Judge compares the run time for two programs and tell you if there’s a statistically significant difference.
Judge is stored in bzr on Launchpad:
$ bzr branch lp:judge
Suppose you’re working on a program and have made some changes, and you want to know if you made it faster or slower. Judge will help:
$ judge 'bzr2.3 status' 'bzr2.5 status' n mean sd min max cmd 50 332.8ms 8.1 320.6 355.1 bzr2.3 st 50 255.0ms 7.4 245.0 283.1 bzr2.5 st -77.787ms -23.4% p=0.000 difference is significant at 95.0% confidence (p=0.000):
Judge does two main things:
- It runs the commands alternately several times, and measures the time they take. Judge discards the first run of each of the commands, since it may be overly affected by caches warming up.
- It does a small bit of mathematics to provide an opinion on whether the observed difference is likely to be repeatable. Specifically it uses Welch’s t-test for a two-sample unpooled t-test with unequal variances.
The first two arguments to
judge are the command lines to test.
Note that the commands are executed directly, not through a shell, so you cannot use shell metacharacters such as quoting and redirection.
Note also that while Unix shells will expand variables within quoted
strings, they typically will not expand
$HOME and they do no
expansion within single quotes.
judge "ls ~/dir1" "ls ~/dir2" # probably not what you want judge 'ls $HOME/dir1' 'ls $HOME/dir2' # nor this judge "ls $HOME/dir1" "ls $HOME/dir2" # works
Judge logs the output from the commands into
judge.log in the current
directory. You can use this to check that the commands did what you
intended them to do. This file is replaced on each run.
Interpreting the results¶
There are three possible outcomes from a Judge trial: an insubstantial result, an insignificant result, or a significant substantial result.
There is no substantial difference in mean run time between the two commands (by default, less than a 1% difference):
n mean sd min max cmd 50 306.7ms 0.5 304.1 307.3 sleep 0.3 50 306.9ms 0.3 306.0 307.2 sleep 0.3 +0.121ms +0.0% p=0.155 difference is not substantial (+0.0% change)
In this case the distributions of values may greatly overlap, so it’s hard to tell whether the specific difference between the means arose by chance or not. However, for the purposes of benchmarking programs, just knowing the means were very similar tells you that there’s probably no difference.
If difference in mean run time is small compared to the variability of the two programs, and it’s hard to be sure the difference did not just arise by chance.
n mean sd min max cmd 50 7.7ms 0.4 5.8 8.6 sleep 0.001 50 7.8ms 0.2 7.5 8.6 sleep 0.0011 +0.091ms +1.2% p=0.084 difference is probably not significant at 95.0% confidence (p=0.084)
In this case you may want to repeat the test with a larger number of runs, or look for a benchmark that will make the effect you’re trying to measure stronger, or perhaps look for other sources of extrinsic variability.
Substantial significant result¶
There was a difference that was both significant (compared to the variance of the measurements) and substantial (compared to the absolute values)
n mean sd min max cmd 50 106.9ms 0.2 106.6 107.6 sleep 0.1 50 206.9ms 0.2 205.9 207.7 sleep 0.2 +99.957ms +93.5% difference is significant at 95.0% confidence (p=0.000):
Judge tries to help you generate a statistically robust measurement, but you should still be careful in using the results:
Cautions about benchmark commands¶
- The commands, hardware, or test data in your experiment might not be representative of the cases your users care about.
- The commands may not be doing what you think they are: check
judge.logto see their output.
- The commands might be dominated by startup time, and that may not be what you really care about at the moment.
- The ‘hot’ case of running the commands repeatedly may not be representative of the case you care about.
Cautions about interpreting the results¶
- Resist the temptation to re-roll until you do get an apparently significant result. With enough experiments, you will eventually get an apparently significant difference, but this doesn’t mean there actually is a difference.
- The measured times may be perturbed by some other activity on the machine. Judge runs the commands repeatedly and alternately to reduce the chance that any particular command will be affected by other tasks, but it is certainly possible.
- The fact that a result seems unlikely to have arisen by chance (low p-value) is no guarantee that it did not in fact arise by chance.
- Do not attempt to over-interpret the p-value: the best pratice is to set a level in advance at which the results will be considered significant and then they either reach that or not. It’s normal for it to vary from one run to another. Don’t think of it as being “nearly significant”: if there is not enough data to get a significant result, you may not know how close you are.
- Judge only looks at the difference in mean run time. However, it may be that what matters most for your situation is something else, perhaps the 95% worst-case result, or the chance of exceeding a particular time budget.
Make more of the parameters configurable.
Measure other attributes than just wall clock time (cpu and system time, io, etc.) Linux reports more in <http://stackoverflow.com/questions/7205806/is-getrusage-broken-in-linux-2-6-30>.
Make Judge installable by pip.
Perhaps control how many tests to run by either fitting them in a particular overall budget, or by looking at the variability of the first few runs.
Log the raw data in a reusable form.
Control the log directory.
Report whether the commands fail.
Allow specifying non-measured cleanup or teardown commands, to establish test data and to allow caches to either be filled or emptied.
Perhaps take some better approach to the insubstantial-difference case to help understand whether the difference is significant or not.
It might be better to compute: what is the p-value for the hypothesis the population means differ by more than 1%?
Allow running just one command, and calculate the mean/sd/etc.
FreeBSD’s ministat, written by Poul Henning-Kamp, applies a similar statistical test. The main differences are that ministat draws ascii-art histograms, and Judge takes care of running the programs while ministat reads in externally measured numbers.
Judge is Copyright 2011 Martin Pool.
Judge is licensed under the Apache License, Version 2.0.
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an “AS IS” BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.