Python benchmark sizes

I’ve always been an advocate for new benchmarks for Python: the PyPy benchmark suite is good, but I’ve always felt that it misses certain real-world features.  Benchmark-picking seems to always be a contentious issue with different people defining “real world features” in different ways (take a look at the JavaScript benchmark situation), but I’ve always felt that the existing Python benchmarks are pretty micro-benchmark-y — which is not necessarily a bad thing, but we as a community might benefit from having larger macrobenchmarks.  I’ve debated with people about the characteristics of the PyPy benchmark suite (especially the “Django macrobenchmark”), so I decided to write a tool to collect some numbers.

The goal of the tool is to get an understanding of how much code is important to the benchmark: specifically, how many lines of code cover 99% of the execution time of the benchmark.  There are many possible statistics that we can gather that would be meaningful, but I chose this one as a rough measure of the amount of hot code in the benchmark.  This is an admittedly crude heuristic — it is dependent on the implementation used to run it (Python 2.7.6 for these results), and has all of the flaws that come with using lines-of-code to measure anything.  Still, I think it can still be useful as a rough starting point for talking about the sizes of our benchmarks.

The tool works by attaching a sampling profiler to the benchmark, noting the line number that was active at the time of the sample, and at the end tallying the most common lines until we have reached 99% of the total number of samples.  I ran the tool over the PyPy benchmark suite and got these results, in “lines of code that comprise 99% of the runtime”:

(Update: I reran the benchmarks to more closely match the way they get reported by PyPy, so the numbers have changed.  See the next section for details.)

  • ai: 12
  • chaos: 78
  • django: 72
  • fannkuch: 16
  • float: 19
  • meteor-contest: 17
  • nbody_modified: 17
  • richards: 99
  • rietveld: 759
  • slowspitfire: 4
  • spambayes: 316
  • spectral-norm: 6
  • spitfire_cstringio: 6
  • telco: 194
  • twisted_iteration: 97
  • twisted_names: 613
  • twisted_pb: 387
  • twisted_tcp: 137
  • (geomean):  36

For reference, my icbd static type inferencer measures in at 1268 lines of code for 99% coverage — it’s a poor benchmark in many ways (non-determinism for one), but this metric suggests that at least along this dimension, it has quite different behavior than the most of the benchmarks in the PyPy benchmark suite.  Again, I’m not trying to say that a small benchmark is necessarily bad or that a large benchmark is necessarily good, but just that we may need more variety in our benchmarks to capture the behavior of different types of programs.  I’m glad to see that there are some larger benchmarks in the PyPy suite, though I think there’s still some room to improve, since they get outnumbered by the smaller benchmarks (the geometric mean is still quite low).

[Update] Some notes on methodology

I picked the 20 benchmarks that PyPy lists on the front page of their Speed Center, which are the benchmarks that they seem to base their published numbers on.

To try to closely match their environment, I modified the benchmark suite’s “runner.py” to output the commands it runs rather than actually run them; in the previous version of this post I just ran the benchmarks with their default arguments.

The PyPy benchmark suite only reports peak performance, and ignores any initialization or warmup time, so I modified the measure_loc tool to ignore those as well.

[Update] Secondary benchmarks

PyPy has a number of benchmarks that they don’t include in their primary benchmark set (the one they use to compare to CPython), but are available in their repository for use.

  • sympy_sum: 244
  • sympy_expand: 148
  • sympy_integrate: 628
  • sympy_str: 513
  • translate: 5805*
  • = with a tracing profiler.  The translate program isn’t signal-safe or easily made to be so, so I have to run it under a tracing profiler.  The tracing profiler is much more invasive and it’s not clear how the numbers compare; they seem to usually be 0-30% higher than the sampling profiler.

Trying it yourself

The tool has been pushed to the Pyston repository, and I’ve tried to make it user-friendly: just run “python measure_loc.py your_script.py your_script_args” or “python measure_loc.py -m your_module your_module_args” and it will spit out some results at the end.  I’d be interested to hear what kinds of results people get with other benchmarks or runtimes, and I hope we can start a discussion that leads to some more comprehensive Python benchmarks.