Python benchmark sizes

I’ve always been an advocate for new benchmarks for Python: the PyPy benchmark suite is good, but I’ve always felt that it misses certain real-world features.  Benchmark-picking seems to always be a contentious issue with different people defining “real world features” in different ways (take a look at the JavaScript benchmark situation), but I’ve always felt that the existing Python benchmarks are pretty micro-benchmark-y — which is not necessarily a bad thing, but we as a community might benefit from having larger macrobenchmarks.  I’ve debated with people about the characteristics of the PyPy benchmark suite (especially the “Django macrobenchmark”), so I decided to write a tool to collect some numbers.

The goal of the tool is to get an understanding of how much code is important to the benchmark: specifically, how many lines of code cover 99% of the execution time of the benchmark.  There are many possible statistics that we can gather that would be meaningful, but I chose this one as a rough measure of the amount of hot code in the benchmark.  This is an admittedly crude heuristic — it is dependent on the implementation used to run it (Python 2.7.6 for these results), and has all of the flaws that come with using lines-of-code to measure anything.  Still, I think it can still be useful as a rough starting point for talking about the sizes of our benchmarks.

The tool works by attaching a sampling profiler to the benchmark, noting the line number that was active at the time of the sample, and at the end tallying the most common lines until we have reached 99% of the total number of samples.  I ran the tool over the PyPy benchmark suite and got these results, in “lines of code that comprise 99% of the runtime”:

(Update: I reran the benchmarks to more closely match the way they get reported by PyPy, so the numbers have changed.  See the next section for details.)

  • ai: 12
  • chaos: 78
  • django: 72
  • fannkuch: 16
  • float: 19
  • meteor-contest: 17
  • nbody_modified: 17
  • richards: 99
  • rietveld: 759
  • slowspitfire: 4
  • spambayes: 316
  • spectral-norm: 6
  • spitfire_cstringio: 6
  • telco: 194
  • twisted_iteration: 97
  • twisted_names: 613
  • twisted_pb: 387
  • twisted_tcp: 137
  • (geomean):  36

For reference, my icbd static type inferencer measures in at 1268 lines of code for 99% coverage — it’s a poor benchmark in many ways (non-determinism for one), but this metric suggests that at least along this dimension, it has quite different behavior than the most of the benchmarks in the PyPy benchmark suite.  Again, I’m not trying to say that a small benchmark is necessarily bad or that a large benchmark is necessarily good, but just that we may need more variety in our benchmarks to capture the behavior of different types of programs.  I’m glad to see that there are some larger benchmarks in the PyPy suite, though I think there’s still some room to improve, since they get outnumbered by the smaller benchmarks (the geometric mean is still quite low).

[Update] Some notes on methodology

I picked the 20 benchmarks that PyPy lists on the front page of their Speed Center, which are the benchmarks that they seem to base their published numbers on.

To try to closely match their environment, I modified the benchmark suite’s “runner.py” to output the commands it runs rather than actually run them; in the previous version of this post I just ran the benchmarks with their default arguments.

The PyPy benchmark suite only reports peak performance, and ignores any initialization or warmup time, so I modified the measure_loc tool to ignore those as well.

[Update] Secondary benchmarks

PyPy has a number of benchmarks that they don’t include in their primary benchmark set (the one they use to compare to CPython), but are available in their repository for use.

  • sympy_sum: 244
  • sympy_expand: 148
  • sympy_integrate: 628
  • sympy_str: 513
  • translate: 5805*
  • = with a tracing profiler.  The translate program isn’t signal-safe or easily made to be so, so I have to run it under a tracing profiler.  The tracing profiler is much more invasive and it’s not clear how the numbers compare; they seem to usually be 0-30% higher than the sampling profiler.

Trying it yourself

The tool has been pushed to the Pyston repository, and I’ve tried to make it user-friendly: just run “python measure_loc.py your_script.py your_script_args” or “python measure_loc.py -m your_module your_module_args” and it will spit out some results at the end.  I’d be interested to hear what kinds of results people get with other benchmarks or runtimes, and I hope we can start a discussion that leads to some more comprehensive Python benchmarks.

10 thoughts on “Python benchmark sizes

    • I think that could be interesting — the “total LoC” metric could potentially be correlated with warmup-time-difficulty. For now I wanted to focus on just the “hot code size” metric, where I felt the absolute figure was more appropriate; I’m still debating if it makes more sense to count lines of code with >=N samples vs the current metric of 99% coverage.

      The tool is open source and hopefully-understandable, so feel free to modify it to collect other metrics!

      Like

    • Hi Maciej, sorry, if I did that it wasn’t intentional. I just ran all of the benchmarks in own/ and unladen_swallow/performance/. A number of them failed due to import errors (presumably since I was not going through your test runner), and some others failed due to not being signal safe (an unfortunate requirement from the sampling profiler) so I excluded them from the results. I could understand if this has an selection bias against complicated benchmarks, but I assumed it wouldn’t be a big deal — are there specific benchmarks that I should try to get working? Would be happy to update the post.

      Like

      • Well, yes, obviously all the big benchmarks are doing complicated stuff including imports. Twisted/translation are the biggest and they involve quite a lot of code. Other benchmarks using large libraries (sympy, …) also come to mind, since those are the big, real world guys. We didn’t write large benchmarks, but we used large open source libraries.

        Essentially, the selection bias is HUGE here, since you ended up picking those that fit in one file, except django which is just a tiny benchmark (it should not be called django at all, we just picked it up from unladen swallow as a name)

        The selection bias is so huge that the entire post completely missed the point and your data is worthless. please get stuff running (including translation) and report then

        Like

      • I had to do a little bit of work to get the translation benchmark running — it’s not signal safe and it has a bug that ends up stripping a tracing profiler (sent you a PR). Anyway, it seems to be working now, so I should have results in a few hours (the tracing profiler slows it down quite a bit).

        Like

      • Ok, worked on it some more and updated the post. I got the entire benchmark set working, though for consistency I sidelined some of the benchmarks you mentioned (translation, sympy) since you guys don’t include them in your numbers.

        Like

      • We include those numbers everywhere except the front page (and it’s mostly because those benchmarks are newer and we’re missing old data, so the historical trend will break). We also run them nightly and include in any detailed comparison (e.g. if you compare pypy and cpython). You are also missing quite a few benchmarks (e.g. genshi), but not necesarilly the biggest ones. http://buildbot.pypy.org/builders/jit-benchmark-linux-x86-64/builds/1284/steps/shell_6/logs/stdio this lists all the run benchmarks. I agree that the geometric mean of benchmarks is meaningless as they’re not “representative” in any sense. I dare to say I can’t see them being representative anyway, so we just include benchmarks that we have found 😉

        Like

Leave a comment