Pyston roadmap

We’ve spent some time recently thinking about the future of Pyston, our faster implementation of Python, and wanted to share what’s on our mind. For updates please check out our wiki.

Roadmap

Our primary goal at this point is to get more people using Pyston, and our initial approach to that will be removing some of the reasons that make it difficult to use Pyston. We are currently 30% faster on our benchmarks and 60% faster on commonly-used benchmarks, and while being even faster would entice more people to use Pyston, we believe we will have a bigger impact by reducing obstacles than by working on performance.

The most common issue that our users report is not being able to install the packages they want. This happens because some packages are difficult to compile; CPython users will typically download pre-compiled versions, but there aren’t pre-compiled packages for Pyston yet. So our current focus is building these packages for our users and providing them in a way that’s easy to install.

To do that we’ve decided to provide the packages through conda. There are a few benefits of this: first it lets us provide the packages ourselves instead of waiting for package maintainers to produce Pyston builds, second it makes it easier for users to get Pyston in the first place, and third it will hopefully make us more portable in the process and work on more Linux distros.

After that we have many things we’d like to do, with the exact order to be determined:

  • Set up a CI/CD system
  • Add support for 64-bit ARM
  • Continued performance improvements

And even longer term:

  • Add Mac and Windows support
  • Integrate with Numba
  • Improve multithreading
  • Explore “opt-in” features that allow us to break semantics
  • Continued performance improvements

Target Python versions

We’ve decided that we only have the resources to target a single version of Python at a time. We would love to be able to provide a version of Pyston regardless of the version of Python you want, but that’s not feasible given our small team.

We currently target Python 3.8.12, and plan to retarget to Python 3.10 some time in early 2022. We intend to backport parts of the “Faster CPython” effort going on in 3.11 depending on the level of compatibility.

We are open to backporting semantic changes made in later versions of Python if we think this means the Python community has decided a certain feature is implementation-dependent. We anticipate that these are largely performance improvements that have small technical semantic differences; we do this on a case-by-case basis, and make a note of it on our wiki.

Supported Pyston versions

While we will try to help with any version of Pyston you are running, because our project moves quickly we won’t be backporting fixes to older versions of Pyston.

If you have any questions/thoughts/suggestions please feel free to join our Discord or file an issue on our GitHub!

Pyston Team Joins Anaconda

We have some very exciting news to announce today: we (Marius and Kevin) are joining Anaconda! Anaconda is a well-known company that produces open-source Python software, and we think that by joining them we can significantly accelerate the trajectory of Pyston, our faster implementation of Python.

[See the corresponding announcement on the Anaconda Blog: https://www.anaconda.com/blog/pyston-team-joins-anaconda]

What will this look like

Things will largely look the same from the outside, except now we will have access to more resources and expertise to move faster. In particular:

  • Pyston remains an open-source project with the same license as CPython
  • Pyston won’t be tied to using conda
  • We still get to set our roadmap, with potentially less time devoted to monetization work. By joining a company with a mature and efficient monetization scheme, we’ll spend more time doing core feature work.
  • Once we need it, we’ll have a governance model that is separate from Anaconda
  • We may develop integrations with other Anaconda projects in ways that are beneficial to both products
  • We’ll continue to work with the community on the other Python performance projects that are underway

Why Anaconda

We talked to a couple of companies about a possible joint future for Pyston, and Anaconda stood out to us in terms of alignment. They’re already doing similar work with Numba and their other projects, and they have a demonstrated open source commitment that means a lot to us.

We also are excited about the possibility of having better integrations with some of their complementary products. We don’t have anything to announce right now, but we already had conda integration on our roadmap, and now that it’s easier, it’s more likely to happen sooner. Together, we are very excited about possibly integrating the features of Numba and Pyston: the two projects target different layers of the stack, and the hope is that by combining features, we will be able to explore more of the space of possible Python optimizations.

And finally, the medium-term roadmap for Pyston mainly involves work to get Pyston into more peoples’ hands. In this sense, we’re finding alternative Python implementations require much more work than simply making them faster, and joining a leading Python distributor will let us short-cut a number of these steps.

The Future

Now that we have Anaconda’s sponsorship, we are planning out a short-term roadmap for the project. We will announce more when it is ready, so stay tuned! In the meantime, give Pyston a try and let us know how it works for you on our Github issue tracker or our Discord channel.

Pyston v2.2: faster and open source

We are proud to announce Pyston v2.2, the latest version of our faster implementation of the Python programming language. This version is significantly faster than previous ones, and importantly is now open source.

We also merged in many changes from CPython and are now based on CPython 3.8.8.

Performance

Pyston v2.2 is 30% faster than stock Python on our web server benchmarks. This is a significant improvement over our previous performance, and if we were feeling cheeky, we would advertise it as “50% more speedup.”

The foundational technology powering Pyston v2.2 is the same as that found in earlier versions, but we have tuned and optimized more areas and found additional speedups, particularly in our JIT and attribute cache mechanisms.

One noteworthy change is that we decided to remove many of the rarely-used debugging features that Python supports because they are expensive even when not needed. Doing so collectively resulted in a 2% speedup, which was remarkable to us: of all the computers in the world running Python, 2% of them are executing debugging checks. We’ve disabled those checks and are positioning ourselves as an “optimized build” similar to binaries without debugging information. Those who still want debugging features can use the “debug build” of stock Python because they are interchangeable. For a full list of the features we removed in Pyston v2.2, please see our wiki.

Open source

As we’ve continued talking to potential customers we now feel convinced that Pyston can thrive on an open-source business model, primarily by starting with support services. This means that we’ve open sourced Pyston v2.2, which you can find at our GitHub here.

We’ve archived our old repository to reduce confusion, but you can still find that here.

We are looking into which of our newest changes can be upstreamed to CPython. Throughout this process, we welcome your contributions. Help with getting Pyston packaged for additional platforms would be especially useful.

Moving forward

We continue to try and make Pyston as compelling and easy to use as possible. Working Pyston into your projects should be as easy as replacing “python” with “pyston.” If that’s not the case, we’d love to hear about it on our GitHub issues tracker or on our Discord channel. We hope you’ll give Pyston a try and see that it really is the easiest way to speed up your Python code.

Pyston v2: 20% faster Python

We’re very excited to release Pyston v2, a faster and highly compatible implementation of the Python programming language. Version 2 is 20% faster than stock Python 3.8 on our macrobenchmarks. More importantly, it is likely to be faster on your code. Pyston v2 can reduce server costs, reduce user latencies, and improve developer productivity.

Pyston v2 is easy to deploy, so if you’re looking for better Python performance, we encourage you to take five minutes and try Pyston. Doing so is one of the easiest ways to speed up your project.

Performance

Pyston v2 provides a noticeable speedup on many workloads while having few drawbacks. Our focus has been on web serving workloads, but Pyston v2 is also faster on other workloads and popular benchmarks.

Our team put together a new public Python macrobenchmark suite that measures the performance of several commonly-used Python projects. The benchmarks in this suite are larger than those found in other Python suites, making them more likely to be representative of real-world applications. Even though this gives us a lower headline number than other projects, we believe it translates to better speedups for real use cases. Pyston v2 still shows sped-up performance on microbenchmarks, being twice as fast as standard Python on tests like chaos.py and nbody.py.

Here are our performance results:

CPython 3.8.5Pyston 2.0PyPy 7.3.2
flaskblogging warmup time [1]n/an/a85s
flaskblogging mean latency5.1ms4.1ms2.5ms
flaskblogging p99 latency6.3ms5.2ms5.8ms
flaskblogging memory usage47MB54MB228MB
djangocms warmup time [1]n/an/a105s
djangocms mean latency14.1ms11.8ms15.9ms
djangocms p99 latency15.0ms12.8ms179ms
djangocms memory usage84MB91MB279MB
Pylint speedup1x1.16x0.50x
mypy speedup1x1.07x [2]unsupported
PyTorch speedup1x1.00x [2]unsupported
PyPy benchmark suite [3]1x1.36x2.48x
Results were collected on an m5.large EC2 instance running Ubuntu 20.04

[1] Warmup time is defined as time until the benchmark reached 95% of peak performance; if it was not distinguishable from noise it is marked “n/a”. Only post-warmup behavior is considered for latency measurement.
[2] mypy and PyTorch don’t support automatically building their C extensions from source, so these Pyston numbers use our unsafe compatibility mode
[3] The PyPy benchmark suite was modified to only run the benchmarks that are compatible with Python 3.8

Results analysis

In our targeted benchmarks (djangocms + flaskblogging), Pyston v2 provides an average 1.22x speedup for mean latency and an 1.18x improvement for p99 latency while using a just few more megabytes per process. We have not yet invested time in optimizing the other benchmarks.

“p99 latency” is the upper 99th percentile of the response-time distribution, and is a common metric used in web serving contexts since it can provide insight into user experience that is lost by taking an average. PyPy’s high p99 latency on djangocms comes from periodic latency spikes, presumably from garbage collection pauses. CPython and Pyston both exhibit periodic spikes, presumably from their cycle collectors, but they are both less frequent and much smaller in magnitude.

The mypy and PyTorch benchmarks show a natural boundary of Pyston v2. These benchmarks both do the bulk of their work in C extensions which are unaffected by our Python speedups. We natively support the C API and do not have an emulation layer, so we are still able to provide a small boost to mypy performance and do not degrade pytorch or numpy performance. Your benefit will depend on your mix of Python and C extension work.

Technical approach

We’re planning on going into more detail in future blog posts, but some of the techniques we use in Pyston v2 include:

  • A very-low-overhead JIT using DynASM
  • Quickening
  • General CPython optimizations
  • Build process improvements

Compatibility

Since Pyston is a fork of CPython, we believe it is one of the most compatible alternative Python implementations available today. It supports all the same features and C API that CPython does.

While Pyston is identically functional in theory, in practice there are some temporary compatibility hurdles for any new Python implementation. Please see our wiki for details.

Availability

Pyston v2.0 is immediately available as a pre-built package. Currently, we have packages for Ubuntu 18.04 and 20.04 x86_64. If you would like support for a different OS, let us know by filing an issue in our issue tracker.

Trying out Pyston is as simple as installing our package, replacing python3 with pyston3, and reinstalling your dependencies with pip-pyston3 install (though see our wiki for a known issue about setuptools). If you already have an automated build set up, the change should be just a few lines.

Our plan is to open-source the code in the future, but since compiler projects are expensive and we no longer have benevolent corporate sponsorship, it is currently closed-source while we iron out our business model.

Reaching us

We are designing Pyston for developers and love to hear about your needs and experiences. So, we’ve set up a Discord server where you can chat with us. If you’d like a commercially-supported version of Pyston, please send us an email.

We’ve optimized Pyston for several use cases but are eager to hear about new ones so that we can make it even more beneficial. If you run into any problems or instances where Pyston does not help as much as expected, please let us know!

Background

We designed Pyston v1 at Dropbox to speed up Python for its web serving workloads. After the project ended, some of us from the team brainstormed how we would do it differently if we were to do it again. In early 2020, enough pieces were in place for us to start a company and work on Pyston full-time.

Pyston v2 is inspired by but is technically unrelated to the original Pyston v1 effort.

Moving forward

We’re on a mission to make Python faster and have plenty of ideas to do so. That means we’re actively looking for people to join the team. Let us know if you’d like to get involved. Otherwise stay tuned for future releases and reach out if you have any questions!

Pyston 0.6.1 released, and future plans

Hello all, we’re happy to release Pyston version 0.6.1, the latest version of our high-performance Python JIT.  v0.6.1 contains performance enhancements over 0.6, bringing Pyston to 95% faster than CPython on standard benchmarks.

On the other hand, this is the last release that Dropbox is sponsoring.  We wanted to take some time to talk about what that means, both about the space of Python performance, and about the Pyston project specifically.

What’s happened

It’s hard to break down the change in cost-benefit analysis, but here are some factors that went into our decision:

  • We spent much more time than we expected on compatibility
  • We similarly had to spend more time on memory usage due to it being a bigger concern than expected
  • Dropbox has increasingly been writing its performance-sensitive code in other languages, such as Go

Our personal take is that the increase on the “cost” side could potentially be considered typical, whereas the decrease on the “benefit” side was probably a larger driver.  It’s hard to say, though, since if we had managed to build things twice as fast the calculus would have been different.

Where we are

We are quite proud that, over the last three years, we’ve been able to achieve meaningful speedups while maintaining a higher level of compatibility than other solutions: we are the first alternative Python runtime to be able to run Dropbox faster.

As for numbers, on the just-released v0.6.1, we are 95% faster on standard Python benchmarks.  On web-workload benchmarks that we created, we are 48% faster.  On Dropbox’s server, we are 10% faster.

We think it’s worth mentioning that the 10% speedup on Dropbox code is just a small fraction of what we think is possible with our approach. We’ve spent most of our time working on compatibility and memory usage and have not had time to optimize this particular workload.

Where we go from here

Marius and I are no longer spending our time working on Pyston and are transitioning to other projects.  The project itself remains open source and available, and we are investigating ways that the project can continue, either in whole or in part.  We are also looking into upstreaming parts of our code back to CPython, since our code is now based on theirs.

We’re proud of what we’ve done and we are looking forward to going into more detail about the technical details in the near future.  We also take some small consolation in having helped map out what Python performance-versus-compatibility tradeoffs may be valuable.

In the end, we are happy that we attempted this, are excited about the many potential ways that our work on Pyston could still be useful, and are happy to refocus ourselves on domains with more immediate needs.

Pyston 0.5 released

Today we are extremely excited to announce the v0.5 release of Pyston, our high performance Python JIT. We’ve been a bit quiet for the past few months, and that’s because we’ve been working on some behind-the-scenes technology that we are finally ready to unveil. It might be a bit less shiny than some other things we could have worked on, but this change makes Pyston much more ready to use.

Pyston is now using reference counting.

Refcounting

Reference counting (“refcounting”), is a form of automatic memory management. It’s usually viewed as slower and less sophisticated than using a tracing garbage collector (a “GC”), the predominant technique in modern languages. All past versions of Pyston contained tracing garbage collectors, and much of our work from 0.4 to 0.5 was tearing it out in favor of refcounting.

Why did we do this? In short, because CPython (the main Python implementation) uses refcounting. We used a GC initially to try to get more performance. But applying a tracing GC to a refcounting C API, such as the one that Python has, is risky and comes with many performance pitfalls. And most challengingly, Pyston wants to support the large amount of code that has been written that relies on the special properties that refcounting provides (predictable immediate destruction). We found that we had to go to greater and greater lengths to support these programs, and there were also cases where we wouldn’t be able to support the applications in their current form.

So we decided to bite the bullet and convert to refcounting, with the goal of getting better application compatibility.

How did we do?

NumPy

We are very happy to announce: we can run NumPy, unmodified.

Specifically: on their latest release (v1.11), we run their entire test suite with one test failure, for which they’ve accepted our patch. For their latest trunk, we have three test failures. We do need to use a modified version of part of their build chain (Cython), and we are currently slower on the test suite than CPython.

Regardless, we are very happy with this result, especially because we will continue to improve both the compatibility and performance.

Other goodies

There are quite a few non-refcounting features that made it into this release as well:

  • Signal handling
  • Frame introspection of exited frames
  • Generator cleanup
  • Support for more C API functions, such as custom tracebacks
  • and many more small fixes than we can list here

These are a large part of our progress on NumPy, and they also help us run other tricky libraries such as py.test, lxml, and cffi. We’ve also greatly reduced the number of modifications that we maintain to the Python standard libraries and C extensions. Overall, refcounting was a big investment, but it’s bought us compatibility wins that we would have had a very hard time getting otherwise.

Performance

Unfortunately, since performance wasn’t our goal for this release, we did slide backwards a bit. v0.5 is about 10% slower than v0.4 was, largely due to the change to refcounting. We are okay with the regression since we explicitly focused on compatibility for the last six months, and our refcounting implementation still has many available optimizations.

As a side note, the “conventional wisdom” is that refcounting should have been even slower compared to using a GC.  We attribute this mainly to the compatibility restrictions that hampered our GC implementation.

There is a lot of low-hanging performance fruit available to us right now which we have been explicitly avoiding while we finished refcounting. Now would be a great time to consider contributing since we have more ideas than we can implement ourselves. This is especially true when it comes to NumPy performance.

Currently, we take about twice as long to run the NumPy test suite as CPython does. We don’t know how this will translate to performance on real NumPy programs, but we do know that much of the slowdown falls into two categories: the first is NumPy hits code paths that are otherwise-rare in Pyston and are currently unoptimized. The second is a bit more subtle: NumPy frequently calls from C code back into the Python runtime, which is expensive for us because it doesn’t benefit from our JIT (in addition to being previously-rare). We have techniques inside Pyston to handle these situations and invoke our JIT from C code, and we’d like to start exposing that so that NumPy and other libraries can use it.

Looking forward

We apologize — again — for the lengthy release cycle. We didn’t expect refcounting to take this long, and we even knew that it would take longer than we expected. We’re planning on doing another blog post to talk about what the difficulties were with it and go into more of the technical details of our refcounting system.

Moving forward, our plan for 0.6 is to focus on performance. We would love help from the community on identifying what is important to make performant. We could work on making the NumPy test suite fast, but it may not end up translating to real NumPy workloads.

We’re at the point that trying out Pyston should be easy; it won’t benefit all workloads, but it should be easy to drop it in and see if it does. To test it out, try

docker run -it pyston/pyston

or check out our readme for other options for obtaining Pyston.  To try NumPy, use the “pyston/pyston-numpy” image instead.

We have quite a few optimization ideas lined up, and the pressure has been strong to delay the 0.5 release “just one more week” so that we have time to include some of them. Expect to see an 0.5.1 release that improves performance.

Final words

Refcounting brings Pyston one step closer to being a drop-in replacement for CPython. There is still much more work to do, but we feel like with refcounting we’ve reached a threshold where we’d like to start getting Pyston into peoples’ hands. It’s still very much beta software, so there are many rough edges and unoptimized casses. But we want your feedback on what’s working and what’s not.

Finally, we would like to thank all of our open source contributors who have contributed to this release, and especially Nexedi for their employment of Boxiang Sun, one of our core contributors who helped greatly with the NumPy support.

  • Boxiang Sun
  • Dong-hee Na
  • Rudi Chen
  • Long Ang
  • @LoyukiL
  • Tony Narlock
  • Felipe Volpone
  • Daniel Milde
  • Krish Monut
  • Jacek Wielemborek

Pyston 0.4 released

For a list of common questions about our project, please see our FAQ.

We are very excited to release Pyston 0.4, the latest version of our high-performance Python JIT.  We have a lot to announce with this release, with the highlights of being able to render Dropbox webpages, and achieve performance 25% better than CPython (the main Python implementation) on our benchmark suite.  We are also excited to unveil our project logo:

A lot has happened in the eight months since the 0.3 release: the 0.4 release contains 2000 commits, three times as many commits as either the 0.2 or 0.3 release.  Moving forward, our plan is to release every four months, but for now please enjoy a double-sized release.

Compatibility

While not individually headline-worthy, this release includes a large number of new features:

  • Unicode support
  • Multiple inheritance
  • Support for weakrefs and finalizers (__del__), including proper ordering
  • with-statements
  • exec s in {}
  • Mutating functions in place, such as by setting func_code, func_defaults, or func_globals
  • Import hooks
  • Set comprehensions
  • Much improved C API support
  • Better support for standard command line arguments
  • Support for multi-line statements in the REPL
  • Traceback and frame objects, locals()

Together, these mean that we support almost all Python semantics.  In addition, we’ve implemented a large number of things that aren’t usually considered “features” but nonetheless are important to supporting common libraries.  This includes small things such as supporting all the combinations of arguments builtin functions can take (passing None as the function to map()) or “fun” things such as mutating sys.modules to change the result of an import statement.

Together, these new features mean that we support many common libraries.  We successfully run the test suites of a number of libraries such as django and sqlalchemy, and are continually adding more.  We have also started running the CPython test suite and have added 153 (out of 401) of their test files to our testing suite.

We also have some initial support for NumPy.  This isn’t a priority for us at the moment (our initial target codebase doesn’t use NumPy), but we spent a small amount of time on it and got some simple NumPy examples working.

And most importantly, we now have the ability to run the main Dropbox server, and can render many of its webpages.  There’s still more work to be done here — we need to get the test suite running, and get a performance-testing regimen in place so we can start reporting real performance numbers and comparisons — but we are extremely happy with the progress here.

C API

One thing that has helped a lot in this process is our improved C API support.  CPython has a C API that can be used for writing extension modules, and starting in the 0.2 release we added a basic compatibility layer that translated between our APIs and the CPython ones.  Over time we’ve been able to extend this compatibility to the point that not only can we support C extensions, but we also support running CPython’s internal code, since it is written to the same API.  This means that to support a new API function we can now use CPython’s implementation of the function rather than implementing it on top of our APIs.

As we’ve implemented more and more APIs using CPython’s implementation, it’s become hard to continue thinking of our support as a compatibility layer, and it’s more realistic to think of CPython as the base for our runtime rather than a compatibility target.  This has also been very useful towards our goal of running the Dropbox server: we have been able to directly use CPython’s implementation of many tricky features, such as unicode handling.  We wouldn’t have been able to run the Dropbox server in this amount of time if we had to implement the entire Python runtime ourselves.

Performance

We’ve made a number of improvements to Pyston’s performance, including:

  • Adding a custom C++ exception unwinder.  This new unwinder takes advantage of Pyston’s existing restrictions to make C++ exceptions twice as fast.
  • Using fast return-code-based exceptions when we think exceptions will be common, either due to the specifics of the code, or due to runtime profiling.
  • A baseline jit tier, which sits between the interpreter and the LLVM JIT.  This tier approaches the performance of the LLVM tier but has much lower startup overhead.
  • A new on-disk cache that eliminates most of the LLVM tier’s cost on non-initial runs.
  • Many tracing enhancements, producing better code and supporting more cases
  • New CAPI calling conventions that can greatly speed up calling CAPI functions.
  • Converted some builtin modules to shared modules to improve startup time.
  • Added a PGO build, and used its function ordering in normal builds as well.

Conspicuously absent from this list is better LLVM optimizations.  Our LLVM tier has been able to do well on microbenchmarks, but on “real code” it tends to have very little knowledge of the behavior of the program, even if it knows all of the types.  This is because knowing the types of objects only peels away the first level of dynamicism: we can figure out what function we should call, but that function will itself often contain a dynamic call.  For example, if we know that we are calling the len() function, we can eliminate the dynamic dispatch and immediately call into our implementation of len() — but that implementation will itself do a dynamic call of arg.__len__().  While len() is common enough that we could special-case it in our LLVM tier, this kind of multiple-levels-of-dynamicism is very common, and we have been increasingly relying on our mini tracing JIT to peel away all layers at once.  The result is that we get good execution of each individual bytecode, but the downside is that we are currently lacking good inter-bytecode optimizations.  Our plan is to integrate the per-bytecode tracing JIT and the LLVM method JIT to achieve the best of both worlds.

Benchmarks

We updated our benchmarks suite to use three real-world libraries: our suite contains a benchmark based on each of pyxl, django, and sqlalchemy. Benchmark selection is a contentious topic, but we like these benchmarks because they are more typical of web servers than existing Python benchmark suites.

On these benchmarks, we are 25% faster than CPython (and 25% slower than PyPy 4.0.0).  We have a full performance tracking site, where you can see our latest benchmark results (note: that last link will auto-update over time and isn’t comparing the same configurations as the 25% result).

Community

We also have a number of exciting developments that aren’t directly related to our code:

  • We switched from a Makefile build system to a CMake-based one.  This lets us have some nice features such as a configure step, faster builds (by supporting Ninja), and down the road easier support for new platforms.  This was done by an open source contributor, Daniel Agar, and we are very thankful!
  • We have more docs.  Check out our wiki for some documentation on internal systems, or tips for new contributors.  Browsing the codebase should be easier as well.
  • We have a logo!
  • We had 184 commits from 11 open source contributors.  A special shoutout to Boxiang Sun, who has greatly helped with our compatibility!

Final words

We have a pre-built binary available to download on Github (though please see the notes there on running it).  Pyston is still in a pre-launch state, so please expect crashes or occasional poor performance, depending on what you run it on.  But if you see any of that, please let us know either in our Gitter channel or by filing a Github issue.  We’re excited to hear what you think!

If you are in the Bay Area, we are having a talk + meetup at the Dropbox SF office at 6:30pm on November 10th.  We only have a few spaces left, so please RSVP if you are interested.  More details at the RSVP link.

We have a lot of exciting things planned for our 0.5 series as well.  Our current goals are to implement a few final features (such as inspection of stack frames after they exit), to continue improving performance, and to start running some Dropbox services on Pyston.  It’s an exciting time, and as always we are taking new contributors!  If you’re interested in contributing, feel free to peruse our docs, check out our list of open issues, or just say hi!

Pyston 0.3: Self-hosting Sufficiency

We’ve been working hard over the past five months and are very happy to release Pyston 0.3, the newest version of our high-performance Python implementation. The biggest features of this release are that we can now run all of our internal scripts on Pyston, as well as improved performance.  We also have some exciting news to share about our project status and plans.

Language compatibility

Self-hosting, or running a compiler through itself, is one of the best ways to demonstrate language compatibility. Pyston isn’t a static compiler or written in Python, so “self-hosting” is a bit of a misnomer / attention grabber, but we still have a number of internal Python scripts of various complexity, and with this release we can now run them all on Pyston. The most complex of our scripts is our test runner, which spawns multiple threads, spawns subprocess to run the tests, calls pickle to load the expected results, and reports back to the user. In the process it executes a few thousand lines of code across a few dozen standard libraries and extension modules.

Unfortunately, we make fairly little use of our self-host ability at the moment.  We only have a single Python script that’s actually involved in the building of Pyston and even then only tangentially.  And we can’t default to running our tester in self-host mode, since what if we have a bug that breaks the test runner and makes all the tests pass?  But at least we have the ability.

For some quantitative stats of debatable value, we can look at how many of the Python standard libraries and extension modules we can import.  (Note: this is just importing the library correctly, not testing any of its functionality beyond that.  Hopefully in the 0.4 release we can say how many of the CPython test cases we can pass.)  At the time of our 0.2 release, we were able to import 56 top-level standard libraries, and 12 standard extension modules.  Now, with the 0.3 release, we are able to import 117 libraries and 27 extension modules, which is more than twice as many.

We still have a long way to go, though, since this is only about half of the libraries and extension modules in CPython (though we don’t have to support all of them immediately).  Thankfully, our C API support is becoming fairly developed, and while it was originally intended for supporting C extension modules, it works just as well to support CPython’s internal code.  We’ve gotten to the point that we can often copy large swaths of code from CPython into Pyston without modification, and while it’s hard to measure, I think we currently compile about as much CPython code into Pyston as code that we wrote ourselves.  So without really intending it, we’ve been adopting a “CPython with a replaced core” architecture and been moving away from the “completely from scratch” model we started with.  Regardless of whether we fully adopt that strategy or not, we’re currently able to use large amounts of implementation from CPython and move much faster.

Performance

We were hesitant to announce performance numbers in the 0.1 and 0.2 releases, since both of those releases focused on longer term investments (getting the core infrastructure in place, and language features, respectively) from which we didn’t want to get distracted.  In the past month or so, though, we’ve finally taken the time to go back and expand our benchmark suite and fix some of the low hanging fruit that we skipped during initial implementation, and are happy to talk about how we’re doing. The result is that we are now (on our small benchmark suite) faster than CPython!  We are currently 1% faster than CPython using a geometric mean, with individual benchmarks varying between 2x faster and 2x slower.  You can see more details and up-to-date benchmark results at speed.pyston.org.  (A hearty thanks to the PyPy team for the performance tracking software.)

“1% faster than CPython” is clearly not our overall performance target, but we are happy with the speed at which we got here, and the amount of optimization headroom we still have.  Moving forward, we could continue working on optimizations and have more impressive benchmark results, but we’re taking this milestone as a signal that we should shift focus back to feature work again.

If we were to break down our performance versus CPython, we (unsurprisingly) have better steady-state performance but worse startup time.  As a quick measure of how our benchmark suite balances the two, the benchmark geomean has a value of 6.0 seconds; it’s hard to tell if this is the same balance as for our target server workloads.

  • Most of our startup time comes from LLVM jitting our code.  This doesn’t mean that LLVM is to blame: our AST interpreter is fairly slow, requiring us to often tier out of it to our LLVM JIT.  We also generate some very large LLVM IR in order to support our frame introspection, which slows down compilation times.  We have a number of ideas on how to improve startup time on both these fronts (make LLVM jit quicker, and go to it less).
  • For steady-state performance, we tend to do well at executing our JIT’ed code, but our memory system — though much better than it was in 0.2 — is still not as good as CPython’s or other implementations’.  Most of our speedup comes from our inline caching mechanisms, and we still have a lot of open headroom for more type speculations and LLVM optimizations, since we do almost none of either.

Project plans

On the project management side, we now have multiple people working full time on the project, in addition to the part-time help we’ve been getting!  With the additional resources we’ve been able to move more quickly (you can see an uptick in GitHub commits), and we’ve set some aggressive goals for running Dropbox on Pyston.  We’re very excited about how much we’re going to be able to get done.

Our goal moving forward is to continue expanding the fraction of the language+runtime that we support, and maintain certain performance targets as we go.  Our current performance target is 1x CPython, but we may loosen it in order to prioritize feature work, since that tends to be more time-sensitive (blocks more things) than performance work.  We’ll be targeting larger and larger applications to run under Pyston, with the ultimate target being the Dropbox server codebase.

Conclusion

As always, you can find our code on GitHub.  We’ve released a binary that may or not run on your system, but is available for you to play with if you’re interested — but remember that this is still an alpha and not ready for real use.  If you run into issues or would like to contribute, please let us know!

Python benchmark sizes

I’ve always been an advocate for new benchmarks for Python: the PyPy benchmark suite is good, but I’ve always felt that it misses certain real-world features.  Benchmark-picking seems to always be a contentious issue with different people defining “real world features” in different ways (take a look at the JavaScript benchmark situation), but I’ve always felt that the existing Python benchmarks are pretty micro-benchmark-y — which is not necessarily a bad thing, but we as a community might benefit from having larger macrobenchmarks.  I’ve debated with people about the characteristics of the PyPy benchmark suite (especially the “Django macrobenchmark”), so I decided to write a tool to collect some numbers.

The goal of the tool is to get an understanding of how much code is important to the benchmark: specifically, how many lines of code cover 99% of the execution time of the benchmark.  There are many possible statistics that we can gather that would be meaningful, but I chose this one as a rough measure of the amount of hot code in the benchmark.  This is an admittedly crude heuristic — it is dependent on the implementation used to run it (Python 2.7.6 for these results), and has all of the flaws that come with using lines-of-code to measure anything.  Still, I think it can still be useful as a rough starting point for talking about the sizes of our benchmarks.

The tool works by attaching a sampling profiler to the benchmark, noting the line number that was active at the time of the sample, and at the end tallying the most common lines until we have reached 99% of the total number of samples.  I ran the tool over the PyPy benchmark suite and got these results, in “lines of code that comprise 99% of the runtime”:

(Update: I reran the benchmarks to more closely match the way they get reported by PyPy, so the numbers have changed.  See the next section for details.)

  • ai: 12
  • chaos: 78
  • django: 72
  • fannkuch: 16
  • float: 19
  • meteor-contest: 17
  • nbody_modified: 17
  • richards: 99
  • rietveld: 759
  • slowspitfire: 4
  • spambayes: 316
  • spectral-norm: 6
  • spitfire_cstringio: 6
  • telco: 194
  • twisted_iteration: 97
  • twisted_names: 613
  • twisted_pb: 387
  • twisted_tcp: 137
  • (geomean):  36

For reference, my icbd static type inferencer measures in at 1268 lines of code for 99% coverage — it’s a poor benchmark in many ways (non-determinism for one), but this metric suggests that at least along this dimension, it has quite different behavior than the most of the benchmarks in the PyPy benchmark suite.  Again, I’m not trying to say that a small benchmark is necessarily bad or that a large benchmark is necessarily good, but just that we may need more variety in our benchmarks to capture the behavior of different types of programs.  I’m glad to see that there are some larger benchmarks in the PyPy suite, though I think there’s still some room to improve, since they get outnumbered by the smaller benchmarks (the geometric mean is still quite low).

[Update] Some notes on methodology

I picked the 20 benchmarks that PyPy lists on the front page of their Speed Center, which are the benchmarks that they seem to base their published numbers on.

To try to closely match their environment, I modified the benchmark suite’s “runner.py” to output the commands it runs rather than actually run them; in the previous version of this post I just ran the benchmarks with their default arguments.

The PyPy benchmark suite only reports peak performance, and ignores any initialization or warmup time, so I modified the measure_loc tool to ignore those as well.

[Update] Secondary benchmarks

PyPy has a number of benchmarks that they don’t include in their primary benchmark set (the one they use to compare to CPython), but are available in their repository for use.

  • sympy_sum: 244
  • sympy_expand: 148
  • sympy_integrate: 628
  • sympy_str: 513
  • translate: 5805*
  • = with a tracing profiler.  The translate program isn’t signal-safe or easily made to be so, so I have to run it under a tracing profiler.  The tracing profiler is much more invasive and it’s not clear how the numbers compare; they seem to usually be 0-30% higher than the sampling profiler.

Trying it yourself

The tool has been pushed to the Pyston repository, and I’ve tried to make it user-friendly: just run “python measure_loc.py your_script.py your_script_args” or “python measure_loc.py -m your_module your_module_args” and it will spit out some results at the end.  I’d be interested to hear what kinds of results people get with other benchmarks or runtimes, and I hope we can start a discussion that leads to some more comprehensive Python benchmarks.