Pyston 0.6.1 released, and future plans

Hello all, we’re happy to release Pyston version 0.6.1, the latest version of our high-performance Python JIT.  v0.6.1 contains performance enhancements over 0.6, bringing Pyston to 95% faster than CPython on standard benchmarks.

On the other hand, this is the last release that Dropbox is sponsoring.  We wanted to take some time to talk about what that means, both about the space of Python performance, and about the Pyston project specifically.

What’s happened

It’s hard to break down the change in cost-benefit analysis, but here are some factors that went into our decision:

  • We spent much more time than we expected on compatibility
  • We similarly had to spend more time on memory usage due to it being a bigger concern than expected
  • Dropbox has increasingly been writing its performance-sensitive code in other languages, such as Go

Our personal take is that the increase on the “cost” side could potentially be considered typical, whereas the decrease on the “benefit” side was probably a larger driver.  It’s hard to say, though, since if we had managed to build things twice as fast the calculus would have been different.

Where we are

We are quite proud that, over the last three years, we’ve been able to achieve meaningful speedups while maintaining a higher level of compatibility than other solutions: we are the first alternative Python runtime to be able to run Dropbox faster.

As for numbers, on the just-released v0.6.1, we are 95% faster on standard Python benchmarks.  On web-workload benchmarks that we created, we are 48% faster.  On Dropbox’s server, we are 10% faster.

We think it’s worth mentioning that the 10% speedup on Dropbox code is just a small fraction of what we think is possible with our approach. We’ve spent most of our time working on compatibility and memory usage and have not had time to optimize this particular workload.

Where we go from here

Marius and I are no longer spending our time working on Pyston and are transitioning to other projects.  The project itself remains open source and available, and we are investigating ways that the project can continue, either in whole or in part.  We are also looking into upstreaming parts of our code back to CPython, since our code is now based on theirs.

We’re proud of what we’ve done and we are looking forward to going into more detail about the technical details in the near future.  We also take some small consolation in having helped map out what Python performance-versus-compatibility tradeoffs may be valuable.

In the end, we are happy that we attempted this, are excited about the many potential ways that our work on Pyston could still be useful, and are happy to refocus ourselves on domains with more immediate needs.

Pyston 0.6 released

We are excited to announce the v0.6 release of Pyston, our high performance Python JIT.

In this release our main goal was to reduce the overall memory footprint.  It also contains a lot of additional smaller changes for improving compatibility and fixing bugs.

Memory usage reductions

One of the big items which reduced memory usage was moving away from representing the interpreter instructions in a tree and instead storing them as an actual bytecode. Now, each instruction follows each other in memory and does not involve pointer-chasing.

We are also much more aggressive in freeing unused memory now. For example for very hot functions which we compiled using the LLVM JIT (our highest tier) we now free the code which the baseline JIT emitted earlier-on. Additional bigger improvements were accomplished by making our analysis passes more memory efficient and fixing a few leaks.

This is a chart compares the maximal resident set size of several 64bit linux python implementations (lower is better) on a machine with 32GB of RAM.

While max RSS is not a very accurate memory usage number for various reasons, like not taking into account that pages can be shared between processes and only measuring peak usage, we think it shows nevertheless a very useful insight about how much (up to 2x) Pyston 0.6 improved over 0.5.1.

While we are happy that we were able to reduce the memory usage quite significantly in a few weeks we are not yet satisfied with it and will spend more time on reducing the memory usage further and developing better tools to investigate it. We have several ideas for this – some of the bytecode related ones are summarized here.

Additional changes

This release contains a lot of fixes for compatibility issues and other bugs.  We also spent time on making it easier to replace CPython with Pyston, such as by more closely matching its directory structure and following its ‘dict’ ordering.  We can now, for example, run pip and virtualenv unmodified, without requiring any upstream patches like other implementations do.

Aside: NumPy performance

NumPy hasn’t been a priority for us, but from time to time we check on how well we can run it.  We’ve focused on compatibility in the past, but for this post we took a look into performance as well.  We don’t have any NumPy-specific optimizations, so we were happy to see this graph from PyPy’s numpy benchmark runner:


As you can see, we closely match CPython’s performance on NumPy microbenchmarks, and are able to beat it on a few of the smaller ones.  Our current understanding is that we are doing better on the left two benchmarks because they run much more quickly — in about 1000x less time than the right three.  These shorter benchmarks spend a significant amount of time transitioning into and out of NumPy, which Pyston can help with, whereas the right three benchmarks are completely dominated by time inside NumPy.

As a side note, we the Pyston team don’t want to be in the business of picking what NumPy workloads are important.  If you have a program that you think shows real-world NumPy usage, please let us know because we would love to start benchmarking real programs rather than simple matrix operations.

Final words

As always, you can check out our online speed center for more details on our performance and memory usage.

We would like to thank all our open source contributors who contributed to this release, and especially Nexedi for their employment of Boxiang Sun, one of our core contributors.

  • Dong-hee Na
  • Krish Munot
  • Long Ang
  • Lucien Chan
  • SangHee Lee

Pyston 0.5.1 released

We are excited to announce the v0.5.1 release of Pyston, our high performance Python JIT.
This minor release passes all SciPy tests and is on average about 15% faster than 0.5.0. In addition several bug fixes and compatibility improvements got merged.

Performance related changes:

We released recently a blog post about our baseline JIT and inline caches. This release brings a lot of improvements in this area, some of the changes are:

  • the number of ICs slots is now variable. Before we specified for every IC how many slots it has and how large they should be (all slots in a IC had the same size). This often led to higher memory usage than necessary. We changed it now to a fixed size of memory which will than get filled with variable size slots whenever a new slot is required and there is space left in the IC. In addition this makes our IC size estimates in the LLVM tier more accurate because they are now based on the number of bytes we required in the bjit tier.
  • the interpreter reuses the stack slots (internally called vregs) assigned to temporary values which are only live in a basic block. This reduces stack usage which saves memory and made Pyston faster.
  • better non null value tracking, stack spilling, duplicate guard removal and much more temporary values will get held in registers
  • the bjit and ICs can now use callee-save register which removes a lot of spilling around calls
  • added a script which allows to inspect jited code directly from `perf report`.
    • usage with `make perf_<testname>`
  • our codegen and analysis passes now work on the vreg numbers which allows us to use arrays as internal data structures instead of hash tables which makes the code easier to understand and faster
  • faster reference counting pass in the code generator of the LLVM tier

Performance comparison:

startup performance benchmarks:


This benchmarks show that the startup time improved significantly. Part of this comes from the numerous bjit improvements mentioned above (the chart also contains a direct comparison between the bjit performance of the different releases).

steady state benchmarks:



There are still a lot of low hanging fruit and we still have a huge amount of ideas for (performance) improvements for future releases.
The next months we will use to make Pyston ready for usage at dropbox – this is going to be very exciting 🙂

Finally, we would like to thank all of our open source contributors who have contributed to this release, and especially Nexedi for their employment of Boxiang Sun, one of our core contributors who helped greatly with the SciPy support.

  • Cullen Rhodes
  • Long Ang
  • Lucien Chan


baseline jit and inline caches

Creating an implementation for a dynamic language using just in time compilation (JIT) techniques involves a lot of compromises mainly between complexity of the implementation, speed, warm-up time and memory usage.
Especially speed is a difficult trade-off because it’s very easy to end-up spending more time optimizing a piece of code and emitting the assembly than we will ever be able to save by executing faster than executing it in a less optimized way.
This causes most JIT language implementations to use an approach of different tiers – approaches to running the code and different amount of optimizations done depending on how often the specific piece of code gets executed. Thereby reducing the chance that more time will get spend transforming the code in to a more efficient representation than it would take to execute it in a less efficient representation.

baseline just in time compiler

We noticed that our interpreter is interpreting code quite slowly while the LLVM tier takes a lot of time to JIT (even with the object cache which made it much faster) so it was obvious that we either have to speed the interpreter up or introduce a new tier in between.
There are well-known problems with our interpreter, mainly it’s slow because it does not represent the code in a contiguous block of memory (bytecode) but instead it involves a lot of pointer chasing because we reuse our AST nodes. Fixing this would be comparable easy but we still thought that this will only improve the performance a little bit but will not give us the performance we want.

About a year ago we introduced a new execution tier instead, the baseline jit (bjit). It is used for python code which is executed a medium number of times and therefore lives between the interpreter and the LLVM JIT tier. In practice this means most code which executes more than 25 times will currently end-up in the bjit and if it gets executed more than about 2500 times we will recompile it using the LLVM tier.

The main goal of the bjit is to generate reasonable machine code very fast and making heavy use of inline caches to get good performance (more on this further down).
It involved a number of design decisions (some may change in the future) but what we currently ended up with:

  • reuse our inline cache mechanism
    • it transform the bjit from only being able to remove the interpretation overhead (which is quite low for python – it depends on the workload but probably not more than 20%) to a JIT which actually is able to improve the performance by a much larger factor
  • generate machine code for a basic block at a time
    • only generating code for blocks which actually get executed reduces the time to generate code and memory usage at the expense of not being able to do optimizations across blocks (at the moment)
  • highly coupled to the interpreter and using the same frame format
    • making it very easy and fast to switch between the interpreter and bjit at every basic block start
    • we can fallback to the interpreter for blocks which contain operations which we are unable to JIT or for blocks which are unreasonable to JIT because the may be very large and generating code for them would cost too much memory
    • makes it easy to tier up to the bjit when we interpret a function which contains a loop with a large amount of iterations
  • does not use type analysis and all code it generates makes no assumptions about types
    • this makes it always safe to execute code in the bjit
    • type specific code is only inside the ICs and always contains a call to a generic implementation in case the assumptions don’t hold
  • all types are boxed / real python objects
  • it collects type information which we will use in LLVM tier to generate more optimized code later on if the function turns out to be hot
    • if an assumption in the LLVM tier turns out to be wrong we will deoptimize to the interpreter/bjit

Inline Cache

the inline cache mechanism is used in the LLVM tier and in the baseline JIT and is currently responsible for most of the performance improvements over the cpython interpreter (which does not use this technique). It removes most of the dynamic dictionary lookups and additional branching which a “normal” python interpreter often has to do. For every operation where we can use ICs we will provide a block of memory and fill it with a lot of nops and a call to the generic implementation of the operation. Therefore the first time we execute the code we will call into the generic implementation but it will trace the execution of the operation using the arguments supplied. It then fills in the block of memory a more optimized type specific version of the operation which we can use the next time we will hit this IC slot if the assumptions the trace made still hold.

Here is a simple diagram of how a IC with two slots could look like:


A simple example will make it easier to understand what we are doing.

For the python function:

def f(a, b):
    return a + b

The CFG will look like this:

Block 0 'entry'; Predecessors: Successors:
 #0 = a
 #1 = b
 #2 = #0+#1
 return #2

We will now look at the IC for #2 = #0+#1

For example if we call f(1, 1) for the first time the C++ function binop() will trace the execution and fill in the memory block with the code to do an addition between two python int objects (it uses a C++ helper function called intAddInt()):


Notice the guard comparisons inside the first IC slot, they make sure that we will only use the more optimized implementation of the operation if it’s safe to do so (in this case the arguments have the same types and the types did not get modified since the trace got created) and otherwise jump to the next IC slot. Or if there is no optimized version call the generic implementation which is always safe to execute.

Most code is not very dynamic which means filling in one or two slots with optimized versions of a operation is enough to catch all encountered cases.
For example if we later on call f("hello ", "world") we will add a new slot in the IC:


We use ICs for nearly all operations not only for binary ones like the example showed. We also use them for stuff like global scope variable lookup, retrieving and setting attributes and much more (we also support more than two slots). Not all traces call helper functions like we have seen in the example some are inlined in the slot.

Pyston will overwrite slots if they already generated slots turn out to be invalid or unused because they assumption of the trace don’t hold anymore. Some code (luckily this is uncommon) is highly dynamic in this cases we will try to fill in the slot with a less aggressive version if possible – one which makes less assumption. If not possible we will just always call the generic version (like cpython always does).

The code we emit inside the ICs has similar trade offs to the bjit code – mainly it needs to get emitted very fast. We prefer generating smaller code instead of faster one because of the fixed size of the inline cache. It’s better to generate a smaller version which allows us to embed more slots if necessary and trashes the instruction cache less.

lots of ideas for improvements

Both the inline cache mechanism and the bjit have a lot of room for improvements. Some of the ideas we have are:

  • directly emit the content of some of the IC slots of the bjit in the LLVM tier as LLVM IR which makes it accessible to a powerful optimization pipeline which emits much better code with sophisticated inlining and much more
  • generating better representation for highly polymorphic sides
  • smarter (less) guards
  • introducing a simple IR which allows us to do some optimizations
  • better register allocation
  • allow tracing of additional operations
  • removal of unnecessary reference counting operations
  • the whole trace generation requires writing manual c++ code (called ‘rewriter’ inside the code base) which makes them quite hard to write but with the benefit of giving us total control of how a slot looks like. In the future we could try find a better trade-off by automatically generating them from the c++ code or LLVM IR when possible

We’ve already made a lot of improvements in this area, stay tuned for a 0.5.1 blog post talking about them 🙂

Pyston 0.5 released

Today we are extremely excited to announce the v0.5 release of Pyston, our high performance Python JIT. We’ve been a bit quiet for the past few months, and that’s because we’ve been working on some behind-the-scenes technology that we are finally ready to unveil. It might be a bit less shiny than some other things we could have worked on, but this change makes Pyston much more ready to use.

Pyston is now using reference counting.


Reference counting (“refcounting”), is a form of automatic memory management. It’s usually viewed as slower and less sophisticated than using a tracing garbage collector (a “GC”), the predominant technique in modern languages. All past versions of Pyston contained tracing garbage collectors, and much of our work from 0.4 to 0.5 was tearing it out in favor of refcounting.

Why did we do this? In short, because CPython (the main Python implementation) uses refcounting. We used a GC initially to try to get more performance. But applying a tracing GC to a refcounting C API, such as the one that Python has, is risky and comes with many performance pitfalls. And most challengingly, Pyston wants to support the large amount of code that has been written that relies on the special properties that refcounting provides (predictable immediate destruction). We found that we had to go to greater and greater lengths to support these programs, and there were also cases where we wouldn’t be able to support the applications in their current form.

So we decided to bite the bullet and convert to refcounting, with the goal of getting better application compatibility.

How did we do?


We are very happy to announce: we can run NumPy, unmodified.

Specifically: on their latest release (v1.11), we run their entire test suite with one test failure, for which they’ve accepted our patch. For their latest trunk, we have three test failures. We do need to use a modified version of part of their build chain (Cython), and we are currently slower on the test suite than CPython.

Regardless, we are very happy with this result, especially because we will continue to improve both the compatibility and performance.

Other goodies

There are quite a few non-refcounting features that made it into this release as well:

  • Signal handling
  • Frame introspection of exited frames
  • Generator cleanup
  • Support for more C API functions, such as custom tracebacks
  • and many more small fixes than we can list here

These are a large part of our progress on NumPy, and they also help us run other tricky libraries such as py.test, lxml, and cffi. We’ve also greatly reduced the number of modifications that we maintain to the Python standard libraries and C extensions. Overall, refcounting was a big investment, but it’s bought us compatibility wins that we would have had a very hard time getting otherwise.


Unfortunately, since performance wasn’t our goal for this release, we did slide backwards a bit. v0.5 is about 10% slower than v0.4 was, largely due to the change to refcounting. We are okay with the regression since we explicitly focused on compatibility for the last six months, and our refcounting implementation still has many available optimizations.

As a side note, the “conventional wisdom” is that refcounting should have been even slower compared to using a GC.  We attribute this mainly to the compatibility restrictions that hampered our GC implementation.

There is a lot of low-hanging performance fruit available to us right now which we have been explicitly avoiding while we finished refcounting. Now would be a great time to consider contributing since we have more ideas than we can implement ourselves. This is especially true when it comes to NumPy performance.

Currently, we take about twice as long to run the NumPy test suite as CPython does. We don’t know how this will translate to performance on real NumPy programs, but we do know that much of the slowdown falls into two categories: the first is NumPy hits code paths that are otherwise-rare in Pyston and are currently unoptimized. The second is a bit more subtle: NumPy frequently calls from C code back into the Python runtime, which is expensive for us because it doesn’t benefit from our JIT (in addition to being previously-rare). We have techniques inside Pyston to handle these situations and invoke our JIT from C code, and we’d like to start exposing that so that NumPy and other libraries can use it.

Looking forward

We apologize — again — for the lengthy release cycle. We didn’t expect refcounting to take this long, and we even knew that it would take longer than we expected. We’re planning on doing another blog post to talk about what the difficulties were with it and go into more of the technical details of our refcounting system.

Moving forward, our plan for 0.6 is to focus on performance. We would love help from the community on identifying what is important to make performant. We could work on making the NumPy test suite fast, but it may not end up translating to real NumPy workloads.

We’re at the point that trying out Pyston should be easy; it won’t benefit all workloads, but it should be easy to drop it in and see if it does. To test it out, try

docker run -it pyston/pyston

or check out our readme for other options for obtaining Pyston.  To try NumPy, use the “pyston/pyston-numpy” image instead.

We have quite a few optimization ideas lined up, and the pressure has been strong to delay the 0.5 release “just one more week” so that we have time to include some of them. Expect to see an 0.5.1 release that improves performance.

Final words

Refcounting brings Pyston one step closer to being a drop-in replacement for CPython. There is still much more work to do, but we feel like with refcounting we’ve reached a threshold where we’d like to start getting Pyston into peoples’ hands. It’s still very much beta software, so there are many rough edges and unoptimized casses. But we want your feedback on what’s working and what’s not.

Finally, we would like to thank all of our open source contributors who have contributed to this release, and especially Nexedi for their employment of Boxiang Sun, one of our core contributors who helped greatly with the NumPy support.

  • Boxiang Sun
  • Dong-hee Na
  • Rudi Chen
  • Long Ang
  • @LoyukiL
  • Tony Narlock
  • Felipe Volpone
  • Daniel Milde
  • Krish Monut
  • Jacek Wielemborek

Pyston 0.4 released

For a list of common questions about our project, please see our FAQ.

We are very excited to release Pyston 0.4, the latest version of our high-performance Python JIT.  We have a lot to announce with this release, with the highlights of being able to render Dropbox webpages, and achieve performance 25% better than CPython (the main Python implementation) on our benchmark suite.  We are also excited to unveil our project logo:

A lot has happened in the eight months since the 0.3 release: the 0.4 release contains 2000 commits, three times as many commits as either the 0.2 or 0.3 release.  Moving forward, our plan is to release every four months, but for now please enjoy a double-sized release.


While not individually headline-worthy, this release includes a large number of new features:

  • Unicode support
  • Multiple inheritance
  • Support for weakrefs and finalizers (__del__), including proper ordering
  • with-statements
  • exec s in {}
  • Mutating functions in place, such as by setting func_code, func_defaults, or func_globals
  • Import hooks
  • Set comprehensions
  • Much improved C API support
  • Better support for standard command line arguments
  • Support for multi-line statements in the REPL
  • Traceback and frame objects, locals()

Together, these mean that we support almost all Python semantics.  In addition, we’ve implemented a large number of things that aren’t usually considered “features” but nonetheless are important to supporting common libraries.  This includes small things such as supporting all the combinations of arguments builtin functions can take (passing None as the function to map()) or “fun” things such as mutating sys.modules to change the result of an import statement.

Together, these new features mean that we support many common libraries.  We successfully run the test suites of a number of libraries such as django and sqlalchemy, and are continually adding more.  We have also started running the CPython test suite and have added 153 (out of 401) of their test files to our testing suite.

We also have some initial support for NumPy.  This isn’t a priority for us at the moment (our initial target codebase doesn’t use NumPy), but we spent a small amount of time on it and got some simple NumPy examples working.

And most importantly, we now have the ability to run the main Dropbox server, and can render many of its webpages.  There’s still more work to be done here — we need to get the test suite running, and get a performance-testing regimen in place so we can start reporting real performance numbers and comparisons — but we are extremely happy with the progress here.


One thing that has helped a lot in this process is our improved C API support.  CPython has a C API that can be used for writing extension modules, and starting in the 0.2 release we added a basic compatibility layer that translated between our APIs and the CPython ones.  Over time we’ve been able to extend this compatibility to the point that not only can we support C extensions, but we also support running CPython’s internal code, since it is written to the same API.  This means that to support a new API function we can now use CPython’s implementation of the function rather than implementing it on top of our APIs.

As we’ve implemented more and more APIs using CPython’s implementation, it’s become hard to continue thinking of our support as a compatibility layer, and it’s more realistic to think of CPython as the base for our runtime rather than a compatibility target.  This has also been very useful towards our goal of running the Dropbox server: we have been able to directly use CPython’s implementation of many tricky features, such as unicode handling.  We wouldn’t have been able to run the Dropbox server in this amount of time if we had to implement the entire Python runtime ourselves.


We’ve made a number of improvements to Pyston’s performance, including:

  • Adding a custom C++ exception unwinder.  This new unwinder takes advantage of Pyston’s existing restrictions to make C++ exceptions twice as fast.
  • Using fast return-code-based exceptions when we think exceptions will be common, either due to the specifics of the code, or due to runtime profiling.
  • A baseline jit tier, which sits between the interpreter and the LLVM JIT.  This tier approaches the performance of the LLVM tier but has much lower startup overhead.
  • A new on-disk cache that eliminates most of the LLVM tier’s cost on non-initial runs.
  • Many tracing enhancements, producing better code and supporting more cases
  • New CAPI calling conventions that can greatly speed up calling CAPI functions.
  • Converted some builtin modules to shared modules to improve startup time.
  • Added a PGO build, and used its function ordering in normal builds as well.

Conspicuously absent from this list is better LLVM optimizations.  Our LLVM tier has been able to do well on microbenchmarks, but on “real code” it tends to have very little knowledge of the behavior of the program, even if it knows all of the types.  This is because knowing the types of objects only peels away the first level of dynamicism: we can figure out what function we should call, but that function will itself often contain a dynamic call.  For example, if we know that we are calling the len() function, we can eliminate the dynamic dispatch and immediately call into our implementation of len() — but that implementation will itself do a dynamic call of arg.__len__().  While len() is common enough that we could special-case it in our LLVM tier, this kind of multiple-levels-of-dynamicism is very common, and we have been increasingly relying on our mini tracing JIT to peel away all layers at once.  The result is that we get good execution of each individual bytecode, but the downside is that we are currently lacking good inter-bytecode optimizations.  Our plan is to integrate the per-bytecode tracing JIT and the LLVM method JIT to achieve the best of both worlds.


We updated our benchmarks suite to use three real-world libraries: our suite contains a benchmark based on each of pyxl, django, and sqlalchemy. Benchmark selection is a contentious topic, but we like these benchmarks because they are more typical of web servers than existing Python benchmark suites.

On these benchmarks, we are 25% faster than CPython (and 25% slower than PyPy 4.0.0).  We have a full performance tracking site, where you can see our latest benchmark results (note: that last link will auto-update over time and isn’t comparing the same configurations as the 25% result).


We also have a number of exciting developments that aren’t directly related to our code:

  • We switched from a Makefile build system to a CMake-based one.  This lets us have some nice features such as a configure step, faster builds (by supporting Ninja), and down the road easier support for new platforms.  This was done by an open source contributor, Daniel Agar, and we are very thankful!
  • We have more docs.  Check out our wiki for some documentation on internal systems, or tips for new contributors.  Browsing the codebase should be easier as well.
  • We have a logo!
  • We had 184 commits from 11 open source contributors.  A special shoutout to Boxiang Sun, who has greatly helped with our compatibility!

Final words

We have a pre-built binary available to download on Github (though please see the notes there on running it).  Pyston is still in a pre-launch state, so please expect crashes or occasional poor performance, depending on what you run it on.  But if you see any of that, please let us know either in our Gitter channel or by filing a Github issue.  We’re excited to hear what you think!

If you are in the Bay Area, we are having a talk + meetup at the Dropbox SF office at 6:30pm on November 10th.  We only have a few spaces left, so please RSVP if you are interested.  More details at the RSVP link.

We have a lot of exciting things planned for our 0.5 series as well.  Our current goals are to implement a few final features (such as inspection of stack frames after they exit), to continue improving performance, and to start running some Dropbox services on Pyston.  It’s an exciting time, and as always we are taking new contributors!  If you’re interested in contributing, feel free to peruse our docs, check out our list of open issues, or just say hi!