Pyston talk recording

November 24, 2015December 2, 2015 Kevin Modzelewski

Hi all, two weeks ago we gave a talk about the current status of Pyston, including some technical details about our JIT strategy and its motivation. I’m glad many of you could make it! For those who couldn’t, we now have the talk uploaded, check it out:

Download / alternative viewer

[Update] Slides

Pyston 0.4 released

November 3, 2015November 3, 2015 Kevin Modzelewski

For a list of common questions about our project, please see our FAQ.

We are very excited to release Pyston 0.4, the latest version of our high-performance Python JIT. We have a lot to announce with this release, with the highlights of being able to render Dropbox webpages, and achieve performance 25% better than CPython (the main Python implementation) on our benchmark suite. We are also excited to unveil our project logo:

A lot has happened in the eight months since the 0.3 release: the 0.4 release contains 2000 commits, three times as many commits as either the 0.2 or 0.3 release. Moving forward, our plan is to release every four months, but for now please enjoy a double-sized release.

Compatibility

While not individually headline-worthy, this release includes a large number of new features:

Unicode support
Multiple inheritance
Support for weakrefs and finalizers (__del__), including proper ordering
with-statements
exec s in {}
Mutating functions in place, such as by setting func_code, func_defaults, or func_globals
Import hooks
Set comprehensions
Much improved C API support
Better support for standard command line arguments
Support for multi-line statements in the REPL
Traceback and frame objects, locals()

Together, these mean that we support almost all Python semantics. In addition, we’ve implemented a large number of things that aren’t usually considered “features” but nonetheless are important to supporting common libraries. This includes small things such as supporting all the combinations of arguments builtin functions can take (passing None as the function to map()) or “fun” things such as mutating sys.modules to change the result of an import statement.

Together, these new features mean that we support many common libraries. We successfully run the test suites of a number of libraries such as django and sqlalchemy, and are continually adding more. We have also started running the CPython test suite and have added 153 (out of 401) of their test files to our testing suite.

We also have some initial support for NumPy. This isn’t a priority for us at the moment (our initial target codebase doesn’t use NumPy), but we spent a small amount of time on it and got some simple NumPy examples working.

And most importantly, we now have the ability to run the main Dropbox server, and can render many of its webpages. There’s still more work to be done here — we need to get the test suite running, and get a performance-testing regimen in place so we can start reporting real performance numbers and comparisons — but we are extremely happy with the progress here.

C API

One thing that has helped a lot in this process is our improved C API support. CPython has a C API that can be used for writing extension modules, and starting in the 0.2 release we added a basic compatibility layer that translated between our APIs and the CPython ones. Over time we’ve been able to extend this compatibility to the point that not only can we support C extensions, but we also support running CPython’s internal code, since it is written to the same API. This means that to support a new API function we can now use CPython’s implementation of the function rather than implementing it on top of our APIs.

As we’ve implemented more and more APIs using CPython’s implementation, it’s become hard to continue thinking of our support as a compatibility layer, and it’s more realistic to think of CPython as the base for our runtime rather than a compatibility target. This has also been very useful towards our goal of running the Dropbox server: we have been able to directly use CPython’s implementation of many tricky features, such as unicode handling. We wouldn’t have been able to run the Dropbox server in this amount of time if we had to implement the entire Python runtime ourselves.

Performance

We’ve made a number of improvements to Pyston’s performance, including:

Adding a custom C++ exception unwinder. This new unwinder takes advantage of Pyston’s existing restrictions to make C++ exceptions twice as fast.
Using fast return-code-based exceptions when we think exceptions will be common, either due to the specifics of the code, or due to runtime profiling.
A baseline jit tier, which sits between the interpreter and the LLVM JIT. This tier approaches the performance of the LLVM tier but has much lower startup overhead.
A new on-disk cache that eliminates most of the LLVM tier’s cost on non-initial runs.
Many tracing enhancements, producing better code and supporting more cases
New CAPI calling conventions that can greatly speed up calling CAPI functions.
Converted some builtin modules to shared modules to improve startup time.
Added a PGO build, and used its function ordering in normal builds as well.

Conspicuously absent from this list is better LLVM optimizations. Our LLVM tier has been able to do well on microbenchmarks, but on “real code” it tends to have very little knowledge of the behavior of the program, even if it knows all of the types. This is because knowing the types of objects only peels away the first level of dynamicism: we can figure out what function we should call, but that function will itself often contain a dynamic call. For example, if we know that we are calling the len() function, we can eliminate the dynamic dispatch and immediately call into our implementation of len() — but that implementation will itself do a dynamic call of arg.__len__(). While len() is common enough that we could special-case it in our LLVM tier, this kind of multiple-levels-of-dynamicism is very common, and we have been increasingly relying on our mini tracing JIT to peel away all layers at once. The result is that we get good execution of each individual bytecode, but the downside is that we are currently lacking good inter-bytecode optimizations. Our plan is to integrate the per-bytecode tracing JIT and the LLVM method JIT to achieve the best of both worlds.

Benchmarks

We updated our benchmarks suite to use three real-world libraries: our suite contains a benchmark based on each of pyxl, django, and sqlalchemy. Benchmark selection is a contentious topic, but we like these benchmarks because they are more typical of web servers than existing Python benchmark suites.

On these benchmarks, we are 25% faster than CPython (and 25% slower than PyPy 4.0.0). We have a full performance tracking site, where you can see our latest benchmark results (note: that last link will auto-update over time and isn’t comparing the same configurations as the 25% result).

Community

We also have a number of exciting developments that aren’t directly related to our code:

We switched from a Makefile build system to a CMake-based one. This lets us have some nice features such as a configure step, faster builds (by supporting Ninja), and down the road easier support for new platforms. This was done by an open source contributor, Daniel Agar, and we are very thankful!
We have more docs. Check out our wiki for some documentation on internal systems, or tips for new contributors. Browsing the codebase should be easier as well.
We have a logo!
We had 184 commits from 11 open source contributors. A special shoutout to Boxiang Sun, who has greatly helped with our compatibility!

Final words

We have a pre-built binary available to download on Github (though please see the notes there on running it). Pyston is still in a pre-launch state, so please expect crashes or occasional poor performance, depending on what you run it on. But if you see any of that, please let us know either in our Gitter channel or by filing a Github issue. We’re excited to hear what you think!

If you are in the Bay Area, we are having a talk + meetup at the Dropbox SF office at 6:30pm on November 10th. We only have a few spaces left, so please RSVP if you are interested. More details at the RSVP link.

We have a lot of exciting things planned for our 0.5 series as well. Our current goals are to implement a few final features (such as inspection of stack frames after they exit), to continue improving performance, and to start running some Dropbox services on Pyston. It’s an exciting time, and as always we are taking new contributors! If you’re interested in contributing, feel free to peruse our docs, check out our list of open issues, or just say hi!

Caching object code

July 14, 2015July 14, 2015 Marius Wachtler

In this blog post I want to briefly describe a new feature which landed recently inside Pyston and which also touches one of the most often mentioned feedback we receive: A lot of people are under the impression that LLVM is not ready to be used as a JIT because of the main focus as a static compiler where fast code generation time is not as important as in the JIT usage.

While I agree that an LLVM JIT is quite expensive compared to baseline JIT tiers in other projects we expect to partly mitigate this and at the same time still take advantage of the good code quality and advanced features LLVM provides.

We observe that a lot of non benchmark code consists of dozens of functions which are hot enough that it makes sense to tier up to the LLVM JIT but the small amount it needs to JIT a single functions adds up and we spend a significant time JITing functions when starting applications. For example starting the pip (the package manager) takes currently about 2.3secs, from those we JIT 66 python functions which takes about 1.4secs. We noticed that from the 1.4secs JITing functions about 1.1secs are spend on optimizing and lowering the LLVM IR to machine code (instruction selection, register allocation, etc) and only a much smaller amount of time is spend generating the LLVM IR. We then thought that the best solution is to cache the generated machine code to disk in order to reuse it the next time we encounter the same function (e.g. on the next startup).

This approach is a little more complicated than just checking if the source code of a function hasn’t changed because we support type specializations, OSR frames and embed a lot of pointers inside the generated code (which will change). That’s why we choose (for now) to still generate the LLVM IR but after we generated it we will hash the IR and try to find a cached object file with the same hash. To overcome the problem that the generated code is not allowed to contain pointers to changing addresses I changed Pyston to emit IR which whenever it encounters a pointer address (e.g. a reference to non Python unicode string created by the parser) generates a symbolic name (like an external variable) and remembers the pointer value in a map.

Here comes the advantage of using the powerful LLVM project to JIT stuff – it contains a runtime linker which is able relocate the address of our JITed object code. This means when we load an object we will replace the symbolic names with the actual pointer values, which lets us reuse the same assembly on different runs with different memory layouts.

Results

Object cache effect on pip startup
The result is that we cut the time to JIT the functions down to 350ms (was 1.4secs) of those merely 60ms are actually spend hashing the IR, decompressing and loading the object code and relocating the pointers (down from 1.1secs).

I think this is a good example of what quite significant performance enhancements can be made with a small amount of effort. There is a lot of room for improvements to this simple caching mechanism for example we could use it for a new ahead-of-time compile mode for important code (e.g the standard Python library) using a higher optimization level. While this change alone will not eliminate all of LLVMs higher JITing cost we are excited to implement additional features and performance optimizations inside Pyston.

If you are interested in more detailed performance statistics pass “-s” to Pyston’s command line and you will see much more output but you may have to look into the source code to see what every stat entry measures.

Pyston 0.3: Self-hosting Sufficiency

February 24, 2015February 24, 2015 Kevin Modzelewski

We’ve been working hard over the past five months and are very happy to release Pyston 0.3, the newest version of our high-performance Python implementation. The biggest features of this release are that we can now run all of our internal scripts on Pyston, as well as improved performance. We also have some exciting news to share about our project status and plans.

Language compatibility

Self-hosting, or running a compiler through itself, is one of the best ways to demonstrate language compatibility. Pyston isn’t a static compiler or written in Python, so “self-hosting” is a bit of a misnomer / attention grabber, but we still have a number of internal Python scripts of various complexity, and with this release we can now run them all on Pyston. The most complex of our scripts is our test runner, which spawns multiple threads, spawns subprocess to run the tests, calls pickle to load the expected results, and reports back to the user. In the process it executes a few thousand lines of code across a few dozen standard libraries and extension modules.

Unfortunately, we make fairly little use of our self-host ability at the moment. We only have a single Python script that’s actually involved in the building of Pyston and even then only tangentially. And we can’t default to running our tester in self-host mode, since what if we have a bug that breaks the test runner and makes all the tests pass? But at least we have the ability.

For some quantitative stats of debatable value, we can look at how many of the Python standard libraries and extension modules we can import. (Note: this is just importing the library correctly, not testing any of its functionality beyond that. Hopefully in the 0.4 release we can say how many of the CPython test cases we can pass.) At the time of our 0.2 release, we were able to import 56 top-level standard libraries, and 12 standard extension modules. Now, with the 0.3 release, we are able to import 117 libraries and 27 extension modules, which is more than twice as many.

We still have a long way to go, though, since this is only about half of the libraries and extension modules in CPython (though we don’t have to support all of them immediately). Thankfully, our C API support is becoming fairly developed, and while it was originally intended for supporting C extension modules, it works just as well to support CPython’s internal code. We’ve gotten to the point that we can often copy large swaths of code from CPython into Pyston without modification, and while it’s hard to measure, I think we currently compile about as much CPython code into Pyston as code that we wrote ourselves. So without really intending it, we’ve been adopting a “CPython with a replaced core” architecture and been moving away from the “completely from scratch” model we started with. Regardless of whether we fully adopt that strategy or not, we’re currently able to use large amounts of implementation from CPython and move much faster.

Performance

We were hesitant to announce performance numbers in the 0.1 and 0.2 releases, since both of those releases focused on longer term investments (getting the core infrastructure in place, and language features, respectively) from which we didn’t want to get distracted. In the past month or so, though, we’ve finally taken the time to go back and expand our benchmark suite and fix some of the low hanging fruit that we skipped during initial implementation, and are happy to talk about how we’re doing. The result is that we are now (on our small benchmark suite) faster than CPython! We are currently 1% faster than CPython using a geometric mean, with individual benchmarks varying between 2x faster and 2x slower. You can see more details and up-to-date benchmark results at speed.pyston.org. (A hearty thanks to the PyPy team for the performance tracking software.)

“1% faster than CPython” is clearly not our overall performance target, but we are happy with the speed at which we got here, and the amount of optimization headroom we still have. Moving forward, we could continue working on optimizations and have more impressive benchmark results, but we’re taking this milestone as a signal that we should shift focus back to feature work again.

If we were to break down our performance versus CPython, we (unsurprisingly) have better steady-state performance but worse startup time. As a quick measure of how our benchmark suite balances the two, the benchmark geomean has a value of 6.0 seconds; it’s hard to tell if this is the same balance as for our target server workloads.

Most of our startup time comes from LLVM jitting our code. This doesn’t mean that LLVM is to blame: our AST interpreter is fairly slow, requiring us to often tier out of it to our LLVM JIT. We also generate some very large LLVM IR in order to support our frame introspection, which slows down compilation times. We have a number of ideas on how to improve startup time on both these fronts (make LLVM jit quicker, and go to it less).
For steady-state performance, we tend to do well at executing our JIT’ed code, but our memory system — though much better than it was in 0.2 — is still not as good as CPython’s or other implementations’. Most of our speedup comes from our inline caching mechanisms, and we still have a lot of open headroom for more type speculations and LLVM optimizations, since we do almost none of either.

Project plans

On the project management side, we now have multiple people working full time on the project, in addition to the part-time help we’ve been getting! With the additional resources we’ve been able to move more quickly (you can see an uptick in GitHub commits), and we’ve set some aggressive goals for running Dropbox on Pyston. We’re very excited about how much we’re going to be able to get done.

Our goal moving forward is to continue expanding the fraction of the language+runtime that we support, and maintain certain performance targets as we go. Our current performance target is 1x CPython, but we may loosen it in order to prioritize feature work, since that tends to be more time-sensitive (blocks more things) than performance work. We’ll be targeting larger and larger applications to run under Pyston, with the ultimate target being the Dropbox server codebase.

Conclusion

As always, you can find our code on GitHub. We’ve released a binary that may or not run on your system, but is available for you to play with if you’re interested — but remember that this is still an alpha and not ready for real use. If you run into issues or would like to contribute, please let us know!

The Pyston Blog

Year: 2015

Pyston talk recording

Pyston 0.4 released

Compatibility

C API

Performance

Benchmarks

Community

Final words

Caching object code

Results

Pyston 0.3: Self-hosting Sufficiency

Language compatibility

Performance

Project plans

Conclusion