Wednesday, February 25, 2015

Experiments in Pyrlang with RPython

Pyrlang is an Erlang BEAM bytecode interpreter written in RPython.

It implements approximately 25% of BEAM instructions. It can support integer calculations (but not bigint), closures, exception handling, some operators to atom, list and tuple, user modules, and multi-process in single core. Pyrlang is still in development.

There are some differences between BEAM and the VM of PyPy:

  • BEAM is a register-based VM, whereas the VM in PyPy is stack-based.
  • There is no traditional call-stack in BEAM. The Y register in BEAM is similar to a call-stack, but the Y register can sometimes store some variables.
  • There are no typical language-level threads and OS-level threads in BEAM; only language-level processes, whose behavior is very similar to the actor model.

Regarding bytecode dispatch loop, Pyrlang uses a while loop to fetch instructions and operands, call the function corresponding to every instruction, and jump back to the head of the while loop. Due to the differences between the RPython call-stack and BEAM’s Y register, we decided to implement and manage the Y register by hand. On the other hand, PyPy uses RPython’s call stack to implement Python’s call stack. As a result, the function for the dispatch loop in PyPy calls itself recursively. This does not happen in Pyrlang.

The Erlang compiler (erlc) usually compiles the bytecode instructions for function invocation into CALL (for normal invocation) and CALL_ONLY (for tail recursive invocation). You can use a trampoline semantic to implement it:

  • CALL instruction: The VM pushes the current instruction pointer (or called-program counter in PyPy) to the Y register, and jumps to the destination label. When encountering a RETURN instruction, the VM pops the instruction pointer from the Y register and returns to the location of the instruction pointer to continue executing the outer function.
  • CALL_ONLY instruction: The VM simply jumps to the destination label, without any modification of the Y register. As a result, the tail recursive invocation never increases the Y register.

The current implementation only inserts the JIT hint of can_enter_jit following the CALL_ONLY instruction. This means that the JIT only traces the tail-recursive invocation in Erlang code, which has a very similar semantic to the loop in imperative programming languages like Python.

We have also written a single scheduler to implement the language level process in a single core. There is a runable queue in the scheduler. On each iteration, the scheduler pops one element (which is a process object with dispatch loop) from the queue, and executes the dispatch loop of the process object. In the dispatch loop, however, there is a counter-call “reduction” inside the dispatch loop. The reduction decrements during the execution of the loop, and when the reduction becomes 0, the dispatch loop terminates. Then the scheduler pushes that element into the runable queue again, and pops the next element for the queue, and so on.

We are planning to implement a multi-process scheduler for multi-core CPUs, which will require multiple schedulers and even multiple runable queues for each core, but that will be another story. :-)

Methods

We wrote two benchmark programs of Erlang:

  • FACT: A benchmark to calculate the factorial in a tail-recursive style, but because we haven’t implemented big int, we do a remainder calculation to the argument for the next iteration, so the number never overflows.
  • REVERSE: The benchmark creates a reversed list of numbers, such as [20000, 19999, 19998, …], and applies a bubble sort to it.

Results

The Value of Reduction

We used REVERSE to evaluate the JIT with different values of reduction:

The X axis is the value of reduction, and the Y axis is the execution time (by second).

It seems that when the value of reduction is small, the reduction influences the performance significantly, but when reduction becomes larger, it only increases the speed very slightly. In fact, we use 2000 as the default reduction value (as well as the reduction value in the official Erlang interpreter).

Surprisingly, the trace is always generated even when the reduction is very small, such as 0, which means the dispatch loop can only run for a very limited number of iterations, and the language level process executes fewer instructions than an entire loop in one switch of the scheduler). The generated trace is almost the same, regardless of different reduction values.

Actually, the RPython JIT only cares what code it meets, but does not care who executes it, thus the JIT always generates the results above. The trace even can be shared among different threads if they execute the same code.

The overhead at low reduction value may be due to the scheduler, which switches from different processes too frequently, or from the too-frequent switching between bytecode interpreter and native code, but not from JIT itself.

Here is more explanation from Armin Rigo:

“The JIT works well because you’re using a scheme where some counter is decremented (and the soft-thread interrupted when it reaches zero) only once in each app-level loop. The soft-thread switch is done by returning to some scheduler, which will resume a different soft-thread by calling it. It means the JIT can still compile each of the loops as usual, with the generated machine code containing the decrease-and-check-for-zero operation which, when true, exits the assembler."

Fair Process Switching vs. Unfair Process Switching

We are also concerned about the timing for decreasing reduction value. In our initial version of Pyrlang, we decrease reduction value at every local function invocation, module function invocation, and BIF (built-in function) invocation, since this is what the official Erlang interpreter does. However, since the JIT in RPython basically traces the target language loop (which is the tail recursive invocation in Pyrlang) it is typically better to keep the loop whole during a switch of the language level process. We modified Pyrlang, and made the reduction decrement only occur after CALL_ONLY, which is actually the loop boundary of the target language.

Of course, this strategy may cause an “unfair” execution among language level processes. For example, if one process has only a single long-sequence code, it executes until the end of the code. On the other hand, if a process has a very short loop, it may be executed by very limited steps then be switched out by the scheduler. However, in the real world, this “unfairness” is usually considered acceptable, and is used in many VM implementations including PyPy for improving the overall performance.

We compared these two versions of Pyrlang in the FACT benchmark. The reduction decrement is quite different because there are some BIF invocations inside the loop. In the old version the process can be suspended at loop boundaries or other function invocation, but in the new version, it can be suspended only at loop boundaries.

We show that the strategy is effective, removing around 7% of the overhead. We have also compared it in REVERSE, but since there are no extra invocations inside the trace, it cannot provide any performance improvement. In the real world, we believe there is usually more than one extra invocation inside a single loop, so this strategy is effective for most cases.

Comparison with Default Erlang and HiPE

We compared the performance of Pyrlang with the default Erlang interpreter and the HiPE (High Performance Erlang) complier. HiPE is an official Erlang compiler that can compile Erlang source code to native code. The speed of Erlang programs obviously improves but loses its generality instead.

Please note that Pyrlang is still in development, so in some situations it does less work than the default Erlang interpreter, such as not checking integer overflow when dealing with big integer, and not checking and adding locks when accessing message queues in the language-level process, so is therefore faster. The final version of Pyrlang may be slower.

We used the two benchmark programs above, and made sure both of them are executed for more than five seconds to cover the JIT warm-up time for RPython. The experiment environment is a OS X 10.10 machine with 3.5GHZ 6-core Intel Xeon E5 CPU and 14GB 1866 MHz DDR3 ECC memory.

Let’s look at the result of FACT. The graph shows that Pyrlang runs 177.41% faster on average than Erlang, and runs at almost the same speed as HiPE. However, since we haven’t implemented big integer in Pyrlang, the arithmetical operators do not do any extra overflow checking. It is reasonable that the final version for Pyrlang will be slower than the current version and HiPE.

As for REVERSE, the graph shows that Pyrlang runs 45.09% faster than Erlang, but 63.45% slower than HiPE on average. We think this is reasonable because there are only few arithmetical operators in this benchmark so the speeds of these three implementations are closer. However, we observed that at the scale of 40,000, the speed of Pyrlang slowed down significantly (111.35% slower than HiPE) compared with the other two scales (56.38% and 22.63% slower than HiPE).

Until now we can only hypothesize why Pyrlang slows down at that scale. We guess that the overhead might be from GC. This is because the BEAM bytecode provides some GC hints to help the default Erlang compiler to perform some GC operations immediately. For example, using GC_BIF instead of a BIF instruction tells the VM that there may be a GC opportunity, and tells the VM how many live variables should be around one instruction. In Pyrlang we do not use these kinds of hints but rely on RPython’s GC totally. When there are a huge number of objects during runtime, (as for REVERSE, it should be the Erlang list object) the speed therefore slows down.

Ruochen Huang

Monday, February 23, 2015

linalg support in pypy/numpy

Introduction

PyPy's numpy support has matured enough that it can now support the lapack/blas libraries through the numpy.linalg module. To install the version of numpy this blog post refers to, install PyPy version 2.5.0 or newer, and run this:

pypy -m pip install git+https://bitbucket.org/pypy/numpy.git

This update is a major step forward for PyPy's numpy support. Many of the basic matrix operations depend on linalg, even matplotlib requires it to display legends (a pypy-friendly version of matplotlib 1.3 is available at https://github.com/mattip/matplotlib).

A number of improvements and adaptations, some of which are in the newly-released PyPy 2.5.0, made this possible:
  • Support for an extended frompyfunc(), which in the PyPy version supports much of the ufunc API (signatures, multiple dtypes) allowing creation of pure-python, jit-friendly ufuncs. An additional keyword allows choosing between out = func(in) or func(in, out) ufunc signatures. More explanation follows.
  • Support for GenericUfuncs via PyPy's (slow) capi-compatibility layer. The underlying mechanism actually calls the internal implementation of frompyfunc().
  • A cffi version of _umath_linalg. Since cffi uses dlopen() to call into shared objects, we added support in the numpy build system to create non-python shared libraries from source code in the numpy tree. We also rewrote parts of the c-based _umath_linalg.c.src in python, renamed numpy's umath_linalg capi module to umath_linag_capi, and use it as a shared object through cffi.

Status

We have not completely implemented all the linalg features. dtype resolution via casting is missing, especially for complex ndarrays, which leads to slight numerical errors where numpy uses a more precise type for intermediate calculations. Other missing features in PyPy's numpy support may have implications for complete linalg support.

Some OSX users have noticed they need to update pip to version 6.0.8 to overcome a regression in pip, and it is not clear if we support all combinations of blas/lapack implementations on all platforms.

Over the next few weeks we will be ironing out these issues.

Performance

A simple benchmark is shown below, but let's state the obvious: PyPy's JIT and the iterators built into PyPy's ndarray implementation will in most cases be no faster than CPython's numpy. The JIT can help where there is a mixture of python and numpy-array code. We do have plans to implement lazy evaluation and to further optimize PyPy's support for numeric python, but numpy is quite good at what it does.

HowTo for PyPy's extended frompyfunc

The magic enabling blas support is a rewrite of the _umath_linalg c-based module as a cffi-python module that creates ufuncs via frompyfunc. We extended the numpy frompyfunc to allow it to function as a replacement for the generic ufunc available in numpy only through the c-api.

We start with the basic frompyfunc, which wraps a python function into a ufunc:
 
def times2(in0):
    return in0 * 2
ufunc = frompyfunc(times2, 1, 1)

In cpython's numpy the dtype of the result is always object, which is not implemented (yet) in PyPy, so this example will fail. While the utility of object dtypes can be debated, in the meantime we add a non-numpy-compatible keyword argument dtypes to frompyfunc. If dtype=['match'] the output dtype will match the dtype of the first input ndarray:

ufunc = frompyfunc(times2, 1, 1, dtype=['match'])
ai = arange(24).reshape(3, 4, 2)
ao = ufunc(ai)
assert  (ao == ai * 2).all()

I hear you ask "why is the dtypes keyword argument a list?" This is so we can support the Generalized Universal Function API, which allows specifying a number of specialized functions and the input-output dtypes each specialized function accepts.
Note that the function feeds the values of ai one at a time, the function operates on scalar values. To support more complicated ufunc calls, the generalized ufunc API allows defining a signature, which specifies the layout of the ndarray inputs and outputs. So we extended frompyfunc with a signature keyword as well.
We add one further extension to frompyfunc: we allow a Boolean keyword stack_inputs to specify the argument layout of the function itself. If the function is of the form:
 
out0, out1, ... = func(in0, in1,...)

then stack_inputs is False. If it is True the function is of the form:
 
func(in0, in1, ... out0, out1, ...)

Here is a complete example of using frompyfunc to create a ufunc, based on this link:
 
def times2(in_array, out_array):
    in_flat = in_array.flat
    out_flat = out_array.flat
    for i in range(in_array.size):
        out_flat[i] = in_flat[i] * 2
ufunc = frompyfunc([times2, times2], 1, 1,
                signature='(i)->(i)',
                dtypes=[dtype(int), dtype(int),
                        dtype(float), dtype(float),
                       ],
                stack_inputs=True,
                )
ai = arange(10, dtype=int)
ai2 = ufunc(ai)
assert all(ai2 == ai * 2)

Using this extended syntax, we rewrote the lapack calls into the blas functions in pure python, no c needed. Benchmarking this approach actually was much slower than using the upstream umath_linalg module via cpyext, as can be seen in the following benchmarks. This is due to the need to copy c-aligned data into Fortran-aligned format. Our __getitem__ and __setitem__ iterators are not as fast as pointer arithmetic in C. So we next tried a hybrid approach: compile and use numpy's umath_linalg python module as a shared object, and call the optimized specific wrapper function from it.

Benchmarks

Here are some benchmarks, running a tight loop of the different versions of linalg.inv(a), where a is a 10x10 double ndarray. The benchmark ran on an i7 processor running ubuntu 14.04 64 bit:
Impl. Time after warmup
CPython 2.7 + numpy 1.10.dev + lapack 8.9 msec/1000 loops
PyPy 2.5.0 + numpy + lapack via cpyext 8.6 msec/1000 loops
PyPy 2.5.0 + numpy + lapack via pure python + cffi 19.9 msec/1000 loops
PyPy 2.5.0 + numpy + lapack via python + c + cffi 9.5 msec/1000 loops


While no general conclusions may be drawn from a single micro-benchmark, it does indicate that there is some merit in the approach taken.

Conclusion

PyPy's numpy now includes a working linalg module. There are still some rough corners, but hopefully we have implemented the parts you need. While the speed of the isolated linalg function is no faster than CPython and upstream numpy, it should not be significantly slower either. Your use case may see an improvement if you use a mix of python and lapack, which is the usual case.

Please let us know how it goes. We love to hear success stories too.

We still have challenges at all levels of programming,and are always looking for people willing to contribute, so stop by on IRC at #pypy.

mattip and the PyPy Team

Wednesday, February 11, 2015

NumPyPy status - January 2015

Hi Everyone

Here is what has been done in January thanks to the funding of NumPyPy, I would like to thank all the donors and tell you that you can still donate :
  • I have focused on implementing the object dtype this month, it is now possible to store objects inside ndarrays using the object dtype
  • It is also possible to add an object ndarray to any other ndarray (implementing other operators is trivial)
The next things I plan on working on next are :
  • Implementing the missing operations for object arrays
  • Implementing garbage collection support for object arrays (currently, storing an object inside an ndarray doesn't keep the object alive)
  • Packaging NumPyPy on PyPI
Cheers
Romain

Tuesday, February 3, 2015

PyPy 2.5.0 released

PyPy 2.5.0 - Pincushion Protea

We’re pleased to announce PyPy 2.5, which contains significant performance enhancements and bug fixes.
You can download the PyPy 2.5.0 release here:
We would like to thank our donors for the continued support of the PyPy project, and for those who donate to our three sub-projects, as well as our volunteers and contributors (10 new commiters joined PyPy since the last release). We’ve shown quite a bit of progress, but we’re slowly running out of funds. Please consider donating more, or even better convince your employer to donate, so we can finish those projects! The three sub-projects are:
  • Py3k (supporting Python 3.x): We have released a Python 3.2.5 compatible version
    we call PyPy3 2.4.0, and are working toward a Python 3.3 compatible version
  • STM (software transactional memory): We have released a first working version, and continue to try out new promising paths of achieving a fast multithreaded Python
  • NumPy which requires installation of our fork of upstream numpy, available on bitbucket

What is PyPy?

PyPy is a very compliant Python interpreter, almost a drop-in replacement for CPython 2.7. It’s fast (pypy and cpython 2.7.x performance comparison) due to its integrated tracing JIT compiler.
This release supports x86 machines on most common operating systems (Linux 32/64, Mac OS X 64, Windows, and OpenBSD), as well as newer ARM hardware (ARMv6 or ARMv7, with VFPv3) running Linux.
While we support 32 bit python on Windows, work on the native Windows 64 bit python is still stalling, we would welcome a volunteer to handle that.

Highlights

  • The past months have seen pypy mature and grow, as rpython becomes the goto solution for writing fast dynamic language interpreters. Our separation of rpython and the python interpreter PyPy is now much clearer in the PyPy documentation and we now have separate RPython documentation.
  • We have improved warmup time as well as jitted code performance: more than 10% compared to pypy-2.4.0. We no longer zero-out memory allocated in the gc nursery by default, work that was started during a GSoC.
  • Passing objects between C and PyPy has been improved. We are now able to pass raw pointers to C (without copying) using pinning. This improves I/O; benchmarks that use networking intensively improved by about 50%. File() operations still need some refactoring but are already showing a 20% improvement on our benchmarks. Let us know if you see similar improvements.
  • Our integrated numpy support gained much of the GenericUfunc api in order to support the lapack/blas linalg module of numpy. This dovetails with work in the pypy/numpy repository to support linalg both through the (slower) cpyext capi interface and also via (the faster) pure python cffi interface, using an extended frompyfunc() api. We will soon post a seperate blog post specifically about linalg and PyPy.
  • Dictionaries are now ordered by default, see the blog post
  • Our nightly translations use –shared by default, including on OS/X and linux
  • We now more carefully handle errno (and GetLastError, WSAGetLastError) tying the handlers as close as possible to the external function call, in non-jitted as well as jitted code.
  • Issues reported with our previous release were resolved after reports from users on our issue tracker at https://bitbucket.org/pypy/pypy/issues or on IRC at #pypy.
We have further improvements on the way: rpython file handling, finishing numpy linalg compatibility, numpy object dtypes, a better profiler, as well as support for Python stdlib 2.7.9.
Please try it out and let us know what you think. We especially welcome success stories, we know you are using PyPy, please tell us about it!
Cheers
The PyPy Team

Thursday, January 22, 2015

Faster, more memory efficient and more ordered dictionaries on PyPy

Hello everyone!

As of today, we merged the latest branch that brings better dictionaries to PyPy by default. The work is based on an idea by Raymond Hettinger on python-dev, with prior work done notably in Java.  It was done by Maciej Fijałkowski and Armin Rigo, with Laurence Tratt recently prodding us to finish it.  (Earlier work going in a similar direction include Alex Gaynor's work on ordered dicts in Topaz, which was also used in the Hippy VM.  Each of these pieces of work is itself based on the original dict implementation in RPython, whose origins fade in the Subversion prehistory of PyPy.)  Coincidentally, a very similar idea has been implemented in Zend PHP very recently. Zend implementation description.

This post covers the basics of design and implementation as well as some basic benchmarks.

Dictionaries are now ordered!

One surprising part is that the new design, besides being more memory efficient, is ordered by design: it preserves the insertion order.  This is not forbidden by the Python language, which allows any order.  It makes the collections.OrderedDict subclass much faster than before: it is now a thin subclass of dict.  Obviously, we recommend that any portable Python program continues to use OrderedDict when ordering is important.  Note that a non-portable program might rely on more: for example, a **keywords argument now receives the keywords in the same order as the one in which they were given in the call.  (Whether such a thing might be called a language design change or not is a bit borderline.)  The point is that Python programs that work on CPython or previous versions of PyPy should continue to work on PyPy.

There is one exception, though.  The iterators of the OrderedDict subclass are now working just like the ones of the dict builtin: they will raise RuntimeError when iterating if the dictionary was modified.  In the CPython design, the class OrderedDict explicitly doesn't worry about that, and instead you get some result that might range from correct to incorrect to crashes (i.e. random Python exceptions).

Original PyPy dictionary design

Originally, PyPy dictionaries, as well as CPython dictionaries are implemented as follows (simplified view):

struct dict {
   long num_items;
   dict_entry* items;   /* pointer to array */
}

struct dict_entry {
   long hash;
   PyObject* key;
   PyObject* value;
}

Where items is a sparse array, with 1/3 to 1/2 of the items being NULL. The average space occupied by a dictionary is 3 * WORD * 12/7 plus some small constant (the smallest dict has 8 entries, which is 8 * 3 * WORD + 2 * WORD = 26 WORDs).

New PyPy dictionary design

The new PyPy dictionary is split in two arrays:

struct dict {
    long num_items;
    variable_int *sparse_array;
    dict_entry* compact_array;
}

struct dict_entry {
    long hash;
    PyObject *key;
    PyObject *value;
}

Here, compact_array stores all the items in order of insertion, while sparse_array is a 1/2 to 2/3 full array of integers. The integers themselves are of the smallest size necessary for indexing the compact_array. So if compact_array has less than 256 items, then sparse_array will be made of bytes; if less than 2^16, it'll be two-byte integers; and so on.

This design saves quite a bit of memory. For example, on 64bit systems we can, but almost never, use indexing of more than 4 billion elements; and for small dicts, the extra sparse_array takes very little space.  For example a 100 element dict, would be on average for the original design on 64bit: 100 * 12/7 * WORD * 3 =~ 4100 bytes, while on new design it's 100 * 12/7 + 3 * WORD * 100 =~ 2600 bytes, quite a significant saving.

GC friendliness

The obvious benefit of having more compact dictionaries is an increased cache friendliness. In modern CPUs cache misses are much more costly than doing additional simple work, like having an additional level of (in-cache) indirection. Additionally, there is a GC benefit coming from it. When doing a minor collection, the GC has to visit all the GC fields in old objects that can point to young objects. In the case of large arrays, this can prove problematic since the array grows and with each minor collection we need to visit more and more GC pointers. In order to avoid it, large arrays in PyPy employ a technique called "card marking" where the GC only visits "cards" or subsets of arrays that were modified between collections. The problem with dictionaries was that by design modifications in a dictionary occur randomly, hence a lot of cards used to get invalidated. In the new design, however, new items are typically appended to the compact_array, hence invalidate much fewer cards --- which improves GC performance.  (The new sparse_array is an array of integers, so it does not suffer from the same problems.)

Deletion

Deleting entries from dictionaries is not very common, but important in a few use cases.  To preserve order, when we delete an entry, we mark the entry as removed but don't otherwise shuffle the remaining entries.  If we repeat this operation often enough, there will be a lot of removed entries in the (originally compact) array.  At this point, we need to do a "packing" operation, which moves all live entries to the start of the array (and then reindexes the sparse array, as the positions changed).  This works well, but there are use cases where previously no reindexing was ever needed, so it makes these cases a bit slower (for example when repeatedly adding and removing keys in equal number).

Benchmarks

The PyPy speed benchmarks show mostly small effect, see changes. The microbenchmarks that we did show large improvements on large and very large dictionaries (particularly, building dictionaries of at least a couple 100s of items is now twice faster) and break-even on small ones (between 20% slower and 20% faster depending very much on the usage patterns and sizes of dictionaries). The new dictionaries enable various optimization possibilities which we're going to explore in the near future.

Cheers,
fijal, arigo and the PyPy team


Thursday, January 15, 2015

Leysin Winter Sprint (20-28th February 2015)

The next PyPy sprint will be in Leysin, Switzerland, for the tenth time. This is a fully public sprint: newcomers and topics other than those proposed below are welcome.

Goals and topics of the sprint

The details depend on who is here and ready to work. We might touch topics such as:

  • cleaning up the optimization step in the JIT, change the register allocation done by the JIT's backend, or improvements to the warm-up time

  • STM (Software Transaction Memory), notably: try to come up with benchmarks, and measure them carefully in order to test and improve the conflict reporting tools, and more generally to figure out how practical it is in large projects to avoid conflicts

  • vmprof - a statistical profiler for CPython and PyPy work, including making it more user friendly.

  • Py3k (Python 3.x support), NumPyPy (the numpy module)

  • added: cffi 1.0, trying out pygame+cffi on Raspberry Pi devices

  • And as usual, the main side goal is to have fun in winter sports :-) We can take a day off for ski.

Exact times

For a change, and as an attempt to simplify things, I specified the dates as 20-28 Februrary 2015, where 20 and 28 are travel days. We will work full days between the 21 and the 27. You are of course allowed to show up for a part of that time only, too.

Location and Accomodation

Leysin, Switzerland, "same place as before". Let me refresh your memory: both the sprint venue and the lodging will be in a very spacious pair of chalets built specifically for bed & breakfast: Ermina. The place has a good ADSL Internet connection with wireless installed. You can of course arrange your own lodging anywhere (as long as you are in Leysin, you cannot be more than a 15 minutes walk away from the sprint venue), but I definitely recommend lodging there too -- you won't find a better view anywhere else (though you probably won't get much worse ones easily, either :-)

Please confirm that you are coming so that we can adjust the reservations as appropriate. In the past, the rates were around 60 CHF a night all included in 2-person rooms, with breakfast. Now, the rooms available are either single-person (or couple), or rooms for 3 persons. The latter choice is recommended and should be under 60 CHF per person.

Please register by Mercurial, or on the pypy-dev mailing list if you do not yet have check-in rights.

You need a Swiss-to-(insert country here) power adapter. There will be some Swiss-to-EU adapters around, and at least one EU-format power strip.

Friday, November 28, 2014

September donations and thank you to the Python Software Foundation!

Hello everyone!

We would like to show you a short update on the PyPy funding. We gathered a total of $15,986 in the month of September and as per earlier agreement, the Python Software Foundation donated $10,000 to PyPy. We would like to thank everyone participating and the PSF in particular for supporting the PyPy project and making our work possible!

We've been working hard on the goals outlined in the funding proposals.

  • PyPy Python 3 support has been in beta for a while and it's already being used by many people, as seen per the number of reported bugs. We're currently supporting 3.2, planning on moving towards 3.4 in the future.
  • Software Transactional Memory has been a successful research project, with first real world results shown during the Warsaw sprint.
  • More detailed update on numpy will be published soon. A little spoiler is that we're planning on addressing matplotlib, scipy and the larger ecosystem to some extent. Stay tuned!

Again, thanks to everyone who donated and happy Thanksgiving to everyone on that side of the world!

Cheers,
fijal and the entire PyPy team


Monday, November 17, 2014

Tornado without a GIL on PyPy STM

This post is by Konstantin Lopuhin, who tried PyPy STM during the Warsaw sprint.

Python has a GIL, right? Not quite - PyPy STM is a python implementation without a GIL, so it can scale CPU-bound work to several cores. PyPy STM is developed by Armin Rigo and Remi Meier, and supported by community donations. You can read more about it in the docs.

Although PyPy STM is still a work in progress, in many cases it can already run CPU-bound code faster than regular PyPy, when using multiple cores. Here we will see how to slightly modify Tornado IO loop to use transaction module. This module is described in the docs and is really simple to use - please see an example there. An event loop of Tornado, or any other asynchronous web server, looks like this (with some simplifications):

while True:
    for callback in list(self._callbacks):
        self._run_callback(callback)
    event_pairs = self._impl.poll()
    self._events.update(event_pairs)
    while self._events:
        fd, events = self._events.popitem()
        handler = self._handlers[fd]
        self._handle_event(fd, handler, events)

We get IO events, and run handlers for all of them, these handlers can also register new callbacks, which we run too. When using such a framework, it is very nice to have a guaranty that all handlers are run serially, so you do not have to put any locks. This is an ideal case for the transaction module - it gives us guaranties that things appear to be run serially, so in user code we do not need any locks. We just need to change the code above to something like:

while True:
    for callback in list(self._callbacks):
        transaction.add(                # added
            self._run_callback, callback)
    transaction.run()                   # added
    event_pairs = self._impl.poll()
    self._events.update(event_pairs)
    while self._events:
        fd, events = self._events.popitem()
        handler = self._handlers[fd]
        transaction.add(                # added
            self._handle_event, fd, handler, events)
    transaction.run()                   # added

The actual commit is here, - we had to extract a little function to run the callback.

Part 1: a simple benchmark: primes

Now we need a simple benchmark, lets start with this - just calculate a list of primes up to the given number, and return it as JSON:

def is_prime(n):
    for i in xrange(2, n):
        if n % i == 0:
            return False
    return True

class MainHandler(tornado.web.RequestHandler):
    def get(self, num):
        num = int(num)
        primes = [n for n in xrange(2, num + 1) if is_prime(n)]
        self.write({'primes': primes})

We can benchmark it with siege:

siege -c 50 -t 20s http://localhost:8888/10000

But this does not scale. The CPU load is at 101-104 %, and we handle 30 % less request per second. The reason for the slowdown is STM overhead, which needs to keep track of all writes and reads in order to detect conflicts. And the reason for using only one core is, obviously, conflicts! Fortunately, we can see what this conflicts are, if we run code like this (here 4 is the number of cores to use):

PYPYSTM=stm.log ./primes.py 4

Then we can use print_stm_log.py to analyse this log. It lists the most expensive conflicts:

14.793s lost in aborts, 0.000s paused (1258x STM_CONTENTION_INEVITABLE)
File "/home/ubuntu/tornado-stm/tornado/tornado/httpserver.py", line 455, in __init__
    self._start_time = time.time()
File "/home/ubuntu/tornado-stm/tornado/tornado/httpserver.py", line 455, in __init__
    self._start_time = time.time()
...

There are only three kinds of conflicts, they are described in stm source, Here we see that two threads call into external function to get current time, and we can not rollback any of them, so one of them must wait till the other transaction finishes. For now we can hack around this by disabling this timing - this is only needed for internal profiling in tornado.

If we do it, we get the following results (but see caveats below):

Impl. req/s
PyPy 2.4 14.4
CPython 2.7 3.2
PyPy-STM 1 9.3
PyPy-STM 2 16.4
PyPy-STM 3 20.4
PyPy STM 4 24.2
   

As we can see, in this benchmark PyPy STM using just two cores can beat regular PyPy! This is not linear scaling, there are still conflicts left, and this is a very simple example but still, it works!

But its not that simple yet :)

First, these are best-case numbers after long (much longer than for regular PyPy) warmup. Second, it can sometimes crash (although removing old pyc files fixes it). Third, benchmark meta-parameters are also tuned.

Here we get relatively good results only when there are a lot of concurrent clients - as a results, a lot of requests pile up, the server is not keeping with the load, and transaction module is busy with work running this piled up requests. If we decrease the number of concurrent clients, results get slightly worse. Another thing we can tune is how heavy is each request - again, if we ask primes up to a lower number, then less time is spent doing calculations, more time is spent in tornado, and results get much worse.

Besides the time.time() conflict described above, there are a lot of others. The bulk of time is lost in these two conflicts:

14.153s lost in aborts, 0.000s paused (270x STM_CONTENTION_INEVITABLE)
File "/home/ubuntu/tornado-stm/tornado/tornado/web.py", line 1082, in compute_etag
    hasher = hashlib.sha1()
File "/home/ubuntu/tornado-stm/tornado/tornado/web.py", line 1082, in compute_etag
    hasher = hashlib.sha1()

13.484s lost in aborts, 0.000s paused (130x STM_CONTENTION_WRITE_READ)
File "/home/ubuntu/pypy/lib_pypy/transaction.py", line 164, in _run_thread
    got_exception)

The first one is presumably calling into some C function from stdlib, and we get the same conflict as for time.time() above, but is can be fixed on PyPy side, as we can be sure that computing sha1 is pure.

It is easy to hack around this one too, just removing etag support, but if we do it, performance is much worse, only slightly faster than regular PyPy, with the top conflict being:

83.066s lost in aborts, 0.000s paused (459x STM_CONTENTION_WRITE_WRITE)
File "/home/arigo/hg/pypy/stmgc-c7/lib-python/2.7/_weakrefset.py", line 70, in __contains__
File "/home/arigo/hg/pypy/stmgc-c7/lib-python/2.7/_weakrefset.py", line 70, in __contains__

Comment by Armin: It is unclear why this happens so far. We'll investigate...

The second conflict (without etag tweaks) originates in the transaction module, from this piece of code:

while True:
    self._do_it(self._grab_next_thing_to_do(tloc_pending),
                got_exception)
    counter[0] += 1

Comment by Armin: This is a conflict in the transaction module itself; ideally, it shouldn't have any, but in order to do that we might need a little bit of support from RPython or C code. So this is pending improvement.

Tornado modification used in this blog post is based on 3.2.dev2. As of now, the latest version is 4.0.2, and if we apply the same changes to this version, then we no longer get any scaling on this benchmark, and there are no conflicts that take any substantial time.

Comment by Armin: There are two possible reactions to a conflict. We can either abort one of the two threads, or (depending on the circumstances) just pause the current thread until the other one commits, after which the thread will likely be able to continue. The tool ``print_stm_log.py`` did not report conflicts that cause pauses. It has been fixed very recently. Chances are that on this test it would report long pauses and point to locations that cause them.

Part 2: a more interesting benchmark: A-star

Although we have seen that PyPy STM is not all moonlight and roses, it is interesting to see how it works on a more realistic application.

astar.py is a simple game where several players move on a map (represented as a list of lists of integers), build and destroy walls, and ask server to give them shortest paths between two points using A-star search, adopted from ActiveState recipie.

The benchmark bench_astar.py is simulating players, and tries to put the main load on A-star search, but also does some wall building and destruction. There are no locks around map modifications, as normal tornado is executing all callbacks serially, and we can keep this guaranty with atomic blocks of PyPy STM. This is also an example of a program that is not trivial to scale to multiple cores with separate processes (assuming more interesting shared state and logic).

This benchmark is very noisy due to randomness of client interactions (also it could be not linear), so just lower and upper bounds for number of requests are reported

Impl. req/s
PyPy 2.4 5 .. 7
CPython 2.7 0.5 .. 0.9
PyPy-STM 1 2 .. 4
PyPy STM 4 2 .. 6

Clearly this is a very bad benchmark, but still we can see that scaling is worse and STM overhead is sometimes higher. The bulk of conflicts come from the transaction module (we have seen it above):

91.655s lost in aborts, 0.000s paused (249x STM_CONTENTION_WRITE_READ)
File "/home/ubuntu/pypy/lib_pypy/transaction.py", line 164, in _run_thread
    got_exception)

Although it is definitely not ready for production use, you can already try to run things, report bugs, and see what is missing in user-facing tools and libraries.

Benchmarks setup:

Wednesday, November 5, 2014

PyPy IO improvements


Hello everyone!

We've wrapped up the Warsaw sprint, so I would like to describe some branches which have been recently merged and which improved the I/O and the GC: gc_no_cleanup_nursery and gc-incminimark-pinning.

The first branch was started by Wenzhu Man for her Google Summer of Code and finished by Maciej Fijałkowski and Armin Rigo. The PyPy GC works by allocating new objects in the young object area (the nursery), simply by incrementing a pointer. After each minor collection, the nursery has to be cleaned up. For simplicity, the GC used to do it by zeroing the whole nursery.

This approach has bad effects on the cache, since you zero a large piece of memory at once and do unnecessary work for things that don't require zeroing like large strings. We mitigated the first problem somewhat with incremental nursery zeroing, but this branch removes the zeroing completely, thus improving the string handling and recursive code (since jitframes don't requires zeroed memory either). I measured the effect on two examples: a recursive implementation of fibonacci and gcbench, to measure GC performance.

The results for fibonacci and gcbench are below (normalized to cpython 2.7). Benchmarks were run 50 times each (note that the big standard deviation comes mostly from the warmup at the beginning, true figures are smaller):

benchmark CPython PyPy 2.4 PyPy non-zero
fibonacci 4.8+-0.15 (1.0x) 0.59+-0.07 (8.1x) 0.45+-0.07 (10.6x)
gcbench 22+-0.36 (1.0x) 1.34+-0.28 (16.4x) 1.02+-0.15 (21.6x)

The second branch was done by Gregor Wegberg for his master thesis and finished by Maciej Fijałkowski and Armin Rigo. Because of the way it works, the PyPy GC from time to time moves the objects in memory, meaning that their address can change. Therefore, if you want to pass pointers to some external C function (for example, write(2) or read(2)), you need to ensure that the objects they are pointing to will not be moved by the GC (e.g. when running a different thread). PyPy up to 2.4 solves the problem by copying the data into or from a non-movable buffer, which is obviously inefficient. The branch introduce the concept of "pinning", which allows us to inform the GC that it is not allowed to move a certain object for a short period of time. This introduces a bit of extra complexity in the garbage collector, but improves the I/O performance quite drastically, because we no longer need the extra copy to and from the non-movable buffers.

In this benchmark, which does I/O in a loop, we either write a number of bytes from a freshly allocated string into /dev/null or read a number of bytes from /dev/full. I'm showing the results for PyPy 2.4, PyPy with non-zero-nursery and PyPy with non-zero-nursery and object pinning. Those are wall times for cases using os.read/os.write and file.read/file.write, normalized against CPython 2.7.

Benchmarks were done using PyPy 2.4 and revisions 85646d1d07fb for non-zero-nursery and 3d8fe96dc4d9 for non-zero-nursery and pinning. The benchmarks were run once, since the standard deviation was small.

The Y axis is speed, normalized to CPython, the more the better

What we can see is that os.read and os.write both improved greatly and outperforms CPython now for each combination. file operations are a little more tricky, and while those branches improved the situation a bit, the improvement is not as drastic as in os versions. It really should not be the case and it showcases how our file buffering is inferior to CPython. We plan on removing our own buffering and using FILE* in C in the near future, so we should outperform CPython on those too (since our allocations are cheaper). If you look carefully in the benchmark, the write function is copied three times. This hack is intended to avoid JIT overspecializing the assembler code, which happens because the buffering code was written way before the JIT was done. In fact, our buffering is hilariously bad, but if stars align correctly it can be JIT-compiled to something that's not half bad. Try removing the hack and seeing how the performance of the last benchmark drops :-) Again, this hack should be absolutely unnecessary once we remove our own buffering, stay tuned for more.

Cheers,
fijal

Tuesday, October 21, 2014

PyPy3 2.4.0 released

We're pleased to announce the availability of PyPy3 2.4.0!

This release contains several bugfixes and enhancements. Among the user-facing improvements specific to PyPy3:
  • Better Windows compatibility, e.g. the nt module functions _getfinalpathname & _getfileinformation are now supported (the former is required for the popular pathlib library for example)
  • Various fsencode PEP 383 related fixes to the posix module (readlink, uname, ttyname and ctermid) and improved locale handling
  • Switched the default binary name on POSIX distributions from 'pypy' to 'pypy3' (which symlinks to to 'pypy3.2')
  • Fixed a couple different crashes related to parsing Python 3 source code

And improvements shared with the recent PyPy 2.4.0 release:
  • internal refactoring in string and GIL handling which led to significant speedups
  • improved handling of multiple objects (like sockets) in long-running programs. They are collected and released more efficiently, reducing memory use. In simpler terms - we closed what looked like a memory leak
  • Windows builds now link statically to zlib, expat, bzip, and openssl-1.0.1i
  • Many issues were resolved since the 2.3.1 release in June

You can download PyPy3 2.4.0 here http://pypy.org/download.html.

PyPy is a very compliant Python interpreter, almost a drop-in replacement for CPython 2.7 and 3.2.5. It's fast (pypy 2.4 and cpython 2.7.x performance comparison) due to its integrated tracing JIT compiler.

This release supports x86 machines running Linux 32/64, Mac OS X 64, Windows, and OpenBSD, as well as newer ARM hardware (ARMv6 or ARMv7, with VFPv3) running Linux. 
We would like to thank our donors for the continued support of the PyPy project.

The complete release notice is here.

Please try it out and let us know what you think. We especially welcome success stories, please tell us about how it has helped you!

Cheers, The PyPy Team

Tuesday, October 14, 2014

Couchbase contribution to PyPy

Hello everyone!

We always offer to put on the blog info about our sponsors who donate substantial amounts of money. So far most people decided to stay anonymous, so this is the first blog post describing our sponsor and his relationship to PyPy, hopefully not the last. We'll also publish a full blog post about the PSF-matched fundraiser soon. This is a guest post by Brent Woodruff from Couchbase.

Couchbase is a leading NoSQL document database that provides a flexible data model, high performance, scalability, and high availability. Couchbase is a commercially supported open source project. Visit us at http://www.couchbase.com and http://github.com/couchbase.

Couchbase Inc. donated $2000.00, and employees of Couchbase personally contributed a disclosed additional $230.00, towards Pypy progress during the September funding drive. These funds will see a match from the Python Software Foundation.

Pypy is primarily used by Couchbase employees to perform product analysis and troubleshooting using internally developed tools. Every customer of Couchbase benefits from the use of Pypy; both due to the rapid development provided by Python, and the speed of the resulting tools provided by the Pypy JIT interpreter.

“PyPy is great - it gave us a 4x speedup in our CPU-intensive internal application over CPython” -Dave Rigby and Daniel Owen, Couchbase Engineers

Additionally, Couchbase has a preliminary CFFI based Couchbase client available for Pypy users.


Monday, September 22, 2014

PyPy 2.4.0 released, 9 days left in funding drive

We're pleased to announce the availability of PyPy 2.4.0; faster, fewer bugs, and updated to the python 2.7.8 stdlib.

This release contains several bugfixes and enhancements. Among the user-facing improvements:
  • internal refactoring in string and GIL handling which led to significant speedups
  • improved handling of multiple objects (like sockets) in long-running programs. They are collected and released more efficiently, reducing memory use. In simpler terms - we closed what looked like a memory leak
  • Windows builds now link statically to zlib, expat, bzip, and openssl-1.0.1i
  • Many issues were resolved since the 2.3.1 release in June

You can download PyPy 2.4.0 here http://pypy.org/download.html.

We would like to also point out that in September, the Python Software Foundation will match funds for any donations up to $10k, so head over to our website and help this mostly-volunteer effort out.

PyPy is a very compliant Python interpreter, almost a drop-in replacement for CPython 2.7 and 3.2.5. It's fast (pypy 2.4 and cpython 2.7.x performance comparison) due to its integrated tracing JIT compiler.

This release supports x86 machines running Linux 32/64, Mac OS X 64, Windows, and OpenBSD, as well as newer ARM hardware (ARMv6 or ARMv7, with VFPv3) running Linux. 
We would like to thank our donors for the continued support of the PyPy project.

The complete release notice is here.

Please try it out and let us know what you think. We especially welcome success stories, please tell us about how it has helped you!

Cheers, The PyPy Team

Monday, September 8, 2014

PyPy 2.4-beta just in time for PSF's funding drive

We're pleased to announce the availability of PyPy 2.4-beta1; faster, fewer bugs, and updated to the python 2.7.8 stdlib.

This release contains several bugfixes and enhancements. Among the user-facing improvements:
  • internal refactoring in string and GIL handling which led to significant speedups
  • improved handling of multiple objects (like sockets) in long-running programs. They are collected and released more efficiently, reducing memory use. In simpler terms - we closed what looked like a memory leak
  • Windows builds now link statically to zlib, expat, bzip, and openssl-1.0.1i
  • Many issues were resolved since the 2.3.1 release in June

You can download the PyPy 2.4-beta1 release here http://pypy.org/download.html.

We would like to also point out that in September, the Python Software Foundation will match funds for any donations up to $10k, so head over to our website and help this mostly-volunteer effort out.

PyPy is a very compliant Python interpreter, almost a drop-in replacement for CPython 2.7 and 3.2.5. It's fast (pypy 2.4 and cpython 2.7.x performance comparison) due to its integrated tracing JIT compiler.

This release supports x86 machines running Linux 32/64, Mac OS X 64, Windows, and OpenBSD, as well as newer ARM hardware (ARMv6 or ARMv7, with VFPv3) running Linux. 
We would like to thank our donors for the continued support of the PyPy project.

The complete release notice is here.

Please try it out and let us know what you think. We especially welcome success stories, please tell us about how it has helped you!

Cheers, The PyPy Team

News Flash from the beta release cycle:
  • Note that the beta release mistakenly identifies itself in sys.pypy_version_info as releaselevel=='final', please do not mistake this for a final version
  • The beta can hit a "Illegal instruction" exception in jitted code on ARMv6 processors like the RaspberryPi. This will be fixed for the release.


Monday, September 1, 2014

Python Software Foundation Matching Donations this Month

We're extremely excited to announce that for the month of September, any amount
you donate to PyPy will be match (up to $10,000) by the Python Software
Foundation
.

This includes any of our ongoing fundraisers: NumPyPy, STM, Python3, or our
general fundraising.

Here are some of the things your previous donations have helped accomplish:

  • Getting PyPy3 completed (currently 3.2, with 3.3 work underway)
  • New research and production engineering on STM for PyPy
  • Lots of progress on NumPy for PyPy
  • Significant performance improvements

You can see a preview of what's coming in our next 2.4 release in the draft
release notes
.

Thank you to all the individuals and companies which have donated so far.

So please, donate today: http://pypy.org/

(Please be aware that the donation progress bars are not live updating, so
don't be afraid if your donation doesn't show up immediately).

Saturday, August 9, 2014

A Field Test of Software Transactional Memory Using the RSqueak Smalltalk VM

Extending the Smalltalk RSqueakVM with STM

by Conrad Calmez, Hubert Hesse, Patrick Rein and Malte Swart supervised by Tim Felgentreff and Tobias Pape

Introduction

After pypy-stm we can announce that through the RSqueakVM (which used to be called SPyVM) a second VM implementation supports software transactional memory. RSqueakVM is a Smalltalk implementation based on the RPython toolchain. We have added STM support based on the STM tools from RPython (rstm). The benchmarks indicate that linear scale up is possible, however in some situations the STM overhead limits speedup.

The work was done as a master's project at the Software Architechture Group of Professor Robert Hirschfeld at at the Hasso Plattner Institut at the University of Potsdam. We - four students - worked about one and a half days per week for four months on the topic. The RSqueakVM was originally developped during a sprint at the University of Bern. When we started the project we were new to the topic of building VMs / interpreters.

We would like to thank Armin, Remi and the #pypy IRC channel who supported us over the course of our project. We also like to thank Toni Mattis and Eric Seckler, who have provided us with an initial code base.

Introduction to RSqueakVM

As the original Smalltalk implementation, the RSqueakVM executes a given Squeak Smalltalk image, containing the Smalltalk code and a snapshot of formerly created objects and active execution contexts. These execution contexts are scheduled inside the image (greenlets) and not mapped to OS threads. Thereby the non-STM RSqueakVM runs on only one OS thread.

Changes to RSqueakVM

The core adjustments to support STM were inside the VM and transparent from the view of a Smalltalk user. Additionally we added Smalltalk code to influence the behavior of the STM. As the RSqueakVM has run in one OS thread so far, we added the capability to start OS threads. Essentially, we added an additional way to launch a new Smalltalk execution context (thread). But in contrast to the original one this one creates a new native OS thread, not a Smalltalk internal green thread.

STM (with automatic transaction boundaries) already solves the problem of concurrent access on one value as this is protected by the STM transactions (to be more precise one instruction). But there are cases were the application relies on the fact that a bigger group of changes is executed either completely or not at all (atomic). Without further information transaction borders could be in the middle of such a set of atomic statements. rstm allows to aggregate multiple statements into one higher level transaction. To let the application mark the beginning and the end of these atomic blocks (high-level transactions), we added two more STM specific extensions to Smalltalk.

Benchmarks

RSqueak was executed in a single OS thread so far. rstm enables us to execute the VM using several OS threads. Using OS threads we expected a speed-up in benchmarks which use multiple threads. We measured this speed-up by using two benchmarks: a simple parallel summation where each thread sums up a predefined interval and an implementation of Mandelbrot where each thread computes a range of predefined lines.

To assess the speed-up, we used one RSqueakVM compiled with rstm enabled, but once running the benchmarks with OS threads and once with Smalltalk green threads. The workload always remained the same and only the number of threads increased. To assess the overhead imposed by the STM transformation we also ran the green threads version on an unmodified RSqueakVM. All VMs were translated with the JIT optimization and all benchmarks were run once before the measurement to warm up the JIT. As the JIT optimization is working it is likely to be adoped by VM creators (the baseline RSqueakVM did that) so that results with this optimization are more relevant in practice than those without it. We measured the execution time by getting the system time in Squeak. The results are:

Parallel Sum Ten Million

Benchmark Parallel Sum 10,000,000
Thread Count RSqueak green threads RSqueak/STM green threads RSqueak/STM OS threads Slow down from RSqueak green threads to RSqueak/STM green threads Speed up from RSqueak/STM green threads to RSQueak/STM OS Threads
1 168.0 ms 240.0 ms 290.9 ms 0.70 0.83
2 167.0 ms 244.0 ms 246.1 ms 0.68 0.99
4 167.8 ms 240.7 ms 366.7 ms 0.70 0.66
8 168.1 ms 241.1 ms 757.0 ms 0.70 0.32
16 168.5 ms 244.5 ms 1460.0 ms 0.69 0.17

Parallel Sum One Billion

Benchmark Parallel Sum 1,000,000,000

Thread CountRSqueak green threadsRSqueak/STM green threadsRSqueak/STM OS threadsSlow down from RSqueak green threads to RSqueak/STM green threadsSpeed up from RSqueak/STM green threads to RSQueak/STM OS Threads
1 16831.0 ms 24111.0 ms 23346.0 ms 0.70 1.03
2 17059.9 ms 24229.4 ms 16102.1 ms 0.70 1.50
4 16959.9 ms 24365.6 ms 12099.5 ms 0.70 2.01
8 16758.4 ms 24228.1 ms 14076.9 ms 0.69 1.72
16 16748.7 ms 24266.6 ms 55502.9 ms 0.69 0.44

Mandelbrot Iterative

Benchmark Mandelbrot
Thread Count RSqueak green threads RSqueak/STM green threads RSqueak/STM OS threads Slow down from RSqueak green threads to RSqueak/STM green threads Speed up from RSqueak/STM green threads to RSqueak/STM OS Threads
1 724.0 ms 983.0 ms 1565.5 ms 0.74 0.63
2 780.5 ms 973.5 ms 5555.0 ms 0.80 0.18
4 781.0 ms 982.5 ms 20107.5 ms 0.79 0.05
8 779.5 ms 980.0 ms 113067.0 ms 0.80 0.01

Discussion of benchmark results

First of all, the ParallelSum benchmarks show that the parallelism is actually paying off, at least for sufficiently large embarrassingly parallel problems. Thus RSqueak can also benefit from rstm.

On the other hand, our Mandelbrot implementation shows the limits of our current rstm integration. We implemented two versions of the algorithm one using one low-level array and one using two nested collections. In both versions, one job only calculates a distinct range of rows and both lead to a slowdown. The summary of the state of rstm transactions shows that there are a lot of inevitable transactions (transactions which must be completed). One reason might be the interactions between the VM and its low-level extensions, so called plugins. We have to investigate this further.

Limitations

Although the current VM setup is working well enough to support our benchmarks, the VM still has limitations. First of all, as it is based on rstm, it has the current limitation of only running on 64-bit Linux.

Besides this, we also have two major limitations regarding the VM itself. First, the atomic interface exposed in Smalltalk is currently not working, when the VM is compiled using the just-in-time compiler transformation. Simple examples such as concurrent parallel sum work fine while more complex benchmarks such as chameneos fail. The reasons for this are currently beyond our understanding. Second, Smalltalk supports green threads, which are threads which are managed by the VM and are not mapped to OS threads. We currently support starting new Smalltalk threads as OS threads instead of starting them as green threads. However, existing threads in a Smalltalk image are not migrated to OS threads, but remain running as green threads.

Future work for STM in RSqueak

The work we presented showed interesting problems, we propose the following problem statements for further analysis:
  • Inevitable transactions in benchmarks. This looks like it could limit other applications too so it should be solved.
  • Collection implementation aware of STM: The current implementation of collections can cause a lot of STM collisions due to their internal memory structure. We believe it could bear potential for performance improvements, if we replace these collections in an STM enabled interpreter with implementations with less STM collisions. As already proposed by Remi Meier, bags, sets and lists are of particular interest.
  • Finally, we exposed STM through languages features such as the atomic method, which is provided through the VM. Originally, it was possible to model STM transactions barriers implicitly by using clever locks, now its exposed via the atomic keyword. From a language design point of view, the question arises whether this is a good solution and what features an stm-enabled interpreter must provide to the user in general? Of particular interest are for example, access to the transaction length and hints for transaction borders to and their performance impact.

    Details for the technically inclined

    • Adjustments to the interpreter loop were minimal.
    • STM works on bytecode granularity that means, there is a implicit transaction border after every bytecode executed. Possible alternatives: only break transactions after certain bytecodes, break transactions on one abstraction layer above, e.g. object methods (setter, getter).
    • rstm calls were exposed using primtives (a way to expose native code in Smalltalk), this was mainly used for atomic.
    • Starting and stopping OS threads is exposed via primitives as well. Threads are started from within the interpreter.
    • For Smalltalk enabled STM code we currently have different image versions. However another way to add, load and replace code to the Smalltalk code base is required to make a switch between STM and non-STM code simple.

      Details on the project setup

      From a non-technical perspective, a problem we encountered was the huge roundtrip times (on our machines up to 600s, 900s with JIT enabled). This led to a tendency of bigger code changes ("Before we compile, let's also add this"), lost flow ("What where we doing before?") and different compiled interpreters in parallel testing ("How is this version different from the others?") As a consequence it was harder to test and correct errors. While this is not as much of a problem for other RPython VMs, RSqueakVM needs to execute the entire image, which makes running it untranslated even slower.

      Summary

      The benchmarks show that speed up is possible, but also that the STM overhead in some situations can eat up the speedup. The resulting STM-enabled VM still has some limitations: As rstm is currently only running on 64-bit Linux the RSqueakVM is doing so as well. Eventhough it is possible for us now to create new threads that map to OS threads within the VM, the migration of exiting Smalltalk threads keeps being problematic.

      We showed that an existing VM code base can benefit of STM in terms of scaling up. Further it was relatively easy to enable STM support. This may also be valuable to VM developers considering to get STM support for their VMs.

      Saturday, July 5, 2014

      PyPy-STM: first "interesting" release

      Hi all,

      PyPy-STM is now reaching a point where we can say it's good enough to be a GIL-less Python. (We don't guarantee there are no more bugs, so please report them :-) The first official STM release:

      This corresponds roughly to PyPy 2.3 (not 2.3.1). It requires 64-bit Linux. More precisely, this release is built for Ubuntu 12.04 to 14.04; you can also rebuild it from source by getting the branch stmgc-c7. You need clang to compile, and you need a patched version of llvm.

      This version's performance can reasonably be compared with a regular PyPy, where both include the JIT. Thanks for following the meandering progress of PyPy-STM over the past three years --- we're finally getting somewhere really interesting! We cannot thank enough all contributors to the previous PyPy-STM money pot that made this possible. And, although this blog post is focused on the results from that period of time, I have of course to remind you that we're running a second call for donation for future work, which I will briefly mention again later.

      A recap of what we did to get there: around the start of the year we found a new model, a "redo-log"-based STM which uses a couple of hardware tricks to not require chasing pointers, giving it (in this context) exceptionally cheap read barriers. This idea was developed over the following months and (relatively) easily integrated with the JIT compiler. The most recent improvements on the Garbage Collection side are closing the gap with a regular PyPy (there is still a bit more to do there). There is some preliminary user documentation.

      Today, the result of this is a PyPy-STM that is capable of running pure Python code on multiple threads in parallel, as we will show in the benchmarks that follow. A quick warning: this is only about pure Python code. We didn't try so far to optimize the case where most of the time is spent in external libraries, or even manipulating "raw" memory like array.array or numpy arrays. To some extent there is no point because the approach of CPython works well for this case, i.e. releasing the GIL around the long-running operations in C. Of course it would be nice if such cases worked as well in PyPy-STM --- which they do to some extent; but checking and optimizing that is future work.

      As a starting point for our benchmarks, when running code that only uses one thread, we get a slow-down between 1.2 and 3: at worst, three times as slow; at best only 20% slower than a regular PyPy. This worst case has been brought down --it used to be 10x-- by recent work on "card marking", a useful GC technique that is also present in the regular PyPy (and about which I don't find any blog post; maybe we should write one :-) The main remaining issue is fork(), or any function that creates subprocesses: it works, but is very slow. To remind you of this fact, it prints a line to stderr when used.

      Now the real main part: when you run multithreaded code, it scales very nicely with two threads, and less-than-linearly but still not badly with three or four threads. Here is an artificial example:

          total = 0
          lst1 = ["foo"]
          for i in range(100000000):
              lst1.append(i)
              total += lst1.pop()

      We run this code N times, once in each of N threads (full benchmark). Run times, best of three:

      Number of threads Regular PyPy (head) PyPy-STM
      N = 1 real 0.92s
      user+sys 0.92s
      real 1.34s
      user+sys 1.34s
      N = 2 real 1.77s
      user+sys 1.74s
      real 1.39s
      user+sys 2.47s
      N = 3 real 2.57s
      user+sys 2.56s
      real 1.58s
      user+sys 4.106s
      N = 4 real 3.38s
      user+sys 3.38s
      real 1.64s
      user+sys 5.35s

      (The "real" time is the wall clock time. The "user+sys" time is the recorded CPU time, which can be larger than the wall clock time if multiple CPUs run in parallel. This was run on a 4x2 cores machine. For direct comparison, avoid loops that are so trivial that the JIT can remove all allocations from them: right now PyPy-STM does not handle this case well. It has to force a dummy allocation in such loops, which makes minor collections occur much more frequently.)

      Four threads is the limit so far: only four threads can be executed in parallel. Similarly, the memory usage is limited to 2.5 GB of GC objects. These two limitations are not hard to increase, but at least increasing the memory limit requires fighting against more LLVM bugs. (Include here snark remarks about LLVM.)

      Here are some measurements from more real-world benchmarks. This time, the amount of work is fixed and we parallelize it on T threads. The first benchmark is just running translate.py on a trunk PyPy. The last three benchmarks are here.

      Benchmark PyPy 2.3 (PyPy head) PyPy-STM, T=1 T=2 T=3 T=4
      translate.py --no-allworkingmodules
      (annotation step)
      184s (170s) 386s (2.10x) n/a
      multithread-richards
      5000 iterations
      24.2s (16.8s) 52.5s (2.17x) 37.4s (1.55x) 25.9s (1.07x) 32.7s (1.35x)
      mandelbrot
      divided in 16-18 bands
      22.9s (18.2s) 27.5s (1.20x) 14.4s (0.63x) 10.3s (0.45x) 8.71s (0.38x)
      btree 2.26s (2.00s) 2.01s (0.89x) 2.22s (0.98x) 2.14s (0.95x) 2.42s (1.07x)

      This shows various cases that can occur:

      • The mandelbrot example runs with minimal overhead and very good parallelization. It's dividing the plane to compute in bands, and each of the T threads receives the same number of bands.
      • Richards, a classical benchmark for PyPy (tweaked to run the iterations in multiple threads), is hard to beat on regular PyPy: we suspect that the difference is due to the fact that a lot of paths through the loops don't allocate, triggering the issue already explained above. Moreover, the speed of Richards was again improved dramatically recently, in trunk.
      • The translation benchmark measures the time translate.py takes to run the first phase only, "annotation" (for now it consumes too much memory to run translate.py to the end). Moreover the timing starts only after the large number of subprocesses spawned at the beginning (mostly gcc). This benchmark is not parallel, but we include it for reference here. The slow-down factor of 2.1x is still too much, but we have some idea about the reasons: most likely, again the Garbage Collector, missing the regular PyPy's very fast small-object allocator for old objects. Also, translate.py is an example of application that could, with reasonable efforts, be made largely parallel in the future using atomic blocks.
      • Atomic blocks are also present in the btree benchmark. I'm not completely sure but it seems that, in this case, the atomic blocks create too many conflicts between the threads for actual parallization: the base time is very good, but running more threads does not help at all.

      As a summary, PyPy-STM looks already useful to run CPU-bound multithreaded applications. We are certainly still going to fight slow-downs, but it seems that there are cases where 2 threads are enough to outperform a regular PyPy, by a large margin. Please try it out on your own small examples!

      And, at the same time, please don't attempt to retrofit threads inside an existing large program just to benefit from PyPy-STM! Our goal is not to send everyone down the obscure route of multithreaded programming and its dark traps. We are going finally to shift our main focus on the phase 2 of our research (donations welcome): how to enable a better way of writing multi-core programs. The starting point is to fix and test atomic blocks. Then we will have to debug common causes of conflicts and fix them or work around them; and try to see how common frameworks like Twisted can be adapted.

      Lots of work ahead, but lots of work behind too :-)

      Armin (thanks Remi as well for the work).

      Friday, June 20, 2014

      PyPy3 2.3.1 - Fulcrum

      We're pleased to announce the first stable release of PyPy3. PyPy3
      targets Python 3 (3.2.5) compatibility.

      We would like to thank all of the people who donated to the py3k proposal
      for supporting the work that went into this.

      You can download the PyPy3 2.3.1 release here:

      http://pypy.org/download.html#pypy3-2-3-1

      Highlights

      • The first stable release of PyPy3: support for Python 3!
      • The stdlib has been updated to Python 3.2.5
      • Additional support for the u'unicode' syntax (PEP 414) from Python 3.3
      • Updates from the default branch, such as incremental GC and various JIT
        improvements
      • Resolved some notable JIT performance regressions from PyPy2:
      • Re-enabled the previously disabled collection (list/dict/set) strategies
      • Resolved performance of iteration over range objects
      • Resolved handling of Python 3's exception __context__ unnecessarily forcing
        frame object overhead

      What is PyPy?

      PyPy is a very compliant Python interpreter, almost a drop-in replacement for
      CPython 2.7.6 or 3.2.5. It's fast due to its integrated tracing JIT compiler.

      This release supports x86 machines running Linux 32/64, Mac OS X 64, Windows,
      and OpenBSD,
      as well as newer ARM hardware (ARMv6 or ARMv7, with VFPv3) running Linux.

      While we support 32 bit python on Windows, work on the native Windows 64
      bit python is still stalling, we would welcome a volunteer
      to handle that.

      How to use PyPy?

      We suggest using PyPy from a virtualenv. Once you have a virtualenv
      installed, you can follow instructions from pypy documentation on how
      to proceed. This document also covers other installation schemes.

      Cheers,
      the PyPy team

      Sunday, June 8, 2014

      PyPy 2.3.1 - Terrestrial Arthropod Trap Revisited

      We're pleased to announce PyPy 2.3.1, a feature-and-bugfix improvement over our recent 2.3 release last month.

      This release contains several bugfixes and enhancements among the user-facing improvements:
      • The built-in struct module was renamed to _struct, solving issues with IDLE and other modules
      • Support for compilation with gcc-4.9
      • A CFFI-based version of the gdbm module is now included in our binary bundle
      • Many issues were resolved since the 2.3 release on May 8

      You can download the PyPy 2.3.1 release here:

      http://pypy.org/download.html

      PyPy is a very compliant Python interpreter, almost a drop-in replacement for CPython 2.7. It's fast (pypy 2.3.1 and cpython 2.7.x performance comparison) due to its integrated tracing JIT compiler.

      This release supports x86 machines running Linux 32/64, Mac OS X 64, Windows, and OpenBSD, as well as newer ARM hardware (ARMv6 or ARMv7, with VFPv3) running Linux. 
      We would like to thank our donors for the continued support of the PyPy project.

      The complete release notice is here.

      Please try it out and let us know what you think. We especially welcome success stories, please tell us about how it has helped you!

      Cheers, The PyPy Team

      Friday, May 9, 2014

      PyPy 2.3 - Terrestrial Arthropod Trap

      We’re pleased to announce PyPy 2.3, which targets version 2.7.6 of the Python language. This release updates the stdlib from 2.7.3, jumping directly to 2.7.6.

      This release also contains several bugfixes and performance improvements, many generated by real users finding corner cases. CFFI has made it easier than ever to use existing C code with both cpython and PyPy, easing the transition for packages like cryptographyPillow(Python Imaging Library [Fork]), a basic port of pygame-cffi, and others.

      PyPy can now be embedded in a hosting application, for instance inside uWSGI

      You can download the PyPy 2.3 release here:

      http://pypy.org/download.html

      PyPy is a very compliant Python interpreter, almost a drop-in replacement for CPython 2.7. It's fast (pypy 2.3 and cpython 2.7.x performance comparison; note that cpython's speed has not changed since 2.7.2) due to its integrated tracing JIT compiler.

      This release supports x86 machines running Linux 32/64, Mac OS X 64, Windows, and OpenBSD, as well as newer ARM hardware (ARMv6 or ARMv7, with VFPv3) running Linux. 

      We would like to thank our donors for the continued support of the PyPy project.

      The complete release notice is here

      Cheers, The PyPy Team