Skip to main content

Fixing a Bug in PyPy's Incremental GC

Introduction

Since last summer, I've been looking on and off into a weird and hard to reproduce crash bug in PyPy. It was manifesting only on CI, and it seemed to always happen in the AST rewriting phase of pytest, the symptoms being that PyPy would crash with a segfault. All my attempts to reproduce it locally failed, and my attempts to try to understand the problem by dumping the involved ASTs lead nowhere.

A few weeks ago, we got two more bug reports, the last one by the authors of the nanobind binding generator, with the same symptoms: crash in AST rewriting, only on CI. I decided to make a more serious push to try to find the bug this time. Ultimately the problem turned out to be several bugs in PyPy's garbage collector (GC) that had been there since its inception in 2013. Understanding the situation turned out to be quite involved, additionally complicated by this being the first time that I was working on this particular aspect of PyPy's GC. Since the bug was so much work to find, I thought I'd write a blog post about it.

The blog post consists of three parts: first a chronological description of what I did to find the bug, a technical explanation of what goes wrong, some reflections on the bug (and then a bonus bug I also found in the process).

Finding the Bug

I started from the failing nanobind CI runs that ended with a segfault of the PyPy interpreter. This was only an intermittent problem, not every run was failing. When I tried to just run the test suite locally, I couldn't get it to fail. Therefore at first I tried to learn more about what was happening by looking on the CI runners.

Running on CI

I forked the nanobind repo and hacked the CI script in order to get it to use a PyPy build with full debug information and more assertions turned on. In order to increase the probability of seeing the crash I added an otherwise unused matrix variable to the CI script that just contained 32 parameters. This means every build is done 32 times (sorry Github for wasting your CPUs 😕). With that amount of repetition, I got at least one job of every build that was crashing.

Then I added the -Xfaulthandler option to the PyPy command which will use the faulthandler module try to print a Python stacktrace if the VM segfaults to confirm that PyPy was indeed crashing in the AST rewriting phase of pytest, which pytest uses for nicer assertions. I experimented with hacking our faulthandler implementation to also give me a C-level callstack, but that didn't work as well as I hoped.

Then I tried to run gdb on CI to try to get it to print a C callstack at the crash point. You can get gdb to execute commands as if typed at the prompt with the -ex commandline option, I used something like this:

gdb -ex "set confirm off" -ex "set pagination off" -ex \
    "set debuginfod enabled off" -ex run -ex where -ex quit \
    --args <command> <arguments>

But unfortunately the crash never occurred when running in gdb.

Afterwards I tried the next best thing, which was configuring the CI runner to dump a core file and upload it as a build artifact, which worked. Looking at the cores locally only sort of worked, because I am running a different version of Ubuntu than the CI runners. So I used tmate to be able to log into the CI runner after a crash and interactively used gdb there. Unfortunately what I learned from that was that the bug was some kind of memory corruption, which is always incredibly unpleasant to debug. Basically the header word of a Python object had been corrupted somehow at the point of the crash, which means that it's vtable wasn't usable any more.

(Sidenote: PyPy doesn't really use a vtable pointer, instead it uses half a word in the header for the vtable, and the other half for flags that the GC needs to keep track of the state of the object. Corrupting all this is still bad.)

Reproducing Locally

At that point it was clear that I had to push to reproduce the problem on my laptop, to allow me to work on the problem more directly and not to always have to go via the CI runner. Memory corruption bugs often have a lot of randomness (depending on which part of memory gets modified, things might crash or more likely just happily keep running). Therefore I decided to try to brute-force reproducing the crash by simply running the tests many many times. Since the crash happened in the AST rewriting phase of pytest, and that happens only if no pyc files of the bytecode-compiled rewritten ASTs exist, I made sure to delete them before every test run.

To repeat the test runs I used multitime, which is a simple program that runs a command repeatedly. It's meant for lightweight benchmarking purposes, but it also halts the execution of the command if that command exits with an error (and it sleeps a small random time between runs, which might help with randomizing the situation, maybe). Here's a demo:

(Max pointed out autoclave to me when reviewing this post, which is a more dedicated tool for this job.)

Thankfully, running the tests repeatedly eventually lead to a crash, solving my "only happens on CI" problem. I then tried various variants to exclude possible sources of errors. The first source of errors to exclude in PyPy bugs is the just-in-time compiler, so I reran the tests with --jit off to see whether I could still get it to crash, and thankfully I eventually could (JIT bugs are often very annoying).

Next source of bugs to exclude where C-extensions. Since those were the tests of nanobind, a framework for creating C-extension modules I was a bit worried that the bug might be in our emulation of CPython's C-API. But running PyPy with the -v option (which will print all the imports as they happen) confirmed that at the point of crash no C-extension had been imported yet.

Using rr

I still couldn't get the bug to happen in GDB, so the tool I tried next was rr, the "reverse debugger". rr can record the execution of a program and later replay it arbitrarily often. This gives you a time-traveling debugger that allows you to execute the program backwards in addition to forwards. Eventually I managed to get the crash to happen when running the tests with rr record --chaos (--chaos randomizes some decisions that rr takes, to try to increase the chance of reproducing bugs).

Using rr well is quite hard, and I'm not very good at it. The main approach I use with rr to debug memory corruption is to replay the crash, then set a watchpoint for the corrupted memory location, then use the command reverse-continue to find the place in the code that mutated the memory location. reverse-continue is like continue, except that it will execute the program backwards from the current point. Here's a little demo of this:

Doing this for my bug revealed that the object that was being corrupted was erroneously collected by the garbage collector. For some reason the GC had wrongly decided that the object was no longer reachable and therefore put the object into a freelist by writing a pointer to the next entry in the freelist into the first word of the object, overwriting the object's header. The next time the object was used things crashed.

Side-quest: wrong GC assertions

At this point in the process, I got massively side-tracked. PyPy's GC has a number of debug modes that you can optionally turn on. Those slow down the program execution a lot, but they should in theory help to understand why the GC goes wrong. When I turned them on, I was getting a failing assertion really early in the test execution, complaining about an invariant violation in the GC logic. At first this made me very happy. I thought that this would help me fix the bug more quickly.

Extremely frustratingly, after two days of work I concluded that the assertion logic itself was wrong. I have fixed that in the meantime too, the details of that are in the bonus section at the end of the post.

Using GDB scripting to find the real bug

After that disaster I went back to the earlier rr recording without GC assertions and tried to understand in more detail why the GC decided to free an object that was still being referenced. To be able to do that I used the GDB Python scripting API to write some helper commands to understand the state of the GC heap (rr is an extension of GDB, so the GDB scripting API works in rr too).

The first (small) helper command I wrote with the GDB scripting API was a way to pretty-print the currently active GC flags of a random PyPy object, starting just from the pointer. The more complex command I wrote was an object tracer, which follows pointers to GC objects starting from a root object to explore the object graph. The object tracer isn't complete, it doesn't deal with all the complexities of PyPy's GC. But it was good enough to help me with my problem, I found out that the corrupted object was stored in an array.

As an example, here's a function that uses the GDB API to walk one of the helper data structures of the GC, a stack of pointers:

def walk_addr_stack(obj):
    """ walk an instance of the AddressStack class (which is a linked list of
    arrays of 1019 pointers).

    the first of the arrays is only partially filled with used_in_last_chunk
    items, all the other chunks are full."""
    if obj.type.code == gdb.TYPE_CODE_PTR:
        obj = obj.dereference()
    used_in_last_chunk = lookup(obj, "used_in_last_chunk")
    chunk = lookup(obj, "inst_chunk").dereference()
    while 1:
        items = lookup(chunk, "items")
        for i in range(used_in_last_chunk):
            yield items[i]
        chunk = lookup(chunk, "next")
        if not chunk:
            break
        chunk = chunk.dereference()
        used_in_last_chunk = 1019

The full file of supporting code I wrote can be found in this gist. This is pretty rough throw-away code, however.

In the following recording I show a staged debugging session with some of the extra commands I wrote with the Python API. The details aren't important, I just wanted to give a bit of a flavor of what inspecting objects looks like:

The next step was to understand why the array content wasn't being correctly traced by the GC, which I eventually managed with some conditional breakpoints, more watchpoints, and using reverse-continue. It turned out to be a bug that occurs when the content of one array was memcopied into another array. The technical details of why the array wasn't traced correctly are described in detail in the next section.

Writing a unit test

To try to make sure I really understood the bug correctly I then wrote a GC unit test that shows the problem. Like most of PyPy, our GC is written in RPython, a (somewhat strange) subset/dialect of Python2, which can be compiled to C code. However, since it is also valid Python2 code, it can be unit-tested on top of a Python2 implementation (which is one of the reasons why we keep maintaining PyPy2).

In the GC unit tests you have a lot of control about what order things happen in, e.g. how objects are allocated, when garbage collection phases happen, etc. After some trying I managed to write a test that crashes with the same kind of memory corruption that my original crash exhibited: an object that is still reachable via an array is collected by the GC. To give you a flavor of what this kind of test looks like, here's an (edited for clarity) version of the test I eventually managed to write

def test_incrementality_bug_arraycopy(self):
    source = self.malloc(VAR, 8) # first array
    # the stackroots list emulates the C stack
    self.stackroots.append(source)
    target = self.malloc(VAR, 8) # second array
    self.stackroots.append(target)
    node = self.malloc(S) # unrelated object, will be collected
    node.x = 5
    # store reference into source array, calling the write barrier
    self.writearray(source, 0, node)
    val = self.gc.collect_step()
    source = self.stackroots[0] # reload arrays, they might have moved
    target = self.stackroots[1]
    # this GC step traces target
    val = self.gc.collect_step()

    # emulate what a memcopy of arrays does
    res = self.gc.writebarrier_before_copy(source, target, 0, 0, 2)
    assert res
    target[0] = source[0] # copy two elements of the arrays
    target[1] = source[1]
    # now overwrite the reference to node in source
    self.writearray(source, 0, lltype.nullptr(S))
    # this GC step traces source
    self.gc.collect_step()
    # some more collection steps, crucially target isn't traced again
    # but node is deleted
    for i in range(3):
        self.gc.collect_step()
    # used to crash, node got collected
    assert target[0].x == 5

One of the good properties of testing our GC that way is that all the memory is emulated. The crash in the last line of the test isn't a segfault at all, instead you get a nice exception saying that you tried to access a freed chunk of memory and you can then debug this with a python2 debugger.

Fixing the Bug

With the unit test in hand, fixing the test was relatively straightforward (the diff in its simplest form is anyway only a single line change). After this first version of my fix, I talked to Armin Rigo who helped me find different case that was still wrong, in the same area of the code.

I also got help by the developers at PortaOne who are using PyPy on their servers and had seen some mysterious PyPy crashes recently, that looked related to the GC. They did test deployments of my fixes in their various stages to their servers to try to see whether stability improved for them. Unfortunately in the end it turned out that their crashes are an unrelated GC bug related to object pinning, which we haven't resolved yet.

Writing a GC fuzzer/property based test

Finding bugs in the GC is always extremely disconcerting, particularly since this one manged to hide for so long (more than ten years!). Therefore I wanted to use these bugs as motivation to try to find more problems in PyPy's GC. Given the ridiculous effectiveness of fuzzing, I used hypothesis to write a property-based test. Every test performs a sequence of randomly chosen steps from the following list:

  • allocate an object
  • read a random field from a random object
  • write a random reference into a random object
  • drop a random stack reference
  • perform one GC step
  • allocate an array
  • read a random index from a random array
  • write to an array
  • memcopy between two arrays

This approach of doing a sequence of steps is pretty close to the stateful testing approach of hypothesis, but I just implemented it manually with the data strategy.

Every one of those steps is always performed on both the tested GC, and on some regular Python objects. The Python objects provide the "ground truth" of what the heap should look like, so we can compare the state of the GC objects with the state of the Python objects to find out whether the GC made a mistake.

In order to check whether the test is actually useful, I reverted my bug fixes and made sure that the test re-finds both the spurious GC assertion error and the problems with memcopying an array.

In addition, the test also found corner cases in my fix. There was a situation that I hadn't accounted for, which the test found after eventually. I also plan on adding a bunch of other GC features as steps in the test to stress them too (for example weakrefs, identity hashes, pinning, maybe finalization).

At the point of publishing this post, the fixes got merged to the 2.7/3.9/3.10 branches of PyPy, and will be part of the next release (v7.3.16).

The technical details of the bug

In order to understand the technical details of the bug, I need to give some background explanations about PyPy's GC.

PyPy's incremental GC

PyPy uses an incremental generational mark-sweep GC. It's generational and therefore has minor collections (where only young objects get collected) and major collections (collecting long-lived objects eventually, using a mark-and-sweep algorithm). Young objects are allocated in a nursery using a bump-pointer allocator, which makes allocation quite efficient. They are moved out of the nursery by minor collections. In order to find references from old to young objects the GC uses a write barrier to detect writes into old objects.

The GC is also incremental, which means that its major collections aren't done all at once (which would lead to long pauses). Instead, major collections are sliced up into small steps, which are done directly after a minor collection (the GC isn't concurrent though, which would mean that the GC does work in a separate thread).

The incremental GC uses tri-color marking to reason about the reachable part of the heap during the marking phase, where every old object can be:

  • black: already marked, reachable, definitely survives the collection
  • grey: will survive, but still needs to be marked
  • white: potentially dead

The color of every object is encoded by setting flags in the object header.

The GC maintains the invariant that black objects must never point to white objects. At the start of a major collection cycle the stack roots are turned gray. During the mark phase of a major collection cycle, the GC will trace gray objects, until none are left. To trace a gray object, all the objects it references have to be marked grey if they are white so far. After a grey object is traced, it can be marked black (because all the referenced objects are now either black or gray). Eventually, there are no gray objects left. At that point (because no white object can be reached from a black one) all the white objects are known to be unreachable and can therefore be freed.

The GC is incremental because every collection step will only trace a limited number of gray objects, before giving control back to the program. This leads to a problem: if an already traced (black) object is changed between two marking steps of the GC, the program can mutate that object and write a new reference into one of its fields. This could lead to an invariant violation, if the referenced object is white. Therefore, the GC uses the write barrier (which it needs anyway to find references from old to young objects) to mark all black objects that are modified gray, and then trace them again at one of the later collection steps.

The special write barrier of memcopy

Arrays use a different kind of write barrier than normal objects. Since they can be arbitrarily large, tracing them can take a long time. Therefore it's potentially wasteful to trace them fully at a minor collection. To fix this, the array write barrier keeps more granular information about which parts of the array have been modified since the last collection step. Then only the modified parts of the array need to be traced, not the whole array.

In addition, there is another optimization for arrays, which is that memcopy is treated specially by the GC. If memcopy is implemented by simply writing a loop that copies the content of one array to the other, that will invoke the write barrier every single loop iteration for the write of every array element, costing a lot of overhead. Here's some pseudo-code:

def arraycopy(source, dest, source_start, dest_start, length):
    for i in range(length):
        value = source[source_start + i]
        dest[dest_start + i] = value # <- write barrier inserted here

Therefore the GC has a special memcopy-specific write barrier that will perform the GC logic once before the memcopy loop, and then use a regular (typically SIMD-optimized) memcopy implementation from libc. Roughly like this:

def arraycopy(source, dest, source_start, dest_start, length):
    gc_writebarrier_before_array_copy(source, dest, source_start, dest_start, length)
    raw_memcopy(cast_to_voidp(source) + source_start,
                cast_to_voidp(dest) + dest_start,
                sizeof(itemtype(source)) * length)

(this is really a rough sketch. The real code is much more complicated.)

The bug

The bugs turned out to be precisely in this memcopy write barrier. When we implemented the current GC, we adapted our previous GC, which was a generational mark-sweep GC but not incremental. We started with most of the previous GC's code, including the write barriers. The regular write barriers were adapted to the new incremental assumptions, in particular the need for the write barrier to also turn black objects back to gray when they are modified during a marking phase. This was simply not done at all for the memcopy write barrier, at least in two of the code paths. Fixing this problem fixes the unit tests and stops the crashes.

Reflections

The way the bug was introduced is really typical. A piece of code (the memcopy write barrier) was written under a set of assumptions. Then those assumptions changed later. Not all the code pieces that relied on these assumptions to be correct were updated. It's pretty hard to prevent this in all situations.

I still think we could have done more to prevent the bug occurring. Writing a property-based test for the GC would have been a good idea given the complexity of the GC, and definitely something we did in other parts of our code at the time (just using the random module mostly, we started using hypothesis later).

It's a bit of a mystery to me why this bug managed to be undetected for so long. Memcopy happens in a lot of pretty core operations of e.g. lists in Python (list.extend, to name just one example). To speculate, I would suspect that all the other preconditions for the bug occurring made it pretty rare:

  • the content of an old list that is not yet marked needs to be copied into another old list that is marked already
  • the source of the copy needs to also store an object that has no other references
  • the source of the copy then needs to be overwritten with other data
  • then the next collection steps need to be happening at the right points
  • ...

Given the complexity of the GC logic I also wonder whether some lightweight formal methods would have been a good idea. Formalizing some of the core invariants in B or TLA+ and then model checking them up to some number of objects would have found this problem pretty quickly. There are also correctness proofs for GC algorithms in some research papers, but I don't have a good overview of the literature to point to any that are particularly good or bad. Going such a more formal route might have fixed this and probably a whole bunch of other bugs, but of course it's a pretty expensive (and tedious) approach.

While it was super annoying to track this down, it was definitely good to learn a bit more about how to use rr and the GDB scripting interface.

Bonus Section: The Wrong Assertion

Some more technical information about the wrong assertion is in this section.

Background: pre-built objects

PyPy's VM-building bootstrapping process can "freeze" a bunch of heap objects into the final binary. This allows the VM to start up quickly, because those frozen objects are loaded by the OS as part of the binary.

Those frozen pre-built objects are parts of the 'roots' of the garbage collector and need to be traced. However, tracing all the pre-built objects at every collection would be very expensive, because there are a lot of them (about 150,000 in a PyPy 3.10 binary). Tracing them all is also not necessary, because most of them are never modified. Unmodified pre-built objects can only reference other pre-built objects, which can never be deallocated anyway. Therefore we have an optimization that uses the write barrier (which we need anyway to find old-to-young pointers) to notice when a pre-built object gets modified for the very first time. If that happens, it gets added to the set of pre-built objects that gets counted as a root, and is traced as a root at collections from then on.

The wrong assertion

The assertion that triggered when I turned on the GC debug mode was saying that the GC found a reference from a black to a white object, violating its invariant. Unmodified pre-built objects count as black, and they aren't roots, because they can only ever reference other pre-built objects. However, when a pre-built object gets modified for the first time, it becomes part of the root set and will be marked gray. This logic works fine.

The wrong assertion triggers if a pre-built object is mutated for the very first time in the middle of an incremental marking phase. While the pre-built object gets added to the root set just fine, and will get traced before the marking phase ends, this is encoded slightly differently for pre-built objects, compared to "regular" old objects. Therefore, the invariant checking code wrongly reported a black->white pointer in this situation.

To fix it I also wrote a unit test checking the problem, made sure that the GC hypothesis test also found the bug, and then fixed the wrong assertion to take the color encoding of pre-built objects into account.

The bug managed to be invisible because we don't tend to turn on the GC assertions very often. We only do that when we find a GC bug, which is of course also when we need it the most to be correct.

Acknowledgements

Thanks to Matti Picus, Max Bernstein, Wouter van Heyst for giving me feedback on drafts of the post. Thanks to Armin Rigo for reviewing the code and pointing out holes in my thinking. Thanks to the original reporters of the various forms of the bug, including Lily Foote, David Hewitt, Wenzel Jakob.

PyPy v7.3.15 release

PyPy v7.3.15: release of python 2.7, 3.9, and 3.10

The PyPy team is proud to release version 7.3.15 of PyPy.

This is primarily a bug-fix release, and includes work done to migrate PyPy to Git and Github.

The release includes three different interpreters:

  • PyPy2.7, which is an interpreter supporting the syntax and the features of Python 2.7 including the stdlib for CPython 2.7.18+ (the + is for backported security updates)

  • PyPy3.9, which is an interpreter supporting the syntax and the features of Python 3.9, including the stdlib for CPython 3.9.18.

  • PyPy3.10, which is an interpreter supporting the syntax and the features of Python 3.10, including the stdlib for CPython 3.10.13.

The interpreters are based on much the same codebase, thus the multiple release. This is a micro release, all APIs are compatible with the other 7.3 releases. It follows after 7.3.14 release on Dec 25, 2023

We recommend updating. You can find links to download the v7.3.15 releases here:

https://pypy.org/download.html

We would like to thank our donors for the continued support of the PyPy project. If PyPy is not quite good enough for your needs, we are available for direct consulting work. If PyPy is helping you out, we would love to hear about it and encourage submissions to our blog via a pull request to https://github.com/pypy/pypy.org

We would also like to thank our contributors and encourage new people to join the project. PyPy has many layers and we need help with all of them: bug fixes, PyPy and RPython documentation improvements, or general help with making RPython's JIT even better.

If you are a python library maintainer and use C-extensions, please consider making a HPy / CFFI / cppyy version of your library that would be performant on PyPy. In any case, both cibuildwheel and the multibuild system support building wheels for PyPy.

What is PyPy?

PyPy is a Python interpreter, a drop-in replacement for CPython It's fast (PyPy and CPython 3.7.4 performance comparison) due to its integrated tracing JIT compiler.

We also welcome developers of other dynamic languages to see what RPython can do for them.

We provide binary builds for:

  • x86 machines on most common operating systems (Linux 32/64 bits, Mac OS 64 bits, Windows 64 bits)

  • 64-bit ARM machines running Linux (aarch64).

  • Apple M1 arm64 machines (macos_arm64).

  • s390x running Linux

PyPy support Windows 32-bit, Linux PPC64 big- and little-endian, and Linux ARM 32 bit, but does not release binaries. Please reach out to us if you wish to sponsor binary releases for those platforms. Downstream packagers provide binary builds for debian, Fedora, conda, OpenBSD, FreeBSD, Gentoo, and more.

What else is new?

For more information about the 7.3.15 release, see the full changelog.

Please update, and continue to help us make pypy better.

Cheers, The PyPy Team

PyPy has moved to Git, GitHub

PyPy has moved its canonical repo and issue tracker from https://foss.heptapod.net/pypy/pypy to https://github.com/pypy/pypy. Obviously, this means development will now be tracked in Git rather than Mercurial.

Motivation

We still feel Mercurial is a better version control system. The named branch model and user interface are superior. But

  • foss.heptapod.net is not well indexed in google/bing/duckduckgo search, so people find it harder to search for issues in the project.

  • Since Heptapod has tightened its spam control, we get reports that users create issues only to have them flagged as spam.

  • Open Source has become synonymous with GitHub, and we are too small to change that.

  • Much of the current development comes as a reaction to fixing issues. Tracking interlocking issues is easier if all the code is on the same platform.

  • The FAQ presents two arguments against the move. Github notes solves much of point (1): the difficulty of discovering provenance of commits, although not entirely. But the main problem is point (2), it turns out that not moving to GitHub is an impediment to contribution and issue reporting.

  • People who wish to continue to use Mercurial can use the same method below to push to GitHub.

  • GitHub is more resource rich than foss.heptapod.net. We could add CI jobs to replace some of our aging buildbot infrastructure.

Method

The migration required two parts: migrating the code and then migrating the issues and merge requests.

Code migration 1: code and notes

I used a fork of git-remote-hg to create a local Git repo with all the changesets. Then I wanted to add a Git note to each commit with the branch it came from. So I prepared a file with two columns: the Git commit hash, and the corresponding branch from Mercurial. Mercurial can describe each commit in two ways: either the commit hash or by a number index. I used hg log to convert an index i to the Mercurial hash, and then git-hg-helper from git-remote-hg to convert the Mercurial hash to a Git hash:

$(cd pypy-git; git-hg-helper git-rev $(cd ../pypy-hg; hg log -r $i -T"{node}\n"))

Then I used hg log again to print the Mercurial branch for the index i:

$(cd pypy-hg; hg log -r $i -T'{branch}\n')

Putting these two together, I could loop over all the commits by their numerical index to prepare the file. Then I iterated over each line in the file, and added the Git note. Since the git note add command works on the current HEAD, I needed to checkout each commit in turn and then add the note:

git checkout -q <hash> && git notes --ref refs/notes/branch add -m branch:<branch>

I could then use git push --all to push to GitHub.

Code migration 2: prepare the branches

PyPy has almost 500 open branches. The code migration created all the branch HEADs, but git push --all did not push them. I needed to check them out and push each one. So I created a file with all the branch names

cd pypy-hg; hg branches | cut -f1 -d" " > branches.txt

and then push each one to the GitHub repo

while read branch; do git checkout branches/$branch && git push origin branches/$branch; done < branches.txt

Note that the branches were named branches/XXX by the migration, not branch/XXX. This confuses the merge request migration, more about that later.

Issue and merge request migration

I used the solution from node-gitlab-2-github which worked almost perfectly. It is important to do the conversion on a private repo otherwise every mention of a sucessfully mapped user name notifies the user about the transfer. This can be quite annoying for a repo the size of PyPy with 600 merge requests and over 4000 issues. Issues transfered without a problem: the script properly retained the issue numbers. However the script does not convert the Mercurial hashes to Git hashes, so the bare hashes in comments show up without a link to the commit. Merge requests are more of a problem:

  • The Mercurial named branch "disappears" once it is merged, so a merge request to a merged branch does not find the target branch name in Git. The conversion creates an issue instead with the label gitlab merge request.
  • For some reason, the branches created by git-remote-hg are called branches/XXX and not branch/XXX as expected by GitLab. This messes up the merge request/PR conversion. For some of the branches (open PRs and main target branches) I manually created additional branches without the es. The net result is that open merge requests became open PRs, merged merge requests became issues, and closed-not-merged merge requests were not migrated.

Layered conversions

PyPy already migrated once from Bitbucket to Heptapod. Many of the issues reflect the multiple transitions: they have lines like "Created originally on Bitbucket by XXX" from the first transition, and an additional line "In Heptapod" from this transition.

Credits

We would like to express our gratitude to the Octobus team who support Heptapod. The transition from Bitbucket was quite an effort, and they have generously hosted our developement since then. We wish them all the best, and still believe that Mercurial should have "won".

Next steps

While the repo at GitHub is live, there are still a few more things we need to do:

  • Documentation needs an update for the new repo and the build automation from readthedocs must be adjusted.
  • The wiki should be copied from Heptapod.
  • buildbot.pypy.org should also look at the new repo. I hope the code is up to the task of interacting with a Git repo.
  • speed.pypy.org tracks changes, it too needs to reference the new location
  • To keep tracking branches with Git notes on new commits, I activated a github action by Julian to add a Git branch note to each commit. Please see the README there for directions on using Git notes.
  • Some of the merge requests were not migrated. If someone wants to, they could migrate those once they figure out the branch naming problems.

Additionally, now is the time for all of you to prove the move is worthwhile:

  • Star the repo, let others know how to find it,
  • Help fix some of the open issues or file new ones,
  • Take advantage of the more familiar workflow to get involved in the project,
  • Suggest ways to improve the migration: are there things I missed or could have done better?

How will development change?

Heptapod did not allow personal forks, so we were generous with a commit bit to the main repo. Additionally, we (well, me) have been using a commit-directly-to-main workflow. We will now be adopting a more structured workflow. Please fork the repo and submit a pull request for any changes. We can now add some pre-merge CI to check that the PR at least passes the first stage of translation. The live and active branches will be:

  • main: what was "default" in Mercurial, it is the Python2.7 interpreter and the base of the RPython interpreter,
  • py3.9: the Python3.9 interpreter, which also includes all RPython changes from main. This is exactly like on Mercurial, and
  • py3.10: the Python3.10 interpreter, which also includes all RPython changes from main and all bugfixes from py3.9. This is exactly like on Mercurial.

Working between the repos

Finding commits

If you want to figure out how a Mercurial commit relates to a Git commit, you can use git-hg-helper. You run it in the Git repo. It takes the full long hash from one repo and gives you the corresponding hash of the other repo:

$ git-hg-helper git-rev d64027c4c2b903403ceeef2c301f5132454491df
4527e62ad94b0e940a5b0f9f20d29428672f93f7
$ git-hg-helper hg-rev 4527e62ad94b0e940a5b0f9f20d29428672f93f7
d64027c4c2b903403ceeef2c301f5132454491df
Finding branches

Branches migrated from Mercurial will have a branches prefix, not branch. While GitLab uses branch for its prefix, the git-remote-hg script uses branches. New work should be in a PR targeting main, py3.9 or py3.10.

Thanks for helping to make PyPy better.

Matti

PyPy v7.3.14 release

PyPy v7.3.14: release of python 2.7, 3.9, and 3.10

The PyPy team is proud to release version 7.3.14 of PyPy.

Hightlights of this release are compatibility with HPy-0.9, cffi 1.16, additional C-API interfaces, and more python3.10 fixes.

The release includes three different interpreters:

  • PyPy2.7, which is an interpreter supporting the syntax and the features of Python 2.7 including the stdlib for CPython 2.7.18+ (the + is for backported security updates)

  • PyPy3.9, which is an interpreter supporting the syntax and the features of Python 3.9, including the stdlib for CPython 3.9.18.

  • PyPy3.10, which is an interpreter supporting the syntax and the features of Python 3.10, including the stdlib for CPython 3.10.13.

The interpreters are based on much the same codebase, thus the multiple release. This is a micro release, all APIs are compatible with the other 7.3 releases. It follows after 7.3.13 release on Sept 29, 2023.

We recommend updating. You can find links to download the v7.3.14 releases here:

https://pypy.org/download.html

We would like to thank our donors for the continued support of the PyPy project. If PyPy is not quite good enough for your needs, we are available for direct consulting work. If PyPy is helping you out, we would love to hear about it and encourage submissions to our blog via a pull request to https://github.com/pypy/pypy.org

We would also like to thank our contributors and encourage new people to join the project. Since the last release we have contributions from three new contributors. PyPy has many layers and we need help with all of them: bug fixes, PyPy and RPython documentation improvements, or general help with making RPython's JIT even better.

If you are a python library maintainer and use C-extensions, please consider making a HPy / CFFI / cppyy version of your library that would be performant on PyPy. In any case, both cibuildwheel and the multibuild system support building wheels for PyPy.

What is PyPy?

PyPy is a Python interpreter, a drop-in replacement for CPython It's fast (PyPy and CPython 3.7.4 performance comparison) due to its integrated tracing JIT compiler.

We also welcome developers of other dynamic languages to see what RPython can do for them.

We provide binary builds for:

  • x86 machines on most common operating systems (Linux 32/64 bits, Mac OS 64 bits, Windows 64 bits)

  • 64-bit ARM machines running Linux (aarch64).

  • Apple M1 arm64 machines (macos_arm64).

  • s390x running Linux

PyPy support Windows 32-bit, Linux PPC64 big- and little-endian, and Linux ARM 32 bit, but does not release binaries. Please reach out to us if you wish to sponsor binary releases for those platforms. Downstream packagers provide binary builds for debian, Fedora, conda, OpenBSD, FreeBSD, Gentoo, and more.

What else is new?

For more information about the 7.3.14 release, see the full changelog.

Please update, and continue to help us make pypy better.

Cheers, The PyPy Team

PyPy v7.3.13 release

PyPy v7.3.13: release of python 2.7, 3.9, and 3.10

The PyPy team is proud to release version 7.3.13 of PyPy. This is primarily a security/bug-fix release. CPython released security patches, and this release also improves the ability to use type specifications via PyType_FromSpec and friends. There are also some small speed-ups.

The release includes three different interpreters:

  • PyPy2.7, which is an interpreter supporting the syntax and the features of Python 2.7 including the stdlib for CPython 2.7.18+ (the + is for backported security updates)

  • PyPy3.9, which is an interpreter supporting the syntax and the features of Python 3.9, including the stdlib for CPython 3.9.18.

  • PyPy3.10, which is an interpreter supporting the syntax and the features of Python 3.10, including the stdlib for CPython 3.10.13. Note it requires at least cython 0.29.35 or cython 3.0.0b3.

The interpreters are based on much the same codebase, thus the multiple release. This is a micro release, all APIs are compatible with the other 7.3 releases. It follows after 7.3.12 release on June 16, 2023.

We recommend updating. You can find links to download the v7.3.13 releases here:

https://pypy.org/download.html

We would like to thank our donors for the continued support of the PyPy project. If PyPy is not quite good enough for your needs, we are available for direct consulting work. If PyPy is helping you out, we would love to hear about it and encourage submissions to our blog via a pull request to https://github.com/pypy/pypy.org

We would also like to thank our contributors and encourage new people to join the project. PyPy has many layers and we need help with all of them: bug fixes, PyPy and RPython documentation improvements, or general help with making RPython's JIT even better.

If you are a python library maintainer and use C-extensions, please consider making a HPy / CFFI / cppyy version of your library that would be performant on PyPy. In any case, both cibuildwheel and the multibuild system support building wheels for PyPy.

What is PyPy?

PyPy is a Python interpreter, a drop-in replacement for CPython It's fast (PyPy and CPython 3.7.4 performance comparison) due to its integrated tracing JIT compiler.

We also welcome developers of other dynamic languages to see what RPython can do for them.

We provide binary builds for:

  • x86 machines on most common operating systems (Linux 32/64 bits, Mac OS 64 bits, Windows 64 bits)

  • 64-bit ARM machines running Linux (aarch64).

  • Apple M1 arm64 machines (macos_arm64).

  • s390x running Linux

PyPy support Windows 32-bit, Linux PPC64 big- and little-endian, and Linux ARM 32 bit, but does not release binaries. Please reach out to us if you wish to sponsor binary releases for those platforms. Downstream packagers provide binary builds for debian, Fedora, conda, OpenBSD, FreeBSD, Gentoo, and more.

What else is new?

For more information about the 7.3.13 release, see the full changelog.

Please update, and continue to help us make pypy better.

Cheers, The PyPy Team

PyPy v7.3.12 release

PyPy v7.3.12: release of python 2.7, 3.9, and 3.10.

The PyPy team is proud to release version 7.3.12 of PyPy. This release includes a new string-to-int algorithm (also appearing in CPython 3.12) that is faster than the older one; support for symlinks in Windows; and our first Python3.10 version.

The release includes three different interpreters:

  • PyPy2.7, which is an interpreter supporting the syntax and the features of Python 2.7 including the stdlib for CPython 2.7.18+ (the + is for backported security updates)

  • PyPy3.9, which is an interpreter supporting the syntax and the features of Python 3.9, including the stdlib for CPython 3.9.17.

  • PyPy3.10, which is an interpreter supporting the syntax and the features of Python 3.10, including the stdlib for CPython 3.10.12. This is our first release of 3.10, but based on past experience we are quite confident in its compatibility with upstream. Of course, we recommend testing your code with this new version before putting it into production. Note it does require at least cython 0.29.35 or cython 3.0.0b3

The interpreters are based on much the same codebase, thus the multiple release. This is a micro release, all APIs are compatible with the other 7.3 releases. It follows after 7.3.11 release on Dec 29, 2022

We recommend updating. You can find links to download the v7.3.12 releases here:

https://pypy.org/download.html

We would like to thank our donors for the continued support of the PyPy project. If PyPy is not quite good enough for your needs, we are available for direct consulting work. If PyPy is helping you out, we would love to hear about it and encourage submissions to our blog via a pull request to https://github.com/pypy/pypy.org

We would also like to thank our contributors and encourage new people to join the project. PyPy has many layers and we need help with all of them: bug fixes, PyPy and RPython documentation improvements, or general help with making RPython's JIT even better. Since the previous release, we have accepted contributions from one new contributor, thanks for pitching in, and welcome to the project!

If you are a python library maintainer and use C-extensions, please consider making a HPy / CFFI / cppyy version of your library that would be performant on PyPy. In any case, both cibuildwheel and the multibuild system support building wheels for PyPy.

What is PyPy?

PyPy is a Python interpreter, a drop-in replacement for CPython 2.7, 3.9 and 3.10. It's fast (PyPy and CPython 3.7.4 performance comparison) due to its integrated tracing JIT compiler.

We also welcome developers of other dynamic languages to see what RPython can do for them.

We provide binary builds for:

  • x86 machines on most common operating systems (Linux 32/64 bits, Mac OS 64 bits, Windows 64 bits)

  • 64-bit ARM machines running Linux (aarch64).

  • Apple M1 arm64 machines (macos_arm64).

  • s390x running Linux

PyPy support Windows 32-bit, Linux PPC64 big- and little-endian, and Linux ARM 32 bit, but does not release binaries. Please reach out to us if you wish to sponsor binary releases for those platforms. Downstream packagers provide binary builds for debian, Fedora, conda, OpenBSD, FreeBSD, Gentoo, and more.

What else is new?

For more information about the 7.3.12 release, see the full changelog.

Please update, and continue to help us make pypy better.

Cheers, The PyPy Team

RPython-based emulator speeds up RISC-V simulation over 15x

In cooperation with RISC-V International, who funded a part of this project, we recently created a workflow to use RPython to take a Sail RISC-V model and automatically create a RISC-V ISA emulator from it, which we call Pydrofoil. The simulator sped up booting a linux emulator from 35 minutes (using the standard Sail-generated emulator in C) to 2 minutes, a speedup of 17.5x. More details about the process are in the RISC-V blog post.

A few take-aways from the project:

  • While PyPy has shown it can speed up generic python code about 4x, the technology behind PyPy can really shine in other areas.

  • RPython is malleable and can be molded to many tasks, the RPython meta-JIT is very flexible.

  • A JIT is well-suited for the problem of emulation, because it can perform dynamic binary translation.

PyPy can solve real world performance problems, even somewhat unusual ones. Please get in touch and let us know how we can help you solve yours!

Repeated string concatenation is quadratic in PyPy (and CPython)

This is a super brief blog post responding to an issue that we got on the PyPy issue tracker. I am moving my response to the blog (with permission of the submitter) to have a post to point to, since it's a problem that comes up with some regularity. It's also documented on our page of differences between PyPy and CPython but I thought an additional blog post might be good.

The issue pointed out that a small program that operates on strings is much slower on PyPy compared to CPython. The program is a solution for 2016's Advent of Code Day 16 and looks like this:

def dragon(a):
    b = a[::-1].replace('0','r').replace('1','0').replace('r','1')
    return a+'0'+b

def diffstr(a):
    b = ""
    for i in range(0,len(a),2):
        b += ['0','1'][a[i] == a[i+1]]
    return b

def iterdiff(a):
    b = a
    while(len(b) % 2 == 0):
        b = diffstr(b)
    return b

size = 35651584
initstate = '10010000000110000'
while(len(initstate) < size):
    initstate = dragon(initstate)
initstate = initstate[:size]
print(iterdiff(initstate))

The submitter pointed out, that the program is fast on CPython (~8s on my laptop) and slow (didn't finish) on PyPy.

The reason for the performance difference is that += on strings in a loop has quadratic complexity in PyPy, which is what diffstr does. To see the quadraticness, consider that to add a character at the end of the string, the beginning of the string needs to be copied into a new chunk of memory. If the loop runs n times, that means there are

1 + 2 + 3 + ... + n = n * (n + 1) // 2

character copies.

Repeated string concatenations are in principle also quadratic in CPython, but CPython has an optimization that makes them sometimes not quadratic, which is what makes this program not too slow in CPython.

In order to fix the problem on PyPy it's best to use a list for the string parts, which has the right amortized O(1) complexity for .append calls, and then use str.join after the loop:

def diffstr(a):
    b = []
    for i in range(0,len(a),2):
        b.append(['0','1'][a[i] == a[i+1]])
    return "".join(b)

With this change the program becomes a little bit faster on CPython for me, and on PyPy it stops being quadratic and runs in ~3.5s.

In general, it's best not to rely on the presence of this optimization in CPython either. Sometimes, a small innocent looking changes will break CPython's optimization. E.g. this useless change makes CPython also take ages:

def diffstr(a):
    b = ""
    for i in range(0,len(a),2):
        b += ['0','1'][a[i] == a[i+1]]
        c = b
    return b

The reason why this change breaks the optimization in CPython is that it only triggers if the reference count of b is 1, in which case it uses realloc on the string. The change is unrealistic of course, but you could imagine a related that keeps an extra reference to b for a sensible reason.

Another situation in which the optimization doesn't work is discussed in this StackOverflow question with an answer by Tim Peters.

It's unlikely that PyPy will fix this. We had a prototype how to do it, but it seems very little "production" code uses += on strings in a loop, and the fix makes the strings implementation quite a bit more complex.

So, in summary, don't use repeated concatenations in a loop!

PyPy v7.3.11 release

PyPy v7.3.11: release of python 2.7, 3.8, and 3.9

The PyPy team is proud to release version 7.3.11 of PyPy. As could be expected, the first release of macOS arm64 impacted the macOS x86-64 build, so this is a bug release to restore the ability of macOS users to run PyPy on macOS < 11.0. It also incoporates the latest CPython stdlib updates released the day after 7.3.10 went out, and a few more bug fixes. The release includes three different interpreters:

  • PyPy2.7, which is an interpreter supporting the syntax and the features of Python 2.7 including the stdlib for CPython 2.7.18+ (the + is for backported security updates)

  • PyPy3.8, which is an interpreter supporting the syntax and the features of Python 3.8, including the stdlib for CPython 3.8.16. Note we intend to drop support for this version in an upcoming release as soon as we release Pyython 3.10.

  • PyPy3.9, which is an interpreter supporting the syntax and the features of Python 3.9, including the stdlib for CPython 3.9.16.

The interpreters are based on much the same codebase, thus the multiple release. This is a micro release, all APIs are compatible with the other 7.3 releases and follows quickly on the heals of the 7.3.10 release on Dec 6.

We recommend updating. You can find links to download the v7.3.11 releases here:

https://pypy.org/download.html

We would like to thank our donors for the continued support of the PyPy project. If PyPy is not quite good enough for your needs, we are available for direct consulting work. If PyPy is helping you out, we would love to hear about it and encourage submissions to our blog via a pull request to https://github.com/pypy/pypy.org

We would also like to thank our contributors and encourage new people to join the project. PyPy has many layers and we need help with all of them: bug fixes, PyPy and RPython documentation improvements, or general help with making RPython's JIT even better. Since the previous release, we have accepted contributions from one new contributor, thanks for pitching in, and welcome to the project!

If you are a python library maintainer and use C-extensions, please consider making a HPy / CFFI / cppyy version of your library that would be performant on PyPy. In any case, both cibuildwheel and the multibuild system support building wheels for PyPy.

What is PyPy?

PyPy is a Python interpreter, a drop-in replacement for CPython 2.7, 3.8 and 3.9. It's fast (PyPy and CPython 3.7.4 performance comparison) due to its integrated tracing JIT compiler.

We also welcome developers of other dynamic languages to see what RPython can do for them.

We provide binary builds for:

  • x86 machines on most common operating systems (Linux 32/64 bits, Mac OS 64 bits, Windows 64 bits)

  • 64-bit ARM machines running Linux (aarch64).

  • Apple M1 arm64 machines (macos_arm64).

  • s390x running Linux

PyPy support Windows 32-bit, Linux PPC64 big- and little-endian, and Linux ARM 32 bit, but does not release binaries. Please reach out to us if you wish to sponsor binary releases for those platforms. Downstream packagers provide binary builds for debian, Fedora, conda, OpenBSD, FreeBSD, Gentoo, and more.

What else is new?

For more information about the 7.3.11 release, see the full changelog.

Please update, and continue to help us make pypy better.

Cheers, The PyPy Team

Finding JIT Optimizer Bugs using SMT Solvers and Fuzzing

In this blog post I want to describe a recent bug finding technique that I've added to the PyPy JIT testing infrastructure. This technique uses the Z3 theorem prover to find bugs in the optimizer of PyPy's JIT, in particular its integer operation optimizations. The approach is based on things I have learned from John Regehr's blog (this post is a good first one to read), Twitter, and on his (et al) paper Alive2: Bounded Translation Validation for LLVM. The work was triggered by a recent miscompilation bug my current bachelor student Nico Rittinghaus found.

Background: Python Integers in the PyPy JIT

The optimizer of PyPy's JITs operates on traces, which are linear sequences of instructions with guards. The instructions in the traces operate on different machine-level data types, machine integers, doubles, pointers, bools, etc. In this post we'll be mostly concerned with machine integers.

To given some wider context I'll explain a bit how Python ints in the user code relate to the types that are used in traces when the PyPy Python implementation is used. When PyPy turns a regular Python 3 function into a trace, there is a lot of work happening in the JIT frontend to try to observe and infer the types that the Python function concretely uses at runtime. The traces are generated under these typing assumptions. Therefore, code that uses ints in the Python code can typically be translated into traces that operate on machine integers. In order to make sure that the Python integer semantics are upheld, many of the operations in the traces need to check that the integer results of some operations still fit into a machine integer. If that is not the case (a rare situation for most programs), the trace is left via a guard, execution falls back to the interpreter, and there a big integer representation is chosen for the too big value (the big integer representation is done via a pointer and some storage on the heap).

All of this machinery is not going to be too relevant for the rest of the post. For the post it's important to know that trace instructions operate on machine integers and other low-level types, and some of the operations can optionally check whether the results still fit into a machine integer. These trace operations are improved by the optimizer, which tries to transform the trace into one that behaves the same, but is less costly to execute.

Background: Bounds Analysis in PyPy's JIT

The optimizer of PyPy's JIT has an analysis based on abstract interpretation that tries to find out whether the integer values stored in a variable are actually not using the full 64 bit (or 32 bit) range, but instead fit into some smaller range. This means that for every integer variable x in a trace, the JIT compiler tracks upper and lower bounds of the runtime value of that variable: a range [a, b] such that for every concrete runtime value v that gets stored in variable x, a <= v <= b must be true. a and b start out as the most general MININT and MAXINT, but sometimes there is extra information that makes it possible to improve these known bounds, and that is often useful to optimize the code.

A typical example is that the JIT knows that the length of a string is non-negative, so for this kind of code: x = len(s) where s is a string, x gets a range [0, MAXINT] assigned. With this information we could for example remove a check x + 10 < 0 completely, because it can never be true.

The bounds information is useful for optimization, but the analysis of the bounds is also a source of bugs in the JIT, because the reasoning is often subtle and easy to get wrong in corner cases. We already use a number of testing techniques to try to make sure that it is correct. A simple one is property-based testing using Hypothesis on the operations on bounds. Even though Hypothesis is fantastic, it unfortunately does not catch absolutely all the bugs even if we'd like it too, as we'll see in the next section.

Motivation: A JIT Miscompilation

I am currently supervising a Bachelor thesis by Nico Rittinghaus, who is extending the integer analysis in the JIT. He'll probably write a separate blog post about that soon. In the process of his work, the current bounds analysis code got a lot of scrutiny, and we found out that one of the unit tests of the bounds analysis was actually incorrect, and the example code in that unit test was optimized incorrectly. This case of incorrect optimization is not a big deal for regular Python code, because it involved a "wrapping integer addition operation", i.e. one where overflowing results just wrap around to negative values. All the additions and other arithmetic operations that the PyPy Python frontend generates actually have overflow checks (to be able to switch to a big integer representation if needed). However, it's still possible to trigger the problem with the __pypy__.intop.int_add API which is a function that exposes wraparound arithmetic on Python ints.

Here's the miscompilation. The JIT optimizes the following function:

import __pypy__

def wrong(x):
    a = __pypy__.intop.int_add(x, 10)
    if a < 15:
        if x < 6:
            return 0
        return 1
    return 2

Into the following code:

import __pypy__

def wrong(x):
    a = __pypy__.intop.int_add(x, 10)
    if a < 15:
        return 0
    return 2

Basically the faulty reasoning of the JIT looks like this: if int_add(x, 10) < 15 then it must follow that x < 5, which is stronger than x < 6, so the second if is always true. This sounds good, but is actually wrong if the addition + 10 wrapped around. So if x == MAXINT, then int_add(x, 10) == MININT + 9 < 15. But MAXINT < 5 is not correct.

Note how the same reasoning with overflow-checking addition is correct! If x + 10 < 15 and the + didn't overflow, then indeed x < 6. And if your mind bends starting to think about all this, you understand some of the difficulty of getting the JIT correct in this area.

How could we have avoided this bug?

One exercise I try to do after finding bugs is to reflect on ways that the bug could have been avoided. I think this is particularly important in the JIT, where bugs are potentially really annoying to find and can cause very strange behaviour in basically arbitrary Python code.

It's easy to always answer this question with "try to think more carefully when working", but that approach cannot be relied on in complicated situations, because humans don't concentrate perfectly for long stretches of time.

A situation-specific problem I identified was the bad design of the range analysis API. A range is not just represented by two numbers, instead it's two numbers and two bools that are supposed to represent that some operation did or did not underflow/overflow. The meaning of these bools was quite hard to grasp and easy to get wrong, so probably they should never have been introduced in the first place (and my bugfix indeed removed them).

But in the rest of this blog post I want to talk about another, systematic approach that can be applied to the problem of mis-optimizations of integer operations, and that is done by applying an SMT solver to the problem.

An SMT solver (Satisfyability Modulo Theories) is a tool that can be used to find out whether mathematical formulas are "satisfiable", i.e. whether some chosen set of inputs exists that will make the formulas evaluate to true. SMT solvers are commonly used in a wide range of CS applications including program correctness proofs, program synthesis, etc. The most widely known one is probably Z3 by Microsoft Research which has the nice advantage of coming with an easy-to-use Python binding.

Going into this I basically knew next to nothing about SMT solvers (despite having been embedded in a formal methods research group for years!) so it was an interesting new world to learn about.

As briefly mentioned in the introduction, the approach I took followed a similar (but much more properly executed) one applied to LLVM operations, called Alive2. Krister Waldfridsson has done similar work for GCC recently, described on his blog.

Z3 Proof of Concept

The first thing I did was to try to get Z3 find the above bug, by encoding the input program into an SMT formula by hand and trying to get Z3 to prove the condition that the JIT thinks is always true. The Z3 code for this looks as follows:

from z3 import BitVec, Implies, prove
x = BitVec('x', 64)
a = x + 10
cond1 = a < 15
cond2 = x < 6
prove(Implies(cond1, cond2))

Here, x is defined to be a bit vector variable of width 64, which is a datatype that can be used to represent bounded machine integers. Addition on bit vectors performs wraparound arithmetic, like the __pypy__.intop.int_add call in the original code. The JIT optimized the second condition away, so essentially it was convinced that the first condition implies the second one. The above snippet tries to get Z3 to confirm this.

When run, the above program prints:

counterexample
[x = 9223372036854775803]

Which shows the bug. As a small side-note, I thought it was cool that the process of "proving" something in Z3 basically means trying to find an example for the negation of the formula. If no counterexample can be found for the negation, the original formula is true. If the original formula turns out to be false (like here) we get a nice example that shows the problem to go with it.

It's not realistic to hand-translate all the hundreds of unit-tests into Z3 formulas and then ask Z3 to prove the optimizations. Instead, we want to have a program that does this for us.

SMT Checking of the JIT Optimizer

What we want from this program is the following: given an unoptimized trace and its optimized version, we want to use Z3 to check whether the optimized trace behaves identically to the unoptimized one. One question is what "behaves identically" means. What we care about is the outputs of the trace being the same values, no matter how they are computed. Also, for every guard we want to make sure that it fails in identical ways in the optimized and unoptimized versions. A guard is only allowed to be optimized away if it can never fail. The code that comes after a guard can assume that the guard has not failed, because otherwise execution would have left the trace. All of this should be true regardless for the values of the input variables of the trace.

So in order to check that the two traces are behaving identically, we do the following:

  • We create Z3 variables for every input variable. We use the same input variables both for the unoptimized as well as the optimized trace.

  • We align the two traces at the corresponding guards. Thankfully the optimizer keeps track of which optimized guard corresponds to which unoptimized input guard.

  • All the operations before a guard are translated into Z3 formulas, for both versions of the trace.

  • For two corresponding guards, we ask Z3 to prove that the guard conditions are identical.

  • For a guard that was optimized away we ask Z3 to prove that the condition is always true.

  • After a guard, we tell Z3 that from now on it can assume that the guard condition is true.

  • We repeat this, guard for guard, until we reach the end of the trace. There, we ask Z3 to prove that the output variables in the unoptimized trace and the optimized trace are identical (every trace can return one or many values).

I implemented this, it's not a lot of code, basically a couple of hundred lines of (somewhat hacky) Python code. So far I only support integer operations. Here are some parts of the code to give you a flavor of what this looks like.

This is the code that translates operations into Z3 formulas:

def add_to_solver(self, ops, state):
    for op in ops:
        if op.type != 'v': # is it an operation with a result
            res = self.newvar(op)
        else: # or does it return void
            res = None

       # ...

        # convert arguments
        if op.numargs() == 1:
            arg0 = self.convertarg(op, 0)
        elif op.numargs() == 2:
            arg0 = self.convertarg(op, 0)
            arg1 = self.convertarg(op, 1)

        # compute results
        if opname == "int_add":
            expr = arg0 + arg1
        elif opname == "int_sub":
            expr = arg0 - arg1
        elif opname == "int_mul":
            expr = arg0 * arg1
        elif opname == "int_and":
            expr = arg0 & arg1
        elif opname == "int_or":
            expr = arg0 | arg1
        elif opname == "int_xor":
            expr = arg0 ^ arg1

        # ...  more operations, some shown below

        self.solver.add(res == expr)

New Z3 variables are defined by the helper function newvar, which adds the operation to a dictionary box_to_z3 mapping boxes (=variables) to Z3 variables. Due to the SSA property that traces have, a variable must be defined before its first use.

Here's what newvar looks like (LONG_BIT is a constant that is either 64 or 32, depending on the target architecture):

def newvar(self, box, repr=None):
    # ... some logic around making the string representation
    # somewhat nicer omitted
    result = z3.BitVec(repr, LONG_BIT)
    self.box_to_z3[box] = result
    return result

The convert method turns an operation argument (either a constant or a variable) into a Z3 formula (either a constant bit vector or an already defined Z3 variable). convertarg is a helper function that takes an operation, reads its nth argument and converts it.

def convert(self, box):
    if isinstance(box, ConstInt):
        return z3.BitVecVal(box.getint(), LONG_BIT)
    return self.box_to_z3[box]

def convertarg(self, box, arg):
    return self.convert(box.getarg(arg))

The lookup of variables in box_to_z3 that convert does cannot fail, because the variable must have been defined before use.

Comparisons return the bit vector 0 or bit vector 1, we use a helper function cond to turn the Z3 truth value of the comparison into a bit vector:

def cond(self, z3expr):
    return z3.If(z3expr, TRUEBV, FALSEBV)


def add_to_solver(self, ops, state):
        # ... start as above

        # more cases
        elif opname == "int_eq":
            expr = self.cond(arg0 == arg1)
        elif opname == "int_ne":
            expr = self.cond(arg0 != arg1)
        elif opname == "int_lt":
            expr = self.cond(arg0 < arg1)
        elif opname == "int_le":
            expr = self.cond(arg0 <= arg1)
        elif opname == "int_gt":
            expr = self.cond(arg0 > arg1)
        elif opname == "int_ge":
            expr = self.cond(arg0 >= arg1)
        elif opname == "int_is_true":
            expr = self.cond(arg0 != FALSEBV)
        elif opname == "uint_lt":
            expr = self.cond(z3.ULT(arg0, arg1))
        elif opname == "uint_le":
            expr = self.cond(z3.ULE(arg0, arg1))
        elif opname == "uint_gt":
            expr = self.cond(z3.UGT(arg0, arg1))
        elif opname == "uint_ge":
            expr = self.cond(z3.UGE(arg0, arg1))
        elif opname == "int_is_zero":
            expr = self.cond(arg0 == FALSEBV)

        # ... rest as above

So basically for every trace operation that operates on integers I had to give a translation into Z3 formulas, which is mostly straightforward.

Guard operations get converted into a Z3 boolean by their own helper function, which looks like this:

def guard_to_condition(self, guard, state):
    opname = guard.getopname()
    if opname == "guard_true":
        return self.convertarg(guard, 0) == TRUEBV
    elif opname == "guard_false":
        return self.convertarg(guard, 0) == FALSEBV
    elif opname == "guard_value":
        return self.convertarg(guard, 0) == self.convertarg(guard, 1)

    # ... some more exist, shown below

Some operations are a bit trickier. An important example in the context of this blog post are integer operations that check for overflow. The overflow operations return a result, but also a boolean whether the operation overflowed or not.

def add_to_solver(self, ops, state):

        # ... more cases

        elif opname == "int_add_ovf":
            expr = arg0 + arg1
            m = z3.SignExt(LONG_BIT, arg0) + z3.SignExt(LONG_BIT, arg1)
            state.no_ovf = m == z3.SignExt(LONG_BIT, expr)
        elif opname == "int_sub_ovf":
            expr = arg0 - arg1
            m = z3.SignExt(LONG_BIT, arg0) - z3.SignExt(LONG_BIT, arg1)
            state.no_ovf = m == z3.SignExt(LONG_BIT, expr)
        elif opname == "int_mul_ovf":
            expr = arg0 * arg1
            m = z3.SignExt(LONG_BIT, arg0) * z3.SignExt(LONG_BIT, arg1)
            state.no_ovf = m == z3.SignExt(LONG_BIT, expr)

        # ...

The boolean is computed by comparing the result of the bit vector operation with the result of converting the input bit vectors into an abstract (arbitrary precision) integer and the result back to bit vectors. Let's go through the addition case step by step, the other cases work analogously.

The addition in the first elif that computes expr is an addition on bit vectors, therefore it is performing wraparound arithmetic. z3.SignExt(LONG_BIT, arg0) sign-extends arg0 from a bit vector of LONG_BIT bits to an abstract, arbitrary precision integer. The addition in the second line is therefore an addition between abstract integers, so it will never overflow and just compute the correct result as an integer.

The condition to check for overflow is now: if the results of the two different ways to do the addition are the same, then overflow did not occur. So in order to compute state.no_ovf in the addition case the code converts the result of the bit vector wraparound addition to an abstract integer (using SignExt again), and then compares that to the integer result.

This boolean can then be checked by the guard operations guard_no_overflow and guard_overflow.

def guard_to_condition(self, guard, state):

    # ... more cases

    elif opname == "guard_no_overflow":
        assert state.no_ovf is not None
        return state.no_ovf
    elif opname == "guard_overflow":
        assert state.no_ovf is not None
        return z3.Not(state.no_ovf)

    # ... more cases

Finding the Bug, Again

Let's actually make all of this more concrete by applying it to the trace of our original bug. The input trace and the incorrectly optimized trace for that look like this (differences highlighted):

# input                       # optimized
[i0]                          [i0]
i1 = int_add(i0, 10)          i1 = int_add(i0, 10)
i2 = int_lt(i1, 15)           i2 = int_lt(i1, 15)
guard_true(i2)                guard_true(i2)
i3 = int_lt(i0, 6)            jump(0)
guard_true(i3)
jump(0)

Note that the trace represents just one of the paths through the control flow graph of the original function, which is typical for tracing JITs (the other paths could incrementally get added later).

The first guards in both these traces correspond to each other, so the first chunks to check are the first three operations (lines 1-4). Those operations don't get changed by the optimizer at all.

These two identical traces get translated to the following Z3 formulas:

i1unoptimized == input_i0 + 10
i2unoptimized == If(i1unoptimized < 15, 1, 0)
i1optimized == input_i0 + 10
i2optimized == If(i1optimized < 15, 1, 0)

To check that the two corresponding guards are the same, the solver is asked to prove that (i2unoptimized == 1) == (i2optimized == 1). This is correct, because the formulas for i2unoptimized and i2optimized are completely identical.

After checking that the guards behave the same, we add the knowledge to the solver that the guards passed. So the Z3 formulas become:

i1unoptimized == input_i0 + 10
i2unoptimized == If(i1unoptimized < 15, 1, 0)
i1optimized == input_i0 + 10
i2optimized == If(i1optimized < 15, 1, 0)
i1optimized == 1
i2optimized == 1

Now we continue with the remaining operations of the two traces (lines 6-8).

We start by adding the int_lt operation in the unoptimized trace to the Z3 formulas:

...
i3unoptimized == If(input_i0 < 6, 1, 0)

Because the second guard was optimized away, we need to ask Z3 to prove that i3unoptimized == 1 is always true, which fails and gives the following counterexample:

input_i0 = 9223372036854775800
i1unoptimized = 9223372036854775810
i2unoptimized = 0
i1optimized = 9223372036854775810
i2optimized = 1
i3unoptimized = 0

Thus demonstrating the bug. The fact that the Z3-based equivalence check also managed to find the original motivating bug without manually translating it to a formula is a good confirmation that the approach works.

Second bug

So with this code I applied the Z3-based equivalence check to all our optimizer unit tests. In addition to the bug we've been discussing the whole post, it also found another buggy test! I had found it too by hand by staring at all the tests in the process of writing all the Z3 infrastructure, but it was still a good confirmation that the process worked. This bug was in the range analysis for int_neg, integer negation. It failed to account that -MININT == MININT and therefore did a mis-optimization along the following lines:

import __pypy__

def wrong(x):
    a = __pypy__.intop.int_sub(0, x)
    if a < 0:
        if x > 0:
            return 0
        return 1
    return 2

Which was wrongly optimized into:

import __pypy__

def wrong(x):
    a = __pypy__.intop.int_sub(0, x)
    if a < 0:
        return 0
    return 2

This is wrong precisely for x == MININT.

Generating Random Traces

These two bugs were the only two that the Z3 checker found for existing unit tests. To try to find some more bugs I combined PyPy's existing random trace generator with the Z3 optimization checker. The random trace generator has so far been mostly used to find bugs in the machine code backends, particularly also in the register allocator. So far we haven't used it with our optimizer, but my experiments show that we should have!

I'm going to describe a little bit how the random trace generator works. It's actually not that complicated, but there's one neat trick to it.

The basic idea is straightforward, it starts out with an empty trace with a random number of input variables. Then it adds some number of operations to the trace, either regular operations or guards. Every operation takes already existing variables as input.

The neat trick is that our random trace generator keeps a concrete random example value for every one of the input variables, and an example result for every operation. In this way, it is possible to generate guards that are consistent with the example values to ensure that running the trace to its end is possible with at least one set of values.

Here's an example random trace that is generated, together with the random example inputs and the results of every operation at the end of every line:

[i0, i1, i2, i3, i4, i5] # example values: 9, 11, -8, -95, 46, 57
i6 = int_add_ovf(i3, i0) # -86
guard_no_overflow()
i7 = int_sub(i2, -35/ci) # 27
i8 = uint_ge(i3, i5) # 1
guard_true(i8)
i9 = int_lt(i7, i8) # 0
i10 = int_mul_ovf(34/ci, i7) # 918
guard_no_overflow()
i11 = int_and(i10, 63/ci) # 22
i12 = int_rshift(i3, i11) # -1
i13 = int_is_zero(i7) # 0
i14 = int_is_true(i13) # 0
guard_false(i13)
i15 = int_lt(i8, i4) # 1
i16 = int_and(i6, i0) # 8
i17 = uint_ge(i6, -6/ci) # 0
finish()

Note how every guard generated is true for the example values.

I have been running this combination of random trace generation and Z3 checking for many nights and it has found some bugs, which I'll describe in the next section. It should probably be run for a lot longer, but still a useful exercise already.

In this mode, I'm giving every Z3 call a time limit to make sure that the random tests don't just take arbitrarily long. This means that asking Z3 to prove something can have three outcomes, either it's proved, or Z3 finds a counterexample, or Z3 times out.

Bugs Found

In addition to the two bugs I've already described, I'll briefly list the additional bugs that were found by optimizing random traces and then trying to prove the equivalence with Z3.

Most of the bugs were actually identified by optimizing random traces alone, not by the Z3 component. They manifested as assert failures in the JIT compiler.

  • The JIT concluded after 12 == int_mul(x, 12) that x == 1, which is incorrect if overflow occurred (a counterexample is 0x8000000000000001).

  • An amusing bug, where from 0 == int_lshift(0x1000000000000000, x) with x <= 0 <= 15, the JIT concluded that 0x1000000000000000 == 0, triggering an assert. This wrong conclusion was again caused by not taking the possibility of overflow into account.

  • A corner case in an optimization for chained integer additions with a constant, where in complex enough expressions, the wrong IR API was used (which works correctly in simple cases). Again, this triggered an assert.

This shows that we should have been fuzzing our JIT optimizer already (not a surprising observation in hindsight, fuzz all the things!).

Thankfully, there was also one further bug that really failed in the Z3 verifier. It's a bug in common subexpression elimination / arithmetic simplification, which again does not take overflow correctly into account.

The buggy trace looks like this (unfortunately it's not easily possible to show this bug in Python code).

[a, b]
c = int_add(a, b)
r = int_sub_ovf(c, b)
guard_no_ovf()
finish(r)

This was optimized to:

[a, b]
finish(a)

Which is incorrect, because the guard can fail given the right inputs. But the optimizer concluded that the subtraction is safe, because its the inverse of an earlier addition, not taking into account that this earlier addition can have overflowed.

Note that a related optimization is actually correct. Given this code:

[a, b]
c = int_add_ovf(a, b)
guard_no_ovf()
r = int_sub(c, b)
finish(r)

It can be optimized to:

[a, b]
c = int_add_ovf(a, b)
guard_no_ovf()
finish(a)

Future Work and Conclusion

In the current form the Z3 checker is only a start, even though it has already been concretely useful. There are various directions into which we could extend it. In addition to generate random tests completely from scratch, we could also start from the existing manually written unit-tests and randomly mutate those.

I also want to extend the Z3 checker with support more operations, heap operations in particular (but it's not quite clear to me how to model garbage collection).

I also want to try to switch the code away from the Z3 API and use the more general smtlib interface directly, in order to be able to use other SMT checkers than Z3, eg CVC4.

But all in all this was a fun and not too hard way to find a bunch of bugs in our optimizer! And the infrastructure is now in place, which means that we run some random test cases every time we execute our tests. This is going to be particularly useful when we do further work on the integer reasoning of the JIT (like Nico is doing, for example). As of time of writing of this post, all the bugs mentioned have been fixed and the Z3 code has landed on the default branch and runs as part of PyPy's CI infrastructure.

Acknowledgements

Thanks to Saam Barati, Max Bernstein, Joshua Schmidt and Martin Berger, for great feedback on drafts of this post!