# A DSL for Peephole Transformation Rules of Integer Operations in the PyPy JIT

As is probably apparent from the sequence of blog posts about the topic in the last year, I have been thinking about and working on integer optimizations in the JIT compiler a lot. This work was mainly motivated by Pydrofoil, where integer operations matter a lot more than for your typical Python program.

In this post I'll describe my most recent change, which is a new small domain specific language that I implemented to specify peephole optimizations on integer operations in the JIT. It uses pattern matching to specify how (sequences of) integer operations should be simplified and optimized. The rules are then compiled to RPython code that then becomes part of the JIT's optimization passes.

To make it less likely to introduce incorrect optimizations into the JIT, the rules are automatically proven correct with Z3 as part of the build process (for a more hands-on intro to how that works you can look at the knownbits post). In this blog post I want to motivate why I introduced the DSL and give an introduction to how it works.

## Motivation¶

This summer, after I wrote my scripts to mine JIT traces for missed optimization opportunities, I started implementing a few of the integer peephole rewrite that the script identified. Unfortunately, doing so led to the problem that the way we express these rewrites up to now is very imperative and verbose. Here's a snippet of RPython code that shows some rewrites for integer multiplication (look at the comments to see what the different parts actually do). You don't need to understand the code in detail, but basically it's in very imperative style and there's quite a lot of boilerplate.

def optimize_INT_MUL(self, op): arg0 = get_box_replacement(op.getarg(0)) b0 = self.getintbound(arg0) arg1 = get_box_replacement(op.getarg(1)) b1 = self.getintbound(arg1) if b0.known_eq_const(1): # 1 * x == x self.make_equal_to(op, arg1) elif b1.known_eq_const(1): # x * 1 == x self.make_equal_to(op, arg0) elif b0.known_eq_const(0) or b1.known_eq_const(0): # 0 * x == x * 0 == 0 self.make_constant_int(op, 0) else: for lhs, rhs in [(arg0, arg1), (arg1, arg0)]: lh_info = self.getintbound(lhs) if lh_info.is_constant(): x = lh_info.get_constant_int() if x & (x - 1) == 0: # x * (2 ** c) == x << c new_rhs = ConstInt(highest_bit(lh_info.get_constant_int())) op = self.replace_op_with(op, rop.INT_LSHIFT, args=[rhs, new_rhs]) self.optimizer.send_extra_operation(op) return elif x == -1: # x * -1 == -x op = self.replace_op_with(op, rop.INT_NEG, args=[rhs]) self.optimizer.send_extra_operation(op) return else: # x * (1 << y) == x << y shiftop = self.optimizer.as_operation(get_box_replacement(lhs), rop.INT_LSHIFT) if shiftop is None: continue if not shiftop.getarg(0).is_constant() or shiftop.getarg(0).getint() != 1: continue shiftvar = get_box_replacement(shiftop.getarg(1)) shiftbound = self.getintbound(shiftvar) if shiftbound.known_nonnegative() and shiftbound.known_lt_const(LONG_BIT): op = self.replace_op_with( op, rop.INT_LSHIFT, args=[rhs, shiftvar]) self.optimizer.send_extra_operation(op) return return self.emit(op)

Adding more rules to these functions is very tedious and gets super confusing when the functions get bigger. In addition I am always worried about making mistakes when writing this kind of code, and there is no feedback at all about which of these rules are actually applied a lot in real programs.

Therefore I decided to write a small domain specific language with the goal of expressing these rules in a more declarative way. In the rest of the post I'll describe the DSL (most of that description is adapted from the documentation about it that I wrote).

## The Peephole Rule DSL¶

### Simple transformation rules¶

The rules in the DSL specify how integer operation can be transformed into cheaper other integer operations. A rule always consists of a name, a pattern, and a target. Here's a simple rule:

add_zero: int_add(x, 0) => x

The name of the rule is `add_zero`

. It matches operations in the trace of the
form `int_add(x, 0)`

, where `x`

will match anything and `0`

will match only the
constant zero. After the `=>`

arrow is the target of the rewrite, i.e. what the
operation is rewritten to, in this case `x`

.

The rule language has a list of which of the operations are commutative, so `add_zero`

will also optimize `int_add(0, x)`

to `x`

.

Variables in the pattern can repeat:

sub_x_x: int_sub(x, x) => 0

This rule matches against `int_sub`

operations where the two arguments are the
same (either the same box, or the same constant).

Here's a rule with a more complicated pattern:

sub_add: int_sub(int_add(x, y), y) => x

This pattern matches `int_sub`

operations, where the first argument was
produced by an `int_add`

operation. In addition, one of the arguments of the
addition has to be the same as the second argument of the subtraction.

The constants `MININT`

, `MAXINT`

and `LONG_BIT`

(which is either 32 or 64,
depending on which platform the JIT is built for) can be used in rules, they
behave like writing numbers but allow bit-width-independent formulations:

is_true_and_minint: int_is_true(int_and(x, MININT)) => int_lt(x, 0)

It is also possible to have a pattern where some arguments needs to be a constant, without specifying which constant. Those patterns look like this:

sub_add_consts: int_sub(int_add(x, C1), C2) # incomplete # more goes here => int_sub(x, C)

Variables in the pattern that start with a `C`

match against constants only.
However, in this current form the rule is incomplete, because the variable `C`

that is being used in the target operation is not defined anywhere. We will see
how to compute it in the next section.

### Computing constants and other intermediate results¶

Sometimes it is necessary to compute intermediate results that are used in the target operation. To do that, there can be extra assignments between the rule head and the rule target.:

sub_add_consts: int_sub(int_add(x, C1), C2) # incomplete C = C1 + C1 => int_sub(x, C)

The right hand side of such an assignment is a subset of Python syntax,
supporting arithmetic using `+`

, `-`

, `*`

, and certain helper functions.
However, the syntax allows you to be explicit about unsignedness for some
operations. E.g. `>>u`

exists for unsigned right shifts (and I plan to add
`>u`

, `>=u`

, `<u`

, `<=u`

for comparisons).

Here's an example of a rule that uses `>>u`

:

urshift_lshift_x_c_c: uint_rshift(int_lshift(x, C), C) mask = (-1 << C) >>u C => int_and(x, mask)

### Checks¶

Some rewrites are only true under certain conditions. For example,
`int_eq(x, 1)`

can be rewritten to `x`

, if `x`

is known to store a boolean value. This can
be expressed with *checks*:

eq_one: int_eq(x, 1) check x.is_bool() => x

A check is followed by a boolean expression. The variables from the pattern can
be used as `IntBound`

instances in checks (and also in assignments) to find out
what the abstract interpretation of the JIT knows about the value of a trace variable
(`IntBound`

is the name of the abstract domain that the JIT uses for integers,
despite the fact that it also stores knownbits information nowadays).

Here's another example:

mul_lshift: int_mul(x, int_lshift(1, y)) check y.known_ge_const(0) and y.known_le_const(LONG_BIT) => int_lshift(x, y)

It expresses that `x * (1 << y)`

can be rewritten to `x << y`

but checks that
`y`

is known to be between `0`

and `LONG_BIT`

.

Checks and assignments can be repeated and combined with each other:

mul_pow2_const: int_mul(x, C) check C > 0 and C & (C - 1) == 0 shift = highest_bit(C) => int_lshift(x, shift)

In addition to calling methods on `IntBound`

instances, it's also possible to
access their attributes, like in this rule:

and_x_c_in_range: int_and(x, C) check x.lower >= 0 and x.upper <= C & ~(C + 1) => x

### Rule Ordering and Liveness¶

The generated optimizer code will give preference to applying rules that
produce a constant or a variable as a rewrite result. Only if none of those
match do rules that produce new result operations get applied. For example, the
rules `sub_x_x`

and `sub_add`

are tried before trying `sub_add_consts`

,
because the former two rules optimize to a constant and a variable
respectively, while the latter produces a new operation as the result.

The rule `sub_add_consts`

has a possible problem, which is that if the
intermediate result of the `int_add`

operation in the rule head is used by
some other operations, then the `sub_add_consts`

rule does not actually
reduce the number of operations (and might actually make things slightly worse
due to increased register pressure). However, currently it would be extremely
hard to take that kind of information into account in the optimization pass of
the JIT, so we optimistically apply the rules anyway.

### Checking rule coverage¶

Every rewrite rule should have at least one unit test where it triggers. To ensure this, the unit test file that mainly checks integer optimizations in the JIT has an assert at the end of a test run, that every rule fired at least once.

### Printing rule statistics¶

The JIT can print statistics about which rule fired how often in the
`jit-intbounds-stats`

logging category, using the PYPYLOG mechanism. For
example, to print the category to stdout at the end of program execution, run
PyPy like this:

PYPYLOG=jit-intbounds-stats:- pypy ...

The output of that will look something like this:

int_add add_reassoc_consts 2514 add_zero 107008 int_sub sub_zero 31519 sub_from_zero 523 sub_x_x 3153 sub_add_consts 159 sub_add 55 sub_sub_x_c_c 1752 sub_sub_c_x_c 0 sub_xor_x_y_y 0 sub_or_x_y_y 0 int_mul mul_zero 0 mul_one 110 mul_minus_one 0 mul_pow2_const 1456 mul_lshift 0 ...

### Termination and Confluence¶

Right now there are unfortunately no checks that the rules actually rewrite operations towards "simpler" forms. There is no cost model, and also nothing that prevents you from writing a rule like this:

neg_complication: int_neg(x) # leads to infinite rewrites => int_mul(-1, x)

Doing this would lead to endless rewrites if there is also another rule that turns multiplication with -1 into negation.

There is also no checking for confluence (yet?), i.e. the property that all rewrites starting from the same input trace always lead to the same output trace, no matter in which order the rules are applied.

### Proofs¶

It is very easy to write a peephole rule that is not correct in all corner
cases. Therefore all the rules are proven correct with Z3 before compiled into
actual JIT code, by default. When the proof fails, a (hopefully minimal)
counterexample is printed. The counterexample consists of values for all the
inputs that fulfil the checks, values for the intermediate expressions, and
then two *different* values for the source and the target operations.

E.g. if we try to add the incorrect rule:

mul_is_add: int_mul(a, b) => int_add(a, b)

We get the following counterexample as output:

Could not prove correctness of rule 'mul_is_add' in line 1 counterexample given by Z3: counterexample values: a: 0 b: 1 operation int_mul(a, b) with Z3 formula a*b has counterexample result vale: 0 BUT target expression: int_add(a, b) with Z3 formula a + b has counterexample value: 1

If we add conditions, they are taken into account and the counterexample will fulfil the conditions:

mul_is_add: int_mul(a, b) check a.known_gt_const(1) and b.known_gt_const(2) => int_add(a, b)

This leads to the following counterexample:

Could not prove correctness of rule 'mul_is_add' in line 46 counterexample given by Z3: counterexample values: a: 2 b: 3 operation int_mul(a, b) with Z3 formula a*b has counterexample result vale: 6 BUT target expression: int_add(a, b) with Z3 formula a + b has counterexample value: 5

Some `IntBound`

methods cannot be used in Z3 proofs because their control
flow is too complex. If that is the case, they can have Z3-equivalent
formulations defined (in every case this is done, it's a potential proof hole if
the Z3 friendly reformulation and the real implementation differ from each
other, therefore extra care is required to make very sure they are equivalent).

It's possible to skip the proof of individual rules entirely by adding
`SORRY_Z3`

to its body (but we should try not to do that too often):

eq_different_knownbits: int_eq(x, y) SORRY_Z3 check x.known_ne(y) => 0

### Checking for satisfiability¶

In addition to checking whether the rule yields a correct optimization, we also
check whether the rule can ever apply. This ensures that there are *some*
runtime values that would fulfil all the checks in a rule. Here's an example of
a rule violating this:

never_applies: int_is_true(x) check x.known_lt_const(0) and x.known_gt_const(0) # impossible condition, always False => x

Right now the error messages if this goes wrong are not completely easy to understand. I hope to be able to improve this later:

Rule 'never_applies' cannot ever apply in line 1 Z3 did not manage to find values for variables x such that the following condition becomes True: And(x <= x_upper, x_lower <= x, If(x_upper < 0, x_lower > 0, x_upper < 0))

### Implementation Notes¶

The implementation of the DSL is done in a relatively ad-hoc manner. It is parsed using rply, there's a small type checker that tries to find common problems in how the rules are written. Z3 is used via the Python API, like in the previous blog posts that are using it. The pattern matching RPython code is generated using an approach inspired by Luc Maranget's paper Compiling Pattern Matching to Good Decision Trees. See this blog post for an approachable introduction.

## Conclusion¶

Now that I've described the DSL, here are the rules that are equivalent to the imperative code in the motivation section:

mul_zero: int_mul(x, 0) => 0 mul_one: int_mul(x, 1) => x mul_minus_one: int_mul(x, -1) => int_neg(x) mul_pow2_const: int_mul(x, C) check C > 0 and C & (C - 1) == 0 shift = highest_bit(C) => int_lshift(x, shift) mul_lshift: int_mul(x, int_lshift(1, y)) check y.known_ge_const(0) and y.known_le_const(LONG_BIT) => int_lshift(x, y)

The current status of the DSL is that it got merged to PyPy's main branch. I rewrote a part of the integer rewrites into the DSL, but some are still in the old imperative style (mostly for complicated reasons, the easily ported ones are all done). Since I've only been porting optimizations that had existed prior to the existence of the DSL, performance numbers of benchmarks didn't change.

There are a number of features that are still missing and some possible extensions that I plan to work on in the future:

All the integer operations that the DSL handles so far are the variants that do not check for overflow (or where overflow was proven to be impossible to happen). In regular Python code the overflow-checking variants int_add_ovf etc are much more common, but the DSL doesn't support them yet. I plan to fix this, but don't completely understand how the correctness proofs for them should be done correctly.

A related problem is that I don't understand what it means for a rewrite to be correct if some of the operations are only defined for a subset of the input values. E.g. division isn't defined if the divisor is zero. In theory, a division operation in the trace should always be preceded by a check that the divisor isn't zero. But sometimes other optimization move the check around and the connection to the division gets lost or muddled. What optimizations can we still safely perform on the division? There's lots of prior work on this question, but I still don't understand what the correct approach in our context would be.

Ordering comparisons like

`int_lt`

,`int_le`

and their unsigned variants are not ported to the DSL yet. Comparisons are an area where the JIT is not super good yet at optimizing away operations. This is a pretty big topic and I've started a project with Nico Rittinghaus to try to improve the situation a bit more generally.A more advanced direction of work would be to implement a simplified form of e-graphs (or ae-graphs). The JIT has like half of an e-graph data structure already, and we probably can't afford a full one in terms of compile time costs, but maybe we can have two thirds or something?

## Acknowledgements¶

Thank you to Max Bernstein and Martin Berger for super helpful feedback on drafts of the post!

## Comments