<?xml version="1.0" encoding="utf-8"?>
<?xml-stylesheet type="text/xsl" href="../assets/xml/rss.xsl" media="all"?><rss version="2.0" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:atom="http://www.w3.org/2005/Atom"><channel><title>PyPy (Posts about gc)</title><link>https://www.pypy.org/</link><description></description><atom:link href="https://www.pypy.org/categories/gc.xml" rel="self" type="application/rss+xml"></atom:link><language>en</language><copyright>Contents © 2026 &lt;a href="mailto:pypy-dev@pypy.org"&gt;The PyPy Team&lt;/a&gt; </copyright><lastBuildDate>Mon, 23 Mar 2026 21:26:46 GMT</lastBuildDate><generator>Nikola (getnikola.com)</generator><docs>http://blogs.law.harvard.edu/tech/rss</docs><item><title>How fast can the RPython GC allocate?</title><link>https://www.pypy.org/posts/2025/06/rpython-gc-allocation-speed.html</link><dc:creator>CF Bolz-Tereick</dc:creator><description>&lt;p&gt;While working on a paper about &lt;a href="https://pypy.org/posts/2025/02/pypy-gc-sampling.html"&gt;allocation profiling in
VMProf&lt;/a&gt; I got curious
about how quickly the RPython GC can allocate an object. I wrote a small
RPython benchmark program to get an idea of the order of magnitude.&lt;/p&gt;
&lt;p&gt;The basic idea is to just allocate an instance in a tight loop:&lt;/p&gt;
&lt;div class="code"&gt;&lt;pre class="code literal-block"&gt;&lt;span class="k"&gt;class&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nc"&gt;A&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nb"&gt;object&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="k"&gt;pass&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nf"&gt;run&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;loops&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="c1"&gt;# preliminary idea, see below&lt;/span&gt;
    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="nb"&gt;range&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;loops&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="n"&gt;a&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;A&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
        &lt;span class="n"&gt;a&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;The RPython type inference will find out that instances of &lt;code&gt;A&lt;/code&gt; have a single
&lt;code&gt;i&lt;/code&gt; field, which is an integer. In addition to that field, every RPython object
needs one word of GC meta-information. Therefore one instance of &lt;code&gt;A&lt;/code&gt; needs 16
bytes on a 64-bit architecture.&lt;/p&gt;
&lt;p&gt;However, measuring like this is not good enough, because the RPython static
optimizer would remove the allocation since the object isn't used. But we can
confuse the escape analysis sufficiently by always keeping two instances alive
at the same time:&lt;/p&gt;
&lt;div class="code"&gt;&lt;pre class="code literal-block"&gt;&lt;span class="k"&gt;class&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nc"&gt;A&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nb"&gt;object&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="k"&gt;pass&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nf"&gt;run&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;loops&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;a&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;prev&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="kc"&gt;None&lt;/span&gt;
    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="nb"&gt;range&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;loops&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="n"&gt;prev&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;a&lt;/span&gt;
        &lt;span class="n"&gt;a&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;A&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
        &lt;span class="n"&gt;a&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt;
    &lt;span class="nb"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;prev&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;a&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="c1"&gt;# print the instances at the end&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;(I confirmed that the allocation isn't being removed by looking at the C code
that the RPython compiler generates from this.)&lt;/p&gt;
&lt;p&gt;This is doing a little bit more work than needed, because of the &lt;code&gt;a.i = i&lt;/code&gt;
instance attribute write. We can also (optionally) leave the field
uninitialized.&lt;/p&gt;
&lt;div class="code"&gt;&lt;pre class="code literal-block"&gt;&lt;span class="k"&gt;def&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nf"&gt;run&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;initialize_field&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;loops&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;t1&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;time&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;time&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;initialize_field&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;a&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;prev&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="kc"&gt;None&lt;/span&gt;
        &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="nb"&gt;range&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;loops&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
            &lt;span class="n"&gt;prev&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;a&lt;/span&gt;
            &lt;span class="n"&gt;a&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;A&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
            &lt;span class="n"&gt;a&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt;
        &lt;span class="nb"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;prev&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;a&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="c1"&gt;# make sure always two objects are alive&lt;/span&gt;
    &lt;span class="k"&gt;else&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;a&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;prev&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="kc"&gt;None&lt;/span&gt;
        &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="nb"&gt;range&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;loops&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
            &lt;span class="n"&gt;prev&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;a&lt;/span&gt;
            &lt;span class="n"&gt;a&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;A&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
        &lt;span class="nb"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;prev&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;a&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;t2&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;time&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;time&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="nb"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;t2&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="n"&gt;t1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;'s'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;object_size_in_words&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt; &lt;span class="c1"&gt;# GC header, one integer field&lt;/span&gt;
    &lt;span class="n"&gt;mem&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;loops&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="mi"&gt;8&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="n"&gt;object_size_in_words&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="mf"&gt;1024.0&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="mf"&gt;1024.0&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="mf"&gt;1024.0&lt;/span&gt;
    &lt;span class="nb"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;mem&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;'GB'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="nb"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;mem&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;t2&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="n"&gt;t1&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="s1"&gt;'GB/s'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;Then we need to add some RPython scaffolding:&lt;/p&gt;
&lt;div class="code"&gt;&lt;pre class="code literal-block"&gt;&lt;span class="k"&gt;def&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nf"&gt;main&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;argv&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;loops&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nb"&gt;int&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;argv&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;
    &lt;span class="n"&gt;with_init&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nb"&gt;bool&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nb"&gt;int&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;argv&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;]))&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;with_init&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="nb"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s2"&gt;"with initialization"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;else&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="nb"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s2"&gt;"without initialization"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;run&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;with_init&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;loops&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nf"&gt;target&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;args&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;main&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;To build a binary:&lt;/p&gt;
&lt;div class="code"&gt;&lt;pre class="code literal-block"&gt;&lt;span class="go"&gt;pypy rpython/bin/rpython targetallocatealot.py&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;Which will turn the RPython code into C code and use a C compiler to turn that
into a binary, containing both our code above as well as the RPython garbage
collector.&lt;/p&gt;
&lt;p&gt;Then we can run it (all results again from my AMD Ryzen 7 PRO 7840U, running
Ubuntu Linux 24.04.2):&lt;/p&gt;
&lt;div class="code"&gt;&lt;pre class="code literal-block"&gt;&lt;span class="gp"&gt;$ &lt;/span&gt;./targetallocatealot-c&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="m"&gt;1000000000&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="m"&gt;0&lt;/span&gt;
&lt;span class="go"&gt;without initialization&lt;/span&gt;
&lt;span class="go"&gt;&amp;lt;A object at 0x7c71ad84cf60&amp;gt; &amp;lt;A object at 0x7c71ad84cf70&amp;gt;&lt;/span&gt;
&lt;span class="go"&gt;0.433825 s&lt;/span&gt;
&lt;span class="go"&gt;14.901161 GB&lt;/span&gt;
&lt;span class="go"&gt;34.348322 GB/s&lt;/span&gt;
&lt;span class="gp"&gt;$ &lt;/span&gt;./targetallocatealot-c&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="m"&gt;1000000000&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="m"&gt;1&lt;/span&gt;
&lt;span class="go"&gt;with initialization&lt;/span&gt;
&lt;span class="go"&gt;&amp;lt;A object at 0x71b41c82cf60&amp;gt; &amp;lt;A object at 0x71b41c82cf70&amp;gt;&lt;/span&gt;
&lt;span class="go"&gt;0.501856 s&lt;/span&gt;
&lt;span class="go"&gt;14.901161 GB&lt;/span&gt;
&lt;span class="go"&gt;29.692100 GB/s&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;Let's compare it with the Boehm GC:&lt;/p&gt;
&lt;div class="code"&gt;&lt;pre class="code literal-block"&gt;&lt;span class="gp"&gt;$ &lt;/span&gt;pypy&lt;span class="w"&gt; &lt;/span&gt;rpython/bin/rpython&lt;span class="w"&gt; &lt;/span&gt;--gc&lt;span class="o"&gt;=&lt;/span&gt;boehm&lt;span class="w"&gt; &lt;/span&gt;--output&lt;span class="o"&gt;=&lt;/span&gt;targetallocatealot-c-boehm&lt;span class="w"&gt; &lt;/span&gt;targetallocatealot.py&lt;span class="w"&gt; &lt;/span&gt;
&lt;span class="go"&gt;...&lt;/span&gt;
&lt;span class="gp"&gt;$ &lt;/span&gt;./targetallocatealot-c-boehm&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="m"&gt;1000000000&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="m"&gt;0&lt;/span&gt;
&lt;span class="go"&gt;without initialization&lt;/span&gt;
&lt;span class="go"&gt;&amp;lt;A object at 0xffff8bd058a6e3af&amp;gt; &amp;lt;A object at 0xffff8bd058a6e3bf&amp;gt;&lt;/span&gt;
&lt;span class="go"&gt;9.722585 s&lt;/span&gt;
&lt;span class="go"&gt;14.901161 GB&lt;/span&gt;
&lt;span class="go"&gt;1.532634 GB/s&lt;/span&gt;
&lt;span class="gp"&gt;$ &lt;/span&gt;./targetallocatealot-c-boehm&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="m"&gt;1000000000&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="m"&gt;1&lt;/span&gt;
&lt;span class="go"&gt;with initialization&lt;/span&gt;
&lt;span class="go"&gt;&amp;lt;A object at 0xffff88e1132983af&amp;gt; &amp;lt;A object at 0xffff88e1132983bf&amp;gt;&lt;/span&gt;
&lt;span class="go"&gt;9.684149 s&lt;/span&gt;
&lt;span class="go"&gt;14.901161 GB&lt;/span&gt;
&lt;span class="go"&gt;1.538717 GB/s&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;This is not a fair comparison, because the Boehm GC uses conservative stack
scanning, therefore it cannot move objects, which requires much more
complicated allocation.&lt;/p&gt;
&lt;h3 id="lets-look-at-perf-stats"&gt;Let's look at &lt;code&gt;perf stats&lt;/code&gt;&lt;/h3&gt;
&lt;p&gt;We can use &lt;code&gt;perf&lt;/code&gt; to get some statistics about the executions:&lt;/p&gt;
&lt;div class="code"&gt;&lt;pre class="code literal-block"&gt;&lt;span class="gp"&gt;$ &lt;/span&gt;perf&lt;span class="w"&gt; &lt;/span&gt;stat&lt;span class="w"&gt; &lt;/span&gt;-e&lt;span class="w"&gt; &lt;/span&gt;cache-references,cache-misses,cycles,instructions,branches,faults,migrations&lt;span class="w"&gt; &lt;/span&gt;./targetallocatealot-c&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="m"&gt;10000000000&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="m"&gt;0&lt;/span&gt;
&lt;span class="go"&gt;without initialization&lt;/span&gt;
&lt;span class="go"&gt;&amp;lt;A object at 0x7aa260e35980&amp;gt; &amp;lt;A object at 0x7aa260e35990&amp;gt;&lt;/span&gt;
&lt;span class="go"&gt;4.301442 s&lt;/span&gt;
&lt;span class="go"&gt;149.011612 GB&lt;/span&gt;
&lt;span class="go"&gt;34.642245 GB/s&lt;/span&gt;

&lt;span class="go"&gt; Performance counter stats for './targetallocatealot-c 10000000000 0':&lt;/span&gt;

&lt;span class="go"&gt;     7,244,117,828      cache-references                                                      &lt;/span&gt;
&lt;span class="go"&gt;        23,446,661      cache-misses                     #    0.32% of all cache refs         &lt;/span&gt;
&lt;span class="go"&gt;    21,074,240,395      cycles                                                                &lt;/span&gt;
&lt;span class="go"&gt;   110,116,790,943      instructions                     #    5.23  insn per cycle            &lt;/span&gt;
&lt;span class="go"&gt;    20,024,347,488      branches                                                              &lt;/span&gt;
&lt;span class="go"&gt;             1,287      faults                                                                &lt;/span&gt;
&lt;span class="go"&gt;                24      migrations                                                            &lt;/span&gt;

&lt;span class="go"&gt;       4.303071693 seconds time elapsed&lt;/span&gt;

&lt;span class="go"&gt;       4.297557000 seconds user&lt;/span&gt;
&lt;span class="go"&gt;       0.003998000 seconds sys&lt;/span&gt;

&lt;span class="gp"&gt;$ &lt;/span&gt;perf&lt;span class="w"&gt; &lt;/span&gt;stat&lt;span class="w"&gt; &lt;/span&gt;-e&lt;span class="w"&gt; &lt;/span&gt;cache-references,cache-misses,cycles,instructions,branches,faults,migrations&lt;span class="w"&gt; &lt;/span&gt;./targetallocatealot-c&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="m"&gt;10000000000&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="m"&gt;1&lt;/span&gt;
&lt;span class="go"&gt;with initialization&lt;/span&gt;
&lt;span class="go"&gt;&amp;lt;A object at 0x77ceb0235980&amp;gt; &amp;lt;A object at 0x77ceb0235990&amp;gt;&lt;/span&gt;
&lt;span class="go"&gt;5.016772 s&lt;/span&gt;
&lt;span class="go"&gt;149.011612 GB&lt;/span&gt;
&lt;span class="go"&gt;29.702688 GB/s&lt;/span&gt;

&lt;span class="go"&gt; Performance counter stats for './targetallocatealot-c 10000000000 1':&lt;/span&gt;

&lt;span class="go"&gt;     7,571,461,470      cache-references                                                      &lt;/span&gt;
&lt;span class="go"&gt;       241,915,266      cache-misses                     #    3.20% of all cache refs         &lt;/span&gt;
&lt;span class="go"&gt;    24,503,497,532      cycles                                                                &lt;/span&gt;
&lt;span class="go"&gt;   130,126,387,460      instructions                     #    5.31  insn per cycle            &lt;/span&gt;
&lt;span class="go"&gt;    20,026,280,693      branches                                                              &lt;/span&gt;
&lt;span class="go"&gt;             1,285      faults                                                                &lt;/span&gt;
&lt;span class="go"&gt;                21      migrations                                                            &lt;/span&gt;

&lt;span class="go"&gt;       5.019444749 seconds time elapsed&lt;/span&gt;

&lt;span class="go"&gt;       5.012924000 seconds user&lt;/span&gt;
&lt;span class="go"&gt;       0.005999000 seconds sys&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;This is pretty cool, we can run this loop with &amp;gt;5 instructions per cycle. Every
allocation takes &lt;code&gt;110116790943 / 10000000000 ≈ 11&lt;/code&gt; instructions and
&lt;code&gt;21074240395 / 10000000000 ≈ 2.1&lt;/code&gt; cycles, including the loop around it.&lt;/p&gt;
&lt;h3 id="how-often-does-the-gc-run"&gt;How often does the GC run?&lt;/h3&gt;
&lt;p&gt;The RPython GC queries the L2 cache size to determine the size of the nursery.
We can find out what it is by turning on PYPYLOG, selecting the proper logging
categories, and printing to &lt;code&gt;stdout&lt;/code&gt; via &lt;code&gt;:-&lt;/code&gt;:&lt;/p&gt;
&lt;div class="code"&gt;&lt;pre class="code literal-block"&gt;&lt;span class="gp"&gt;$ &lt;/span&gt;&lt;span class="nv"&gt;PYPYLOG&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;gc-set-nursery-size,gc-hardware:-&lt;span class="w"&gt; &lt;/span&gt;./targetallocatealot-c&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="m"&gt;1&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="m"&gt;1&lt;/span&gt;
&lt;span class="go"&gt;[f3e6970465723] {gc-set-nursery-size&lt;/span&gt;
&lt;span class="go"&gt;nursery size: 270336&lt;/span&gt;
&lt;span class="go"&gt;[f3e69704758f3] gc-set-nursery-size}&lt;/span&gt;
&lt;span class="go"&gt;[f3e697047b9a1] {gc-hardware&lt;/span&gt;
&lt;span class="go"&gt;L2cache = 1048576&lt;/span&gt;
&lt;span class="go"&gt;[f3e69705ced19] gc-hardware}&lt;/span&gt;
&lt;span class="go"&gt;[f3e69705d11b5] {gc-hardware&lt;/span&gt;
&lt;span class="go"&gt;memtotal = 32274210816.000000&lt;/span&gt;
&lt;span class="go"&gt;[f3e69705f4948] gc-hardware}&lt;/span&gt;
&lt;span class="go"&gt;[f3e6970615f78] {gc-set-nursery-size&lt;/span&gt;
&lt;span class="go"&gt;nursery size: 4194304&lt;/span&gt;
&lt;span class="go"&gt;[f3e697061ecc0] gc-set-nursery-size}&lt;/span&gt;
&lt;span class="go"&gt;with initialization&lt;/span&gt;
&lt;span class="go"&gt;NULL &amp;lt;A object at 0x7fa7b1434020&amp;gt;&lt;/span&gt;
&lt;span class="go"&gt;0.000008 s&lt;/span&gt;
&lt;span class="go"&gt;0.000000 GB&lt;/span&gt;
&lt;span class="go"&gt;0.001894 GB/s&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;So the nursery is 4 MiB. This means that when we allocate 14.9 GiB the GC needs to perform &lt;code&gt;10000000000 * 16 / 4194304 ≈ 38146&lt;/code&gt; minor collections. Let's confirm that:&lt;/p&gt;
&lt;div class="code"&gt;&lt;pre class="code literal-block"&gt;&lt;span class="gp"&gt;$ &lt;/span&gt;&lt;span class="nv"&gt;PYPYLOG&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;gc-minor:out&lt;span class="w"&gt; &lt;/span&gt;./targetallocatealot-c&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="m"&gt;10000000000&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="m"&gt;1&lt;/span&gt;
&lt;span class="go"&gt;with initialization&lt;/span&gt;
&lt;span class="go"&gt;w&amp;lt;A object at 0x7991e3835980&amp;gt; &amp;lt;A object at 0x7991e3835990&amp;gt;&lt;/span&gt;
&lt;span class="go"&gt;5.315511 s&lt;/span&gt;
&lt;span class="go"&gt;149.011612 GB&lt;/span&gt;
&lt;span class="go"&gt;28.033356 GB/s&lt;/span&gt;
&lt;span class="gp"&gt;$ &lt;/span&gt;head&lt;span class="w"&gt; &lt;/span&gt;out
&lt;span class="go"&gt;[f3ee482f4cd97] {gc-minor&lt;/span&gt;
&lt;span class="go"&gt;[f3ee482f53874] {gc-minor-walkroots&lt;/span&gt;
&lt;span class="go"&gt;[f3ee482f54117] gc-minor-walkroots}&lt;/span&gt;
&lt;span class="go"&gt;minor collect, total memory used: 0&lt;/span&gt;
&lt;span class="go"&gt;number of pinned objects: 0&lt;/span&gt;
&lt;span class="go"&gt;total size of surviving objects: 0&lt;/span&gt;
&lt;span class="go"&gt;time taken: 0.000029&lt;/span&gt;
&lt;span class="go"&gt;[f3ee482f67b7e] gc-minor}&lt;/span&gt;
&lt;span class="go"&gt;[f3ee4838097c5] {gc-minor&lt;/span&gt;
&lt;span class="go"&gt;[f3ee48380c945] {gc-minor-walkroots&lt;/span&gt;
&lt;span class="gp"&gt;$ &lt;/span&gt;grep&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"{gc-minor-walkroots"&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;out&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;|&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;wc&lt;span class="w"&gt; &lt;/span&gt;-l
&lt;span class="go"&gt;38147&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;Each minor collection is very quick, because a minor collection is
O(surviving objects), and in this program only one object survive each time
(the other instance is in the process of being allocated).
Also, the GC root shadow stack is only one entry, so walking that is super
quick as well. The time the minor collections take is logged to the out file:&lt;/p&gt;
&lt;div class="code"&gt;&lt;pre class="code literal-block"&gt;&lt;span class="gp"&gt;$ &lt;/span&gt;grep&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"time taken"&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;out&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;|&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;tail
&lt;span class="go"&gt;time taken: 0.000002&lt;/span&gt;
&lt;span class="go"&gt;time taken: 0.000002&lt;/span&gt;
&lt;span class="go"&gt;time taken: 0.000002&lt;/span&gt;
&lt;span class="go"&gt;time taken: 0.000002&lt;/span&gt;
&lt;span class="go"&gt;time taken: 0.000002&lt;/span&gt;
&lt;span class="go"&gt;time taken: 0.000002&lt;/span&gt;
&lt;span class="go"&gt;time taken: 0.000002&lt;/span&gt;
&lt;span class="go"&gt;time taken: 0.000003&lt;/span&gt;
&lt;span class="go"&gt;time taken: 0.000002&lt;/span&gt;
&lt;span class="go"&gt;time taken: 0.000002&lt;/span&gt;
&lt;span class="gp"&gt;$ &lt;/span&gt;grep&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"time taken"&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;out&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;|&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;grep&lt;span class="w"&gt; &lt;/span&gt;-o&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"0.*"&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;|&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;numsum
&lt;span class="go"&gt;0.0988160000000011&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;(This number is super approximate due to float formatting rounding.)&lt;/p&gt;
&lt;p&gt;that means that &lt;code&gt;0.0988160000000011 / 5.315511 ≈ 2%&lt;/code&gt; of the time is spent in the GC.&lt;/p&gt;
&lt;h3 id="what-does-the-generated-machine-code-look-like"&gt;What does the generated machine code look like?&lt;/h3&gt;
&lt;p&gt;The allocation fast path of the RPython GC is a simple bump pointer, in Python
pseudo-code it would look roughly like this:&lt;/p&gt;
&lt;div class="code"&gt;&lt;pre class="code literal-block"&gt;&lt;span class="n"&gt;result&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;gc&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;nursery_free&lt;/span&gt;
&lt;span class="c1"&gt;# Move nursery_free pointer forward by totalsize&lt;/span&gt;
&lt;span class="n"&gt;gc&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;nursery_free&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;result&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;totalsize&lt;/span&gt;
&lt;span class="c1"&gt;# Check if this allocation would exceed the nursery&lt;/span&gt;
&lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;gc&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;nursery_free&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;gc&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;nursery_top&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="c1"&gt;# If it does =&amp;gt; collect the nursery and al&lt;/span&gt;
    &lt;span class="n"&gt;result&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;collect_and_reserve&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;totalsize&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;hdr&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="n"&gt;GC&lt;/span&gt; &lt;span class="n"&gt;flags&lt;/span&gt; &lt;span class="ow"&gt;and&lt;/span&gt; &lt;span class="nb"&gt;type&lt;/span&gt; &lt;span class="nb"&gt;id&lt;/span&gt; &lt;span class="n"&gt;of&lt;/span&gt; &lt;span class="n"&gt;A&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;So we can disassemble the compiled binary &lt;code&gt;targetallocatealot-c&lt;/code&gt; and try to
find the equivalent logic in machine code. I'm super bad at reading machine
code, but I tried to annotate what I think is the core loop (the version
without initializing the &lt;code&gt;i&lt;/code&gt; field) below:&lt;/p&gt;
&lt;div class="code"&gt;&lt;pre class="code literal-block"&gt;&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="p"&gt;...&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="nl"&gt;cb68&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt;   &lt;/span&gt;&lt;span class="n"&gt;mov&lt;/span&gt;&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="o"&gt;%&lt;/span&gt;&lt;span class="n"&gt;rbx&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="o"&gt;%&lt;/span&gt;&lt;span class="n"&gt;rdi&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="nl"&gt;cb6b&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt;   &lt;/span&gt;&lt;span class="n"&gt;mov&lt;/span&gt;&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="o"&gt;%&lt;/span&gt;&lt;span class="n"&gt;rdx&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="o"&gt;%&lt;/span&gt;&lt;span class="n"&gt;rbx&lt;/span&gt;

&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="cp"&gt;# initialize object header of object allocated in previous iteration&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="nl"&gt;cb6e&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt;   &lt;/span&gt;&lt;span class="n"&gt;movq&lt;/span&gt;&lt;span class="w"&gt;   &lt;/span&gt;&lt;span class="n"&gt;$0x4c8&lt;/span&gt;&lt;span class="p"&gt;,(&lt;/span&gt;&lt;span class="o"&gt;%&lt;/span&gt;&lt;span class="n"&gt;rbx&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="cp"&gt;# loop termination check&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="nl"&gt;cb75&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt;   &lt;/span&gt;&lt;span class="n"&gt;cmp&lt;/span&gt;&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="o"&gt;%&lt;/span&gt;&lt;span class="n"&gt;rbp&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="o"&gt;%&lt;/span&gt;&lt;span class="n"&gt;r12&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="nl"&gt;cb78&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt;   &lt;/span&gt;&lt;span class="n"&gt;je&lt;/span&gt;&lt;span class="w"&gt;     &lt;/span&gt;&lt;span class="n"&gt;ccb8&lt;/span&gt;

&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="cp"&gt;# load nursery_free&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="nl"&gt;cb7e&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt;   &lt;/span&gt;&lt;span class="n"&gt;mov&lt;/span&gt;&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="mh"&gt;0x33c13&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;%&lt;/span&gt;&lt;span class="n"&gt;rip&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;&lt;span class="o"&gt;%&lt;/span&gt;&lt;span class="n"&gt;rdx&lt;/span&gt;

&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="cp"&gt;# increment loop counter&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="nl"&gt;cb85&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt;   &lt;/span&gt;&lt;span class="n"&gt;add&lt;/span&gt;&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="n"&gt;$0x1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="o"&gt;%&lt;/span&gt;&lt;span class="n"&gt;rbp&lt;/span&gt;

&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="cp"&gt;# add 16 (size of object) to nursery_free&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="nl"&gt;cb89&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt;   &lt;/span&gt;&lt;span class="n"&gt;lea&lt;/span&gt;&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="mh"&gt;0x10&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;%&lt;/span&gt;&lt;span class="n"&gt;rdx&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;&lt;span class="o"&gt;%&lt;/span&gt;&lt;span class="n"&gt;rax&lt;/span&gt;

&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="cp"&gt;# compare nursery_top with new nursery_free&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="nl"&gt;cb8d&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt;   &lt;/span&gt;&lt;span class="n"&gt;cmp&lt;/span&gt;&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="o"&gt;%&lt;/span&gt;&lt;span class="n"&gt;rax&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="mh"&gt;0x33c24&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;%&lt;/span&gt;&lt;span class="n"&gt;rip&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="cp"&gt;# store new nursery_free&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="nl"&gt;cb94&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt;   &lt;/span&gt;&lt;span class="n"&gt;mov&lt;/span&gt;&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="o"&gt;%&lt;/span&gt;&lt;span class="n"&gt;rax&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="mh"&gt;0x33bfd&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;%&lt;/span&gt;&lt;span class="n"&gt;rip&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="cp"&gt;# if new nursery_free exceeds nursery_top, fall through to slow path, if not, start at top&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="nl"&gt;cb9b&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt;   &lt;/span&gt;&lt;span class="n"&gt;jae&lt;/span&gt;&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="n"&gt;cb68&lt;/span&gt;

&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="cp"&gt;# slow path from here on:&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="cp"&gt;# save live object from last iteration to GC shadow stack&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="nl"&gt;cb9d&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt;   &lt;/span&gt;&lt;span class="n"&gt;mov&lt;/span&gt;&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="o"&gt;%&lt;/span&gt;&lt;span class="n"&gt;rbx&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="mh"&gt;-0x8&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;%&lt;/span&gt;&lt;span class="n"&gt;rcx&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="nl"&gt;cba1&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt;   &lt;/span&gt;&lt;span class="n"&gt;mov&lt;/span&gt;&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="o"&gt;%&lt;/span&gt;&lt;span class="n"&gt;r13&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="o"&gt;%&lt;/span&gt;&lt;span class="n"&gt;rdi&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="nl"&gt;cba4&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt;   &lt;/span&gt;&lt;span class="n"&gt;mov&lt;/span&gt;&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="n"&gt;$0x10&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="o"&gt;%&lt;/span&gt;&lt;span class="n"&gt;esi&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="cp"&gt;# do minor collection&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="nl"&gt;cba9&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt;   &lt;/span&gt;&lt;span class="n"&gt;call&lt;/span&gt;&lt;span class="w"&gt;   &lt;/span&gt;&lt;span class="mi"&gt;20800&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="n"&gt;pypy_g_IncrementalMiniMarkGC_collect_and_reserve&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="p"&gt;...&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;

&lt;h3 id="running-the-benchmark-as-regular-python-code"&gt;Running the benchmark as regular Python code&lt;/h3&gt;
&lt;p&gt;So far we ran this code as &lt;em&gt;RPython&lt;/em&gt;, i.e. type inference is performed and the
program is translated to a C binary. We can also run it on top of PyPy, as a
regular Python3 program. However, an instance of a user-defined class in regular
Python when run on PyPy is actually a much larger object, due to &lt;a href="https://pypy.org/posts/2010/11/efficiently-implementing-python-objects-3838329944323946932.html"&gt;dynamic
typing&lt;/a&gt;.
It's at least 7 words, which is 56 bytes.&lt;/p&gt;
&lt;p&gt;However, we can simply use &lt;code&gt;int&lt;/code&gt; objects instead. Integers are allocated on the
heap and consist of two words, one for the GC and one with the
machine-word-sized integer value, if the integer fits into a signed 64-bit
representation (otherwise a less compact different representation is used,
which can represent arbitrarily large integers).&lt;/p&gt;
&lt;p&gt;Therefore, we can simply use this kind of code:&lt;/p&gt;
&lt;div class="code"&gt;&lt;pre class="code literal-block"&gt;&lt;span class="kn"&gt;import&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nn"&gt;sys&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nn"&gt;time&lt;/span&gt;


&lt;span class="k"&gt;def&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nf"&gt;run&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;loops&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;t1&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;time&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;time&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="n"&gt;a&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;prev&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="kc"&gt;None&lt;/span&gt;
    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="nb"&gt;range&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;loops&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="n"&gt;prev&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;a&lt;/span&gt;
        &lt;span class="n"&gt;a&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt;
    &lt;span class="nb"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;prev&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;a&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="c1"&gt;# make sure always two objects are alive&lt;/span&gt;
    &lt;span class="n"&gt;t2&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;time&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;time&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="n"&gt;object_size_in_words&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt; &lt;span class="c1"&gt;# GC header, one integer field&lt;/span&gt;
    &lt;span class="n"&gt;mem&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;loops&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="mi"&gt;28&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="mf"&gt;1024.0&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="mf"&gt;1024.0&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="mf"&gt;1024.0&lt;/span&gt;
    &lt;span class="nb"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;mem&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;'GB'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="nb"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;mem&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;t2&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="n"&gt;t1&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="s1"&gt;'GB/s'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nf"&gt;main&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;argv&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;loops&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nb"&gt;int&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;argv&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;
    &lt;span class="n"&gt;run&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;loops&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;

&lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="vm"&gt;__name__&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="s1"&gt;'__main__'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;sys&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;exit&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;main&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;sys&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;argv&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;In this case we can't really leave the value uninitialized though.&lt;/p&gt;
&lt;p&gt;We can run this both with and without the JIT:&lt;/p&gt;
&lt;div class="code"&gt;&lt;pre class="code literal-block"&gt;&lt;span class="gp"&gt;$ &lt;/span&gt;pypy3&lt;span class="w"&gt; &lt;/span&gt;allocatealot.py&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="m"&gt;1000000000&lt;/span&gt;
&lt;span class="go"&gt;999999998 999999999&lt;/span&gt;
&lt;span class="go"&gt;14.901161193847656 GB&lt;/span&gt;
&lt;span class="go"&gt;17.857494904899553 GB/s&lt;/span&gt;
&lt;span class="gp"&gt;$ &lt;/span&gt;pypy3&lt;span class="w"&gt; &lt;/span&gt;--jit&lt;span class="w"&gt; &lt;/span&gt;off&lt;span class="w"&gt; &lt;/span&gt;allocatealot.py&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="m"&gt;1000000000&lt;/span&gt;
&lt;span class="go"&gt;999999998 999999999&lt;/span&gt;
&lt;span class="go"&gt;14.901161193847656 GB&lt;/span&gt;
&lt;span class="go"&gt;0.8275382375297171 GB/s&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;This is obviously much less efficient than the C code, the PyPy JIT generates
much less efficient machine code than GCC. Still, "only" twice as slow is kind
of cool anyway.&lt;/p&gt;
&lt;p&gt;(Running it with CPython doesn't really make sense for this measurements, since
CPython ints are bigger – &lt;code&gt;sys.getsizeof(5)&lt;/code&gt; reports 28 bytes.)&lt;/p&gt;
&lt;h3 id="the-machine-code-that-the-jit-generates"&gt;The machine code that the JIT generates&lt;/h3&gt;
&lt;p&gt;Unfortunately it's a bit of a journey to show the machine code that PyPy's JIT generates for this. First we need to run with all jit logging categories:&lt;/p&gt;
&lt;div class="code"&gt;&lt;pre class="code literal-block"&gt;&lt;span class="gp"&gt;$ &lt;/span&gt;&lt;span class="nv"&gt;PYPYLOG&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;jit:out&lt;span class="w"&gt; &lt;/span&gt;pypy3&lt;span class="w"&gt; &lt;/span&gt;allocatealot.py&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="m"&gt;1000000000&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;Then we can read the log file to find the trace IR for the loop under the logging category &lt;code&gt;jit-log-opt&lt;/code&gt;:&lt;/p&gt;
&lt;div class="code"&gt;&lt;pre class="code literal-block"&gt;&lt;span class="o"&gt;+&lt;/span&gt;&lt;span class="mi"&gt;532&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;label&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;p0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;p1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;p6&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;p9&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;p11&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;i34&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;p13&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;p19&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;p21&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;p23&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;p25&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;p29&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;p31&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;i44&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;i35&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;descr&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;TargetToken&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;137358545605472&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
&lt;span class="n"&gt;debug_merge_point&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s1"&gt;'run;/home/cfbolz/projects/gitpypy/allocatealot.py:6-9~#24 FOR_ITER'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# are we at the end of the loop&lt;/span&gt;
&lt;span class="o"&gt;+&lt;/span&gt;&lt;span class="mi"&gt;552&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;i45&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;int_lt&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;i44&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;i35&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="o"&gt;+&lt;/span&gt;&lt;span class="mi"&gt;555&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;guard_true&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;i45&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;descr&lt;/span&gt;&lt;span class="o"&gt;=&amp;lt;&lt;/span&gt;&lt;span class="n"&gt;Guard0x7ced4756a160&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;p0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;p6&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;p9&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;p11&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;p13&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;p19&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;p21&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;p23&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;p25&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;p29&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;p31&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;p1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;i44&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;i35&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;i34&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;span class="o"&gt;+&lt;/span&gt;&lt;span class="mi"&gt;561&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;i47&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;int_add&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;i44&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;debug_merge_point&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s1"&gt;'run;/home/cfbolz/projects/gitpypy/allocatealot.py:6-9~#26 STORE_FAST'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;debug_merge_point&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s1"&gt;'run;/home/cfbolz/projects/gitpypy/allocatealot.py:6-10~#28 LOAD_FAST'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;debug_merge_point&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s1"&gt;'run;/home/cfbolz/projects/gitpypy/allocatealot.py:6-10~#30 STORE_FAST'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;debug_merge_point&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s1"&gt;'run;/home/cfbolz/projects/gitpypy/allocatealot.py:6-11~#32 LOAD_FAST'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;debug_merge_point&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s1"&gt;'run;/home/cfbolz/projects/gitpypy/allocatealot.py:6-11~#34 STORE_FAST'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;debug_merge_point&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s1"&gt;'run;/home/cfbolz/projects/gitpypy/allocatealot.py:6-11~#36 JUMP_ABSOLUTE'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# update iterator object&lt;/span&gt;
&lt;span class="o"&gt;+&lt;/span&gt;&lt;span class="mi"&gt;565&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;setfield_gc&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;p25&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;i47&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;descr&lt;/span&gt;&lt;span class="o"&gt;=&amp;lt;&lt;/span&gt;&lt;span class="n"&gt;FieldS&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;pypy&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;module&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;__builtin__&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;functional&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;W_IntRangeIterator&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;inst_current&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;8&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="o"&gt;+&lt;/span&gt;&lt;span class="mi"&gt;569&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;guard_not_invalidated&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;descr&lt;/span&gt;&lt;span class="o"&gt;=&amp;lt;&lt;/span&gt;&lt;span class="n"&gt;Guard0x7ced4756a1b0&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;p0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;p6&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;p9&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;p11&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;p19&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;p21&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;p23&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;p25&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;p29&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;p31&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;p1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;i44&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;i34&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;

&lt;span class="c1"&gt;# check for signals&lt;/span&gt;
&lt;span class="o"&gt;+&lt;/span&gt;&lt;span class="mi"&gt;569&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;i49&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;getfield_raw_i&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;137358624889824&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;descr&lt;/span&gt;&lt;span class="o"&gt;=&amp;lt;&lt;/span&gt;&lt;span class="n"&gt;FieldS&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;pypysig_long_struct_inner&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;c_value&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="o"&gt;+&lt;/span&gt;&lt;span class="mi"&gt;582&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;i51&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;int_lt&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;i49&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="o"&gt;+&lt;/span&gt;&lt;span class="mi"&gt;586&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;guard_false&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;i51&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;descr&lt;/span&gt;&lt;span class="o"&gt;=&amp;lt;&lt;/span&gt;&lt;span class="n"&gt;Guard0x7ced4754db78&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;p0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;p6&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;p9&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;p11&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;p19&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;p21&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;p23&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;p25&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;p29&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;p31&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;p1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;i44&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;i34&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;span class="n"&gt;debug_merge_point&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s1"&gt;'run;/home/cfbolz/projects/gitpypy/allocatealot.py:6-9~#24 FOR_ITER'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# allocate the integer (allocation sunk to the end of the trace)&lt;/span&gt;
&lt;span class="o"&gt;+&lt;/span&gt;&lt;span class="mi"&gt;592&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;p52&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;new_with_vtable&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;descr&lt;/span&gt;&lt;span class="o"&gt;=&amp;lt;&lt;/span&gt;&lt;span class="n"&gt;SizeDescr&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;16&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="o"&gt;+&lt;/span&gt;&lt;span class="mi"&gt;630&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;setfield_gc&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;p52&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;i34&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;descr&lt;/span&gt;&lt;span class="o"&gt;=&amp;lt;&lt;/span&gt;&lt;span class="n"&gt;FieldS&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;pypy&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;objspace&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;std&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;intobject&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;W_IntObject&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;inst_intval&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;8&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;pure&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="o"&gt;+&lt;/span&gt;&lt;span class="mi"&gt;634&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;jump&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;p0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;p1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;p6&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;p9&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;p11&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;i44&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;p52&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;p19&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;p21&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;p23&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;p25&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;p29&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;p31&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;i47&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;i35&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;descr&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;TargetToken&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;137358545605472&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;To find the machine code address of the trace, we need to search for this line:&lt;/p&gt;
&lt;div class="code"&gt;&lt;pre class="code literal-block"&gt;&lt;span class="nx"&gt;Loop&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;run&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="nx"&gt;home&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="nx"&gt;cfbolz&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="nx"&gt;projects&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="nx"&gt;gitpypy&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="nx"&gt;allocatealot&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;py&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="mi"&gt;6&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mi"&gt;9&lt;/span&gt;&lt;span class="o"&gt;~&lt;/span&gt;&lt;span class="err"&gt;#&lt;/span&gt;&lt;span class="mi"&gt;24&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;FOR_ITER&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;\
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="nx"&gt;has&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;address&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mh"&gt;0x7ced473ffa0b&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;to&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mh"&gt;0x7ced473ffbb0&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;bootstrap&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mh"&gt;0x7ced473ff980&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;Then we can use a script in the PyPy repo to disassemble the generated machine code:&lt;/p&gt;
&lt;div class="code"&gt;&lt;pre class="code literal-block"&gt;&lt;span class="gp"&gt;$ &lt;/span&gt;pypy&lt;span class="w"&gt; &lt;/span&gt;rpython/jit/backend/tool/viewcode.py&lt;span class="w"&gt; &lt;/span&gt;out
&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;This will dump all the machine code to stdout, and open a &lt;a href="https://pypy.org/posts/2021/04/ways-pypy-graphviz.html"&gt;pygame-based
graphviz cfg&lt;/a&gt;. In there
we can search for the address and see this:&lt;/p&gt;
&lt;p&gt;&lt;img alt="Graphviz based visualization of the machine code the JIT generates" src="https://www.pypy.org/images/2025-allocatealot-machine-code.png"&gt;&lt;/p&gt;
&lt;p&gt;Here's an annotated version with what I think this code does:&lt;/p&gt;
&lt;div class="code"&gt;&lt;pre class="code literal-block"&gt;&lt;span class="x"&gt;# increment the profile counter&lt;/span&gt;
&lt;span class="x"&gt;7ced473ffb40:   48 ff 04 25 20 9e 33    incq   0x38339e20&lt;/span&gt;
&lt;span class="x"&gt;7ced473ffb47:   38 &lt;/span&gt;

&lt;span class="x"&gt;# check whether the loop is done&lt;/span&gt;
&lt;span class="x"&gt;7ced473ffb48:   4c 39 fe                cmp    %r15,%rsi&lt;/span&gt;
&lt;span class="x"&gt;7ced473ffb4b:   0f 8d 76 01 00 00       jge    0x7ced473ffcc7&lt;/span&gt;

&lt;span class="x"&gt;# increment iteration variable&lt;/span&gt;
&lt;span class="x"&gt;7ced473ffb51:   4c 8d 66 01             lea    0x1(%rsi),%r12&lt;/span&gt;

&lt;span class="x"&gt;# update iterator object&lt;/span&gt;
&lt;span class="x"&gt;7ced473ffb55:   4d 89 61 08             mov    %r12,0x8(%r9)&lt;/span&gt;

&lt;span class="x"&gt;# check for ctrl-c/thread switch&lt;/span&gt;
&lt;span class="x"&gt;7ced473ffb59:   49 bb e0 1b 0b 4c ed    movabs $0x7ced4c0b1be0,%r11&lt;/span&gt;
&lt;span class="x"&gt;7ced473ffb60:   7c 00 00 &lt;/span&gt;
&lt;span class="x"&gt;7ced473ffb63:   49 8b 0b                mov    (%r11),%rcx&lt;/span&gt;
&lt;span class="x"&gt;7ced473ffb66:   48 83 f9 00             cmp    $0x0,%rcx&lt;/span&gt;
&lt;span class="x"&gt;7ced473ffb6a:   0f 8c 8f 01 00 00       jl     0x7ced473ffcff&lt;/span&gt;

&lt;span class="x"&gt;# load nursery_free pointer&lt;/span&gt;
&lt;span class="x"&gt;7ced473ffb70:   49 8b 8b d8 30 f6 fe    mov    -0x109cf28(%r11),%rcx&lt;/span&gt;

&lt;span class="x"&gt;# add size (16)&lt;/span&gt;
&lt;span class="x"&gt;7ced473ffb77:   48 8d 51 10             lea    0x10(%rcx),%rdx&lt;/span&gt;

&lt;span class="x"&gt;# compare against nursery top&lt;/span&gt;
&lt;span class="x"&gt;7ced473ffb7b:   49 3b 93 f8 30 f6 fe    cmp    -0x109cf08(%r11),%rdx&lt;/span&gt;

&lt;span class="x"&gt;# jump to slow path if nursery is full&lt;/span&gt;
&lt;span class="x"&gt;7ced473ffb82:   0f 87 41 00 00 00       ja     0x7ced473ffbc9&lt;/span&gt;

&lt;span class="x"&gt;# store new value of nursery free&lt;/span&gt;
&lt;span class="x"&gt;7ced473ffb88:   49 89 93 d8 30 f6 fe    mov    %rdx,-0x109cf28(%r11)&lt;/span&gt;

&lt;span class="x"&gt;# initialize GC header&lt;/span&gt;
&lt;span class="x"&gt;7ced473ffb8f:   48 c7 01 30 11 00 00    movq   $0x1130,(%rcx)&lt;/span&gt;

&lt;span class="x"&gt;# initialize integer field&lt;/span&gt;
&lt;span class="x"&gt;7ced473ffb96:   48 89 41 08             mov    %rax,0x8(%rcx)&lt;/span&gt;
&lt;span class="x"&gt;7ced473ffb9a:   48 89 f0                mov    %rsi,%rax&lt;/span&gt;
&lt;span class="x"&gt;7ced473ffb9d:   48 89 8d 60 01 00 00    mov    %rcx,0x160(%rbp)&lt;/span&gt;
&lt;span class="x"&gt;7ced473ffba4:   4c 89 e6                mov    %r12,%rsi&lt;/span&gt;
&lt;span class="x"&gt;7ced473ffba7:   e9 94 ff ff ff          jmp    0x7ced473ffb40&lt;/span&gt;
&lt;span class="x"&gt;7ced473ffbac:   0f 1f 40 00             nopl   0x0(%rax)&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;

&lt;h3 id="conclusion"&gt;Conclusion&lt;/h3&gt;
&lt;p&gt;The careful design of the RPython GC's allocation fast path gives pretty good
allocation rates. This technique isn't really new, it's a pretty typical way to
design a GC. Apart from that, my main conclusion would be that computers are
fast or something? Indeed, when we ran the same code on my colleague's
two-year-old AMD, we got quite a bit worse results, so a lot of the speed seems
to be due to the hard work of CPU architects.&lt;/p&gt;</description><category>benchmarking</category><category>gc</category><category>rpython</category><guid>https://www.pypy.org/posts/2025/06/rpython-gc-allocation-speed.html</guid><pubDate>Sun, 15 Jun 2025 13:48:30 GMT</pubDate></item><item><title>Low Overhead Allocation Sampling with VMProf in PyPy's GC</title><link>https://www.pypy.org/posts/2025/02/pypy-gc-sampling.html</link><dc:creator>Christoph Jung</dc:creator><description>&lt;h3 id="introduction"&gt;Introduction&lt;/h3&gt;
&lt;p&gt;There are many time-based statistical profilers around (like VMProf or py-spy
just to name a few). They allow the user to pick a trade-off between profiling
precision and runtime overhead.&lt;/p&gt;
&lt;p&gt;On the other hand there are memory profilers
such as &lt;a href="https://github.com/bloomberg/memray"&gt;memray&lt;/a&gt;. They can be handy for
finding leaks or for discovering functions that allocate a lot of memory.
Memory profilers typlically save every single allocation a program does. This
results in precise profiling, but larger overhead.&lt;/p&gt;
&lt;p&gt;In this post we describe our experimental approach to low overhead statistical
memory profiling. Instead of saving every single allocation a program does, it
only saves every nth allocated byte. We have tightly integrated VMProf and the
PyPy Garbage Collector to achieve this. The main technical insight is that the
check whether an allocation should be sampled can be made free. This is done by
folding it into the bump pointer allocator check that the PyPy’s GC uses to
find out if it should start a minor collection. In this way the fast path with
and without memory sampling are exactly the same.&lt;/p&gt;
&lt;h3 id="background"&gt;Background&lt;/h3&gt;
&lt;p&gt;To get an insight how the profiler and GC interact, lets take a brief look at
both of them first.&lt;/p&gt;
&lt;h4 id="vmprof"&gt;VMProf&lt;/h4&gt;
&lt;p&gt;&lt;a href="https://github.com/vmprof/vmprof-python"&gt;VMProf&lt;/a&gt; is a statistical time-based profiler for PyPy. VMProf samples the stack of currently running Python functions a certain user-configured number of times per second. By adjusting
this number, the overhead of profiling can be modified to pick the correct trade-off between overhead and precision of the profile. In the resulting profile, functions with huge runtime stand out the most, functions with shorter runtime less so. If you want to get a little more introduction to VMProf and how to use it with PyPy, you may look
at &lt;a href="https://pypy.org/posts/2024/05/vmprof-firefox-converter.html"&gt;this blog post&lt;/a&gt;&lt;/p&gt;
&lt;h4 id="pypys-gc"&gt;PyPy’s GC&lt;/h4&gt;
&lt;p&gt;PyPy uses a generational incremental copying collector. That means there are two spaces for allocated objects, the nursery and the old-space. Freshly allocated objects will be allocated into the nursery. When the nursery is full at some point, it will be collected and all objects that survive will be tenured i.e. moved into the old-space. The old-space is much larger than the nursery and is collected less frequently and &lt;a href="https://www.pypy.org/posts/2024/03/fixing-bug-incremental-gc.html"&gt;incrementally&lt;/a&gt; (not completely
collected in one go, but step-by-step). The old space collection is not relevant for the rest of the post though. We will now take a look at nursery allocations and how the nursery is collected.&lt;/p&gt;
&lt;h4 id="bump-pointer-allocation-in-the-nursery"&gt;Bump Pointer Allocation in the Nursery&lt;/h4&gt;
&lt;p&gt;The nursery (a small continuous memory area) utilizes two pointers to keep track from where on the nursery is free and where it ends. They are called &lt;code&gt;nursery_free&lt;/code&gt; and &lt;code&gt;nursery_top&lt;/code&gt;. When memory is allocated, the GC checks if there is enough space in the nursery left. If there is enough space, the &lt;code&gt;nursery_free&lt;/code&gt; pointer will be returned as the start address for the newly allocated memory, and &lt;code&gt;nursery_free&lt;/code&gt; will be moved forward by the amount of allocated memory.&lt;/p&gt;
&lt;p&gt;&lt;img src="https://www.pypy.org/images/2025_02_allocation_sampling_images/nursery_allocation.svg"&gt;&lt;/p&gt;
&lt;div class="code"&gt;&lt;pre class="code literal-block"&gt;&lt;span class="k"&gt;def&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nf"&gt;allocate&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;totalsize&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
  &lt;span class="c1"&gt;# Save position, where the object will be allocated to as result&lt;/span&gt;
  &lt;span class="n"&gt;result&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;gc&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;nursery_free&lt;/span&gt;
  &lt;span class="c1"&gt;# Move nursery_free pointer forward by totalsize&lt;/span&gt;
  &lt;span class="n"&gt;gc&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;nursery_free&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;result&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;totalsize&lt;/span&gt;
  &lt;span class="c1"&gt;# Check if this allocation would exceed the nursery&lt;/span&gt;
  &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;gc&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;nursery_free&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;gc&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;nursery_top&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
      &lt;span class="c1"&gt;# If it does =&amp;gt; collect the nursery and allocate afterwards&lt;/span&gt;
      &lt;span class="n"&gt;result&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;collect_and_reserve&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;totalsize&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
  &lt;span class="c1"&gt;# result is a pointer into the nursery, obj will be allocated there&lt;/span&gt;
  &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;result&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nf"&gt;collect_and_reserve&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;size_of_allocation&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="c1"&gt;# do a minor collection and return the start of the nursery afterwards&lt;/span&gt;
    &lt;span class="n"&gt;minor_collection&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;gc&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;nursery_free&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;Understanding this is crucial for our allocation sampling approach, so let us go through this step-by-step.&lt;/p&gt;
&lt;p&gt;We already saw an example on how an allocation into a non-full nursery will look like. But what happens, if the nursery is (too) full?&lt;/p&gt;
&lt;p&gt;&lt;img src="https://www.pypy.org/images/2025_02_allocation_sampling_images/nursery_full.svg"&gt;&lt;/p&gt;
&lt;p&gt;As soon as an object doesn't fit into the nursery anymore, it will be collected. A nursery collection will move all surviving objects into the old-space, so that the nursery is free afterwards, and the requested allocation can be made.&lt;/p&gt;
&lt;p&gt;&lt;img src="https://www.pypy.org/images/2025_02_allocation_sampling_images/nursery_collected.svg"&gt;&lt;/p&gt;
&lt;p&gt;(Note that this is still a bit of a simplification.)&lt;/p&gt;
&lt;h3 id="sampling-approach"&gt;Sampling Approach&lt;/h3&gt;
&lt;p&gt;The last section described how the nursery allocation works normally. Now we'll talk how we integrate the new allocation sampling approach into it.&lt;/p&gt;
&lt;p&gt;To decide whether the GC should trigger a sample, the sampling logic is integrated into the bump pointer allocation logic. Usually, when there is not enough space in the nursery left to fulfill an allocation request, the nursery will be collected and the allocation will be done afterwards. We reuse that mechanism for sampling, by introducing a new pointer called &lt;code&gt;sample_point&lt;/code&gt; that is calculated by &lt;code&gt;sample_point = nursery_free + sample_n_bytes&lt;/code&gt; where &lt;code&gt;sample_n_bytes&lt;/code&gt; is the number of bytes allocated before a sample is made (i.e. our sampling rate).&lt;/p&gt;
&lt;p&gt;Imagine we'd have a nursery of 2MB and want to sample every 512KB allocated, then you could imagine our nursery looking like that:&lt;/p&gt;
&lt;p&gt;&lt;img src="https://www.pypy.org/images/2025_02_allocation_sampling_images/nursery_sampling.svg"&gt;&lt;/p&gt;
&lt;p&gt;We use the sample point as &lt;code&gt;nursery_top&lt;/code&gt;, so that allocating a chunk of 512KB would exceed the nursery top and start a nursery collection. But of course we don't want to do a minor collection just then, so before starting a collection, we need to check if the nursery is actually full or if that is just an exceeded sample point. The latter will then trigger a VMprof stack sample. Afterwards we don't actually do a minor collection, but change &lt;code&gt;nursery_top&lt;/code&gt; and immediately return to the caller.&lt;/p&gt;
&lt;p&gt;The last picture is a conceptual simplification. Only one sampling point exists at any given time. After we created the sampling point, it will be used as nursery top, if exceeded at some point, we will just add &lt;code&gt;sample_n_bytes&lt;/code&gt; to that sampling point, i.e. move it forward.&lt;/p&gt;
&lt;p&gt;Here's how the updated &lt;code&gt;collect_and_reserve&lt;/code&gt; function looks like:&lt;/p&gt;
&lt;div class="code"&gt;&lt;pre class="code literal-block"&gt;&lt;span class="k"&gt;def&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nf"&gt;collect_and_reserve&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;size_of_allocation&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="c1"&gt;# Check if we exceeded a sample point or if we need to do a minor collection&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;gc&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;nursery_top&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="n"&gt;gc&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;sample_point&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="c1"&gt;# One allocation could exceed multiple sample points&lt;/span&gt;
        &lt;span class="c1"&gt;# Sample, move sample_point forward&lt;/span&gt;
        &lt;span class="n"&gt;vmprof&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;sample_now&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
        &lt;span class="n"&gt;gc&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;sample_point&lt;/span&gt; &lt;span class="o"&gt;+=&lt;/span&gt; &lt;span class="n"&gt;sample_n_bytes&lt;/span&gt;

        &lt;span class="c1"&gt;# Set sample point as new nursery_top if it fits into the nursery&lt;/span&gt;
        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;sample_point&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;=&lt;/span&gt; &lt;span class="n"&gt;gc&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;real_nursery_top&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="n"&gt;gc&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;nursery_top&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;sample_point&lt;/span&gt;
        &lt;span class="c1"&gt;# Or use the real nursery top if it does not fit&lt;/span&gt;
        &lt;span class="k"&gt;else&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="n"&gt;gc&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;nursery_top&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;gc&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;real_nursery_top&lt;/span&gt;

        &lt;span class="c1"&gt;# Is there enough memory left inside the nursery&lt;/span&gt;
        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;gc&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;nursery_free&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;size_of_allocation&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;=&lt;/span&gt; &lt;span class="n"&gt;gc&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;nursery_top&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="c1"&gt;# Yes =&amp;gt; move nursery_free forward&lt;/span&gt;
            &lt;span class="n"&gt;gc&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;nursery_free&lt;/span&gt; &lt;span class="o"&gt;+=&lt;/span&gt; &lt;span class="n"&gt;size_of_allocation&lt;/span&gt;
            &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;gc&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;nursery_free&lt;/span&gt;

    &lt;span class="c1"&gt;# We did not exceed a sampling point and must do a minor collection, or&lt;/span&gt;
    &lt;span class="c1"&gt;# we exceeded a sample point but we needed to do a minor collection anyway&lt;/span&gt;
    &lt;span class="n"&gt;minor_collection&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;gc&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;nursery_free&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;

&lt;h3 id="why-is-the-overhead-low"&gt;Why is the Overhead ‘low’&lt;/h3&gt;
&lt;p&gt;The most important property of our approach is that the bump-pointer fast path is not changed at all. If sampling is turned off, the slow path in &lt;code&gt;collect_and_reserve&lt;/code&gt; has three extra instructions for the if at the beginning, but are only a very small amount of overhead, compared to doing a minor collection.&lt;/p&gt;
&lt;p&gt;When sampling is on, the extra logic in &lt;code&gt;collect_and_reserve&lt;/code&gt; gets executed. Every time an allocation exceeds the &lt;code&gt;sample_point&lt;/code&gt;, &lt;code&gt;collect_and_reserve&lt;/code&gt; will sample the Python functions currently executing. The resulting overhead is directly controlled by &lt;code&gt;sample_n_bytes&lt;/code&gt;. After sampling, the &lt;code&gt;sample_point&lt;/code&gt; and &lt;code&gt;nursery_top&lt;/code&gt; must be set accordingly. This will be done once after sampling in &lt;code&gt;collect_and_reserve&lt;/code&gt;. At some point a nursery collection will free the nursery and set the new &lt;code&gt;sample_point&lt;/code&gt; afterwards.&lt;/p&gt;
&lt;p&gt;That means that the overhead mostly depends on the sampling rate and the rate at which the user program allocates memory, as the combination of those two factors determines the amount of samples.&lt;/p&gt;
&lt;p&gt;Since the sampling rate can be adjusted from as low as 64 Byte to a theoretical maximum of ~4 GB (at the moment), the tradeoff between number of samples (i.e. profiling precision) and overhead can be completely adjusted.&lt;/p&gt;
&lt;p&gt;We also suspect linkage between user program stack depth and overhead (a deeper stack takes longer to walk, leading to higher overhead), especially when walking the C call stack to.&lt;/p&gt;
&lt;h3 id="sampling-rates-bigger-than-the-nursery-size"&gt;Sampling rates bigger than the nursery size&lt;/h3&gt;
&lt;p&gt;The nursery usually has a size of a few megabytes, but profiling long-runningor larger applications with tons of allocations could result in very high number of samples per second (and thus overhead). To combat that it is possible to use sampling rates higher than the nursery size.&lt;/p&gt;
&lt;p&gt;The sampling point is not limited by the nursery size, but if it is 'outside' the nursery (e.g. because &lt;code&gt;sample_n_bytes&lt;/code&gt; is set to twice the nursery size) it won't be used as &lt;code&gt;nursery_top&lt;/code&gt; until it 'fits' into the nursery.&lt;/p&gt;
&lt;p&gt;&lt;img src="https://www.pypy.org/images/2025_02_allocation_sampling_images/nursery_sampling_larger_than_nursery.svg"&gt;&lt;/p&gt;
&lt;p&gt;After every nursery collection, we'd usually set the &lt;code&gt;sample_point&lt;/code&gt; to &lt;code&gt;nursery_free + sample_n_bytes&lt;/code&gt;, but if it is larger than the nursery, then the amount of collected memory during the last nursery collection is subtracted from &lt;code&gt;sample_point&lt;/code&gt;.&lt;/p&gt;
&lt;p&gt;&lt;img src="https://www.pypy.org/images/2025_02_allocation_sampling_images/nursery_sampling_larger_than_nursery_post_minor.svg"&gt;&lt;/p&gt;
&lt;p&gt;At some point the &lt;code&gt;sample_point&lt;/code&gt; will be smaller than the nursery size, then it will be used as &lt;code&gt;nursery_top&lt;/code&gt; again to trigger a sample when exceeded.&lt;/p&gt;
&lt;h3 id="differences-to-time-based-sampling"&gt;Differences to Time-Based Sampling&lt;/h3&gt;
&lt;p&gt;As mentioned in the introduction, time-based sampling ‘hits’ functions with high runtime, and allocation-sampling ‘hits’ functions allocating much memory. But are those always different functions? The answer is: sometimes. There can be functions allocating lots of memory, that do not have a (relative) high runtime.&lt;/p&gt;
&lt;p&gt;Another difference to time-based sampling is that the profiling overhead does not solely depend on the sampling rate (if we exclude a potential stack-depth - overhead correlation for now) but also on the amount of memory the user code allocates.&lt;/p&gt;
&lt;p&gt;Let us look at an example:&lt;/p&gt;
&lt;p&gt;If we’d sample every 1024 Byte and some program A allocates 3 MB and runs for 5 seconds, and program B allocates 6 MB but also runs for 5 seconds, there will be ~3000 samples when profiling A, but ~6000 samples when profiling B. That means we cannot give a ‘standard’ sampling rate like time-based profilers use to do (e.g. vmprof uses ~1000 samples/s for time sampling), as the number of resulting samples, and thus overhead, depends on sampling rate and amount of memory allocated by the program.&lt;/p&gt;
&lt;p&gt;For testing and benchmarking, we usually started with a sampling rate of 128Kb and then halved or doubled that (multiple times) depending on sample counts, our need for precision (and size of the profile).&lt;/p&gt;
&lt;h3 id="evaluation"&gt;Evaluation&lt;/h3&gt;
&lt;h4 id="overhead"&gt;Overhead&lt;/h4&gt;
&lt;p&gt;Now let us take a look at the allocation sampling overhead, by profiling some benchmarks. &lt;/p&gt;
&lt;p&gt;The x-axis shows the sampling rate, while the y-axis shows the overhead, which is computed as &lt;code&gt;runtime_with_sampling / runtime_without_sampling&lt;/code&gt;.&lt;/p&gt;
&lt;p&gt;All benchmarks were executed five times on a PyPy with JIT and native profiling enabled, so that every dot in the plot is one run of a benchmark.&lt;/p&gt;
&lt;p&gt;&lt;img src="https://www.pypy.org/images/2025_02_allocation_sampling_images/as_overhead.png"&gt;&lt;/p&gt;
&lt;p&gt;As you probably expected, the Overhead drops with higher allocation sampling rates.
Reaching from as high as ~390% for 32kb allocation sampling to as low as &amp;lt; 10% for 32mb.&lt;/p&gt;
&lt;p&gt;Let me give one concrete example: One run of the microbenchmark at 32kb sampling took 15.596 seconds and triggered 822050 samples.
That makes a ridiculous amount of &lt;code&gt;822050 / 15.596 = ~52709&lt;/code&gt; samples per second. &lt;/p&gt;
&lt;p&gt;There is probably no need for that amount of samples per second, so that for 'real' application profiling a much higher sampling rate would be sufficient.&lt;/p&gt;
&lt;p&gt;Let us compare that to time sampling.&lt;/p&gt;
&lt;p&gt;This time we ran those benchmarks with 100, 1000 and 2000 samples per second.&lt;/p&gt;
&lt;p&gt;&lt;img src="https://www.pypy.org/images/2025_02_allocation_sampling_images/ts_overhead.png"&gt;&lt;/p&gt;
&lt;p&gt;The overhead varies with the sampling rate. Both with allocation and time sampling, you can reach any amount of overhead and any level of profiling precision you want. The best approach probably is to just try out a sampling rate and choose what gives you the right tradeoff between precision and overhead (and disk usage).&lt;/p&gt;
&lt;p&gt;The benchmarks used are:&lt;/p&gt;
&lt;p&gt;microbenchmark &lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href="https://github.com/Cskorpion/microbenchmark"&gt;https://github.com/Cskorpion/microbenchmark&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;pypy microbench.py 65536&lt;/code&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;gcbench &lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href="https://github.com/pypy/pypy/blob/main/rpython/translator/goal/gcbench.py"&gt;https://github.com/pypy/pypy/blob/main/rpython/translator/goal/gcbench.py&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;print statements removed&lt;/li&gt;
&lt;li&gt;&lt;code&gt;pypy gcbench.py 1&lt;/code&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;pypy translate step&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;first step of the pypy translation (annotation step)&lt;/li&gt;
&lt;li&gt;&lt;code&gt;pypy path/to/rpython --opt=0 --cc=gcc --dont-write-c-files --gc=incminimark --annotate path/to/pypy/goal/targetpypystandalone.py&lt;/code&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;interpreter pystone&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;pystone benchmark on top of an interpreted pypy on top of a translated pypy&lt;/li&gt;
&lt;li&gt;&lt;code&gt;pypy path/to/pypy/bin/pyinteractive.py -c "import test.pystone; test.pystone.main(1)"&lt;/code&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;All benchmarks executed on:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Kubuntu 24.04&lt;/li&gt;
&lt;li&gt;AMD Ryzen 7 5700U&lt;/li&gt;
&lt;li&gt;24gb DDR4 3200MHz (dual channel)&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;SSD benchmarking at read: 1965 MB/s, write: 227 MB/s&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Sequential 1MB 1 Thread 8 Queues&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Self built PyPy with allocation sampling features&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href="https://github.com/Cskorpion/pypy/tree/gc_allocation_sampling_u_2.7"&gt;https://github.com/Cskorpion/pypy/tree/gc_allocation_sampling_u_2.7&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Modified VMProf with allocation sampling support&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href="https://github.com/Cskorpion/vmprof-python/tree/pypy_gc_allocation_sampling"&gt;https://github.com/Cskorpion/vmprof-python/tree/pypy_gc_allocation_sampling&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;h4 id="example"&gt;Example&lt;/h4&gt;
&lt;p&gt;We have also modified &lt;a href="https://github.com/Cskorpion/vmprof-firefox-converter/tree/allocation_sampling"&gt;vmprof-firefox-converter&lt;/a&gt; to show the allocation samples in the Firefor Profiler UI. With the techniques from this post, the output looks like this:&lt;/p&gt;
&lt;p&gt;&lt;img src="https://www.pypy.org/images/2025_02_allocation_sampling_images/allocation_sampling_call_tree.png"&gt;&lt;/p&gt;
&lt;p&gt;While this view is interesting, it would be even better if we could also see what types of objects are being allocated in these functions. We will take about how to do this in a future blog post.&lt;/p&gt;
&lt;h3 id="conclusion"&gt;Conclusion&lt;/h3&gt;
&lt;p&gt;In this blog post we introduced allocation sampling for PyPy by going through the technical aspects and the corresponding overhead. In a future blog post, we are going to dive into the actual usage of allocation sampling with VMProf, and show an example case study. That will be accompanied by some new improvements and additional features, like extracting the type of an object that triggered a sample.&lt;/p&gt;
&lt;p&gt;So far all this work is still experimental and happening on PyPy branches but
we hope to get the technique stable enough to merge it to main and ship it with
PyPy eventually.&lt;/p&gt;
&lt;p&gt;-- Christoph Jung and CF Bolz-Tereick&lt;/p&gt;</description><category>gc</category><category>profiling</category><category>vmprof</category><guid>https://www.pypy.org/posts/2025/02/pypy-gc-sampling.html</guid><pubDate>Tue, 25 Feb 2025 10:16:00 GMT</pubDate></item><item><title>PyPy for low-latency systems</title><link>https://www.pypy.org/posts/2019/01/pypy-for-low-latency-systems-613165393301401965.html</link><dc:creator>Antonio Cuni</dc:creator><description>&lt;h1 class="title"&gt;
PyPy for low-latency systems&lt;/h1&gt;
Recently I have merged the gc-disable branch, introducing a couple of features
which are useful when you need to respond to certain events with the lowest
possible latency.  This work has been kindly sponsored by &lt;a class="reference external" href="https://www.gambitresearch.com/"&gt;Gambit Research&lt;/a&gt;
(which, by the way, is a very cool and geeky place where to &lt;a class="reference external" href="https://www.gambitresearch.com/jobs.html"&gt;work&lt;/a&gt;, in case you
are interested).  Note also that this is a very specialized use case, so these
features might not be useful for the average PyPy user, unless you have the
same problems as described here.&lt;br&gt;
&lt;br&gt;
The PyPy VM manages memory using a generational, moving Garbage Collector.
Periodically, the GC scans the whole heap to find unreachable objects and
frees the corresponding memory.  Although at a first look this strategy might
sound expensive, in practice the total cost of memory management is far less
than e.g. on CPython, which is based on reference counting.  While maybe
counter-intuitive, the main advantage of a non-refcount strategy is
that allocation is very fast (especially compared to malloc-based allocators),
and deallocation of objects which die young is basically for free. More
information about the PyPy GC is available &lt;a class="reference external" href="https://pypy.readthedocs.io/en/latest/gc_info.html#incminimark"&gt;here&lt;/a&gt;.&lt;br&gt;
&lt;br&gt;
As we said, the total cost of memory managment is less on PyPy than on
CPython, and it's one of the reasons why PyPy is so fast.  However, one big
disadvantage is that while on CPython the cost of memory management is spread
all over the execution of the program, on PyPy it is concentrated into GC
runs, causing observable pauses which interrupt the execution of the user
program.&lt;br&gt;
To avoid excessively long pauses, the PyPy GC has been using an &lt;a class="reference external" href="https://www.pypy.org/posts/2013/10/incremental-garbage-collector-in-pypy-8956893523842234676.html"&gt;incremental
strategy&lt;/a&gt; since 2013. The GC runs as a series of "steps", letting the user
program to progress between each step.&lt;br&gt;
&lt;br&gt;
The following chart shows the behavior of a real-world, long-running process:&lt;br&gt;
&lt;div class="separator" style="clear: both; text-align: center;"&gt;
&lt;a href="https://3.bp.blogspot.com/-44yKwUVK3BE/XC4X9XL4BII/AAAAAAAABbE/XdTCIoyA-eYxvxIgJhFHaKnzxjhoWStHQCEwYBhgL/s1600/gc-timing.png" style="margin-right: 1em;"&gt;&lt;img border="0" height="246" src="https://3.bp.blogspot.com/-44yKwUVK3BE/XC4X9XL4BII/AAAAAAAABbE/XdTCIoyA-eYxvxIgJhFHaKnzxjhoWStHQCEwYBhgL/s640/gc-timing.png" width="640"&gt;&lt;/a&gt;&lt;/div&gt;
&lt;br&gt;
&lt;br&gt;
The orange line shows the total memory used by the program, which
increases linearly while the program progresses. Every ~5 minutes, the GC
kicks in and the memory usage drops from ~5.2GB to ~2.8GB (this ratio is controlled
by the &lt;a class="reference external" href="https://pypy.readthedocs.io/en/latest/gc_info.html#environment-variables"&gt;PYPY_GC_MAJOR_COLLECT&lt;/a&gt; env variable).&lt;br&gt;
The purple line shows aggregated data about the GC timing: the whole
collection takes ~1400 individual steps over the course of ~1 minute: each
point represent the &lt;strong&gt;maximum&lt;/strong&gt; time a single step took during the past 10
seconds. Most steps take ~10-20 ms, although we see a horrible peak of ~100 ms
towards the end. We have not investigated yet what it is caused by, but we
suspect it is related to the deallocation of raw objects.&lt;br&gt;
&lt;br&gt;
These multi-millesecond pauses are a problem for systems where it is important
to respond to certain events with a latency which is both low and consistent.
If the GC kicks in at the wrong time, it might causes unacceptable pauses during
the collection cycle.&lt;br&gt;
&lt;br&gt;
Let's look again at our real-world example. This is a system which
continuously monitors an external stream; when a certain event occurs, we want
to take an action. The following chart shows the maximum time it takes to
complete one of such actions, aggregated every minute:&lt;br&gt;
&lt;br&gt;
&lt;div class="separator" style="clear: both; text-align: center;"&gt;
&lt;a href="https://4.bp.blogspot.com/-FO9uFHSqZzU/XC4YC8LZUpI/AAAAAAAABa8/B8ZOrEgbVJUHoO65wxvCMVpvciO_d_0TwCLcBGAs/s1600/normal-max.png" style="margin-right: 1em;"&gt;&lt;img border="0" height="240" src="https://4.bp.blogspot.com/-FO9uFHSqZzU/XC4YC8LZUpI/AAAAAAAABa8/B8ZOrEgbVJUHoO65wxvCMVpvciO_d_0TwCLcBGAs/s640/normal-max.png" width="640"&gt;&lt;/a&gt;&lt;/div&gt;
&lt;br&gt;
You can clearly see that the baseline response time is around ~20-30
ms. However, we can also see periodic spikes around ~50-100 ms, with peaks up
to ~350-450 ms! After a bit of investigation, we concluded that most (although
not all) of the spikes were caused by the GC kicking in at the wrong time.&lt;br&gt;
&lt;br&gt;
The work I did in the &lt;tt class="docutils literal"&gt;&lt;span class="pre"&gt;gc-disable&lt;/span&gt;&lt;/tt&gt; branch aims to fix this problem by
introducing &lt;a class="reference external" href="https://pypy.readthedocs.io/en/latest/gc_info.html#semi-manual-gc-management"&gt;two new features&lt;/a&gt; to the &lt;tt class="docutils literal"&gt;gc&lt;/tt&gt; module:&lt;br&gt;
&lt;blockquote&gt;
&lt;ul class="simple"&gt;
&lt;li&gt;&lt;tt class="docutils literal"&gt;gc.disable()&lt;/tt&gt;, which previously only inhibited the execution of
finalizers without actually touching the GC, now disables the GC major
collections. After a call to it, you will see the memory usage grow
indefinitely.&lt;/li&gt;
&lt;li&gt;&lt;tt class="docutils literal"&gt;gc.collect_step()&lt;/tt&gt; is a new function which you can use to manually
execute a single incremental GC collection step.&lt;/li&gt;
&lt;/ul&gt;
&lt;/blockquote&gt;
It is worth to specify that &lt;tt class="docutils literal"&gt;gc.disable()&lt;/tt&gt; disables &lt;strong&gt;only&lt;/strong&gt; the major
collections, while minor collections still runs.  Moreover, thanks to the
JIT's virtuals, many objects with a short and predictable lifetime are not
allocated at all. The end result is that most objects with short lifetime are
still collected as usual, so the impact of &lt;tt class="docutils literal"&gt;gc.disable()&lt;/tt&gt; on memory growth
is not as bad as it could sound.&lt;br&gt;
&lt;br&gt;
Combining these two functions, it is possible to take control of the GC to
make sure it runs only when it is acceptable to do so.  For an example of
usage, you can look at the implementation of a &lt;a class="reference external" href="https://github.com/antocuni/pypytools/blob/master/pypytools/gc/custom.py"&gt;custom GC&lt;/a&gt; inside &lt;a class="reference external" href="https://pypi.org/project/pypytools/"&gt;pypytools&lt;/a&gt;.
The peculiarity is that it also defines a "&lt;tt class="docutils literal"&gt;with &lt;span class="pre"&gt;nogc():"&lt;/span&gt;&lt;/tt&gt; context manager
which you can use to mark performance-critical sections where the GC is not
allowed to run.&lt;br&gt;
&lt;br&gt;
The following chart compares the behavior of the default PyPy GC and the new
custom GC, after a careful placing of &lt;tt class="docutils literal"&gt;nogc()&lt;/tt&gt; sections:&lt;br&gt;
&lt;br&gt;
&lt;div class="separator" style="clear: both; text-align: center;"&gt;
&lt;a href="https://1.bp.blogspot.com/-bGqs0WrOEBk/XC4YJN0uZfI/AAAAAAAABbA/4EXOASvy830IKBoTFtrnmY22Vyd_api-ACLcBGAs/s1600/nogc-max.png" style="margin-right: 1em;"&gt;&lt;img border="0" height="242" src="https://1.bp.blogspot.com/-bGqs0WrOEBk/XC4YJN0uZfI/AAAAAAAABbA/4EXOASvy830IKBoTFtrnmY22Vyd_api-ACLcBGAs/s640/nogc-max.png" width="640"&gt;&lt;/a&gt;&lt;/div&gt;
&lt;br&gt;
The yellow line is the same as before, while the purple line shows the new
system: almost all spikes have gone, and the baseline performance is about 10%
better. There is still one spike towards the end, but after some investigation
we concluded that it was &lt;strong&gt;not&lt;/strong&gt; caused by the GC.&lt;br&gt;
&lt;br&gt;
Note that this does &lt;strong&gt;not&lt;/strong&gt; mean that the whole program became magically
faster: we simply moved the GC pauses in some other place which is &lt;strong&gt;not&lt;/strong&gt;
shown in the graph: in this specific use case this technique was useful
because it allowed us to shift the GC work in places where pauses are more
acceptable.&lt;br&gt;
&lt;br&gt;
All in all, a pretty big success, I think.  These functionalities are already
available in the nightly builds of PyPy, and will be included in the next
release: take this as a New Year present :)&lt;br&gt;
&lt;br&gt;
Antonio Cuni and the PyPy team</description><category>gc</category><category>sponsors</category><guid>https://www.pypy.org/posts/2019/01/pypy-for-low-latency-systems-613165393301401965.html</guid><pubDate>Thu, 03 Jan 2019 14:21:00 GMT</pubDate></item></channel></rss>