STM with threads
Hi all,
A quick update. The first version of pypy-stm based on regular
threads is ready. Still having no JIT and a 4-or-5-times performance
hit, it is not particularly fast, but I am happy that it turns out not
to be much slower than the previous thread-less attempts. It is at
least fast enough to run faster (in real time) than an equivalent no-STM
PyPy, if fed with an eight-threaded program on an eight-core machine
(provided, of course, you don't mind it eating all 8 cores' CPU power
instead of just one :-).
You can download and play around with this binary for Linux 64. It
was made from the stm-thread branch of the PyPy repository (translate.py --stm -O2 targetpypystandalone.py). (Be sure
to put it where it can find its stdlib, e.g. by putting it inside the
directory from the official 1.9 release.)
This binary supports the thread module and runs without the GIL.
So, despite the factor-of-4 slow-down issue, it should be the fourth
complete Python interpreter in which we can reasonably claim to have
resolved the problem of the GIL. (The first one was Greg Stein's Python
1.4, re-explored here; the second one is Jython; the third one is
IronPython.) Unlike the previous three, it is also the first one to
offer full GIL semantics to the programmer, and additionally
thread.atomic (see below). I should also add that we're likely to
see in the next year a 5th such interpreter, too, based on Hardware
Transactional Memory (same approach as with STM, but using e.g.
Intel's HTM).
The binary I linked to above supports all built-in modules from PyPy,
apart from signal, still being worked on (which can be a bit
annoying because standard library modules like subprocess depend on
it). The sys.get/setcheckinterval() functions can be used to tweak
the frequency of the automatic commits. Additionally, it offers
thread.atomic, described in the previous blog post as a way to
create longer atomic sections (with the observable effect of preventing
the "GIL" to be released during that time). A complete
transaction.py module based on it is available from the sources.
The main missing features are:
- the signal module;
- the Garbage Collector, which does not do major collections so far, only
minor ones; - and finally, the JIT, which needs some amount of integration to generate
the correctly-tweaked assembler.
Have fun!
Armin.
PyPy 1.9 - Yard Wolf
We're pleased to announce the 1.9 release of PyPy. This release brings mostly
bugfixes, performance improvements, other small improvements and overall
progress on the numpypy effort.
It also brings an improved situation on Windows and OS X.
You can download the PyPy 1.9 release here:
https://pypy.org/download.html
What is PyPy?
PyPy is a very compliant Python interpreter, almost a drop-in replacement for
CPython 2.7. It's fast (pypy 1.9 and cpython 2.7.2 performance comparison)
due to its integrated tracing JIT compiler.
This release supports x86 machines running Linux 32/64, Mac OS X 64 or
Windows 32. Windows 64 work is still stalling, we would welcome a volunteer
to handle that.
Thanks to our donors
But first of all, we would like to say thank you to all people who
donated some money to one of our four calls:
- NumPy in PyPy (got so far $44502 out of $60000, 74%)
- Py3k (Python 3) (got so far $43563 out of $105000, 41%)
- Software Transactional Memory (got so far $21791 of $50400, 43%)
- as well as our general PyPy pot.
Thank you all for proving that it is indeed possible for a small team of
programmers to get funded like that, at least for some
time. We want to include this thank you in the present release
announcement even though most of the work is not finished yet. More
precisely, neither Py3k nor STM are ready to make it in an official release
yet: people interested in them need to grab and (attempt to) translate
PyPy from the corresponding branches (respectively py3k and
stm-thread).
Highlights
- This release still implements Python 2.7.2.
- Many bugs were corrected for Windows 32 bit. This includes new
functionality to test the validity of file descriptors; and
correct handling of the calling convensions for ctypes. (Still not
much progress on Win64.) A lot of work on this has been done by Matti Picus
and Amaury Forgeot d'Arc. - Improvements in cpyext, our emulator for CPython C extension modules.
For example PyOpenSSL should now work. We thank various people for help. - Sets now have strategies just like dictionaries. This means for example
that a set containing only ints will be more compact (and faster). - A lot of progress on various aspects of numpypy. See the numpy-status
page for the automatic report. - It is now possible to create and manipulate C-like structures using the
PyPy-only _ffi module. The advantage over using e.g. ctypes is that
_ffi is very JIT-friendly, and getting/setting of fields is translated
to few assembler instructions by the JIT. However, this is mostly intended
as a low-level backend to be used by more user-friendly FFI packages, and
the API might change in the future. Use it at your own risk. - The non-x86 backends for the JIT are progressing but are still not
merged (ARMv7 and PPC64). - JIT hooks for inspecting the created assembler code have been improved.
See JIT hooks documentation for details. - select.kqueue has been added (BSD).
- Handling of keyword arguments has been drastically improved in the best-case
scenario: proxy functions which simply forwards *args and **kwargs
to another function now performs much better with the JIT. - List comprehension has been improved.
JitViewer
There will be a corresponding 1.9 release of JitViewer which is guaranteed to work
with PyPy 1.9. See the JitViewer docs for details.
Cheers,
The PyPy Team
I have took a look at the mentioned numpypy table (https://buildbot.pypy.org/numpy-status/latest.html), and it lies in many ways. At first, some methods marked as "done" and undone yet, e.g. consider searchsorted:
>>>> from numpypy import searchsorted
>>>> searchsorted([1,2,3],[2,3])
Traceback (most recent call last):
File "", line 1, in
File "/home/dmitrey/Install/pypy-c-jit-55492-ac392fb76904-linux/lib_pypy/numpypy/core/fromnumeric.py", line 763, in searchsorted
raise NotImplementedError('Waiting on interp level method')
NotImplementedError: Waiting on interp level method
(and AFAIK there are many other similar numpypy funcs that are present in dir(numpypy), but only raise NotImplementedError).
At 2nd, some funcs like all and any, also mentioned there as "done", don't work with "axis" parameter and thus also should be unmarked.
FYI as a temporary replacement for some missing in PyPy yet numpy funcs (atleast_1d, atleast_2d, hstack, vstack, cumsum, isscalar, asscalar, asfarray, flatnonzero, tile, zeros_like, ones_like, empty_like, where, searchsorted;
with "axis" parameter: nan(arg)min, nan(arg)max, all, any )
I have implemented them in AppLevel (thus PyPy developers refuce to commit them, but some users could be interested right now), see https://openopt.org/PyPy for more details and my sincere opinion on the situation.
Best wishes for PyPy developers and users, D.
Hi Dmitrey, nice to hear from you.
The page is automatically generated - we should probably just disable those functions, I can't remember the exact reason why they're there in the first place.
When it comes to missing arguments - you just can't help it. It's an automatically generated page that should give only an overview.
As far as your patches go - yes, we need tests and we also need tests that cover corner cases. This is very important for us, we can live without the rest (like implementations on the interp-level). We do care about quality a lot.
Cheers,
fijal
hi fijal,
as far as I remember, main reasons of PyPy developers (I don't remember namely) to reject my funcs propositions were AppLevel vs InterpLevel, not corner testcases (they even said "don't start the func, it must be InterpLevel"). Thus to speedup OpenOpt port on PyPy I went other way and as you probably have seen that some OpenOpt Suite functionality is already available in PyPy and works some times faster.
If apperplevel is ok for some of those funcs mentioned above, you or any other PyPy programmer can take anything from the code; as for me, I have lots of other things to do with my projects, especially now, before regular release, and thus cannot allocate time to create testcases for the numpy funcs.
BTW, what about fancy indexing with int arrays (https://bugs.pypy.org/issue1130) - when it will be implemented? It's very important for many Python projects and hangs for a long time already.
Py3k status update #4
This is the fourth status update about our work on the py3k branch, which we
can work on thanks to all of the people who donated to the py3k proposal.
For various reasons, less work than usual has been done since the last status
update. However, some interesting things happened anyway.
As readers know, so far we spent most of the effort in fixing all PyPy's own
tests which started to fail for various py2/py3 differences. Most of them
failed for shallow reasons, e.g. syntactic changes or the int/long
unifications. Others failed for subtle differences and needed a bit more care,
for example the fact that unbound methods are gone in Py3k.
The good news is that finally we are seeing the light at the end of the
tunnel. Most of them have been fixed. For sine other tests, we introduced the
concept of "py3k-skipping": some optimizations and modules are indeed failing,
but right now we are concentrating on completing the core language and so we
are not interested in those. When the core language will be done, we will be
able to easily find and work on the py3k-skipped tests. In particular, for
now we disabled the Int and String dict strategies, which are broken
because of the usual int/long unification and str vs bytes. As for modules,
for now _continuation (needed for stackless) and _multiprocessing do
not work yet.
Another non-trivial feature we implemented is the proper cleaning of exception
variables when we exit except blocks. This is a feature which touches
lots of levels of PyPy, starting from astcompiler, down to the bytecode
interpreter. It tooks two days of headache, but at the end we made it :-).
Additionally, Amaury did a lot of improvements to cpyext, which had been
broken since forever on this branch.
As for the next plans, now that things are starting to work and PyPy's own
tests mostly pass, we can finally start to run the compiled PyPy against
CPython's test suite. It is very likely that we will have tons of failures at
the beginning, but once we start to fix them one by one, a Py3k-compatible
PyPy will be closer and closer.
Does anyone actually use Python 3? That whole project of Guido's reminds me of "things you should never do: rewrites."
https://www.neilgunton.com/doc/?o=1&doc_id=8583
I cheered at your update when I saw it originally - but did not write this here.
Since no one else did that, yet, I want to go back to fix the mistake:
Great work!
I’m anxious to see my python3 code running under pypy!
@Connelly Barnes:
Wat.
I use Python3 almost exclusively, mainly because filter, map, and friends return iterators as FSM intended. I haven't done much string work, but that's another major win. And it's not like 2.7's EOL'd.
In short, un-bunch your knickers and roll with the times.
STM update: back to threads?
Hi again,
Here is another update on the status of Software Transactional Memory on PyPy.
Those of you who have been closely following this blog since last year know that, from the very first post about STM, I explored various design ideas about the API that we should get when programming in Python.
I went a full circle, and now I am back to where I started (with, important difference, a very roughly working implementation of pypy-stm).
What I realized is that the "thread" module is not that bad after all --- I mean, yes, it is a horribly low-level interface, but it is general enough to build various interesting things on top of it. What the "stm-thread" branch of PyPy contains is, basically, the regular "thread" module in which the GIL was replaced with STM. It gives multicore capabilities to any program based on multiple threads. (This is so far exactly the idea same than the one being investigated for Hardware Transactional Memory. It is roughly also what you would get if you managed to convince GCC 4.7 to compile CPython using STM.)
Now while this might already be quite interesting to some people, here is how it relates to all I said previously: namely, threads are bad, and some new "transaction" module would be a better idea.
There is one new core functionality in the "stm-thread" branch: it is "thread.atomic", a context manager that can be used in a "with" statement (exact name subject to change). In terms of the GIL, it prevents the GIL from being released in the "with" block. In terms of STM, it prevents a "transaction break", which means that the whole "with" statement runs in one single transaction. (From the Python programmer's point of view, the net effect is the same.)
So far, no ground-breaking news. But what I missed previously is that this is enough to give multicore capabilities even to a program that is not using threads so far. It is possible to rewrite an equivalent of the old transaction module in a few pages of pure Python, using "thread.atomic". Something along the following lines: start N threads that each reads from a Queue.Queue() the next job to do, and does it in a "with thread.atomic" block. The STM version of PyPy is then able to run these atomic blocks concurrently. The key point is that the slightly delicate handling of threads should be nicely hidden inside the new "transaction" module, and from outside the observed behavior would be exactly as if the transactions that we schedule are run serially.
The point I kept missing was that, yes, this sounds like nonsense, because it seems that we create N threads just to serialize their work again in "thread.atomic" sections. In fact this would be nonsense in any model that would "just" remove the GIL to let multiple threads run concurrently without crashing. Indeed, you have multiple threads, but their atomic blocks would be again a sort of GIL: only one of them would run at a time. And this is indeed the simple model of execution that you get even with STM --- but not the model of performance. The performance with STM scales with the number of cores, as long as there is enough non-conflicting work to do.
So in summary the complete circle back to the starting point is that threads might be a good low-level model. It mends itself naturally to, say, a kind of program in which the main thread polls file descriptors using select() or the Linux epoll(), and the work received is split along N other threads --- which is the kind of program you would naturally write in other languages that don't have a GIL, say Java. The other threads can then use "thread.atomic" blocks to protect sections of their work. The traditional Transactional Memory point of view is that you use such blocks to guard the short sections of code that communicate with other threads or modify global state, but nothing prevents you from using much larger sections: you should be able to scale them up to the size of a native "unit of work", so that every unit is naturally atomic. And then it's only a matter of design: you can tweak an existing module that does the thread pooling to add one "with thread.atomic"; or do it yourself from scratch; or (if the design is compatible enough) just plug in the proposed pure-Python "transaction" module. Or if you feel like it you can even use threads directly (but keep in mind that using threads too explicitly is not a composable abstraction, whereas higher-level designs typically are).
At the end of the day, you can write or reuse programs whose global structure you are already familiar with, for example with a thread pool (that can be hidden in a library if you prefer), or any other structure with or without explicit threads. But you can do so without all the mess that comes with threads like locks and deadlocks. From that angle it is really similar to Garbage Collection: e.g. the Boehm GC (now used by GCC itself) lets you write C code like you are used to, but forgeting all you had to learn about careful explicit memory management.
So I'm not sure if I fully grok STM, but my basic understanding of the workflow for a transaction is this:
1. Make a copy of whatever it is you're planning to use, ie, 'stuff'.
2. Do anything that doesn't have side effects (writing to memory/disk).
3. Acquire a lock & compare the state of the parts of 'stuff' you want to change to the current state.
4a. If 'stuff to write' is unchanged, write it and release lock.
4b. Otherwise, release lock and restart transaction.
With the context manager, how is 'stuff' determined? Does it record everything in locals()? That seems like it might be excessive. Would it make sense to expose 'stuff' to the programmer?
If you were to expose 'stuff' to the programmer, I'd think you'd want a new local context where the only variables available were those explicitly specified as 'stuff' (and builtins, etc) so as to avoid congruency accidents. Something like:
with atomic(f, x, y, z, q) as f, x, y, z, q:
z += f(x, y)
y = x
x = q.pop()
This would also help remind folks to keep their transactions small.
Furthermore, this could easily be transformed into a very useful (function) decorator that uses the function's arguments as the 'stuff'.
Am I missing something? Are my suggestions reasonable?
@Benjamin:
My understanding is STM is using these type of transactions: https://en.wikipedia.org/wiki/Optimistic_concurrency_control
@Benjamin: no, that's not reasonable at all in the context of large transactions. "Help remind folks to keep their transactions small" is precisely what I don't want: I want large transactions. This might be harder to do efficiently, it might be more conflict-prone, etc.; but what I don't want is the classical situation where you have to be very careful about keeping your transactions as small as possible, because that's just as hard and error-prone as using locks.
What I want is for "the average programmer" to not use the "thread" module at all, including "thread.atomic". This should be part of a library that does thread pooling and dispatching (large) transactions.
You know, of course, that stackless has an "atomic" property, and stacklesslib has an stacklesslib.utils.atomic ctxtmgr.
I recently modified stackless so that the "atomic" property also inhibited GIL release, so that inter-thread tasklet operations could be made safe.
On a whim I scoured the python archives and found that such a property had been proposed to cPython but rejected (unwisely imho) in favor of general locking.
Perhaps we can get them to reconsider?
Oh, and btw:
an "atomic" property in regular cPython (and stackless) of course only prevents preemptive release of the GIL. Any blocking IO calls will still cause a "co-operative" GIL release. For this reason, "atomic" cannot be replace generic locks completely.
How does this play with longer "transactions" in STM?
@Kris: ah, interesting. You did the same as what I attempted in my hack of CPython at https://bitbucket.org/arigo/cpython-withatomic . This didn't really work out, though, because the stdlib (including file objects) use regular locks. A simple "print" in an atomic block could lead to deadlocks: the atomic block can block waiting for the stdout's file lock to be released, but it does so without releasing the GIL. Now the lock would typically be released by another thread --- if only it could grab the GIL for a short while.
You can see the workaround I found in the last few commit messages of the above repository, but I'm not satisfied with it... In general I'm still unsure what the best way is. For now in pypy-stm I'm going to hack on a case-by-case basis to convert the locks to atomic sections.
Perhaps it is possible to do it semi-generically, e.g. convert all syntactically nested "with lock:" statements in the user code into "with atomic:" statements (similar to next year's Intel CPUs, which will have "lock elision" to help convert from lock-based to HTM programs). As far as I know, this idea doesn't work in all situations, e.g. if you acquire a lock in one thread and release it in another thread.
As far as I can say, this issue is the main blocker preventing any further progress on the CPython side. It is certainly the reason I stopped pushing for it last year.
@Kris: ah, ok: you have a version of "atomic" that doesn't prevent the GIL from being released around I/O calls. This is different from the version described in this post, which is also what I assumed in my previous answer. In a "with atomic" block, the GIL is not released under any circumstance (equivalently, the whole "atomic" block runs as a single transaction), so that the programmer can assume that a "with atomic" block is truly atomic.
@Arne: here is an example using directly thread.atomic. In your multithreaded application, at some point, you want to remove an item from list1 and add it to list2, knowing that list1 and list2 are also accessed by other threads. Then you write:
with thread.atomic:
x = list1.pop()
list2.append(x)
This is a classical STM example. What I'm pushing for is not that, though: it is for not writing multithreaded code in the first place. With the proper library code you can write code like the first few lines of transaction. The library code would itself use thread.atomic, but not you directly.
Yes, sorry for not being clear, Armin. But an "atomic" flag that inhibits involountary thread switching is useful too, because it is a fast "lock" around all kinds of code:
with atomic:
foo = foo+1 #multi-threading-safe
without the overhead of real locks.
In our GIL world, real locks only benefit areas that incur thread-blocking operations such as IO.
Anyway, that is off-topic, I suppose :)
Of course, we cannot replace thread._Lock with an "atomic" equivalent, because it is a non-recursive entity, also used for such things as condition variables!.
Not a very wise move, in retrospect.
@Kris: indeed. I found out a way that should in all cases either work or raise an exception if unsupported (and not unexpectedly deadlock).
The unsupported situation is: we are in a "with atomic" block trying to acquire a lock, and this lock is acquired already. In this case, there is nothing the interpreter can do automatically. It can only complain rather than deadlocking: no other thread is going to run in parallel to release the lock.
This should let the "common use case" work, which is locks used as scoped mutexes. Caveat: only as long as you use them either only in "with atomic" blocks --- because they appear to be fully serialized, so the mutex will never block --- or only outside "with atomic" blocks.
This leaves the case of mixed usage as unsupported, but I don't see how it could reasonably be supported.
So for now, pypy-stm will raise "oups deadlock" if you try to use "print" statements both inside and outside atomic blocks in parallel... that's the best I could come up with so far.
thanks for the article. might want to reword "This is so far exactly the idea same than the one being investigated for Hardware Transactional Memory.". :)
To expand slightly on what someone else commented, there was a talk not too long ago by some guys who found out using queues to communicate between threads can be pretty hefty bottleneck. They were using the JVM.
The talk is interesting because they actually measured the stuff they do and compared it with how it affects the CPU pipelines/caches. The queue discussion is around 32 minutes into the talk.
It's perhaps not relevant for pypy-stm at the moment, but it's definitely relevant for anyone interested in high-performance multithreaded code.
Good job, Armin!
This is exactly what Python needs, and if turns out hard rather than insanely hard, all the better!
I am not entirely sure about the concept which is being implemented in PyPy-stm or better, which is planned for a parallel PyPy in the future.
I think am a pretty conservative programmer, and I actually dislike the idea of running code twice because of conflicts which could have been foreseen at development time ;). I still see the advantages STM brings regarding development time.
So I'm wondering about a point which was not entirely clear in your post. You're saying you don't want people to (be forced to?) write short transactions. However, I could still in a project which is both CPU and memory intensive try to keep the thread.atomic sections as small as possible to avoid unneccessary overheads but still get effective logs?
@Jonas: it is not always obvious at development time -- to say the least -- how to avoid all conflicts. Think about how hard it is to add automatic GC to C++ in a large project: it's messy but you might get pretty far with just reference counting -- until some point when you loose because of cyclic references. If instead you had used a proper GC-managed language, the problem would just not exist. It's the same about Transactional Memory and conflicts: you can either think harder and harder about using locks correctly, until your programs becomes really messy; then you give up and use TM, solving the issue instantly and letting you think again about your original problem.
Regarding the transaction size: with a good implementation, big transactions should not be slower than small transactions. The only potential drawback of having big transactions is that the risks of conflicts might increase (depending on your program).
Note that this question has a different answer for Python than for C, where code outside transactions runs faster than code within transactions. It is not so in Python. The reason is that transactions are always needed in Python: either explicitly, or implicitly in order to protect the interpreter structures (in replacement of the famous GIL).
Is there any plan to add type declarations as some optional mode in PyPy, like Cython allows? Because PyPy can sometimes give some speed up, but when it doesn't it seems the alternative for the user is to go back to CPython + Cython.
@Armin: Looks nice!
But you’re right: The explicit transaction still looks nicer.
I think though, that both can nicely complement each other:
(1) The transaction is efficient for pushing out parts of the code from the main run to get it multithreaded (think “#pragma omp parallel for” from OpenMP).
(2) The thread.atomic is efficient for protecting stuff inside a threaded application. Also I like that I don’t have to explicitely state which variables I want to protect. And I like that it is not full locking: If I don’t actually get a conflict, other code still runs in parallel.
The first actually looks more interesting though, because it might be possible to make every for-loop run like this, as long as later runs are not dependent on the result of previous runs. This would require quite heavy runtime analysis, though.
STM update (and thanks everybody)
A short update on the Software Transactional Memory (STM) side. Let me remind you that the work is to add STM internally into PyPy, with the goal of letting the user's programs run on multiple cores after a minor adaptation. (The goal is not to expose STM to the user's program.) I will soon write some official documentation that explains in more details exactly what you get. For now you can read the previous blog posts, and you can also find technical details in the call for donation itself; or directly look at how I adapted the examples linked to later in this post.
I have now reached the point where the basics seem to work. There is no integration with the JIT so far; moreover the integration with the Garbage Collection subsystem is not finished right now, but at least it is "not crashing in my simple tests and not leaking memory too quickly". (It means that it is never calling __del__ so far, although it releases memory; and when entering transactional mode or when going to the next transaction, all live objects become immortal. This should still let most not-too-long-running programs work.)
If you want to play with it, you can download this binary (you need to put it in a place with the paths lib-python and lib_pypy, for example inside the main directory from a regular nightly tarball or from a full checkout). This version was compiled for Linux x86 32-bit from the stm-gc branch on the 25th of April. It runs e.g. the modified version of richards. This branch could also be translated for Linux x86-64, but not for other OSes nor other CPUs for now.
The resulting pypy-stm exposes the same interface as the pure Python transaction module, which is an emulator (running on CPython or any version of PyPy) which can be used to play around and prepare your programs. See the comments in there. A difference is that the real pypy-stm doesn't support epoll right now, so it cannot be used yet to play with a branch of Twisted that was already adapted (thanks Jean-Paul Calderone); but that's coming soon. For now you can use it to get multi-core usage on purely computational programs.
I did for example adapt PyPy's own translate.py: see the tweak in rpython/rtyper.py. Lines 273-281 are all that I needed to add, and they are mostly a "simplification and parallelization" of the lines above. There are a few more places in the whole translate.py that could be similarly modified, but overall it is just that: a few places. I did not measure performance, but I checked that it is capable of using multiple cores in the RTyping step of translation, with --- as expected --- some still-reasonable number of conflicts, particularly at the beginning when shared data structures are still being built.
On a few smaller, more regular examples like richards, I did measure the performance. It is not great, even taking into account that it has no JIT so far. Running pypy-stm with one thread is roughly 5 times slower than running a regular PyPy with no JIT (it used to be better in previous versions, but they didn't have any GC; nevertheless, I need to investigate). However, it does seem to scale. At least, it scales roughly as expected on my 2-real-cores, 4-hyperthreaded-cores laptop (i.e. for N between 1 and 4, the N-threaded pypy-stm performs similarly to N independent pypy-stm's running one thread each).
And finally...
...a big thank you to everyone who contributed some money to support this! As you see on the PyPy site, we got more than 6700$ so far in only 5 or 6 weeks. Thanks to that, my contract started last Monday, and I am now paid a small salary via the Software Freedom Conservancy (thanks Bradley M. Kuhn for organizational support from the SFC). Again, thank you everybody!
UPDATE: The performance regression was due to disabling an optimization, the method cache, which caused non-deterministic results --- the performance could vary from simple to double. Today, as a workaround, I made the method cache transaction-local for now; it is only effective for transactions that run for long enough (maybe 0.1ms or 1ms), but at least it is there in this situation. In the version of richards presented above, the transactions are too short to make a difference (around 0.015ms).
I don't get it. It's great that pypy libs and so on will be multithreaded with good performance, but how does that help you to write a multithreaded program with good performance, if you don't expose the tools you used to do that?
I think the idea is that the GIL would be gone since internally the interpreter would use STM, and at the programmer level, you would be free to use the normal threading mechanisms
@Texatril no, the point is you would not have to. You write a normal event-based program with transaction module and boom it works. It's easier than writing correct multithreaded code.
Ah, you kinda contradicted yourself by saying the goal wasn't to expose STM to users' programs, but then saying that it exposed the same API as the transaction module.
The transaction module is pretty horrible though. Might I suggest a better syntax than the transaction module? Something like exceptions would be better:
begin:
...
commit
or:
transaction:
...
rollback:
retry
perhaps with an option (in a later version?) to replace the "retry" with alternate code.
@Anonymous that would be a bad API, because you cannot fail a transaction. It'll be automatically retried until it finishes. That's in-line with correct programs, just multithreaded
Anonymous: the user level API is any asynchronous event handling framework that uses the transaction library internally to handle events in parallel.
So, for example, you take *any* Twisted program and run it on pypy-stm and it will use the available number of cores to process events without losing any of the normal correctness guarantees of event-based programming.
The goal is really not to expose STM to the user. The pure Python transaction module is a working implementation, running on a single core but running. The fact that pypy-stm provides an alternate implementation, based on STM and giving multi-core usage --- this is the implementation detail.
That's why it has the kind of API you see, and not some STM syntax like "begin: rollback: commit". I also dislike custom keywords, because then we can no longer run the program on CPython or non-STM-based PyPys. But I know I am no language designer myself, so the details are open for discussion.
Nick: thanks for the precisions. Note however that the transaction module is also meant to be used directly, e.g. in CPU-intensive computational programs that don't use any event framework, like I did in rpython/rtyper.py.
That sounds great!
From the code I wondered, though, if it’s not actually only 2 lines:
for block in pending:
transaction.add(self.specialize_block, block)
transaction.run()
That sounds like map() - for example like the futures module:
with concurrent.futures.ThreadExecutor() as e:
e.map(...)
Similarly something like
with transaction.Runner() as r:
r.map(self.specialize_block, block)
might be easier.
Anyway: Your STM project sounds great!
@arne: right, maybe. It points to a similarity, at least. This simple example corresponds nicely to map(), but in other examples (like richards) we add() more transactions from within transactions. Nevertheless, using the "with ... as t:" syntax might work, by passing the "t" inside transactions in order to call t.map() or t.add() on it too.
This would also open the door to naturally nest these constructs. Right now if you call transaction.run() inside a transaction, you get an error. Such a case is more work to support in the current implementation, but from the surface it looks like a transaction.Runner() kind of interface should allow us to express what we need.
NumPy on PyPy progress report
Hello.
A lot of things happened in March, like pycon. I was also busy doing other things (pictured), so apologies for the late numpy status update.
However, a lot of things have happened and numpy continues to be one of the main points of entry for hacking on PyPy. Apologies to all the people whose patches I don't review in timely manner, but seriously, you do a lot of work.
This list of changes is definitely not exhaustive, and I might be forgetting important contributions. In a loose order:
-
Matti Picus made out parameter work for a lot of (but not all) functions.
-
We merged record dtypes support. The only missing dtypes left are complex (important), datetime (less important) and object (which will probably never be implemented because it makes very little sense and is a mess with moving GCs).
-
Taavi Burns and others implemented lots of details, including lots of ufuncs. On the completely unscientific measure of "implemented functions" on numpypy status page, we're close to 50% of numpy working. In reality it might be more or less, but after complex dtypes we're getting very close to running real programs.
-
Bool indexing of arrays of the same size should work, leaving only arrays-of-ints indexing as the last missing element of fancy indexing.
-
I did some very early experiments on SSE. This work is seriously preliminary - in fact the only implemented operation is addition of float single-dimension numpy arrays. However, results are encouraging, given that our assembler generator is far from ideal:
Numpy
PyPy SSE
PyPy
GCC non-looped
GCC looped
a+b
0.6s
0.3s
0.4s
0.3s
0.25s
a+b+c
1.9s
0.35s
0.5s
0.7s
0.32s
a+b+c+d+e
3.2s
0.36s
0.8s
1.7s
0.51s
The benchmark repo is available. GCC was run with -O3, no further options specified. PyPy was run with default options, the SSE branch is under backend-vector-ops, but it's not working completely yet.
One might argue that C and Python is not the same code - indeed it is not. It just shows some possible approach to writing numeric code.
Next step would be to just continue implementing missing features such as
- specialised arrays i.e. masked arrays and matrixes
- core modules such as fft, linalg, random.
- numpy's testing framework
The future is hard to predict, but we're not far off!
Cheers,
fijal
UPDATE:Indeed, string and unicode dtypes are not supported yet. They're as important as complex dtype
Hello,
May you get a bit more precise on the GCC test ?
For instance, is the GCC code using SSE too ? Is it written in a single loop (x[i] = a[i] + b[i] + c[i]) or in several consecutive loops first a+b then (a+b) + c ?
Just to know :-)
One thing I'll note is that I do from time to time use the object dtype. Occasionally, I've got multidimensional arrays of objects, and the array operations from numpy are useful. I don't really get a speed advantage there, but the interface from numpy is useful. But its not super necessary and certainly not a priority.
Sorry, didn't RTFA completely. I just had a look at the C code.
Still, a question: is PyPy doing the optimization of combining operations in one step ?
A "good" Fortran compiler should be able to do those optimizations, for instance.
You should compare to numpy with a JIT, such as numexpr, it would be interesting to see whether PyPy is able to beat the numexpr JIT.
Hi, Numpy masked arrays, matrices and the testing framework are pure Python, so why do you need to implement them?
Ralf, we don't have to implement the pure-python stuff, so much as we need to make sure the features of NumPy's core that they depend on are implemented.
Support for objects is actually quite useful: please reconsider adding it.
Here is a very useful case: the manipulation of arrays of numbers with uncertainties (special uncertainties.UFloat objects). Numbers with uncertainties behave very much like regular numbers: it is very useful to be able to use the regular NumPy syntax for array operations, for calculating matrix inverses when the matrices contain number with uncertainties, etc. I know many people use these features.
It would be *great* (read: irreplaceable :) to have support for the object NumPy dtype.
This sounds really cool!
And it would be awesome if you’d manage to coordinate with numpy, so the projects merge to a single python codebase with two separate backends: One C-Extension based for CPython and one Pure-Python based for pypy.
Any chance comparing with Fortran? There are assumptions about pointers and alignment that Fortran compiler can make.
Nice...but what is the next step?
Numpy alone is not that useful.
"We" need at least scipy and matplotlib.
Are you going to port all these modules? I don't think so.
One way forward could be to have numpy in pypy and at least scipy and matplotlib working with the pypy C api at a decent speed.
What do you think?
What about pickling? I'd love to experiment with hybrid CPython/PyPy execution using some magic from the multiprocessing module or a similar parallel computation framework.
Hello,
This is a very promising result, thank you for sharing it.
Could you give a few more details about the differences wrt to numpy?
What would people have to do to use numpypy with scipy?
I think the numpy.linalg module is pretty important.
How to move efforts into this?
PyCon 2012 wrap up
So, PyCon happened. This was the biggest PyCon ever and probably the biggest gathering of Python hackers ever.
From the PyPy perspective, a lot at PyCon was about PyPy. Listing things:
- David Beazley presented an excellent keynote describing his experience diving head-first into PyPy and at least partly failing. He, however, did not fail to explain bits and pieces about PyPy's architecture. Video is available.
- We gave tons of talks, including the tutorial, why pypy by example and pypy's JIT architecture
- We had a giant influx of new commiters, easily doubling the amount of pull requests ever created for PyPy. The main topics for newcomers were numpy and py3k, disproving what David said about PyPy being too hard to dive into ;)
- Guido argued in his keynote that Python is not too slow. In the meantime, we're trying to prove him correct :-)
We would like to thank everyone who talked to us, shared ideas and especially those who participated in sprints - we're always happy to welcome newcomers!
I'm sure there are tons of things I forgot, but thank you all!
Cheers, fijal
Py3k status update #3
A lot of work has been done during the last month: as usual, the list of changes is too big to be reported in a detalied way, so this is just a summary of what happened.
One of the most active areas was killing old and deprecated features. In particular, we killed support for the __cmp__ special method and its counsins, the cmp builtin function and keyword argument for list.sort() and sorted(). Killing is easy, but then you have to fix all the places which breaks because of this, including all the types which relied on __cmp__ to be comparable,, fixing all the tests which tried to order objects which are no longer ordeable now, or implementing new behavior like forbidding calling hash() on objects which implement __eq__ but not __hash__.
Among the other features, we killed lots of now-gone functions in the operator module, the builtins apply(), reduce() and buffer, and the os.* functions to deal with temporary files, which has been deprecated in favour of the new tempfile module.
The other topic which can't miss in a py3k status update is, as usual, string-vs-unicode. At this round, we fixed bugs in string formatting (in particular to teach format() to always use unicode strings) and various corner cases about when calling the (possibly overridden) __str__ method on subclasses of str. Believe me, you don't want to know the precise rules :-).
Other features which we worked on and fixed tests include, but are not limited to, marshal, hashlib, zipimport, _socket and itertools, plus the habitual endless lists of tests which fail for shallow reasons such as the syntactic differences, int vs long, range() vs list(range()) etc. As a result, the number of failing tests dropped from 650 to 235: we are beginning to see the light at the end of the tunnel :-)
Benjamin finished implementing Python 3 syntax. Most of it was small cleanups and tweaks to be compatible with CPython such as making True and False keywords and preventing . . . (note spaces between dots) from being parsed as Ellipsis. Larger syntax additions included keyword only arguments and function annotations.
Finally, we did some RPython fixes, so that it is possible again to translate PyPy in the py3k branch. However, the resuling binary is a strange beast which mixes python 2 and python 3 semantics, so it is unusable for anything but showing friends how cool it is.
I would like to underline that I was not alone in doing all this work. In particular, a lot of people joined the PyPy sprint at Pycon and worked on the branch, as you can clearly see in this activity graph. I would like to thank all who helped!
cheers,
Antonio and Benjamin
Very cool work!
Thanks for the update! I‘ll need to see if I can already let it hit my own python3 project (I had to convert that to python2.x to make it run with pypy, being able to get rid of that step would be really cool!)
Do you already have prebuilt binaries of pypy3?
I don't think that there is any chance that a python3 project will run as of now, there are still tons of features missing. So far my job as mostly been to fix all the failing tests in the PyPy testsuite. When I'll have finished, I'll be able to start with new features.
And, for the same reason: no prebuilt binaries yet, sorry.
OK, thanks for the info!
I’m anxious to test it, once you give it a chance to run simple unicode-using code!
PyPy sprint in Leipzig, Germany (June 22-27)
The next PyPy sprint will be held --- for the first time in a while --- in a place where we haven't been so far: Leipzig, Germany, at the Python Academy's Teaching Center. It will take place from the 22nd to the 27th of June 2012, before EuroPython. Thanks to Mike Müller for organizing it!
This is a fully public sprint, everyone is welcome to join us. All days are full sprint days, so it is recommended to arrive the 21st and leave the 28th.
Topics and goals
Open. Here are some goals:
- numpy: progress towards completing the numpypy module; try to use it in real code
- stm: progress on Transactional Memory; try out the transaction module on real code.
- jit optimizations: there are a number of optimizations we can still try out or refactor.
- work on various, more efficient data structures for Python language. A good example would be lazy string slicing/concatenation or more efficient objects.
- any other PyPy-related topic is fine too.
Grants
For students, we have the possibility to support some costs via PyPy funds. Additionally, we can support you applying for grants from the PSF and other sources.
Registration
If you'd like to come, please sign up either by announcing yourself on pypy-dev, or by directly adding yourself to the list of people. (We need to have a head count for the organization.) If you are new to the project please drop a note about your interests and post any questions.
More...
For more information, please see the sprint announcement.
Call for donations for Software Transactional Memory
Hi all,
The Software Transactional Memory call for donations is up. From the proposal:
Previous attempts on Hardware Transactional Memory focused on parallelizing existing programs written using the thread or threading modules. However, as argued here, this may not be the most practical way to achieve real multithreading; it seems that better alternatives would offer good scalability too. Notably, Transactional Memory could benefit any event-based system that is written to dispatch events serially (Twisted-based, most GUI toolkit, Stackless, gevent, and so on). The events would internally be processed in parallel, while maintaining the illusion of serial execution, with all the corresponding benefits of safety. This should be possible with minimal changes to the event dispatchers. This approach has been described by the Automatic Mutual Exclusion work at Microsoft Research, but not been implemented anywhere (to the best of our knowledge).
Note that, yes, this gives you both sides of the coin: you keep using your non-thread-based program (without worrying about locks and their drawbacks like deadlocks, races, and friends), and your programs benefit from all your cores. In more details, a low-level built-in module will provide the basics to start transactions in parallel; but this module will be only used internally in a tweaked version of, say, a Twisted reactor. Using this reactor will be enough for your existing Twisted-based programs to actually run on multiple cores. You, as a developer of the Twisted-based program, have only to care about improving the parallelizability of your program (e.g. by splitting time-consuming transactions into several parts; the exact rules will be published in detail once they are known). |
The point is that your program is always correct, and can be tweaked to improve performance. This is the opposite from what explicit threads and locks give you, which is a performant program which you need to tweak to remove bugs. Arguably, this approach is the reason for why you use Python in the first place :-)
Armin
Great news, really looking into experimenting with that, good luck!
My question is: will it map to os thread being created on each event dispatch or can it potentially be somehow optimized? I mean, you can potentially end up with code that has tons of small events, and creating os thread on each event would slow down your program.
@k_bx it's not like that at all. There are links in the proposal that may enlighten, depending on what you already know.
Indeed, it is creating a pool of N threads and reusing them, where N is configurable. Ideally it should default to the number of cores you have, detected in some (sadly non-portable) way.
Are any of you affiliated with a university? Since this is research, maybe you can get a grant for a post-doc or a PhD position.
Trivial comment - on the donation page in the "What is Transactional Memory?" section, I think a (TM) has been turned into a superscript TM (as in trademark).
This sounds exciting for the kinds of Python programs that would benefit from TM, but can anyone give a ballpark estimate of what percentage of programs that might be?
Personally, I write various (non-evented) Python scripts (replacements for Bash scripts, IRC bot, etc) and do a lot of Django web dev. It's not clear that I or similar people would benefit from Transactional Memory.
Is that correct?
Could u update the donation page? It doesn't seem to be tallying the amounts.
I am really excited to see this work even if it is pure research (I donated $200). It would be awesome if
stm:
....pre:
........# init transaction state
....trans:
........# parallel stuff
So it would be easy to retry failed transactions or be able to reorder them for contention or perf.
offtopic:
there is a project to help bring C# and C++ together
https://github.com/mono/cxxi
and fork https://github.com/kthompson/cxxi
in essence: there is a generation step which allows then to easily use C++ objects in C# and vice versa.
considering that ctypes are very much like p/invoke, it looks like pypy might have something similar for python/C++ environments , this might allow much easier to port, for example, Blender to use pypy as scripting language.
Could you post an example snippet of code which would benefit from that?
I ask because I have trouble really imagining example code.
Something minimal with the least possible amount of extension modules which I could just throw into the pypy and pypy-tm interpreter and see the difference.
STM has such much potential. I wonder if it gets the attention of the hacker community it deserves. And if not, why not? I hope this is getting more recognition in the future.
Ah... didn't mean to post it anonymously.
Nice!
@Paul: my guess would be that the majority of people that know STM are still looking at it from the point of view of short or very short transactions, as a replacement of locking. Even gcc 4.7 got an STM extension, but it cannot be used with long-running transactions: the performance is not at all tuned for this case, and you cannot express things you need in real long-running transactions, like interrupting them for I/O.
Moreover the single-core 4x performance hit is usually far more that what people are willing to accept --- not realizing that in many cases it will soon be outdated, as a way of measuring performance: the future is toward many-cores machines.
For a casual Python programmer like me, how does STM affect the way I write my programs? I know about suggested benefits of STM on multi-core machines. However, what I'm asking is what is it that I have to do differently to get that benefit ?
Thanks
@Anonymous: https://foss.heptapod.net/pypy/pypy/-/tree/branch//stm-thread/pypy/doc/stm.rst