Python is NOT Single Threaded (and how to bypass the GIL)

I’ve encountered this misconception about Python a lot: Python is single threading and to use more then one core on your CPU you must use the multiprocessing module. In this video I show that this is utterly false and that you can be very CPU efficient with Python when calling out to something like a c-extension (numba in this case). I feel this is an important nuance to understand, hence this video.

○ A.I. Learns to play Snake…

Similar Posts

48 thoughts on “Python is NOT Single Threaded (and how to bypass the GIL)
  1. To the inevitable "nuh huhh" commenters: save your collective breaths. I have lost any interest in discussing this with anyone unwilling to listen. This video isn't about language superiority or "what X was designed to do", it's about showing a way to write better code involving python. If that is helpful to you, great. If that somehow makes you angry, maybe take a step back and re-evaluate your priorities in life.

    I will no longer be responding to "no this is wrong" comments, and I will remove any of them that are disrespectful or offensive. You have been warned.

  2. Good video. Racket calls its green threads "thread" (which may not be ideal in all cases) and its multiprocessing thing "places".
    I like that it calls it places, because it is a nice analogy. I started writing an explanation but it looked to much like a book, so I deleted it.
    Another thing I want to mention: it also has "distributed places" which is a way to do distributed programming / writing programs that run across multiple machines (of course there is no magic to make that super easy).

  3. I need handle thousand-level number of huge plain text files, and there are nested loops in my script. So could you please give me a advice for the acceleration lib? Numba? Multiprocess? Or something else?

  4. When I was first learning Python 3 years ago, I fell into this category of being upset with Python's GIL and claiming it's "not true multithreading." This being because I came from a C++ background, where if I needed to do parallel processing on complex data types/structures, such as flying PSO particles in a distributed fashion, it was fairly straightforward in C++ but rather impossible in Python without the performance hit.

    Since then, I've learned to use Python as a "wrapper" for calling more complex operations underneath. Occasionally, I still find myself using the multiprocessing library as a clutch to run a set of numpy or pandas operations in parallel, especially when I'm in a bind. But, the one thing Python seriously lacks is native performant parallel processing on complex structures. It's not a problem if you can fit your code into the interface of one of the underlying libraries, but can be a problem for novel approaches in research. It's still my biggest gripe about Python, even tho it's good at almost everything else. The whole pickle approach to passing args to processes almost made our work a no-go in a previous project.

  5. Numba is too young. It don't even support python list to be passed into function. Instead we need to use numpy. And that case too it don't support unicode arrays.According to numba documentation, built in functions like .split() are slower than cpython implementation in nopython mode. There are too many limits while using numba which are not covered in this video.For industrial applications I think cython is the best option to accelerate python.

  6. hey… Great video. to emphasize how dumb these sorts of blanket statements are… I run some extensively multiple simultaneous applications built in python, and the option that we use and people will say is cheating, but totally avoids the gil is: subprocess. Yeah, you just launch multiple processes and they have no issue with the GIL at all. We run systems with 900+ co-operating processes and they replaced C-programs that used IPC mechanisms… by avoiding formal IPC (all IPC mechanisms are just ways to slow code down so that they don´t smash common resources, so the fundamental improvement is to require fewer common resources). Modifying algorithms to use mechanisms that avoided locks whenever possible, the python ended up faster (like 10x) than the C it replaced. yes, it uses a lot of memory. My app is looking for execution speed, and we can afford memory… one use case, others will differ.

  7. if Htop marks the different threads with different pids . Doesn't this mean that we can (delegate) the different threads to different cores ? I really didn't understand the second point. If the first is true and we are creating kernel level threads (green threads as oppose to user level threads is sooo weird to me) than it should be true that we can assign these very threads to different cores ?

  8. i am doing computation (and memory) heavy right now. And whole team dont't want to do in in vanila python. Pushing to use pandas instead.

    Which is perfect example why python is single threaded. Not because it factually is. But because in actual tasks – no one use python multithreading.

  9. Uhm, youre spreading misinformation with clickbait titles. Having multiple threads does not mean youre executing these in parallel as "system level threads." It doesnt matter how many threads you have running, only ONE is executing at a time, essentially making python behave as a single thread. That is what people say and always have been. The GIL is designed right into its core, trying to say "hey look guys i see another thread on htop, lolz proven wrong" is the worst attempt to disprove this ever. Prove they are executing at the same time with pure python code, they arent.

  10. Not a topic I was wondering about much, but as a Python programmer but without much in-depth knowledge, I find myself Googling and Wikipediaing a bunch of terms mentioned in the video just so I can follow. PS In the last few years, I have built up a stubborn belief that given enough effort, there is nothing that I cannot eventually understand. So I am not intimidated by more advanced content such as this as I was before. This video (and I'm sure the others) are perhaps more informative than you think.

  11. More or less the reason why I more or less quit Python.

    Nowadays I often use a language that is more or less as slow as Python but allows for parallelism, Erlang. You may want to check it out.

  12. Yea, but don't you get rid of some of the advantages python provides. Wouldn't it be better just to stick to a language more fitted for multithreading i.e. C, C++, C#, Java, etc…?

  13. This is exactly what I need, I am done with seeing "Introduction to *insert item here*" stuff, when all I wanted was some intermediate level information. Much appreciated mate, you earned another sub.

  14. The main issue is that a video like this will have the opposite intended effect – people are now going to go around and use threads simply because they heard that python is in fact using proper system threads, ignoring the nuance. Ive seen that happen often enough.

    Ultimately, in python threads ACT as if they aren't real threads – nicely proven by spawning a few threads that all append the current timestamp to separate lists. You can very clearly see that the multiple threads are never executing at the same time. This is your pudding. It's what the threads act like. that they don't act like that in NumPy etc. is a nice detail, but ultimately not disproving the core statement: python threads are not (read as: dont act like) threads. I will continue to point this out to beginners and i will continue to recommend other ways of speeding up their code – like using numba (with gil-released threads if you want more speed).

    Ultimately, its the same as telling people python is interpreted – it's also compiled, but explaining that to beginners is entirely pedantic as python acts entirely like a interpreted language.

  15. Maybe misconceptions, but doesn't change the fact that Python's threading model is fundamentally broken as long as the GIL exists. I use python regularly mostly in the ways you imply are completely fine and also still regularly bump into the GIL. There are simply some computationally intensive things that are difficult to fit into things like numpy or pytorch, and there is a lot of stuff that is even difficult to fit well into the multiprocessing paradigm of python.

  16. Man, you make great videos. It's just at the right level. As someone who would class themselves as an intermediate Python programmer, your videos provide just the right amount of information and examples for me to immediately apply them and run with it. Thank you and keep up the great work!

  17. Firstly, I don't think anybody who knows Python thinks it's single-threaded. If it were, it wouldn't need a GIL. Secondly, async is not green threads, it's cooperative.

  18. What a clickbait video. Offloading some calculations through FFI does not make python multithreaded in a way programmers think when they hear "multithreading". Nor sequential concurrency does.

  19. Right off the bat, The advice on using multiprocessing instead of threading came from Guido himself, it's not a "misconception", it's a quote from the creator of Python, And he said it without the context of "what if you offload some tasks in your code to a CPP code".
    The second part is, nobody ever claimed that python is using green threads, of course it creates a system thread. But because of the GIL every thread waits for the GIL to free up for execution, in that making your threading tasks not truly concurrent and rendering it being single threaded in that regard, no matter what your wrote in your code. unlike NodeJS, Golang and others.

  20. The main thesis is pretty wrong. Nobody cares if Python threads are implemented using kernel threads or not. What ACTUALLY matters is whether you can get SIMULTANEOUS EXECUTION, because that's what's going to let you use all of the many cores that are usually available. If you can't do that easily (e.g. you need to resort to multiple processes, or some third-party facility), then Python is not amenable to a VERY WIDE range of applications. This non-amenableness explains something said near the beginning: When was the last time you saw a CPU-bound Python program? The reason this doesn't exist is because when people set out to write a CPU-intensive application, they immediately throw out Python as a possible implementation language.

    Numba looks pretty interesting, and was new to me, so I did get something out of this, but the fact that you have to explicitly go out of your way to release the GIL in special sections of code proves the general wisdom correct: When using CPython, you have to jump through extra hoops to use all of your many cores, because in its "native" state, Python doesn't let you do that.

  21. I've actually never heard anyone mistake Python for being single-threaded. I feel like most people understand pretty well that "it only uses up to 1 CPU worth of processing". What I see much more often is people saying Node.js is a single-threaded language.

  22. There are some misconceptions here. In the old days we had 1 cpu with 1 core. Nevertheless full multi-threading was possible, because the OS would apply time slices and alternate the threads using its thread manager.

    This is very useful with a GUI program, because the GUI code would be sitting idle most of the time, waiting for a keypress or mouse click. So the worker threads would be able to do their job in the meantime. Only when the user would take some action, the GUI thread would kick in. So the GUI would remain responsive, while the worker threads would do their long tasks. Think of reindexing a database.

    Also handy with scripts, e.g. when you want to cancel a job.

  23. Great video! My question: it is possible to make my own good looking console that can represent a lot of data at once (with python). ( like the core performance monitor in i3 console.) (something like in your video)

  24. Very good demonstration.

    There are 4 distinct levels of thread execution touched on; physical processors, SMT (simultaneous multi threading, where one core has multiple contexts; "virtual core" is a bit of a misnomer as they have their own physical registers and are just as real as the sibling they share execution units with), operating system threads (which were repeatedly referred to as "proper"), and in-process coprocesses ("green" threads). The load spike moving from processor to processor in the GIL demonstration is probably execution, not threads, moving between cores. It just varies which thread acquires the GIL. Linux tends to keep threads in the same core when possible, something that can be enforced using thread affinity. This reduces overhead from warming up the caches for particular tasks. In some processor cores, e.g. Xmos XS1, a single thread can't use all the execution resources; in XS1, 4 threads are required to saturate the pipeline. The multiprocessing example probably bottlenecked in the scheduling of work or transfer of parameters and results, which are serialized due to execution on the parent control process. Larger queues are often used to mitigate this, but it won't help if the task generation takes longer than task execution.

  25. The real reason that no one changes these languages to have full parallelism like C++ is that you can't change a library or a compiler or an interpreter to do that – not and still have garbage collection anyway. It has to be baked in from the initial design – and the decisions that you have to make to make that possible are significantly different. If you wrote a new version of the language that did all that, none of the existing libraries would work with the new language. It still might be worthwhile, but it's a deeply breaking change AND it requires rewriting from the ground up.

    The kludges that are happening now, copying data across processes instead of sharing them might be enough – though it IS possible to write an actually sharing version of a dynamic language. Note Numba's nogil only works on code that doesn't use Python types… That's quite a restriction.

    By the way, languages with advanced garbage collectors like Java and .net don't actually use system threads = or rather they have no more than one system thread per hardware thread. This is because advanced garbage collectors need to do context switches only at safe points – otherwise, finishing a gc phase would have to wait for all language threads (including the ones that don't currently have an active timeslice) to count out the phase change because of the limitations on how cores see changes that other cores have made in memory.

    So these advanced systems DO use green threading, they use one green thread at a time per hardware thread or hardware hyperthread.

    Google's go language is similar except that it doesn't even emulate having preemptive multitasking. So no preemptive multitasking and no green threads. Instead there's a limit on the number of simultaneous threads so that counting out a gc phase only requires a response from the currently running hardware threads.

Leave a Reply

Your email address will not be published. Required fields are marked *