I’ve been learning a lot of Twisted lately and I’m really enjoying it. Back in the day when I did C++ coding I almost exclusively used event-driven programming, so getting back into it again with a Python twist (pun fully intended) is making my coding juices flow once more.
One of the drivers for me to learn more about it is to be able to maintain the Launchpad Build Farm manager, which is written mostly using Twisted events. When it was first put together by Celso to replace the old slave scanner, it was quite a tough task to slot it into the existing Launchpad code base without too much disruption. He did a pretty good job, but now the code is in need of an overhaul as it scales quite badly.
There’s three main reasons that it scales badly:
- It scans each builder in turn once per “cycle” that it wakes up to do its scan
- It synchronously uploads build results, blocking further progress until the upload completes
- It synchronously polls the builders before setting up Twisted Deferreds to do the actual dispatch in parallel.
I started tackling (1) a while ago now as it was obvious there was a better way. So now instead of polling each builder in turn, it sets up a separate Deferred event for each builder in the system and lets it fire independently of all the others. Because the code is not really threaded, there are no race conditions or exclusivity to worry about and we get a couple of really big wins from this approach:
- If there’s any kind of exception during the scan, it only takes out the current scan cycle on a single builder, not all of them. Other builders continue to dispatch quite happily.
- We can now overlap events better. Previously, we did all the dispatching at the same time and that involves mostly sat waiting for the reset scripts to finish on the virtual builders (which sometimes take up to 30 seconds). Now, we can kick that off and poll another builder or upload a build at the same time. Throughput has increased slightly but the overall effect is dramatic on the build farm page as you no longer see builders idle for long periods even though there’s a big queue.
This code went live in production a month ago and we had a massive problem with it immediately that I had not thought of. Because the builder reset events (which are a sub process) now happen in parallel with other non-Twisted events they get interrupted with SIGCHLD when the sub process finishes – this manifests itself as an EINTR exception in the middle of a comms operation with another builder. We added some code to retry the operation in this case and it finally worked! I’m really happy that this was the only problem.
However, the really, really big win in scalability will come with Jelmer’s fix which will land Real Soon Now. This will remove the blocking that happens when uploading builds – a block that can last for over a minute for large uploads. I can’t wait to see this branch land.
Here’s to more Twisted!