At Rippling, our mission is to free smart people to work on hard problems. A crucial part of that mission relies on our application’s speed and performance. In this post, we'll delve into how we identified that Python garbage collection was a significant contributor to tail latency, and how we reduced P99 latency 50% by tuning when and how the garbage collector runs.
The long tail latency
Earlier this year, Rippling had a performance problem. We had significantly improved our API response times as a whole, but our page load times were not dropping at the same rate. Roughly 25% of page load times were far beyond their median without correlating to any single product or customer behavior. We had one crucial hint: our P99 API latency was still high—around 3 seconds, compared to a P50 of 50 milliseconds.
In a complex client app, page load times are functions of API latency, but aren’t always linearly proportional. This effect is amplified when our client makes many API calls–sometimes up to 30 on a single page–because a single slow response is enough to delay rendering.
We approached the problem statistically by asking, “If 1% of API requests are slow, and we make 30 API requests, how likely is it that at least one of the requests is slow?” A request has a 99% chance to fall within P99 latency, so we find the probability of 30 requests all falling within P99, then find the complement:
A spot-on match for our page load metrics.
We ruled out some obvious culprits quickly: CPU usage was fine and request routing time was insignificant. The slowdowns were spread across every endpoint and our traces through DataDog would show different hotspots each time. OS context switching seemed unlikely (although we checked anyway) and our databases were all healthy. But one idea couldn’t be ruled out: our programming language, Python. (We use Python extensively at Rippling, with almost all requests going through a Python service.)
Instrumenting our app
Python is a memory-managed language. Removing unused objects is handled in two distinct ways: reference counting and cyclic garbage collection. The reference counter works seamlessly as objects go in and out of scope, destroying (and de-allocating) objects when their reference count hits zero. Even very clean Python code creates cycles, which tie up references, and the cyclic garbage collector is necessary to clean them up. When the garbage collector runs, the Python program is suspended until it is complete.
We considered the Python garbage collector early on, but our problem seemed too severe. Our app was pausing for 2 to 4 seconds on many of these requests, surely garbage collection wasn't that slow.
Python doesn't report this data directly, but it is possible to create a simple fixture to collect it. The Python garbage collector module (“gc”) lets you register callbacks. The callback is notified each time the garbage collector starts and stops, passing along metadata at the end. If we capture a timestamp for each event, we can easily measure the total time spent garbage collecting. It looks something like this:
We attached our fixture to every API worker and saw extreme results. Our worker processes frequently paused in garbage collection for multiple seconds, especially when a request ran a third-generation collection.
A brief detour into garbage collection
CPython, the reference implementation of Python, uses a generational garbage collector. Objects that end up in cyclic references must pass through the cyclic garbage collector. The cyclic garbage collector can be expensive, so if an object survives one collection cycle, CPython moves it into a higher generation, which is scanned less frequently over time. This optimization generally saves compute time (by avoiding repeat traversals of long-lived collections), but it comes with a cost: later generations slow down as they collect more objects. Python attempts to minimize this cost by running fewer third-generation collections as the generation size increases. But this only saves amortized costs–an individual collection event can still be very slow.
For long-running computation, this strategy is great. For an API, where consistency is essential, these random pauses are a problem.
We considered three approaches:
- Refactor our app to reduce cyclic references
- Disable the cyclic garbage collector entirely. Let the workers accumulate garbage and restart them when they reach a memory threshold
- Keep the garbage collector on, but somehow garbage collect “later” when it was convenient
Refactoring our entire app would have been a large undertaking. Plus, we like many of our internal graph data structures despite their cyclic nature needing extra time to process. (We build many in-memory graphs to support products like Unity). Disabling the garbage collector would require constant tuning. And, done wrong, we might also quickly lose control of our API worker stability.
We wanted to run the garbage collector at a better time, but we had no clear mechanism to do it. More specifically, we wanted to run it in-between requests. Our web framework (Django) has request hooks, but even the post-request hooks run before the request is fully closed. We realized we could write a custom worker if we switched to Gunicorn.
Gunicorn to the rescue
Gunicorn is a lightweight WSGI app server written entirely in Python. Gunicorn allows you to specify a custom worker, without maintaining a fork of the server itself. For our purposes, this was perfect. We could add custom behavior right after the request was completely processed, even after the connection to the client was closed.
Simple, easy to maintain, and gives us all the power we need. Since we don't need to run garbage collection every request, we run each generation at a fixed interval roughly tuned to our use case. (An astute engineer pointed out we had finally found a real-world use for FizzBuzz). With this change, we saw a 40% reduction in P99 API latency.
At this point, we had completely eliminated the garbage collection pauses from our API responses. But our third-generation collections still always took over 2 seconds to run. Those 2 seconds weren't contributing to latency anymore, but they were a nuisance. (That’s CPU time we’re paying for!)
We looked back to our long-lived objects (those objects being scanned in the third generation). We realized the vast majority of them were created before our app started serving requests, and they would live for the duration of the process. We knew we'd never need to garbage collect these objects, but the garbage collector didn’t.
The Python garbage collector has another trick: we can “gc.freeze()” our long-lived objects, hiding them from the garbage collector indefinitely. In Django, the WSGI entry point is the perfect spot to do this. As a bonus, our warmup call already lives in this sequence, so we capture any side effects from that process too.
This gives the garbage collector a clean slate before we serve our first request. We do a full garbage collection cycle right before, in order to make sure we don't freeze any actual garbage.
This change dropped our average third-generation collection time from over 2 seconds to below 500ms, an 80%+ speedup.
Low garbage, without gc.freeze()
(previous best case)
Full production garbage, with gc.freeze()
(new worst case)
Sampled production, third-generation collection times.
Future improvement and caveats
Right now, our current hard-coded request interval is working well for most of our infrastructure. That said, the linear request interval we currently use is too aggressive for some high-throughput APIs. We’ll be investigating better request interval coefficients in the future.
It’s worth noting that running the garbage collector pauses execution on all Python threads in a process. For us, this isn't an issue, because we run all workers on independent processes. For servers configured to use multiple threads, moving GC between requests will be difficult to synchronize.
By the end, our P99 latency dropped by over 50% and the earlier slow pages quickly sped up.
The Python garbage collector served us well on its defaults for years but as our app design shifted, we found the heavy GC pauses, although infrequent, unacceptable for our front end. The change was gentle—we still run frequent GC cycles and keep our garbage low—but the impact was significant. Importantly, we were able to make this change without disrupting other development efforts in our product.