This two-part series is a behind-the-scenes look at our journey scaling Rippling without sacrificing our execution speed. In part one, we’ll do a deep dive into the API bottleneck issue we were facing, and in part two we’ll explain how we solved it.
Things are slow
Last year, there were complaints about page load times from our large customers. Our monitoring of site performance showed that P50 load times were reasonable, but P95 load times had become unacceptably slow for our largest customers. The impressions of site speed are determined by our customers' slowest experiences, not their fastest ones.
P95 Site Wide Page Load time for all and large customers (June 2022)
We launched a number of efforts at all layers of our tech stack to address the problem. Product teams were responsible for identifying and fixing slow pages for their products. Our performance team was responsible for speeding up our application/API layer. Our infra team was responsible for ensuring that our database was indexed correctly to avoid slow queries.
Let’s dive into the changes we made to our API layer to address these concerns—and how they gave us sitewide performance gains.
The read path
Our north star performance metric was P95 page load time for all pages across our application. We measure P95 time as opposed to the median or some other metric because it accounts for the slowest data points from our largest customers who are most acutely affected by performance issues. Read API traffic accounts for most of the time it takes to load a page of our application, so we focused our efforts on optimizing our API read path.
Where time is spent in a Read API Request
Here’s how the read path works for our application. The frontend makes a request to our API servers, the user is authenticated, then the API servers:
- Grab data from the DB and marshal it into a Document object
- Determine which fields on the object the user is allowed to see and exclude the rest
- Compute user-defined functions called “properties” on the Document object
- Convert the Document object and computed properties into JSON to be sent to the frontend
The first step is handled by PyMongo, our DB driver. The last three steps are handled by the Serializer: a Django Rest Framework (DRF) abstraction that operates on both the read and write paths. On the write path, the serializer converts data in the other direction and does write validation and permissions before sending the update to the DB.
On the read path request overhead is usually fixed and indexed DB queries are fast. As we request more data from the database, time spent in the request is dominated by the Serializer. So we decided to improve the Serializer bottleneck to speed up reads.
Why are Serializers slow?
There are several reasons why Serializers are the bottleneck along our read path.
Instantiating our Object Document Mapping (ODM) is slow
Along the read path, our server marshals data from the DB into a Document object. We use MongoEngine as our ODM, and one of its weaknesses is that it is slow to instantiate objects. This is because the time it takes to instantiate a MongoEngine Document object is proportional to the size of the data that is passed to it from the DB. At first this sounds reasonable, but consider the following extremely common use case:
- Suppose there are two objects: one with 10 fields and another with 100 fields. Our frontend only ever needs 5 fields from either of these objects. Instantiating the first object is much faster than the second, even though we only ever need to access 5 fields in both cases.
Instantiating MongoEngine objects accounts for close to 30% of our read request time, so having it be much slower for larger objects is unacceptable.
You might be wondering: can’t we only marshal the fields to MongoEngine that we actually need if we know them ahead of time? The answer is yes, that functionality exists in MongoEngine. But because we allow computing and sending properties (arbitrary functions over the object) to the frontend, we don’t know ahead of time which fields we need to marshal into the MongoEngine object. So we have to assume we will need all of them.
Property serialization is unbounded and expensive
Properties are arbitrary user-defined functions over a MongoEngine object. Here’s an example property which formats a phone number as part of an Address info object:
Properties can access any number of fields from the current object, and can even trigger DB queries on their own. Properties are incredibly useful from a developer velocity perspective, but there are a number of issues with allowing them to be sent to the frontend:
- We can’t tell ahead of time which data from the DB that we need to compute a property, so we request all of it from the DB. This isn’t so bad from a DB query perspective given that the DB is the fastest part of our request. But it does affect how fast we can instantiate our ODM, which accounts for 30% of the request.
- Properties that make their own DB queries, i.e. joins, are subject to the N+1 query problem. We don’t know which properties are making joins, so we can’t batch joins ahead of time when serializing N objects. Instead we have to make N queries serially as part of computing this property for the N objects being serialized.
- Serializing properties can get slower over time as business logic gets more complex.
Properties are a great abstraction for developing business logic on the backend, but really shouldn’t be allowed to be sent to the frontend.
All fields anti-pattern
DRF Serializers take a model and list of fields of the model to serialize as arguments. DRF also allow for you specify all fields as a shortcut:
This is handy if you don’t want to write all the fields out in your serializer, but it’s not secure or performant:
- As the number of fields and properties on a model grows, the time it takes to serialize objects of that model grows.
- It’s unlikely that you need every field from an object when retrieving it to the frontend, especially for very large models. If you do, it’s likely that your API is not granular enough and is being used as a catch-all data source on the frontend.
- For Rippling, it means that more properties are being serialized to the frontend than are necessary. If someone writes a very slow property that is for the backend only, a serializer that uses this shortcut will compute the property and send it to the frontend anyway.
- Performance aside, having an allow-list instead of a block-list is secure by design. You know exactly which fields you are sending to the frontend and that behavior does not change if you add a sensitive field to the model unless you want it to.
Use of the all fields pattern is pervasive in our codebase which means that our API servers are sending more data to the frontend than is actually used. It also means that our read endpoints are getting slower as the number of fields and properties on models grows over time.
Permissions are slow, but necessary
Rippling Serializers are built on top of DRF Serializers and we added our in-house permissions framework at this layer. We built permissions at this abstraction layer because we needed field-level permissions, and MongoDB only provides permissions at the collection level.
Permissions can take anywhere from 10–30% of a read request depending on the ACLs of the user and field-level permission definitions on an object. Users like admins with dense ACLs experience slower permission evaluation. Permissions are necessary and keep Rippling secure, but we assume a variable performance overhead by evaluating them in real time on the read path.
Underlying Serializers are slow
Rippling Serializers are built on top of DRF Serializers, and DRF Serializers are an abstraction over both the read and write path. As such, they provide field validation and transformation for all field types on both the read and write path.
For our application, field validation is unnecessary for reads given that we validate along the write path. Field transformation is also only necessary for non-primitive fields as the JSON output for primitive fields is usually the same as what the DB provides us.
Sample object data from the DB and its serialized JSON counterpart
The abstraction of DRF Serializers is helpful, but at the cost of performance along the read path.
At this point, we had a pretty good grasp of all of the different parameters that contributed to the read bottleneck in our serializers. It was a combination of behaviors and design decisions in our application architecture that contributed to response latency. In part two, we’ll dive into exactly how we solved the bottleneck and safely rolled out the optimization at the platform level.