A Better Framework for Computed Fields
There are only two hard things in computer science: cache invalidation and naming things.
Phil karlton
If you’ve ever been through a class on databases, you’ve probably had the refrain “normalize your data!” drilled into your head. One of the benefits of this approach is that it eliminates data redundancy: data stored in multiple places, or more generally, data that depends on the value of other data. In an ideal world, every column in a database table would be an independent value, and updating a column value would be an operation that was independent of any other values in the database; this helps us avoid thorny problems around data consistency: if we store cost_in_cents and cost_in_dollars in our table, then every time we update one of the values we have to update the other, otherwise one will be wrong! The situation becomes even worse if these values are stored in different tables -- now the only way to prevent inconsistencies is to use transactions to modify all related rows in one go.
Of course, you should never have cost_in_cents and cost_in_dollars as separate values in your database, but unfortunately, the real world is messy, and sometimes we need to store columns that depend on the values of other columns, other rows, or even other tables. One reason for this is that it allows us to index and efficiently query values that would otherwise be prohibitively expensive: this is, of course, the magic of caching. But wherever you introduce caches, you invariably end up with stale caches -- and that’s how this particular journey down the rabbit hole began and resulted in our building a new applicative-functor based framework to efficiently calculate derived values (and keep them up-to-date!).
Let’s start from the beginning. As you might guess, PayrollRun is an important model in our codebase. A PayrollRun has references to many PayrollRunEntries, each corresponding to the pay for a given employee in a pay period. A PayrollRunEntry has a boolean field isPayrollReady which determines whether we’re currently able to actually disburse the payment to that employee -- if it’s false we won’t pay that employee, so freshness is obviously critical. It accounts for every reason that we’re not able to pay that employee, including invalid addresses, unverified bank accounts, locations we don’t currently support, invalid pay configurations, lack of tax filing information like a W4, etc -- we end up performing dozens of queries on dozens of tables to set this one bit that we use to determine whether the entry should actually be paid out. Here’s a sample of what goes into calculating it:
This is obviously a pretty expensive computation, so caching it is important -- otherwise fetching every individual payroll entry could take a second or two, and fetching all entries in a run could take several minutes. This value is dependent on the values of dozens of other rows, though, so we need to make sure that it’s kept up-to-date: if an employee adds a bank account and is now eligible to be included in the current payroll run; we want that admin to see that change reflected near-instantly.
The way we used to handle this is by adding post-save hooks to every model that the calculation depends on. That’s actually pretty nontrivial! Because data fetches can happen anywhere in any Python method, a call to a method like:
RWCPayrollSettings.calculate_not_ready_reasons_for_direct_deposit
can end up introducing half a dozen implicit dependencies on other models that aren’t obvious from a cursory inspection of this method; to determine all the dependencies of this function you have to inspect all its method calls recursively, and even that isn’t enough, as attribute access on models can lead to new database fetches if the attribute corresponds to a foreign key. It can be quite easy for an engineer to modify this method and not realize they’ve added a new implicit dependency that requires a new hook to be added to maintain consistency.
There’s another issue here, too: because calculateIsPayrollReady is an expensive method, we couldn’t just throw it into the critical path for saves on all these other models -- instead, we had to enqueue a background job in our hook that could update the cached value. For the interval of time between saving the dependency and updating the cached value, isPayrollReady will be stale, with no way for an end-user to recognize that.
We quickly realized that we needed a better approach. An ideal framework for these computed fields would have the following properties:
- For a given computed value and its dependencies, add hooks such that when any of the dependencies change, the value is recomputed. Additionally, provide a cheap way to determine whether the computed value is stale.
- Automatically compute the dependencies of a computed field from the code itself, without having to manually specify them.
This is how we built that framework.
Determining cache freshness
We realized that we should be able to model every computed field as a pure function of two things:
Queries -- any query performed within a computation introduces a dependency on that query: if the set of objects returned by the query changes, the cache should be considered stale.
Field paths -- after querying for other objects, any foreign key fields from those objects used in the computed field calculation should be considered dependencies as well. If one of those fields changes on an object in the queryset, then the cache should also be considered stale.
This needs to work recursively, as we need to also consider fields in objects accessed through references on queryset results. For example, if I query for roles, and then use employee.workLocation.address.fullAddress in my computation (Employee and WorkLocation are stored in different tables), I need to track the address of the workLocation as a dependency.
If both of these things are unchanged -- the queries I make return the same objects as results, and the values of the fields I access on those objects, then my computed field should return the same result, and so we can safely use the cached value.
If I know that my queries are returning the same objects, then determining whether the values have changed is easy: along with my cached value I can store the time of its computation and check whether the newest updatedAt timestamp for the queryset results is newer than it -- this should be a pretty fast database query with no serialization required. How can we determine whether a query returns the same objects at time t1 and time t2? There are two different ways that it could return different objects:
- A document is in the queryset at t2 that was not in the queryset at t1. If that’s the case, this new document will have an updatedAt timestamp that’s newer than t1.
- A document that was in the queryset at t1 is not in the queryset at t2. This is trickier, because max(updatedAt) over the documents in the queryset will not change. However, the count of documents in the queryset will change.
By storing these two pieces of information along with our cached value for each query dependency, we can quickly determine whether the dependencies of our value have been modified. Actually, since we need to recompute the value if any of the dependencies have changed, it’s sufficient to store the sum of queryset counts and the max over all max(updatedAt) values.
Updates to documents accessed through references also need to be tracked -- we store the ids of these along with our cached value and do an max(updatedAt) query over them as well.
Refreshing fields on dependency changes
The above algorithm for determining cache freshness provides us with an easy way to lazily implement these computed fields: before accessing the cached value of a field, we can first quickly determine its freshness and recompute if stale. However, if a field is expensive to compute, this has the potential to create significant negative performance impacts by placing the recomputation in the critical path of the READ request. Instead, for many applications, we would prefer to move the recomputation to the write: if a dependency of our computed field changes, we should recompute the value of the field and update the cached value, allowing any reads to simply use the cached value without performing any recomputation. In order for dependency saves to trigger recomputations of fields depending on it, we need to maintain a reverse index mapping a given object to all the fields that depend on it. However, this is non-trivial; as maintaining a mapping of object IDs isn’t sufficient; our computed fields depend on queries, not just objects. To illustrate the difference, consider the following example.
Suppose I have a cached field that stores the average salary of employees in a particular work location:
It’ll depend on the following results of the following query:
Employee.objects(currentJob__location=LOCATION_ID)
If I run my query at time t1: it’ll return a set of IDs, which in a sense are the objects this computed field depends on; if currentJob.location changes for any of them we’ll need to redo the calculation. However, this is not sufficient: if another role has currentJob.location change to be equal to LOCATION_ID, it’ll also necessitate a recalculation of our computed value, despite not being in our original queryset. This suggests that our index needs to not map from dependencies to fields directly; but rather from queries to fields.
For every model in our database, we maintain a reverse index of queries to computed fields, so that for any computed field that depends on it via some query, we maintain a reverse index of queries to computed fields. Our reverse index contains documents generated by the following algorithm:
- For every field f used in a query for the dependency, first determine whether the field is orderable.
- If not, add fields of the form f_inclusion, f_exclusion.
- If it is orderable, in addition to the above, also add f_lower, f_upper.
- Given a simple query, where each field is constrained by operators eq, ne, lt, lte, gt, gte, in, nin, we map it to an index document by decomposing each set of field constraints into a set of bounds and additional included or excluded points.
From a given dependency, finding the queries that it matches can then be done by constructing a compound query that does the following: for every queryable field in the document, either the field does not exist in the index document, or it satisfies the query f_lower__lte=depedency_field AND f_upper__gte=dependency_field AND f_inclusion__all=[depdendency_field] AND f_exclusion__not__all=[dependency_field].
For a given dependency, this query will return the list of computed fields that depend on it, allowing us to trigger background jobs to update them all on every dependency save. When combined with the cache freshness logic above, this gives us the best of both worlds -- we run background jobs to update our cached values when necessary, but we avoid inconsistencies caused by race conditions because an intervening read can detect the staleness of the value and either wait for the background job to finish or recompute the field itself.
Automating dependency tracking (and optimizing data fetching
Of course, in order for this approach to work, we need to know the dependencies of our field up front -- but as we saw above in the isPayrollReady case, that’s not easy! Data fetches today can happen anywhere, even on property access, and keeping track of the whole web of recursive dependencies is a losing battle. It’s much better to make the flow of data an integral part of your framework itself, as anyone who’s switched from jQuery to React can attest. To do this, we need to make data fetching something that happens explicitly instead of implicitly: we can write the business logic for our computed fields in such a way that each field declares the queries it depends on and what to do with the results of the queries, but doesn’t perform the fetching itself -- the fetching is left to the framework as an implementation detail. This also enables the possibility of performing aggressive optimizations on data fetching; by allowing the framework to operate on batches of query dependencies, we can batch the fetches themselves and eliminate duplicate or overlapping queries.
Haxl, a Haskell framework built by Facebook and used to implement the rules engine for abuse detection there, provides an intuitive interface to express these sorts of deferred computations through the applicative functor abstraction. If you’re not familiar with applicative functors, no worries, Wikipedia is our friend:
If that definition makes you want to run away screaming in the other direction, like it did for me, the good news is that all the category theory is just a nice way to formalize what at its core is a very simple concept that you’re probably already familiar with.
If you’ve written Javascript recently, you’ve likely written blocks of code that look like this:
If you think about it, Promise.all is actually quite a nifty function. It takes a bunch of values, each wrapped in its own context so we can’t get at it directly (in the case of a fetch, the value doesn’t even exist yet as we’re still waiting to receive it from the server), and combines them to yield a list of values, wrapped in a single context. That’s exactly what an applicative functor enables! (Promise is technically not an example of an applicative functor, but the distinction is pretty pedantic and is only meaningful in the special case of nested promises).
Why is being able to wrap multiple values in a context and merging those contexts useful? Well, in this particular case, it allows us to block on every single value becoming available, and to do the fetches in parallel in the meantime, instead of one at a time.
At this point you might be saying “Wait, but I can just write a bunch of awaits in a row and the fetches will still execute in parallel, too. Sure, the Promise.all makes for nicer syntax and error-handling, but what’s so special about it?”
await requests[0];
await requests[1];
await requests[2];
Sure, in Javascript, Promises begin executing even before their constructor returns, so writing awaits in serial doesn’t prevent parallel execution. Imagine if you had to await on a Promise before execution started, though. In this case, it’s obvious that Promise.all is indispensable -- it’s the only way to execute multiple operations in parallel. Why would we ever want to structure a system this way, instead of executing directly? There are two key reasons.
- Separating context construction from execution allows us to cheaply build these contexts and inspect them, without committing to execution. If context construction is cheap, but execution is expensive (as in the case of data fetching), then this allows us to perform what you might think of as static analysis (static in the context of the expensive operation, that is) -- this means that we can run a function returning a dependency-tracking applicative functor without actually fetching any dependencies, but with a guarantee that these dependencies are the ones that the post-fetch code path depends on.
- The ability to combine contexts before starting any execution allows us to make optimizations that aren’t possible without the combined context. A quick example of the former: suppose our executor was sufficiently intelligent enough to know that coolstuff.com has a batch API, too, and make a single fetch to coolstuff.com/batch?pages=1,2,3 -- this sort of optimization is only possible with the shared context that the three fetches are to be made parallelly.
An applicative functor is defined by two abilities: first, the ability to put a value (or a function) in a context that yields the original value/function, and second, the ability to combine multiple values in multiple contexts into a single value in a single context, without ever taking any of the values out of their individual contextsComing back to our dependency tracking problem, we can think of a block of code like,
as really conceptually two separate functions:
Notably, average_salary_in_location is just the function composition of these two subfunctions.
Our applicative functor itself is quite trivial (the batching/optimization framework that it feeds into is where most of the complexity lies):
The key idea is that we want to decouple every computed field computation into two steps: data fetching and computations on fetched data. In between these, the framework should actually fetch the data and feed it back into the method to be operated on. We can implement this with generators in Python, under the constraint that each generator performs exactly one yield:
Let’s take a look at our original isPayrollReady field, now implemented in this framework:
Note the use of the deferred_objects method in place of the objects call. Instead of performing a query, deferred_objects merely returns a representation of the query without evaluating it. The framework gets the query from the yield, fetches the corresponding data, and passes it back into the generator, setting the appropriate variable to the result of the query. Because the framework now has an explicit understanding of the dependencies of the field, it can handle all the metadata tracking and storage necessary to enable the cache validity and refresh algorithms described above. If multiple queries are being performed, the framework can also batch and optimize queries for maximum performance, without the programmer having to worry about the details of these optimizations -- DeferredComputation.gather is like Promise.all, except the underlying execution engine does behind-the-scenes batching and optimization.
Notice that we’ve replaced our original call to
RWCPayrollSettings.calculate_not_ready_reasons_for_direct_deposit with a call to
RWCPayrollSettings.deferred_not_ready_reasons_for_direct_deposit.
This is a reimplementation of the logic of the original function, but in our deferred field framework -- specifying dependencies up front doesn’t mean abandoning code reuse! As long as the data-fetching functions we call are also written in this framework, we can call them and still get the same guarantees.
Notice also that we didn’t modify the code checking WorkLocation addresses at all, either. Since this is a simple foreign key lookup from currentJob, our framework is able to automatically track the dependency without needing to specify it up front -- this is implemented via a proxy object that tracks field accesses.
What does this mean? If a work location is updated with a new address, or an employee adds a bank account, isPayrollReady will be updated immediately -- and we can delete all the post_save hooks that were previously littering our codebase! If this calculation needs to be updated to take into account another model, it can just fetch the corresponding data and use it: the new dependency will be tracked without any additional developer effort.
What does this mean? If a work location is updated with a new address, or an employee adds a bank account, isPayrollReady will be updated immediately -- and we can delete all the post_save hooks that were previously littering our codebase! If this calculation needs to be updated to take into account another model, it can just fetch the corresponding data and use it: the new dependency will be tracked without any additional developer effort.
It can take a while to get used to writing code in this two-phase style, but we’ve found that it’s worth it: it’s a huge relief to not have to worry about freshness or query batching or any of the concerns we used to have to think about when handling these cached query-dependent computations. From the product perspective, it’s allowed us to expose much more configurability to the end-user in the form of embedded rules engines, with framework-level optimization of the rules they write to ensure maximum performance, and it’s allowed us internally to write cleaner data-fetching code while eliminating a whole class of cache invalidation issues.
If you’re interested in working on problems like this and building infrastructure to modernize traditionally painful HR and business processes, check out our jobs page! We’re hiring for engineering roles in our San Francisco and Bangalore offices.