Engineering

The code behind Rippling's no-code workflow automation tools

Dilanka Dharmasena, Engineering ManagerFeb 2, 2022

For those engineers who work in a larger organization with a separate HR department or where managers handle HR tasks, allow me to illustrate the behind-the-scenes process of doing something as simple as changing an employee’s address.

Step 1: An employee changes their address

Step 2: An HR representative sees the change, sighs, and bangs their head against a desk

Step 3: They proceed to change the employee’s address in all of the relevant business systems and send out emails across the company to all stakeholders in order to comply with policies or change stuff in other systems

The exact process may vary, but imagine how many thousands of hours are wasted across your organization for simple changes like these. And that is assuming everything goes as planned—what happens when no one sees the change or someone misses one of those other emails?

“Why isn’t all this automated?” you ask; great question!

You can probably guess where I am going with this: We’ve built a product at Rippling to eliminate the busy work of administering a company while being customizable enough to adapt to the many permutations of systems and policies seen across organizations.

With Workflow Automator, organizations can, for example:

  • Configure a Slack message to be sent to a manager who attempts to hire someone into the engineering department with a salary below a certain threshold
  • Send a reminder via email to an engineer with more than 5 open pull requests

There are countless possibilities, and you can read more about Workflow Automator here, but today, we’re going to be taking a look at one of the underlying systems that makes this a reality. Several major in-house technologies, including our new scripting language, went into building Workflow Automator; here we’ll be highlighting the workflow engine that glues everything together.

First, we’ll define what a workflow is, and then we’ll take a look at what the workflow engine actually does and some goals it sought to address before wrapping up with a brief discussion on its implementation.

So, good news, a workflow is pretty much what you are thinking of right now. It is a series of events that needs to be run as a result of some launch trigger, whether that be some new data point matching a query or someone clicking a “start” button on a webpage. In the context of Workflow Automator, it is the former—something like “if an employee’s salary changes to above $60,000, send an email to the finance department”. In the most general case, a workflow is an “if, then” series of events, and the workflow engine is a framework that 1) monitors the state of the if condition and 2) manages the execution of any follow up events while facilitating the transfer of data between these events.

There are two types of events in a workflow:

  • Trigger: an event that launches a workflow
  • Action: an operation that is performed after a trigger fires (the canonical workflow step)

In most cases, the sequence and dependency tree of events is formatted as a directed acyclic graph (DAG). I say most cases as you can nest workflows, which breaks the traditional definition, but that is an implementation detail beyond the scope of this discussion.

There is something important that we have failed to mention here—something that derives from the fact that actions have dynamic dependencies. For example, if you want to send a Slack message, you need to know where to send it, and to make this workflow even half useful, that recipient is probably related to something in the trigger. So at runtime, you need to be able to pass data through this DAG.

This is where the concept of contexts comes in. A context is a blob of data that is generated by a trigger or action when it executes. In an earlier example, when the trigger “an employee changes their address” fires, we know the ID of the person in question. That ID will live as a dependency inside the trigger’s context and can be used by any action downstream of it. So if our action A here is Slack, it will be able to pull that ID from the upstream context and use it to determine its recipients. These contexts essentially form a parallel DAG at runtime.

For the sake of common terminology, we will refer to a single run of a workflow as a job going forward.

Touching on goals, our workflow engine must allow engineers and users to swap events in and out, so, for example, if someone wants to create a calendar event instead of a JIRA ticket, they can do that. As a framework level system, we also wanted an event built for one product (e.g. send an email in Workflow Automator) to be usable by some other product that may eventually get built. This is perhaps a given, but we also want to minimize or eliminate the possibility that errors outside of the event’s business logic itself result in the workflow failing—infrastructure issues chief among them. More Rippling specific, this library needs to be infrastructure agnostic and able to adapt as we grow. Currently we run on our in-house task runner, but that may change as we scale.

Now that we’ve discussed some of our high level goals, we’ll briefly highlight some important implementation decisions we made without getting too into the weeds.

For starters, our workflow engine processes workflows in 6 stages:

Listen for triggers

Our users have defined some conditions under which to trigger a workflow. We register these conditions and listen for matching events or data changes. If a match is detected, we spool up stage 2 and pass in the event context.

Trigger state updates + anomaly detection

When we detect a triggering event, we want to do two things. First, we want to dynamically update the state of our trigger to adjust triggering conditions (eg. prevent duplicate triggers on the same employee). Second, we want to ensure that this triggering event is valid before progressing to the next stage (eg. filter out a torrent of workflows caused by a bad migration somewhere else in the company).

Launch new job

For triggering events deemed valid, we want to create a new job, or workflow run, for each of them; that happens in this stage.

Update job state

Our trigger (from stage 3) or action (from stage 6) reports back to the workflow engine to update the status of the current job and determine which actions are downstream of it.

Determine ready state of next action

For pending actions downstream of the event that reported back in stage 5, we evaluate whether they are ready to execute. This is most relevant for actions who should only execute if upstream events completed with a certain status (eg. success or failed) or completed with a certain output (eg. created a calendar event for more than 10 people). For each action that is deemed ready, we move on to the next stage.

Accounting for concurrency is particularly important here as there may be multiple upstream actions reporting back simultaneously.

Perform the action

Simple enough, this is where we send an email, trigger a webhook, etc. Some actions, like outgoing messages via Slack, require us to rate limit our third party api requests. In these cases, stage 6 sends these messages to a rate limiting queue and a listener waits for them to be processed before continuing on. When this stage succeeds or fails, we loop back to stage 4 to update the job.

You can think of each stage as a function that takes as input the configuration of the workflow and the state of the job (what event triggered it, the status of upstream actions, etc.). These stages are parallelized, so a given job may be executing multiple actions simultaneously.

Now, how do we ensure that workflows progress through each stage even if transient and unexpected errors arise with the underlying infrastructure or external apis? This is where our system of tickets and handlers comes in. Before each stage is run, a ‘ticket’ must be created for it. A ticket is simply an entry in the database that indicates that the stage must be run. This persistence is important as it maintains our workflow’s state through external disruptions. When a worker pops this ticket off the queue, before the ticket is marked as complete, a ‘handler’ is created. This is basically the same thing as a ticket, but intra-stage as opposed to the inter-stage ticket. The stage executes while the handler is marked as in progress, and before the handler is marked as complete, we create a ticket for the following stage. Obviously errors during handoff can result in duplicates, so deduplication is critical.

These persisted tickets and handlers allow our recovery system to detect and restart the workflow after external disruptions at a partially complete state, meaning that an entire run of a workflow doesn’t have to be restarted due to a timed out api call.

Tracing these design decisions back to our initial product goals, we wanted to provide users with a configurable, but reliable, automation framework that eliminates the busy work and stress of day-to-day administration. As we continue to run this system in production at greater and greater scale, we will share our learnings, but for now, be on the lookout for articles highlighting other technical underpinnings of our Unity launch.

As always, if you’re interested in building systems like this and in tackling new product challenges with a fun team, take a look at our careers page here!