7 Lessons from building our RADIUS server in the cloud
Karthik Bhuvaneswaran — Nov 18, 2020
What is RADIUS?
Remote Authentication Dial-In User Service (RADIUS) is a networking protocol that provides centralized Authentication, Authorization, and Accounting (AAA or Triple A) management for users who connect and use a network service.
You’re probably familiar with WiFi authentication protocols like WPA-Personal that require sharing a single password to access a WiFi network. These protocols are convenient for home WiFi use, but they’re dangerously insecure for accessing an office WiFi network or VPN.
Why? Consider these questions:
- How do you ensure the single password is not shared across insecure channels like emails / chat / common company whiteboards?
- When an employee is terminated, what prevents them from keeping the shared password after they leave?
- If a single employee’s password needs to be rotated, how do you do this without rotating the password for everyone in the office?
WPA-Enterprise solves these issues by allowing different passwords for each employee. When an employee connects to a WiFi network using WPA-Enterprise, the WiFi router connects to a RADIUS (Remote Authentication Dial-In User Service) authentication server to validate the user’s individual credentials. RADIUS is a fairly old protocol built on UDP, and operates at a lower OSI level than more modern SSO standards like SAML.
Most companies can’t or don’t want to host their own RADIUS authentication server. Rippling hosts a cloud-based RADIUS authentication server for our customers, so that employees can then log into their WiFi with the same Rippling credentials they use for other applications. And when an employee is terminated in Rippling, they immediately automatically lose access to the WiFi network — along with all other sensitive accounts managed by Rippling.
At Rippling, we’ve built a redundant and highly available infrastructure for RADIUS that turned out to be a little different than building redundancy for traditional web applications. In this post we’ll talk about the lessons we learned building redundancy for our RADIUS authentication system.
Project Goal & Lessons
We were supporting RADIUS for Routers and Virtual Private Networks (VPN) for the last two years and at that point of time, two dedicated Amazon Elastic Compute Cloud (EC2) instances running RADIUS servers were more than enough for the ingress traffic. Since we are growing quickly, we started working on a project with two goals in mind:
- Auto scaling – RADIUS servers should be able to automatically scale during peak traffic load
- Auto healing – RADIUS servers should be able to heal automatically by fixing unhealthy servers
These are crucial for any production server and especially in our case because downtime in the RADIUS server could cause our customers to not be able to use their WiFI network.
We were able to tick both of these points and achieve high availability for our RADIUS servers. In the process of doing this, we had a lot of learnings. We would like to highlight a few.
Lesson 1: In this case, it’s better to build than to buy.
In the current era, no one would ever build their own AWS; everyone understands how painful on premise infra maintenance was without AWS. Similarly, we started looking for a RADIUS-as-a-service provider who could partner with us to provide high reliability with reasonable pricing. To our surprise we found challenges across all RADIUS-as-a-service providers:
- They are on prem and don’t have the same ease of scaling that cloud-native providers do.
- This in turn leads to higher pricing per user for any other RADIUS service providers.
- RADIUS-as-a-service providers didn’t provide the customization functionality that our customers need for more advanced authentication mechanisms.
In general we believe the bar for choosing to build over buy should be high. In this case, since RADIUS is part of our core identity management infrastructure, we decided to build the system in-house to provide the customization, scalability, and pricing that our customers expect.
Lesson 2: Containerize builds for easy development, testing and production deployment
When we started adding support for RADIUS servers, we didn’t want to reinvent the wheel so we started building it on top of proven Linux libraries. This significantly accelerated our RADIUS implementation – thanks to the open source world! We initially started with Vagrant + Ubuntu for our development. Unfortunately, Ubuntu version differences between development and product created several inconsistencies during deployment.
So, we moved the environment from vagrant to docker containers — one DockerFile for development, testing and production. This completely removed the “but it works on my machine!” scenario. See our earlier post for more advantages of containerizing your builds.
Lesson 3: Use NLB and sidecar containers for instance health checks
Load balancers play a crucial role in distributed systems. Amazon Application Load Balancers (ALB) currently support only HTTP traffic because more than 99% of the services hosted in AWS are HTTP-based. This wouldn’t work for us because RADIUS uses UDP and also we need to support IP-based load balancing as well. Network Load Balancers (NLB) serve this purpose, by providing load balancing at the Transport Layer as opposed to Application Layer that ALB uses.
Because we required UDP support, it was obvious that we should choose NLB rather than ALB. There was, however, a weird limitation with health checks in NLB: it would server traffic in UDP protocol and health checks needed to happen on TCP. As a result, we had no choice but to end up with 2 processes in the container:
- Main traffic – a RADIUS server process running in network port 1812 and using UDP
- Health check – a Python TCP health check wrapper running in network port 1814 using TCP
Now another interesting problem appeared. NLB allows us to do health checks on static fixed ports or traffic ports with ECS auto scaling and load balancer integration. Autoscaling spins up multiple containers in a machine and attaches a load balancer using a dynamic traffic port. But the health check port cannot be dynamic, restricting us to use only one container per EC2 instance.
We had three options to proceed further:
- Use a static host port for both UDP traffic and TCP health checks, but with limited scalability of 1 task per instance.
This would lead to underutilization of the instance, so we discarded this option. Majority of our services use HTTP interface and radius server is only UDP service with a maximum of 6 containers, since the EC2 machines are shared across the services having 1 task per instance would work but we were looking for a more scalable approach to support us in future.
- Run a separate daemon service on the same cluster instances which would serve health check requests.
- Both custom and other service containers can be placed on the same hosts so that at least health checks are in sync.
We did not love this option because the daemon service checked health status for the instances, not the containers. So if there were no RADIUS container in an instance, then a health check daemon running on it wouldn’t make sense. Having more granular health checks at a container level was a better choice rather than instance level.
- Use a sidecar container approach along with AWS VPC networking mode where we get a unique IP per container. So each machine will have multiple containers with different internal IPs.
- The number of tasks per machine is now limited by the number of ENI’s (Elastic network interfaces) the machine can support. It is further improved using Elastic Network trunking.
- Because the IP is specific to the task, we can have static port mapping as well as multiple tasks per machine.
- Have a single ECS task with 2 containers listening on different protocols and ports to help manage traffic on UDP and health check on TCP separately
- Radius server container exposed on port 1812/UDP
- Health check container using python exposed on Port 1814/TCP for health checks only
- UDP load balancing with Network Load Balancer
The sidecar approach worked well and is a commonly embraced pattern when there are two different processes defined for the main app and health check separately.
This helps us check if the instance is up, but doesn’t provide a guarantee that the RADIUS requests are being processed properly. So this alone doesn’t provide confidence when we do auto recovery and auto scaling.
Lesson 4: Use a custom health check endpoint to verify the full RADIUS stack is functioning
There are several ways to implement health checks for a given server.
At a minimum, we should detect that the instance is up and the server process is running. Better, we could validate that other systems can communicate with the endpoint and return valid content.
Better, though, is to ensure that the server can communicate with all of the underlying downstream systems and provide the correct ACCEPT code for a valid user in the system.
In order to solve this problem, we wrote a thin python wrapper which opens up a TCP port to the outside NLB health check and does a RADIUS end-to-end test on UDP with several retries. ECS waits for 10 seconds for a successful health check response and that is ample time to do a consistent health check within python with upto 6 retries. Also we set the logic in ECS to drain the container if three consecutive health checks fail to avoid any intermittent issue. Frequency of health check for ECS container was set to 30 seconds.
Here is an example of the python TCP wrapper:
|RECHECK_INTERVAL_SECONDS = 10|
|# Create a TCP/IP socket|
|sock = socket.socket(socket.AF_INET, socket.SOCK_STREAM)|
|server_name = "0.0.0.0" # Modify this if you want to bind only on a specific IP Address|
|server_address = (server_name, int(os.environ["TCP_HEALTH_CHECK_PORT"]))|
|print("starting up on %s port %s" % server_address, file=sys.stdout)|
|# If RADIUS service is not available, then let health check fail by not listening for 10 seconds|
|child = subprocess.Popen(|
|"1812", # 1812 is the standard RADIUS port|
|(output, err) = child.communicate()|
|# Meanwhile checkback the health status after 10 seconds to see if RADIUS server recovered back|
|# If the service was not able to recover even on the successive tries:|
|# * Then consecutive health checks will fail|
|# * Current instance will be de-registered and drained|
|if child.returncode != 0:|
|print("waiting for a connection", file=sys.stdout)|
|connection, client_address = sock.accept()|
|print("client connected:", client_address, file=sys.stdout)|
|data = bytes("Server Alive", "utf-8")|
Lesson 5: Always keep the application servers stateless
In order to scale the servers, we had to make sure the RADIUS servers were stateless and didn’t store any data in rest. (Other RADIUS-as-a-service products do store state.) All the user level authentication data is passed through a secure encrypted tunnel to RADIUS and the data is held only in transit for the duration required. This made our containers very thin, stateless, dataless This, in turn, increased the security and made it easy to replicate the containers during peak load.
Lesson 6: Use Elastic IPs in addition to A records
High end routers like Cisco Meraki and Ubiquiti UniFi support domain names, so you can configure them to connect to “radius1.rippling.com” and they’ll do a DNSlookup and connect to the right servers. But there are many lower end routers which don’t support domain names and need to be configured to connect to a specific IP address. Always prefer and reserve elastic IPs which don’t change over restarts or re-provisioning because these values are configured in routers and are painful for customers to change.
Lesson 7: Provide secondary domains for higher availability
We implemented Auto recovery and Auto scaling in ECS. This mitigates the occurrence of downtime without any human intervention.
As Murphy’s law states “If something can go wrong, it will.”
What if one instance goes down?
Auto recovery should spawn a new container and kill the unhealthy one.
What if the NLB or all the instances under an NLB goes down simultaneously? (very rare)
We added two different Network Load Balancers (NLB) to increase the availability even further.
If the Routers and VPN systems support a secondary RADIUS server and are configured, then it should seamlessly fallback from radius1.rippling.com to radius2.rippling.com resulting in zero downtime.
We migrated all of our existing customers to this new RADIUS infrastructure and have seen excellent uptime since then. We also have better visibility through proper health checks and have greater confidence that the system will scale even as usage grows exponentially.
If you’re interested in working on problems like this and building infrastructure to modernize traditionally painful HR and business processes, check out our jobs page! We’re hiring for engineering and infrastructure roles in our San Francisco and Bangalore offices.