Summary: Elixir has many constructs for doing work asynchronously. Learn them and use them!
When cast iron was first used in building facades, architects treated it like the prevailing material at the time: stone. Stone has relatively poor tensile strength, meaning a flat piece of stone above a window could not bear the weight of a tall building above it. Thus, engineers would use arches to convert the tensile stresses into compressive stress, which stone handles very well.
Iron, however, has much stronger tensile strength and did not need to be bent into an arch to be structurally sound. Still, because architects in the mid-19th century had been trained in techniques to work around stone's shortcomings, they continued to use these techniques even though they were no longer necessary with the new materials. It took several decades for architects and engineers to grow out of the old paradigms and use the new technologies in novel ways, and we can see examples of window arches getting flatter and more abstract.
Similarly, experienced Rails developers coming to Elixir and Phoenix try to reimplement familiar patterns that evolved to compensate for the weaknesses of the old system. In my consulting, I see many newcomers translate their experience into unidiomatic Elixir code that doesn't harness the strengths of the technology. Today we'll look at one tool Rails developers reflexively reach for, and how we might do things differently in Elixir: Background Job Queues.
Background Job Queues
Any nontrivial Rails application has a background job framework such as Sidekiq. We use this system to:
- defer work so the user isn't kept waiting, as when setting up a user account with data from external sources
- perform ongoing background work that isn't initiated by a user, such as fetching the latest value of your cryptocurrency
- speed up response times by moving nonessential work out of the user request cycle, like sending a welcome email
So what should you use instead?
Elixir has several tools available for doing work in the background; the main ones are Supervisors, GenServers, and Tasks. You'll have to interrogate your use case and weigh your options when choosing which to use. How would we use them in our examples above?
Deferring Work
In the first case, we have a hypothetical system that completes a user's account setup by connecting to two external data sources. We use the user's address to find the nearest fulfillment center and we create a record in a separate CRM with the user's name and birthday. We then store fulfillment_center_id
and crm_id
keys on the user our main database (these don't have to be local foreign keys, they're just ids to track the user across services). Here I recommend Supervised Tasks. This lets us do the jobs concurrently and restart them if they fail. We also don't care about the return values for these tasks - they perform some side effects and that's it. Let's build out our supervision tree for this scenario.
Your application should define a Task.Supervisor
that starts when your application launches. We give it the name YourApp.AccountSetupSupervisor
, which gives the supervisor semantic meaning for our use case, and we define a module with the same name with convenience functions for interacting with it. Your system can have many Task.Supervisor
s for managing different types of tasks.
When a user account is created, your controller or account context calls YourApp.AccountSetupSupervisor.set_up_user_account/1
to complete the user account setup. At this point our user is already in our database. This module spawns two Tasks - one for connecting to a Fulfillment Service and one for connecting to the CRM - and adds them to its supervision tree with Task.Supervisor.start_child/5
. We use the restart: :transient
option to tell the supervisor to restart the Task only if it exits abnormally, i.e., some operation failed and the process crashed. The supervisor will restart the process up to 3 times in 5 seconds, then give up. If the operation succeeds, the process exits normally, and life goes on.
We place our logic in two separate modules, called YourApp.Fulfillment
and YourApp.CRM
. Each is responsible for knowing how to connect to its external service and updating the user account with its results. This pattern also lets us put all code related to interacting with each service into one module. We pass the user down into each Task so that it can extract the relevant API keys for its external calls, as well as to update the user. Resist the urge to find clever ways to avoid passing arguments; this itch is responsible for many poor design decisions.
This way, the caller immediately returns to the user, who can see a helpful About Us screen while the system is hard at work. If one of the Tasks fails to connect to its data source, we can rely on the Supervisor's restart strategy to retry a few times and eventually give up. We confine the responsibilities of each Task to enforce clear separation of duties, and we can call this code from elsewhere in our application if we want to. And we reduce our database calls since the Supervisor and Tasks get a user object directly instead of a user_id
.
Ongoing Background Work
In the second case of ongoing background work, trying using a GenServer. Two reasons GenServers are cool is because you don't have to serialize arguments to them like you do with Sidekiq; and because resource-intensive processes (did you drop an infinite recursive loop into your code?) will not plunder your user-facing responsiveness thanks to the BEAM's fair scheduling. You can also drop some useful statistics into your GenServer's state
so you can see what it's doing with the help of the observer
.
Let's look at our cryptocurrency account value example. If we've wired up our frontend to update the DOM in realtime via Phoenix channels once our system has the data, our only job is to ping the node every second for the latest data. For these recurring tasks, I use a technique where we spin up a GenServer during application startup. The GenServer asks the BEAM to send it a work
message after 1000ms and then goes to sleep. Remember, sleeping processes do not affect the performance of your system. When the interval has elapsed, the BEAM sends the message to the GenServer, which wakes up, does the work, schedules another message to be sent in 1000ms, and goes back to sleep.
If the work is simple and you don't mind a little drift, you can have the GenServer handle both the work and the scheduling/sleeping responsibilities. If the work is expensive or your intervals are absolute, have the GenServer spawn a Task to do the heavy lifting and let the GenServer be responsible only for the scheduling. Bear in mind here that if a Task takes an unexpectedly long time to complete, it may overwrite fresher data because it finished last.
You can spin up a GenServer for each user, but you may end up in a situation where your system is hammering the node with requests for the latest data; GenServers are all working concurrently! This approach generally works best with system-wide state, such as the weather in Los Angeles. We only need one process to fetch the latest conditions (sunny and warm 🏖), and all connected users simply use that same piece of data.
Moving Nonessential Work
Cool! What about the third case, where we want to speed up response times to the user? These actions are generally one-offs or fire-and-forget, so we'll use a Task
. Simply wrap your email sending code in a Task.start
block and you're done! Yay!
This approach doesn't give you a lot of options for retrying failed jobs. If the work is critical, start a supervisor and have it spawn a task that can be restarted. If you're tempted to say "that sounds like a lot of work, why not just pull in a background job system," don't fall for it! Starting a supervision tree is simple and straightforward. Do it twice and you'll do it in your sleep. Remember, a background queue has several moving pieces, and by "moving pieces" I mean "points of failure." It's your 2am PagerDuty call.
Other Considerations
There are some situations where a background job framework does make sense. If you reboot your system while it's doing work, the work may be lost. Such cases call for an independent system that survives application restarts.
If that's your use case, I recommend a message broker like RabbitMQ (also built in Erlang!) instead of the Redis-separate-copy-of-your-app setup. For one, RabbitMQ has much more robust monitoring, queueing, and retry features. Plus, you can put your consumers (ie job workers) right inside your main application! Running your workers in a separate application is a big cognitive load and adds complications to your deployment strategy. Having the workers inside your app (remember, processes are fairly scheduled) lets you seamlessly interact with the rest of your app's state and keeps deployment simple. Two big wins.
Oh, and you no longer need to worry about "priority" queues. Everything happens concurrently, and processes sending emails won't interfere with processes building out user accounts. (It's possible to give processes High Priority on the BEAM scheduler, but this is extremely rare and you should avoid it unless you Know What You Are Doing.)
Be judicious about needing the ability to retry jobs. Often when we see failures rack up in our Sidekiq dashboard, we just clear them out. Important retries are usually done manually, or at least under close supervision by an engineer, instead of being blindly requeued.
Many of us came to Elixir because it addresses shortcomings in the systems we're used to working on. We lose many of those benefits if we take our baggage with us! Much of the fluency in choosing how to do work in the background comes from understanding Erlang's process model and how the system is built to let you spawn concurrent processes to do work. The more you learn about processes, the better and more idiomatic your architectural decisions will be.
I coach teams coming to Elixir on best practices. Want to avoid common mistakes? Hire me!
Project? Question? Reach out and say hello.
Sign up to our infrequent newsletter to hear what we are thinking about.