Background Workers: The Unsung Heroes That Keep Your Application Alive
Let me tell you about the time a single background job took down an entire e-commerce platform for six hours on Black Friday. One job. Six hours. Millions in lost revenue. The job was processing order confirmations, hit a malformed email address, raised an unhandled exception, and because it was configured to retry indefinitely with no circuit breaker, it blocked the entire queue. Every other job, shipping notifications, inventory updates, payment confirmations, everything just sat there waiting. This is not a hypothetical. This happened. And it is exactly why I am writing this post.
Let me tell you about the time a single background job took down an entire e-commerce platform for six hours on Black Friday. One job. Six hours. Millions in lost revenue. The job was processing order confirmations, hit a malformed email address, raised an unhandled exception, and because it was configured to retry indefinitely with no circuit breaker, it blocked the entire queue. Every other job, shipping notifications, inventory updates, payment confirmations, everything just sat there waiting. This is not a hypothetical. This happened. And it is exactly why I am writing this post.
Background workers are the unsung heroes of modern web applications. They handle all the stuff that would otherwise make your users stare at loading spinners: sending emails, processing images, syncing data, generating reports, charging credit cards. But they are also a massive source of production incidents when not designed properly.
I have been running Sidekiq in production for over a decade across dozens of applications. From small startups to enterprise systems processing millions of jobs per day. The patterns in this post are battle tested and represent hard won lessons from real outages.
Why Background Workers Exist
Let us start with the basics. When a user clicks a button on your website, they expect something to happen quickly. Research shows users start getting frustrated after about 100 milliseconds of delay, and they will abandon a page that takes more than 3 seconds to load.
But some operations take time. Sending an email might take 500ms because you have to connect to an SMTP server. Processing an uploaded image might take 2 seconds. Generating a PDF report might take 10 seconds. Syncing data with a third party API might take who knows how long depending on their servers.
If you do these operations synchronously (meaning the user waits while they complete), your application feels sluggish. Worse, if you are doing them inside a web request, you are tying up a server process that could be handling other requests. Under load, this leads to request queuing, timeouts, and eventually your entire application grinding to a halt.
Background workers solve this by moving slow operations out of the request/response cycle. Instead of doing the work immediately, you queue it up and return a response to the user. A separate process picks up the queued work and does it asynchronously.
The user clicks "Place Order" and instantly sees "Order confirmed!" Meanwhile, in the background, workers are sending confirmation emails, updating inventory, notifying the warehouse, and charging the credit card. The user does not wait for any of this.
The Stack: Sidekiq, Redis, and Valkey
In the Ruby world, Sidekiq is the undisputed king of background job processing. It is fast, reliable, and has been battle tested in production by thousands of companies for over a decade.
Sidekiq uses Redis as its job queue. Redis is an in memory data store that is incredibly fast for the kind of operations Sidekiq needs: pushing jobs onto queues, popping jobs off queues, and tracking job state.
A Quick Note on Valkey
In 2024, Redis changed its licensing from the permissive BSD license to a dual license that restricts how cloud providers can offer Redis as a service. In response, the Linux Foundation forked Redis and created Valkey, which continues under the original BSD license.
For our purposes, Valkey is a drop in replacement for Redis. Everything I say about Redis in this post applies equally to Valkey. If you are starting a new project or your organisation has concerns about Redis licensing, use Valkey. The commands are identical, the protocol is identical, and Sidekiq works with both.
# Gemfile
gem 'sidekiq', '~> 7.2'
# For Redis
gem 'redis', '~> 5.0'
# Or for Valkey (same gem, different server)
# gem 'redis', '~> 5.0' # Valkey speaks the Redis protocol
# config/initializers/sidekiq.rb
# Works with either Redis or Valkey
Sidekiq.configure_server do |config|
config.redis = {
url: ENV.fetch('REDIS_URL', 'redis://localhost:6379/0'),
network_timeout: 5,
pool_timeout: 5
}
end
Sidekiq.configure_client do |config|
config.redis = {
url: ENV.fetch('REDIS_URL', 'redis://localhost:6379/0'),
network_timeout: 5,
pool_timeout: 5
}
end
Your First Background Job
Let us write a simple job that sends a welcome email:
# app/jobs/send_welcome_email_job.rb
# Dead simple job to send welcome emails
# Nothing fancy, just gets the job done
class SendWelcomeEmailJob
include Sidekiq::Job
def perform(user_id)
user = User.find(user_id)
UserMailer.welcome(user).deliver_now
end
end
And enqueue it from your controller:
# app/controllers/registrations_controller.rb
class RegistrationsController < ApplicationController
def create
@user = User.new(user_params)
if @user.save
# Enqueue the email job - returns imediately
# User doesnt wait for email to actually send
SendWelcomeEmailJob.perform_async(@user.id)
redirect_to dashboard_path, notice: 'Welcome aboard!'
else
render :new
end
end
end
The key thing here is perform_async. This does not send the email. It serialises the job (the class name and arguments) into Redis and returns immediately. The user gets their response in milliseconds.
Meanwhile, a Sidekiq worker process running separately picks up the job from Redis and executes the perform method. If the email takes 500ms to send, that is fine. The user is already looking at their dashboard.
The Problem: Jobs Are Not Islands
Here is where things get dangerous. That simple job I just showed you? It has several critical flaws that could take down your entire application.
Flaw 1: What If The User Does Not Exist?
def perform(user_id)
user = User.find(user_id) # BOOM! RecordNotFound if user was deleted
UserMailer.welcome(user).deliver_now
end
Between the time the job was enqueued and when it runs, the user might have been deleted. Maybe they requested account deletion. Maybe an admin removed them. Maybe there was a database rollback.
User.find will raise ActiveRecord::RecordNotFound, which Sidekiq will catch and retry. And retry. And retry. By default, Sidekiq retries failed jobs 25 times over about 21 days. That is 25 exceptions in your error tracker, 25 wasted processing cycles, and potentially 25 alert notifications.
Flaw 2: What If The Email Fails?
def perform(user_id)
user = User.find(user_id)
UserMailer.welcome(user).deliver_now # BOOM! SMTP error, timeout, whatever
end
SMTP servers go down. Network connections timeout. Rate limits get hit. Email addresses are malformed. Any of these will raise an exception.
Again, Sidekiq retries. But here is the insidious part: if your email provider is having an outage, every single email job will fail. They all go into the retry queue. When the provider comes back, you suddenly have thousands of jobs all retrying at once, potentially overwhelming the provider again.
Flaw 3: Shared Resources
The really nasty problems happen when jobs share resources. Consider this:
# app/jobs/process_order_job.rb
# This job has a massive problem - can you spot it?
class ProcessOrderJob
include Sidekiq::Job
def perform(order_id)
order = Order.find(order_id)
# Lock the order row to prevent double processing
order.with_lock do
return if order.processed?
# These all need to succeed together
charge_payment(order)
update_inventory(order)
notify_warehouse(order)
send_confirmation(order)
order.update!(processed: true)
end
end
private
def charge_payment(order)
# Hits Stripe API - might be slow or fail
Stripe::Charge.create(amount: order.total_cents, ...)
end
def update_inventory(order)
# Database updates - might deadlock
order.line_items.each do |item|
item.product.decrement!(:stock, item.quantity)
end
end
def notify_warehouse(order)
# Hits external API - might timeout
WarehouseAPI.new.create_shipment(order)
end
def send_confirmation(order)
# Sends email - might fail
OrderMailer.confirmation(order).deliver_now
end
end
This job does four different things, any of which can fail. If notify_warehouse times out, the entire job fails and retries. But wait, we already charged the payment! Now on retry, we will try to charge again.
Or consider: if update_inventory causes a database deadlock, the job fails. Meanwhile, another job processing a different order is trying to update the same product's stock. Both jobs fail, both retry, both deadlock again. Rinse and repeat until someone notices.
The Golden Rule: Jobs Must Be Independent
This is the single most important lesson in this entire post. Tattoo it on your arm if you have to:
Every job must be independent. Every job must be idempotent. Every job must assume it will fail and be retried.
Independence
A job should not depend on the state left behind by another job. It should not assume jobs run in any particular order. It should fetch all the data it needs at execution time, not rely on data from when it was enqueued.
# BAD: Depends on previous job setting a flag
class SendShippingNotificationJob
include Sidekiq::Job
def perform(order_id)
order = Order.find(order_id)
# Assumes ProcessOrderJob already ran and set shipped_at
# What if it hasnt? What if it failed?
raise 'Not shipped yet!' unless order.shipped_at
OrderMailer.shipped(order).deliver_now
end
end
# GOOD: Checks state and handles gracefully
class SendShippingNotificationJob
include Sidekiq::Job
def perform(order_id)
order = Order.find_by(id: order_id)
# User or order might not exist anymore
return unless order
# Not shipped yet? Thats fine, just dont send notification
# Maybe the shipment job will enqueue us again later
return unless order.shipped_at
# Already notified? Dont spam the customer
return if order.shipping_notification_sent_at
OrderMailer.shipped(order).deliver_now
order.update!(shipping_notification_sent_at: Time.current)
end
end
Idempotency
A job is idempotent if running it multiple times has the same effect as running it once. This is crucial because Sidekiq might run your job multiple times due to retries, network issues, or worker crashes.
# BAD: Not idempotent - will charge multiple times on retry
class ChargeOrderJob
include Sidekiq::Job
def perform(order_id)
order = Order.find(order_id)
Stripe::Charge.create(
amount: order.total_cents,
customer: order.user.stripe_customer_id
)
order.update!(paid: true)
end
end
# GOOD: Idempotent - checks if already charged
class ChargeOrderJob
include Sidekiq::Job
def perform(order_id)
order = Order.find_by(id: order_id)
return unless order
# Already paid? Nothing to do
return if order.paid?
# Use idempotency key so Stripe wont double charge
# Even if we crash after charging but before updating
charge = Stripe::Charge.create(
amount: order.total_cents,
customer: order.user.stripe_customer_id,
idempotency_key: "order_#{order.id}_charge"
)
order.update!(
paid: true,
stripe_charge_id: charge.id,
paid_at: Time.current
)
end
end
Graceful Failure
Jobs should handle errors gracefully rather than exploding and relying on retries:
# BAD: Explodes on any error
class SyncUserToMailchimpJob
include Sidekiq::Job
def perform(user_id)
user = User.find(user_id)
mailchimp.lists.add_member(
list_id: ENV['MAILCHIMP_LIST_ID'],
email: user.email,
merge_fields: { FNAME: user.first_name }
)
end
end
# GOOD: Handles expected errors, only retries on transient failures
class SyncUserToMailchimpJob
include Sidekiq::Job
sidekiq_options retry: 5 # Limit retries
def perform(user_id)
user = User.find_by(id: user_id)
return unless user
return unless user.email.present?
begin
mailchimp.lists.add_member(
list_id: ENV['MAILCHIMP_LIST_ID'],
email: user.email,
merge_fields: { FNAME: user.first_name }
)
rescue Mailchimp::AlreadySubscribedError
# Thats fine, they are already on the list
Rails.logger.info("User #{user_id} already subscribed to Mailchimp")
rescue Mailchimp::InvalidEmailError => e
# Dont retry - this email will never work
Rails.logger.warn("Invalid email for user #{user_id}: #{e.message}")
user.update!(mailchimp_sync_failed: true, mailchimp_sync_error: e.message)
rescue Mailchimp::RateLimitError
# Transient error - retry with backoff
raise # Let Sidekiq retry
rescue Mailchimp::ServerError
# Transient error - retry with backoff
raise # Let Sidekiq retry
end
end
end
Breaking Up Monolithic Jobs
Remember that terrible ProcessOrderJob from earlier? Let us fix it by breaking it into independent jobs:
# app/jobs/process_order_job.rb
# Orchestrator job - just coordinates the others
# Each step is its own independent job
class ProcessOrderJob
include Sidekiq::Job
def perform(order_id)
order = Order.find_by(id: order_id)
return unless order
return if order.processing_started?
# Mark that we have started - prevents double processing
order.update!(processing_started_at: Time.current)
# Enqueue each step as a separate job
# They will run independently and can fail independently
ChargeOrderJob.perform_async(order_id)
UpdateInventoryJob.perform_async(order_id)
NotifyWarehouseJob.perform_async(order_id)
SendOrderConfirmationJob.perform_async(order_id)
end
end
# app/jobs/charge_order_job.rb
# Just handles payment - nothing else
class ChargeOrderJob
include Sidekiq::Job
sidekiq_options queue: 'critical', retry: 10
def perform(order_id)
order = Order.find_by(id: order_id)
return unless order
return if order.paid?
charge = Stripe::Charge.create(
amount: order.total_cents,
customer: order.user.stripe_customer_id,
idempotency_key: "order_#{order.id}_v1"
)
order.update!(
paid: true,
stripe_charge_id: charge.id,
paid_at: Time.current
)
rescue Stripe::CardError => e
# Card declined - dont retry, notify the user
order.update!(payment_failed: true, payment_error: e.message)
PaymentFailedMailer.notify(order).deliver_later
end
end
# app/jobs/update_inventory_job.rb
# Just handles inventory - nothing else
class UpdateInventoryJob
include Sidekiq::Job
sidekiq_options queue: 'default', retry: 5
def perform(order_id)
order = Order.find_by(id: order_id)
return unless order
return if order.inventory_updated?
# Use a transaction for the inventory updates
# But only the inventory updates, not other stuff
ActiveRecord::Base.transaction do
order.line_items.each do |item|
product = item.product.lock!
product.decrement!(:stock, item.quantity)
end
order.update!(inventory_updated_at: Time.current)
end
end
end
# app/jobs/notify_warehouse_job.rb
# Just handles warehouse notification - nothing else
class NotifyWarehouseJob
include Sidekiq::Job
sidekiq_options queue: 'external_apis', retry: 15
def perform(order_id)
order = Order.find_by(id: order_id)
return unless order
return if order.warehouse_notified?
shipment = WarehouseAPI.new.create_shipment(
order_id: order.id,
items: order.line_items.map(&:to_warehouse_format),
address: order.shipping_address
)
order.update!(
warehouse_notified_at: Time.current,
warehouse_shipment_id: shipment.id
)
rescue WarehouseAPI::ServiceUnavailable
# Transient - let it retry
raise
rescue WarehouseAPI::InvalidAddressError => e
# Permanent failure - need human intervention
order.update!(warehouse_error: e.message)
AdminAlertJob.perform_async('warehouse_address_error', order_id)
end
end
# app/jobs/send_order_confirmation_job.rb
# Just handles the confirmation email - nothing else
class SendOrderConfirmationJob
include Sidekiq::Job
sidekiq_options queue: 'mailers', retry: 5
def perform(order_id)
order = Order.find_by(id: order_id)
return unless order
return if order.confirmation_sent?
OrderMailer.confirmation(order).deliver_now
order.update!(confirmation_sent_at: Time.current)
end
end
Now if the warehouse API is down, only NotifyWarehouseJob fails and retries. The payment goes through, inventory updates, and the customer gets their confirmation email. The warehouse notification will eventually succeed when their API recovers.
If the email fails, the customer still gets charged correctly and the warehouse still gets notified. We can investigate the email failure separately.
This is the power of independent jobs. Failures are isolated. One broken thing does not cascade into everything being broken.
Queue Design: Keeping Critical Jobs Moving
Not all jobs are created equal. A password reset email is more urgent than a weekly analytics rollup. A payment confirmation is more critical than syncing data to a CRM.
Sidekiq lets you define multiple queues with different priorities:
# config/sidekiq.yml
# Queue configuration - order matters!
# Sidekiq processes queues in the order listed
:concurrency: 10
:queues:
- [critical, 6] # 6 threads dedicated to critical
- [default, 3] # 3 threads for default
- [mailers, 2] # 2 threads for emails
- [low, 1] # 1 thread for low priority
- [external_apis, 2] # 2 threads for external API calls
The numbers in brackets are weights. Sidekiq will process jobs from critical 6 times as often as from low. But this is probabilistic, not absolute, so low priority jobs still make progress.
For truly critical jobs, I prefer running separate Sidekiq processes:
# Process 1: Only critical jobs
bundle exec sidekiq -q critical -c 5
# Process 2: Default and mailers
bundle exec sidekiq -q default -q mailers -c 10
# Process 3: Low priority and external APIs
bundle exec sidekiq -q low -q external_apis -c 5
This way, even if the external_apis queue gets backed up because a third party is slow, your critical jobs keep flowing.
Retry Strategies and Dead Letter Queues
Sidekiq's default retry behaviour is sensible but you should understand and customise it:
# Default: 25 retries over ~21 days
# Retry intervals: 15s, 16s, 31s, 96s, 271s, ... (exponential backoff)
class SomeJob
include Sidekiq::Job
# Customise retry behaviour
sidekiq_options(
retry: 10, # Max 10 retries instead of 25
dead: true, # Move to dead queue after max retries (default)
backtrace: 10 # Store 10 lines of backtrace
)
# Custom retry delay
sidekiq_retry_in do |count, exception|
case exception
when RateLimitError
# Rate limited - wait longer
60 * (count + 1) # 1 min, 2 min, 3 min, etc
when ServiceUnavailable
# Service down - exponential backoff
(count ** 4) + 15 + (rand(10) * (count + 1))
else
# Default exponential backoff
(count ** 4) + 15 + (rand(10) * (count + 1))
end
end
end
After exhausting retries, jobs go to the Dead Job queue (sometimes called the Dead Letter Queue or DLQ). These are jobs that have failed permanently and need human attention.
You must monitor your dead job queue. If it is filling up, something is wrong. I set up alerts that fire if more than X jobs die per hour:
# app/jobs/concerns/dead_job_alerting.rb
# Alert when jobs die - dont let them pile up silently
module DeadJobAlerting
extend ActiveSupport::Concern
included do
sidekiq_retries_exhausted do |job, exception|
# Log it properly
Rails.logger.error(
"Job #{job['class']} died after #{job['retry_count']} retries. " \
"Args: #{job['args']}. Error: #{exception.message}"
)
# Alert the team
Slack.notify(
channel: '#alerts',
text: ":skull: Job died: #{job['class']} - #{exception.message}"
)
# Track metrics
StatsD.increment('sidekiq.jobs.dead', tags: ["job:#{job['class']}"])
end
end
end
# Include in your jobs
class ImportantJob
include Sidekiq::Job
include DeadJobAlerting
# ...
end
Monitoring and Alerting
You cannot fix what you cannot see. Sidekiq provides a web UI that you should absolutely deploy:
# config/routes.rb
require 'sidekiq/web'
Rails.application.routes.draw do
# Protect with authentication
authenticate :user, ->(user) { user.admin? } do
mount Sidekiq::Web => '/sidekiq'
end
end
But the web UI is for manual inspection. For production monitoring, you need metrics and alerts:
# config/initializers/sidekiq.rb
Sidekiq.configure_server do |config|
config.on(:startup) do
# Report queue sizes every minute
Thread.new do
loop do
stats = Sidekiq::Stats.new
StatsD.gauge('sidekiq.enqueued', stats.enqueued)
StatsD.gauge('sidekiq.processed', stats.processed)
StatsD.gauge('sidekiq.failed', stats.failed)
StatsD.gauge('sidekiq.dead', stats.dead_size)
StatsD.gauge('sidekiq.retry', stats.retry_size)
StatsD.gauge('sidekiq.workers', stats.workers_size)
Sidekiq::Queue.all.each do |queue|
StatsD.gauge('sidekiq.queue.size', queue.size, tags: ["queue:#{queue.name}"])
StatsD.gauge('sidekiq.queue.latency', queue.latency, tags: ["queue:#{queue.name}"])
end
sleep 60
end
end
end
end
Key metrics to alert on:
Queue latency: How long jobs wait before being picked up. If this spikes, you need more workers or your jobs are too slow.
Dead job count: Jobs that have failed permanently. Should be near zero. Any increase needs investigation.
Retry queue size: Jobs waiting to be retried. A sustained high number indicates systematic failures.
Queue depth: Number of jobs waiting. Occasional spikes are fine, sustained growth means you are not keeping up.
Lessons From Production
Let me share some specific lessons from real outages:
Lesson 1: Arguments Must Be Serialisable
# BAD: Passing ActiveRecord objects
SendEmailJob.perform_async(@user) # Serialises the whole object!
# GOOD: Pass IDs
SendEmailJob.perform_async(@user.id)
Sidekiq serialises job arguments to JSON. ActiveRecord objects become huge blobs of data. Worse, by the time the job runs, the object might be stale. Always pass IDs and fetch fresh data in the job.
Lesson 2: Jobs Should Be Fast to Enqueue
# BAD: Slow to enqueue
def create
@order = Order.create!(order_params)
# This hits the database 1000 times during enqueue!
@order.line_items.each do |item|
UpdateInventoryJob.perform_async(item.id)
end
end
# GOOD: Bulk enqueue
def create
@order = Order.create!(order_params)
# Single job that handles all items
UpdateOrderInventoryJob.perform_async(@order.id)
end
Lesson 3: Watch Your Memory
Sidekiq workers are long running processes. Memory leaks accumulate over time:
# BAD: Loading too much into memory
class ProcessAllOrdersJob
def perform
orders = Order.where(status: 'pending').to_a # Loads ALL into memory
orders.each { |o| process(o) }
end
end
# GOOD: Batch processing
class ProcessAllOrdersJob
def perform
Order.where(status: 'pending').find_each(batch_size: 100) do |order|
process(order)
end
end
end
Lesson 4: Test Your Jobs
# spec/jobs/send_welcome_email_job_spec.rb
require 'rails_helper'
RSpec.describe SendWelcomeEmailJob, type: :job do
describe '#perform' do
let(:user) { create(:user) }
it 'sends welcome email' do
expect {
described_class.new.perform(user.id)
}.to change { ActionMailer::Base.deliveries.count }.by(1)
end
it 'handles missing user gracefully' do
expect {
described_class.new.perform(999999)
}.not_to raise_error
end
it 'is idempotent' do
described_class.new.perform(user.id)
expect {
described_class.new.perform(user.id)
}.not_to change { ActionMailer::Base.deliveries.count }
end
end
end
Conclusion
Background workers are not optional for modern web applications. But they are also a significant source of complexity and potential failures. The difference between a robust system and a house of cards comes down to how well you design your jobs.
Remember:
Jobs must be independent. Do not rely on other jobs having run first.
Jobs must be idempotent. Running twice should have the same effect as running once.
Jobs must handle failure gracefully. Because failure is not a matter of if, but when.
Queue design matters. Separate critical jobs from bulk processing.
Monitor everything. You cannot fix what you cannot see.
I have seen too many applications brought down by poorly designed background jobs. The Black Friday outage I mentioned at the start? It was completely preventable. A few guards, some retry limits, and proper queue isolation would have kept the system running.
Do not learn these lessons the hard way. Design your jobs properly from the start.
I have been running Sidekiq in production for over a decade across applications processing millions of jobs per day. The patterns in this post are battle tested. The horror stories are real.
Need help designing a robust background job architecture? Or debugging mysterious job failures? I have seen it all and can help you build systems that stay up when things go wrong. Let us chat.