Background Workers: The Unsung Heroes That Keep Your Application Alive

Let me tell you about the time a single background job took down an entire e-commerce platform for six hours on Black Friday. One job. Six hours. Millions in lost revenue. The job was processing order confirmations, hit a malformed email address, raised an unhandled exception, and because it was configured to retry indefinitely with no circuit breaker, it blocked the entire queue. Every other job, shipping notifications, inventory updates, payment confirmations, everything just sat there waiting. This is not a hypothetical. This happened. And it is exactly why I am writing this post.

Let me tell you about the time a single background job took down an entire e-commerce platform for six hours on Black Friday. One job. Six hours. Millions in lost revenue. The job was processing order confirmations, hit a malformed email address, raised an unhandled exception, and because it was configured to retry indefinitely with no circuit breaker, it blocked the entire queue. Every other job, shipping notifications, inventory updates, payment confirmations, everything just sat there waiting. This is not a hypothetical. This happened. And it is exactly why I am writing this post.

Background workers are the unsung heroes of modern web applications. They handle all the stuff that would otherwise make your users stare at loading spinners: sending emails, processing images, syncing data, generating reports, charging credit cards. But they are also a massive source of production incidents when not designed properly.

I have been running Sidekiq in production for over a decade across dozens of applications. From small startups to enterprise systems processing millions of jobs per day. The patterns in this post are battle tested and represent hard won lessons from real outages.

Why Background Workers Exist

Let us start with the basics. When a user clicks a button on your website, they expect something to happen quickly. Research shows users start getting frustrated after about 100 milliseconds of delay, and they will abandon a page that takes more than 3 seconds to load.

But some operations take time. Sending an email might take 500ms because you have to connect to an SMTP server. Processing an uploaded image might take 2 seconds. Generating a PDF report might take 10 seconds. Syncing data with a third party API might take who knows how long depending on their servers.

If you do these operations synchronously (meaning the user waits while they complete), your application feels sluggish. Worse, if you are doing them inside a web request, you are tying up a server process that could be handling other requests. Under load, this leads to request queuing, timeouts, and eventually your entire application grinding to a halt.

Background workers solve this by moving slow operations out of the request/response cycle. Instead of doing the work immediately, you queue it up and return a response to the user. A separate process picks up the queued work and does it asynchronously.

The user clicks "Place Order" and instantly sees "Order confirmed!" Meanwhile, in the background, workers are sending confirmation emails, updating inventory, notifying the warehouse, and charging the credit card. The user does not wait for any of this.

The Stack: Sidekiq, Redis, and Valkey

In the Ruby world, Sidekiq is the undisputed king of background job processing. It is fast, reliable, and has been battle tested in production by thousands of companies for over a decade.

Sidekiq uses Redis as its job queue. Redis is an in memory data store that is incredibly fast for the kind of operations Sidekiq needs: pushing jobs onto queues, popping jobs off queues, and tracking job state.

A Quick Note on Valkey

In 2024, Redis changed its licensing from the permissive BSD license to a dual license that restricts how cloud providers can offer Redis as a service. In response, the Linux Foundation forked Redis and created Valkey, which continues under the original BSD license.

For our purposes, Valkey is a drop in replacement for Redis. Everything I say about Redis in this post applies equally to Valkey. If you are starting a new project or your organisation has concerns about Redis licensing, use Valkey. The commands are identical, the protocol is identical, and Sidekiq works with both.

# Gemfile
gem 'sidekiq', '~> 7.2'

# For Redis
gem 'redis', '~> 5.0'

# Or for Valkey (same gem, different server)
# gem 'redis', '~> 5.0'  # Valkey speaks the Redis protocol
# config/initializers/sidekiq.rb
# Works with either Redis or Valkey

Sidekiq.configure_server do |config|
  config.redis = {
    url: ENV.fetch('REDIS_URL', 'redis://localhost:6379/0'),
    network_timeout: 5,
    pool_timeout: 5
  }
end

Sidekiq.configure_client do |config|
  config.redis = {
    url: ENV.fetch('REDIS_URL', 'redis://localhost:6379/0'),
    network_timeout: 5,
    pool_timeout: 5
  }
end

Your First Background Job

Let us write a simple job that sends a welcome email:

# app/jobs/send_welcome_email_job.rb
# Dead simple job to send welcome emails
# Nothing fancy, just gets the job done

class SendWelcomeEmailJob
  include Sidekiq::Job

  def perform(user_id)
    user = User.find(user_id)
    UserMailer.welcome(user).deliver_now
  end
end

And enqueue it from your controller:

# app/controllers/registrations_controller.rb

class RegistrationsController < ApplicationController
  def create
    @user = User.new(user_params)
    
    if @user.save
      # Enqueue the email job - returns imediately
      # User doesnt wait for email to actually send
      SendWelcomeEmailJob.perform_async(@user.id)
      
      redirect_to dashboard_path, notice: 'Welcome aboard!'
    else
      render :new
    end
  end
end

The key thing here is perform_async. This does not send the email. It serialises the job (the class name and arguments) into Redis and returns immediately. The user gets their response in milliseconds.

Meanwhile, a Sidekiq worker process running separately picks up the job from Redis and executes the perform method. If the email takes 500ms to send, that is fine. The user is already looking at their dashboard.

The Problem: Jobs Are Not Islands

Here is where things get dangerous. That simple job I just showed you? It has several critical flaws that could take down your entire application.

Flaw 1: What If The User Does Not Exist?

def perform(user_id)
  user = User.find(user_id)  # BOOM! RecordNotFound if user was deleted
  UserMailer.welcome(user).deliver_now
end

Between the time the job was enqueued and when it runs, the user might have been deleted. Maybe they requested account deletion. Maybe an admin removed them. Maybe there was a database rollback.

User.find will raise ActiveRecord::RecordNotFound, which Sidekiq will catch and retry. And retry. And retry. By default, Sidekiq retries failed jobs 25 times over about 21 days. That is 25 exceptions in your error tracker, 25 wasted processing cycles, and potentially 25 alert notifications.

Flaw 2: What If The Email Fails?

def perform(user_id)
  user = User.find(user_id)
  UserMailer.welcome(user).deliver_now  # BOOM! SMTP error, timeout, whatever
end

SMTP servers go down. Network connections timeout. Rate limits get hit. Email addresses are malformed. Any of these will raise an exception.

Again, Sidekiq retries. But here is the insidious part: if your email provider is having an outage, every single email job will fail. They all go into the retry queue. When the provider comes back, you suddenly have thousands of jobs all retrying at once, potentially overwhelming the provider again.

Flaw 3: Shared Resources

The really nasty problems happen when jobs share resources. Consider this:

# app/jobs/process_order_job.rb
# This job has a massive problem - can you spot it?

class ProcessOrderJob
  include Sidekiq::Job

  def perform(order_id)
    order = Order.find(order_id)
    
    # Lock the order row to prevent double processing
    order.with_lock do
      return if order.processed?
      
      # These all need to succeed together
      charge_payment(order)
      update_inventory(order)
      notify_warehouse(order)
      send_confirmation(order)
      
      order.update!(processed: true)
    end
  end
  
  private
  
  def charge_payment(order)
    # Hits Stripe API - might be slow or fail
    Stripe::Charge.create(amount: order.total_cents, ...)
  end
  
  def update_inventory(order)
    # Database updates - might deadlock
    order.line_items.each do |item|
      item.product.decrement!(:stock, item.quantity)
    end
  end
  
  def notify_warehouse(order)
    # Hits external API - might timeout
    WarehouseAPI.new.create_shipment(order)
  end
  
  def send_confirmation(order)
    # Sends email - might fail
    OrderMailer.confirmation(order).deliver_now
  end
end

This job does four different things, any of which can fail. If notify_warehouse times out, the entire job fails and retries. But wait, we already charged the payment! Now on retry, we will try to charge again.

Or consider: if update_inventory causes a database deadlock, the job fails. Meanwhile, another job processing a different order is trying to update the same product's stock. Both jobs fail, both retry, both deadlock again. Rinse and repeat until someone notices.

The Golden Rule: Jobs Must Be Independent

This is the single most important lesson in this entire post. Tattoo it on your arm if you have to:

Every job must be independent. Every job must be idempotent. Every job must assume it will fail and be retried.

Independence

A job should not depend on the state left behind by another job. It should not assume jobs run in any particular order. It should fetch all the data it needs at execution time, not rely on data from when it was enqueued.

# BAD: Depends on previous job setting a flag
class SendShippingNotificationJob
  include Sidekiq::Job

  def perform(order_id)
    order = Order.find(order_id)
    # Assumes ProcessOrderJob already ran and set shipped_at
    # What if it hasnt? What if it failed?
    raise 'Not shipped yet!' unless order.shipped_at
    OrderMailer.shipped(order).deliver_now
  end
end

# GOOD: Checks state and handles gracefully
class SendShippingNotificationJob
  include Sidekiq::Job

  def perform(order_id)
    order = Order.find_by(id: order_id)
    
    # User or order might not exist anymore
    return unless order
    
    # Not shipped yet? Thats fine, just dont send notification
    # Maybe the shipment job will enqueue us again later
    return unless order.shipped_at
    
    # Already notified? Dont spam the customer
    return if order.shipping_notification_sent_at
    
    OrderMailer.shipped(order).deliver_now
    order.update!(shipping_notification_sent_at: Time.current)
  end
end

Idempotency

A job is idempotent if running it multiple times has the same effect as running it once. This is crucial because Sidekiq might run your job multiple times due to retries, network issues, or worker crashes.

# BAD: Not idempotent - will charge multiple times on retry
class ChargeOrderJob
  include Sidekiq::Job

  def perform(order_id)
    order = Order.find(order_id)
    Stripe::Charge.create(
      amount: order.total_cents,
      customer: order.user.stripe_customer_id
    )
    order.update!(paid: true)
  end
end

# GOOD: Idempotent - checks if already charged
class ChargeOrderJob
  include Sidekiq::Job

  def perform(order_id)
    order = Order.find_by(id: order_id)
    return unless order
    
    # Already paid? Nothing to do
    return if order.paid?
    
    # Use idempotency key so Stripe wont double charge
    # Even if we crash after charging but before updating
    charge = Stripe::Charge.create(
      amount: order.total_cents,
      customer: order.user.stripe_customer_id,
      idempotency_key: "order_#{order.id}_charge"
    )
    
    order.update!(
      paid: true,
      stripe_charge_id: charge.id,
      paid_at: Time.current
    )
  end
end

Graceful Failure

Jobs should handle errors gracefully rather than exploding and relying on retries:

# BAD: Explodes on any error
class SyncUserToMailchimpJob
  include Sidekiq::Job

  def perform(user_id)
    user = User.find(user_id)
    
    mailchimp.lists.add_member(
      list_id: ENV['MAILCHIMP_LIST_ID'],
      email: user.email,
      merge_fields: { FNAME: user.first_name }
    )
  end
end

# GOOD: Handles expected errors, only retries on transient failures
class SyncUserToMailchimpJob
  include Sidekiq::Job
  
  sidekiq_options retry: 5  # Limit retries

  def perform(user_id)
    user = User.find_by(id: user_id)
    return unless user
    return unless user.email.present?
    
    begin
      mailchimp.lists.add_member(
        list_id: ENV['MAILCHIMP_LIST_ID'],
        email: user.email,
        merge_fields: { FNAME: user.first_name }
      )
    rescue Mailchimp::AlreadySubscribedError
      # Thats fine, they are already on the list
      Rails.logger.info("User #{user_id} already subscribed to Mailchimp")
    rescue Mailchimp::InvalidEmailError => e
      # Dont retry - this email will never work
      Rails.logger.warn("Invalid email for user #{user_id}: #{e.message}")
      user.update!(mailchimp_sync_failed: true, mailchimp_sync_error: e.message)
    rescue Mailchimp::RateLimitError
      # Transient error - retry with backoff
      raise  # Let Sidekiq retry
    rescue Mailchimp::ServerError
      # Transient error - retry with backoff
      raise  # Let Sidekiq retry
    end
  end
end

Breaking Up Monolithic Jobs

Remember that terrible ProcessOrderJob from earlier? Let us fix it by breaking it into independent jobs:

# app/jobs/process_order_job.rb
# Orchestrator job - just coordinates the others
# Each step is its own independent job

class ProcessOrderJob
  include Sidekiq::Job

  def perform(order_id)
    order = Order.find_by(id: order_id)
    return unless order
    return if order.processing_started?
    
    # Mark that we have started - prevents double processing
    order.update!(processing_started_at: Time.current)
    
    # Enqueue each step as a separate job
    # They will run independently and can fail independently
    ChargeOrderJob.perform_async(order_id)
    UpdateInventoryJob.perform_async(order_id)
    NotifyWarehouseJob.perform_async(order_id)
    SendOrderConfirmationJob.perform_async(order_id)
  end
end
# app/jobs/charge_order_job.rb
# Just handles payment - nothing else

class ChargeOrderJob
  include Sidekiq::Job
  sidekiq_options queue: 'critical', retry: 10

  def perform(order_id)
    order = Order.find_by(id: order_id)
    return unless order
    return if order.paid?
    
    charge = Stripe::Charge.create(
      amount: order.total_cents,
      customer: order.user.stripe_customer_id,
      idempotency_key: "order_#{order.id}_v1"
    )
    
    order.update!(
      paid: true,
      stripe_charge_id: charge.id,
      paid_at: Time.current
    )
  rescue Stripe::CardError => e
    # Card declined - dont retry, notify the user
    order.update!(payment_failed: true, payment_error: e.message)
    PaymentFailedMailer.notify(order).deliver_later
  end
end
# app/jobs/update_inventory_job.rb
# Just handles inventory - nothing else

class UpdateInventoryJob
  include Sidekiq::Job
  sidekiq_options queue: 'default', retry: 5

  def perform(order_id)
    order = Order.find_by(id: order_id)
    return unless order
    return if order.inventory_updated?
    
    # Use a transaction for the inventory updates
    # But only the inventory updates, not other stuff
    ActiveRecord::Base.transaction do
      order.line_items.each do |item|
        product = item.product.lock!
        product.decrement!(:stock, item.quantity)
      end
      
      order.update!(inventory_updated_at: Time.current)
    end
  end
end
# app/jobs/notify_warehouse_job.rb
# Just handles warehouse notification - nothing else

class NotifyWarehouseJob
  include Sidekiq::Job
  sidekiq_options queue: 'external_apis', retry: 15

  def perform(order_id)
    order = Order.find_by(id: order_id)
    return unless order
    return if order.warehouse_notified?
    
    shipment = WarehouseAPI.new.create_shipment(
      order_id: order.id,
      items: order.line_items.map(&:to_warehouse_format),
      address: order.shipping_address
    )
    
    order.update!(
      warehouse_notified_at: Time.current,
      warehouse_shipment_id: shipment.id
    )
  rescue WarehouseAPI::ServiceUnavailable
    # Transient - let it retry
    raise
  rescue WarehouseAPI::InvalidAddressError => e
    # Permanent failure - need human intervention
    order.update!(warehouse_error: e.message)
    AdminAlertJob.perform_async('warehouse_address_error', order_id)
  end
end
# app/jobs/send_order_confirmation_job.rb
# Just handles the confirmation email - nothing else

class SendOrderConfirmationJob
  include Sidekiq::Job
  sidekiq_options queue: 'mailers', retry: 5

  def perform(order_id)
    order = Order.find_by(id: order_id)
    return unless order
    return if order.confirmation_sent?
    
    OrderMailer.confirmation(order).deliver_now
    order.update!(confirmation_sent_at: Time.current)
  end
end

Now if the warehouse API is down, only NotifyWarehouseJob fails and retries. The payment goes through, inventory updates, and the customer gets their confirmation email. The warehouse notification will eventually succeed when their API recovers.

If the email fails, the customer still gets charged correctly and the warehouse still gets notified. We can investigate the email failure separately.

This is the power of independent jobs. Failures are isolated. One broken thing does not cascade into everything being broken.

Queue Design: Keeping Critical Jobs Moving

Not all jobs are created equal. A password reset email is more urgent than a weekly analytics rollup. A payment confirmation is more critical than syncing data to a CRM.

Sidekiq lets you define multiple queues with different priorities:

# config/sidekiq.yml
# Queue configuration - order matters!
# Sidekiq processes queues in the order listed

:concurrency: 10
:queues:
  - [critical, 6]      # 6 threads dedicated to critical
  - [default, 3]       # 3 threads for default
  - [mailers, 2]       # 2 threads for emails
  - [low, 1]           # 1 thread for low priority
  - [external_apis, 2] # 2 threads for external API calls

The numbers in brackets are weights. Sidekiq will process jobs from critical 6 times as often as from low. But this is probabilistic, not absolute, so low priority jobs still make progress.

For truly critical jobs, I prefer running separate Sidekiq processes:

# Process 1: Only critical jobs
bundle exec sidekiq -q critical -c 5

# Process 2: Default and mailers
bundle exec sidekiq -q default -q mailers -c 10

# Process 3: Low priority and external APIs
bundle exec sidekiq -q low -q external_apis -c 5

This way, even if the external_apis queue gets backed up because a third party is slow, your critical jobs keep flowing.

Retry Strategies and Dead Letter Queues

Sidekiq's default retry behaviour is sensible but you should understand and customise it:

# Default: 25 retries over ~21 days
# Retry intervals: 15s, 16s, 31s, 96s, 271s, ... (exponential backoff)

class SomeJob
  include Sidekiq::Job
  
  # Customise retry behaviour
  sidekiq_options(
    retry: 10,           # Max 10 retries instead of 25
    dead: true,          # Move to dead queue after max retries (default)
    backtrace: 10        # Store 10 lines of backtrace
  )
  
  # Custom retry delay
  sidekiq_retry_in do |count, exception|
    case exception
    when RateLimitError
      # Rate limited - wait longer
      60 * (count + 1)  # 1 min, 2 min, 3 min, etc
    when ServiceUnavailable
      # Service down - exponential backoff
      (count ** 4) + 15 + (rand(10) * (count + 1))
    else
      # Default exponential backoff
      (count ** 4) + 15 + (rand(10) * (count + 1))
    end
  end
end

After exhausting retries, jobs go to the Dead Job queue (sometimes called the Dead Letter Queue or DLQ). These are jobs that have failed permanently and need human attention.

You must monitor your dead job queue. If it is filling up, something is wrong. I set up alerts that fire if more than X jobs die per hour:

# app/jobs/concerns/dead_job_alerting.rb
# Alert when jobs die - dont let them pile up silently

module DeadJobAlerting
  extend ActiveSupport::Concern

  included do
    sidekiq_retries_exhausted do |job, exception|
      # Log it properly
      Rails.logger.error(
        "Job #{job['class']} died after #{job['retry_count']} retries. " \
        "Args: #{job['args']}. Error: #{exception.message}"
      )
      
      # Alert the team
      Slack.notify(
        channel: '#alerts',
        text: ":skull: Job died: #{job['class']} - #{exception.message}"
      )
      
      # Track metrics
      StatsD.increment('sidekiq.jobs.dead', tags: ["job:#{job['class']}"])
    end
  end
end

# Include in your jobs
class ImportantJob
  include Sidekiq::Job
  include DeadJobAlerting
  
  # ...
end

Monitoring and Alerting

You cannot fix what you cannot see. Sidekiq provides a web UI that you should absolutely deploy:

# config/routes.rb
require 'sidekiq/web'

Rails.application.routes.draw do
  # Protect with authentication
  authenticate :user, ->(user) { user.admin? } do
    mount Sidekiq::Web => '/sidekiq'
  end
end

But the web UI is for manual inspection. For production monitoring, you need metrics and alerts:

# config/initializers/sidekiq.rb

Sidekiq.configure_server do |config|
  config.on(:startup) do
    # Report queue sizes every minute
    Thread.new do
      loop do
        stats = Sidekiq::Stats.new
        
        StatsD.gauge('sidekiq.enqueued', stats.enqueued)
        StatsD.gauge('sidekiq.processed', stats.processed)
        StatsD.gauge('sidekiq.failed', stats.failed)
        StatsD.gauge('sidekiq.dead', stats.dead_size)
        StatsD.gauge('sidekiq.retry', stats.retry_size)
        StatsD.gauge('sidekiq.workers', stats.workers_size)
        
        Sidekiq::Queue.all.each do |queue|
          StatsD.gauge('sidekiq.queue.size', queue.size, tags: ["queue:#{queue.name}"])
          StatsD.gauge('sidekiq.queue.latency', queue.latency, tags: ["queue:#{queue.name}"])
        end
        
        sleep 60
      end
    end
  end
end

Key metrics to alert on:

Queue latency: How long jobs wait before being picked up. If this spikes, you need more workers or your jobs are too slow.

Dead job count: Jobs that have failed permanently. Should be near zero. Any increase needs investigation.

Retry queue size: Jobs waiting to be retried. A sustained high number indicates systematic failures.

Queue depth: Number of jobs waiting. Occasional spikes are fine, sustained growth means you are not keeping up.

Lessons From Production

Let me share some specific lessons from real outages:

Lesson 1: Arguments Must Be Serialisable

# BAD: Passing ActiveRecord objects
SendEmailJob.perform_async(@user)  # Serialises the whole object!

# GOOD: Pass IDs
SendEmailJob.perform_async(@user.id)

Sidekiq serialises job arguments to JSON. ActiveRecord objects become huge blobs of data. Worse, by the time the job runs, the object might be stale. Always pass IDs and fetch fresh data in the job.

Lesson 2: Jobs Should Be Fast to Enqueue

# BAD: Slow to enqueue
def create
  @order = Order.create!(order_params)
  
  # This hits the database 1000 times during enqueue!
  @order.line_items.each do |item|
    UpdateInventoryJob.perform_async(item.id)
  end
end

# GOOD: Bulk enqueue
def create
  @order = Order.create!(order_params)
  
  # Single job that handles all items
  UpdateOrderInventoryJob.perform_async(@order.id)
end

Lesson 3: Watch Your Memory

Sidekiq workers are long running processes. Memory leaks accumulate over time:

# BAD: Loading too much into memory
class ProcessAllOrdersJob
  def perform
    orders = Order.where(status: 'pending').to_a  # Loads ALL into memory
    orders.each { |o| process(o) }
  end
end

# GOOD: Batch processing
class ProcessAllOrdersJob
  def perform
    Order.where(status: 'pending').find_each(batch_size: 100) do |order|
      process(order)
    end
  end
end

Lesson 4: Test Your Jobs

# spec/jobs/send_welcome_email_job_spec.rb
require 'rails_helper'

RSpec.describe SendWelcomeEmailJob, type: :job do
  describe '#perform' do
    let(:user) { create(:user) }

    it 'sends welcome email' do
      expect {
        described_class.new.perform(user.id)
      }.to change { ActionMailer::Base.deliveries.count }.by(1)
    end

    it 'handles missing user gracefully' do
      expect {
        described_class.new.perform(999999)
      }.not_to raise_error
    end

    it 'is idempotent' do
      described_class.new.perform(user.id)
      
      expect {
        described_class.new.perform(user.id)
      }.not_to change { ActionMailer::Base.deliveries.count }
    end
  end
end

Conclusion

Background workers are not optional for modern web applications. But they are also a significant source of complexity and potential failures. The difference between a robust system and a house of cards comes down to how well you design your jobs.

Remember:

Jobs must be independent. Do not rely on other jobs having run first.

Jobs must be idempotent. Running twice should have the same effect as running once.

Jobs must handle failure gracefully. Because failure is not a matter of if, but when.

Queue design matters. Separate critical jobs from bulk processing.

Monitor everything. You cannot fix what you cannot see.

I have seen too many applications brought down by poorly designed background jobs. The Black Friday outage I mentioned at the start? It was completely preventable. A few guards, some retry limits, and proper queue isolation would have kept the system running.

Do not learn these lessons the hard way. Design your jobs properly from the start.

I have been running Sidekiq in production for over a decade across applications processing millions of jobs per day. The patterns in this post are battle tested. The horror stories are real.

Need help designing a robust background job architecture? Or debugging mysterious job failures? I have seen it all and can help you build systems that stay up when things go wrong. Let us chat.

Struggling with background job reliability? I can help you design queues that stay healthy, jobs that fail gracefully, and monitoring that catches problems before your users do. Get in touch.