Files
alpinebits_python/docs/MULTI_WORKER_DEPLOYMENT.md
2025-10-15 10:07:42 +02:00

8.1 KiB

Multi-Worker Deployment Guide

Problem Statement

When running FastAPI with multiple workers (e.g., uvicorn app:app --workers 4), the lifespan function runs in every worker process. This causes singleton services to run multiple times:

  • Email schedulers send duplicate notifications (4x emails if 4 workers)
  • Background tasks run redundantly across all workers
  • Database migrations/hashing may cause race conditions

Solution: File-Based Worker Coordination

We use file-based locking to ensure only ONE worker runs singleton services. This approach:

  • Works across different process managers (uvicorn, gunicorn, systemd)
  • No external dependencies (Redis, databases)
  • Automatic failover (if primary worker crashes, another can acquire lock)
  • Simple and reliable

Implementation

1. Worker Coordination Module

The worker_coordination.py module provides:

from alpine_bits_python.worker_coordination import is_primary_worker

# In your lifespan function
is_primary, worker_lock = is_primary_worker()

if is_primary:
    # Start schedulers, background tasks, etc.
    start_email_scheduler()
else:
    # This is a secondary worker - skip singleton services
    pass

2. How It Works

┌─────────────────────────────────────────────────────┐
│  uvicorn --workers 4                                 │
└─────────────────────────────────────────────────────┘
         │
         ├─── Worker 0 (PID 1001) ─┐
         ├─── Worker 1 (PID 1002) ─┤
         ├─── Worker 2 (PID 1003) ─┤  All try to acquire
         └─── Worker 3 (PID 1004) ─┘  /tmp/alpinebits_primary_worker.lock

                    │
                    ▼

    Worker 0: ✓ Lock acquired → PRIMARY
    Worker 1: ✗ Lock busy → SECONDARY
    Worker 2: ✗ Lock busy → SECONDARY
    Worker 3: ✗ Lock busy → SECONDARY

3. Lifespan Function

async def lifespan(app: FastAPI):
    # Determine primary worker using file lock
    is_primary, worker_lock = is_primary_worker()

    _LOGGER.info("Worker startup: pid=%d, primary=%s", os.getpid(), is_primary)

    # All workers: shared setup
    config = load_config()
    engine = create_async_engine(DATABASE_URL)

    # Only primary worker: singleton services
    if is_primary:
        # Start email scheduler
        email_handler, report_scheduler = setup_logging(
            config, email_service, loop, enable_scheduler=True
        )
        report_scheduler.start()

        # Run database migrations/hashing
        await hash_existing_customers()
    else:
        # Secondary workers: skip schedulers
        email_handler, report_scheduler = setup_logging(
            config, email_service, loop, enable_scheduler=False
        )

    yield

    # Cleanup
    if report_scheduler:
        report_scheduler.stop()

    # Release lock
    if worker_lock:
        worker_lock.release()

Deployment Scenarios

Development (Single Worker)

# No special configuration needed
uvicorn alpine_bits_python.api:app --reload

Result: Single worker becomes primary automatically.

Production (Multiple Workers)

# 4 workers for handling concurrent requests
uvicorn alpine_bits_python.api:app --workers 4 --host 0.0.0.0 --port 8000

Result:

  • Worker 0 becomes PRIMARY → runs schedulers
  • Workers 1-3 are SECONDARY → handle requests only

With Gunicorn

gunicorn alpine_bits_python.api:app \
    --workers 4 \
    --worker-class uvicorn.workers.UvicornWorker \
    --bind 0.0.0.0:8000

Result: Same as uvicorn - one primary, rest secondary.

Docker Compose

services:
  api:
    image: alpinebits-api
    command: uvicorn alpine_bits_python.api:app --workers 4 --host 0.0.0.0
    volumes:
      - /tmp:/tmp  # Important: Share lock file location

Important: When using multiple containers, ensure they share the same lock file location or use Redis-based coordination instead.

Monitoring & Debugging

Check Which Worker is Primary

Look for log messages at startup:

Worker startup: pid=1001, primary=True
Worker startup: pid=1002, primary=False
Worker startup: pid=1003, primary=False
Worker startup: pid=1004, primary=False

Check Lock File

# See which PID holds the lock
cat /tmp/alpinebits_primary_worker.lock
# Output: 1001

# Verify process is running
ps aux | grep 1001

Testing Worker Coordination

Run the test script:

uv run python test_worker_coordination.py

Expected output:

Worker 0 (PID 30773): ✓ I am PRIMARY
Worker 1 (PID 30774): ✗ I am SECONDARY
Worker 2 (PID 30775): ✗ I am SECONDARY
Worker 3 (PID 30776): ✗ I am SECONDARY

Failover Behavior

Primary Worker Crashes

  1. Primary worker holds lock
  2. Primary worker crashes/exits → lock is automatically released by OS
  3. Existing secondary workers remain secondary (they already failed to acquire lock)
  4. Next restart: First worker to start becomes new primary

Graceful Restart

  1. Send SIGTERM to workers
  2. Primary worker releases lock in shutdown
  3. New workers start, one becomes primary

Lock File Location

Default: /tmp/alpinebits_primary_worker.lock

Change Lock Location

from alpine_bits_python.worker_coordination import WorkerLock

# Custom location
lock = WorkerLock("/var/run/alpinebits/primary.lock")
is_primary = lock.acquire()

Production recommendation: Use /var/run/ or /run/ for lock files (automatically cleaned on reboot).

Common Issues

Issue: All workers think they're primary

Cause: Lock file path not accessible or workers running in separate containers.

Solution:

  • Check file permissions on lock directory
  • For containers: Use shared volume or Redis-based coordination

Issue: No worker becomes primary

Cause: Lock file from previous run still exists.

Solution:

# Clean up stale lock file
rm /tmp/alpinebits_primary_worker.lock
# Restart application

Issue: Duplicate emails still being sent

Cause: Email handler running on all workers (not just schedulers).

Solution: Email alert handler runs on all workers (to catch errors from any worker). Email scheduler only runs on primary. This is correct behavior - alerts come from any worker, scheduled reports only from primary.

Alternative Approaches

Redis-Based Coordination

For multi-container deployments, consider Redis-based locks:

import redis
from redis.lock import Lock

redis_client = redis.Redis(host='redis', port=6379)
lock = Lock(redis_client, "alpinebits_primary_worker", timeout=60)

if lock.acquire(blocking=False):
    # This is the primary worker
    start_schedulers()

Pros: Works across containers Cons: Requires Redis dependency

# Manually set primary worker
ALPINEBITS_PRIMARY_WORKER=true uvicorn app:app

Pros: Simple Cons: Manual configuration, no automatic failover

Best Practices

  1. Use file locks for single-host deployments (our implementation)
  2. Use Redis locks for multi-container deployments
  3. Log primary/secondary status at startup
  4. Always release locks on shutdown
  5. Keep lock files in /var/run/ or /tmp/
  6. Don't rely on process names (unreliable with uvicorn)
  7. Don't use environment variables (no automatic failover)
  8. Don't skip coordination (will cause duplicate notifications)

Summary

With file-based worker coordination:

  • Only ONE worker runs singleton services (schedulers, migrations)
  • All workers handle HTTP requests normally
  • Automatic failover if primary worker crashes
  • No external dependencies needed
  • Works with uvicorn, gunicorn, and other ASGI servers

This ensures you get the benefits of multiple workers (concurrency) without duplicate email notifications or race conditions.