Files

Jonas Linter 361611ae1b Worker coordination with file locks

2025-10-15 10:07:42 +02:00

8.1 KiB

Raw Blame History

Multi-Worker Deployment Guide

Problem Statement

When running FastAPI with multiple workers (e.g., uvicorn app:app --workers 4), the lifespan function runs in every worker process. This causes singleton services to run multiple times:

❌ Email schedulers send duplicate notifications (4x emails if 4 workers)
❌ Background tasks run redundantly across all workers
❌ Database migrations/hashing may cause race conditions

Solution: File-Based Worker Coordination

We use file-based locking to ensure only ONE worker runs singleton services. This approach:

✅ Works across different process managers (uvicorn, gunicorn, systemd)
✅ No external dependencies (Redis, databases)
✅ Automatic failover (if primary worker crashes, another can acquire lock)
✅ Simple and reliable

Implementation

1. Worker Coordination Module

The worker_coordination.py module provides:

from alpine_bits_python.worker_coordination import is_primary_worker

# In your lifespan function
is_primary, worker_lock = is_primary_worker()

if is_primary:
    # Start schedulers, background tasks, etc.
    start_email_scheduler()
else:
    # This is a secondary worker - skip singleton services
    pass

2. How It Works

┌─────────────────────────────────────────────────────┐
│  uvicorn --workers 4                                 │
└─────────────────────────────────────────────────────┘
         │
         ├─── Worker 0 (PID 1001) ─┐
         ├─── Worker 1 (PID 1002) ─┤
         ├─── Worker 2 (PID 1003) ─┤  All try to acquire
         └─── Worker 3 (PID 1004) ─┘  /tmp/alpinebits_primary_worker.lock

                    │
                    ▼

    Worker 0: ✓ Lock acquired → PRIMARY
    Worker 1: ✗ Lock busy → SECONDARY
    Worker 2: ✗ Lock busy → SECONDARY
    Worker 3: ✗ Lock busy → SECONDARY

3. Lifespan Function

async def lifespan(app: FastAPI):
    # Determine primary worker using file lock
    is_primary, worker_lock = is_primary_worker()

    _LOGGER.info("Worker startup: pid=%d, primary=%s", os.getpid(), is_primary)

    # All workers: shared setup
    config = load_config()
    engine = create_async_engine(DATABASE_URL)

    # Only primary worker: singleton services
    if is_primary:
        # Start email scheduler
        email_handler, report_scheduler = setup_logging(
            config, email_service, loop, enable_scheduler=True
        )
        report_scheduler.start()

        # Run database migrations/hashing
        await hash_existing_customers()
    else:
        # Secondary workers: skip schedulers
        email_handler, report_scheduler = setup_logging(
            config, email_service, loop, enable_scheduler=False
        )

    yield

    # Cleanup
    if report_scheduler:
        report_scheduler.stop()

    # Release lock
    if worker_lock:
        worker_lock.release()

Deployment Scenarios

Development (Single Worker)

# No special configuration needed
uvicorn alpine_bits_python.api:app --reload

Result: Single worker becomes primary automatically.

Production (Multiple Workers)

# 4 workers for handling concurrent requests
uvicorn alpine_bits_python.api:app --workers 4 --host 0.0.0.0 --port 8000

Result:

Worker 0 becomes PRIMARY → runs schedulers
Workers 1-3 are SECONDARY → handle requests only

With Gunicorn

gunicorn alpine_bits_python.api:app \
    --workers 4 \
    --worker-class uvicorn.workers.UvicornWorker \
    --bind 0.0.0.0:8000

Result: Same as uvicorn - one primary, rest secondary.

Docker Compose

services:
  api:
    image: alpinebits-api
    command: uvicorn alpine_bits_python.api:app --workers 4 --host 0.0.0.0
    volumes:
      - /tmp:/tmp  # Important: Share lock file location

Important: When using multiple containers, ensure they share the same lock file location or use Redis-based coordination instead.

Monitoring & Debugging

Check Which Worker is Primary

Look for log messages at startup:

Worker startup: pid=1001, primary=True
Worker startup: pid=1002, primary=False
Worker startup: pid=1003, primary=False
Worker startup: pid=1004, primary=False

Check Lock File

# See which PID holds the lock
cat /tmp/alpinebits_primary_worker.lock
# Output: 1001

# Verify process is running
ps aux | grep 1001

Testing Worker Coordination

Run the test script:

uv run python test_worker_coordination.py

Expected output:

Worker 0 (PID 30773): ✓ I am PRIMARY
Worker 1 (PID 30774): ✗ I am SECONDARY
Worker 2 (PID 30775): ✗ I am SECONDARY
Worker 3 (PID 30776): ✗ I am SECONDARY

Failover Behavior

Primary Worker Crashes

Primary worker holds lock
Primary worker crashes/exits → lock is automatically released by OS
Existing secondary workers remain secondary (they already failed to acquire lock)
Next restart: First worker to start becomes new primary

Graceful Restart

Send SIGTERM to workers
Primary worker releases lock in shutdown
New workers start, one becomes primary

Lock File Location

Default: /tmp/alpinebits_primary_worker.lock

Change Lock Location

from alpine_bits_python.worker_coordination import WorkerLock

# Custom location
lock = WorkerLock("/var/run/alpinebits/primary.lock")
is_primary = lock.acquire()

Production recommendation: Use /var/run/ or /run/ for lock files (automatically cleaned on reboot).

Common Issues

Issue: All workers think they're primary

Cause: Lock file path not accessible or workers running in separate containers.

Solution:

Check file permissions on lock directory
For containers: Use shared volume or Redis-based coordination

Issue: No worker becomes primary

Cause: Lock file from previous run still exists.

Solution:

# Clean up stale lock file
rm /tmp/alpinebits_primary_worker.lock
# Restart application

Issue: Duplicate emails still being sent

Cause: Email handler running on all workers (not just schedulers).

Solution: Email alert handler runs on all workers (to catch errors from any worker). Email scheduler only runs on primary. This is correct behavior - alerts come from any worker, scheduled reports only from primary.

Alternative Approaches

Redis-Based Coordination

For multi-container deployments, consider Redis-based locks:

import redis
from redis.lock import Lock

redis_client = redis.Redis(host='redis', port=6379)
lock = Lock(redis_client, "alpinebits_primary_worker", timeout=60)

if lock.acquire(blocking=False):
    # This is the primary worker
    start_schedulers()

Pros: Works across containers Cons: Requires Redis dependency

Environment Variable (Not Recommended)

# Manually set primary worker
ALPINEBITS_PRIMARY_WORKER=true uvicorn app:app

Pros: Simple Cons: Manual configuration, no automatic failover

Best Practices

✅ Use file locks for single-host deployments (our implementation)
✅ Use Redis locks for multi-container deployments
✅ Log primary/secondary status at startup
✅ Always release locks on shutdown
✅ Keep lock files in /var/run/ or /tmp/
❌ Don't rely on process names (unreliable with uvicorn)
❌ Don't use environment variables (no automatic failover)
❌ Don't skip coordination (will cause duplicate notifications)

Summary

With file-based worker coordination:

✅ Only ONE worker runs singleton services (schedulers, migrations)
✅ All workers handle HTTP requests normally
✅ Automatic failover if primary worker crashes
✅ No external dependencies needed
✅ Works with uvicorn, gunicorn, and other ASGI servers

This ensures you get the benefits of multiple workers (concurrency) without duplicate email notifications or race conditions.

8.1 KiB Raw Blame History