# Multi-Worker Deployment Guide ## Problem Statement When running FastAPI with multiple workers (e.g., `uvicorn app:app --workers 4`), the `lifespan` function runs in **every worker process**. This causes singleton services to run multiple times: - ❌ **Email schedulers** send duplicate notifications (4x emails if 4 workers) - ❌ **Background tasks** run redundantly across all workers - ❌ **Database migrations/hashing** may cause race conditions ## Solution: File-Based Worker Coordination We use **file-based locking** to ensure only ONE worker runs singleton services. This approach: - ✅ Works across different process managers (uvicorn, gunicorn, systemd) - ✅ No external dependencies (Redis, databases) - ✅ Automatic failover (if primary worker crashes, another can acquire lock) - ✅ Simple and reliable ## Implementation ### 1. Worker Coordination Module The `worker_coordination.py` module provides: ```python from alpine_bits_python.worker_coordination import is_primary_worker # In your lifespan function is_primary, worker_lock = is_primary_worker() if is_primary: # Start schedulers, background tasks, etc. start_email_scheduler() else: # This is a secondary worker - skip singleton services pass ``` ### 2. How It Works ``` ┌─────────────────────────────────────────────────────┐ │ uvicorn --workers 4 │ └─────────────────────────────────────────────────────┘ │ ├─── Worker 0 (PID 1001) ─┐ ├─── Worker 1 (PID 1002) ─┤ ├─── Worker 2 (PID 1003) ─┤ All try to acquire └─── Worker 3 (PID 1004) ─┘ /tmp/alpinebits_primary_worker.lock │ ▼ Worker 0: ✓ Lock acquired → PRIMARY Worker 1: ✗ Lock busy → SECONDARY Worker 2: ✗ Lock busy → SECONDARY Worker 3: ✗ Lock busy → SECONDARY ``` ### 3. Lifespan Function ```python async def lifespan(app: FastAPI): # Determine primary worker using file lock is_primary, worker_lock = is_primary_worker() _LOGGER.info("Worker startup: pid=%d, primary=%s", os.getpid(), is_primary) # All workers: shared setup config = load_config() engine = create_async_engine(DATABASE_URL) # Only primary worker: singleton services if is_primary: # Start email scheduler email_handler, report_scheduler = setup_logging( config, email_service, loop, enable_scheduler=True ) report_scheduler.start() # Run database migrations/hashing await hash_existing_customers() else: # Secondary workers: skip schedulers email_handler, report_scheduler = setup_logging( config, email_service, loop, enable_scheduler=False ) yield # Cleanup if report_scheduler: report_scheduler.stop() # Release lock if worker_lock: worker_lock.release() ``` ## Deployment Scenarios ### Development (Single Worker) ```bash # No special configuration needed uvicorn alpine_bits_python.api:app --reload ``` Result: Single worker becomes primary automatically. ### Production (Multiple Workers) ```bash # 4 workers for handling concurrent requests uvicorn alpine_bits_python.api:app --workers 4 --host 0.0.0.0 --port 8000 ``` Result: - Worker 0 becomes PRIMARY → runs schedulers - Workers 1-3 are SECONDARY → handle requests only ### With Gunicorn ```bash gunicorn alpine_bits_python.api:app \ --workers 4 \ --worker-class uvicorn.workers.UvicornWorker \ --bind 0.0.0.0:8000 ``` Result: Same as uvicorn - one primary, rest secondary. ### Docker Compose ```yaml services: api: image: alpinebits-api command: uvicorn alpine_bits_python.api:app --workers 4 --host 0.0.0.0 volumes: - /tmp:/tmp # Important: Share lock file location ``` **Important**: When using multiple containers, ensure they share the same lock file location or use Redis-based coordination instead. ## Monitoring & Debugging ### Check Which Worker is Primary Look for log messages at startup: ``` Worker startup: pid=1001, primary=True Worker startup: pid=1002, primary=False Worker startup: pid=1003, primary=False Worker startup: pid=1004, primary=False ``` ### Check Lock File ```bash # See which PID holds the lock cat /tmp/alpinebits_primary_worker.lock # Output: 1001 # Verify process is running ps aux | grep 1001 ``` ### Testing Worker Coordination Run the test script: ```bash uv run python test_worker_coordination.py ``` Expected output: ``` Worker 0 (PID 30773): ✓ I am PRIMARY Worker 1 (PID 30774): ✗ I am SECONDARY Worker 2 (PID 30775): ✗ I am SECONDARY Worker 3 (PID 30776): ✗ I am SECONDARY ``` ## Failover Behavior ### Primary Worker Crashes 1. Primary worker holds lock 2. Primary worker crashes/exits → lock is automatically released by OS 3. Existing secondary workers remain secondary (they already failed to acquire lock) 4. **Next restart**: First worker to start becomes new primary ### Graceful Restart 1. Send SIGTERM to workers 2. Primary worker releases lock in shutdown 3. New workers start, one becomes primary ## Lock File Location Default: `/tmp/alpinebits_primary_worker.lock` ### Change Lock Location ```python from alpine_bits_python.worker_coordination import WorkerLock # Custom location lock = WorkerLock("/var/run/alpinebits/primary.lock") is_primary = lock.acquire() ``` **Production recommendation**: Use `/var/run/` or `/run/` for lock files (automatically cleaned on reboot). ## Common Issues ### Issue: All workers think they're primary **Cause**: Lock file path not accessible or workers running in separate containers. **Solution**: - Check file permissions on lock directory - For containers: Use shared volume or Redis-based coordination ### Issue: No worker becomes primary **Cause**: Lock file from previous run still exists. **Solution**: ```bash # Clean up stale lock file rm /tmp/alpinebits_primary_worker.lock # Restart application ``` ### Issue: Duplicate emails still being sent **Cause**: Email handler running on all workers (not just schedulers). **Solution**: Email **alert handler** runs on all workers (to catch errors from any worker). Email **scheduler** only runs on primary. This is correct behavior - alerts come from any worker, scheduled reports only from primary. ## Alternative Approaches ### Redis-Based Coordination For multi-container deployments, consider Redis-based locks: ```python import redis from redis.lock import Lock redis_client = redis.Redis(host='redis', port=6379) lock = Lock(redis_client, "alpinebits_primary_worker", timeout=60) if lock.acquire(blocking=False): # This is the primary worker start_schedulers() ``` **Pros**: Works across containers **Cons**: Requires Redis dependency ### Environment Variable (Not Recommended) ```bash # Manually set primary worker ALPINEBITS_PRIMARY_WORKER=true uvicorn app:app ``` **Pros**: Simple **Cons**: Manual configuration, no automatic failover ## Best Practices 1. ✅ **Use file locks for single-host deployments** (our implementation) 2. ✅ **Use Redis locks for multi-container deployments** 3. ✅ **Log primary/secondary status at startup** 4. ✅ **Always release locks on shutdown** 5. ✅ **Keep lock files in `/var/run/` or `/tmp/`** 6. ❌ **Don't rely on process names** (unreliable with uvicorn) 7. ❌ **Don't use environment variables** (no automatic failover) 8. ❌ **Don't skip coordination** (will cause duplicate notifications) ## Summary With file-based worker coordination: - ✅ Only ONE worker runs singleton services (schedulers, migrations) - ✅ All workers handle HTTP requests normally - ✅ Automatic failover if primary worker crashes - ✅ No external dependencies needed - ✅ Works with uvicorn, gunicorn, and other ASGI servers This ensures you get the benefits of multiple workers (concurrency) without duplicate email notifications or race conditions.