Worker coordination with file locks

2025-10-15 10:07:42 +02:00
parent 0d04a546cf
commit 361611ae1b
7 changed files with 944 additions and 14 deletions
--- a/docs/MULTI_WORKER_DEPLOYMENT.md
+++ b/docs/MULTI_WORKER_DEPLOYMENT.md
@@ -0,0 +1,297 @@
+# Multi-Worker Deployment Guide
+
+## Problem Statement
+
+When running FastAPI with multiple workers (e.g., `uvicorn app:app --workers 4`), the `lifespan` function runs in **every worker process**. This causes singleton services to run multiple times:
+
+- ❌ **Email schedulers** send duplicate notifications (4x emails if 4 workers)
+- ❌ **Background tasks** run redundantly across all workers
+- ❌ **Database migrations/hashing** may cause race conditions
+
+## Solution: File-Based Worker Coordination
+
+We use **file-based locking** to ensure only ONE worker runs singleton services. This approach:
+
+- ✅ Works across different process managers (uvicorn, gunicorn, systemd)
+- ✅ No external dependencies (Redis, databases)
+- ✅ Automatic failover (if primary worker crashes, another can acquire lock)
+- ✅ Simple and reliable
+
+## Implementation
+
+### 1. Worker Coordination Module
+
+The `worker_coordination.py` module provides:
+
+```python
+from alpine_bits_python.worker_coordination import is_primary_worker
+
+# In your lifespan function
+is_primary, worker_lock = is_primary_worker()
+
+if is_primary:
+    # Start schedulers, background tasks, etc.
+    start_email_scheduler()
+else:
+    # This is a secondary worker - skip singleton services
+    pass
+```
+
+### 2. How It Works
+
+```
+┌─────────────────────────────────────────────────────┐
+│  uvicorn --workers 4                                 │
+└─────────────────────────────────────────────────────┘
+         │
+         ├─── Worker 0 (PID 1001) ─┐
+         ├─── Worker 1 (PID 1002) ─┤
+         ├─── Worker 2 (PID 1003) ─┤  All try to acquire
+         └─── Worker 3 (PID 1004) ─┘  /tmp/alpinebits_primary_worker.lock
+
+                    │
+                    ▼
+
+    Worker 0: ✓ Lock acquired → PRIMARY
+    Worker 1: ✗ Lock busy → SECONDARY
+    Worker 2: ✗ Lock busy → SECONDARY
+    Worker 3: ✗ Lock busy → SECONDARY
+```
+
+### 3. Lifespan Function
+
+```python
+async def lifespan(app: FastAPI):
+    # Determine primary worker using file lock
+    is_primary, worker_lock = is_primary_worker()
+
+    _LOGGER.info("Worker startup: pid=%d, primary=%s", os.getpid(), is_primary)
+
+    # All workers: shared setup
+    config = load_config()
+    engine = create_async_engine(DATABASE_URL)
+
+    # Only primary worker: singleton services
+    if is_primary:
+        # Start email scheduler
+        email_handler, report_scheduler = setup_logging(
+            config, email_service, loop, enable_scheduler=True
+        )
+        report_scheduler.start()
+
+        # Run database migrations/hashing
+        await hash_existing_customers()
+    else:
+        # Secondary workers: skip schedulers
+        email_handler, report_scheduler = setup_logging(
+            config, email_service, loop, enable_scheduler=False
+        )
+
+    yield
+
+    # Cleanup
+    if report_scheduler:
+        report_scheduler.stop()
+
+    # Release lock
+    if worker_lock:
+        worker_lock.release()
+```
+
+## Deployment Scenarios
+
+### Development (Single Worker)
+
+```bash
+# No special configuration needed
+uvicorn alpine_bits_python.api:app --reload
+```
+
+Result: Single worker becomes primary automatically.
+
+### Production (Multiple Workers)
+
+```bash
+# 4 workers for handling concurrent requests
+uvicorn alpine_bits_python.api:app --workers 4 --host 0.0.0.0 --port 8000
+```
+
+Result:
+- Worker 0 becomes PRIMARY → runs schedulers
+- Workers 1-3 are SECONDARY → handle requests only
+
+### With Gunicorn
+
+```bash
+gunicorn alpine_bits_python.api:app \
+    --workers 4 \
+    --worker-class uvicorn.workers.UvicornWorker \
+    --bind 0.0.0.0:8000
+```
+
+Result: Same as uvicorn - one primary, rest secondary.
+
+### Docker Compose
+
+```yaml
+services:
+  api:
+    image: alpinebits-api
+    command: uvicorn alpine_bits_python.api:app --workers 4 --host 0.0.0.0
+    volumes:
+      - /tmp:/tmp  # Important: Share lock file location
+```
+
+**Important**: When using multiple containers, ensure they share the same lock file location or use Redis-based coordination instead.
+
+## Monitoring & Debugging
+
+### Check Which Worker is Primary
+
+Look for log messages at startup:
+
+```
+Worker startup: pid=1001, primary=True
+Worker startup: pid=1002, primary=False
+Worker startup: pid=1003, primary=False
+Worker startup: pid=1004, primary=False
+```
+
+### Check Lock File
+
+```bash
+# See which PID holds the lock
+cat /tmp/alpinebits_primary_worker.lock
+# Output: 1001
+
+# Verify process is running
+ps aux | grep 1001
+```
+
+### Testing Worker Coordination
+
+Run the test script:
+
+```bash
+uv run python test_worker_coordination.py
+```
+
+Expected output:
+```
+Worker 0 (PID 30773): ✓ I am PRIMARY
+Worker 1 (PID 30774): ✗ I am SECONDARY
+Worker 2 (PID 30775): ✗ I am SECONDARY
+Worker 3 (PID 30776): ✗ I am SECONDARY
+```
+
+## Failover Behavior
+
+### Primary Worker Crashes
+
+1. Primary worker holds lock
+2. Primary worker crashes/exits → lock is automatically released by OS
+3. Existing secondary workers remain secondary (they already failed to acquire lock)
+4. **Next restart**: First worker to start becomes new primary
+
+### Graceful Restart
+
+1. Send SIGTERM to workers
+2. Primary worker releases lock in shutdown
+3. New workers start, one becomes primary
+
+## Lock File Location
+
+Default: `/tmp/alpinebits_primary_worker.lock`
+
+### Change Lock Location
+
+```python
+from alpine_bits_python.worker_coordination import WorkerLock
+
+# Custom location
+lock = WorkerLock("/var/run/alpinebits/primary.lock")
+is_primary = lock.acquire()
+```
+
+**Production recommendation**: Use `/var/run/` or `/run/` for lock files (automatically cleaned on reboot).
+
+## Common Issues
+
+### Issue: All workers think they're primary
+
+**Cause**: Lock file path not accessible or workers running in separate containers.
+
+**Solution**:
+- Check file permissions on lock directory
+- For containers: Use shared volume or Redis-based coordination
+
+### Issue: No worker becomes primary
+
+**Cause**: Lock file from previous run still exists.
+
+**Solution**:
+```bash
+# Clean up stale lock file
+rm /tmp/alpinebits_primary_worker.lock
+# Restart application
+```
+
+### Issue: Duplicate emails still being sent
+
+**Cause**: Email handler running on all workers (not just schedulers).
+
+**Solution**: Email **alert handler** runs on all workers (to catch errors from any worker). Email **scheduler** only runs on primary. This is correct behavior - alerts come from any worker, scheduled reports only from primary.
+
+## Alternative Approaches
+
+### Redis-Based Coordination
+
+For multi-container deployments, consider Redis-based locks:
+
+```python
+import redis
+from redis.lock import Lock
+
+redis_client = redis.Redis(host='redis', port=6379)
+lock = Lock(redis_client, "alpinebits_primary_worker", timeout=60)
+
+if lock.acquire(blocking=False):
+    # This is the primary worker
+    start_schedulers()
+```
+
+**Pros**: Works across containers
+**Cons**: Requires Redis dependency
+
+### Environment Variable (Not Recommended)
+
+```bash
+# Manually set primary worker
+ALPINEBITS_PRIMARY_WORKER=true uvicorn app:app
+```
+
+**Pros**: Simple
+**Cons**: Manual configuration, no automatic failover
+
+## Best Practices
+
+1. ✅ **Use file locks for single-host deployments** (our implementation)
+2. ✅ **Use Redis locks for multi-container deployments**
+3. ✅ **Log primary/secondary status at startup**
+4. ✅ **Always release locks on shutdown**
+5. ✅ **Keep lock files in `/var/run/` or `/tmp/`**
+6. ❌ **Don't rely on process names** (unreliable with uvicorn)
+7. ❌ **Don't use environment variables** (no automatic failover)
+8. ❌ **Don't skip coordination** (will cause duplicate notifications)
+
+## Summary
+
+With file-based worker coordination:
+
+- ✅ Only ONE worker runs singleton services (schedulers, migrations)
+- ✅ All workers handle HTTP requests normally
+- ✅ Automatic failover if primary worker crashes
+- ✅ No external dependencies needed
+- ✅ Works with uvicorn, gunicorn, and other ASGI servers
+
+This ensures you get the benefits of multiple workers (concurrency) without duplicate email notifications or race conditions.
--- a/docs/architecture_diagram.txt
+++ b/docs/architecture_diagram.txt
@@ -0,0 +1,154 @@
+╔══════════════════════════════════════════════════════════════════════════════╗
+║                     MULTI-WORKER FASTAPI ARCHITECTURE                        ║
+╚══════════════════════════════════════════════════════════════════════════════╝
+
+┌─────────────────────────────────────────────────────────────────────────────┐
+│  Command: uvicorn alpine_bits_python.api:app --workers 4                    │
+└─────────────────────────────────────────────────────────────────────────────┘
+                                    │
+                                    ▼
+        ┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓
+        ┃  Master Process (uvicorn supervisor)       ┃
+        ┗━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┛
+                │          │          │          │
+    ┌───────────┼──────────┼──────────┼──────────┼───────────┐
+    │           │          │          │          │           │
+    ▼           ▼          ▼          ▼          ▼           ▼
+┌────────┐ ┌────────┐ ┌────────┐ ┌────────┐ ┌──────────────────┐
+│Worker 0│ │Worker 1│ │Worker 2│ │Worker 3│ │Lock File         │
+│PID:1001│ │PID:1002│ │PID:1003│ │PID:1004│ │/tmp/alpinebits   │
+└────┬───┘ └───┬────┘ └───┬────┘ └───┬────┘ │_primary_worker   │
+     │         │          │          │      │.lock             │
+     │         │          │          │      └──────────────────┘
+     │         │          │          │             ▲
+     │         │          │          │             │
+     └─────────┴──────────┴──────────┴─────────────┤
+                  All try to acquire lock          │
+                          │                        │
+                          ▼                        │
+              ┌───────────────────────┐            │
+              │  fcntl.flock(LOCK_EX) │────────────┘
+              │  Non-blocking attempt │
+              └───────────────────────┘
+                          │
+         ┏━━━━━━━━━━━━━━━━┻━━━━━━━━━━━━━━━━┓
+         ▼                                  ▼
+    ┌─────────┐                    ┌──────────────┐
+    │SUCCESS  │                    │ WOULD BLOCK  │
+    │(First)  │                    │(Others)      │
+    └────┬────┘                    └──────┬───────┘
+         │                                │
+         ▼                                ▼
+
+╔════════════════════════════════╗   ╔══════════════════════════════╗
+║      PRIMARY WORKER            ║   ║    SECONDARY WORKERS         ║
+║      (Worker 0, PID 1001)      ║   ║    (Workers 1-3)             ║
+╠════════════════════════════════╣   ╠══════════════════════════════╣
+║                                ║   ║                              ║
+║ ✓ Handle HTTP requests         ║   ║ ✓ Handle HTTP requests       ║
+║ ✓ Start email scheduler        ║   ║ ✗ Skip email scheduler       ║
+║ ✓ Send daily reports           ║   ║ ✗ Skip daily reports         ║
+║ ✓ Run DB migrations            ║   ║ ✗ Skip DB migrations         ║
+║ ✓ Hash customers (startup)     ║   ║ ✗ Skip customer hashing      ║
+║ ✓ Send error alerts            ║   ║ ✓ Send error alerts          ║
+║ ✓ Process webhooks             ║   ║ ✓ Process webhooks           ║
+║ ✓ AlpineBits endpoints         ║   ║ ✓ AlpineBits endpoints       ║
+║                                ║   ║                              ║
+║ Holds: worker_lock             ║   ║ worker_lock = None           ║
+║                                ║   ║                              ║
+╚════════════════════════════════╝   ╚══════════════════════════════╝
+         │                                      │
+         │                                      │
+         └──────────┬───────────────────────────┘
+                    │
+                    ▼
+        ┌───────────────────────────┐
+        │   Incoming HTTP Request   │
+        └───────────────────────────┘
+                    │
+            (Load balanced by OS)
+                    │
+        ┌───────────┴──────────────┐
+        │                          │
+        ▼                          ▼
+  Any worker can handle     Round-robin distribution
+  the request normally      across all 4 workers
+
+
+╔══════════════════════════════════════════════════════════════════════════════╗
+║                          SINGLETON SERVICES                                  ║
+╚══════════════════════════════════════════════════════════════════════════════╝
+
+  Only run on PRIMARY worker:
+
+  ┌─────────────────────────────────────────────────────────────┐
+  │  Email Scheduler                                            │
+  │  ├─ Daily Report: 8:00 AM                                  │
+  │  └─ Stats Collection: Per-hotel reservation counts         │
+  └─────────────────────────────────────────────────────────────┘
+
+  ┌─────────────────────────────────────────────────────────────┐
+  │  Startup Tasks (One-time)                                   │
+  │  ├─ Database table creation                                │
+  │  ├─ Customer data hashing/backfill                         │
+  │  └─ Configuration validation                               │
+  └─────────────────────────────────────────────────────────────┘
+
+
+╔══════════════════════════════════════════════════════════════════════════════╗
+║                          SHARED SERVICES                                     ║
+╚══════════════════════════════════════════════════════════════════════════════╝
+
+  Run on ALL workers (primary + secondary):
+
+  ┌─────────────────────────────────────────────────────────────┐
+  │  HTTP Request Handling                                      │
+  │  ├─ Webhook endpoints (/api/webhook/*)                     │
+  │  ├─ AlpineBits endpoints (/api/alpinebits/*)              │
+  │  └─ Health checks (/api/health)                            │
+  └─────────────────────────────────────────────────────────────┘
+
+  ┌─────────────────────────────────────────────────────────────┐
+  │  Error Alert Handler                                        │
+  │  └─ Any worker can send immediate error alerts             │
+  └─────────────────────────────────────────────────────────────┘
+
+  ┌─────────────────────────────────────────────────────────────┐
+  │  Event Dispatching                                          │
+  │  └─ Background tasks triggered by webhooks                 │
+  └─────────────────────────────────────────────────────────────┘
+
+
+╔══════════════════════════════════════════════════════════════════════════════╗
+║                      SHUTDOWN & FAILOVER                                     ║
+╚══════════════════════════════════════════════════════════════════════════════╝
+
+  Graceful Shutdown:
+  ┌─────────────────────────────────────────────────────────────┐
+  │  1. SIGTERM received                                        │
+  │  2. Stop scheduler (primary only)                           │
+  │  3. Close email handler                                     │
+  │  4. Release worker_lock (primary only)                      │
+  │  5. Dispose database engine                                 │
+  └─────────────────────────────────────────────────────────────┘
+
+  Primary Worker Crash:
+  ┌─────────────────────────────────────────────────────────────┐
+  │  1. Primary worker crashes                                  │
+  │  2. OS automatically releases file lock                     │
+  │  3. Secondary workers continue handling requests            │
+  │  4. On next restart, first worker becomes new primary       │
+  └─────────────────────────────────────────────────────────────┘
+
+
+╔══════════════════════════════════════════════════════════════════════════════╗
+║                           KEY BENEFITS                                       ║
+╚══════════════════════════════════════════════════════════════════════════════╝
+
+  ✓ No duplicate email notifications
+  ✓ No race conditions in database operations
+  ✓ Automatic failover if primary crashes
+  ✓ Load distribution for HTTP requests
+  ✓ No external dependencies (Redis, etc.)
+  ✓ Simple and reliable
+