Worker coordination with file locks
This commit is contained in:
297
docs/MULTI_WORKER_DEPLOYMENT.md
Normal file
297
docs/MULTI_WORKER_DEPLOYMENT.md
Normal file
@@ -0,0 +1,297 @@
|
||||
# Multi-Worker Deployment Guide
|
||||
|
||||
## Problem Statement
|
||||
|
||||
When running FastAPI with multiple workers (e.g., `uvicorn app:app --workers 4`), the `lifespan` function runs in **every worker process**. This causes singleton services to run multiple times:
|
||||
|
||||
- ❌ **Email schedulers** send duplicate notifications (4x emails if 4 workers)
|
||||
- ❌ **Background tasks** run redundantly across all workers
|
||||
- ❌ **Database migrations/hashing** may cause race conditions
|
||||
|
||||
## Solution: File-Based Worker Coordination
|
||||
|
||||
We use **file-based locking** to ensure only ONE worker runs singleton services. This approach:
|
||||
|
||||
- ✅ Works across different process managers (uvicorn, gunicorn, systemd)
|
||||
- ✅ No external dependencies (Redis, databases)
|
||||
- ✅ Automatic failover (if primary worker crashes, another can acquire lock)
|
||||
- ✅ Simple and reliable
|
||||
|
||||
## Implementation
|
||||
|
||||
### 1. Worker Coordination Module
|
||||
|
||||
The `worker_coordination.py` module provides:
|
||||
|
||||
```python
|
||||
from alpine_bits_python.worker_coordination import is_primary_worker
|
||||
|
||||
# In your lifespan function
|
||||
is_primary, worker_lock = is_primary_worker()
|
||||
|
||||
if is_primary:
|
||||
# Start schedulers, background tasks, etc.
|
||||
start_email_scheduler()
|
||||
else:
|
||||
# This is a secondary worker - skip singleton services
|
||||
pass
|
||||
```
|
||||
|
||||
### 2. How It Works
|
||||
|
||||
```
|
||||
┌─────────────────────────────────────────────────────┐
|
||||
│ uvicorn --workers 4 │
|
||||
└─────────────────────────────────────────────────────┘
|
||||
│
|
||||
├─── Worker 0 (PID 1001) ─┐
|
||||
├─── Worker 1 (PID 1002) ─┤
|
||||
├─── Worker 2 (PID 1003) ─┤ All try to acquire
|
||||
└─── Worker 3 (PID 1004) ─┘ /tmp/alpinebits_primary_worker.lock
|
||||
|
||||
│
|
||||
▼
|
||||
|
||||
Worker 0: ✓ Lock acquired → PRIMARY
|
||||
Worker 1: ✗ Lock busy → SECONDARY
|
||||
Worker 2: ✗ Lock busy → SECONDARY
|
||||
Worker 3: ✗ Lock busy → SECONDARY
|
||||
```
|
||||
|
||||
### 3. Lifespan Function
|
||||
|
||||
```python
|
||||
async def lifespan(app: FastAPI):
|
||||
# Determine primary worker using file lock
|
||||
is_primary, worker_lock = is_primary_worker()
|
||||
|
||||
_LOGGER.info("Worker startup: pid=%d, primary=%s", os.getpid(), is_primary)
|
||||
|
||||
# All workers: shared setup
|
||||
config = load_config()
|
||||
engine = create_async_engine(DATABASE_URL)
|
||||
|
||||
# Only primary worker: singleton services
|
||||
if is_primary:
|
||||
# Start email scheduler
|
||||
email_handler, report_scheduler = setup_logging(
|
||||
config, email_service, loop, enable_scheduler=True
|
||||
)
|
||||
report_scheduler.start()
|
||||
|
||||
# Run database migrations/hashing
|
||||
await hash_existing_customers()
|
||||
else:
|
||||
# Secondary workers: skip schedulers
|
||||
email_handler, report_scheduler = setup_logging(
|
||||
config, email_service, loop, enable_scheduler=False
|
||||
)
|
||||
|
||||
yield
|
||||
|
||||
# Cleanup
|
||||
if report_scheduler:
|
||||
report_scheduler.stop()
|
||||
|
||||
# Release lock
|
||||
if worker_lock:
|
||||
worker_lock.release()
|
||||
```
|
||||
|
||||
## Deployment Scenarios
|
||||
|
||||
### Development (Single Worker)
|
||||
|
||||
```bash
|
||||
# No special configuration needed
|
||||
uvicorn alpine_bits_python.api:app --reload
|
||||
```
|
||||
|
||||
Result: Single worker becomes primary automatically.
|
||||
|
||||
### Production (Multiple Workers)
|
||||
|
||||
```bash
|
||||
# 4 workers for handling concurrent requests
|
||||
uvicorn alpine_bits_python.api:app --workers 4 --host 0.0.0.0 --port 8000
|
||||
```
|
||||
|
||||
Result:
|
||||
- Worker 0 becomes PRIMARY → runs schedulers
|
||||
- Workers 1-3 are SECONDARY → handle requests only
|
||||
|
||||
### With Gunicorn
|
||||
|
||||
```bash
|
||||
gunicorn alpine_bits_python.api:app \
|
||||
--workers 4 \
|
||||
--worker-class uvicorn.workers.UvicornWorker \
|
||||
--bind 0.0.0.0:8000
|
||||
```
|
||||
|
||||
Result: Same as uvicorn - one primary, rest secondary.
|
||||
|
||||
### Docker Compose
|
||||
|
||||
```yaml
|
||||
services:
|
||||
api:
|
||||
image: alpinebits-api
|
||||
command: uvicorn alpine_bits_python.api:app --workers 4 --host 0.0.0.0
|
||||
volumes:
|
||||
- /tmp:/tmp # Important: Share lock file location
|
||||
```
|
||||
|
||||
**Important**: When using multiple containers, ensure they share the same lock file location or use Redis-based coordination instead.
|
||||
|
||||
## Monitoring & Debugging
|
||||
|
||||
### Check Which Worker is Primary
|
||||
|
||||
Look for log messages at startup:
|
||||
|
||||
```
|
||||
Worker startup: pid=1001, primary=True
|
||||
Worker startup: pid=1002, primary=False
|
||||
Worker startup: pid=1003, primary=False
|
||||
Worker startup: pid=1004, primary=False
|
||||
```
|
||||
|
||||
### Check Lock File
|
||||
|
||||
```bash
|
||||
# See which PID holds the lock
|
||||
cat /tmp/alpinebits_primary_worker.lock
|
||||
# Output: 1001
|
||||
|
||||
# Verify process is running
|
||||
ps aux | grep 1001
|
||||
```
|
||||
|
||||
### Testing Worker Coordination
|
||||
|
||||
Run the test script:
|
||||
|
||||
```bash
|
||||
uv run python test_worker_coordination.py
|
||||
```
|
||||
|
||||
Expected output:
|
||||
```
|
||||
Worker 0 (PID 30773): ✓ I am PRIMARY
|
||||
Worker 1 (PID 30774): ✗ I am SECONDARY
|
||||
Worker 2 (PID 30775): ✗ I am SECONDARY
|
||||
Worker 3 (PID 30776): ✗ I am SECONDARY
|
||||
```
|
||||
|
||||
## Failover Behavior
|
||||
|
||||
### Primary Worker Crashes
|
||||
|
||||
1. Primary worker holds lock
|
||||
2. Primary worker crashes/exits → lock is automatically released by OS
|
||||
3. Existing secondary workers remain secondary (they already failed to acquire lock)
|
||||
4. **Next restart**: First worker to start becomes new primary
|
||||
|
||||
### Graceful Restart
|
||||
|
||||
1. Send SIGTERM to workers
|
||||
2. Primary worker releases lock in shutdown
|
||||
3. New workers start, one becomes primary
|
||||
|
||||
## Lock File Location
|
||||
|
||||
Default: `/tmp/alpinebits_primary_worker.lock`
|
||||
|
||||
### Change Lock Location
|
||||
|
||||
```python
|
||||
from alpine_bits_python.worker_coordination import WorkerLock
|
||||
|
||||
# Custom location
|
||||
lock = WorkerLock("/var/run/alpinebits/primary.lock")
|
||||
is_primary = lock.acquire()
|
||||
```
|
||||
|
||||
**Production recommendation**: Use `/var/run/` or `/run/` for lock files (automatically cleaned on reboot).
|
||||
|
||||
## Common Issues
|
||||
|
||||
### Issue: All workers think they're primary
|
||||
|
||||
**Cause**: Lock file path not accessible or workers running in separate containers.
|
||||
|
||||
**Solution**:
|
||||
- Check file permissions on lock directory
|
||||
- For containers: Use shared volume or Redis-based coordination
|
||||
|
||||
### Issue: No worker becomes primary
|
||||
|
||||
**Cause**: Lock file from previous run still exists.
|
||||
|
||||
**Solution**:
|
||||
```bash
|
||||
# Clean up stale lock file
|
||||
rm /tmp/alpinebits_primary_worker.lock
|
||||
# Restart application
|
||||
```
|
||||
|
||||
### Issue: Duplicate emails still being sent
|
||||
|
||||
**Cause**: Email handler running on all workers (not just schedulers).
|
||||
|
||||
**Solution**: Email **alert handler** runs on all workers (to catch errors from any worker). Email **scheduler** only runs on primary. This is correct behavior - alerts come from any worker, scheduled reports only from primary.
|
||||
|
||||
## Alternative Approaches
|
||||
|
||||
### Redis-Based Coordination
|
||||
|
||||
For multi-container deployments, consider Redis-based locks:
|
||||
|
||||
```python
|
||||
import redis
|
||||
from redis.lock import Lock
|
||||
|
||||
redis_client = redis.Redis(host='redis', port=6379)
|
||||
lock = Lock(redis_client, "alpinebits_primary_worker", timeout=60)
|
||||
|
||||
if lock.acquire(blocking=False):
|
||||
# This is the primary worker
|
||||
start_schedulers()
|
||||
```
|
||||
|
||||
**Pros**: Works across containers
|
||||
**Cons**: Requires Redis dependency
|
||||
|
||||
### Environment Variable (Not Recommended)
|
||||
|
||||
```bash
|
||||
# Manually set primary worker
|
||||
ALPINEBITS_PRIMARY_WORKER=true uvicorn app:app
|
||||
```
|
||||
|
||||
**Pros**: Simple
|
||||
**Cons**: Manual configuration, no automatic failover
|
||||
|
||||
## Best Practices
|
||||
|
||||
1. ✅ **Use file locks for single-host deployments** (our implementation)
|
||||
2. ✅ **Use Redis locks for multi-container deployments**
|
||||
3. ✅ **Log primary/secondary status at startup**
|
||||
4. ✅ **Always release locks on shutdown**
|
||||
5. ✅ **Keep lock files in `/var/run/` or `/tmp/`**
|
||||
6. ❌ **Don't rely on process names** (unreliable with uvicorn)
|
||||
7. ❌ **Don't use environment variables** (no automatic failover)
|
||||
8. ❌ **Don't skip coordination** (will cause duplicate notifications)
|
||||
|
||||
## Summary
|
||||
|
||||
With file-based worker coordination:
|
||||
|
||||
- ✅ Only ONE worker runs singleton services (schedulers, migrations)
|
||||
- ✅ All workers handle HTTP requests normally
|
||||
- ✅ Automatic failover if primary worker crashes
|
||||
- ✅ No external dependencies needed
|
||||
- ✅ Works with uvicorn, gunicorn, and other ASGI servers
|
||||
|
||||
This ensures you get the benefits of multiple workers (concurrency) without duplicate email notifications or race conditions.
|
||||
154
docs/architecture_diagram.txt
Normal file
154
docs/architecture_diagram.txt
Normal file
@@ -0,0 +1,154 @@
|
||||
╔══════════════════════════════════════════════════════════════════════════════╗
|
||||
║ MULTI-WORKER FASTAPI ARCHITECTURE ║
|
||||
╚══════════════════════════════════════════════════════════════════════════════╝
|
||||
|
||||
┌─────────────────────────────────────────────────────────────────────────────┐
|
||||
│ Command: uvicorn alpine_bits_python.api:app --workers 4 │
|
||||
└─────────────────────────────────────────────────────────────────────────────┘
|
||||
│
|
||||
▼
|
||||
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓
|
||||
┃ Master Process (uvicorn supervisor) ┃
|
||||
┗━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┛
|
||||
│ │ │ │
|
||||
┌───────────┼──────────┼──────────┼──────────┼───────────┐
|
||||
│ │ │ │ │ │
|
||||
▼ ▼ ▼ ▼ ▼ ▼
|
||||
┌────────┐ ┌────────┐ ┌────────┐ ┌────────┐ ┌──────────────────┐
|
||||
│Worker 0│ │Worker 1│ │Worker 2│ │Worker 3│ │Lock File │
|
||||
│PID:1001│ │PID:1002│ │PID:1003│ │PID:1004│ │/tmp/alpinebits │
|
||||
└────┬───┘ └───┬────┘ └───┬────┘ └───┬────┘ │_primary_worker │
|
||||
│ │ │ │ │.lock │
|
||||
│ │ │ │ └──────────────────┘
|
||||
│ │ │ │ ▲
|
||||
│ │ │ │ │
|
||||
└─────────┴──────────┴──────────┴─────────────┤
|
||||
All try to acquire lock │
|
||||
│ │
|
||||
▼ │
|
||||
┌───────────────────────┐ │
|
||||
│ fcntl.flock(LOCK_EX) │────────────┘
|
||||
│ Non-blocking attempt │
|
||||
└───────────────────────┘
|
||||
│
|
||||
┏━━━━━━━━━━━━━━━━┻━━━━━━━━━━━━━━━━┓
|
||||
▼ ▼
|
||||
┌─────────┐ ┌──────────────┐
|
||||
│SUCCESS │ │ WOULD BLOCK │
|
||||
│(First) │ │(Others) │
|
||||
└────┬────┘ └──────┬───────┘
|
||||
│ │
|
||||
▼ ▼
|
||||
|
||||
╔════════════════════════════════╗ ╔══════════════════════════════╗
|
||||
║ PRIMARY WORKER ║ ║ SECONDARY WORKERS ║
|
||||
║ (Worker 0, PID 1001) ║ ║ (Workers 1-3) ║
|
||||
╠════════════════════════════════╣ ╠══════════════════════════════╣
|
||||
║ ║ ║ ║
|
||||
║ ✓ Handle HTTP requests ║ ║ ✓ Handle HTTP requests ║
|
||||
║ ✓ Start email scheduler ║ ║ ✗ Skip email scheduler ║
|
||||
║ ✓ Send daily reports ║ ║ ✗ Skip daily reports ║
|
||||
║ ✓ Run DB migrations ║ ║ ✗ Skip DB migrations ║
|
||||
║ ✓ Hash customers (startup) ║ ║ ✗ Skip customer hashing ║
|
||||
║ ✓ Send error alerts ║ ║ ✓ Send error alerts ║
|
||||
║ ✓ Process webhooks ║ ║ ✓ Process webhooks ║
|
||||
║ ✓ AlpineBits endpoints ║ ║ ✓ AlpineBits endpoints ║
|
||||
║ ║ ║ ║
|
||||
║ Holds: worker_lock ║ ║ worker_lock = None ║
|
||||
║ ║ ║ ║
|
||||
╚════════════════════════════════╝ ╚══════════════════════════════╝
|
||||
│ │
|
||||
│ │
|
||||
└──────────┬───────────────────────────┘
|
||||
│
|
||||
▼
|
||||
┌───────────────────────────┐
|
||||
│ Incoming HTTP Request │
|
||||
└───────────────────────────┘
|
||||
│
|
||||
(Load balanced by OS)
|
||||
│
|
||||
┌───────────┴──────────────┐
|
||||
│ │
|
||||
▼ ▼
|
||||
Any worker can handle Round-robin distribution
|
||||
the request normally across all 4 workers
|
||||
|
||||
|
||||
╔══════════════════════════════════════════════════════════════════════════════╗
|
||||
║ SINGLETON SERVICES ║
|
||||
╚══════════════════════════════════════════════════════════════════════════════╝
|
||||
|
||||
Only run on PRIMARY worker:
|
||||
|
||||
┌─────────────────────────────────────────────────────────────┐
|
||||
│ Email Scheduler │
|
||||
│ ├─ Daily Report: 8:00 AM │
|
||||
│ └─ Stats Collection: Per-hotel reservation counts │
|
||||
└─────────────────────────────────────────────────────────────┘
|
||||
|
||||
┌─────────────────────────────────────────────────────────────┐
|
||||
│ Startup Tasks (One-time) │
|
||||
│ ├─ Database table creation │
|
||||
│ ├─ Customer data hashing/backfill │
|
||||
│ └─ Configuration validation │
|
||||
└─────────────────────────────────────────────────────────────┘
|
||||
|
||||
|
||||
╔══════════════════════════════════════════════════════════════════════════════╗
|
||||
║ SHARED SERVICES ║
|
||||
╚══════════════════════════════════════════════════════════════════════════════╝
|
||||
|
||||
Run on ALL workers (primary + secondary):
|
||||
|
||||
┌─────────────────────────────────────────────────────────────┐
|
||||
│ HTTP Request Handling │
|
||||
│ ├─ Webhook endpoints (/api/webhook/*) │
|
||||
│ ├─ AlpineBits endpoints (/api/alpinebits/*) │
|
||||
│ └─ Health checks (/api/health) │
|
||||
└─────────────────────────────────────────────────────────────┘
|
||||
|
||||
┌─────────────────────────────────────────────────────────────┐
|
||||
│ Error Alert Handler │
|
||||
│ └─ Any worker can send immediate error alerts │
|
||||
└─────────────────────────────────────────────────────────────┘
|
||||
|
||||
┌─────────────────────────────────────────────────────────────┐
|
||||
│ Event Dispatching │
|
||||
│ └─ Background tasks triggered by webhooks │
|
||||
└─────────────────────────────────────────────────────────────┘
|
||||
|
||||
|
||||
╔══════════════════════════════════════════════════════════════════════════════╗
|
||||
║ SHUTDOWN & FAILOVER ║
|
||||
╚══════════════════════════════════════════════════════════════════════════════╝
|
||||
|
||||
Graceful Shutdown:
|
||||
┌─────────────────────────────────────────────────────────────┐
|
||||
│ 1. SIGTERM received │
|
||||
│ 2. Stop scheduler (primary only) │
|
||||
│ 3. Close email handler │
|
||||
│ 4. Release worker_lock (primary only) │
|
||||
│ 5. Dispose database engine │
|
||||
└─────────────────────────────────────────────────────────────┘
|
||||
|
||||
Primary Worker Crash:
|
||||
┌─────────────────────────────────────────────────────────────┐
|
||||
│ 1. Primary worker crashes │
|
||||
│ 2. OS automatically releases file lock │
|
||||
│ 3. Secondary workers continue handling requests │
|
||||
│ 4. On next restart, first worker becomes new primary │
|
||||
└─────────────────────────────────────────────────────────────┘
|
||||
|
||||
|
||||
╔══════════════════════════════════════════════════════════════════════════════╗
|
||||
║ KEY BENEFITS ║
|
||||
╚══════════════════════════════════════════════════════════════════════════════╝
|
||||
|
||||
✓ No duplicate email notifications
|
||||
✓ No race conditions in database operations
|
||||
✓ Automatic failover if primary crashes
|
||||
✓ Load distribution for HTTP requests
|
||||
✓ No external dependencies (Redis, etc.)
|
||||
✓ Simple and reliable
|
||||
|
||||
Reference in New Issue
Block a user