Worker coordination with file locks
This commit is contained in:
108
QUICK_REFERENCE.md
Normal file
108
QUICK_REFERENCE.md
Normal file
@@ -0,0 +1,108 @@
|
||||
# Multi-Worker Quick Reference
|
||||
|
||||
## TL;DR
|
||||
|
||||
**Problem**: Using 4 workers causes duplicate emails and race conditions.
|
||||
|
||||
**Solution**: File-based locking ensures only ONE worker runs schedulers.
|
||||
|
||||
## Commands
|
||||
|
||||
```bash
|
||||
# Development (1 worker - auto primary)
|
||||
uvicorn alpine_bits_python.api:app --reload
|
||||
|
||||
# Production (4 workers - one becomes primary)
|
||||
uvicorn alpine_bits_python.api:app --workers 4 --host 0.0.0.0 --port 8000
|
||||
|
||||
# Test worker coordination
|
||||
uv run python test_worker_coordination.py
|
||||
|
||||
# Run all tests
|
||||
uv run pytest tests/ -v
|
||||
```
|
||||
|
||||
## Check Which Worker is Primary
|
||||
|
||||
Look for startup logs:
|
||||
|
||||
```
|
||||
[INFO] Worker startup: pid=1001, primary=True ← PRIMARY
|
||||
[INFO] Worker startup: pid=1002, primary=False ← SECONDARY
|
||||
[INFO] Worker startup: pid=1003, primary=False ← SECONDARY
|
||||
[INFO] Worker startup: pid=1004, primary=False ← SECONDARY
|
||||
[INFO] Daily report scheduler started ← Only on PRIMARY
|
||||
```
|
||||
|
||||
## Lock File
|
||||
|
||||
**Location**: `/tmp/alpinebits_primary_worker.lock`
|
||||
|
||||
**Check lock status**:
|
||||
```bash
|
||||
# See which PID holds the lock
|
||||
cat /tmp/alpinebits_primary_worker.lock
|
||||
# Output: 1001
|
||||
|
||||
# Verify process is running
|
||||
ps aux | grep 1001
|
||||
```
|
||||
|
||||
**Clean stale lock** (if needed):
|
||||
```bash
|
||||
rm /tmp/alpinebits_primary_worker.lock
|
||||
# Then restart application
|
||||
```
|
||||
|
||||
## What Runs Where
|
||||
|
||||
| Service | Primary Worker | Secondary Workers |
|
||||
|---------|---------------|-------------------|
|
||||
| HTTP requests | ✓ Yes | ✓ Yes |
|
||||
| Email scheduler | ✓ Yes | ✗ No |
|
||||
| Error alerts | ✓ Yes | ✓ Yes (all workers can send) |
|
||||
| DB migrations | ✓ Yes | ✗ No |
|
||||
| Customer hashing | ✓ Yes | ✗ No |
|
||||
|
||||
## Troubleshooting
|
||||
|
||||
### All workers think they're primary
|
||||
**Cause**: Lock file not accessible
|
||||
**Fix**: Check permissions on `/tmp/` or change lock location
|
||||
|
||||
### No worker becomes primary
|
||||
**Cause**: Stale lock file
|
||||
**Fix**: `rm /tmp/alpinebits_primary_worker.lock` and restart
|
||||
|
||||
### Still getting duplicate emails
|
||||
**Check**: Are you seeing duplicate **scheduled reports** or **error alerts**?
|
||||
- Scheduled reports should only come from primary ✓
|
||||
- Error alerts can come from any worker (by design) ✓
|
||||
|
||||
## Code Example
|
||||
|
||||
```python
|
||||
from alpine_bits_python.worker_coordination import is_primary_worker
|
||||
|
||||
async def lifespan(app: FastAPI):
|
||||
# Acquire lock - only one worker succeeds
|
||||
is_primary, worker_lock = is_primary_worker()
|
||||
|
||||
if is_primary:
|
||||
# Start singleton services
|
||||
scheduler.start()
|
||||
|
||||
# All workers handle requests
|
||||
yield
|
||||
|
||||
# Release lock on shutdown
|
||||
if worker_lock:
|
||||
worker_lock.release()
|
||||
```
|
||||
|
||||
## Documentation
|
||||
|
||||
- **Full guide**: `docs/MULTI_WORKER_DEPLOYMENT.md`
|
||||
- **Solution summary**: `SOLUTION_SUMMARY.md`
|
||||
- **Implementation**: `src/alpine_bits_python/worker_coordination.py`
|
||||
- **Test script**: `test_worker_coordination.py`
|
||||
Reference in New Issue
Block a user