alpinebits_python/docs/MULTI_WORKER_DEPLOYMENT.md

# Multi-Worker Deployment Guide

## Problem Statement

When running FastAPI with multiple workers (e.g., `uvicorn app:app --workers 4`), the `lifespan` function runs in **every worker process**. This causes singleton services to run multiple times:

- ❌ **Email schedulers** send duplicate notifications (4x emails if 4 workers)
- ❌ **Background tasks** run redundantly across all workers
- ❌ **Database migrations/hashing** may cause race conditions

## Solution: File-Based Worker Coordination

We use **file-based locking** to ensure only ONE worker runs singleton services. This approach:

- ✅ Works across different process managers (uvicorn, gunicorn, systemd)
- ✅ No external dependencies (Redis, databases)
- ✅ Automatic failover (if primary worker crashes, another can acquire lock)
- ✅ Simple and reliable

## Implementation

### 1. Worker Coordination Module

The `worker_coordination.py` module provides:

```python
from alpine_bits_python.worker_coordination import is_primary_worker

# In your lifespan function
is_primary, worker_lock = is_primary_worker()

if is_primary:
    # Start schedulers, background tasks, etc.
    start_email_scheduler()
else:
    # This is a secondary worker - skip singleton services
    pass
```

### 2. How It Works

```
┌─────────────────────────────────────────────────────┐
│  uvicorn --workers 4                                 │
└─────────────────────────────────────────────────────┘
         │
         ├─── Worker 0 (PID 1001) ─┐
         ├─── Worker 1 (PID 1002) ─┤
         ├─── Worker 2 (PID 1003) ─┤  All try to acquire
         └─── Worker 3 (PID 1004) ─┘  /tmp/alpinebits_primary_worker.lock

                    │
                    ▼

    Worker 0: ✓ Lock acquired → PRIMARY
    Worker 1: ✗ Lock busy → SECONDARY
    Worker 2: ✗ Lock busy → SECONDARY
    Worker 3: ✗ Lock busy → SECONDARY
```

### 3. Lifespan Function

```python
async def lifespan(app: FastAPI):
    # Determine primary worker using file lock
    is_primary, worker_lock = is_primary_worker()

    _LOGGER.info("Worker startup: pid=%d, primary=%s", os.getpid(), is_primary)

    # All workers: shared setup
    config = load_config()
    engine = create_async_engine(DATABASE_URL)

    # Only primary worker: singleton services
    if is_primary:
        # Start email scheduler
        email_handler, report_scheduler = setup_logging(
            config, email_service, loop, enable_scheduler=True
        )
        report_scheduler.start()

        # Run database migrations/hashing
        await hash_existing_customers()
    else:
        # Secondary workers: skip schedulers
        email_handler, report_scheduler = setup_logging(
            config, email_service, loop, enable_scheduler=False
        )

    yield

    # Cleanup
    if report_scheduler:
        report_scheduler.stop()

    # Release lock
    if worker_lock:
        worker_lock.release()
```

## Deployment Scenarios

### Development (Single Worker)

```bash
# No special configuration needed
uvicorn alpine_bits_python.api:app --reload
```

Result: Single worker becomes primary automatically.

### Production (Multiple Workers)

```bash
# 4 workers for handling concurrent requests
uvicorn alpine_bits_python.api:app --workers 4 --host 0.0.0.0 --port 8000
```

Result:
- Worker 0 becomes PRIMARY → runs schedulers
- Workers 1-3 are SECONDARY → handle requests only

### With Gunicorn

```bash
gunicorn alpine_bits_python.api:app \
    --workers 4 \
    --worker-class uvicorn.workers.UvicornWorker \
    --bind 0.0.0.0:8000
```

Result: Same as uvicorn - one primary, rest secondary.

### Docker Compose

```yaml
services:
  api:
    image: alpinebits-api
    command: uvicorn alpine_bits_python.api:app --workers 4 --host 0.0.0.0
    volumes:
      - /tmp:/tmp  # Important: Share lock file location
```

**Important**: When using multiple containers, ensure they share the same lock file location or use Redis-based coordination instead.

## Monitoring & Debugging

### Check Which Worker is Primary

Look for log messages at startup:

```
Worker startup: pid=1001, primary=True
Worker startup: pid=1002, primary=False
Worker startup: pid=1003, primary=False
Worker startup: pid=1004, primary=False
```

### Check Lock File

```bash
# See which PID holds the lock
cat /tmp/alpinebits_primary_worker.lock
# Output: 1001

# Verify process is running
ps aux | grep 1001
```

### Testing Worker Coordination

Run the test script:

```bash
uv run python test_worker_coordination.py
```

Expected output:
```
Worker 0 (PID 30773): ✓ I am PRIMARY
Worker 1 (PID 30774): ✗ I am SECONDARY
Worker 2 (PID 30775): ✗ I am SECONDARY
Worker 3 (PID 30776): ✗ I am SECONDARY
```

## Failover Behavior

### Primary Worker Crashes

1. Primary worker holds lock
2. Primary worker crashes/exits → lock is automatically released by OS
3. Existing secondary workers remain secondary (they already failed to acquire lock)
4. **Next restart**: First worker to start becomes new primary

### Graceful Restart

1. Send SIGTERM to workers
2. Primary worker releases lock in shutdown
3. New workers start, one becomes primary

## Lock File Location

Default: `/tmp/alpinebits_primary_worker.lock`

### Change Lock Location

```python
from alpine_bits_python.worker_coordination import WorkerLock

# Custom location
lock = WorkerLock("/var/run/alpinebits/primary.lock")
is_primary = lock.acquire()
```

**Production recommendation**: Use `/var/run/` or `/run/` for lock files (automatically cleaned on reboot).

## Common Issues

### Issue: All workers think they're primary

**Cause**: Lock file path not accessible or workers running in separate containers.

**Solution**:
- Check file permissions on lock directory
- For containers: Use shared volume or Redis-based coordination

### Issue: No worker becomes primary

**Cause**: Lock file from previous run still exists.

**Solution**:
```bash
# Clean up stale lock file
rm /tmp/alpinebits_primary_worker.lock
# Restart application
```

### Issue: Duplicate emails still being sent

**Cause**: Email handler running on all workers (not just schedulers).

**Solution**: Email **alert handler** runs on all workers (to catch errors from any worker). Email **scheduler** only runs on primary. This is correct behavior - alerts come from any worker, scheduled reports only from primary.

## Alternative Approaches

### Redis-Based Coordination

For multi-container deployments, consider Redis-based locks:

```python
import redis
from redis.lock import Lock

redis_client = redis.Redis(host='redis', port=6379)
lock = Lock(redis_client, "alpinebits_primary_worker", timeout=60)

if lock.acquire(blocking=False):
    # This is the primary worker
    start_schedulers()
```

**Pros**: Works across containers
**Cons**: Requires Redis dependency

### Environment Variable (Not Recommended)

```bash
# Manually set primary worker
ALPINEBITS_PRIMARY_WORKER=true uvicorn app:app
```

**Pros**: Simple
**Cons**: Manual configuration, no automatic failover

## Best Practices

1. ✅ **Use file locks for single-host deployments** (our implementation)
2. ✅ **Use Redis locks for multi-container deployments**
3. ✅ **Log primary/secondary status at startup**
4. ✅ **Always release locks on shutdown**
5. ✅ **Keep lock files in `/var/run/` or `/tmp/`**
6. ❌ **Don't rely on process names** (unreliable with uvicorn)
7. ❌ **Don't use environment variables** (no automatic failover)
8. ❌ **Don't skip coordination** (will cause duplicate notifications)

## Summary

With file-based worker coordination:

- ✅ Only ONE worker runs singleton services (schedulers, migrations)
- ✅ All workers handle HTTP requests normally
- ✅ Automatic failover if primary worker crashes
- ✅ No external dependencies needed
- ✅ Works with uvicorn, gunicorn, and other ASGI servers

This ensures you get the benefits of multiple workers (concurrency) without duplicate email notifications or race conditions.