Holy db migrations batman

This commit is contained in:
Jonas Linter
2025-12-03 10:41:34 +01:00
parent 2c3d779ab2
commit 08f85d1b26
19 changed files with 752 additions and 84 deletions

396
database_schema_analysis.md Normal file
View File

@@ -0,0 +1,396 @@
# Database Schema Analysis
## Overview
This document analyzes the database schema for normalization issues, redundancy, and potential improvements.
## Schema Summary
The database contains 13 tables organized around several core concepts:
- **Customer/Guest Management**: `customers`, `hashed_customers`, `conversion_guests`
- **Reservations**: `reservations`, `conversions`, `conversion_rooms`
- **Hotels**: `hotels`, `hotel_inventory`, `room_availability`
- **Webhooks**: `webhook_endpoints`, `webhook_requests`
- **Tracking**: `acked_requests`
---
## Major Issues Identified
### 1. **CRITICAL: Dual Customer Systems (Data Duplication)**
**Problem**: The schema maintains two parallel customer tracking systems:
- `customers` + `hashed_customers` (from Wix forms)
- `conversion_guests` (from PMS)
**Impact**:
- Same person can exist in both systems with no linkage
- `conversion_guests.hashed_customer_id` attempts to link but this is backward (many-to-one instead of one-to-one)
- Data inconsistency when same guest appears in both sources
**Details**:
```
customers (id=1, email="john@example.com")
└─ hashed_customers (id=1, customer_id=1, hashed_email="abc123...")
conversion_guests (hotel_id="HOTEL1", guest_id=42, guest_email="john@example.com")
└─ hashed_customer_id = NULL (or points to hashed_customers.id=1 after matching)
```
**Recommendation**:
- Create a unified `persons` table with a `source` field ("wix", "pms", "merged")
- Both `customers` and `conversion_guests` should reference this unified entity
- Implement proper guest matching/merging logic
---
### 2. **Data Redundancy: Hashed Values Stored Separately**
**Problem**: `hashed_customers` and `conversion_guests` store hashed values in separate columns alongside originals.
**Current Structure**:
```
customers:
- email_address (plaintext)
- phone (plaintext)
hashed_customers:
- customer_id (FK to customers)
- hashed_email
- hashed_phone
- hashed_given_name
...
```
**Issues**:
- Violates 3NF (derived data stored in separate table)
- Synchronization required between `customers` and `hashed_customers`
- If customer data changes, hashed version can become stale
- Extra JOIN required for every Meta Conversion API call
**Better Approach**:
Option A: Store hashed values directly in `customers` table as additional columns
Option B: Compute hashes on-the-fly (SHA256 is fast, ~1-2ms per hash)
**Recommendation**:
- **Short term**: Keep current structure but add triggers to auto-update hashed values
- **Long term**: Move hashed columns into `customers` table directly
---
### 3. **Advertising Account IDs Duplicated Across Tables**
**Problem**: `meta_account_id` and `google_account_id` appear in 3 places:
- `hotels` table (canonical source)
- `reservations` table (copied at creation time)
- Derived from `fbclid`/`gclid` tracking parameters
**Current Flow**:
```
hotels.meta_account_id = "123456"
reservation created with fbclid
reservations.meta_account_id = "123456" (copied from hotels)
```
**Issues**:
- Denormalization without clear benefit
- If hotel's account ID changes, old reservations have stale data
- Mixed source of truth (sometimes from hotels, sometimes from tracking params)
**Recommendation**:
- Remove `meta_account_id` and `google_account_id` from `reservations`
- Always derive from `hotels` table via JOIN
- If tracking-derived account differs from hotel's account, log a warning
---
### 4. **Hotel Information Duplicated in Reservations**
**Problem**: `reservations` table stores `hotel_code` and `hotel_name` but has no FK to `hotels` table.
**Issues**:
- Data can become inconsistent if hotel name changes
- No referential integrity
- Unclear if `hotel_code` matches `hotels.hotel_id`
**Recommendation**:
- Add `hotel_id` FK column to `reservations` pointing to `hotels.hotel_id`
- Remove `hotel_code` and `hotel_name` columns
- Derive hotel information via JOIN when needed
---
### 5. **Weak Foreign Key Consistency**
**Problem**: Mixed use of `ON DELETE` policies:
- Some FKs use `SET NULL` (appropriate for nullable relationships)
- Some use `CASCADE` (appropriate for child records)
- Some use `NO ACTION` (prevents deletion, may cause issues)
- `conversions` table has confusing composite FK setup with `hotel_id` and `guest_id`
**Examples**:
```sql
-- Good: Child data should be deleted with parent
hotel_inventory.hotel_id hotels.hotel_id (ON DELETE CASCADE)
-- Questionable: Should webhook requests survive hotel deletion?
webhook_requests.hotel_id hotels.hotel_id (ON DELETE NO ACTION)
-- Inconsistent: Why SET NULL vs CASCADE?
reservations.customer_id customers.id (ON DELETE SET NULL)
reservations.hashed_customer_id hashed_customers.id (ON DELETE CASCADE)
```
**Recommendation**:
Review each FK and establish consistent policies:
- Core data (hotels, customers): SET NULL to preserve historical records
- Supporting data (hashed_customers, inventory): CASCADE
- Transactional data (webhooks, conversions): Decide on retention policy
---
### 6. **Confusing Composite Foreign Key in Conversions**
**Problem**: The `conversions` table has a composite FK that's incorrectly mapped:
```python
# In db.py lines 650-655
__table_args__ = (
ForeignKeyConstraint(
["hotel_id", "guest_id"],
["conversion_guests.hotel_id", "conversion_guests.guest_id"],
ondelete="SET NULL",
),
)
```
**But the database shows**:
```
Foreign Keys:
hotel_id -> conversion_guests.hotel_id (ON DELETE SET NULL)
guest_id -> conversion_guests.hotel_id (ON DELETE SET NULL) # ← WRONG!
guest_id -> conversion_guests.guest_id (ON DELETE SET NULL)
hotel_id -> conversion_guests.guest_id (ON DELETE SET NULL) # ← WRONG!
```
**Impact**:
- Database has 4 FKs instead of 1 composite FK
- Mapping is incorrect (guest_id → hotel_id doesn't make sense)
- Could cause constraint violations or allow orphaned records
**Recommendation**:
- Fix the composite FK definition in SQLAlchemy
- Run a migration to drop incorrect FKs and recreate properly
---
### 7. **Unclear Relationship Between Reservations and Conversions**
**Problem**: The relationship between `reservations` (from Wix forms) and `conversions` (from PMS) is complex:
```
conversions:
- reservation_id (FK to reservations) - matched by tracking IDs
- customer_id (FK to customers) - matched by guest details
- hashed_customer_id (FK to hashed_customers) - matched by hashed guest details
- guest_id (FK to conversion_guests) - the actual PMS guest
```
**Issues**:
- Three different FK fields to three different customer/guest tables
- Matching logic is unclear from schema alone
- `directly_attributable` and `guest_matched` flags indicate matching quality, but this should be more explicit
**Recommendation**:
- Add a `match_confidence` enum field: "exact_id", "high_confidence", "medium_confidence", "no_match"
- Add `match_method` field to explain how the link was made
- Consider a separate `reservation_conversion_links` table to make the many-to-many relationship explicit
---
### 8. **Room Type Information Scattered**
**Problem**: Room information appears in multiple places:
- `reservations.room_type_code`, `room_classification_code`, `room_type`
- `conversion_rooms.room_type`, `room_number`
- `hotel_inventory.inv_type_code`, `inv_code`, `room_name`
**Issues**:
- No clear master data for room types
- Room type codes not standardized across sources
- No FK between `reservations.room_type_code` and `hotel_inventory.inv_type_code`
**Recommendation**:
- Create a `room_types` reference table linked to hotels
- Add FKs from reservations and conversion_rooms to room_types
- Standardize room type codes across all sources
---
## Normalization Analysis
### 1st Normal Form (1NF): ✅ PASS
- All columns contain atomic values
- **Exception**: `reservations.children_ages` stores comma-separated values
- Should be: separate `reservation_children` table with age column
### 2nd Normal Form (2NF): ⚠️ MOSTLY PASS
- All non-key attributes depend on the full primary key
- **Issue**: Some denormalized data exists (hotel names, account IDs in reservations)
### 3rd Normal Form (3NF): ❌ FAIL
Multiple violations:
- `hashed_customers` stores derived data (hashes) that depend on `customers`
- `reservations.meta_account_id` depends on `hotels` via hotel_code
- `reservations.hotel_name` depends on `hotels` via hotel_code
---
## Data Integrity Issues
### Missing Foreign Keys
1. **reservations.hotel_code** → should FK to hotels.hotel_id
2. **reservations.room_type_code** → should FK to hotel_inventory
3. **acked_requests.unique_id** → should FK to reservations.unique_id (or be nullable)
### Missing Indexes
Consider adding for query performance:
1. `customers.email_address` - for lookups during conversion matching
2. `conversions.reservation_date` - for time-based queries
3. `conversion_rooms.total_revenue` - for revenue analytics
4. `reservations.start_date`, `end_date` - for date range queries
### Missing Constraints
1. **Check constraints** for date logic:
- `reservations.end_date > start_date`
- `conversion_rooms.departure_date > arrival_date`
2. **Check constraints** for counts:
- `num_adults >= 0`, `num_children >= 0`
3. **NOT NULL constraints** on critical fields:
- `customers.contact_id` should be NOT NULL (it's the natural key)
- `conversions.hotel_id` is NOT NULL ✓ (good)
---
## Recommendations Priority
### HIGH PRIORITY (Data Integrity)
1. Fix composite FK in `conversions` table (lines 650-655 in db.py)
2. Add `hotel_id` FK to `reservations` table
3. Add missing NOT NULL constraints on natural keys
4. Add check constraints for date ranges and counts
### MEDIUM PRIORITY (Normalization)
5. Unify customer/guest systems into a single `persons` entity
6. Remove duplicate account ID fields from `reservations`
7. Remove `hotel_name` from `reservations` (derive via JOIN)
8. Create `reservation_children` table for children_ages
### LOW PRIORITY (Performance & Cleanup)
9. Move hashed fields into `customers` table (remove `hashed_customers`)
10. Add indexes for common query patterns
11. Create `room_types` reference table
12. Add `match_confidence` and `match_method` to `conversions`
---
## Positive Aspects
✅ Good use of composite keys (`conversion_guests`, `hotel_inventory`)
✅ Unique constraints on natural keys (`contact_id`, `webhook_secret`)
✅ Proper use of indexes on frequently queried fields
✅ Cascade deletion for child records (inventory, rooms)
✅ Tracking metadata (created_at, updated_at, first_seen, last_seen)
✅ Webhook deduplication via `payload_hash`
✅ JSON storage for flexible data (`conversion_rooms.daily_sales`)
---
## Suggested Refactoring Path
### Phase 1: Fix Critical Issues (1-2 days)
- Fix composite FK in conversions
- Add hotel_id FK to reservations
- Add missing constraints
### Phase 2: Normalize Customer Data (3-5 days)
- Create unified persons/guests table
- Migrate existing data
- Update matching logic
### Phase 3: Clean Up Redundancy (2-3 days)
- Remove duplicate account IDs
- Merge hashed_customers into customers
- Create room_types reference
### Phase 4: Enhance Tracking (1-2 days)
- Add match_confidence fields
- Improve conversion attribution
- Add missing indexes
---
## Query Examples Affected by Current Issues
### Issue: Duplicate Customer Data
```sql
-- Current: Find all reservations for a guest (requires checking both systems)
SELECT r.* FROM reservations r
WHERE r.customer_id = ?
OR r.hashed_customer_id IN (
SELECT id FROM hashed_customers WHERE contact_id = ?
);
-- After fix: Simple unified query
SELECT r.* FROM reservations r
WHERE r.person_id = ?;
```
### Issue: Missing Hotel FK
```sql
-- Current: Get hotel info for reservation (unreliable)
SELECT r.*, r.hotel_name
FROM reservations r
WHERE r.id = ?;
-- After fix: Reliable JOIN
SELECT r.*, h.hotel_name, h.meta_account_id
FROM reservations r
JOIN hotels h ON r.hotel_id = h.hotel_id
WHERE r.id = ?;
```
### Issue: Hashed Data in Separate Table
```sql
-- Current: Get customer for Meta API (requires JOIN)
SELECT hc.hashed_email, hc.hashed_phone
FROM reservations r
JOIN hashed_customers hc ON r.hashed_customer_id = hc.id
WHERE r.id = ?;
-- After fix: Direct access
SELECT c.hashed_email, c.hashed_phone
FROM reservations r
JOIN customers c ON r.customer_id = c.id
WHERE r.id = ?;
```
---
## Conclusion
The schema is **functional but has significant normalization and consistency issues**. The main problems are:
1. **Dual customer tracking systems** that should be unified
2. **Redundant storage of derived data** (hashes, account IDs)
3. **Missing foreign key relationships** (hotels, room types)
4. **Inconsistent deletion policies** across foreign keys
5. **Broken composite foreign key** in conversions table
The database violates 3NF in several places and could benefit from a refactoring effort. However, the issues are primarily architectural rather than critical bugs, so the system can continue operating while improvements are made incrementally.
**Estimated effort to fix all issues**: 1-2 weeks of development + testing
**Risk level**: Medium (requires data migration and careful FK updates)
**Recommended approach**: Incremental fixes starting with high-priority items