Holy db migrations batman
This commit is contained in:
396
database_schema_analysis.md
Normal file
396
database_schema_analysis.md
Normal file
@@ -0,0 +1,396 @@
|
||||
# Database Schema Analysis
|
||||
|
||||
## Overview
|
||||
This document analyzes the database schema for normalization issues, redundancy, and potential improvements.
|
||||
|
||||
## Schema Summary
|
||||
The database contains 13 tables organized around several core concepts:
|
||||
- **Customer/Guest Management**: `customers`, `hashed_customers`, `conversion_guests`
|
||||
- **Reservations**: `reservations`, `conversions`, `conversion_rooms`
|
||||
- **Hotels**: `hotels`, `hotel_inventory`, `room_availability`
|
||||
- **Webhooks**: `webhook_endpoints`, `webhook_requests`
|
||||
- **Tracking**: `acked_requests`
|
||||
|
||||
---
|
||||
|
||||
## Major Issues Identified
|
||||
|
||||
### 1. **CRITICAL: Dual Customer Systems (Data Duplication)**
|
||||
|
||||
**Problem**: The schema maintains two parallel customer tracking systems:
|
||||
- `customers` + `hashed_customers` (from Wix forms)
|
||||
- `conversion_guests` (from PMS)
|
||||
|
||||
**Impact**:
|
||||
- Same person can exist in both systems with no linkage
|
||||
- `conversion_guests.hashed_customer_id` attempts to link but this is backward (many-to-one instead of one-to-one)
|
||||
- Data inconsistency when same guest appears in both sources
|
||||
|
||||
**Details**:
|
||||
```
|
||||
customers (id=1, email="john@example.com")
|
||||
└─ hashed_customers (id=1, customer_id=1, hashed_email="abc123...")
|
||||
|
||||
conversion_guests (hotel_id="HOTEL1", guest_id=42, guest_email="john@example.com")
|
||||
└─ hashed_customer_id = NULL (or points to hashed_customers.id=1 after matching)
|
||||
```
|
||||
|
||||
**Recommendation**:
|
||||
- Create a unified `persons` table with a `source` field ("wix", "pms", "merged")
|
||||
- Both `customers` and `conversion_guests` should reference this unified entity
|
||||
- Implement proper guest matching/merging logic
|
||||
|
||||
---
|
||||
|
||||
### 2. **Data Redundancy: Hashed Values Stored Separately**
|
||||
|
||||
**Problem**: `hashed_customers` and `conversion_guests` store hashed values in separate columns alongside originals.
|
||||
|
||||
**Current Structure**:
|
||||
```
|
||||
customers:
|
||||
- email_address (plaintext)
|
||||
- phone (plaintext)
|
||||
|
||||
hashed_customers:
|
||||
- customer_id (FK to customers)
|
||||
- hashed_email
|
||||
- hashed_phone
|
||||
- hashed_given_name
|
||||
...
|
||||
```
|
||||
|
||||
**Issues**:
|
||||
- Violates 3NF (derived data stored in separate table)
|
||||
- Synchronization required between `customers` and `hashed_customers`
|
||||
- If customer data changes, hashed version can become stale
|
||||
- Extra JOIN required for every Meta Conversion API call
|
||||
|
||||
**Better Approach**:
|
||||
Option A: Store hashed values directly in `customers` table as additional columns
|
||||
Option B: Compute hashes on-the-fly (SHA256 is fast, ~1-2ms per hash)
|
||||
|
||||
**Recommendation**:
|
||||
- **Short term**: Keep current structure but add triggers to auto-update hashed values
|
||||
- **Long term**: Move hashed columns into `customers` table directly
|
||||
|
||||
---
|
||||
|
||||
### 3. **Advertising Account IDs Duplicated Across Tables**
|
||||
|
||||
**Problem**: `meta_account_id` and `google_account_id` appear in 3 places:
|
||||
- `hotels` table (canonical source)
|
||||
- `reservations` table (copied at creation time)
|
||||
- Derived from `fbclid`/`gclid` tracking parameters
|
||||
|
||||
**Current Flow**:
|
||||
```
|
||||
hotels.meta_account_id = "123456"
|
||||
↓
|
||||
reservation created with fbclid
|
||||
↓
|
||||
reservations.meta_account_id = "123456" (copied from hotels)
|
||||
```
|
||||
|
||||
**Issues**:
|
||||
- Denormalization without clear benefit
|
||||
- If hotel's account ID changes, old reservations have stale data
|
||||
- Mixed source of truth (sometimes from hotels, sometimes from tracking params)
|
||||
|
||||
**Recommendation**:
|
||||
- Remove `meta_account_id` and `google_account_id` from `reservations`
|
||||
- Always derive from `hotels` table via JOIN
|
||||
- If tracking-derived account differs from hotel's account, log a warning
|
||||
|
||||
---
|
||||
|
||||
### 4. **Hotel Information Duplicated in Reservations**
|
||||
|
||||
**Problem**: `reservations` table stores `hotel_code` and `hotel_name` but has no FK to `hotels` table.
|
||||
|
||||
**Issues**:
|
||||
- Data can become inconsistent if hotel name changes
|
||||
- No referential integrity
|
||||
- Unclear if `hotel_code` matches `hotels.hotel_id`
|
||||
|
||||
**Recommendation**:
|
||||
- Add `hotel_id` FK column to `reservations` pointing to `hotels.hotel_id`
|
||||
- Remove `hotel_code` and `hotel_name` columns
|
||||
- Derive hotel information via JOIN when needed
|
||||
|
||||
---
|
||||
|
||||
### 5. **Weak Foreign Key Consistency**
|
||||
|
||||
**Problem**: Mixed use of `ON DELETE` policies:
|
||||
- Some FKs use `SET NULL` (appropriate for nullable relationships)
|
||||
- Some use `CASCADE` (appropriate for child records)
|
||||
- Some use `NO ACTION` (prevents deletion, may cause issues)
|
||||
- `conversions` table has confusing composite FK setup with `hotel_id` and `guest_id`
|
||||
|
||||
**Examples**:
|
||||
```sql
|
||||
-- Good: Child data should be deleted with parent
|
||||
hotel_inventory.hotel_id → hotels.hotel_id (ON DELETE CASCADE)
|
||||
|
||||
-- Questionable: Should webhook requests survive hotel deletion?
|
||||
webhook_requests.hotel_id → hotels.hotel_id (ON DELETE NO ACTION)
|
||||
|
||||
-- Inconsistent: Why SET NULL vs CASCADE?
|
||||
reservations.customer_id → customers.id (ON DELETE SET NULL)
|
||||
reservations.hashed_customer_id → hashed_customers.id (ON DELETE CASCADE)
|
||||
```
|
||||
|
||||
**Recommendation**:
|
||||
Review each FK and establish consistent policies:
|
||||
- Core data (hotels, customers): SET NULL to preserve historical records
|
||||
- Supporting data (hashed_customers, inventory): CASCADE
|
||||
- Transactional data (webhooks, conversions): Decide on retention policy
|
||||
|
||||
---
|
||||
|
||||
### 6. **Confusing Composite Foreign Key in Conversions**
|
||||
|
||||
**Problem**: The `conversions` table has a composite FK that's incorrectly mapped:
|
||||
|
||||
```python
|
||||
# In db.py lines 650-655
|
||||
__table_args__ = (
|
||||
ForeignKeyConstraint(
|
||||
["hotel_id", "guest_id"],
|
||||
["conversion_guests.hotel_id", "conversion_guests.guest_id"],
|
||||
ondelete="SET NULL",
|
||||
),
|
||||
)
|
||||
```
|
||||
|
||||
**But the database shows**:
|
||||
```
|
||||
Foreign Keys:
|
||||
hotel_id -> conversion_guests.hotel_id (ON DELETE SET NULL)
|
||||
guest_id -> conversion_guests.hotel_id (ON DELETE SET NULL) # ← WRONG!
|
||||
guest_id -> conversion_guests.guest_id (ON DELETE SET NULL)
|
||||
hotel_id -> conversion_guests.guest_id (ON DELETE SET NULL) # ← WRONG!
|
||||
```
|
||||
|
||||
**Impact**:
|
||||
- Database has 4 FKs instead of 1 composite FK
|
||||
- Mapping is incorrect (guest_id → hotel_id doesn't make sense)
|
||||
- Could cause constraint violations or allow orphaned records
|
||||
|
||||
**Recommendation**:
|
||||
- Fix the composite FK definition in SQLAlchemy
|
||||
- Run a migration to drop incorrect FKs and recreate properly
|
||||
|
||||
---
|
||||
|
||||
### 7. **Unclear Relationship Between Reservations and Conversions**
|
||||
|
||||
**Problem**: The relationship between `reservations` (from Wix forms) and `conversions` (from PMS) is complex:
|
||||
|
||||
```
|
||||
conversions:
|
||||
- reservation_id (FK to reservations) - matched by tracking IDs
|
||||
- customer_id (FK to customers) - matched by guest details
|
||||
- hashed_customer_id (FK to hashed_customers) - matched by hashed guest details
|
||||
- guest_id (FK to conversion_guests) - the actual PMS guest
|
||||
```
|
||||
|
||||
**Issues**:
|
||||
- Three different FK fields to three different customer/guest tables
|
||||
- Matching logic is unclear from schema alone
|
||||
- `directly_attributable` and `guest_matched` flags indicate matching quality, but this should be more explicit
|
||||
|
||||
**Recommendation**:
|
||||
- Add a `match_confidence` enum field: "exact_id", "high_confidence", "medium_confidence", "no_match"
|
||||
- Add `match_method` field to explain how the link was made
|
||||
- Consider a separate `reservation_conversion_links` table to make the many-to-many relationship explicit
|
||||
|
||||
---
|
||||
|
||||
### 8. **Room Type Information Scattered**
|
||||
|
||||
**Problem**: Room information appears in multiple places:
|
||||
- `reservations.room_type_code`, `room_classification_code`, `room_type`
|
||||
- `conversion_rooms.room_type`, `room_number`
|
||||
- `hotel_inventory.inv_type_code`, `inv_code`, `room_name`
|
||||
|
||||
**Issues**:
|
||||
- No clear master data for room types
|
||||
- Room type codes not standardized across sources
|
||||
- No FK between `reservations.room_type_code` and `hotel_inventory.inv_type_code`
|
||||
|
||||
**Recommendation**:
|
||||
- Create a `room_types` reference table linked to hotels
|
||||
- Add FKs from reservations and conversion_rooms to room_types
|
||||
- Standardize room type codes across all sources
|
||||
|
||||
---
|
||||
|
||||
## Normalization Analysis
|
||||
|
||||
### 1st Normal Form (1NF): ✅ PASS
|
||||
- All columns contain atomic values
|
||||
- **Exception**: `reservations.children_ages` stores comma-separated values
|
||||
- Should be: separate `reservation_children` table with age column
|
||||
|
||||
### 2nd Normal Form (2NF): ⚠️ MOSTLY PASS
|
||||
- All non-key attributes depend on the full primary key
|
||||
- **Issue**: Some denormalized data exists (hotel names, account IDs in reservations)
|
||||
|
||||
### 3rd Normal Form (3NF): ❌ FAIL
|
||||
Multiple violations:
|
||||
- `hashed_customers` stores derived data (hashes) that depend on `customers`
|
||||
- `reservations.meta_account_id` depends on `hotels` via hotel_code
|
||||
- `reservations.hotel_name` depends on `hotels` via hotel_code
|
||||
|
||||
---
|
||||
|
||||
## Data Integrity Issues
|
||||
|
||||
### Missing Foreign Keys
|
||||
1. **reservations.hotel_code** → should FK to hotels.hotel_id
|
||||
2. **reservations.room_type_code** → should FK to hotel_inventory
|
||||
3. **acked_requests.unique_id** → should FK to reservations.unique_id (or be nullable)
|
||||
|
||||
### Missing Indexes
|
||||
Consider adding for query performance:
|
||||
1. `customers.email_address` - for lookups during conversion matching
|
||||
2. `conversions.reservation_date` - for time-based queries
|
||||
3. `conversion_rooms.total_revenue` - for revenue analytics
|
||||
4. `reservations.start_date`, `end_date` - for date range queries
|
||||
|
||||
### Missing Constraints
|
||||
1. **Check constraints** for date logic:
|
||||
- `reservations.end_date > start_date`
|
||||
- `conversion_rooms.departure_date > arrival_date`
|
||||
|
||||
2. **Check constraints** for counts:
|
||||
- `num_adults >= 0`, `num_children >= 0`
|
||||
|
||||
3. **NOT NULL constraints** on critical fields:
|
||||
- `customers.contact_id` should be NOT NULL (it's the natural key)
|
||||
- `conversions.hotel_id` is NOT NULL ✓ (good)
|
||||
|
||||
---
|
||||
|
||||
## Recommendations Priority
|
||||
|
||||
### HIGH PRIORITY (Data Integrity)
|
||||
1. Fix composite FK in `conversions` table (lines 650-655 in db.py)
|
||||
2. Add `hotel_id` FK to `reservations` table
|
||||
3. Add missing NOT NULL constraints on natural keys
|
||||
4. Add check constraints for date ranges and counts
|
||||
|
||||
### MEDIUM PRIORITY (Normalization)
|
||||
5. Unify customer/guest systems into a single `persons` entity
|
||||
6. Remove duplicate account ID fields from `reservations`
|
||||
7. Remove `hotel_name` from `reservations` (derive via JOIN)
|
||||
8. Create `reservation_children` table for children_ages
|
||||
|
||||
### LOW PRIORITY (Performance & Cleanup)
|
||||
9. Move hashed fields into `customers` table (remove `hashed_customers`)
|
||||
10. Add indexes for common query patterns
|
||||
11. Create `room_types` reference table
|
||||
12. Add `match_confidence` and `match_method` to `conversions`
|
||||
|
||||
---
|
||||
|
||||
## Positive Aspects
|
||||
|
||||
✅ Good use of composite keys (`conversion_guests`, `hotel_inventory`)
|
||||
✅ Unique constraints on natural keys (`contact_id`, `webhook_secret`)
|
||||
✅ Proper use of indexes on frequently queried fields
|
||||
✅ Cascade deletion for child records (inventory, rooms)
|
||||
✅ Tracking metadata (created_at, updated_at, first_seen, last_seen)
|
||||
✅ Webhook deduplication via `payload_hash`
|
||||
✅ JSON storage for flexible data (`conversion_rooms.daily_sales`)
|
||||
|
||||
---
|
||||
|
||||
## Suggested Refactoring Path
|
||||
|
||||
### Phase 1: Fix Critical Issues (1-2 days)
|
||||
- Fix composite FK in conversions
|
||||
- Add hotel_id FK to reservations
|
||||
- Add missing constraints
|
||||
|
||||
### Phase 2: Normalize Customer Data (3-5 days)
|
||||
- Create unified persons/guests table
|
||||
- Migrate existing data
|
||||
- Update matching logic
|
||||
|
||||
### Phase 3: Clean Up Redundancy (2-3 days)
|
||||
- Remove duplicate account IDs
|
||||
- Merge hashed_customers into customers
|
||||
- Create room_types reference
|
||||
|
||||
### Phase 4: Enhance Tracking (1-2 days)
|
||||
- Add match_confidence fields
|
||||
- Improve conversion attribution
|
||||
- Add missing indexes
|
||||
|
||||
---
|
||||
|
||||
## Query Examples Affected by Current Issues
|
||||
|
||||
### Issue: Duplicate Customer Data
|
||||
```sql
|
||||
-- Current: Find all reservations for a guest (requires checking both systems)
|
||||
SELECT r.* FROM reservations r
|
||||
WHERE r.customer_id = ?
|
||||
OR r.hashed_customer_id IN (
|
||||
SELECT id FROM hashed_customers WHERE contact_id = ?
|
||||
);
|
||||
|
||||
-- After fix: Simple unified query
|
||||
SELECT r.* FROM reservations r
|
||||
WHERE r.person_id = ?;
|
||||
```
|
||||
|
||||
### Issue: Missing Hotel FK
|
||||
```sql
|
||||
-- Current: Get hotel info for reservation (unreliable)
|
||||
SELECT r.*, r.hotel_name
|
||||
FROM reservations r
|
||||
WHERE r.id = ?;
|
||||
|
||||
-- After fix: Reliable JOIN
|
||||
SELECT r.*, h.hotel_name, h.meta_account_id
|
||||
FROM reservations r
|
||||
JOIN hotels h ON r.hotel_id = h.hotel_id
|
||||
WHERE r.id = ?;
|
||||
```
|
||||
|
||||
### Issue: Hashed Data in Separate Table
|
||||
```sql
|
||||
-- Current: Get customer for Meta API (requires JOIN)
|
||||
SELECT hc.hashed_email, hc.hashed_phone
|
||||
FROM reservations r
|
||||
JOIN hashed_customers hc ON r.hashed_customer_id = hc.id
|
||||
WHERE r.id = ?;
|
||||
|
||||
-- After fix: Direct access
|
||||
SELECT c.hashed_email, c.hashed_phone
|
||||
FROM reservations r
|
||||
JOIN customers c ON r.customer_id = c.id
|
||||
WHERE r.id = ?;
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Conclusion
|
||||
|
||||
The schema is **functional but has significant normalization and consistency issues**. The main problems are:
|
||||
|
||||
1. **Dual customer tracking systems** that should be unified
|
||||
2. **Redundant storage of derived data** (hashes, account IDs)
|
||||
3. **Missing foreign key relationships** (hotels, room types)
|
||||
4. **Inconsistent deletion policies** across foreign keys
|
||||
5. **Broken composite foreign key** in conversions table
|
||||
|
||||
The database violates 3NF in several places and could benefit from a refactoring effort. However, the issues are primarily architectural rather than critical bugs, so the system can continue operating while improvements are made incrementally.
|
||||
|
||||
**Estimated effort to fix all issues**: 1-2 weeks of development + testing
|
||||
**Risk level**: Medium (requires data migration and careful FK updates)
|
||||
**Recommended approach**: Incremental fixes starting with high-priority items
|
||||
Reference in New Issue
Block a user