# Database Schema Analysis ## Overview This document analyzes the database schema for normalization issues, redundancy, and potential improvements. ## Schema Summary The database contains 13 tables organized around several core concepts: - **Customer/Guest Management**: `customers`, `hashed_customers`, `conversion_guests` - **Reservations**: `reservations`, `conversions`, `conversion_rooms` - **Hotels**: `hotels`, `hotel_inventory`, `room_availability` - **Webhooks**: `webhook_endpoints`, `webhook_requests` - **Tracking**: `acked_requests` --- ## Major Issues Identified ### 1. **CRITICAL: Dual Customer Systems (Data Duplication)** **Problem**: The schema maintains two parallel customer tracking systems: - `customers` + `hashed_customers` (from Wix forms) - `conversion_guests` (from PMS) **Impact**: - Same person can exist in both systems with no linkage - `conversion_guests.hashed_customer_id` attempts to link but this is backward (many-to-one instead of one-to-one) - Data inconsistency when same guest appears in both sources **Details**: ``` customers (id=1, email="john@example.com") └─ hashed_customers (id=1, customer_id=1, hashed_email="abc123...") conversion_guests (hotel_id="HOTEL1", guest_id=42, guest_email="john@example.com") └─ hashed_customer_id = NULL (or points to hashed_customers.id=1 after matching) ``` **Recommendation**: - Create a unified `persons` table with a `source` field ("wix", "pms", "merged") - Both `customers` and `conversion_guests` should reference this unified entity - Implement proper guest matching/merging logic --- ### 2. **Data Redundancy: Hashed Values Stored Separately** **Problem**: `hashed_customers` and `conversion_guests` store hashed values in separate columns alongside originals. **Current Structure**: ``` customers: - email_address (plaintext) - phone (plaintext) hashed_customers: - customer_id (FK to customers) - hashed_email - hashed_phone - hashed_given_name ... ``` **Issues**: - Violates 3NF (derived data stored in separate table) - Synchronization required between `customers` and `hashed_customers` - If customer data changes, hashed version can become stale - Extra JOIN required for every Meta Conversion API call **Better Approach**: Option A: Store hashed values directly in `customers` table as additional columns Option B: Compute hashes on-the-fly (SHA256 is fast, ~1-2ms per hash) **Recommendation**: - **Short term**: Keep current structure but add triggers to auto-update hashed values - **Long term**: Move hashed columns into `customers` table directly --- ### 3. **Advertising Account IDs Duplicated Across Tables** **Problem**: `meta_account_id` and `google_account_id` appear in 3 places: - `hotels` table (canonical source) - `reservations` table (copied at creation time) - Derived from `fbclid`/`gclid` tracking parameters **Current Flow**: ``` hotels.meta_account_id = "123456" ↓ reservation created with fbclid ↓ reservations.meta_account_id = "123456" (copied from hotels) ``` **Issues**: - Denormalization without clear benefit - If hotel's account ID changes, old reservations have stale data - Mixed source of truth (sometimes from hotels, sometimes from tracking params) **Recommendation**: - Remove `meta_account_id` and `google_account_id` from `reservations` - Always derive from `hotels` table via JOIN - If tracking-derived account differs from hotel's account, log a warning --- ### 4. **Hotel Information Duplicated in Reservations** **Problem**: `reservations` table stores `hotel_code` and `hotel_name` but has no FK to `hotels` table. **Issues**: - Data can become inconsistent if hotel name changes - No referential integrity - Unclear if `hotel_code` matches `hotels.hotel_id` **Recommendation**: - Add `hotel_id` FK column to `reservations` pointing to `hotels.hotel_id` - Remove `hotel_code` and `hotel_name` columns - Derive hotel information via JOIN when needed --- ### 5. **Weak Foreign Key Consistency** **Problem**: Mixed use of `ON DELETE` policies: - Some FKs use `SET NULL` (appropriate for nullable relationships) - Some use `CASCADE` (appropriate for child records) - Some use `NO ACTION` (prevents deletion, may cause issues) - `conversions` table has confusing composite FK setup with `hotel_id` and `guest_id` **Examples**: ```sql -- Good: Child data should be deleted with parent hotel_inventory.hotel_id → hotels.hotel_id (ON DELETE CASCADE) -- Questionable: Should webhook requests survive hotel deletion? webhook_requests.hotel_id → hotels.hotel_id (ON DELETE NO ACTION) -- Inconsistent: Why SET NULL vs CASCADE? reservations.customer_id → customers.id (ON DELETE SET NULL) reservations.hashed_customer_id → hashed_customers.id (ON DELETE CASCADE) ``` **Recommendation**: Review each FK and establish consistent policies: - Core data (hotels, customers): SET NULL to preserve historical records - Supporting data (hashed_customers, inventory): CASCADE - Transactional data (webhooks, conversions): Decide on retention policy --- ### 6. **Confusing Composite Foreign Key in Conversions** **Problem**: The `conversions` table has a composite FK that's incorrectly mapped: ```python # In db.py lines 650-655 __table_args__ = ( ForeignKeyConstraint( ["hotel_id", "guest_id"], ["conversion_guests.hotel_id", "conversion_guests.guest_id"], ondelete="SET NULL", ), ) ``` **But the database shows**: ``` Foreign Keys: hotel_id -> conversion_guests.hotel_id (ON DELETE SET NULL) guest_id -> conversion_guests.hotel_id (ON DELETE SET NULL) # ← WRONG! guest_id -> conversion_guests.guest_id (ON DELETE SET NULL) hotel_id -> conversion_guests.guest_id (ON DELETE SET NULL) # ← WRONG! ``` **Impact**: - Database has 4 FKs instead of 1 composite FK - Mapping is incorrect (guest_id → hotel_id doesn't make sense) - Could cause constraint violations or allow orphaned records **Recommendation**: - Fix the composite FK definition in SQLAlchemy - Run a migration to drop incorrect FKs and recreate properly --- ### 7. **Unclear Relationship Between Reservations and Conversions** **Problem**: The relationship between `reservations` (from Wix forms) and `conversions` (from PMS) is complex: ``` conversions: - reservation_id (FK to reservations) - matched by tracking IDs - customer_id (FK to customers) - matched by guest details - hashed_customer_id (FK to hashed_customers) - matched by hashed guest details - guest_id (FK to conversion_guests) - the actual PMS guest ``` **Issues**: - Three different FK fields to three different customer/guest tables - Matching logic is unclear from schema alone - `directly_attributable` and `guest_matched` flags indicate matching quality, but this should be more explicit **Recommendation**: - Add a `match_confidence` enum field: "exact_id", "high_confidence", "medium_confidence", "no_match" - Add `match_method` field to explain how the link was made - Consider a separate `reservation_conversion_links` table to make the many-to-many relationship explicit --- ### 8. **Room Type Information Scattered** **Problem**: Room information appears in multiple places: - `reservations.room_type_code`, `room_classification_code`, `room_type` - `conversion_rooms.room_type`, `room_number` - `hotel_inventory.inv_type_code`, `inv_code`, `room_name` **Issues**: - No clear master data for room types - Room type codes not standardized across sources - No FK between `reservations.room_type_code` and `hotel_inventory.inv_type_code` **Recommendation**: - Create a `room_types` reference table linked to hotels - Add FKs from reservations and conversion_rooms to room_types - Standardize room type codes across all sources --- ## Normalization Analysis ### 1st Normal Form (1NF): ✅ PASS - All columns contain atomic values - **Exception**: `reservations.children_ages` stores comma-separated values - Should be: separate `reservation_children` table with age column ### 2nd Normal Form (2NF): ⚠️ MOSTLY PASS - All non-key attributes depend on the full primary key - **Issue**: Some denormalized data exists (hotel names, account IDs in reservations) ### 3rd Normal Form (3NF): ❌ FAIL Multiple violations: - `hashed_customers` stores derived data (hashes) that depend on `customers` - `reservations.meta_account_id` depends on `hotels` via hotel_code - `reservations.hotel_name` depends on `hotels` via hotel_code --- ## Data Integrity Issues ### Missing Foreign Keys 1. **reservations.hotel_code** → should FK to hotels.hotel_id 2. **reservations.room_type_code** → should FK to hotel_inventory 3. **acked_requests.unique_id** → should FK to reservations.unique_id (or be nullable) ### Missing Indexes Consider adding for query performance: 1. `customers.email_address` - for lookups during conversion matching 2. `conversions.reservation_date` - for time-based queries 3. `conversion_rooms.total_revenue` - for revenue analytics 4. `reservations.start_date`, `end_date` - for date range queries ### Missing Constraints 1. **Check constraints** for date logic: - `reservations.end_date > start_date` - `conversion_rooms.departure_date > arrival_date` 2. **Check constraints** for counts: - `num_adults >= 0`, `num_children >= 0` 3. **NOT NULL constraints** on critical fields: - `customers.contact_id` should be NOT NULL (it's the natural key) - `conversions.hotel_id` is NOT NULL ✓ (good) --- ## Recommendations Priority ### HIGH PRIORITY (Data Integrity) 1. Fix composite FK in `conversions` table (lines 650-655 in db.py) 2. Add `hotel_id` FK to `reservations` table 3. Add missing NOT NULL constraints on natural keys 4. Add check constraints for date ranges and counts ### MEDIUM PRIORITY (Normalization) 5. Unify customer/guest systems into a single `persons` entity 6. Remove duplicate account ID fields from `reservations` 7. Remove `hotel_name` from `reservations` (derive via JOIN) 8. Create `reservation_children` table for children_ages ### LOW PRIORITY (Performance & Cleanup) 9. Move hashed fields into `customers` table (remove `hashed_customers`) 10. Add indexes for common query patterns 11. Create `room_types` reference table 12. Add `match_confidence` and `match_method` to `conversions` --- ## Positive Aspects ✅ Good use of composite keys (`conversion_guests`, `hotel_inventory`) ✅ Unique constraints on natural keys (`contact_id`, `webhook_secret`) ✅ Proper use of indexes on frequently queried fields ✅ Cascade deletion for child records (inventory, rooms) ✅ Tracking metadata (created_at, updated_at, first_seen, last_seen) ✅ Webhook deduplication via `payload_hash` ✅ JSON storage for flexible data (`conversion_rooms.daily_sales`) --- ## Suggested Refactoring Path ### Phase 1: Fix Critical Issues (1-2 days) - Fix composite FK in conversions - Add hotel_id FK to reservations - Add missing constraints ### Phase 2: Normalize Customer Data (3-5 days) - Create unified persons/guests table - Migrate existing data - Update matching logic ### Phase 3: Clean Up Redundancy (2-3 days) - Remove duplicate account IDs - Merge hashed_customers into customers - Create room_types reference ### Phase 4: Enhance Tracking (1-2 days) - Add match_confidence fields - Improve conversion attribution - Add missing indexes --- ## Query Examples Affected by Current Issues ### Issue: Duplicate Customer Data ```sql -- Current: Find all reservations for a guest (requires checking both systems) SELECT r.* FROM reservations r WHERE r.customer_id = ? OR r.hashed_customer_id IN ( SELECT id FROM hashed_customers WHERE contact_id = ? ); -- After fix: Simple unified query SELECT r.* FROM reservations r WHERE r.person_id = ?; ``` ### Issue: Missing Hotel FK ```sql -- Current: Get hotel info for reservation (unreliable) SELECT r.*, r.hotel_name FROM reservations r WHERE r.id = ?; -- After fix: Reliable JOIN SELECT r.*, h.hotel_name, h.meta_account_id FROM reservations r JOIN hotels h ON r.hotel_id = h.hotel_id WHERE r.id = ?; ``` ### Issue: Hashed Data in Separate Table ```sql -- Current: Get customer for Meta API (requires JOIN) SELECT hc.hashed_email, hc.hashed_phone FROM reservations r JOIN hashed_customers hc ON r.hashed_customer_id = hc.id WHERE r.id = ?; -- After fix: Direct access SELECT c.hashed_email, c.hashed_phone FROM reservations r JOIN customers c ON r.customer_id = c.id WHERE r.id = ?; ``` --- ## Conclusion The schema is **functional but has significant normalization and consistency issues**. The main problems are: 1. **Dual customer tracking systems** that should be unified 2. **Redundant storage of derived data** (hashes, account IDs) 3. **Missing foreign key relationships** (hotels, room types) 4. **Inconsistent deletion policies** across foreign keys 5. **Broken composite foreign key** in conversions table The database violates 3NF in several places and could benefit from a refactoring effort. However, the issues are primarily architectural rather than critical bugs, so the system can continue operating while improvements are made incrementally. **Estimated effort to fix all issues**: 1-2 weeks of development + testing **Risk level**: Medium (requires data migration and careful FK updates) **Recommended approach**: Incremental fixes starting with high-priority items