Files
alpinebits_python/database_schema_analysis.md
2025-12-03 10:41:34 +01:00

13 KiB

Database Schema Analysis

Overview

This document analyzes the database schema for normalization issues, redundancy, and potential improvements.

Schema Summary

The database contains 13 tables organized around several core concepts:

  • Customer/Guest Management: customers, hashed_customers, conversion_guests
  • Reservations: reservations, conversions, conversion_rooms
  • Hotels: hotels, hotel_inventory, room_availability
  • Webhooks: webhook_endpoints, webhook_requests
  • Tracking: acked_requests

Major Issues Identified

1. CRITICAL: Dual Customer Systems (Data Duplication)

Problem: The schema maintains two parallel customer tracking systems:

  • customers + hashed_customers (from Wix forms)
  • conversion_guests (from PMS)

Impact:

  • Same person can exist in both systems with no linkage
  • conversion_guests.hashed_customer_id attempts to link but this is backward (many-to-one instead of one-to-one)
  • Data inconsistency when same guest appears in both sources

Details:

customers (id=1, email="john@example.com")
  └─ hashed_customers (id=1, customer_id=1, hashed_email="abc123...")

conversion_guests (hotel_id="HOTEL1", guest_id=42, guest_email="john@example.com")
  └─ hashed_customer_id = NULL (or points to hashed_customers.id=1 after matching)

Recommendation:

  • Create a unified persons table with a source field ("wix", "pms", "merged")
  • Both customers and conversion_guests should reference this unified entity
  • Implement proper guest matching/merging logic

2. Data Redundancy: Hashed Values Stored Separately

Problem: hashed_customers and conversion_guests store hashed values in separate columns alongside originals.

Current Structure:

customers:
  - email_address (plaintext)
  - phone (plaintext)

hashed_customers:
  - customer_id (FK to customers)
  - hashed_email
  - hashed_phone
  - hashed_given_name
  ...

Issues:

  • Violates 3NF (derived data stored in separate table)
  • Synchronization required between customers and hashed_customers
  • If customer data changes, hashed version can become stale
  • Extra JOIN required for every Meta Conversion API call

Better Approach: Option A: Store hashed values directly in customers table as additional columns Option B: Compute hashes on-the-fly (SHA256 is fast, ~1-2ms per hash)

Recommendation:

  • Short term: Keep current structure but add triggers to auto-update hashed values
  • Long term: Move hashed columns into customers table directly

3. Advertising Account IDs Duplicated Across Tables

Problem: meta_account_id and google_account_id appear in 3 places:

  • hotels table (canonical source)
  • reservations table (copied at creation time)
  • Derived from fbclid/gclid tracking parameters

Current Flow:

hotels.meta_account_id = "123456"
  ↓
reservation created with fbclid
  ↓
reservations.meta_account_id = "123456" (copied from hotels)

Issues:

  • Denormalization without clear benefit
  • If hotel's account ID changes, old reservations have stale data
  • Mixed source of truth (sometimes from hotels, sometimes from tracking params)

Recommendation:

  • Remove meta_account_id and google_account_id from reservations
  • Always derive from hotels table via JOIN
  • If tracking-derived account differs from hotel's account, log a warning

4. Hotel Information Duplicated in Reservations

Problem: reservations table stores hotel_code and hotel_name but has no FK to hotels table.

Issues:

  • Data can become inconsistent if hotel name changes
  • No referential integrity
  • Unclear if hotel_code matches hotels.hotel_id

Recommendation:

  • Add hotel_id FK column to reservations pointing to hotels.hotel_id
  • Remove hotel_code and hotel_name columns
  • Derive hotel information via JOIN when needed

5. Weak Foreign Key Consistency

Problem: Mixed use of ON DELETE policies:

  • Some FKs use SET NULL (appropriate for nullable relationships)
  • Some use CASCADE (appropriate for child records)
  • Some use NO ACTION (prevents deletion, may cause issues)
  • conversions table has confusing composite FK setup with hotel_id and guest_id

Examples:

-- Good: Child data should be deleted with parent
hotel_inventory.hotel_id  hotels.hotel_id (ON DELETE CASCADE)

-- Questionable: Should webhook requests survive hotel deletion?
webhook_requests.hotel_id  hotels.hotel_id (ON DELETE NO ACTION)

-- Inconsistent: Why SET NULL vs CASCADE?
reservations.customer_id  customers.id (ON DELETE SET NULL)
reservations.hashed_customer_id  hashed_customers.id (ON DELETE CASCADE)

Recommendation: Review each FK and establish consistent policies:

  • Core data (hotels, customers): SET NULL to preserve historical records
  • Supporting data (hashed_customers, inventory): CASCADE
  • Transactional data (webhooks, conversions): Decide on retention policy

6. Confusing Composite Foreign Key in Conversions

Problem: The conversions table has a composite FK that's incorrectly mapped:

# In db.py lines 650-655
__table_args__ = (
    ForeignKeyConstraint(
        ["hotel_id", "guest_id"],
        ["conversion_guests.hotel_id", "conversion_guests.guest_id"],
        ondelete="SET NULL",
    ),
)

But the database shows:

Foreign Keys:
  hotel_id -> conversion_guests.hotel_id (ON DELETE SET NULL)
  guest_id -> conversion_guests.hotel_id (ON DELETE SET NULL)  # ← WRONG!
  guest_id -> conversion_guests.guest_id (ON DELETE SET NULL)
  hotel_id -> conversion_guests.guest_id (ON DELETE SET NULL)  # ← WRONG!

Impact:

  • Database has 4 FKs instead of 1 composite FK
  • Mapping is incorrect (guest_id → hotel_id doesn't make sense)
  • Could cause constraint violations or allow orphaned records

Recommendation:

  • Fix the composite FK definition in SQLAlchemy
  • Run a migration to drop incorrect FKs and recreate properly

7. Unclear Relationship Between Reservations and Conversions

Problem: The relationship between reservations (from Wix forms) and conversions (from PMS) is complex:

conversions:
  - reservation_id (FK to reservations) - matched by tracking IDs
  - customer_id (FK to customers) - matched by guest details
  - hashed_customer_id (FK to hashed_customers) - matched by hashed guest details
  - guest_id (FK to conversion_guests) - the actual PMS guest

Issues:

  • Three different FK fields to three different customer/guest tables
  • Matching logic is unclear from schema alone
  • directly_attributable and guest_matched flags indicate matching quality, but this should be more explicit

Recommendation:

  • Add a match_confidence enum field: "exact_id", "high_confidence", "medium_confidence", "no_match"
  • Add match_method field to explain how the link was made
  • Consider a separate reservation_conversion_links table to make the many-to-many relationship explicit

8. Room Type Information Scattered

Problem: Room information appears in multiple places:

  • reservations.room_type_code, room_classification_code, room_type
  • conversion_rooms.room_type, room_number
  • hotel_inventory.inv_type_code, inv_code, room_name

Issues:

  • No clear master data for room types
  • Room type codes not standardized across sources
  • No FK between reservations.room_type_code and hotel_inventory.inv_type_code

Recommendation:

  • Create a room_types reference table linked to hotels
  • Add FKs from reservations and conversion_rooms to room_types
  • Standardize room type codes across all sources

Normalization Analysis

1st Normal Form (1NF): PASS

  • All columns contain atomic values
  • Exception: reservations.children_ages stores comma-separated values
    • Should be: separate reservation_children table with age column

2nd Normal Form (2NF): ⚠️ MOSTLY PASS

  • All non-key attributes depend on the full primary key
  • Issue: Some denormalized data exists (hotel names, account IDs in reservations)

3rd Normal Form (3NF): FAIL

Multiple violations:

  • hashed_customers stores derived data (hashes) that depend on customers
  • reservations.meta_account_id depends on hotels via hotel_code
  • reservations.hotel_name depends on hotels via hotel_code

Data Integrity Issues

Missing Foreign Keys

  1. reservations.hotel_code → should FK to hotels.hotel_id
  2. reservations.room_type_code → should FK to hotel_inventory
  3. acked_requests.unique_id → should FK to reservations.unique_id (or be nullable)

Missing Indexes

Consider adding for query performance:

  1. customers.email_address - for lookups during conversion matching
  2. conversions.reservation_date - for time-based queries
  3. conversion_rooms.total_revenue - for revenue analytics
  4. reservations.start_date, end_date - for date range queries

Missing Constraints

  1. Check constraints for date logic:

    • reservations.end_date > start_date
    • conversion_rooms.departure_date > arrival_date
  2. Check constraints for counts:

    • num_adults >= 0, num_children >= 0
  3. NOT NULL constraints on critical fields:

    • customers.contact_id should be NOT NULL (it's the natural key)
    • conversions.hotel_id is NOT NULL ✓ (good)

Recommendations Priority

HIGH PRIORITY (Data Integrity)

  1. Fix composite FK in conversions table (lines 650-655 in db.py)
  2. Add hotel_id FK to reservations table
  3. Add missing NOT NULL constraints on natural keys
  4. Add check constraints for date ranges and counts

MEDIUM PRIORITY (Normalization)

  1. Unify customer/guest systems into a single persons entity
  2. Remove duplicate account ID fields from reservations
  3. Remove hotel_name from reservations (derive via JOIN)
  4. Create reservation_children table for children_ages

LOW PRIORITY (Performance & Cleanup)

  1. Move hashed fields into customers table (remove hashed_customers)
  2. Add indexes for common query patterns
  3. Create room_types reference table
  4. Add match_confidence and match_method to conversions

Positive Aspects

Good use of composite keys (conversion_guests, hotel_inventory) Unique constraints on natural keys (contact_id, webhook_secret) Proper use of indexes on frequently queried fields Cascade deletion for child records (inventory, rooms) Tracking metadata (created_at, updated_at, first_seen, last_seen) Webhook deduplication via payload_hash JSON storage for flexible data (conversion_rooms.daily_sales)


Suggested Refactoring Path

Phase 1: Fix Critical Issues (1-2 days)

  • Fix composite FK in conversions
  • Add hotel_id FK to reservations
  • Add missing constraints

Phase 2: Normalize Customer Data (3-5 days)

  • Create unified persons/guests table
  • Migrate existing data
  • Update matching logic

Phase 3: Clean Up Redundancy (2-3 days)

  • Remove duplicate account IDs
  • Merge hashed_customers into customers
  • Create room_types reference

Phase 4: Enhance Tracking (1-2 days)

  • Add match_confidence fields
  • Improve conversion attribution
  • Add missing indexes

Query Examples Affected by Current Issues

Issue: Duplicate Customer Data

-- Current: Find all reservations for a guest (requires checking both systems)
SELECT r.* FROM reservations r
WHERE r.customer_id = ?
   OR r.hashed_customer_id IN (
      SELECT id FROM hashed_customers WHERE contact_id = ?
   );

-- After fix: Simple unified query
SELECT r.* FROM reservations r
WHERE r.person_id = ?;

Issue: Missing Hotel FK

-- Current: Get hotel info for reservation (unreliable)
SELECT r.*, r.hotel_name
FROM reservations r
WHERE r.id = ?;

-- After fix: Reliable JOIN
SELECT r.*, h.hotel_name, h.meta_account_id
FROM reservations r
JOIN hotels h ON r.hotel_id = h.hotel_id
WHERE r.id = ?;

Issue: Hashed Data in Separate Table

-- Current: Get customer for Meta API (requires JOIN)
SELECT hc.hashed_email, hc.hashed_phone
FROM reservations r
JOIN hashed_customers hc ON r.hashed_customer_id = hc.id
WHERE r.id = ?;

-- After fix: Direct access
SELECT c.hashed_email, c.hashed_phone
FROM reservations r
JOIN customers c ON r.customer_id = c.id
WHERE r.id = ?;

Conclusion

The schema is functional but has significant normalization and consistency issues. The main problems are:

  1. Dual customer tracking systems that should be unified
  2. Redundant storage of derived data (hashes, account IDs)
  3. Missing foreign key relationships (hotels, room types)
  4. Inconsistent deletion policies across foreign keys
  5. Broken composite foreign key in conversions table

The database violates 3NF in several places and could benefit from a refactoring effort. However, the issues are primarily architectural rather than critical bugs, so the system can continue operating while improvements are made incrementally.

Estimated effort to fix all issues: 1-2 weeks of development + testing Risk level: Medium (requires data migration and careful FK updates) Recommended approach: Incremental fixes starting with high-priority items