397 lines
13 KiB
Markdown
397 lines
13 KiB
Markdown
# Database Schema Analysis
|
|
|
|
## Overview
|
|
This document analyzes the database schema for normalization issues, redundancy, and potential improvements.
|
|
|
|
## Schema Summary
|
|
The database contains 13 tables organized around several core concepts:
|
|
- **Customer/Guest Management**: `customers`, `hashed_customers`, `conversion_guests`
|
|
- **Reservations**: `reservations`, `conversions`, `conversion_rooms`
|
|
- **Hotels**: `hotels`, `hotel_inventory`, `room_availability`
|
|
- **Webhooks**: `webhook_endpoints`, `webhook_requests`
|
|
- **Tracking**: `acked_requests`
|
|
|
|
---
|
|
|
|
## Major Issues Identified
|
|
|
|
### 1. **CRITICAL: Dual Customer Systems (Data Duplication)**
|
|
|
|
**Problem**: The schema maintains two parallel customer tracking systems:
|
|
- `customers` + `hashed_customers` (from Wix forms)
|
|
- `conversion_guests` (from PMS)
|
|
|
|
**Impact**:
|
|
- Same person can exist in both systems with no linkage
|
|
- `conversion_guests.hashed_customer_id` attempts to link but this is backward (many-to-one instead of one-to-one)
|
|
- Data inconsistency when same guest appears in both sources
|
|
|
|
**Details**:
|
|
```
|
|
customers (id=1, email="john@example.com")
|
|
└─ hashed_customers (id=1, customer_id=1, hashed_email="abc123...")
|
|
|
|
conversion_guests (hotel_id="HOTEL1", guest_id=42, guest_email="john@example.com")
|
|
└─ hashed_customer_id = NULL (or points to hashed_customers.id=1 after matching)
|
|
```
|
|
|
|
**Recommendation**:
|
|
- Create a unified `persons` table with a `source` field ("wix", "pms", "merged")
|
|
- Both `customers` and `conversion_guests` should reference this unified entity
|
|
- Implement proper guest matching/merging logic
|
|
|
|
---
|
|
|
|
### 2. **Data Redundancy: Hashed Values Stored Separately**
|
|
|
|
**Problem**: `hashed_customers` and `conversion_guests` store hashed values in separate columns alongside originals.
|
|
|
|
**Current Structure**:
|
|
```
|
|
customers:
|
|
- email_address (plaintext)
|
|
- phone (plaintext)
|
|
|
|
hashed_customers:
|
|
- customer_id (FK to customers)
|
|
- hashed_email
|
|
- hashed_phone
|
|
- hashed_given_name
|
|
...
|
|
```
|
|
|
|
**Issues**:
|
|
- Violates 3NF (derived data stored in separate table)
|
|
- Synchronization required between `customers` and `hashed_customers`
|
|
- If customer data changes, hashed version can become stale
|
|
- Extra JOIN required for every Meta Conversion API call
|
|
|
|
**Better Approach**:
|
|
Option A: Store hashed values directly in `customers` table as additional columns
|
|
Option B: Compute hashes on-the-fly (SHA256 is fast, ~1-2ms per hash)
|
|
|
|
**Recommendation**:
|
|
- **Short term**: Keep current structure but add triggers to auto-update hashed values
|
|
- **Long term**: Move hashed columns into `customers` table directly
|
|
|
|
---
|
|
|
|
### 3. **Advertising Account IDs Duplicated Across Tables**
|
|
|
|
**Problem**: `meta_account_id` and `google_account_id` appear in 3 places:
|
|
- `hotels` table (canonical source)
|
|
- `reservations` table (copied at creation time)
|
|
- Derived from `fbclid`/`gclid` tracking parameters
|
|
|
|
**Current Flow**:
|
|
```
|
|
hotels.meta_account_id = "123456"
|
|
↓
|
|
reservation created with fbclid
|
|
↓
|
|
reservations.meta_account_id = "123456" (copied from hotels)
|
|
```
|
|
|
|
**Issues**:
|
|
- Denormalization without clear benefit
|
|
- If hotel's account ID changes, old reservations have stale data
|
|
- Mixed source of truth (sometimes from hotels, sometimes from tracking params)
|
|
|
|
**Recommendation**:
|
|
- Remove `meta_account_id` and `google_account_id` from `reservations`
|
|
- Always derive from `hotels` table via JOIN
|
|
- If tracking-derived account differs from hotel's account, log a warning
|
|
|
|
---
|
|
|
|
### 4. **Hotel Information Duplicated in Reservations**
|
|
|
|
**Problem**: `reservations` table stores `hotel_code` and `hotel_name` but has no FK to `hotels` table.
|
|
|
|
**Issues**:
|
|
- Data can become inconsistent if hotel name changes
|
|
- No referential integrity
|
|
- Unclear if `hotel_code` matches `hotels.hotel_id`
|
|
|
|
**Recommendation**:
|
|
- Add `hotel_id` FK column to `reservations` pointing to `hotels.hotel_id`
|
|
- Remove `hotel_code` and `hotel_name` columns
|
|
- Derive hotel information via JOIN when needed
|
|
|
|
---
|
|
|
|
### 5. **Weak Foreign Key Consistency**
|
|
|
|
**Problem**: Mixed use of `ON DELETE` policies:
|
|
- Some FKs use `SET NULL` (appropriate for nullable relationships)
|
|
- Some use `CASCADE` (appropriate for child records)
|
|
- Some use `NO ACTION` (prevents deletion, may cause issues)
|
|
- `conversions` table has confusing composite FK setup with `hotel_id` and `guest_id`
|
|
|
|
**Examples**:
|
|
```sql
|
|
-- Good: Child data should be deleted with parent
|
|
hotel_inventory.hotel_id → hotels.hotel_id (ON DELETE CASCADE)
|
|
|
|
-- Questionable: Should webhook requests survive hotel deletion?
|
|
webhook_requests.hotel_id → hotels.hotel_id (ON DELETE NO ACTION)
|
|
|
|
-- Inconsistent: Why SET NULL vs CASCADE?
|
|
reservations.customer_id → customers.id (ON DELETE SET NULL)
|
|
reservations.hashed_customer_id → hashed_customers.id (ON DELETE CASCADE)
|
|
```
|
|
|
|
**Recommendation**:
|
|
Review each FK and establish consistent policies:
|
|
- Core data (hotels, customers): SET NULL to preserve historical records
|
|
- Supporting data (hashed_customers, inventory): CASCADE
|
|
- Transactional data (webhooks, conversions): Decide on retention policy
|
|
|
|
---
|
|
|
|
### 6. **Confusing Composite Foreign Key in Conversions**
|
|
|
|
**Problem**: The `conversions` table has a composite FK that's incorrectly mapped:
|
|
|
|
```python
|
|
# In db.py lines 650-655
|
|
__table_args__ = (
|
|
ForeignKeyConstraint(
|
|
["hotel_id", "guest_id"],
|
|
["conversion_guests.hotel_id", "conversion_guests.guest_id"],
|
|
ondelete="SET NULL",
|
|
),
|
|
)
|
|
```
|
|
|
|
**But the database shows**:
|
|
```
|
|
Foreign Keys:
|
|
hotel_id -> conversion_guests.hotel_id (ON DELETE SET NULL)
|
|
guest_id -> conversion_guests.hotel_id (ON DELETE SET NULL) # ← WRONG!
|
|
guest_id -> conversion_guests.guest_id (ON DELETE SET NULL)
|
|
hotel_id -> conversion_guests.guest_id (ON DELETE SET NULL) # ← WRONG!
|
|
```
|
|
|
|
**Impact**:
|
|
- Database has 4 FKs instead of 1 composite FK
|
|
- Mapping is incorrect (guest_id → hotel_id doesn't make sense)
|
|
- Could cause constraint violations or allow orphaned records
|
|
|
|
**Recommendation**:
|
|
- Fix the composite FK definition in SQLAlchemy
|
|
- Run a migration to drop incorrect FKs and recreate properly
|
|
|
|
---
|
|
|
|
### 7. **Unclear Relationship Between Reservations and Conversions**
|
|
|
|
**Problem**: The relationship between `reservations` (from Wix forms) and `conversions` (from PMS) is complex:
|
|
|
|
```
|
|
conversions:
|
|
- reservation_id (FK to reservations) - matched by tracking IDs
|
|
- customer_id (FK to customers) - matched by guest details
|
|
- hashed_customer_id (FK to hashed_customers) - matched by hashed guest details
|
|
- guest_id (FK to conversion_guests) - the actual PMS guest
|
|
```
|
|
|
|
**Issues**:
|
|
- Three different FK fields to three different customer/guest tables
|
|
- Matching logic is unclear from schema alone
|
|
- `directly_attributable` and `guest_matched` flags indicate matching quality, but this should be more explicit
|
|
|
|
**Recommendation**:
|
|
- Add a `match_confidence` enum field: "exact_id", "high_confidence", "medium_confidence", "no_match"
|
|
- Add `match_method` field to explain how the link was made
|
|
- Consider a separate `reservation_conversion_links` table to make the many-to-many relationship explicit
|
|
|
|
---
|
|
|
|
### 8. **Room Type Information Scattered**
|
|
|
|
**Problem**: Room information appears in multiple places:
|
|
- `reservations.room_type_code`, `room_classification_code`, `room_type`
|
|
- `conversion_rooms.room_type`, `room_number`
|
|
- `hotel_inventory.inv_type_code`, `inv_code`, `room_name`
|
|
|
|
**Issues**:
|
|
- No clear master data for room types
|
|
- Room type codes not standardized across sources
|
|
- No FK between `reservations.room_type_code` and `hotel_inventory.inv_type_code`
|
|
|
|
**Recommendation**:
|
|
- Create a `room_types` reference table linked to hotels
|
|
- Add FKs from reservations and conversion_rooms to room_types
|
|
- Standardize room type codes across all sources
|
|
|
|
---
|
|
|
|
## Normalization Analysis
|
|
|
|
### 1st Normal Form (1NF): ✅ PASS
|
|
- All columns contain atomic values
|
|
- **Exception**: `reservations.children_ages` stores comma-separated values
|
|
- Should be: separate `reservation_children` table with age column
|
|
|
|
### 2nd Normal Form (2NF): ⚠️ MOSTLY PASS
|
|
- All non-key attributes depend on the full primary key
|
|
- **Issue**: Some denormalized data exists (hotel names, account IDs in reservations)
|
|
|
|
### 3rd Normal Form (3NF): ❌ FAIL
|
|
Multiple violations:
|
|
- `hashed_customers` stores derived data (hashes) that depend on `customers`
|
|
- `reservations.meta_account_id` depends on `hotels` via hotel_code
|
|
- `reservations.hotel_name` depends on `hotels` via hotel_code
|
|
|
|
---
|
|
|
|
## Data Integrity Issues
|
|
|
|
### Missing Foreign Keys
|
|
1. **reservations.hotel_code** → should FK to hotels.hotel_id
|
|
2. **reservations.room_type_code** → should FK to hotel_inventory
|
|
3. **acked_requests.unique_id** → should FK to reservations.unique_id (or be nullable)
|
|
|
|
### Missing Indexes
|
|
Consider adding for query performance:
|
|
1. `customers.email_address` - for lookups during conversion matching
|
|
2. `conversions.reservation_date` - for time-based queries
|
|
3. `conversion_rooms.total_revenue` - for revenue analytics
|
|
4. `reservations.start_date`, `end_date` - for date range queries
|
|
|
|
### Missing Constraints
|
|
1. **Check constraints** for date logic:
|
|
- `reservations.end_date > start_date`
|
|
- `conversion_rooms.departure_date > arrival_date`
|
|
|
|
2. **Check constraints** for counts:
|
|
- `num_adults >= 0`, `num_children >= 0`
|
|
|
|
3. **NOT NULL constraints** on critical fields:
|
|
- `customers.contact_id` should be NOT NULL (it's the natural key)
|
|
- `conversions.hotel_id` is NOT NULL ✓ (good)
|
|
|
|
---
|
|
|
|
## Recommendations Priority
|
|
|
|
### HIGH PRIORITY (Data Integrity)
|
|
1. Fix composite FK in `conversions` table (lines 650-655 in db.py)
|
|
2. Add `hotel_id` FK to `reservations` table
|
|
3. Add missing NOT NULL constraints on natural keys
|
|
4. Add check constraints for date ranges and counts
|
|
|
|
### MEDIUM PRIORITY (Normalization)
|
|
5. Unify customer/guest systems into a single `persons` entity
|
|
6. Remove duplicate account ID fields from `reservations`
|
|
7. Remove `hotel_name` from `reservations` (derive via JOIN)
|
|
8. Create `reservation_children` table for children_ages
|
|
|
|
### LOW PRIORITY (Performance & Cleanup)
|
|
9. Move hashed fields into `customers` table (remove `hashed_customers`)
|
|
10. Add indexes for common query patterns
|
|
11. Create `room_types` reference table
|
|
12. Add `match_confidence` and `match_method` to `conversions`
|
|
|
|
---
|
|
|
|
## Positive Aspects
|
|
|
|
✅ Good use of composite keys (`conversion_guests`, `hotel_inventory`)
|
|
✅ Unique constraints on natural keys (`contact_id`, `webhook_secret`)
|
|
✅ Proper use of indexes on frequently queried fields
|
|
✅ Cascade deletion for child records (inventory, rooms)
|
|
✅ Tracking metadata (created_at, updated_at, first_seen, last_seen)
|
|
✅ Webhook deduplication via `payload_hash`
|
|
✅ JSON storage for flexible data (`conversion_rooms.daily_sales`)
|
|
|
|
---
|
|
|
|
## Suggested Refactoring Path
|
|
|
|
### Phase 1: Fix Critical Issues (1-2 days)
|
|
- Fix composite FK in conversions
|
|
- Add hotel_id FK to reservations
|
|
- Add missing constraints
|
|
|
|
### Phase 2: Normalize Customer Data (3-5 days)
|
|
- Create unified persons/guests table
|
|
- Migrate existing data
|
|
- Update matching logic
|
|
|
|
### Phase 3: Clean Up Redundancy (2-3 days)
|
|
- Remove duplicate account IDs
|
|
- Merge hashed_customers into customers
|
|
- Create room_types reference
|
|
|
|
### Phase 4: Enhance Tracking (1-2 days)
|
|
- Add match_confidence fields
|
|
- Improve conversion attribution
|
|
- Add missing indexes
|
|
|
|
---
|
|
|
|
## Query Examples Affected by Current Issues
|
|
|
|
### Issue: Duplicate Customer Data
|
|
```sql
|
|
-- Current: Find all reservations for a guest (requires checking both systems)
|
|
SELECT r.* FROM reservations r
|
|
WHERE r.customer_id = ?
|
|
OR r.hashed_customer_id IN (
|
|
SELECT id FROM hashed_customers WHERE contact_id = ?
|
|
);
|
|
|
|
-- After fix: Simple unified query
|
|
SELECT r.* FROM reservations r
|
|
WHERE r.person_id = ?;
|
|
```
|
|
|
|
### Issue: Missing Hotel FK
|
|
```sql
|
|
-- Current: Get hotel info for reservation (unreliable)
|
|
SELECT r.*, r.hotel_name
|
|
FROM reservations r
|
|
WHERE r.id = ?;
|
|
|
|
-- After fix: Reliable JOIN
|
|
SELECT r.*, h.hotel_name, h.meta_account_id
|
|
FROM reservations r
|
|
JOIN hotels h ON r.hotel_id = h.hotel_id
|
|
WHERE r.id = ?;
|
|
```
|
|
|
|
### Issue: Hashed Data in Separate Table
|
|
```sql
|
|
-- Current: Get customer for Meta API (requires JOIN)
|
|
SELECT hc.hashed_email, hc.hashed_phone
|
|
FROM reservations r
|
|
JOIN hashed_customers hc ON r.hashed_customer_id = hc.id
|
|
WHERE r.id = ?;
|
|
|
|
-- After fix: Direct access
|
|
SELECT c.hashed_email, c.hashed_phone
|
|
FROM reservations r
|
|
JOIN customers c ON r.customer_id = c.id
|
|
WHERE r.id = ?;
|
|
```
|
|
|
|
---
|
|
|
|
## Conclusion
|
|
|
|
The schema is **functional but has significant normalization and consistency issues**. The main problems are:
|
|
|
|
1. **Dual customer tracking systems** that should be unified
|
|
2. **Redundant storage of derived data** (hashes, account IDs)
|
|
3. **Missing foreign key relationships** (hotels, room types)
|
|
4. **Inconsistent deletion policies** across foreign keys
|
|
5. **Broken composite foreign key** in conversions table
|
|
|
|
The database violates 3NF in several places and could benefit from a refactoring effort. However, the issues are primarily architectural rather than critical bugs, so the system can continue operating while improvements are made incrementally.
|
|
|
|
**Estimated effort to fix all issues**: 1-2 weeks of development + testing
|
|
**Risk level**: Medium (requires data migration and careful FK updates)
|
|
**Recommended approach**: Incremental fixes starting with high-priority items
|