Data Engineering · ML · BI 🏗 Architect + Builder Independent Project ⚠ Simulated Data
IceBase —
Hockey Analytics
Platform
A full-stack data engineering, machine learning, and business intelligence platform built on Databricks for the Idaho Mashers — a mock professional hockey franchise. From raw seed data through a Lakeflow declarative pipeline, Unity Catalog medallion architecture, K-Means fan segmentation, XGBoost churn prediction with MLflow tracking, and three branded AI/BI dashboards.
Databricks Delta Lake Unity Catalog Lakeflow Pipelines K-Means Clustering XGBoost MLflow AI/BI Dashboards Medallion Architecture Auto Loader PySpark · SQL · Python
Idaho Mashers Logo
Idaho Mashers
View on GitHub →
Project Phases

Six phases.
Five delivered.

1
Foundation
Seed data generator · 5k fans · 82 games · Narrative encoding
2
Pipeline
Lakeflow declarative · Bronze→Silver · Auto Loader · DQ Expectations
3
Gold Layer
customer_360 · game_revenue · retention_cohort · Job DAG
4
ML Models
K-Means segments · XGBoost churn · MLflow · UC Registry
5
Dashboards
3 AI/BI dashboards · Fan Health · Revenue Ops · Segments
6
Alerts
SQL Alerts · Monitoring capstone To-Do
The Design Philosophy

The data existed.
The story didn't.

Most data engineering portfolio projects generate random data, build a pipeline on top of it, and stop there. The output is technically correct but tells no story a business person would recognize.

IceBase was designed differently. Every number in this platform was deliberately authored to reflect real business situations that sports organizations actually face — a hot start, a midseason slump where marketing over-relied on discounts, a late-season comeback, and a jersey retirement ceremony that became the single biggest fan acquisition event of the year.

The goal was to build a platform where every spike and drop in the data has a cause you can explain to a CMO, a VP of Revenue, or a retention manager. The pipeline, the ML models, and the dashboards were all built to surface both the good stories and the bad ones.

The season arc — hot start, midseason slump, late push, Jersey Night — is encoded as a Python configuration object in the seed generator. Every downstream table, model, and dashboard inherits this narrative. That's what makes it a business intelligence project, not just a data engineering exercise.

Design Principle 01 · Authored Narrative
Phase weights control fill rate, price multiplier, promo rate, and fan acquisition volume for each game range. The data isn't random — it's intentional.
Design Principle 02 · Good & Bad Stories
Four good stories and four bad stories baked into the seed data. Promo over-reliance, walk-up acquisition quality, Jersey Night reactivation — all visible in the dashboards.
Design Principle 03 · Business Personas
Every dashboard answers exactly one executive's questions — CMO, VP Revenue, Marketing Strategy. Data engineering is only valuable when it drives decisions.
Design Principle 04 · Production Patterns
Quarantine routing, MLflow lineage tracking, champion model aliases, Lakeflow declarative pipelines — not tutorial patterns. Production patterns.
✓ Good Stories in the Data
  • Season Core fans showed near-zero churn through the slump — proving loyalty program ROI
  • Jersey Night reactivated 94 deeply lapsed fans and drove a 31% 30-day return rate
  • XGBoost model flagged 600+ at-risk fans before full churn — with a real intervention window
  • 412 net-new fans acquired on Jersey Night — highest single-event acquisition of the season
↯ Bad Stories in the Data
  • Promo over-reliance during the slump grew the Promo Hunter segment from 8% to 22% of the fan base
  • 67% of walk-up buyers from the hot start never attended another game — vanity acquisition metrics
  • Marketing had no follow-up playbook for the 412 Jersey Night new acquisitions — 61% received zero outreach
  • Data leakage caught in XGBoost training — AUC of 1.0 diagnosed and fixed to a real 0.94
Architecture

Medallion architecture
end to end.

Unity Catalog governs every layer. Lakeflow handles Bronze→Silver declaratively. Gold is orchestrated by a multi-task Job DAG. ML models live in the Unity Catalog Model Registry with full feature lineage.

IceBase Pipeline Architecture
Databricks Free Edition · Unity Catalog · Lakeflow · MLflow · AI/BI Dashboards
icebase.(bronze|silver|gold)
Simulation
Python Faker
Seed Generator
Pipeline Kickoff
Simulator
Volume
Landing Zone
Bronze
raw_tickets
_stream
raw_customers
_stream
raw_events
raw_promotions
Silver
dim_customer
fact_tickets
dim_game
bridge_promo
quarantine
_tickets
Gold
customer_360
fan_features
game_revenue
retention_cohort
ml_fan_segments
ml_churn_scores
ML Layer
K-Means
Segmentation
XGBoost
Churn Model
MLflow
UC Registry
Serving
Fan Health
Dashboard
Revenue Ops
Dashboard
Segment
Explorer
IceBase Databricks Workspace
Unity Catalog
Lakeflow Pipeline DAG
Phase 2 — Lakeflow Pipeline

Declarative Bronze
to Silver.

The Bronze→Silver pipeline is a Lakeflow Spark Declarative Pipeline — six notebooks registered as pipeline source files. The language split is intentional: SQL for single-source streaming tables and materialized views, Python for tables that need to union two independent readStreams.

Auto Loader watches the Unity Catalog Volume landing zones and ingests new JSON files incrementally — only processing files not yet seen. Schema evolution is handled automatically. Checkpoint management is owned by the pipeline runtime.

Data Quality Expectations are enforced at Silver. Null IDs and invalid prices are hard-dropped via expect_or_drop. Bad records don't disappear silently — they route to a quarantine_tickets table for investigation.

6 Pipeline Notebooks
4 SQL + 2 Python. SQL for streaming tables and materialized views. Python where two independent readStreams need to be unioned — which SQL can't express cleanly.
Streaming Tables + Materialized Views
Streaming Tables for append-only sources (Auto Loader ingest, Silver cleaning). Materialized Views for static/batch sources like the game schedule and promo history.
Quarantine Routing
Most pipelines silently drop bad records. This one routes them to a named quarantine table — preserved for investigation. A production pattern, not a tutorial shortcut.
dim_game Silver Table Sample Data
Phase 3 — Gold Layer

Three Gold tables.
One automated Job.

c360
customer_360
One row per fan. Total spend, promo sensitivity score, churn flag, revenue net, Jersey Night attendee flag. The CMO's fan dossier.
rev
game_revenue
One row per game. Gross vs net revenue, fill rate, 3-game rolling average, revenue index vs season average. 82 rows exactly.
ret
retention_cohort
30-day and 60-day return windows, churn flag, Jersey Night cohort flag. The ML training ground for Phase 4.
dag
Job DAG
4-task Lakeflow Job. Silver pipeline → customer_360 + game_revenue in parallel → retention_cohort. Runs every 30 minutes.
sim
Simulator
Kickoff notebook drops new ticket and customer JSON files to Volumes, triggering the full automated pipeline end to end.
Lakeflow Job Orchestrator DAG
Job Run History
Phase 4 — Machine Learning

Two models.
Full lineage.

Both models read from a Unity Catalog Feature Table, tracked in MLflow with full experiment logging, and registered in the Unity Catalog Model Registry. Predictions write back to Gold as enrichment tables.

Fan Segmentation
Unsupervised
5
Clusters
K-Means
Algorithm
  • Features: RFM metrics, promo sensitivity, seat tier rank, tenure, purchase timing
  • Elbow analysis run for k=2–8, logged to MLflow — k=5 selected at inflection point
  • Cluster labels assigned manually by reading centroid profiles — practitioner judgment step
  • Segments: Season Core · Casual Fan · Promo Hunter · Lapsed · Deeply Lapsed
  • Writes icebase.gold.ml_fan_segments back to Unity Catalog
Churn Prediction
Supervised
0.94
AUC
0.85
Accuracy
  • Algorithm: XGBoost binary classifier — churn defined as no purchase in 45+ days
  • Training data: Built via FeatureEngineeringClient for full feature lineage tracking
  • Data leakage caught and fixed — AUC of 1.0 diagnosed (recency_days leakage), resolved to 0.94
  • Champion alias set in UC Model Registry — promoting new versions requires no code changes
  • Writes ml_churn_scores with High/Medium/Low risk tiers
⚡ The Leakage Story — Worth Calling Out
During initial training, the XGBoost model produced an AUC of 1.0 — a near-perfect score that is almost always a red flag, not a success. The cause: recency_days and churn_flag were mathematically derived from the same underlying value — days since last purchase. The model wasn't learning fan behavior, it was learning a tautology. Removing recency_days brought the model to a realistic 0.94 AUC. Catching this and knowing why it happened is the difference between someone who ran a tutorial and someone who understands the craft.
MLflow Experiments
Unity Catalog Model Registry
Fan Features Table
Phase 5 — AI/BI Dashboards

Three dashboards.
One team, one brand.

Three AI/BI dashboards built on Databricks SQL — each one answering a specific executive's questions. Branded with the Idaho Mashers identity. All powered live from Gold and ML output tables.

Fan Health Dashboard
Audience: CMO · Retention Manager
PHASE 5
Fan Health Dashboard
KPI scorecard · Churn risk distribution · High-risk fan watchlist with recommended actions · Jersey Night cohort 30-day return comparison · Acquisition channel quality analysis
Revenue Operations
VP Revenue · Finance
Revenue Operations Dashboard
Season revenue arc · Jersey Night spike · Promo cost by phase · Fill rate trend
Segment Explorer
Marketing Strategy · Campaign Managers
Segment Explorer Dashboard
Segment profiles · Channel quality · Campaign action guide table
Dashboard Overview
Results

The stat sheet

5 phases
End-to-End Delivery
Seed data through ML-powered dashboards. Phase 6 SQL Alerts designed and deferred to next cycle.
0.94 AUC
Churn Model Performance
XGBoost churn predictor after catching and correcting data leakage from recency_days feature.
5 clusters
Fan Segments
K-Means on RFM + behavioral features. Labels assigned manually by reading centroid profiles — practitioner judgment step.
3 dashboards
AI/BI Serving Layer
Fan Health · Revenue Ops · Segment Explorer. Each one answers exactly one executive's questions.
82 games
Full Season Modeled
5,000 fans · 47k+ ticket transactions · 4 season phases · Jersey Night event modeled as game 77.
6 notebooks
Lakeflow Pipeline
4 SQL + 2 Python. Single-language notebooks required by Lakeflow — language split is intentional and documented.
4 tasks
Job DAG
Silver pipeline → customer_360 + game_revenue in parallel → retention_cohort. Runs every 30 minutes on schedule.
Free tier
Databricks Free Edition
Full Unity Catalog, Lakeflow, MLflow, and AI/BI Dashboards on Databricks Free Edition with serverless compute.
Engineering Decisions

Key decisions &
tradeoffs

The choices that shaped the architecture — and what was knowingly accepted or given up at each step.

DecisionRationaleTradeoff
Authored narrative over random data Random seed data produces technically correct pipelines but tells no business story. The PHASE_WEIGHTS configuration object encodes fill rate, price multiplier, promo rate, and acquisition volume for each game range — making every downstream number explainable to a CMO, not just a data engineer. More build time upfront — designing the narrative before generating a single row added significant design work. Paid for itself in the quality of the ML validation and dashboard storytelling.
6-notebook pipeline with language split Lakeflow pipeline source files must be single-language — SQL or Python, not both. The split is intentional: SQL for single-source tables and materialized views (cleaner, more readable), Python where two independent readStreams need to be unioned — which SQL cannot express cleanly. Correct implementation — attempting mixed-language notebooks was the first approach and failed. The 6-notebook solution reflects how Databricks Lakeflow actually works, not how older DLT documentation implied it worked.
Quarantine table over silent drops Most pipeline tutorials drop bad records and move on. Routing them to a named quarantine table means bad data is preserved for investigation, the pipeline event log shows exactly how many records failed each expectation, and a SQL Alert can fire when the count spikes. Production pattern — adds one more streaming table to manage, but the observability gain is significant. This is the pattern a senior data engineer would reach for in a real org.
Remove recency_days from churn model Initial XGBoost training produced AUC of 1.0 — the signal that something was wrong. recency_days is mathematically derived from churn_flag (no purchase in 45+ days). Leaving it in the model means the model learns a tautology, not fan behavior. Removing it produced a real 0.94 AUC. Honest model — 0.94 AUC on clean behavioral features is a genuinely strong result. The leakage catch and fix is a stronger portfolio signal than a perfect 1.0 would have been.
Champion alias over versioned model staging Unity Catalog replaced the old Staging/Production stage model with named aliases. Setting a "champion" alias on the registered model means downstream scoring notebooks reference @champion — not a version number. Promoting a retrained model requires updating the alias pointer, not changing any code. Alias documentation required — anyone unfamiliar with the UC alias pattern needs context. Accepted in exchange for a promotion workflow that doesn't require code changes in scoring notebooks.
Three dashboards over one unified report The CMO, VP Revenue, and Marketing Strategy leads ask fundamentally different questions. A unified dashboard forces cognitive filtering that reduces adoption. One persona per dashboard is a product decision — each dashboard is structured around exactly one executive's mental model of the data. More build and maintenance surface — three dashboards to update when the Gold schema evolves. Accepted as the correct product tradeoff for usability and stakeholder clarity.
Gallery

Screenshot gallery

Every pipeline artifact, ML experiment, and dashboard view from the live Databricks workspace.

Databricks Workspace
Databricks Workspace
Full project workspace · Phase folder structure · GitHub Repos connected
Unity Catalog
Unity Catalog
icebase.(bronze|silver|gold) · ML models registered alongside tables
Lakeflow Pipeline
Lakeflow Pipeline DAG
Bronze→Silver declarative pipeline · 6 source files · Dependency graph resolved by runtime
dim_game Sample Data
Silver — dim_game
82 games · Derived columns: result_numeric, is_playoff_relevant, wins_to_date
Fan Features Table
Gold — fan_features
Unity Catalog Feature Table · RFM + behavioral features · ML training ground
Job Orchestrator
Lakeflow Job DAG
4-task orchestrator · Parallel branches · Silver → Gold → ML · 30-min schedule
Job Run History
Job Run History
All tasks succeeded · Pipeline automation confirmed end to end
MLflow Experiments
MLflow Experiments
Elbow analysis + XGBoost churn run · Parameters, AUC, confusion matrix, feature importance
UC Model Registry
UC Model Registry
xgboost_churn_predictor · @champion alias · Feature lineage tab shows fan_features source
Fan Health Dashboard
Fan Health Dashboard
Churn risk distribution · High-risk watchlist · Jersey Night cohort comparison
Revenue Operations Dashboard
Revenue Operations
Season revenue arc · Jersey Night spike · Promo cost by phase · Fill rate trend
Segment Explorer Dashboard
Segment Explorer
Segment profiles · Channel quality cross-tab · Campaign action guide