IceBase - Data-Driven Marketing & Data Professional

Data Engineering · ML · BI 🏗 Architect + Builder Independent Project ⚠ Simulated Data

IceBase —
Hockey Analytics
Platform

A full-stack data engineering, machine learning, and business intelligence platform built on Databricks for the Idaho Mashers — a mock professional hockey franchise. From raw seed data through a Lakeflow declarative pipeline, Unity Catalog medallion architecture, K-Means fan segmentation, XGBoost churn prediction with MLflow tracking, and three branded AI/BI dashboards.

Databricks Delta Lake Unity Catalog Lakeflow Pipelines K-Means Clustering XGBoost MLflow AI/BI Dashboards Medallion Architecture Auto Loader PySpark · SQL · Python

Idaho Mashers

View on GitHub →

Project Phases

Six phases.
Five delivered.

Foundation

Seed data generator · 5k fans · 82 games · Narrative encoding

Pipeline

Lakeflow declarative · Bronze→Silver · Auto Loader · DQ Expectations

Gold Layer

customer_360 · game_revenue · retention_cohort · Job DAG

ML Models

K-Means segments · XGBoost churn · MLflow · UC Registry

Dashboards

3 AI/BI dashboards · Fan Health · Revenue Ops · Segments

Alerts

SQL Alerts · Monitoring capstone To-Do

The Design Philosophy

The data existed.
The story didn't.

Most data engineering portfolio projects generate random data, build a pipeline on top of it, and stop there. The output is technically correct but tells no story a business person would recognize.

IceBase was designed differently. Every number in this platform was deliberately authored to reflect real business situations that sports organizations actually face — a hot start, a midseason slump where marketing over-relied on discounts, a late-season comeback, and a jersey retirement ceremony that became the single biggest fan acquisition event of the year.

The goal was to build a platform where every spike and drop in the data has a cause you can explain to a CMO, a VP of Revenue, or a retention manager. The pipeline, the ML models, and the dashboards were all built to surface both the good stories and the bad ones.

The season arc — hot start, midseason slump, late push, Jersey Night — is encoded as a Python configuration object in the seed generator. Every downstream table, model, and dashboard inherits this narrative. That's what makes it a business intelligence project, not just a data engineering exercise.

Design Principle 01 · Authored Narrative

Phase weights control fill rate, price multiplier, promo rate, and fan acquisition volume for each game range. The data isn't random — it's intentional.

Design Principle 02 · Good & Bad Stories

Four good stories and four bad stories baked into the seed data. Promo over-reliance, walk-up acquisition quality, Jersey Night reactivation — all visible in the dashboards.

Design Principle 03 · Business Personas

Every dashboard answers exactly one executive's questions — CMO, VP Revenue, Marketing Strategy. Data engineering is only valuable when it drives decisions.

Design Principle 04 · Production Patterns

Quarantine routing, MLflow lineage tracking, champion model aliases, Lakeflow declarative pipelines — not tutorial patterns. Production patterns.

✓ Good Stories in the Data

Season Core fans showed near-zero churn through the slump — proving loyalty program ROI
Jersey Night reactivated 94 deeply lapsed fans and drove a 31% 30-day return rate
XGBoost model flagged 600+ at-risk fans before full churn — with a real intervention window
412 net-new fans acquired on Jersey Night — highest single-event acquisition of the season

↯ Bad Stories in the Data

Promo over-reliance during the slump grew the Promo Hunter segment from 8% to 22% of the fan base
67% of walk-up buyers from the hot start never attended another game — vanity acquisition metrics
Marketing had no follow-up playbook for the 412 Jersey Night new acquisitions — 61% received zero outreach
Data leakage caught in XGBoost training — AUC of 1.0 diagnosed and fixed to a real 0.94

Architecture

Medallion architecture
end to end.

Unity Catalog governs every layer. Lakeflow handles Bronze→Silver declaratively. Gold is orchestrated by a multi-task Job DAG. ML models live in the Unity Catalog Model Registry with full feature lineage.

IceBase Pipeline Architecture

Databricks Free Edition · Unity Catalog · Lakeflow · MLflow · AI/BI Dashboards

icebase.(bronze|silver|gold)

Simulation

Python Faker
Seed Generator

Pipeline Kickoff
Simulator

Volume
Landing Zone

→

Bronze

raw_tickets
_stream

raw_customers
_stream

raw_events
raw_promotions

→

Silver

dim_customer
fact_tickets

dim_game
bridge_promo

quarantine
_tickets

→

Gold

customer_360
fan_features

game_revenue
retention_cohort

ml_fan_segments
ml_churn_scores

→

ML Layer

K-Means
Segmentation

XGBoost
Churn Model

MLflow
UC Registry

→

Serving

Fan Health
Dashboard

Revenue Ops
Dashboard

Segment
Explorer

Phase 2 — Lakeflow Pipeline

Declarative Bronze
to Silver.

The Bronze→Silver pipeline is a Lakeflow Spark Declarative Pipeline — six notebooks registered as pipeline source files. The language split is intentional: SQL for single-source streaming tables and materialized views, Python for tables that need to union two independent readStreams.

Auto Loader watches the Unity Catalog Volume landing zones and ingests new JSON files incrementally — only processing files not yet seen. Schema evolution is handled automatically. Checkpoint management is owned by the pipeline runtime.

Data Quality Expectations are enforced at Silver. Null IDs and invalid prices are hard-dropped via expect_or_drop. Bad records don't disappear silently — they route to a quarantine_tickets table for investigation.

6 Pipeline Notebooks

4 SQL + 2 Python. SQL for streaming tables and materialized views. Python where two independent readStreams need to be unioned — which SQL can't express cleanly.

Streaming Tables + Materialized Views

Streaming Tables for append-only sources (Auto Loader ingest, Silver cleaning). Materialized Views for static/batch sources like the game schedule and promo history.

Quarantine Routing

Most pipelines silently drop bad records. This one routes them to a named quarantine table — preserved for investigation. A production pattern, not a tutorial shortcut.

Phase 3 — Gold Layer

Three Gold tables.
One automated Job.

c360

customer_360

One row per fan. Total spend, promo sensitivity score, churn flag, revenue net, Jersey Night attendee flag. The CMO's fan dossier.

rev

game_revenue

One row per game. Gross vs net revenue, fill rate, 3-game rolling average, revenue index vs season average. 82 rows exactly.

ret

retention_cohort

30-day and 60-day return windows, churn flag, Jersey Night cohort flag. The ML training ground for Phase 4.

dag

Job DAG

4-task Lakeflow Job. Silver pipeline → customer_360 + game_revenue in parallel → retention_cohort. Runs every 30 minutes.

sim

Simulator

Kickoff notebook drops new ticket and customer JSON files to Volumes, triggering the full automated pipeline end to end.

Phase 4 — Machine Learning

Two models.
Full lineage.

Both models read from a Unity Catalog Feature Table, tracked in MLflow with full experiment logging, and registered in the Unity Catalog Model Registry. Predictions write back to Gold as enrichment tables.

Fan Segmentation

Unsupervised

Clusters

K-Means

Algorithm

Features: RFM metrics, promo sensitivity, seat tier rank, tenure, purchase timing
Elbow analysis run for k=2–8, logged to MLflow — k=5 selected at inflection point
Cluster labels assigned manually by reading centroid profiles — practitioner judgment step
Segments: Season Core · Casual Fan · Promo Hunter · Lapsed · Deeply Lapsed
Writes icebase.gold.ml_fan_segments back to Unity Catalog

Churn Prediction

Supervised

0.94

AUC

0.85

Accuracy

Algorithm: XGBoost binary classifier — churn defined as no purchase in 45+ days
Training data: Built via FeatureEngineeringClient for full feature lineage tracking
Data leakage caught and fixed — AUC of 1.0 diagnosed (recency_days leakage), resolved to 0.94
Champion alias set in UC Model Registry — promoting new versions requires no code changes
Writes ml_churn_scores with High/Medium/Low risk tiers

⚡ The Leakage Story — Worth Calling Out

During initial training, the XGBoost model produced an AUC of 1.0 — a near-perfect score that is almost always a red flag, not a success. The cause: recency_days and churn_flag were mathematically derived from the same underlying value — days since last purchase. The model wasn't learning fan behavior, it was learning a tautology. Removing recency_days brought the model to a realistic 0.94 AUC. Catching this and knowing why it happened is the difference between someone who ran a tutorial and someone who understands the craft.

Phase 5 — AI/BI Dashboards

Three dashboards.
One team, one brand.

Three AI/BI dashboards built on Databricks SQL — each one answering a specific executive's questions. Branded with the Idaho Mashers identity. All powered live from Gold and ML output tables.

Fan Health Dashboard

Audience: CMO · Retention Manager

PHASE 5

KPI scorecard · Churn risk distribution · High-risk fan watchlist with recommended actions · Jersey Night cohort 30-day return comparison · Acquisition channel quality analysis

Revenue Operations

VP Revenue · Finance

Season revenue arc · Jersey Night spike · Promo cost by phase · Fill rate trend

Segment Explorer

Marketing Strategy · Campaign Managers

Segment profiles · Channel quality · Campaign action guide table

Results

The stat sheet

5 phases

End-to-End Delivery

Seed data through ML-powered dashboards. Phase 6 SQL Alerts designed and deferred to next cycle.

0.94 AUC

Churn Model Performance

XGBoost churn predictor after catching and correcting data leakage from recency_days feature.

5 clusters

Fan Segments

K-Means on RFM + behavioral features. Labels assigned manually by reading centroid profiles — practitioner judgment step.

3 dashboards

AI/BI Serving Layer

Fan Health · Revenue Ops · Segment Explorer. Each one answers exactly one executive's questions.

82 games

Full Season Modeled

5,000 fans · 47k+ ticket transactions · 4 season phases · Jersey Night event modeled as game 77.

6 notebooks

Lakeflow Pipeline

4 SQL + 2 Python. Single-language notebooks required by Lakeflow — language split is intentional and documented.

4 tasks

Job DAG

Silver pipeline → customer_360 + game_revenue in parallel → retention_cohort. Runs every 30 minutes on schedule.

Free tier

Databricks Free Edition

Full Unity Catalog, Lakeflow, MLflow, and AI/BI Dashboards on Databricks Free Edition with serverless compute.

Engineering Decisions

Key decisions &
tradeoffs

The choices that shaped the architecture — and what was knowingly accepted or given up at each step.

Decision	Rationale	Tradeoff
Authored narrative over random data	Random seed data produces technically correct pipelines but tells no business story. The PHASE_WEIGHTS configuration object encodes fill rate, price multiplier, promo rate, and acquisition volume for each game range — making every downstream number explainable to a CMO, not just a data engineer.	More build time upfront — designing the narrative before generating a single row added significant design work. Paid for itself in the quality of the ML validation and dashboard storytelling.
6-notebook pipeline with language split	Lakeflow pipeline source files must be single-language — SQL or Python, not both. The split is intentional: SQL for single-source tables and materialized views (cleaner, more readable), Python where two independent readStreams need to be unioned — which SQL cannot express cleanly.	Correct implementation — attempting mixed-language notebooks was the first approach and failed. The 6-notebook solution reflects how Databricks Lakeflow actually works, not how older DLT documentation implied it worked.
Quarantine table over silent drops	Most pipeline tutorials drop bad records and move on. Routing them to a named quarantine table means bad data is preserved for investigation, the pipeline event log shows exactly how many records failed each expectation, and a SQL Alert can fire when the count spikes.	Production pattern — adds one more streaming table to manage, but the observability gain is significant. This is the pattern a senior data engineer would reach for in a real org.
Remove recency_days from churn model	Initial XGBoost training produced AUC of 1.0 — the signal that something was wrong. recency_days is mathematically derived from churn_flag (no purchase in 45+ days). Leaving it in the model means the model learns a tautology, not fan behavior. Removing it produced a real 0.94 AUC.	Honest model — 0.94 AUC on clean behavioral features is a genuinely strong result. The leakage catch and fix is a stronger portfolio signal than a perfect 1.0 would have been.
Champion alias over versioned model staging	Unity Catalog replaced the old Staging/Production stage model with named aliases. Setting a "champion" alias on the registered model means downstream scoring notebooks reference `@champion` — not a version number. Promoting a retrained model requires updating the alias pointer, not changing any code.	Alias documentation required — anyone unfamiliar with the UC alias pattern needs context. Accepted in exchange for a promotion workflow that doesn't require code changes in scoring notebooks.
Three dashboards over one unified report	The CMO, VP Revenue, and Marketing Strategy leads ask fundamentally different questions. A unified dashboard forces cognitive filtering that reduces adoption. One persona per dashboard is a product decision — each dashboard is structured around exactly one executive's mental model of the data.	More build and maintenance surface — three dashboards to update when the Gold schema evolves. Accepted as the correct product tradeoff for usability and stakeholder clarity.

Gallery

Screenshot gallery

Every pipeline artifact, ML experiment, and dashboard view from the live Databricks workspace.

Databricks Workspace

Full project workspace · Phase folder structure · GitHub Repos connected

Unity Catalog

icebase.(bronze|silver|gold) · ML models registered alongside tables

Lakeflow Pipeline DAG

Bronze→Silver declarative pipeline · 6 source files · Dependency graph resolved by runtime

Silver — dim_game

82 games · Derived columns: result_numeric, is_playoff_relevant, wins_to_date

Gold — fan_features

Unity Catalog Feature Table · RFM + behavioral features · ML training ground

Lakeflow Job DAG

4-task orchestrator · Parallel branches · Silver → Gold → ML · 30-min schedule

Job Run History

All tasks succeeded · Pipeline automation confirmed end to end

MLflow Experiments

Elbow analysis + XGBoost churn run · Parameters, AUC, confusion matrix, feature importance

UC Model Registry

xgboost_churn_predictor · @champion alias · Feature lineage tab shows fan_features source

Fan Health Dashboard

Churn risk distribution · High-risk watchlist · Jersey Night cohort comparison

Revenue Operations

Season revenue arc · Jersey Night spike · Promo cost by phase · Fill rate trend

Segment Explorer

Segment profiles · Channel quality cross-tab · Campaign action guide

Six phases.Five delivered.

The data existed.The story didn't.

Medallion architectureend to end.

Declarative Bronzeto Silver.

Three Gold tables.One automated Job.

Two models.Full lineage.

Three dashboards.One team, one brand.