Feature Stores: A Deep Dive
How feature stores work: storage, synchronization, and point-in-time correctness.
Feature stores maintain a single source of truth for ML features across training and serving, abstracting away the complex data management behind a simple API. Under the hood, they handle storage, synchronization, and point-in-time correctness automatically.
What Is a Feature Store?
Many machine learning models fail in production not because of bad algorithms, but because of bad data management [1].
Features used during model training sometimes differ from those used in production. This mismatch causes training/serving skew, where feature values differ between training and production, degrading model performance [2][3].
Training datasets can also accidentally include feature values that were not available when the label occurred. This leaks future information into the training data, making offline evaluation metrics appear better than they are [4].
As teams build more models, another issue emerges: the same feature is often recomputed across multiple pipelines. These duplicated implementations drift over time, leading to inconsistent feature definitions, redundant work, and difficult debugging [5].
These problems all stem from the same issue: feature data is difficult to manage consistently across the machine learning lifecycle.
Feature stores were created to address this challenge. Uber introduced this concept in 2017 as part of its Michelangelo platform [6]. The core idea is simple: maintain a single source of truth for features so they can be defined once and reused consistently for both model training and production inference [7].
For engineers, the feature store interface is straightforward.
Call get_historical_features() to build training datasets and get_online_features() to retrieve features for real-time predictions.
But behind this simple interface lies significant infrastructure. A feature store must compute features across batch and streaming pipelines, store them in systems optimized for training and serving, keep those systems synchronized, and enforce point-in-time correctness.
This post explains how feature stores accomplish this behind the scenes.
Training and Serving Are Different Workloads
Training and serving use the same features but in completely different ways.
- Training: operates on massive historical datasets. It may scan billions of rows. Throughput and efficiency are prioritized over latency.
- Serving: requires making predictions on a single a feature vector in milliseconds. The system cannot scan the entire dataset.
Serving both workloads from a single storage system does not work. Feature stores solve this by splitting storage into two layers: an offline store for batch workloads and an online store for low-latency access. Both stores contain the same features but are optimized for their workloads.
This is essentially SCD Type 4 with the offline store as your historical ledger and the online store as your real time memory.
Training can also be performed online using streaming updates, and inference can sometimes be done in batch offline. For simplicity, this post focuses on the common case of offline training and online inference.
Storage Design
Offline Store
The offline store handles batch workloads such as training and backfills. Data is stored in columnar formats such as Parquet with Delta Lake or Hudi to enable ACID transactions [8].
user_id: [ 42 17 99 ]
event_timestamp:[ 2024-03-01 2024-02-15 2024-03-10 ]
login_count: [ 2 8 3 ]
avg_purchase: [ 40 25 67 ]Training jobs can read only the columns they need. Compression is also very efficient, especially for low cardinality columns where techniques like run-length encoding work well [9].
The offline feature store acts as a historical repository of feature data used for model training and batch workloads, often storing months or years of historical feature values. Efficient compression makes it practical and inexpensive to retain this data for training, debugging, and backfilling newly created features.
Online Store
The online store is optimized for low-latency access. Data is stored row by row or as key-value pairs so that all features for a single entity can be retrieved with a single fast lookup.
| user_id | event_timestamp | login_count | avg_purchase |
| ------- | --------------- | ----------- | ------------ |
| 42 | 2024-03-01 | 2 | 40 |
| 17 | 2024-02-15 | 8 | 25 |
| 99 | 2024-03-10 | 3 | 67 |Row-oriented databases such as Postgres or key-value stores like Redis are common choices. They provide millisecond read latency but are expensive for storing large historical datasets. For this reason, the online store usually keeps only the most recent feature values.
Offline vs Online Store Summary
| Feature | Offline Store | Online Store |
|---|---|---|
| Purpose | Model training and batch inference | Real-time inference |
| Access pattern | Large batch workloads | Single item lookups |
| Storage format | Columnar (Delta, Hudi) | Row-oriented or key-value (Postgres, Redis) |
| Data retained | Full historical feature values | Latest value per entity |
| Latency | Seconds to hours | Milliseconds |
| Cost | Low cost with strong compression | Higher cost for low-latency reads |
Keeping these two stores synchronized is one of the main engineering challenges.
Keeping Offline and Online Stores in Sync
Feature values must land in both stores. Writing to both systems independently is risky. If one write job succeeds and the other fails, the stores diverge. This is known as the dual write problem.
There are two common approaches to solving it.
Streaming Updates with Kafka
One method is to route all feature updates through a streaming platform such as Kafka [10].
The feature pipeline writes new feature values to a Kafka topic. Events are usually serialized using a schema format such as Avro and stored in a schema registry. This ensures both downstream consumers interpret the feature payload the same way and allows safe schema evolution as features change over time.
Both the online and offline stores subscribe to this topic. One consumer writes updates to the online store for low-latency inference. The other consumer writes the same events to the offline store for historical storage.
Kafka acts as the central log for all feature updates. Once an event is written to the topic, downstream services replicate it into both storage systems. This pattern avoids the dual write problem.
By default, Kafka is configured to provide at-least-once delivery guarantees [11]. If a consumer fails, it can replay events from the log. This provides redundancy, but also introduces the possibility of duplicates.
To maintain correctness, feature stores typically use idempotent writes in the online store and deduplication or ACID upserts in the offline store. Both systems can also rely on the ordered event stream to apply updates deterministically[5:1].
The online store updates almost immediately, so new features are available for inference. The offline store may lag slightly because updates are often batched and compacted before being written to columnar storage.
The result is two systems that remain eventually consistent.
The Hopsworks feature store implements this architecture to maintain consistency between online and offline feature stores. See the Hopsworks documentation for more detail.
Batch Materialization
The Kafka approach streams updates into both stores in real time. Another approach is simpler. Treat the offline store as the source of truth and periodically copy feature values into the online store.
A scheduled batch job reads the latest feature values from the offline store and writes them into the online store.
This process is called materialization. The job queries recent feature values from the offline store and loads them into the online store so they are available for low-latency inference.
This design keeps the system simple. The offline store remains the system of record, and the online store provides quick lookups for the latest feature values.
The tradeoff for simplicity is freshness. The online store only reflects new data after each materialization run. This is often acceptable for models that tolerate slightly stale features, such as periodic recommendation systems.
The Feast feature store follows this architecture and periodically materializes features from the offline store into the online store. See the Feast documentation on the Online Store and Offline Store to understand how this works.
Point-in-Time Correctness
Joining labels to features introduces a risk of data leakage [4:1].
A naive join can include feature values that were not available when the label occurred. This makes offline metrics look artificially good, but the model can perform poorly in production.
Example: predicting whether a user will buy a premium subscription
Suppose a label occurs on 2024-03-03.
| user_id | event_timestamp | purchased_premium |
| ------- | --------------- | ----------------- |
| 42 | 2024-03-03 | true || user_id | event_timestamp | login_count | avg_purchase |
| ------- | --------------- | ----------- | ------------ |
| 42 | 2024-03-01 | 2 | 40 |
| 42 | 2024-03-04 | 5 | 41 |
| 42 | 2024-03-07 | 9 | 42 |If your join logic simply grabs the latest record, it will use a login_count of 9. But on the day of the purchase, that number was 2. By using 9, you've leaked information from four days into the future into your training set.
The Fix: Point-in-Time-Join
To maintain temporal integrity, you must perform a point-in-time join. This ensures you only retrieve feature values where .
1. The Modern Approach: AS OF Join
If your data processing framework supports it, the AS OF JOIN is the cleanest way to grab the single most recent record relative to the label timestamp.
SELECT
labels.user_id,
labels.event_timestamp,
labels.purchased_premium,
features.login_count,
features.avg_purchase
FROM labels
AS OF LEFT JOIN features
ON labels.user_id = features.user_id
AND labels.event_timestamp >= features.event_timestamp
2. The Standard SQL Approach: Window Functions
If AS OF JOIN isn't available, you can use a QUALIFY statement with a window function.
SELECT
labels.user_id,
labels.event_timestamp,
labels.purchased_premium,
features.login_count,
features.avg_purchase
FROM labels
LEFT JOIN features
ON labels.user_id = features.user_id
AND labels.event_timestamp >= features.event_timestamp
QUALIFY ROW_NUMBER() OVER (
PARTITION BY labels.user_id, labels.event_timestamp
ORDER BY features.event_timestamp DESC
) = 1
The QUALIFY clause is gaining widespread adoption and is supported by platforms such as Snowflake, DuckDB, and Databricks.
If QUALIFY is not available in your SQL dialect, the same result can typically be achieved using a window function combined with a CTE.
Automation via Feature Stores
Writing these queries manually for dozens of features is tedious and error-prone.
By calling a single method like get_historical_features(), the system handles point-in-time correctness automatically, ensuring your model only learns from feature values available at the time of prediction.
Summary
Feature stores solve several core challenges in production machine learning systems.
Storage: Training and serving have fundamentally different access patterns. Training requires scanning large historical datasets, while online inference requires retrieving a single feature vector within milliseconds. Feature stores separate these workloads into two systems: an offline store optimized for large analytical queries and an online store optimized for low-latency lookups.
Synchronization: Both stores must contain consistent feature values. Two common architectures address this. Streaming pipelines publish updates through systems such as Kafka, allowing both stores to ingest events in near real time. Alternatively, batch materialization treats the offline store as the source of truth and periodically loads the latest feature values into the online store. Streaming prioritizes freshness, while materialization prioritizes simplicity.
Point-in-time correctness: Training datasets must only include feature values that were available when the label occurred. Without this constraint, joins can leak future information and inflate offline evaluation metrics. Feature stores enforce point-in-time joins automatically, ensuring models learn from data that reflects the true prediction context.
By handling these concerns, feature stores allow machine learning teams to define features once and reuse them consistently across training and production. Behind a simple interface, they provide the infrastructure needed to keep feature data reliable, reproducible, and aligned across the entire ML lifecycle.
For Further Reading
- Feast: Feature Store Concepts
- Hopsworks: The Feature Store for Machine Learning
- Databricks: Databricks Feature Store
- Tecton: What is Tecton?
- Chalk.ai: What is a Feature Store? A Complete Guide for ML Teams