Should Hub Definitions Follow Source Systems or Business Entities in Data Vault 2.0?

Rhys Hanscombe
Jan 28
3 min read

One of the most common modeling questions in Data Vault 2.0 appears deceptively simple:

Should Hubs be defined based on source system tables or on business entities?

A recent discussion in the Data Community forum surfaced this exact dilemma and produced a clear, experience-backed answer that many data professionals struggle to articulate under delivery pressure.

This post breaks down the problem, the reasoning, and the practical rule you can apply on your next Data Vault model.

The Problem: Source Tables vs Business Reality

The original question came from a practitioner working with multiple operational systems. In their case:

Client data and contact data arrive from separate source tables.
Business users conceptually view both as a single “Client” entity.
There is concern about scalability if 10–15 different tables feed into one hub.
There is uncertainty about whether merging concepts will cause pain later.

This situation is common. Source systems are optimized for transactions, not semantics. Data Vault promises integration, but the modeling choices still feel risky when the vault begins to grow.

So what should drive hub design?

The Core Principle: Data Vault Is Business-Centric

The strongest response in the thread cuts directly to the heart of Data Vault 2.0:

Hubs must represent business entities, not source system structures.

This is not an opinion. It is foundational to why Data Vault exists.

Source systems are volatile. Tables change. Fields move. Systems are replaced. If hubs mirror source structures, the vault inherits this instability and loses its purpose as a long-term integration layer.

Business entities, by contrast, are comparatively stable. Customers, products, contracts, suppliers. These concepts persist even when systems change.

Data Vault’s strength is its ability to unify multiple representations of the same business concept across disparate sources. That unification happens at the hub.

A Simple Test for Hub Design

One contributor offered a practical test that cuts through abstract debate:

Can this concept exist independently?

Apply it like this:

Can a Client exist without a Contact?
Can a Contact exist without a Client?

If contact details only make sense in the context of a client, then “Contact” is not a hub. It is descriptive information. It belongs in satellites attached to the Client hub.

This reframes the problem away from tables and toward meaning.

If two concepts are inseparable at the business level, splitting them into separate hubs creates artificial complexity and semantic confusion.

Why Reporting Requirements Should Not Drive Hub Design

Another critical clarification from the discussion:

Hubs should not be driven by reporting or downstream consumption.

Reporting needs change frequently. Dashboards evolve. Metrics are redefined. If hub definitions are shaped by current reporting, the model becomes brittle.

Hubs exist to anchor business identity and history. Satellites exist to capture descriptive change. Links exist to express relationships.

When these roles are blurred to satisfy short-term reporting convenience, rework becomes inevitable.

The Fear of “Too Many Satellites” Is Misplaced

A recurring anxiety in Data Vault modeling is scale. What happens when a single hub ends up with many satellites fed by many sources?

The forum response was blunt and accurate:

Hubs and satellites are cheap.
Incorrect semantics are expensive.

Having many satellites is normal. It is expected. It is how Data Vault absorbs change without structural upheaval.

Storage is inexpensive. Compute scales. Re-modeling a vault because a hub was defined incorrectly is far more costly than maintaining additional satellites.

If multiple source tables describe the same business entity from different perspectives, that is exactly what satellites are for.

Iteration Is Not Failure

Another subtle but important point surfaced in the discussion:

Modeling is iterative.

You will not get every hub definition perfect on day one. That is acceptable. What matters is grounding decisions in business meaning rather than technical convenience.

Engage business stakeholders early. Ask how they think about the data. Validate whether two concepts are truly distinct in their mental model.

Data Vault is designed to evolve, but evolution is far cheaper when the core business entities are modeled correctly from the start.

Practical Takeaways for Data Vault Practitioners

If you remember nothing else, remember this checklist:

Define hubs around stable business entities.
Ignore source system table boundaries when defining hubs.
Use the “can it exist independently?” test.
Do not let reporting requirements dictate hub design.
Expect and accept many satellites.
Optimize for semantic correctness, not table count.

These principles scale. They hold whether you have two sources or twenty.

Join the Conversation

This blog is based on a real discussion between experienced Data Vault practitioners in the Data Community forum.

If you are modeling a vault today, you will face these decisions repeatedly. Reading static guidance helps. Debating real scenarios helps more.

👉 Join the Data Community forum to read the full discussion and take part in future ones.

Ask questions. Challenge assumptions. Learn from practitioners solving the same problems you are.

Good data models are not built in isolation.