What I Look for Before Trusting a Dataset

Most bad analysis doesn't start with the chart. It starts earlier, when someone accepts a dataset at face value because it loaded cleanly into a dashboard or looked tidy enough in Excel.

Before I build anything on top of data, I want to know what it represents, where it came from, what's missing, and what kinds of decisions it's safe to support. It doesn't need to be perfect - almost no real-world dataset is. But trust usually comes down to a few practical checks.

1. What is this dataset actually measuring?

What does one row represent? What's being counted? What's excluded? What do the fields mean in plain English?

A dataset can be technically clean and still be misleading if the definitions underneath it are fuzzy. Fields like customer, transaction, active user, or sales can mean different things depending on the source system and the team using it.

I once worked on a category review where scan data showed single-digit growth in a segment that I knew was growing in the double digits. The numbers were accurate - but a major brand had quietly dropped out of the segment hierarchy. That one missing brand accounted for about 80% of the segment's real growth. The data was clean. The definition was wrong.

If I can't explain what a dataset represents without hiding behind system language, I'm not ready to trust it yet.

2. Where did it come from?

The source tells you what kinds of problems to expect before you find them the hard way.

Manually entered data? Expect inconsistency. Operational system? Expect process-driven quirks. Third-party platform? Expect hidden assumptions and field definitions that don't quite match what you'd assume. Stitched together from multiple systems? Expect hierarchy mismatches and key alignment issues - products mapped to the wrong segment, store names that don't match store codes, time periods that don't line up across sources.

When I know a dataset was hand-entered by store managers across 700 locations, I check for different problems than when it came from a clean API feed. The source narrows down where to look.

3. Is it complete, mutually exclusive, and collectively exhaustive?

This is where a lot of analysis goes wrong - not because the data is inaccurate, but because it's not structured properly for the question being asked.

Collectively exhaustive means nothing is missing. If I'm looking at a trend, are there gaps in the time series? If I'm comparing segments, are any products excluded? If I'm reporting by geography, are some regions only partially mapped? A dataset can be accurate for what it contains and still be incomplete in a way that makes the conclusion unsafe.

Mutually exclusive means nothing is double-counted. Is a product sitting in two segments? Is a transaction appearing in two time periods because of a reporting lag? Are stores counted under both their old and new region after a restructure? If the categories overlap, the totals are wrong - and the error often isn't obvious until someone asks why the numbers don't reconcile.

The test is simple: if I sum the parts, do they equal the whole? If they don't, something is either missing or duplicated, and I need to find out which before I build anything on top of it.

4. Has the meaning changed over time?

Trend analysis is where weak datasets get exposed.

Did field definitions change between periods? Were categories added, removed, or renamed? Did the collection method change? Did the geography or hierarchy shift?

A category might show growth not because sales improved, but because the definition expanded to include products that were previously classified elsewhere. A suburb-level time series might show population decline not because people left, but because the boundary was redrawn between census editions - which is exactly the problem I had to solve when building the LGA Language Explorer.

Sometimes the data isn't wrong. It's just no longer comparable in the way people assume it is. If I'm looking at anything over time, I want to know whether the structure underneath stayed stable across the period.

5. What decisions is this data safe for?

Not every dataset needs to be trusted for every purpose. Some are fine for directional analysis but not for commercial decisions. Some are good enough for internal reporting but would fall apart under external scrutiny. Some are useful for generating hypotheses but not strong enough to act on.

I find it more useful to ask what is this dataset safe for? rather than treating trust as a binary. That framing is more honest, and it leads to better work - because the answer is usually "it depends on what you're trying to do with it."

Before moving ahead

Before I rely on a dataset, I want a clear view on five things: what it actually represents, where it came from, whether it's complete and properly structured, whether it's comparable over time, and what decisions it can safely support.

If I can answer those clearly, I'm usually in a good place to start building.

If I can't, the right move is to slow down - before the problem gets buried under charts, commentary, and false confidence.

Trusting a dataset is less about perfection and more about understanding its limits before you ask it to support a decision.