Data: Garbage in, gospel out¶

Data quantity: The eternal whinge¶

Data scientists will always say they need more data. It’s their default setting—like a smoke alarm that goes off every time you make toast. “The model underperforms because we only have ten million samples!” they proclaim, eyes wide, as if data were oxygen and you’d just turned off their life support.

In fairness, there are genuine cases of data famine. Trying to train a machine learning model on three JPEGs and a hunch isn’t going to get you very far. It’s like trying to write a novel using only fridge magnets. But more often than not, the problem isn’t that you don’t have enough data—it’s that you don’t have useful data.

You can drown in a sea of inputs and still die of thirst. Does your dataset reflect the rich tapestry of human experience, or does it resemble a damp pile of Tesco receipts and mislabelled cat photos? Ten thousand nearly identical images of the same blurry stapler won’t teach your model much beyond the tragic reality of your office supplies. And a billion rows are meaningless if 99% of them consist of missing values, default timestamps, or the word “unknown” spelled seven different ways.

So yes, quantity matters. But not as much as your average data scientist thinks it does—especially when they haven’t looked at a histogram since 2021.

Data quality: Where hope goes to die¶

Ah, “quality.” That ever-elusive ideal, invoked with solemnity but rarely seen in the wild. Ask anyone to define it and you’ll get answers ranging from “not awful” to “whatever passed the last pipeline without crashing.”

Let’s examine the three horsemen of data quality:

Consistency¶

The notion that your data might behave predictably, like a dependable train schedule or a well-socialised golden retriever. But instead, it acts more like a moody cat—capricious, inscrutable, and very likely to claw you for your trouble. One field contains “United Kingdom”, another “UK”, a third “U.K.”, and the fourth has “Europe, sort of.” Try doing analysis on that without losing your will to live.

Correctness¶

You’d hope your data reflects reality, or at least something adjacent to it. But often it’s more like a fever dream transcribed by a sleep-deprived intern. You spot an entry: “Customer Age: 217.” Maybe they’re immortal. Maybe someone leaned on the keyboard. Either way, your fraud detection model just signed them up for a youth savings account.

Completeness¶

The cheerful fiction that your dataset isn’t mostly empty space wrapped in lies. Eighty percent missing values? “No problem,” says someone, “we’ll just use mean imputation.” Brilliant—now all your customers have exactly average income, average height, and precisely the same number of cats. This may work if you’re modelling Lego figures, but not actual people.

The inevitable truth¶

Cleaning data is like flossing: tedious, thankless, and absolutely necessary. No one applauds you for doing it, but skip it and you’ll soon find everything falling apart. Unlike code, bad data doesn’t throw a stack trace—it just sits there, corrupting your insights like a damp patch behind the fridge. Silent, smelly, and quietly ruining your life.

Last update: 2025-05-19 20:21