Wednesday, August 27, 2025

Use Good Test Data

A brief tale of that time I nuked a production database table because I used inadequate test data.

This post was inspired by the following:

Tell us your entertaining tales of catastrophic work mistakes you have made or witnessed.

Some number of years ago, my younger self didn’t exactly drop the production database, but he sure did clear out the most important table in it.

In my defense, it was fairly early in the morning, and I wasn’t a coffee drinker yet, so my brain wasn’t firing on all cylinders. And speaking of firing, I was lucky enough to keep my job, and my employer was lucky enough to keep the customer.

I won’t be naming any names here—except my own, obviously—so some details of the story have been changed to protect both the innocent and the guilty. I also don’t have the exact code, so you’ll have to suspend a bit of disbelief.

I was working at a small SaaS company where I didn’t have access to the production database (which actually describes most of my résumé). One of our customers had imported data into the platform, and much of the source text was entirely lowercase. Some of the data fields were things like names, where capitalization would be expected, so the experience really wasn’t ideal. We wanted to do them a solid by fixing it, because we were awesome like that—and because they asked us to do so—and I was tasked with creating a quick and dirty solution.

The approach I came up with was reasonable, I think: I wrote a Python script that would read the table in question and write UPDATE SQL statements with values run through str.capitalize()—easy enough! There were quite a few columns, so I iterated through them, which Python makes super easy.

As I mentioned, I don’t have this snippet available, but it probably looked something like this (with the rows being queried from the DB):

COLUMNS = ["person_name", "company", "category"]

rows = [
    {
        "id": 313,
        "person_name": "nik kantar",
        "company": "acme inc.",
        "count": 13,
        "status": "legit",
        "category": "oopsies",
    },
    {
        "id": 1024,
        "person_name": "ada lovelace",
        "company": "acme inc.",
        "count": 24,
        "status": "legit",
        "category": "also oopsies",
    },
    {
        "id": 9001,
        "person_name": "margaret hamilton",
        "company": "not acme inc.",
        "count": 8,
        "status": "super legit",
        "category": "also oopsies",
    },
]

statements = []

for row in rows:
    statement = "UPDATE very_important_table SET "

    updates = []
    for column in COLUMNS:
        original_value = row[column]
        parts = original_value.split(" ")
        updated_value = " ".join([part.capitalize() for part in parts])
        update = column + " = '" + updated_value + "'"
        updates.append(update)

    statement += ", ".join(updates)
    statement = statement + " WHERE id = " + str(row["id"]) + ";"
    statements.append(statement)

output = "\n".join(statements)
print(output)

I didn’t have f-strings available to me, and I was some number of years less experienced, so I’m sure the actual version was clumsier than the above, while trying to be entirely too clever.

The result was something like this, minus the formatting for legibility:

UPDATE very_important_table
SET person_name = 'Nik Kantar',
    company = 'Acme Inc.',
    category = 'Oopsies'
WHERE id = 313;

UPDATE very_important_table
SET person_name = 'Ada Lovelace',
    company = 'Acme Inc.',
    category = 'Also Oopsies'
WHERE id = 1024;

UPDATE very_important_table
SET person_name = 'Margaret Hamilton',
    company = 'Not Acme Inc.',
    category = 'Also Oopsies'
WHERE id = 9001;

I know for an absolute fact I tested this on some synthetic data I had in my local database, and it worked like a charm. I ran the script, it generated the SQL, I ran that on the DB, and verified the results.

Empowered by this massive success, I sent the Python script to the person who was going to run it on production data. They went through the process, but were too preoccupied to validate the changes on the spot, which wasn’t ideal, but our confidence about this fairly simple transformation was high.

That is until a few moments later, when someone at the company told us they couldn’t access the single most important page in the product—powered by this very table—while logged in as that customer. Uh-oh…

A little bit later the person with access to production data was able to take a look, and, uhh, it wasn’t great: the table was full of values based on column names—that is, the person_name column had "Person_name" for every entry we updated, company was "Company", and so on.

It turned out that I did the equivalent of original_value = column instead of original_value = row[column] in whatever syntax I actually used. Such an easy mistake to make…but wait, didn’t I test this?!

Well…yes, I did. But my test data was pretty awful. I was working with values extremely close to column names, like "person name" and "company", so my sleepy eyes glazed over the differences. And my partner in crime didn’t verify the generated SQL before running it, since it was a lot more verbose and illegible than the sample above.

We were very fortunate to have a backup recent enough that no changes to customer data had been made since. I fixed the script, we were able to run it on the backup table, and everything was fine.

Ever since I’ve tried my best to use at least slightly more representative test data. If I at all can, I’ll make up realistic values for names, write full sentences for descriptions, and pick a variety of numerical values. Really, anything is better than "person name" for a column named person_name…

Tags: blaugust, programming, python, sql

Thanks for reading! You can keep up with my writing via the feed or newsletter, or you can get in touch via email or Mastodon.

Older:
Blaugust 2025: Staying Motivated

Newer:
Fixing Plausible Geolocation