Cleaning Messy Data

How to use Claude Code to fix inconsistent names, missing values, duplicate rows, and other common data problems

Why real data is messy

If you've worked with data from more than one source, you already know: data is rarely clean when you first get it.

A CRM export spells customer names differently than your billing system. A survey tool records dates as "January 15, 2025" while your spreadsheet uses "01/15/2025." Someone entered "California" in one row and "CA" in the next. The same customer appears three times because they signed up with different email addresses.

These problems aren't unusual. They're the norm. Industry studies estimate that only about 3% of companies' data meets basic quality standards.

In a spreadsheet, fixing these issues means hours of Find & Replace, sorting, manual deduplication, and careful scanning. With Claude Code, you describe what "clean" looks like and let it handle the grunt work.

The golden rule: work on a copy

Before you clean anything, make a copy of your data file.

You heard this advice in the first page of this module, and it bears repeating here because the stakes are higher. When you ask Claude Code a question about your data, it reads the file and shows you an answer — the file doesn't change. But when you ask Claude Code to clean your data, it rewrites the file. If something goes wrong, you want the original untouched.

You can ask Claude Code to make the copy for you:

Make a copy of customer_orders.csv called customer_orders_clean.csv and work with the copy from now on

Claude Code creates the copy and uses it for all cleaning operations going forward. Your original file stays exactly as it was.

Tip: Some people prefer creating the copy themselves — duplicate the file in Finder or File Explorer before starting Claude Code. Either way works. The point is: never clean your only copy.

Finding the problems first

Before you fix anything, ask Claude Code to tell you what's wrong.

Read customer_orders_clean.csv and give me a data quality report. What problems do you see — inconsistent values, missing data, duplicates, formatting issues?

Claude Code scans the entire file and reports back something like:

Data quality report for customer_orders_clean.csv (2,847 rows):

Inconsistent values:
- "region" column has 6 variations for the same regions:
  "West", "west", "WEST", "West Coast", "W", "Pacific" (likely all mean West)
- "status" column mixes "Shipped", "shipped", "SHIPPED", "Complete"

Missing data:
- customer_name: 3 rows missing
- email: 15 rows missing
- region: 8 rows missing

Duplicates:
- 23 rows appear to be exact duplicates (all columns match)
- 7 additional rows have matching name + email but different order dates
  (likely same customer, different orders — not true duplicates)

Formatting:
- order_date uses three formats: "2025-01-15", "01/15/2025", "January 15, 2025"
- unit_price mixes formats: some have dollar signs, some don't, some have two decimal places

This report gives you a map of everything that needs fixing. You can tackle it piece by piece or ask Claude Code to fix everything at once.

Fixing inconsistent values

Inconsistent values are the most common data quality problem. Different people and systems enter the same information different ways.

Tell Claude Code what the correct values should be:

In the region column, standardize all values. "West", "west", "WEST", "West Coast", "W", and "Pacific" should all be "West". Do the same for all other regions — pick the most common spelling and use it everywhere.

Claude Code reads through the column, maps every variation to a standard value, and makes the changes.

For the status column, be specific:

Standardize the status column. Use "Shipped", "Pending", and "Cancelled" as the only three values. Map "Complete" and "shipped" to "Shipped", and so on.

What matters is giving Claude Code a clear target. "Fix the inconsistencies" works, but "standardize to these three values" works better because Claude Code doesn't have to guess what you want.

Tip: If you're not sure what all the variations are, ask Claude Code to show you first: "Show me all unique values in the region column and how many times each one appears." That way you can decide what the standard should be before asking for changes.

Handling missing data

Not all missing data needs to be fixed the same way. Some gaps you can fill, some you should leave empty, and some rows might need to be removed entirely.

Start by understanding what's missing and why it matters:

For the 15 rows with missing email addresses, show me those rows. Can the emails be filled in from other information in the file?

Claude Code might find that some of those customers appear in other rows with their email included. Or it might find that there's no way to recover the information.

Here are the most common approaches, each with what to tell Claude Code.

You can fill in from context when other columns have enough clues:

Where customer_name is missing but email exists, use the part before the @ sign as the customer name, capitalized

You can fill with a placeholder so records don't vanish from analysis:

For rows with missing region, fill in "Unknown" so they're not silently dropped from analysis

You can remove rows that are too incomplete to be useful:

Remove any rows where both customer_name and email are missing — we can't use those records

And sometimes you leave gaps alone. The 1,204 empty "notes" fields from our file don't need fixing — most orders don't have notes.

Decide what to do for each column rather than applying a blanket rule. Tell Claude Code your reasoning and it applies the right fix.

Removing duplicates

Duplicate rows seem straightforward, but there's a distinction Claude Code's quality report already caught: exact duplicates versus records that share some fields.

For exact duplicates (every column is identical):

Remove exact duplicate rows — keep the first occurrence and delete the rest

For near-duplicates, you need to be more specific about what counts as "the same record." Those seven rows with matching name and email but different order dates? Those are probably the same customer placing multiple orders — not duplicates at all.

Don't remove rows that have the same name and email but different order_date values. Those are separate orders from repeat customers.

If you're unsure, ask Claude Code to show you the candidates before removing anything:

Show me all rows that share the same email address. Don't remove anything yet — I want to see them first.

Review the list, then tell Claude Code which ones to keep.

Heads up: Deduplication is one place where Claude Code can make mistakes if you're not specific enough. "Remove duplicates" without clarification might remove legitimate records. Always specify what makes two rows "the same" — is it the same email? Same name and payment method? Every column matching?

Fixing date and format issues

Mixed date formats are a headache in spreadsheets, but a one-line fix with Claude Code:

Standardize all dates in the order_date column to YYYY-MM-DD format

Claude Code recognizes the three different formats in the file ("2025-01-15", "01/15/2025", "January 15, 2025") and converts them all to a consistent format.

For the unit_price column:

Standardize unit_price so every value has a dollar sign and two decimal places, like $29.99

The pattern is the same for any formatting task. Show Claude Code an example of the correct format, and ask it to apply that everywhere.

Previewing changes before they happen

For small files, you might be comfortable letting Claude Code clean everything at once. For larger files or high-stakes data, preview the changes first.

Show me what the first 20 rows would look like after all these cleaning steps, but don't save any changes yet

Claude Code shows you a before-and-after comparison so you can confirm the changes are correct. If something looks wrong, adjust your instructions before applying changes to the whole file.

You can also ask for a summary of what will change:

How many values will change if you standardize the region column? Show me the mapping you'll use.

This shows something like:

Proposed changes to region column:
"west" → "West" (41 rows)
"WEST" → "West" (12 rows)
"West Coast" → "West" (8 rows)
"W" → "West" (3 rows)
"Pacific" → "West" (2 rows)

Total: 66 values will change out of 2,847 rows

Now you know exactly what's going to happen before it happens.

Describing what "clean" looks like

The most effective cleaning prompts give Claude Code a picture of the finished product, not a list of problems to chase.

Compare these two approaches:

"Fix the inconsistencies in the region column." That's problem-focused. Claude Code has to decide what "fixed" means.

"The region column should contain exactly four values: East, West, North, South. Map all existing values to one of these four. Show me any values you're not sure how to map." That's outcome-focused. Claude Code knows exactly what you expect and will flag anything ambiguous instead of guessing.

For a full cleanup, you can describe the end state up front:

Clean this file so that:
- All dates are in YYYY-MM-DD format
- Region uses exactly: East, West, North, South
- Status uses exactly: Shipped, Pending, Cancelled
- unit_price values all have dollar signs and two decimal places
- No exact duplicate rows
- Rows missing both name and email are removed
- Save the result to customer_orders_clean.csv

Claude Code works through each rule, applies all the changes, and saves the result.

Verifying the cleanup

After cleaning, check the results:

Give me a data quality report on the cleaned file. Are there any remaining inconsistencies, duplicates, or formatting issues?

Claude Code runs the same quality checks on the cleaned file and tells you what's left.

You should also spot-check manually. Open the cleaned file in a spreadsheet and scan a few rows to confirm the changes look right. Check that the row count makes sense. If you started with 2,847 rows and removed 23 exact duplicates, you should have 2,824 rows.

Heads up: Cleaning operations are where the verification habits from the previous page matter most. Claude Code is rewriting your data, not answering questions about it. A wrong answer is annoying. A wrong edit to your data file could mean sending reports based on corrupted numbers. Always check.

Common cleaning prompts

What you want	What to type
Quality assessment	"Give me a data quality report on this file"
Standardize text values	"Standardize the [column] to use these values: [list]"
Fix dates	"Convert all dates in [column] to YYYY-MM-DD format"
Fill missing values	"Fill missing [column] values with [default or rule]"
Remove exact duplicates	"Remove exact duplicate rows, keep the first occurrence"
Preview changes	"Show me what the first 20 rows would look like after these changes"
Show unique values	"Show all unique values in [column] with counts"
Remove invalid rows	"Remove rows where [condition]"
Format numbers	"Standardize [column] to the format [example]"
Verify cleanup	"Run a data quality report on the cleaned file"

Next, you'll learn how to transform and combine data — converting between formats, merging files from different sources, and adding calculated columns like profit margins and percentages.

Cleaning Messy Data

On this page