When Data Analysis Goes Wrong

How to recognize and prevent the most common data analysis errors in Claude Code

Wrong numbers that look right

Everything you've built in this module depends on the numbers being correct. And most of the time, they are. But when Claude Code gets something wrong in a data analysis, the result doesn't look wrong. There's no error message. No red warning. You get a clean number in a well-formatted sentence, and it looks exactly like a correct answer.

That's what makes data errors dangerous. A formula error in a spreadsheet at least shows you a broken cell. Claude Code hands you a confident, polished wrong answer and moves on.

This page covers the three most common ways data analysis goes wrong, and how to catch each one before the numbers reach anyone else.

Error 1: column misinterpretation

You saw this briefly in an earlier page: Claude Code guesses what each column means, and sometimes it guesses wrong. But the problem goes deeper than a single wrong answer.

When Claude Code misreads a column, every follow-up question in that session inherits the mistake. If it decides your value column is a customer satisfaction score instead of revenue, then your "top customers by revenue" list is actually sorted by satisfaction. Your "revenue by region" chart is actually satisfaction by region. Your summary report quotes satisfaction numbers labeled as revenue. The error compounds with every step.

Here's what makes this hard to catch: the shape of the results looks normal. You asked for the top 10 customers by revenue and got a list of 10 customers with numbers next to them. Nothing looks wrong unless you happen to know the actual revenue numbers.

The warning signs:

Numbers are in a different range than you expect (hundreds instead of thousands, or vice versa)
Rankings don't match your intuition ("Our biggest customer is ranked seventh?")
Totals don't match a number you've seen before in a spreadsheet or dashboard
A column that should have a wide range of values shows a narrow one (or the reverse)

How to catch it:

Ask Claude Code to show you the raw data behind any result that will go into a report:

Show me the actual rows for the top 5 customers in that list. Include all columns, not just the ones in the summary.

When you see the full rows, you can check whether the number Claude Code called "revenue" actually came from the revenue column.

Error 2: silent sampling

This one is sneaky. When a dataset is large, Claude Code sometimes analyzes only part of it and presents the results as if they cover the whole file.

There's no hard limit at work here. Claude Code can read files of any size on your machine. The problem is in the Python script it writes behind the scenes. Claude Code might write a script that reads the first 1,000 rows of a 50,000-row file, or samples every tenth row, or filters out rows with missing values, all without mentioning it. The analysis runs, the numbers come back, and nothing in the response says "based on a sample of 1,000 rows." It just says "the total revenue is $2.3 million."

With smaller files (a few hundred or a few thousand rows), this rarely happens. The risk climbs with larger datasets, tens of thousands of rows or more.

The warning signs:

The total row count in the analysis doesn't match what you know the file contains
Totals are suspiciously round or lower than expected
"The data contains 1,000 rows" when you know you exported 47,000

How to catch it:

Before starting any analysis on a large file, ask Claude Code to confirm the scope:

How many rows are in this file? Confirm you're analyzing all of them, not a sample.

If Claude Code says it sampled or filtered rows, ask it to reprocess the full dataset:

Run the analysis again on every row in the file. Don't sample or skip any rows.

You can also add this as a standing instruction in your CLAUDE.md file (more on that in a moment).

Error 3: hallucinated statistics

This is the rarest of the three, but the one that can do the most damage. Claude Code sometimes generates statistics that don't come from your data at all.

This tends to happen in summary sections. Claude Code analyzes your data, produces accurate numbers for individual questions, and then writes a summary paragraph that includes a statistic it invented. The summary might say "customer retention improved by 12% year-over-year" when your data doesn't contain retention data at all.

There's a pattern worth knowing about fabricated numbers. Made-up figures tend to use round percentages, numbers ending in 0 or 5. They also tend to sound authoritative and specific in a way that discourages questioning. "Revenue increased by 23.4%" is less likely to be checked than "revenue increased." The false precision creates false confidence.

The warning signs:

A statistic in a summary that you didn't ask about
Numbers that seem too neat or too specific for your dataset
Claims about trends or patterns that go beyond what the data could show
References to metrics that aren't in your file (retention, satisfaction, NPS when your file only has sales data)

How to catch it:

For any number that will appear in a report or presentation, trace it back to the source:

Where did the 12% retention figure come from? Show me the specific rows and calculation.

If Claude Code can't point to specific rows and a clear calculation, the number isn't real.

The CLAUDE.md data dictionary

The best way to prevent all three errors is to tell Claude Code about your data before it starts guessing.

In Module 2, you learned about the CLAUDE.md file, a plain-text file that tells Claude Code about your project. For data work, the most useful thing you can put in a CLAUDE.md is a data dictionary: a description of your data files and what each column means, written in plain language.

Here's what that looks like. Create a file called CLAUDE.md in your data project folder with content like this:

# Data project

## Files
- customer_orders.csv: All orders from January 2024 through December 2025

## Column definitions for customer_orders.csv
- order_id: Unique identifier for each order (integer)
- customer_name: Buyer's full name
- order_date: Date the order was placed (format: YYYY-MM-DD)
- total: Dollar amount of the order — this is revenue
- quantity: Number of items in the order
- region: Sales territory (West, East, Central, South)
- status: Fulfillment status (Shipped, Pending, Cancelled)
- rep: Name of the sales representative

## Rules
- Always analyze all rows. Do not sample.
- When reporting revenue, use the "total" column and exclude Cancelled orders.
- Dates are in YYYY-MM-DD format.
- Verify row counts before and after any filtering.

Claude Code reads this file automatically at the start of every conversation in that folder. You write it once, and every session starts with Claude Code already knowing what your columns mean, what format your dates are in, and what rules to follow.

Column misinterpretation disappears because Claude Code doesn't have to guess. The "always analyze all rows" rule prevents silent sampling. And explicit column definitions make it harder for Claude Code to invent metrics that aren't in your data.

You don't need to document every column. Focus on the ones that could be confused, columns with generic names like value, type, status, amount, or score. If a column name is self-explanatory (like customer_email), skip it.

Building a verification habit

These techniques work best as habits, not heroics.

You won't verify every number Claude Code gives you. That would defeat the purpose of using it. But verify any number before it leaves your desk. Before it goes into a report, a presentation, a Slack message, or a decision.

Here's a practical checklist:

Before you start, add column descriptions to your CLAUDE.md (or describe them at the start of the session)
During analysis, if a number surprises you, ask Claude Code to show the raw rows behind it
Before sharing, ask "How many rows did you analyze?" and check one total against a known value
In summaries, read every statistic and ask yourself whether that metric actually exists in your data

Most sessions only need step one. But the one time you catch a wrong number before someone makes a decision on it, you'll be glad you built the habit.

Heads up: These errors are not unique to Claude Code. Spreadsheet formulas reference wrong cells. Dashboard tools mismap columns. Analysts make copy-paste mistakes. The difference is that Claude Code is confident and articulate about its mistakes, which makes them easier to miss. The verification habits here apply to any tool that produces numbers for you.

Quick reference: verification prompts

What you want to check	What to type
Column interpretation	"Which column did you use for [metric]? Show me a sample of values from that column"
Full dataset scope	"How many rows are in this file? Confirm you're analyzing all of them"
Calculation breakdown	"Show me the exact calculation for [number]. Which rows and columns did you use?"
Statistic source	"Where did the [specific number] come from? Show me the data it's based on"
Reprocess without sampling	"Run that analysis again on every row. Don't sample or filter unless I ask"
Sanity check	"What's the total number of [items] and the sum of [column]? I want to verify against my records"

Next, you'll pull together everything from this module into a data analysis toolkit: a quick reference of the most useful prompts, a project folder setup, and guidance on when Claude Code is the right tool for data work versus when to reach for something else.

When Data Analysis Goes Wrong

Wrong numbers that look right

Error 1: column misinterpretation

Error 2: silent sampling

Error 3: hallucinated statistics

The CLAUDE.md data dictionary

Building a verification habit

Quick reference: verification prompts

On this page