How to Fix Broken CSV Files: 7 Common Problems and Solutions

CSV is supposed to be simple. It's just text with commas. But open any CSV in a tool you didn't generate it with, and you quickly discover that "just text with commas" is a spec flexible enough to cause serious pain.

The scenario plays out like this: a vendor sends you a data export. You open it in Excel, and column B contains what appears to be the entire record as one giant cell. Or the first column header is ï»¿name instead of name. Or you import it into your database and every row fails validation because the column count doesn't match. The file looks perfectly normal in Notepad.

This guide covers the seven problems that account for the vast majority of broken CSV files, with concrete fixes for each.

Problem 1: Encoding Issues (UTF-8 vs ANSI vs UTF-16)

What it looks like: Characters like é, ü, ñ, or 中 display as question marks, boxes, or garbled sequences like Ã© or â€œ. This is called mojibake — text decoded with the wrong encoding.

Why it happens: CSV files have no header that declares their encoding. When a file is saved in Windows-1252 (ANSI) and opened expecting UTF-8, the byte sequences for non-ASCII characters mean something different. The bytes 0xE9 0x74 0xE9 are "été" in Latin-1 but decode to garbage in UTF-8.

The fix:

In Python, read the file with the correct encoding and re-save as UTF-8:

import chardet

# First, detect the encoding
with open("contacts.csv", "rb") as f:
    raw = f.read()
    detected = chardet.detect(raw)
    print(detected)  # {'encoding': 'ISO-8859-1', 'confidence': 0.73}

# Then re-save as UTF-8
with open("contacts.csv", "r", encoding="ISO-8859-1") as f:
    content = f.read()

with open("contacts_utf8.csv", "w", encoding="utf-8", newline="") as f:
    f.write(content)

In Excel, instead of double-clicking the file, use Data → Get External Data → From Text/CSV and explicitly select the encoding in the import wizard.

Tip: If you're generating CSV files programmatically, always specify encoding="utf-8-sig" in Python — the sig variant adds a BOM that helps Excel detect UTF-8 automatically.

Problem 2: Wrong Delimiter (Semicolons, Tabs, Pipes)

What it looks like: Your file opens as a single column with all the data jammed together, or you see values like John;Doe;john@example.com in the first column.

Why it happens: CSV stands for "comma-separated" but the actual delimiter varies by region and tool. European versions of Excel default to semicolons because commas are used as decimal separators in many European locales. Some database exports use tabs (.tsv) or pipes (|) to avoid conflicts with data that contains commas.

The fix:

In Python with pandas, specify the separator explicitly:

import pandas as pd

# Try semicolon
df = pd.read_csv("export.csv", sep=";")

# Or let pandas sniff the delimiter automatically
df = pd.read_csv("export.csv", sep=None, engine="python")

print(df.head())

In Excel, use Data → From Text/CSV and look for the "Delimiter" dropdown in the import wizard. Select the correct delimiter and the data will parse correctly.

Note: If you receive CSV files from multiple sources and can't predict the delimiter, Python's csv.Sniffer can detect it automatically from the first few lines.

Problem 3: Fields Containing Commas Without Quotes

What it looks like: A row that should have 5 columns suddenly has 6 or 7. Address fields like "Austin, TX" split across two columns.

Why it happens: The CSV spec says fields containing the delimiter character must be wrapped in double quotes. Many export tools don't do this correctly, so New York, NY appears literally in the file and gets split on the comma.

The raw file might look like:

id,name,city,email
1042,Maria Hernandez,Austin TX,maria@example.com
1043,James O'Brien,"New York, NY",james@example.com

Row 1043 is fine — quoted correctly. Row 1042 avoided the issue by not including a comma. But if a tool exports New York, NY without quotes, row 1043 would break.

The fix:

In Python, use the csv module which handles RFC 4180 quoting correctly when reading:

import csv

with open("addresses.csv", newline="", encoding="utf-8") as f:
    reader = csv.reader(f)
    for row in reader:
        print(len(row), row)  # Correctly parsed rows

For fixing the file itself, load it with pandas (which handles most quoting issues), then re-export:

import pandas as pd

df = pd.read_csv("addresses.csv")
df.to_csv("addresses_fixed.csv", index=False, quoting=csv.QUOTE_NONNUMERIC)

QUOTE_NONNUMERIC wraps every non-numeric field in quotes, eliminating ambiguity.

Problem 4: Inconsistent Line Endings (CRLF vs LF)

What it looks like: On Unix/Mac, a CSV exported from Windows may appear as one continuous line. Alternatively, you see \r characters at the end of field values, or imports fail with "unexpected end of file" errors.

Why it happens: Windows uses \r\n (CRLF) as a line ending. Unix and macOS use \n (LF). Some tools are strict about which they expect. A file with mixed line endings — some CRLF, some LF — confuses most parsers.

The fix:

In Python, always open CSV files with newline="" and let the csv module handle line ending translation:

import csv

# The newline="" parameter is important — it prevents double-translation
with open("data.csv", newline="", encoding="utf-8") as f:
    reader = csv.reader(f)
    rows = list(reader)

To normalize line endings in a file:

with open("data.csv", "rb") as f:
    content = f.read()

# Replace CRLF with LF
content = content.replace(b"\r\n", b"\n")
# Replace any remaining standalone CR
content = content.replace(b"\r", b"\n")

with open("data_fixed.csv", "wb") as f:
    f.write(content)

Problem 5: BOM Character (Byte Order Mark)

What it looks like: The first column header is ï»¿id or \ufeffname instead of id or name. Queries like df["id"] fail with a KeyError even though you can see "id" in the column list.

Why it happens: Excel adds a BOM (Byte Order Mark — the bytes EF BB BF in UTF-8) to the beginning of CSV files it saves as "UTF-8". This is technically valid but breaks many parsers that don't strip the BOM. When you print the column names and they look right, but indexing fails, an invisible BOM is usually the culprit.

The fix:

import pandas as pd

# encoding="utf-8-sig" automatically strips the BOM on read
df = pd.read_csv("excel_export.csv", encoding="utf-8-sig")
print(df.columns[0])  # 'id', not '\ufeffid'

Or strip it manually from a file:

with open("excel_export.csv", "rb") as f:
    content = f.read()

if content.startswith(b"\xef\xbb\xbf"):
    content = content[3:]

with open("export_no_bom.csv", "wb") as f:
    f.write(content)

Problem 6: Duplicate Column Headers

What it looks like: Two or more columns have the same name. pandas silently renames them to name, name.1, name.2. SQL imports fail entirely. Your pivot tables and lookups return wrong data.

Why it happens: This typically happens when merging data from multiple exports, copying data between sheets, or when an export tool fails to deduplicate headers in a joined query.

The fix:

import pandas as pd

df = pd.read_csv("merged_export.csv")

# Detect duplicates
duplicates = df.columns[df.columns.duplicated()].tolist()
print(f"Duplicate columns: {duplicates}")

# Rename duplicates manually
df = df.rename(columns={
    "email": "primary_email",
    "email.1": "secondary_email"
})

df.to_csv("merged_fixed.csv", index=False)

For systematic deduplication when you can't manually rename each one:

# Make all columns unique by appending a counter
cols = pd.Series(df.columns)
for dup in cols[cols.duplicated()].unique():
    cols[cols[cols == dup].index.values.tolist()] = [
        dup if i == 0 else f"{dup}_{i}"
        for i in range(sum(cols == dup))
    ]
df.columns = cols

Problem 7: Ragged Rows (Inconsistent Column Count)

What it looks like: Most rows have 8 columns but some have 7 or 9. SQL imports fail with "column count mismatch." Pandas fills missing values with NaN and shifts data into wrong columns for rows with extras.

Why it happens: Trailing commas (some tools add an extra delimiter at the end of every row, creating a phantom empty column), incorrectly quoted fields that contain unescaped newlines, or genuinely malformed exports where some rows are missing fields.

The fix:

First, diagnose the extent of the problem:

import csv

expected_cols = None
problem_rows = []

with open("data.csv", newline="", encoding="utf-8") as f:
    reader = csv.reader(f)
    for i, row in enumerate(reader):
        if i == 0:
            expected_cols = len(row)
            print(f"Header has {expected_cols} columns")
        elif len(row) != expected_cols:
            problem_rows.append((i + 1, len(row), row))

print(f"\nFound {len(problem_rows)} ragged rows:")
for line_num, col_count, row in problem_rows[:10]:
    print(f"  Line {line_num}: {col_count} columns — {row}")

Then fix trailing-comma issues:

with open("data.csv", newline="", encoding="utf-8") as f_in, \
     open("data_fixed.csv", "w", newline="", encoding="utf-8") as f_out:
    reader = csv.reader(f_in)
    writer = csv.writer(f_out)
    for row in reader:
        # Strip trailing empty fields
        while row and row[-1] == "":
            row.pop()
        writer.writerow(row)

Fixing Multiple Problems at Once

When a file has several issues simultaneously — encoding problems, a BOM, inconsistent delimiters, and ragged rows — fixing them one by one is tedious and error-prone. Our CSV Cleaner tool detects and fixes all of these automatically. Upload your file, and it will report every issue found (encoding, delimiter, duplicate headers, ragged rows, BOM presence) and let you apply fixes with a single click. You can also use our CSV to JSON converter to validate that the file parses correctly before using it downstream.

Preventing CSV Problems

The best fix is avoiding the problem in the first place. When generating CSV programmatically:

Always use a proper CSV library — never build CSV strings by hand with string concatenation. Use Python's csv module, pandas' to_csv, or a library in your language of choice. They handle quoting, escaping, and line endings correctly.
Always specify encoding explicitly — use UTF-8 everywhere. If the recipient uses Excel on Windows, use utf-8-sig to add a BOM that Excel recognizes.
Validate on export — read back the file you just wrote and check row counts match. A one-line sanity check saves hours of debugging downstream.
Document your delimiter — if you use a non-comma delimiter, say so in the filename (export_pipe_delimited.csv) or accompanying documentation. Don't make the recipient guess.

Frequently Asked Questions

Why does Excel open my CSV with all data in one column?

Almost certainly a delimiter mismatch. Your file uses a semicolon or tab as the delimiter, but Excel is trying to parse it as comma-separated. Use Data → From Text/CSV instead of double-clicking, and select the correct delimiter in the import wizard. Alternatively, use the CSV Cleaner tool to normalize the delimiter to commas before opening in Excel.

My CSV looks fine in Notepad but breaks in my script. What is going on?

The most likely cause is encoding. Notepad renders most encodings reasonably well, but it's not strict. Your script probably expects UTF-8, and the file may be Windows-1252 (ANSI). Open it with the chardet library in Python to detect the encoding, then re-save as UTF-8. A BOM character is another common culprit — use encoding="utf-8-sig" when reading with pandas.

Can I fix a CSV without writing code?

Yes. LibreOffice Calc gives you more import control than Excel and handles encoding detection better. Online tools like the CSV Cleaner handle the most common issues without requiring any code. For one-off fixes, a good text editor (VS Code, Sublime Text) can change encoding, find/replace line endings, and strip BOM characters through the GUI.