csvjson
·13 min read

JSON to CSV: How to Handle Special Characters and Encoding

Why special characters break CSV files, and how to fix every encoding problem: UTF-8 BOM for Excel, escaped Unicode in JSON, accented characters, emoji, null bytes, embedded newlines, and Windows-1252 issues.

You convert a JSON file to CSV. Python says it worked. You open the file in Excel and see é instead of é, question marks instead of emoji, or a first column header that starts with . The data is there — it's just garbled.

Encoding problems are the most common source of silent data corruption when converting JSON to CSV, and they're almost entirely caused by one mismatch: JSON is always UTF-8, but CSV has no defined encoding, and Excel doesn't default to UTF-8.

This guide covers every encoding issue you'll encounter, what causes it, and the exact fix.


The core problem: CSV has no encoding standard

JSON is specified to be UTF-8 (RFC 8259). If you have a valid JSON file, it's UTF-8. That's guaranteed.

CSV has no such guarantee. RFC 4180 — the closest thing to a CSV standard — doesn't mention encoding at all. In practice:

  • Python writes UTF-8 by default
  • Excel on Windows defaults to Windows-1252 (a legacy Western European encoding)
  • Excel on Mac defaults to macOS Roman or UTF-8 depending on the version
  • Google Sheets reads and writes UTF-8

When you write a UTF-8 CSV and open it in Excel on Windows, Excel misreads the encoding, and any character outside plain ASCII — accented letters, em dashes, curly quotes, emoji — comes out wrong.


Fix 1: Add a UTF-8 BOM for Excel

A BOM (byte order mark) is a three-byte sequence (\xef\xbb\xbf) at the start of the file that signals "this file is UTF-8." Excel uses it to detect encoding. Without it, Excel guesses — and guesses wrong.

Python standard library

import csv
import json

with open("data.json", encoding="utf-8") as f:
    records = json.load(f)

# utf-8-sig writes the BOM automatically
with open("output.csv", "w", newline="", encoding="utf-8-sig") as f:
    writer = csv.DictWriter(f, fieldnames=records[0].keys())
    writer.writeheader()
    writer.writerows(records)

utf-8-sig is Python's built-in encoding name for UTF-8 with BOM. It writes the BOM prefix on open and strips it automatically when reading — you never handle it manually.

pandas

import pandas as pd
import json

with open("data.json") as f:
    records = json.load(f)

df = pd.DataFrame(records)
df.to_csv("output.csv", index=False, encoding="utf-8-sig")

Verify it worked

Open the file in a hex editor or run:

python -c "
with open('output.csv', 'rb') as f:
    print(f.read(3).hex())
# Should print: efbbbf
"

ef bb bf = UTF-8 BOM. Excel will now read the file correctly.


Fix 2: Read encoding issues

The BOM fix handles the write side. If you're reading a CSV that was written with a different encoding, you need to decode it correctly first.

Detect the encoding of an unknown file

pip install chardet
import chardet

with open("mystery.csv", "rb") as f:
    result = chardet.detect(f.read())

print(result)
# {'encoding': 'ISO-8859-1', 'confidence': 0.73, 'language': ''}

Then read with that encoding:

import pandas as pd

df = pd.read_csv("mystery.csv", encoding="iso-8859-1")

Common encodings you'll encounter:

| Encoding name | Also known as | Where it comes from | |---------------|---------------|---------------------| | utf-8 | UTF-8 | Linux, Mac, modern tools | | utf-8-sig | UTF-8 with BOM | Excel (when saved as "CSV UTF-8 (BOM)") | | cp1252 | Windows-1252 | Excel on Windows (default) | | iso-8859-1 | Latin-1 | Old European systems, some databases | | cp1250 | Windows Central European | Excel in Central/Eastern Europe | | mac_roman | macOS Roman | Old Mac exports |

Re-encoding a garbled CSV

If someone gave you a CSV in Windows-1252 and you need UTF-8:

with open("windows.csv", encoding="cp1252") as f_in, \
     open("utf8.csv", "w", encoding="utf-8-sig") as f_out:
    f_out.write(f_in.read())

Unicode escape sequences in JSON

JSON allows characters to be written as \uXXXX escape sequences. é is the same as é. Python's json.load() decodes these automatically — but some tools don't.

import json

raw = '{"name": "Andr\\u00e9", "city": "M\\u00fcnchen"}'
data = json.loads(raw)
print(data)
# {'name': 'André', 'city': 'München'}

When you use json.load() or json.loads(), all \uXXXX sequences are decoded to their actual Unicode characters before you do anything else. The CSV you write will contain André, not André.

The ensure_ascii flag

When writing JSON in Python, json.dumps() defaults to ensure_ascii=True, which escapes all non-ASCII characters back to \uXXXX:

import json

data = {"name": "André"}

# Default — escapes non-ASCII
json.dumps(data)
# '{"name": "Andr\\u00e9"}'

# With ensure_ascii=False — keeps Unicode as-is
json.dumps(data, ensure_ascii=False)
# '{"name": "André"}'

If your JSON pipeline outputs escaped sequences and you need readable characters in the CSV, either:

  1. Use json.load() to read (auto-decodes), or
  2. Decode manually: raw.encode("utf-8").decode("unicode_escape")

Special characters inside CSV fields

CSV uses commas, double quotes, and newlines as structural characters. When these appear inside field values, they need to be handled correctly — or your CSV parser reads the wrong number of columns.

Python's csv module does this correctly

import csv
import json

records = [
    {"name": 'O\'Brien, "The Boss"', "notes": "Line one\nLine two"},
    {"name": "Müller GmbH & Co.", "notes": 'Price: €1,234.56'}
]

with open("output.csv", "w", newline="", encoding="utf-8-sig") as f:
    writer = csv.DictWriter(f, fieldnames=["name", "notes"])
    writer.writeheader()
    writer.writerows(records)

The csv module wraps any field containing commas, double quotes, or newlines in double quotes, and doubles any internal double quotes — per RFC 4180:

name,notes
"O'Brien, ""The Boss""","Line one
Line two"
Müller GmbH & Co.,Price: €1234.56

The newline="" requirement

Always open CSV files with newline="". Without it, Python's universal newlines mode translates \r\n to \n on read — which breaks multiline fields — and adds extra \r on Windows on write. The csv module handles line endings itself; tell Python not to interfere.

# Correct
with open("output.csv", "w", newline="", encoding="utf-8-sig") as f:

# Wrong — breaks multiline fields and Windows line endings
with open("output.csv", "w", encoding="utf-8-sig") as f:

Emoji and multi-byte characters

Emoji (and many CJK characters) are encoded in UTF-8 as 3 or 4 bytes. They're perfectly valid in JSON strings and in UTF-8 CSV files. The only issue is Excel.

Excel can display emoji in cells but has inconsistent support for 4-byte UTF-8 characters (outside the Basic Multilingual Plane) in CSV import. The BOM fix (utf-8-sig) is still the right solution — it ensures Excel reads the file as UTF-8 at all.

import csv
import json

records = [
    {"product": "Coffee ☕", "rating": "⭐⭐⭐⭐⭐", "review": "Great! 🎉"},
]

with open("reviews.csv", "w", newline="", encoding="utf-8-sig") as f:
    writer = csv.DictWriter(f, fieldnames=["product", "rating", "review"])
    writer.writeheader()
    writer.writerows(records)

If a downstream system can't handle emoji and you need to strip them:

import re

def strip_emoji(text):
    emoji_pattern = re.compile(
        "["
        "\U0001F600-\U0001F64F"  # emoticons
        "\U0001F300-\U0001F5FF"  # symbols & pictographs
        "\U0001F680-\U0001F6FF"  # transport & map
        "\U0001F1E0-\U0001F1FF"  # flags
        "\U00002700-\U000027BF"  # dingbats
        "\U0001F900-\U0001F9FF"  # supplemental symbols
        "]+",
        flags=re.UNICODE,
    )
    return emoji_pattern.sub("", text)

cleaned = [{k: strip_emoji(str(v)) if isinstance(v, str) else v
            for k, v in row.items()}
           for row in records]

Invisible characters that corrupt CSV

Some encoding problems are invisible in a text editor but break parsing or introduce garbage data.

Non-breaking space ( )

Looks identical to a regular space but is a different byte (\xc2\xa0 in UTF-8). Causes lookup failures when you try to match "New York" against a value that contains a non-breaking space.

# Detect
text = "New York"
print(repr(text))  # 'New\xa0York'

# Fix — replace with regular space
cleaned = text.replace(" ", " ")

# Fix all whitespace variants at once
import unicodedata
cleaned = "".join(" " if unicodedata.category(c) == "Zs" else c for c in text)

Zero-width space () and zero-width non-joiner ()

Often copied in from web content. Invisible but present.

def strip_zero_width(text):
    zero_width = "​‌‍‎‏"
    return "".join(c for c in text if c not in zero_width)

Note:  is the BOM character. It's correct at the start of a file but garbage if it appears in the middle of content (which happens when files are concatenated carelessly).

Null bytes (\x00)

Some databases and JSON exports contain null bytes in strings. CSV parsers handle this inconsistently — some truncate the string at the null byte, others fail entirely.

# Remove null bytes from all string fields
records = [
    {k: v.replace("\x00", "") if isinstance(v, str) else v
     for k, v in row.items()}
    for row in records
]

Soft hyphens and other format characters

import unicodedata

def clean_string(text):
    # Remove format characters (category Cf) except normal spaces
    return "".join(
        c for c in text
        if unicodedata.category(c) != "Cf" or c == "\t"
    )

Normalizing Unicode (NFC vs NFD)

The same visible character can be encoded in multiple ways in Unicode. é can be a single code point (é, NFC form) or a base e followed by a combining accent (, NFD form). They look identical and compare as equal in most languages — but not in string matching in Python by default.

s1 = "é"      # é — precomposed (NFC)
s2 = "é"     # é — e + combining accent (NFD)
print(s1 == s2)    # False
print(len(s1))     # 1
print(len(s2))     # 2

Normalize to NFC before writing to CSV to ensure consistency:

import unicodedata

def normalize(text):
    return unicodedata.normalize("NFC", text) if isinstance(text, str) else text

records = [{k: normalize(v) for k, v in row.items()} for row in records]

NFC is the standard form for text in files and databases. NFD sometimes appears in JSON generated on macOS.


Right-to-left text (Arabic, Hebrew, Persian)

RTL text is valid UTF-8 and writes to CSV without issues. The display problem is in Excel: Excel doesn't set the cell or sheet direction based on content, so Arabic text appears left-aligned and visually reversed in an LTR sheet.

For programmatic use (piping into databases, APIs, data pipelines), RTL text in UTF-8 CSV is perfectly fine as-is. For human-readable Excel output, use .xlsx format with openpyxl instead — you can set column direction and right-align cells:

import pandas as pd
import openpyxl
from openpyxl.styles import Alignment

df = pd.DataFrame(records)
df.to_excel("output.xlsx", index=False, engine="openpyxl")

wb = openpyxl.load_workbook("output.xlsx")
ws = wb.active

for row in ws.iter_rows():
    for cell in row:
        if isinstance(cell.value, str) and any("؀" <= c <= "ۿ" for c in cell.value):
            cell.alignment = Alignment(horizontal="right", reading_order=2)

wb.save("output.xlsx")

Line endings: \r\n vs \n

RFC 4180 specifies \r\n (CRLF) line endings for CSV. Python's csv module writes CRLF on all platforms when you use newline="". Without newline="", Python adds an extra \r on Windows, producing \r\r\n.

On Linux/Mac, most tools accept both. Excel on Windows expects CRLF.

# Check what line endings your CSV has
with open("output.csv", "rb") as f:
    content = f.read()

if b"\r\r\n" in content:
    print("Double carriage return — newline='' was missing on open()")
elif b"\r\n" in content:
    print("CRLF — correct for RFC 4180")
elif b"\n" in content:
    print("LF only — fine for most tools, may cause issues in old Excel")

Full pipeline: JSON to clean CSV

Combining all the fixes into a single production-ready function:

import csv
import json
import unicodedata

def clean_value(value):
    if not isinstance(value, str):
        return value
    # NFC normalization
    value = unicodedata.normalize("NFC", value)
    # Remove null bytes and zero-width characters
    remove = "\x00​‌‍‎‏"
    value = "".join(c for c in value if c not in remove)
    # Normalize whitespace variants to regular space
    value = "".join(
        " " if unicodedata.category(c) == "Zs" else c
        for c in value
    )
    return value.strip()

def json_to_csv_clean(json_path, csv_path, encoding="utf-8-sig"):
    with open(json_path, encoding="utf-8") as f:
        records = json.load(f)

    if not records:
        raise ValueError("JSON file contains no records")

    # Clean all string values
    cleaned = [
        {k: clean_value(v) for k, v in row.items()}
        for row in records
    ]

    headers = list(cleaned[0].keys())

    with open(csv_path, "w", newline="", encoding=encoding) as f:
        writer = csv.DictWriter(f, fieldnames=headers, extrasaction="ignore")
        writer.writeheader()
        writer.writerows(cleaned)

    print(f"Wrote {len(cleaned)} rows to {csv_path}")

# Usage
json_to_csv_clean("data.json", "output.csv")              # Excel-safe (BOM)
json_to_csv_clean("data.json", "output.csv", "utf-8")     # No BOM (for pipelines)

Debugging checklist

When your CSV has garbled characters, work through this list in order:

1. Check the file's actual encoding:

with open("problem.csv", "rb") as f:
    print(f.read(3).hex())  # efbbbf = UTF-8 BOM, else unknown

2. Use chardet to detect encoding:

import chardet
with open("problem.csv", "rb") as f:
    print(chardet.detect(f.read(10000)))

3. Read with the detected encoding:

df = pd.read_csv("problem.csv", encoding="cp1252")

4. Re-write with utf-8-sig:

df.to_csv("fixed.csv", index=False, encoding="utf-8-sig")

5. Verify in a text editor before opening in Excel — VS Code shows encoding in the status bar. If it shows UTF-8 correctly, the issue is Excel's import, not the file.

6. When importing into Excel, use Data → From Text/CSV instead of double-clicking. This lets you specify the encoding explicitly in the import wizard.


Frequently Asked Questions

Why does é become é in Excel?

This is the classic UTF-8 misread as Windows-1252. The byte sequence for é in UTF-8 is \xc3\xa9. When misread as Windows-1252, \xc3 is à and \xa9 is ©, giving you é. Fix: write with encoding="utf-8-sig" so Excel detects the encoding correctly.

Why is there a  at the start of my first cell?

Your file has a UTF-8 BOM (\xef\xbb\xbf), but the tool that opened it doesn't recognize BOM encoding. The BOM is being displayed as the characters  (those bytes misread as Latin-1). Fix on read: use encoding="utf-8-sig" which strips the BOM when reading. Fix the source: write without BOM (encoding="utf-8") if BOM causes problems for your downstream tool.

My CSV looks correct in VS Code but wrong in Excel — why?

VS Code reads UTF-8. Excel on Windows defaults to Windows-1252. Add the BOM: encoding="utf-8-sig". The three-byte BOM prefix tells Excel to read UTF-8 instead of guessing.

Does the JSON to CSV converter handle encoding correctly?

Yes. The converter runs in the browser, which uses UTF-16 internally for strings. All JSON Unicode escapes (\uXXXX) are decoded automatically. The downloaded CSV uses UTF-8 with BOM by default so it opens correctly in Excel. There's no re-encoding step — the file is generated directly from JavaScript strings.

Should I use UTF-8 or UTF-8 with BOM?

Use UTF-8 with BOM (utf-8-sig) when the file will be opened in Excel by humans. Use plain UTF-8 (no BOM) when the file goes to a database, API, data pipeline, or any programmatic consumer — most parsers handle UTF-8 correctly, and some fail on the BOM prefix.

How do I handle a JSON file that has mixed encodings in different fields?

This shouldn't happen — valid JSON is UTF-8 throughout, and json.load() will raise a UnicodeDecodeError if it encounters invalid UTF-8 bytes. If you get that error, the file isn't valid UTF-8 JSON. Try reading the file as bytes and decoding with errors="replace" to see where the bad bytes are:

with open("bad.json", "rb") as f:
    raw = f.read()

text = raw.decode("utf-8", errors="replace")
# Any bad bytes become  in the output — search for them
bad_positions = [i for i, c in enumerate(text) if c == "�"]
print(f"Bad bytes at positions: {bad_positions[:10]}")

Then either fix the source, or decode the file with its actual encoding (cp1252, latin-1, etc.) before parsing it as JSON.