JSON to CSV with pandas: The Complete Workflow
A deep dive into converting JSON to CSV using pandas — json_normalize, record_path, meta, type coercion, column renaming, filtering, and exporting large datasets. Real API examples throughout.
- When pandas is the right choice
- `pd.read_json` vs `pd.json_normalize`
- Flattening nested objects with `json_normalize`
- Controlling depth with `max_level`
- Expanding arrays with `record_path` and `meta`
- Expanding multiple levels
- Column operations before export
- Renaming columns
- Selecting and reordering columns
- Dropping internal or sensitive fields
- Type coercion before export
- Unix timestamps to human-readable dates
- Currency amounts (cents → dollars)
- Booleans to Yes/No for Excel users
- Filling missing values explicitly
- Writing the CSV
- `encoding="utf-8-sig"` vs `encoding="utf-8"`
- Processing large JSON files with pandas
- Strategy 1: Newline-delimited JSON (NDJSON) with `chunksize`
- Strategy 2: Standard JSON arrays with `ijson`
- A complete production-ready script
- Frequently Asked Questions
- What is the difference between `pd.read_json` and `pd.json_normalize`?
- Why does `json_normalize` produce NaN for some fields?
- How do I handle a JSON array that's nested two levels deep?
- Can pandas handle JSON files larger than RAM?
- Why do my float columns have too many decimal places in the CSV?
- How do I make the CSV open correctly in Excel?
pandas is the standard library for JSON-to-CSV conversion when your data needs more than a straight dump. It handles nested structures, type coercion, column filtering, joins, and aggregations — all before you write a single CSV row.
This guide focuses on the pandas workflow end-to-end. For a broader comparison of the standard library, pandas, and online converters, see the complete guide to converting JSON to CSV in Python. For error handling, see the JSON to CSV errors guide.
When pandas is the right choice
Use pandas when you need to do anything beyond a straight dump:
- Your JSON is nested (objects inside objects, arrays of objects)
- You want to filter rows, rename columns, or reorder fields before export
- You're doing this inside a larger analysis pipeline
- Records have inconsistent keys and you want NaN instead of a crash
- You want type coercion — dates as dates, numbers as numbers
If your JSON is flat and you just want a file, the standard library approach is three lines and zero dependencies. Use that.
pd.read_json vs pd.json_normalize
These two functions look similar but do different jobs.
pd.read_json is a file/string reader. It reads JSON and tries to infer a tabular structure. Use it for flat JSON arrays and newline-delimited JSON files.
pd.json_normalize is a structure transformer. It takes a Python list (that you've already loaded) and recursively flattens nested dicts into columns. Use it for anything with nesting.
import pandas as pd
import json
# pd.read_json — best for flat arrays and NDJSON files
df = pd.read_json("flat_users.json") # flat array
df = pd.read_json("data.ndjson", lines=True) # one JSON object per line
# pd.json_normalize — best for nested structures
with open("nested_payments.json") as f:
data = json.load(f)
df = pd.json_normalize(data, sep=".") # nested API response
For nested JSON, always prefer json_normalize over pd.read_json. read_json will serialize nested dicts as Python objects rather than flattening them.
Flattening nested objects with json_normalize
Here's a realistic Stripe payment intent response:
[
{
"id": "pi_3MtwBwLkdIwHu7ix28a3tqPD",
"amount": 2000,
"currency": "usd",
"status": "succeeded",
"created": 1680729600,
"customer": {
"id": "cus_NffrFeUfNV2Hib",
"email": "jenny.rosen@example.com",
"name": "Jenny Rosen"
},
"payment_method": {
"type": "card",
"card": {
"brand": "visa",
"last4": "4242",
"exp_month": 8,
"exp_year": 2026
}
},
"metadata": {
"order_id": "6735",
"campaign": "summer_sale"
}
}
]
json_normalize with sep="." flattens this into:
import pandas as pd, json
with open("stripe_payments.json") as f:
payments = json.load(f)
df = pd.json_normalize(payments, sep=".")
print(df.columns.tolist())
# ['id', 'amount', 'currency', 'status', 'created',
# 'customer.id', 'customer.email', 'customer.name',
# 'payment_method.type', 'payment_method.card.brand',
# 'payment_method.card.last4', 'payment_method.card.exp_month',
# 'payment_method.card.exp_year',
# 'metadata.order_id', 'metadata.campaign']
df.to_csv("payments.csv", index=False)
Every nested field becomes a dot-notation column name. Three levels of nesting flatten cleanly.
Controlling depth with max_level
Deeply nested structures can produce unwieldy column names. max_level stops flattening at a given depth:
df = pd.json_normalize(payments, sep=".", max_level=1)
# Flattens one level: customer.id, customer.email, customer.name
# Stops there: payment_method stays as a dict column
print(df.columns.tolist())
# ['id', 'amount', 'currency', 'status', 'created',
# 'customer.id', 'customer.email', 'customer.name',
# 'payment_method', 'metadata']
Columns that weren't fully flattened will contain dicts. Serialize them to JSON strings before writing to CSV:
for col in df.columns:
if df[col].apply(lambda x: isinstance(x, (dict, list))).any():
df[col] = df[col].apply(
lambda x: json.dumps(x) if isinstance(x, (dict, list)) else x
)
Expanding arrays with record_path and meta
When each JSON record contains an array of sub-records — like order line items, GitHub commits, or Jira issue comments — record_path expands each array element into its own row.
import pandas as pd
orders = [
{
"order_id": "ORD-1001",
"customer_email": "alice@example.com",
"order_date": "2024-11-15",
"shipping": {"carrier": "UPS", "tracking": "1Z999AA10123456784"},
"line_items": [
{"sku": "LAPTOP-15", "qty": 1, "unit_price": 1299.00, "discount": 0.10},
{"sku": "MOUSE-BT", "qty": 2, "unit_price": 49.99, "discount": 0.00},
],
},
{
"order_id": "ORD-1002",
"customer_email": "bob@example.com",
"order_date": "2024-11-16",
"shipping": {"carrier": "FedEx", "tracking": "449044304137821"},
"line_items": [
{"sku": "KEYBOARD-MEC", "qty": 1, "unit_price": 149.00, "discount": 0.05},
],
},
]
df = pd.json_normalize(
orders,
record_path="line_items",
meta=[
"order_id",
"customer_email",
"order_date",
["shipping", "carrier"], # nested meta field
],
sep=".",
errors="ignore",
)
print(df)
# sku qty unit_price discount order_id customer_email order_date shipping.carrier
# 0 LAPTOP-15 1 1299.00 0.10 ORD-1001 alice@example.com 2024-11-15 UPS
# 1 MOUSE-BT 2 49.99 0.00 ORD-1001 alice@example.com 2024-11-15 UPS
# 2 KEYBOARD-MEC 1 149.00 0.05 ORD-1002 bob@example.com 2024-11-16 FedEx
df.to_csv("order_line_items.csv", index=False)
Key points:
record_path="line_items"expands theline_itemsarray — each element becomes a rowmetalists parent fields to repeat on every row- Nested meta fields are passed as a list:
["shipping", "carrier"]→ columnshipping.carrier errors="ignore"prevents crashes when a meta field is missing on some records- Two orders with three total line items become three CSV rows
Expanding multiple levels
When arrays are nested two levels deep, call json_normalize iteratively:
# GitHub API: repos → contributors
repos = [
{
"full_name": "tensorflow/tensorflow",
"stargazers_count": 178000,
"topics": ["machine-learning", "deep-learning"],
"contributors": [
{"login": "tensorflower-gardener", "contributions": 24891, "type": "Bot"},
{"login": "yifeif", "contributions": 2156, "type": "User"},
],
},
]
df = pd.json_normalize(
repos,
record_path="contributors",
meta=["full_name", "stargazers_count"],
sep=".",
)
# Handle the 'topics' primitive array — join to string
df_repos = pd.json_normalize(repos, sep=".")
df_repos["topics"] = df_repos["topics"].apply(
lambda x: "|".join(x) if isinstance(x, list) else x
)
Column operations before export
Once your data is in a DataFrame, you have the full pandas API for cleanup before writing.
Renaming columns
Long dot-notation names can be hard to work with in spreadsheets:
# Rename specific columns
df = df.rename(columns={
"customer.email": "email",
"customer.name": "customer_name",
"payment_method.card.last4": "card_last4",
"payment_method.card.brand": "card_brand",
})
# Rename all columns: replace dots with underscores
df.columns = [col.replace(".", "_") for col in df.columns]
Selecting and reordering columns
# Keep only the columns you want, in the order you want
EXPORT_COLUMNS = [
"id", "status", "amount", "currency",
"customer_name", "email",
"card_brand", "card_last4",
"created",
]
df = df[EXPORT_COLUMNS] # Raises KeyError if a column is missing
# Or, safe version that silently skips missing columns:
df = df[[col for col in EXPORT_COLUMNS if col in df.columns]]
Dropping internal or sensitive fields
# Drop columns by prefix
metadata_cols = [col for col in df.columns if col.startswith("metadata.")]
df = df.drop(columns=metadata_cols)
# Drop columns with all-null values
df = df.dropna(axis=1, how="all")
Type coercion before export
Raw JSON values are often the wrong type for your downstream system. Fix types in the DataFrame before calling to_csv.
Unix timestamps to human-readable dates
# Convert Unix epoch (seconds) to ISO 8601 string
df["created"] = pd.to_datetime(df["created"], unit="s", utc=True)
df["created"] = df["created"].dt.strftime("%Y-%m-%d %H:%M:%S UTC")
# Or just the date portion
df["created_date"] = pd.to_datetime(df["created"], unit="s", utc=True).dt.date
Currency amounts (cents → dollars)
Stripe and many payment APIs store amounts in the smallest currency unit (cents for USD):
df["amount"] = df["amount"] / 100 # 2000 → 20.00
df["amount"] = df["amount"].round(2)
Booleans to Yes/No for Excel users
bool_columns = df.select_dtypes(include="bool").columns
df[bool_columns] = df[bool_columns].replace({True: "Yes", False: "No"})
Filling missing values explicitly
# Fill all NaN with empty string (default csv export behavior)
df = df.fillna("")
# Or fill specific columns with meaningful defaults
df["discount"] = df["discount"].fillna(0)
df["status"] = df["status"].fillna("unknown")
Writing the CSV
df.to_csv(
"output.csv",
index=False, # Don't write the row index as a column
encoding="utf-8-sig", # UTF-8 with BOM — Excel opens this correctly
float_format="%.2f", # Limit float precision (avoids 89.98999999999999)
date_format="%Y-%m-%d",
)
encoding="utf-8-sig" vs encoding="utf-8"
utf-8-sig adds a UTF-8 BOM (byte order mark) at the start of the file. This tells Excel to open it as UTF-8 automatically. Without it, Excel on Windows may interpret the file as Latin-1 and garble non-ASCII characters (accented names, currency symbols, CJK characters).
For files that will be consumed by code (databases, APIs, other scripts), use plain utf-8. For files going to a human with Excel, use utf-8-sig.
Processing large JSON files with pandas
json.load() reads the entire file into RAM. For files over ~500MB, this can exhaust memory.
Strategy 1: Newline-delimited JSON (NDJSON) with chunksize
If your file is newline-delimited (one JSON object per line — also called NDJSON or JSONL), pandas can read it in chunks:
import pandas as pd
output_path = "output.csv"
first_chunk = True
for chunk in pd.read_json("large_data.ndjson", lines=True, chunksize=50_000):
# Apply the same transformations to each chunk
chunk = pd.json_normalize(chunk.to_dict("records"), sep=".")
chunk.to_csv(
output_path,
mode="w" if first_chunk else "a",
header=first_chunk,
index=False,
encoding="utf-8",
)
first_chunk = False
print(f"Processed chunk, total rows written so far...")
print("Done")
Strategy 2: Standard JSON arrays with ijson
For standard JSON arrays (not newline-delimited), use ijson to stream records:
# pip install ijson
import ijson, pandas as pd, csv
CHUNK_SIZE = 10_000
def stream_records(filepath, prefix="item"):
with open(filepath, "rb") as f:
yield from ijson.items(f, prefix)
def process_chunk(records):
return pd.json_normalize(records, sep=".")
output_path = "output.csv"
first_chunk = True
buffer = []
for record in stream_records("large_data.json"):
buffer.append(record)
if len(buffer) >= CHUNK_SIZE:
df_chunk = process_chunk(buffer)
df_chunk.to_csv(
output_path,
mode="w" if first_chunk else "a",
header=first_chunk,
index=False,
)
first_chunk = False
buffer = []
# Flush remaining records
if buffer:
df_chunk = process_chunk(buffer)
df_chunk.to_csv(output_path, mode="a", header=False, index=False)
print("Done")
A complete production-ready script
This combines everything above into a script you can adapt for real API data:
import pandas as pd
import json
from datetime import timezone
from pathlib import Path
def json_to_csv(
input_path: str,
output_path: str,
record_path: str | None = None,
meta_fields: list | None = None,
rename_map: dict | None = None,
export_columns: list | None = None,
timestamp_columns: list | None = None,
drop_prefixes: list | None = None,
) -> int:
"""
Convert a JSON file to CSV with optional flattening, expansion, and cleanup.
Returns the number of rows written.
"""
with open(input_path, encoding="utf-8-sig") as f:
data = json.load(f)
# Unwrap common API envelope patterns
if isinstance(data, dict):
for key in ("data", "items", "results", "records"):
if key in data and isinstance(data[key], list):
data = data[key]
break
if not isinstance(data, list):
raise ValueError(f"Expected a list, got {type(data).__name__}")
# Flatten / expand
kwargs = {"sep": ".", "errors": "ignore"}
if record_path:
kwargs["record_path"] = record_path
if meta_fields:
kwargs["meta"] = meta_fields
df = pd.json_normalize(data, **kwargs)
# Drop prefixes (e.g. internal metadata)
if drop_prefixes:
drop_cols = [c for c in df.columns if any(c.startswith(p) for p in drop_prefixes)]
df = df.drop(columns=drop_cols)
# Convert timestamps
if timestamp_columns:
for col in timestamp_columns:
if col in df.columns:
df[col] = (
pd.to_datetime(df[col], unit="s", utc=True)
.dt.strftime("%Y-%m-%d %H:%M:%S UTC")
)
# Serialize any remaining nested objects
for col in df.columns:
if df[col].apply(lambda x: isinstance(x, (dict, list))).any():
df[col] = df[col].apply(
lambda x: json.dumps(x) if isinstance(x, (dict, list)) else x
)
# Rename columns
if rename_map:
df = df.rename(columns=rename_map)
# Select and order columns
if export_columns:
df = df[[c for c in export_columns if c in df.columns]]
df = df.fillna("")
df.to_csv(output_path, index=False, encoding="utf-8-sig")
return len(df)
# Usage — Stripe payments
rows = json_to_csv(
input_path="stripe_payments.json",
output_path="stripe_export.csv",
rename_map={
"customer.email": "email",
"customer.name": "name",
"payment_method.card.brand": "card_brand",
"payment_method.card.last4": "card_last4",
},
export_columns=["id", "status", "amount", "currency", "name", "email", "card_brand", "card_last4", "created"],
timestamp_columns=["created"],
drop_prefixes=["metadata."],
)
print(f"Exported {rows} rows")
Frequently Asked Questions
What is the difference between pd.read_json and pd.json_normalize?
read_json reads a file or string and returns a DataFrame. It works well for flat JSON arrays and NDJSON files. json_normalize takes a Python list of dicts and recursively flattens nested structures into columns. For nested JSON, always use json_normalize — read_json will leave nested objects as dict columns rather than expanding them.
Why does json_normalize produce NaN for some fields?
Fields that don't exist in every record come back as NaN. This is correct behavior — the record simply didn't have that field. Use df.fillna("") to convert all NaN to empty strings before writing to CSV, or use df["column"].fillna("default") for specific columns.
How do I handle a JSON array that's nested two levels deep?
Use nested json_normalize calls. First expand the outer array with record_path, then if the expanded rows still contain nested objects, run json_normalize again on that DataFrame using df.to_dict("records"). See the "Expanding multiple levels" section above.
Can pandas handle JSON files larger than RAM?
Not directly — json.load() reads everything at once. For large files, either convert to NDJSON first and use pd.read_json(..., lines=True, chunksize=N), or stream records with ijson and process in batches. See the "Processing large JSON files" section above.
Why do my float columns have too many decimal places in the CSV?
By default, pandas writes floats with full IEEE 754 precision, which gives values like 89.98999999999999. Pass float_format="%.2f" to to_csv to limit to 2 decimal places, or round the columns explicitly with df[col] = df[col].round(2) before export.
How do I make the CSV open correctly in Excel?
Two things: (1) use encoding="utf-8-sig" so Excel recognizes the UTF-8 encoding instead of defaulting to Latin-1; (2) if your locale uses ; as the list separator, you may need sep=";" in to_csv. Excel uses the Windows locale's list separator when deciding whether to split CSV columns automatically.