csvjson
·11 min read

JSON to CSV with pandas: The Complete Workflow

A deep dive into converting JSON to CSV using pandas — json_normalize, record_path, meta, type coercion, column renaming, filtering, and exporting large datasets. Real API examples throughout.

pandas is the standard library for JSON-to-CSV conversion when your data needs more than a straight dump. It handles nested structures, type coercion, column filtering, joins, and aggregations — all before you write a single CSV row.

This guide focuses on the pandas workflow end-to-end. For a broader comparison of the standard library, pandas, and online converters, see the complete guide to converting JSON to CSV in Python. For error handling, see the JSON to CSV errors guide.

When pandas is the right choice

Use pandas when you need to do anything beyond a straight dump:

  • Your JSON is nested (objects inside objects, arrays of objects)
  • You want to filter rows, rename columns, or reorder fields before export
  • You're doing this inside a larger analysis pipeline
  • Records have inconsistent keys and you want NaN instead of a crash
  • You want type coercion — dates as dates, numbers as numbers

If your JSON is flat and you just want a file, the standard library approach is three lines and zero dependencies. Use that.

pd.read_json vs pd.json_normalize

These two functions look similar but do different jobs.

pd.read_json is a file/string reader. It reads JSON and tries to infer a tabular structure. Use it for flat JSON arrays and newline-delimited JSON files.

pd.json_normalize is a structure transformer. It takes a Python list (that you've already loaded) and recursively flattens nested dicts into columns. Use it for anything with nesting.

import pandas as pd
import json

# pd.read_json — best for flat arrays and NDJSON files
df = pd.read_json("flat_users.json")                          # flat array
df = pd.read_json("data.ndjson", lines=True)                  # one JSON object per line

# pd.json_normalize — best for nested structures
with open("nested_payments.json") as f:
    data = json.load(f)
df = pd.json_normalize(data, sep=".")                         # nested API response

For nested JSON, always prefer json_normalize over pd.read_json. read_json will serialize nested dicts as Python objects rather than flattening them.

Flattening nested objects with json_normalize

Here's a realistic Stripe payment intent response:

[
  {
    "id": "pi_3MtwBwLkdIwHu7ix28a3tqPD",
    "amount": 2000,
    "currency": "usd",
    "status": "succeeded",
    "created": 1680729600,
    "customer": {
      "id": "cus_NffrFeUfNV2Hib",
      "email": "jenny.rosen@example.com",
      "name": "Jenny Rosen"
    },
    "payment_method": {
      "type": "card",
      "card": {
        "brand": "visa",
        "last4": "4242",
        "exp_month": 8,
        "exp_year": 2026
      }
    },
    "metadata": {
      "order_id": "6735",
      "campaign": "summer_sale"
    }
  }
]

json_normalize with sep="." flattens this into:

import pandas as pd, json

with open("stripe_payments.json") as f:
    payments = json.load(f)

df = pd.json_normalize(payments, sep=".")
print(df.columns.tolist())
# ['id', 'amount', 'currency', 'status', 'created',
#  'customer.id', 'customer.email', 'customer.name',
#  'payment_method.type', 'payment_method.card.brand',
#  'payment_method.card.last4', 'payment_method.card.exp_month',
#  'payment_method.card.exp_year',
#  'metadata.order_id', 'metadata.campaign']

df.to_csv("payments.csv", index=False)

Every nested field becomes a dot-notation column name. Three levels of nesting flatten cleanly.

Controlling depth with max_level

Deeply nested structures can produce unwieldy column names. max_level stops flattening at a given depth:

df = pd.json_normalize(payments, sep=".", max_level=1)
# Flattens one level: customer.id, customer.email, customer.name
# Stops there: payment_method stays as a dict column
print(df.columns.tolist())
# ['id', 'amount', 'currency', 'status', 'created',
#  'customer.id', 'customer.email', 'customer.name',
#  'payment_method', 'metadata']

Columns that weren't fully flattened will contain dicts. Serialize them to JSON strings before writing to CSV:

for col in df.columns:
    if df[col].apply(lambda x: isinstance(x, (dict, list))).any():
        df[col] = df[col].apply(
            lambda x: json.dumps(x) if isinstance(x, (dict, list)) else x
        )

Expanding arrays with record_path and meta

When each JSON record contains an array of sub-records — like order line items, GitHub commits, or Jira issue comments — record_path expands each array element into its own row.

import pandas as pd

orders = [
    {
        "order_id": "ORD-1001",
        "customer_email": "alice@example.com",
        "order_date": "2024-11-15",
        "shipping": {"carrier": "UPS", "tracking": "1Z999AA10123456784"},
        "line_items": [
            {"sku": "LAPTOP-15", "qty": 1, "unit_price": 1299.00, "discount": 0.10},
            {"sku": "MOUSE-BT",  "qty": 2, "unit_price": 49.99,   "discount": 0.00},
        ],
    },
    {
        "order_id": "ORD-1002",
        "customer_email": "bob@example.com",
        "order_date": "2024-11-16",
        "shipping": {"carrier": "FedEx", "tracking": "449044304137821"},
        "line_items": [
            {"sku": "KEYBOARD-MEC", "qty": 1, "unit_price": 149.00, "discount": 0.05},
        ],
    },
]

df = pd.json_normalize(
    orders,
    record_path="line_items",
    meta=[
        "order_id",
        "customer_email",
        "order_date",
        ["shipping", "carrier"],   # nested meta field
    ],
    sep=".",
    errors="ignore",
)

print(df)
#            sku  qty  unit_price  discount  order_id       customer_email  order_date shipping.carrier
# 0    LAPTOP-15    1     1299.00      0.10  ORD-1001  alice@example.com  2024-11-15              UPS
# 1     MOUSE-BT    2       49.99      0.00  ORD-1001  alice@example.com  2024-11-15              UPS
# 2  KEYBOARD-MEC    1      149.00      0.05  ORD-1002    bob@example.com  2024-11-16            FedEx

df.to_csv("order_line_items.csv", index=False)

Key points:

  • record_path="line_items" expands the line_items array — each element becomes a row
  • meta lists parent fields to repeat on every row
  • Nested meta fields are passed as a list: ["shipping", "carrier"] → column shipping.carrier
  • errors="ignore" prevents crashes when a meta field is missing on some records
  • Two orders with three total line items become three CSV rows

Expanding multiple levels

When arrays are nested two levels deep, call json_normalize iteratively:

# GitHub API: repos → contributors
repos = [
    {
        "full_name": "tensorflow/tensorflow",
        "stargazers_count": 178000,
        "topics": ["machine-learning", "deep-learning"],
        "contributors": [
            {"login": "tensorflower-gardener", "contributions": 24891, "type": "Bot"},
            {"login": "yifeif",                "contributions": 2156,  "type": "User"},
        ],
    },
]

df = pd.json_normalize(
    repos,
    record_path="contributors",
    meta=["full_name", "stargazers_count"],
    sep=".",
)

# Handle the 'topics' primitive array — join to string
df_repos = pd.json_normalize(repos, sep=".")
df_repos["topics"] = df_repos["topics"].apply(
    lambda x: "|".join(x) if isinstance(x, list) else x
)

Column operations before export

Once your data is in a DataFrame, you have the full pandas API for cleanup before writing.

Renaming columns

Long dot-notation names can be hard to work with in spreadsheets:

# Rename specific columns
df = df.rename(columns={
    "customer.email": "email",
    "customer.name": "customer_name",
    "payment_method.card.last4": "card_last4",
    "payment_method.card.brand": "card_brand",
})

# Rename all columns: replace dots with underscores
df.columns = [col.replace(".", "_") for col in df.columns]

Selecting and reordering columns

# Keep only the columns you want, in the order you want
EXPORT_COLUMNS = [
    "id", "status", "amount", "currency",
    "customer_name", "email",
    "card_brand", "card_last4",
    "created",
]

df = df[EXPORT_COLUMNS]  # Raises KeyError if a column is missing
# Or, safe version that silently skips missing columns:
df = df[[col for col in EXPORT_COLUMNS if col in df.columns]]

Dropping internal or sensitive fields

# Drop columns by prefix
metadata_cols = [col for col in df.columns if col.startswith("metadata.")]
df = df.drop(columns=metadata_cols)

# Drop columns with all-null values
df = df.dropna(axis=1, how="all")

Type coercion before export

Raw JSON values are often the wrong type for your downstream system. Fix types in the DataFrame before calling to_csv.

Unix timestamps to human-readable dates

# Convert Unix epoch (seconds) to ISO 8601 string
df["created"] = pd.to_datetime(df["created"], unit="s", utc=True)
df["created"] = df["created"].dt.strftime("%Y-%m-%d %H:%M:%S UTC")

# Or just the date portion
df["created_date"] = pd.to_datetime(df["created"], unit="s", utc=True).dt.date

Currency amounts (cents → dollars)

Stripe and many payment APIs store amounts in the smallest currency unit (cents for USD):

df["amount"] = df["amount"] / 100  # 2000 → 20.00
df["amount"] = df["amount"].round(2)

Booleans to Yes/No for Excel users

bool_columns = df.select_dtypes(include="bool").columns
df[bool_columns] = df[bool_columns].replace({True: "Yes", False: "No"})

Filling missing values explicitly

# Fill all NaN with empty string (default csv export behavior)
df = df.fillna("")

# Or fill specific columns with meaningful defaults
df["discount"] = df["discount"].fillna(0)
df["status"]   = df["status"].fillna("unknown")

Writing the CSV

df.to_csv(
    "output.csv",
    index=False,          # Don't write the row index as a column
    encoding="utf-8-sig", # UTF-8 with BOM — Excel opens this correctly
    float_format="%.2f",  # Limit float precision (avoids 89.98999999999999)
    date_format="%Y-%m-%d",
)

encoding="utf-8-sig" vs encoding="utf-8"

utf-8-sig adds a UTF-8 BOM (byte order mark) at the start of the file. This tells Excel to open it as UTF-8 automatically. Without it, Excel on Windows may interpret the file as Latin-1 and garble non-ASCII characters (accented names, currency symbols, CJK characters).

For files that will be consumed by code (databases, APIs, other scripts), use plain utf-8. For files going to a human with Excel, use utf-8-sig.

Processing large JSON files with pandas

json.load() reads the entire file into RAM. For files over ~500MB, this can exhaust memory.

Strategy 1: Newline-delimited JSON (NDJSON) with chunksize

If your file is newline-delimited (one JSON object per line — also called NDJSON or JSONL), pandas can read it in chunks:

import pandas as pd

output_path = "output.csv"
first_chunk = True

for chunk in pd.read_json("large_data.ndjson", lines=True, chunksize=50_000):
    # Apply the same transformations to each chunk
    chunk = pd.json_normalize(chunk.to_dict("records"), sep=".")
    chunk.to_csv(
        output_path,
        mode="w" if first_chunk else "a",
        header=first_chunk,
        index=False,
        encoding="utf-8",
    )
    first_chunk = False
    print(f"Processed chunk, total rows written so far...")

print("Done")

Strategy 2: Standard JSON arrays with ijson

For standard JSON arrays (not newline-delimited), use ijson to stream records:

# pip install ijson
import ijson, pandas as pd, csv

CHUNK_SIZE = 10_000

def stream_records(filepath, prefix="item"):
    with open(filepath, "rb") as f:
        yield from ijson.items(f, prefix)

def process_chunk(records):
    return pd.json_normalize(records, sep=".")

output_path = "output.csv"
first_chunk = True
buffer = []

for record in stream_records("large_data.json"):
    buffer.append(record)
    if len(buffer) >= CHUNK_SIZE:
        df_chunk = process_chunk(buffer)
        df_chunk.to_csv(
            output_path,
            mode="w" if first_chunk else "a",
            header=first_chunk,
            index=False,
        )
        first_chunk = False
        buffer = []

# Flush remaining records
if buffer:
    df_chunk = process_chunk(buffer)
    df_chunk.to_csv(output_path, mode="a", header=False, index=False)

print("Done")

A complete production-ready script

This combines everything above into a script you can adapt for real API data:

import pandas as pd
import json
from datetime import timezone
from pathlib import Path


def json_to_csv(
    input_path: str,
    output_path: str,
    record_path: str | None = None,
    meta_fields: list | None = None,
    rename_map: dict | None = None,
    export_columns: list | None = None,
    timestamp_columns: list | None = None,
    drop_prefixes: list | None = None,
) -> int:
    """
    Convert a JSON file to CSV with optional flattening, expansion, and cleanup.
    Returns the number of rows written.
    """
    with open(input_path, encoding="utf-8-sig") as f:
        data = json.load(f)

    # Unwrap common API envelope patterns
    if isinstance(data, dict):
        for key in ("data", "items", "results", "records"):
            if key in data and isinstance(data[key], list):
                data = data[key]
                break

    if not isinstance(data, list):
        raise ValueError(f"Expected a list, got {type(data).__name__}")

    # Flatten / expand
    kwargs = {"sep": ".", "errors": "ignore"}
    if record_path:
        kwargs["record_path"] = record_path
        if meta_fields:
            kwargs["meta"] = meta_fields
    df = pd.json_normalize(data, **kwargs)

    # Drop prefixes (e.g. internal metadata)
    if drop_prefixes:
        drop_cols = [c for c in df.columns if any(c.startswith(p) for p in drop_prefixes)]
        df = df.drop(columns=drop_cols)

    # Convert timestamps
    if timestamp_columns:
        for col in timestamp_columns:
            if col in df.columns:
                df[col] = (
                    pd.to_datetime(df[col], unit="s", utc=True)
                    .dt.strftime("%Y-%m-%d %H:%M:%S UTC")
                )

    # Serialize any remaining nested objects
    for col in df.columns:
        if df[col].apply(lambda x: isinstance(x, (dict, list))).any():
            df[col] = df[col].apply(
                lambda x: json.dumps(x) if isinstance(x, (dict, list)) else x
            )

    # Rename columns
    if rename_map:
        df = df.rename(columns=rename_map)

    # Select and order columns
    if export_columns:
        df = df[[c for c in export_columns if c in df.columns]]

    df = df.fillna("")

    df.to_csv(output_path, index=False, encoding="utf-8-sig")
    return len(df)


# Usage — Stripe payments
rows = json_to_csv(
    input_path="stripe_payments.json",
    output_path="stripe_export.csv",
    rename_map={
        "customer.email": "email",
        "customer.name": "name",
        "payment_method.card.brand": "card_brand",
        "payment_method.card.last4": "card_last4",
    },
    export_columns=["id", "status", "amount", "currency", "name", "email", "card_brand", "card_last4", "created"],
    timestamp_columns=["created"],
    drop_prefixes=["metadata."],
)
print(f"Exported {rows} rows")

Frequently Asked Questions

What is the difference between pd.read_json and pd.json_normalize?

read_json reads a file or string and returns a DataFrame. It works well for flat JSON arrays and NDJSON files. json_normalize takes a Python list of dicts and recursively flattens nested structures into columns. For nested JSON, always use json_normalizeread_json will leave nested objects as dict columns rather than expanding them.

Why does json_normalize produce NaN for some fields?

Fields that don't exist in every record come back as NaN. This is correct behavior — the record simply didn't have that field. Use df.fillna("") to convert all NaN to empty strings before writing to CSV, or use df["column"].fillna("default") for specific columns.

How do I handle a JSON array that's nested two levels deep?

Use nested json_normalize calls. First expand the outer array with record_path, then if the expanded rows still contain nested objects, run json_normalize again on that DataFrame using df.to_dict("records"). See the "Expanding multiple levels" section above.

Can pandas handle JSON files larger than RAM?

Not directly — json.load() reads everything at once. For large files, either convert to NDJSON first and use pd.read_json(..., lines=True, chunksize=N), or stream records with ijson and process in batches. See the "Processing large JSON files" section above.

Why do my float columns have too many decimal places in the CSV?

By default, pandas writes floats with full IEEE 754 precision, which gives values like 89.98999999999999. Pass float_format="%.2f" to to_csv to limit to 2 decimal places, or round the columns explicitly with df[col] = df[col].round(2) before export.

How do I make the CSV open correctly in Excel?

Two things: (1) use encoding="utf-8-sig" so Excel recognizes the UTF-8 encoding instead of defaulting to Latin-1; (2) if your locale uses ; as the list separator, you may need sep=";" in to_csv. Excel uses the Windows locale's list separator when deciding whether to split CSV columns automatically.