csvjson
·12 min read

How to Convert CSV to JSON in Python

Complete Python guide to converting CSV to JSON — from a two-line pandas solution to nested JSON, grouped output, type casting, NDJSON, and large-file streaming. Covers every output shape you'll actually need.

CSV is how data gets exported. JSON is how APIs and modern applications consume it. Converting between them in Python is a daily task — and there are a dozen ways to do it, each with different trade-offs depending on the shape of JSON you need.

This guide covers every pattern you're likely to encounter: flat arrays, nested objects, grouped JSON, NDJSON, type casting, and large-file streaming. Each section is self-contained, so skip to the shape that matches what you're building.

If you just need a one-off conversion without writing code, the CSV to JSON converter handles it in the browser — no install, nothing uploaded.


The two-line solution (pandas)

If you have a flat CSV and need a flat JSON array, this is all you need:

import pandas as pd

df = pd.read_csv("data.csv")
df.to_json("output.json", orient="records", indent=2)

That's it. orient="records" produces the standard JSON shape — an array of objects, one per row:

[
  {"id": 1, "name": "Alice Chen", "dept": "Engineering", "salary": 95000},
  {"id": 2, "name": "Bob Kumar", "dept": "Design", "salary": 78000},
  {"id": 3, "name": "Sara Mills", "dept": "Management", "salary": 110000}
]

The rest of this guide covers what to do when this isn't enough.


Method 1: Standard library (no dependencies)

Python's csv module handles CSV-to-JSON without any third-party packages. This is useful in environments where you can't install pandas, or when you want explicit control over every field.

Basic flat array

import csv
import json

with open("data.csv", newline="", encoding="utf-8") as f:
    reader = csv.DictReader(f)
    rows = list(reader)

with open("output.json", "w", encoding="utf-8") as f:
    json.dump(rows, f, indent=2)

csv.DictReader reads each row as a dict with the header row as keys. The output is identical in structure to the pandas example above.

Caveat: Every value is a string. A CSV has no type information — "95000" stays "95000" unless you cast it. See the type casting section below.

Streaming to JSON (memory-efficient)

For large files, don't load all rows into memory before writing. Build the JSON array incrementally:

import csv
import json

with open("large.csv", newline="", encoding="utf-8") as f_in, \
     open("output.json", "w", encoding="utf-8") as f_out:

    reader = csv.DictReader(f_in)
    f_out.write("[\n")
    
    for i, row in enumerate(reader):
        prefix = "  " if i == 0 else ",\n  "
        f_out.write(prefix + json.dumps(row))
    
    f_out.write("\n]")

This writes one JSON object at a time, so memory usage is proportional to one row, not the whole file.


Method 2: pandas — full control over output

pandas to_json has an orient parameter that controls the JSON shape:

| orient | Output shape | Use when | |----------|-------------|---------| | "records" | [{...}, {...}] | Standard — one object per row | | "index" | {"0": {...}, "1": {...}} | Need row numbers as keys | | "split" | {"columns": [...], "data": [[...]]} | Compact; preserves column order | | "table" | Schema + data (JSON Table spec) | Need type metadata | | "values" | [[val, val], [val, val]] | Raw 2D array, no keys |

For almost every use case, orient="records" is correct.

Compact vs. pretty-printed output

# Pretty (readable)
df.to_json("output.json", orient="records", indent=2)

# Compact (smaller file, faster to parse)
df.to_json("output.json", orient="records")

Compact output for a 3-row file:

[{"id":1,"name":"Alice Chen","dept":"Engineering","salary":95000}]

Handling NaN (the most common pandas gotcha)

pandas represents missing CSV values as NaN. By default, to_json converts NaN to null in JSON — which is correct behavior. But if you're using json.dumps on a pandas DataFrame that you've converted to a dict first, NaN becomes the literal string "NaN":

# This is wrong — NaN becomes "NaN" in JSON
rows = df.to_dict(orient="records")
json.dumps(rows)  # NaN is not valid JSON

# This is correct — to_json handles NaN → null
df.to_json("output.json", orient="records", indent=2)

# Or if you need a string, not a file:
json_str = df.to_json(orient="records", indent=2)

Preserving column order

pandas to_json writes columns in the order they appear in the DataFrame, which matches the CSV column order by default. If you've reordered columns during processing, the output order reflects that.

To force a specific column order:

cols = ["id", "name", "dept", "salary"]
df[cols].to_json("output.json", orient="records", indent=2)

Type casting: strings to numbers, booleans, and nulls

The most important thing to understand about CSV: everything is a string. A CSV cell containing 95000 is the string "95000". If your JSON consumer expects a number and gets a string, it will fail.

pandas handles this automatically

pd.read_csv() infers types by default. Columns of integers become int64, decimals become float64, and to_json writes them as JSON numbers — no extra work.

df = pd.read_csv("employees.csv")
print(df.dtypes)
# id        int64
# name      object   ← string
# salary    int64
# score     float64

Override inference when needed:

# Force a column to stay string (e.g., zip codes like "01234")
df = pd.read_csv("data.csv", dtype={"zip_code": str, "phone": str})

# Parse date columns
df = pd.read_csv("data.csv", parse_dates=["created_at", "updated_at"])

Manual casting with the csv module

When using the standard library, cast explicitly:

import csv
import json

def cast(value):
    if value == "":
        return None
    if value.lower() in ("true", "yes"):
        return True
    if value.lower() in ("false", "no"):
        return False
    try:
        return int(value)
    except ValueError:
        pass
    try:
        return float(value)
    except ValueError:
        pass
    return value

with open("data.csv", newline="", encoding="utf-8") as f:
    reader = csv.DictReader(f)
    rows = [{k: cast(v) for k, v in row.items()} for row in reader]

with open("output.json", "w", encoding="utf-8") as f:
    json.dump(rows, f, indent=2)

This turns "95000"95000, ""null, "true"true, "3.14"3.14, and leaves everything else as a string.


Building nested JSON from a flat CSV

A flat CSV becomes a flat JSON array by default. Sometimes you need nested JSON — a column called address.city should become {"address": {"city": "..."}}.

From dot-notation column names

Given a CSV like:

id,name,address.city,address.state,contact.email
1,Alice Chen,Austin,TX,alice@example.com

Build nested objects from the dotted column names:

import csv
import json
from functools import reduce

def set_nested(obj, keys, value):
    for key in keys[:-1]:
        obj = obj.setdefault(key, {})
    obj[keys[-1]] = value

def flatten_row_to_nested(row):
    result = {}
    for key, value in row.items():
        parts = key.split(".")
        set_nested(result, parts, value)
    return result

with open("flat.csv", newline="", encoding="utf-8") as f:
    reader = csv.DictReader(f)
    rows = [flatten_row_to_nested(row) for row in reader]

with open("nested.json", "w", encoding="utf-8") as f:
    json.dump(rows, f, indent=2)

Input row {"id": "1", "address.city": "Austin", "address.state": "TX"} becomes:

{
  "id": "1",
  "address": {
    "city": "Austin",
    "state": "TX"
  }
}

From explicit column groupings (pandas)

import pandas as pd
import json

df = pd.read_csv("flat.csv")

records = []
for _, row in df.iterrows():
    record = {
        "id": row["id"],
        "name": row["name"],
        "address": {
            "city": row["address.city"],
            "state": row["address.state"],
        },
        "contact": {
            "email": row["contact.email"],
        },
    }
    records.append(record)

with open("nested.json", "w") as f:
    json.dump(records, f, indent=2)

More verbose but more readable for complex schemas.


Grouped JSON: one key per group

A common pattern: you have a flat CSV with a repeating column (like department or region) and you want JSON grouped by that column — a dict of arrays, or an array of objects each containing a sub-array.

Dict of arrays (group by key)

dept,name,salary
Engineering,Alice Chen,95000
Engineering,Bob Kumar,88000
Design,Sara Mills,78000
Design,Tom Park,72000

Target output:

{
  "Engineering": [
    {"name": "Alice Chen", "salary": 95000},
    {"name": "Bob Kumar", "salary": 88000}
  ],
  "Design": [
    {"name": "Sara Mills", "salary": 78000},
    {"name": "Tom Park", "salary": 72000}
  ]
}
import pandas as pd
import json

df = pd.read_csv("employees.csv")

grouped = (
    df.groupby("dept")
    .apply(lambda g: g.drop(columns="dept").to_dict(orient="records"))
    .to_dict()
)

with open("by_dept.json", "w") as f:
    json.dump(grouped, f, indent=2)

Array of group objects

Target output:

[
  {
    "dept": "Engineering",
    "count": 2,
    "employees": [{"name": "Alice Chen", "salary": 95000}, ...]
  }
]
import pandas as pd
import json

df = pd.read_csv("employees.csv")

result = []
for dept, group in df.groupby("dept"):
    result.append({
        "dept": dept,
        "count": len(group),
        "employees": group.drop(columns="dept").to_dict(orient="records"),
    })

with open("departments.json", "w") as f:
    json.dump(result, f, indent=2)

With the standard library

import csv
import json
from collections import defaultdict

groups = defaultdict(list)

with open("employees.csv", newline="") as f:
    for row in csv.DictReader(f):
        dept = row.pop("dept")
        groups[dept].append(row)

with open("by_dept.json", "w") as f:
    json.dump(dict(groups), f, indent=2)

NDJSON output (one JSON object per line)

NDJSON (newline-delimited JSON) is used in data pipelines, log streaming, and as input to tools like jq and mlr. Each line is a complete JSON object — no outer array brackets.

import pandas as pd

df = pd.read_csv("data.csv")
df.to_json("output.ndjson", orient="records", lines=True)

Output:

{"id":1,"name":"Alice Chen","dept":"Engineering","salary":95000}
{"id":2,"name":"Bob Kumar","dept":"Design","salary":78000}

lines=True enables NDJSON mode. Combine with streaming for large files:

import pandas as pd

chunk_size = 10_000

with open("output.ndjson", "w") as f_out:
    for chunk in pd.read_csv("large.csv", chunksize=chunk_size):
        f_out.write(chunk.to_json(orient="records", lines=True) + "\n")

Large files: pandas chunking

For CSV files too large to load into memory at once, read in chunks:

import pandas as pd
import json

chunk_size = 50_000
first = True

with open("output.json", "w") as f_out:
    f_out.write("[\n")
    
    for chunk in pd.read_csv("large.csv", chunksize=chunk_size):
        records = json.loads(chunk.to_json(orient="records"))
        for i, record in enumerate(records):
            if not first:
                f_out.write(",\n")
            f_out.write("  " + json.dumps(record))
            first = False
    
    f_out.write("\n]")

print("Done")

For NDJSON output, chunking is simpler — just append lines:

import pandas as pd

with open("output.ndjson", "w") as f_out:
    for chunk in pd.read_csv("large.csv", chunksize=50_000):
        f_out.write(chunk.to_json(orient="records", lines=True))
        f_out.write("\n")

Handling real-world CSV problems

BOM and encoding issues

CSVs from Excel often start with a UTF-8 BOM (). pandas handles this automatically, but the standard library needs encoding="utf-8-sig":

# csv module
with open("from_excel.csv", newline="", encoding="utf-8-sig") as f:
    reader = csv.DictReader(f)

# pandas (handles BOM automatically)
df = pd.read_csv("from_excel.csv")

Non-standard delimiters

CSVs from European locales often use semicolons (;) or tabs as delimiters:

# Semicolon-delimited
df = pd.read_csv("european.csv", sep=";")

# Tab-delimited
df = pd.read_csv("data.tsv", sep="\t")

# Auto-detect (slower)
df = pd.read_csv("data.csv", sep=None, engine="python")

Quoted fields with embedded newlines

The csv module handles RFC 4180 quoting automatically — fields that span multiple lines inside double quotes are read correctly:

with open("data.csv", newline="") as f:  # newline="" is required
    reader = csv.DictReader(f)

Do not pass newline="\n" — always use newline="" to let the csv module handle line endings.

Date columns in pandas

# Parse dates on read
df = pd.read_csv("data.csv", parse_dates=["created_at", "updated_at"])

# to_json with ISO 8601 dates
df.to_json("output.json", orient="records", indent=2, date_format="iso")

# Custom date format
df["created_at"] = pd.to_datetime(df["created_at"]).dt.strftime("%Y-%m-%d")
df.to_json("output.json", orient="records", indent=2)

Empty cells become null

pandas converts empty CSV cells to NaN, and to_json converts those to JSON null:

id,name,email
1,Alice,alice@example.com
2,Bob,
[
  {"id": 1, "name": "Alice", "email": "alice@example.com"},
  {"id": 2, "name": "Bob", "email": null}
]

To keep empty string instead of null:

df = pd.read_csv("data.csv").fillna("")
df.to_json("output.json", orient="records", indent=2)

Complete production script

A reusable script that handles the most common CSV-to-JSON conversion needs:

import pandas as pd
import json
import argparse
from pathlib import Path

def csv_to_json(
    csv_path: str,
    json_path: str,
    orient: str = "records",
    indent: int = 2,
    null_as_empty: bool = False,
    str_columns: list[str] | None = None,
    date_columns: list[str] | None = None,
    encoding: str = "utf-8-sig",
) -> int:
    dtype_overrides = {col: str for col in (str_columns or [])}

    df = pd.read_csv(
        csv_path,
        dtype=dtype_overrides,
        parse_dates=date_columns or [],
        encoding=encoding,
    )

    if null_as_empty:
        df = df.fillna("")

    df.to_json(
        json_path,
        orient=orient,
        indent=indent,
        date_format="iso",
        force_ascii=False,
    )

    return len(df)


if __name__ == "__main__":
    parser = argparse.ArgumentParser(description="Convert CSV to JSON")
    parser.add_argument("input",  help="Input CSV file")
    parser.add_argument("output", help="Output JSON file")
    parser.add_argument("--orient", default="records", choices=["records","index","split","table"])
    parser.add_argument("--indent", type=int, default=2)
    parser.add_argument("--null-as-empty", action="store_true")
    parser.add_argument("--str-cols", nargs="*", default=[], metavar="COL")
    parser.add_argument("--date-cols", nargs="*", default=[], metavar="COL")
    args = parser.parse_args()

    count = csv_to_json(
        args.input,
        args.output,
        orient=args.orient,
        indent=args.indent,
        null_as_empty=args.null_as_empty,
        str_columns=args.str_cols,
        date_columns=args.date_cols,
    )
    print(f"Wrote {count} records to {args.output}")

Usage:

# Basic conversion
python csv_to_json.py employees.csv employees.json

# Keep zip codes as strings, parse date column
python csv_to_json.py orders.csv orders.json --str-cols zip_code --date-cols created_at

# Replace nulls with empty strings
python csv_to_json.py data.csv data.json --null-as-empty

Which method should you use?

| Situation | Recommended approach | |-----------|---------------------| | Simple flat array, standard schema | pd.read_csvdf.to_json(orient="records") | | Need explicit type control | csv module + manual cast() function | | Dot-notation columns → nested objects | csv module + set_nested() helper | | Group rows by a column | pandas groupby + to_dict(orient="records") | | NDJSON output | df.to_json(orient="records", lines=True) | | Large file (> 100MB) | pandas chunksize + streaming write | | No dependencies allowed | csv + json standard library | | One-off conversion, no code | CSV to JSON converter |


Frequently Asked Questions

Why are all my values strings in the JSON output?

You're using csv.DictReader without type casting. The CSV format stores everything as text — there's no type information. Use the cast() function from the type casting section, or switch to pandas which infers types automatically.

How do I get compact JSON (no spaces or newlines)?

# pandas
df.to_json("output.json", orient="records")  # omit indent

# json module
json.dump(rows, f, separators=(",", ":"))

My CSV has duplicate column names — what happens?

pandas renames them automatically: name, name.1, name.2. To control this, rename columns after reading:

df = pd.read_csv("data.csv")
df.columns = ["id", "first_name", "last_name"]  # override with your names

How do I convert only specific rows to JSON?

Filter before converting:

df = pd.read_csv("data.csv")

# Filter and convert
df[df["status"] == "active"].to_json("active.json", orient="records", indent=2)

# Multiple conditions
df[(df["dept"] == "Engineering") & (df["salary"] > 90000)].to_json(
    "senior_engineers.json", orient="records", indent=2
)

How do I output one JSON file per group?

import pandas as pd

df = pd.read_csv("employees.csv")

for dept, group in df.groupby("dept"):
    filename = f"{dept.lower().replace(' ', '_')}.json"
    group.drop(columns="dept").to_json(filename, orient="records", indent=2)
    print(f"Wrote {len(group)} rows to {filename}")

What's the difference between orient="records" and orient="split"?

orient="records" produces [{"col": val, ...}, ...] — one object per row, keys repeated on every object. orient="split" produces {"columns": ["col1", "col2"], "data": [[val, val], ...]} — keys stored once, data as a 2D array. split is more compact for wide tables with many columns; records is what most APIs and tools expect.