CSV to JSON: the quoting, typing, and encoding traps
CSV looks trivial until you actually parse it. The quoting rules, the everything-is-a-string problem, embedded newlines, and the encoding issues that turn a clean-looking spreadsheet into broken JSON.
CSV is the format everyone assumes is simple: rows of values separated by commas, split on the comma, done. Then a field contains a comma, or a quote, or a line break, or a leading zero that matters, and the naive split falls apart. Converting CSV to JSON forces every one of these edge cases into the open, because JSON has types and structure where CSV has only text. Here's what actually bites.
CSV is barely a standard
There's an RFC (4180) that describes CSV, but real files predate it and
ignore it freely. The delimiter might be a comma, a semicolon (common
in locales where the comma is a decimal separator), or a tab. The line
ending might be \n or \r\n. There may or may not be a header row.
The first rule of parsing CSV is that you can't assume — a parser that
works on one export breaks on the next. This is why "just split on
commas" is wrong before you even reach the hard parts.
Quoting is the core rule
The one piece of CSV that is well-defined is how quoting rescues special characters. A field wrapped in double quotes can contain commas, line breaks, and quotes:
name,note
"Smith, John","He said ""hi"" twice"
That's two fields: Smith, John and He said "hi" twice. The
comma inside the quotes is data, not a separator, and a doubled quote
"" is an escaped single quote. A naive split(",") turns this into
four broken pieces. Any correct CSV-to-JSON conversion has to honor
quoting, which is exactly why you reach for a real parser instead of a
one-liner.
Everything is a string
This is the conceptual gap between the formats. CSV has no types — every cell is text. JSON has numbers, booleans, and null. So conversion forces a decision: do you keep everything as strings, or infer types?
Type inference is convenient and dangerous:
007becomes7— and you've destroyed a zip code, a product SKU, or a phone number's leading zero.1.0becomes1, or worse, a long decimal becomes a float that can't represent it exactly.TRUE/FALSEmight become booleans, or stay strings, depending on the tool.- An empty cell might become
"",null, or be omitted entirely.
There's no universally right answer. For data interchange, keeping everything as strings and letting the consumer cast deliberately is the safe default; inferring types is fine for analysis you control. Just know which one your converter is doing, because silent type inference is a classic source of "the data looked fine but the import was wrong."
Embedded newlines break line-based tools
Because a quoted field can contain a literal line break, a CSV record
is not the same as a line of the file. Tools that process CSV
line-by-line — wc -l, naive readline loops, log pipelines — will
miscount and split records that contain multi-line fields. If your row
count seems too high, embedded newlines in quoted fields are the first
suspect.
Encoding and the BOM
CSV carries no encoding declaration, so the bytes are ambiguous. Two recurring problems:
- The BOM. Spreadsheets often save UTF-8 with a byte-order mark at
the start of the file. The first header name then silently becomes
nameinstead ofname, so a lookup by"name"fails for no visible reason. Strip the BOM on read. - Non-UTF-8 exports. Files saved as Latin-1 or a regional code page turn accented characters into mojibake when read as UTF-8. There's no flag for this; you have to know or detect the source encoding.
A short checklist
When a CSV-to-JSON conversion produces wrong output, walk these in order:
- Right delimiter? (comma vs semicolon vs tab)
- Quoting honored, including doubled quotes and embedded commas?
- Header row present and not polluted by a BOM?
- Types — inferred or kept as strings, and is that what you wanted?
- Encoding — is it actually UTF-8?
To convert a file with quoting handled correctly and a clear choice about typing, our CSV ↔ JSON converter does it in the browser so the data stays local. Once it's JSON, if you need to enforce a shape on it before passing it downstream, validating with JSON Schema is the next step; and if you're choosing a format for config rather than tabular data, JSON vs YAML covers that trade-off.