BlogDeveloper

CSV to JSON: the quoting, typing, and encoding traps

CSV looks trivial until you actually parse it. The quoting rules, the everything-is-a-string problem, embedded newlines, and the encoding issues that turn a clean-looking spreadsheet into broken JSON.

CSV is the format everyone assumes is simple: rows of values separated by commas, split on the comma, done. Then a field contains a comma, or a quote, or a line break, or a leading zero that matters, and the naive split falls apart. Converting CSV to JSON forces every one of these edge cases into the open, because JSON has types and structure where CSV has only text. Here's what actually bites.

CSV is barely a standard

There's an RFC (4180) that describes CSV, but real files predate it and ignore it freely. The delimiter might be a comma, a semicolon (common in locales where the comma is a decimal separator), or a tab. The line ending might be \n or \r\n. There may or may not be a header row. The first rule of parsing CSV is that you can't assume — a parser that works on one export breaks on the next. This is why "just split on commas" is wrong before you even reach the hard parts.

Quoting is the core rule

The one piece of CSV that is well-defined is how quoting rescues special characters. A field wrapped in double quotes can contain commas, line breaks, and quotes:

name,note
"Smith, John","He said ""hi"" twice"

That's two fields: Smith, John and He said "hi" twice. The comma inside the quotes is data, not a separator, and a doubled quote "" is an escaped single quote. A naive split(",") turns this into four broken pieces. Any correct CSV-to-JSON conversion has to honor quoting, which is exactly why you reach for a real parser instead of a one-liner.

Everything is a string

This is the conceptual gap between the formats. CSV has no types — every cell is text. JSON has numbers, booleans, and null. So conversion forces a decision: do you keep everything as strings, or infer types?

Type inference is convenient and dangerous:

  • 007 becomes 7 — and you've destroyed a zip code, a product SKU, or a phone number's leading zero.
  • 1.0 becomes 1, or worse, a long decimal becomes a float that can't represent it exactly.
  • TRUE/FALSE might become booleans, or stay strings, depending on the tool.
  • An empty cell might become "", null, or be omitted entirely.

There's no universally right answer. For data interchange, keeping everything as strings and letting the consumer cast deliberately is the safe default; inferring types is fine for analysis you control. Just know which one your converter is doing, because silent type inference is a classic source of "the data looked fine but the import was wrong."

Embedded newlines break line-based tools

Because a quoted field can contain a literal line break, a CSV record is not the same as a line of the file. Tools that process CSV line-by-line — wc -l, naive readline loops, log pipelines — will miscount and split records that contain multi-line fields. If your row count seems too high, embedded newlines in quoted fields are the first suspect.

Encoding and the BOM

CSV carries no encoding declaration, so the bytes are ambiguous. Two recurring problems:

  • The BOM. Spreadsheets often save UTF-8 with a byte-order mark at the start of the file. The first header name then silently becomes name instead of name, so a lookup by "name" fails for no visible reason. Strip the BOM on read.
  • Non-UTF-8 exports. Files saved as Latin-1 or a regional code page turn accented characters into mojibake when read as UTF-8. There's no flag for this; you have to know or detect the source encoding.

A short checklist

When a CSV-to-JSON conversion produces wrong output, walk these in order:

  1. Right delimiter? (comma vs semicolon vs tab)
  2. Quoting honored, including doubled quotes and embedded commas?
  3. Header row present and not polluted by a BOM?
  4. Types — inferred or kept as strings, and is that what you wanted?
  5. Encoding — is it actually UTF-8?

To convert a file with quoting handled correctly and a clear choice about typing, our CSV ↔ JSON converter does it in the browser so the data stays local. Once it's JSON, if you need to enforce a shape on it before passing it downstream, validating with JSON Schema is the next step; and if you're choosing a format for config rather than tabular data, JSON vs YAML covers that trade-off.