# CSV Dialect
CSV Dialect defines a simple format to describe the various dialects of CSV files in a language agnostic manner. It aims to deal with a reasonably large subset of the features which differ between dialects, such as terminator strings, quoting rules, escape rules and so on.
Author(s) | Rufus Pollock |
---|---|
Created | 20 February 2013 |
Updated | 05 April 2021 |
JSON Schema | csv-dialect.json |
Version | 1.2 |
# Language
The key words MUST
, MUST NOT
, REQUIRED
, SHALL
, SHALL NOT
, SHOULD
, SHOULD NOT
, RECOMMENDED
, MAY
, and OPTIONAL
in this document are to be interpreted as described in RFC 2119
# Introduction
CSV Dialect defines a simple format to describe the various dialects of CSV files in a language agnostic manner. It aims to deal with a reasonably large subset of the features which differ between dialects, such as terminator strings, quoting rules, escape rules and so on. The specification has been modeled around the union of the csv modules in Python and Ruby, and the bulk load capabilities of MySQL and PostgresQL.
# Excluded
CSV Dialect has nothing to do with the names, contents or types of the headers or data within the CSV file, only how it is formatted. However, CSV Dialect does allow the presence or absence of a header to be specified, similarly to RFC4180 (opens new window).
CSV Dialect is also orthogonal to the character encoding used in the CSV file. Note that it is possible for files in CSV format to contain data in more than one encoding.
# Usage
CSV Dialect is useful for programmes which might have to deal with multiple dialects of CSV file, but which can rely on being told out-of-band which dialect will be used in a given input stream. This reduces the need for heuristic inference of CSV dialects, and simplifies the implementation of CSV readers, which must juggle dialect inference, schema inference, unseekable input streams, character encoding issues, and the lazy reading of very large input streams.
Some related work can be found in similar JSON format (opens new window) and in Python’s PEP 305 (opens new window).
# Specification
A CSV Dialect descriptor, dialect
, MUST
be a JSON object
with the following properties:
delimiter
- specifies the character sequence which should separate fields (aka columns). Default =,
. Example\t
. If not present, consumers should assume that it’s,
.lineTerminator
- specifies the character sequence which should terminate rows. Default =\r\n
quoteChar
- specifies a one-character string to use as the quoting character. Default ="
doubleQuote
- controls the handling of quotes inside fields. If true, two consecutive quotes should be interpreted as one. Default =true
escapeChar
- specifies a one-character string to use for escaping (for example,\
), mutually exclusive withquoteChar
. Not set by defaultnullSequence
- specifies the null sequence (for example\N
). Not set by defaultskipInitialSpace
- specifies how to interpret whitespace which immediately follows a delimiter; iffalse
, it means that whitespace immediately after a delimiter should be treated as part of the following field. Default =false
header
- indicates whether the file includes a header row. Iftrue
the first row in the file is a header row, not data. Default =true
commentChar
- indicates a one-character string to ignore any line whose row begins with this character. Not set by defaultcaseSensitiveHeader
- indicates that case in the header is meaningful. For example, columnsCAT
andCat
should not be equated. Default =false
csvddfVersion
- a number, in n.n format, e.g.,1.2
. If not present, consumers should assume latest schema version.
# Example
Here’s an example:
{
"dialect": {
"csvddfVersion": 1.2,
"delimiter": ";",
"doubleQuote": true,
"lineTerminator": "\r\n",
"quoteChar": "\"",
"skipInitialSpace": true,
"header": true,
"commentChar": "#"
}
}