CSV Dialect
A simple format to describe a CSV file's dialect.

Authors Rufus Pollock
Version 1.2
Last Updated 30 January 2017
Created 20 February 2013

Abstract

CSV Dialect defines a simple format to describe the various dialects of CSV files in a language agnostic manner. It aims to deal with a reasonably large subset of the features which differ between dialects, such as terminator strings, quoting rules, escape rules and so on.

Goals

CSV Dialect shares the design philosophy of all Frictionless Data Specifications, being:

  • Requirements that are driven by simplicity
  • Extensibility and customisation by design
  • Metadata that is human-editable and machine-usable
  • Reuse of existing standard formats for data
  • Language-, technology- and infrastructure-agnostic

Changelog

See the Changelog for information.

Language

The key words MUST, MUST NOT, REQUIRED, SHALL, SHALL NOT, SHOULD, SHOULD NOT, RECOMMENDED, MAY, and OPTIONAL in this document are to be interpreted as described in RFC 2119.

Specification

CSV Dialect defines a simple format to describe the various dialects of CSV files in a language agnostic manner. It aims to deal with a reasonably large subset of the features which differ between dialects, such as terminator strings, quoting rules, escape rules and so on. The specification has been modeled around the union of the csv modules in Python and Ruby, and the bulk load capabilities of MySQL and PostgresQL.

CSV Dialect has nothing to do with the names, contents or types of the headers or data within the CSV file, only how it is formatted. However, CSV Dialect does allow the presence or absence of a header to be specified, similarly to RFC4180.

CSV Dialect is also orthogonal to the character encoding used in the CSV file. Note that it is possible for files in CSV format to contain data in more than one encoding.

CSV Dialect is useful for programmes which might have to deal with multiple dialects of CSV file, but which can rely on being told out-of-band which dialect will be used in a given input stream. This reduces the need for heuristic inference of CSV dialects, and simplifies the implementation of CSV readers, which must juggle dialect inference, schema inference, unseekable input streams, character encoding issues, and the lazy reading of very large input streams.

Some related work can be found in this comparison of csv dialect support, this example of similar JSON format, and in Python's PEP 305.

Examples

A minimal CSV Dialect looks as follows.

{
  "delimiter": ","
}

A more complete CSV Dialect with all defaults looks as follows.

{
  "delimiter": ",",
  "doubleQuote": true,
  "lineTerminator": "\r\n",
  "quoteChar": "\"",
  "skipInitialSpace": true,
  "header": true
}

A customized CSV Dialect looks as follows.

{
  "delimiter": ";",
  "doubleQuote": false,
  "lineTerminator": "\n",
  "quoteChar": "'",
  "skipInitialSpace": false,
  "header": false
}

Descriptor

A valid CSV Dialect descriptor is an object conforming with the formal reference outlined in Properties, and the following more general requirements.

Form

The descriptor MUST be valid JSON, as described in RFC 4627, and SHOULD be in one of the following forms:

  1. A file named csvdialect.json.
  2. An object, either on its own or nested in another data structure.

A JSON Schema to validate CSV Dialect descriptors is available here.

Media type

The media type for CSV Dialect descriptors MUST be application/vnd.csvdialect+json. This media type is registered with IANA).

URIs

Several properties are defined as URI-formatted strings, which are to be considered as a subset of the formal URI specification described in RFC 3986. The additional constraints imposed are as follows:

  1. The only supported schemes are http and https. Absence of a scheme indicates either a POSIX path or a JSON Pointer (see below).
  2. URLs, indicated by http or https, MUST be fully qualified.
  3. POSIX paths are supported for referencing local files, with the security restraint that they MUST be relative siblings or children of the descriptor. Absolute paths (/) and relative parent paths (../) MUST NOT be used, and implementations SHOULD NOT support these path types.
  4. JSON Pointers are supported as a general referencing mechanism to other properties in the same descriptor, and therefore MUST start with the pound symbol (#).

Properties

This section presents a complete description of required and optional properties for a CSV Dialect descriptor.

Adherence to the specification does not imply that additional, non-specified properties cannot be used: a descriptor MAY include any number of properties in additional to those described as required and optional properties.

Required properties

A CSV Dialect descriptor MUST include the following properties.

delimiter

A character sequence to use as the field separator.

Examples
{
  "delimiter": ","
}
{
  "delimiter": ";"
}

doubleQuote

Specifies the handling of quotes inside fields.

If Double Quote is set to true, two consecutive quotes must be interpreted as one.

Examples
{
  "doubleQuote": true
}

Optional properties

A CSV Dialect descriptor SHOULD include the following properties.

lineTerminator

Specifies the character sequence that must be used to terminate rows.

Examples
{
  "lineTerminator": "\r\n"
}
{
  "lineTerminator": "\n"
}

nullSequence

Specifies the null sequence, for example, \N.

Examples
{
  "nullSequence": "\N"
}

quoteChar

Specifies a one-character string to use as the quoting character.

Examples
{
  "quoteChar": ""
}
{
  "quoteChar": "''"
}

escapeChar

Specifies a one-character string to use as the escape character.

Examples
{
  "escapeChar": "\\"
}

skipInitialSpace

Specifies the interpretation of whitespace immediately following a delimiter. If false, whitespace immediately after a delimiter should be treated as part of the subsequent field.

Examples
{
  "skipInitialSpace": true
}

header

Specifies if the file includes a header row, always as the first row in the file.

Examples
{
  "header": true
}

caseSensitiveHeader

Specifies if the case of headers is meaningful.

Use of case in source CSV files is not always an intentional decision. For example, should "CAT" and "Cat" be considered to have the same meaning.

Examples
{
  "caseSensitiveHeader": true
}

Implementations

The following implementations are available for csv-dialect:

See the implementation page for further information on writing an implementation of a Frictionless Data specification.