CSV Dialect

CSV Dialect defines a simple format to describe the various dialects of CSV files in a language agnostic manner. It aims to deal with a reasonably large subset of the features which differ between dialects, such as terminator strings, quoting rules, escape rules and so on.

Authors Rufus Pollock
Media Type application/vnd.csvdialect+json
Version 1.2
Last Updated 30 January 2017
Created 20 February 2013

Language

The key words MUST, MUST NOT, REQUIRED, SHALL, SHALL NOT, SHOULD, SHOULD NOT, RECOMMENDED, MAY, and OPTIONAL in this document are to be interpreted as described in RFC 2119.

Table Of Contents

Introduction

CSV Dialect defines a simple format to describe the various dialects of CSV files in a language agnostic manner. It aims to deal with a reasonably large subset of the features which differ between dialects, such as terminator strings, quoting rules, escape rules and so on. The specification has been modeled around the union of the csv modules in Python and Ruby, and the bulk load capabilities of MySQL and PostgresQL.

Excluded

CSV Dialect has nothing to do with the names, contents or types of the headers or data within the CSV file, only how it is formatted. However, CSV Dialect does allow the presence or absence of a header to be specified, similarly to RFC4180.

CSV Dialect is also orthogonal to the character encoding used in the CSV file. Note that it is possible for files in CSV format to contain data in more than one encoding.

Usage

CSV Dialect is useful for programmes which might have to deal with multiple dialects of CSV file, but which can rely on being told out-of-band which dialect will be used in a given input stream. This reduces the need for heuristic inference of CSV dialects, and simplifies the implementation of CSV readers, which must juggle dialect inference, schema inference, unseekable input streams, character encoding issues, and the lazy reading of very large input streams.

Some related work can be found in this comparison of csv dialect support, this example of similar JSON format, and in Python's PEP 305.

Specification

A CSV Dialect descriptor MUST be a JSON object with the following properties:

  • delimiter - specifies the character sequence which should separate fields (aka columns). Default = ,. Example \t.
  • lineTerminator - specifies the character sequence which should terminate rows. Default = \r\n
  • quoteChar - specifies a one-character string to use as the quoting character. Default = "
  • doubleQuote - controls the handling of quotes inside fields. If true, two consecutive quotes should be interpreted as one. Default = true
  • escapeChar - specifies a one-character string to use for escaping (for example, \), mutually exclusive with quoteChar. Not set by default
  • nullSequence - specifies the null sequence (for example \N). Not set by default
  • skipInitialSpace - specifies how to interpret whitespace which immediately follows a delimiter; if false, it means that whitespace immediately after a delimiter should be treated as part of the following field. Default = true
  • header - indicates whether the file includes a header row. If true the first row in the file is a header row, not data. Default = true
  • caseSensitiveHeader - indicates that case in the header is meaningful. For example, columns CAT and Cat should not be equated. Default = false
  • csvddfVersion - a number, in n.n format, e.g., 1.0. If not present, consumers should assume latest schema version.

Example

Here's an example:

{
  "csvddfVersion": 1.0,
  "delimiter": ",",
  "doubleQuote": true,
  "lineTerminator": "\r\n",
  "quoteChar": "\"",
  "skipInitialSpace": true,
  "header": true
}

Changelog

See the Changelog for information.

Implementations

The following implementations are available for csv-dialect:

See the implementation page for further information on writing an implementation of a Frictionless Data specification.