Tabular Data Package
A simple format to describe tabular data and its metadata.

Authors Paul Walsh
Rufus Pollock
Martin Keegan
Version 1.0.0-rc.1
Last Updated 30 January 2017
Created 7 May 2012

Abstract

Tabular Data Package is a simple container format used to describe and package a collection of data sources with additional metadata about those data sources. By providing a minimum set of required properties and a range of optional properties, the format enables a simple contract for data interoperability (delivery, installation, management) that is governed by minimalism.

Goals

Tabular Data Package shares the design philosophy of all Frictionless Data Specifications, being:

  • Requirements that are driven by simplicity
  • Extensibility and customisation by design
  • Metadata that is human-editable and machine-usable
  • Reuse of existing standard formats for data
  • Language-, technology- and infrastructure-agnostic

Changelog

See the Changelog for information.

Language

The key words MUST, MUST NOT, REQUIRED, SHALL, SHALL NOT, SHOULD, SHOULD NOT, RECOMMENDED, MAY, and OPTIONAL in this document are to be interpreted as described in RFC 2119.

Specification

Tabular Data Package is a Data Package profile for publishing and sharing tabular data. Table Schema has a focus on ease of use over the web, and on simplifying the production and consumption of tabular data to/from spreadsheets and data storage backends such as SQL databases.

Tabular Data Package builds directly on the Data Package specification, and therefore a valid Tabular Data Package MUST be a valid Data Package.

Tabular Data Package has the following requirements over and above those imposed by Data Package:

  • There MUST be at least one resource in the resources array
  • Each resource MUST be a valid Tabular Data Resource
  • Each resource.path MUST point to a file in CSV format. The CSV file MUST conform to the constraints described in the CSV dialect specification
  • Each resource MUST have a schema property whose value MUST be a valid Table Schema (either inlined, or as a URI to a file)

Why is CSV the preferred data serialisation format?

  1. CSV is very simple - it is perhaps the most simple data format
  2. CSV is tabular-oriented. Most data structures are either tabular or can be transformed to a tabular structure by some form of normalization
  3. It is open and the "standard" is well-known
  4. It is widely supported - practically every spreadsheet program, relational database and programming language in existence can handle CSV in some form or other
  5. It is text-based and therefore amenable to manipulation and access from a wide range of standard tools (including revision control systems such as git, mercurial and subversion)
  6. It is line-oriented which means it can be incrementally processed - you do not need to read an entire file to extract a single row. For similar reasons it means that the format supports streaming.

More informally:

CSV is the data Kalashnikov: not pretty, but many wars have been fought with it and kids can use it. [@pudo (Friedrich Lindenberg)]

CSV is the ultimate simple, standard data format - streamable, text-based, no need for proprietary tools etc [@rufuspollock (Rufus Pollock)]

Examples

Example 1

Here's an example of a minimal simple data format dataset:

2 files

data.csv
datapackage.json


data.csv
--------

var1,var2,var3
A,1,2
B,3,4

datapackage.json
----------------

{
  "name": "my-dataset",
  # here we list the data files in this dataset
  "resources": [
    {
      "path": "data.csv",
      "schema": {
        "fields": [
          {
            "name": "var1",
            "type": "string"
          },
          {
            "name": "var2",
            "type": "integer"
          },
          {
            "name": "var3",
            "type": "number"
          }
        ]
      }
    }
  ]
}

A minimal Tabular Data Package on disk would be a directory containing a single file:

datapackage.json

Lacking a single piece of actual data would make this Tabular Data Package of doubtful use. A less minimal version would be:

datapackage.json
# a data file (CSV in this case)
data.csv

Additional files such as a README, scripts (for processing or analyzing the data) and other material may be provided. By convention scripts go in a scripts directory and thus, a more elaborate data package could look like this:

datapackage.json
README.md
mydata.csv
data/otherdata.csv
scripts/my-preparation-script.py

Several exemplar data packages can be found in the <a href="https://github.com/datasets">datasets organization on github</a>, including:

Example 2

A simple example of a Tabular Data Package descriptor looks as follows.

{
  "name": "my-unique-datapackage",
  "title": "my Unique Data Package",
  "licenses": [
    "http://example.com/license.md"
  ],
  "resources": [
    {
      "name": "",
      "path": ""
    }
  ]
  "sources": [
    {
      "name": ""
    }
  ]
}

Example 3

A more complete example of a Tabular Data Package descriptor looks as follows.

{
  "name": "my-unique-datapackage",
  "title": "my Unique Data Package",
  "licenses": [
    "http://example.com/license.md",
    "http://example.com/commercial-use.md"
  ],
  "author": {
  },
  "contributors": [
  ],
  "resources": [
    {
      "name": "",
      "path": ""
    },
    {
      "name": "",
      "path": ""
    },
    {
      "name": "",
      "path": ""
    }
  ]
  "sources": [
    {
      "name": ""
    }
  ]
}

Descriptor

A valid Tabular Data Package descriptor is an object conforming with the formal reference outlined in Properties, and, and the following more general requirements.

Form

The descriptor MUST be valid JSON, as described in RFC 4627, and SHOULD be in one of the following forms:

  1. A file named datapackage.json.
  2. An object, either on its own or nested in another data structure.

Media type

The media type for Tabular Data Package descriptors as MUST be application/vnd.datapackage+json. This media type is registered with IANA).

URIs

Several properties are defined as URI-formatted strings, which are to be considered as a subset of the formal URI specification described in RFC 3986. The additional constraints imposed are as follows:

  1. The only supported schemes are http and https. Absence of a scheme indicates either a POSIX path or a JSON Pointer (see below).
  2. URLs, indicated by http or https, MUST be fully qualified.
  3. POSIX paths, are supported for referencing local files, with the security restraint that they MUST be relative siblings or children of the descriptor. Absolute paths (/) and relative parent paths (../) MUST NOT be used, and implementations SHOULD NOT support these path types.
  4. JSON Pointers are supported as a general referencing mechanism to other properties in the same descriptor, and therefore MUST start with the pound symbol (#).

Properties

This section presents a complete description of required and optional properties for a Tabular Data Package descriptor.

Adherence to the specification does not imply that additional, non-specified properties cannot be used: a descriptor MAY include any number of properties in additional to those described as required and optional fields.

Required properties

A Tabular Data Package descriptor MUST include the following properties.

profile

The profile of this descriptor.

Every Package and Resource descriptor has a profile. The default profile, if none is declared, is `default`. The namespace for the profile is the type of descriptor, so, `default` for a Package descriptor is not the same as `default` for a Resource descriptor.
Examples
"profile": "tabular"
"profile": "fiscal"
"profile": "http://example.com/my-profiles-json-schema.json"

resources

An `array` of [Tabular Data Resource](/tabular-dataresource/) objects.

Examples
"resources": [ { "name": "my-data", "data": [ "data.csv" ], "schema": "tableschema.json" "mediatype": "text/csv" } ]
Items

Each item in the Tabular Data Resources array is a **Tabular Data Resource** object. The name, data, schema, profile properties are **required**.

All specified Tabular Data Resource properties are as follows:

profile

The profile of this descriptor.

name

An identifier string. Lower case characters with '.', '_', '-' and '/' are allowed.

data

A reference to the data for this resource. `data` `MUST` be an array of valid URIs.

Items

Each item in the Data array is a **URI** string. The property is **required**.

A minimal example of URI looks like:

"uri": "file.csv"

schema

A Table Schema for this resource, compliant with the [Table Schema](/tableschema/) specification.

title

A human-readable title.

description

A text description. Markdown is encouraged.

homepage

The home on the web that is related to this data package.

sources

The raw sources for this resource.

Items

Each item in the Sources array is a **Source** object. The property is **required**.

All specified Source properties are as follows:

name

An identifier string. Lower case characters with '.', '_', '-' and '/' are allowed.

uri

A URI (with some restrictions), being a fully qualified HTTP address, a relative POSIX path, or a JSON Pointer.

email

An email address.

licenses

The license(s) under which the resource is published.

Items

Each item in the Licenses array is a **License** object. The property is **required**.

All specified License properties are as follows:

name

An identifier string. Lower case characters with '.', '_', '-' and '/' are allowed.

uri

A URI (with some restrictions), being a fully qualified HTTP address, a relative POSIX path, or a JSON Pointer.

title

A human-readable title.

dialect

The CSV dialect descriptor.

format

The file format of this resource.

mediatype

The media type of this resource. Can be any valid media type listed with [IANA](https://www.iana.org/assignments/media-types/media-types.xhtml).

encoding

The file encoding of this resource.

bytes

The size of this resource in bytes.

hash

The MD5 hash of this resource. Indicate other hashing algorithms with the {algorithm}:{hash} format.

Optional properties

A Tabular Data Package descriptor SHOULD include the following properties.

name

A property reserved for globally unique identifiers. Examples of identifiers that are unique include UUIDs and DOIs.

A common usage pattern for Data Packages is as a packaging format within the bounds of a system or platform. In these cases, a unique identifier for a package is desired for common data handling workflows, such as updating an existing package. While at the level of the specification, global uniqueness cannot be validated, consumers using the `id` property `MUST` ensure identifiers are globally unique.
Examples
"id": "b03ec84-77fd-4270-813b-0c698943f7ce"
"id": "http://dx.doi.org/10.1594/PANGAEA.726855"

title

A human-readable title.

Examples
"title": "My Package Title"

description

A text description. Markdown is encouraged.

Examples
# My Package description
All about my package.

homepage

The home on the web that is related to this data package.

Examples
"homepage": { "name": "My Web Page", "uri": "http://example.com/" }

sources

The raw sources for this resource.

Examples
"sources": [ { "name": "World Bank and OECD", "uri": "http://data.worldbank.org/indicator/NY.GDP.MKTP.CD" } ]
Items

Each item in the Sources array is a **Source** object. The property is **required**.

All specified Source properties are as follows:

name

An identifier string. Lower case characters with '.', '_', '-' and '/' are allowed.

uri

A URI (with some restrictions), being a fully qualified HTTP address, a relative POSIX path, or a JSON Pointer.

email

An email address.

licenses

The license(s) under which the resource is published.

This property is not legally binding and does not guarantee that the package is licensed under the terms defined herein.
Examples
"licenses": [ { "name": "ODC-PDDL-1.0", "uri": "http://opendatacommons.org/licenses/pddl/" } ]
Items

Each item in the Licenses array is a **License** object. The property is **required**.

All specified License properties are as follows:

name

An identifier string. Lower case characters with '.', '_', '-' and '/' are allowed.

uri

A URI (with some restrictions), being a fully qualified HTTP address, a relative POSIX path, or a JSON Pointer.

title

A human-readable title.