Data Package
A simple format to describe data and its metadata.

Authors Paul Walsh
Rufus Pollock
Version 1.0.0-rc.1
Last Updated 30 January 2017
Created 12 November 2007

Abstract

Data Package is a simple container format used to describe and package a collection of data sources with additional metadata about those data sources. By providing a minimum set of required properties and a range of optional properties, the format enables a simple contract for data interoperability (delivery, installation, management) that is governed by minimalism.

Goals

Data Package shares the design philosophy of all Frictionless Data Specifications, being:

  • Requirements that are driven by simplicity
  • Extensibility and customisation by design
  • Metadata that is human-editable and machine-usable
  • Reuse of existing standard formats for data
  • Language-, technology- and infrastructure-agnostic

Changelog

See the Changelog for information.

Language

The key words MUST, MUST NOT, REQUIRED, SHALL, SHALL NOT, SHOULD, SHOULD NOT, RECOMMENDED, MAY, and OPTIONAL in this document are to be interpreted as described in RFC 2119.

Specification

A Data Package consists of:

  • Metadata that describes the structure and contents of the package
  • Resources such as data files that form the contents of the package

Resources in a Data Package are declared on the resources property, which is an array of Data Resource objects.

The Data Package specification does NOT impose any requirements on the form or structure of data described by a Data Resource. Therefore, Data Package can be used for packaging any kind of data.

The data included in the package may be provided as:

  • Files bundled locally with the package descriptor
  • Remote resources, referenced by URL
  • "Inline" data (see below) which is included directly in the descriptor

A valid descriptor MUST contain both a name property and a resources property. The definition of these properties is described below.

Customisation

A Data Package, like any Frictionless Data descriptor, MAY add any number of additional properties beyond those listed in its specification.

For example, if you were storing time series data and wanted to list the temporal coverage of the data in the Data Package you could add a property temporal (cf. Dublin Core):

"temporal": {
  "name": "19th Century",
  "start": "1800-01-01",
  "end": "1899-12-31"
}

This flexibility enables specific communities to extend Data Packages as appropriate for the data they manage.

Profiles

An extension of Data Package may be formalised as a profile. A profile is a Data Package which extends the default specification towards more specific needs.

A profile is declared on the profile property. For the default Data Package descriptor, this SHOULD be present with a value of default, but if not, the absence of a profile is equivalent to setting "profile": "default".

Custom profiles MUST have a profile property, where the value is a unique identifier for that profile. This unique identifier can be in one of two forms. It can be an id from the official Data Package Schema Registry, or, a URI that points directly to a JSON Schema that can be used to validate the profile.

As part of the Frictionless Data Specifications project, we publish a number of Data Package profiles. See those profiles below.

Examples

A minimal Data Package on disk would be a directory containing a single file:

datapackage.json

A less minimal version would be:

datapackage.json
# a data file (CSV in this case)
data.csv

Additional files such as a README, scripts (for processing or analyzing the data) and other material may be provided. By convention scripts go in a scripts directory and thus, a more elaborate data package could look like this:

datapackage.json
README.md
mydata.csv
data/otherdata.csv
scripts/my-preparation-script.py

Several example Data Packages can be found in the datasets organization on GitHub, including:

A simple example of a Data Package descriptor looks as follows.

{
  "name": "my-unique-datapackage",
  "title": "My Unique Data Package",
  "licenses": [
    "http://example.com/license.md"
  ],
  "resources": [
    {
      "name": "",
      "data": [ "" ]
    }
  ]
  "sources": [
    {
      "name": ""
    }
  ]
}

A more complete example of a Data Package descriptor looks as follows.

{
  "name": "my-unique-datapackage",
  "title": "My Unique Data Package",
  "licenses": [
    "http://example.com/license.md",
    "http://example.com/commercial-use.md"
  ],
  "author": {
  },
  "contributors": [
  ],
  "resources": [
    {
      "name": "",
      "data": [ "" ]
    },
    {
      "name": "",
      "data": [ "" ]
    },
    {
      "name": "",
      "data": [ "" ]
    }
  ]
  "sources": [
    {
      "name": ""
    }
  ]
}

Descriptor

A valid Data Package descriptor is an object conforming with the formal reference outlined in Properties, and the following more general requirements.

Form

The descriptor MUST be valid JSON, as described in RFC 4627, and SHOULD be in one of the following forms:

  1. A file named datapackage.json.
  2. An object, either on its own or nested in another data structure.

A JSON Schema to validate Data Package descriptors is available here.

Media type

The media type for Data Package descriptors MUST be application/vnd.datapackage+json. This media type is registered with IANA).

URIs

Several properties are defined as URI-formatted strings, which are to be considered as a subset of the formal URI specification described in RFC 3986. The additional constraints imposed are as follows:

  1. The only supported schemes are http and https. Absence of a scheme indicates either a POSIX path or a JSON Pointer (see below).
  2. URLs, indicated by http or https, MUST be fully qualified.
  3. POSIX paths are supported for referencing local files, with the security restraint that they MUST be relative siblings or children of the descriptor. Absolute paths (/) and relative parent paths (../) MUST NOT be used, and implementations SHOULD NOT support these path types.
  4. JSON Pointers are supported as a general referencing mechanism to other properties in the same descriptor, and therefore MUST start with the pound symbol (#).

Properties

This section presents a complete description of required and optional properties for a Data Package descriptor.

Adherence to the specification does not imply that additional, non-specified properties cannot be used: a descriptor MAY include any number of properties in additional to those described as required and optional properties.

Required properties

A Data Package descriptor MUST include the following properties.

resources

An array of Data Resource objects, each compliant with the Data Resource specification.

Examples
{
  "resources": [
    {
      "name": "my-data",
      "data": [
        "data.csv"
      ],
      "mediatype": "text/csv"
    }
  ]
}
Items

Each item in the Data Resources array is a Data Resource object. The name, data properties are required, and other defined properties are optional.

profile

The profile of this descriptor.

Every Package and Resource descriptor has a profile. The default profile, if none is declared, is default. The namespace for the profile is the type of descriptor, so, default for a Package descriptor is not the same as default for a Resource descriptor.

name

An identifier string. Lower case characters with ., _, - and / are allowed.

This is ideally a url-usable and human-readable name. Name SHOULD be invariant, meaning it SHOULD NOT change when its parent descriptor is updated.

data

A reference to the data for this resource. data MUST be an array of valid URIs.

The dereferenced value of each referenced data source in the data array MUST be commensurate with a native, dereferenced representation of the data the resource describes. For example, in a Tabular Data Resource, this means that the dereferenced value of data MUST be an array.

Items

Each item in the Data array is a URI string. The property is required, and other defined properties are optional.

schema

A schema for this resource.

title

A human-readable title.

description

A text description. Markdown is encouraged.

homepage

The home on the web that is related to this data package.

sources

The raw sources for this resource.

Items

Each item in the Sources array is a Source object. The uri property is required, and other defined properties are optional.

name

An identifier string. Lower case characters with ., _, - and / are allowed.

This is ideally a url-usable and human-readable name. Name SHOULD be invariant, meaning it SHOULD NOT change when its parent descriptor is updated.

uri

A URI (with some restrictions), being a fully qualified HTTP address, a relative POSIX path, or a JSON Pointer.

Implementations need to negotiate the type of URI provided, and dereference the data accordingly. There are restrictions imposed on URIs that are POSIX paths: see the notes on descriptors for more information.

email

An email address.

licenses

The license(s) under which the resource is published.

This property is not legally binding and does not guarantee that the package is licensed under the terms defined herein.

Items

Each item in the Licenses array is a License object. The uri property is required, and other defined properties are optional.

name

An identifier string. Lower case characters with ., _, - and / are allowed.

This is ideally a url-usable and human-readable name. Name SHOULD be invariant, meaning it SHOULD NOT change when its parent descriptor is updated.

uri

A URI (with some restrictions), being a fully qualified HTTP address, a relative POSIX path, or a JSON Pointer.

Implementations need to negotiate the type of URI provided, and dereference the data accordingly. There are restrictions imposed on URIs that are POSIX paths: see the notes on descriptors for more information.

title

A human-readable title.

format

The file format of this resource.

csv, xls, json are examples of common formats.

mediatype

The media type of this resource. Can be any valid media type listed with IANA.

encoding

The file encoding of this resource.

bytes

The size of this resource in bytes.

hash

The MD5 hash of this resource. Indicate other hashing algorithms with the {algorithm}:{hash} format.

Optional properties

A Data Package descriptor SHOULD include the following properties.

profile

The profile of this descriptor.

Every Package and Resource descriptor has a profile. The default profile, if none is declared, is default. The namespace for the profile is the type of descriptor, so, default for a Package descriptor is not the same as default for a Resource descriptor.

Examples
{
  "profile": "tabular-data-package"
}
{
  "profile": "http://example.com/my-profiles-json-schema.json"
}

name

An identifier string. Lower case characters with ., _, - and / are allowed.

This is ideally a url-usable and human-readable name. Name SHOULD be invariant, meaning it SHOULD NOT change when its parent descriptor is updated.

Examples
{
  "name": "my-nice-name"
}

id

A property reserved for globally unique identifiers. Examples of identifiers that are unique include UUIDs and DOIs.

A common usage pattern for Data Packages is as a packaging format within the bounds of a system or platform. In these cases, a unique identifier for a package is desired for common data handling workflows, such as updating an existing package. While at the level of the specification, global uniqueness cannot be validated, consumers using the id property MUST ensure identifiers are globally unique.

Examples
{
  "id": "b03ec84-77fd-4270-813b-0c698943f7ce"
}
{
  "id": "http://dx.doi.org/10.1594/PANGAEA.726855"
}

title

A human-readable title.

Examples
{
  "title": "My Package Title"
}

description

A text description. Markdown is encouraged.

Examples
{
  "description": "# My Package description\nAll about my package."
}

homepage

The home on the web that is related to this data package.

Examples
{
  "homepage": {
    "name": "My Web Page",
    "uri": "http://example.com/"
  }
}

created

The datetime on which this descriptor was created.

The datetime must conform to the string formats for datetime as described in RFC3339

Examples
{
  "created": "1985-04-12T23:20:50.52Z"
}

contributors

The contributors to this descriptor.

Examples
{
  "contributors": [
    {
      "name": "Joe Bloggs"
    }
  ]
}
{
  "contributors": [
    {
      "name": "Joe Bloggs",
      "email": "[email protected]",
      "role": "author"
    }
  ]
}
Items

Each item in the Contributors array is a Contributor . The name property is required, and other defined properties are optional.

name

An identifier string. Lower case characters with ., _, - and / are allowed.

This is ideally a url-usable and human-readable name. Name SHOULD be invariant, meaning it SHOULD NOT change when its parent descriptor is updated.

uri

A URI (with some restrictions), being a fully qualified HTTP address, a relative POSIX path, or a JSON Pointer.

Implementations need to negotiate the type of URI provided, and dereference the data accordingly. There are restrictions imposed on URIs that are POSIX paths: see the notes on descriptors for more information.

email

An email address.

role

keywords

A list of keywords that describe this package.

Examples
{
  "keywords": [
    "data",
    "fiscal",
    "transparency"
  ]
}
Items

Each item in the Keywords array is a string. The property is required, and other defined properties are optional.

licenses

The license(s) under which this package is published.

This property is not legally binding and does not guarantee that the package is licensed under the terms defined herein.

Examples
{
  "licenses": [
    {
      "name": "ODC-PDDL-1.0",
      "uri": "http://opendatacommons.org/licenses/pddl/"
    }
  ]
}
Items

Each item in the Licenses array is a License object. The uri property is required, and other defined properties are optional.

name

An identifier string. Lower case characters with ., _, - and / are allowed.

This is ideally a url-usable and human-readable name. Name SHOULD be invariant, meaning it SHOULD NOT change when its parent descriptor is updated.

uri

A URI (with some restrictions), being a fully qualified HTTP address, a relative POSIX path, or a JSON Pointer.

Implementations need to negotiate the type of URI provided, and dereference the data accordingly. There are restrictions imposed on URIs that are POSIX paths: see the notes on descriptors for more information.

title

A human-readable title.

sources

The raw sources for this resource.

Examples
{
  "sources": [
    {
      "name": "World Bank and OECD",
      "uri": "http://data.worldbank.org/indicator/NY.GDP.MKTP.CD"
    }
  ]
}
Items

Each item in the Sources array is a Source object. The uri property is required, and other defined properties are optional.

name

An identifier string. Lower case characters with ., _, - and / are allowed.

This is ideally a url-usable and human-readable name. Name SHOULD be invariant, meaning it SHOULD NOT change when its parent descriptor is updated.

uri

A URI (with some restrictions), being a fully qualified HTTP address, a relative POSIX path, or a JSON Pointer.

Implementations need to negotiate the type of URI provided, and dereference the data accordingly. There are restrictions imposed on URIs that are POSIX paths: see the notes on descriptors for more information.

email

An email address.

Implementations

The following implementations are available for data-package:

See the implementation page for further information on writing an implementation of a Frictionless Data specification.