This document is meant to serve as an introduction and an entry point into writing a library that implements a Frictionless Data specification. The focus is on two libraries in particular - Data Package and Table Schema - as implementing these libraries essentially implements the whole family of specifications.
The reader, being an implementer/maintainer of such libraries, should get a clear understanding of the reference material available for undertaking work, and the minimal set of actions that such libraries must enable for their users.
We prefer to focus on actions rather than features, feature sets, user stories, or more formal API specifications as we want to leave enough flexibility for implementations that follow the idioms of the host language, yet we do want to ensure a common base of what can be done with an implementation in any language.
While OKI and various other 3rd parties have been using the Data Package family of specifications with great success for several years, it has mostly been over the last 12 months that we are starting to see more mature libraries to implement the specifications at a “low level” for ease of reuse.
descriptors that are passed to these libraries. This enables significant reuse across implementations for descriptor validation logic.
# High-level requirements
Data Package and
Also, see the stack reference section below for some naming conventions we use, and that ideally should be followed in new implementations.
# Data Package library
The Data Package library can load and validate any
descriptor for a Data Package Profile, allow the creation and modification of
descriptors, and expose methods for reading and streaming data in the package. When a
descriptor is a Tabular Data Package, it uses the Table Schema library, and exposes its functionality, for each
resource object in the
- Data Package specification
- Data Package Profiles
- Tabular Data Package specification (the most commonly used and useful Profile)
- JSON Schema Registry
- Python implementation
- R implementation
- read an existing Data Package descriptor
- validate an existing Data Package descriptor, including profile-specific validation via the registry of JSON Schemas
- create a new Data Package descriptor
- edit an existing Data Package descriptor
- as part of editing a descriptor, helper methods to add and remove resources from the resources array
- validate edits made to a data package descriptor
- save a Data Package descriptor to a file path
- zip a Data Package descriptor and its co-located references (more generically: “zip a data package”)
- read a zip file that “claims” to be a data package
- save a zipped Data Package to disk
# Table Schema library
The Table Schema library can load and validate any Table Schema
descriptor, allow the creation and modification of
descriptors, expose methods for reading and streaming data that conforms to a Table Schema via the Tabular Data Resource abstraction.
- Table Schema specification
- Tabular Data Resource specification
- JSON Schema Registry
- Python implementation
- read an existing Table Schema descriptor
- validate an existing Table Schema descriptor using the JSON Schema spec
- create a new Table Schema descriptor
- edit an existing Table Schema descriptor
- provide a model-type interface to interact with a descriptor
- infer a Table Schema descriptor from a supplied sample of data
- validate a data source against the Table Schema descriptor, including in response to editing the descriptor
- enable streaming and reading of a data source through a Table Schema (cast on iteration)
# On dereferencing and descriptor validation
Some properties in the Frictionless Data specifications allow a path (a URL or a POSIX path) that resolves to an object.
The most prominent example of this is the
schema property on Tabular Data Resource descriptors.
Allowing such references has practical use for publishers, for example in allowing schema reuse. However, it does introduce difficulties in the validation of such properties. For example, validating a path pointing to a schema rather than the schema object itself will do little to guarantee the integrity of the schema definition. Therefore implementors
MUST dereference such “referential” property values before attempting to validate a descriptor. At present, this requirement applies to the following properties in Tabular Data Package and Tabular Data Resource:
# Other libraries
tabulator - which are an important part of the Frictionless Data stack, and we would be delighted to see them implemented in other languages either as standalone libraries, or, as part of a wider effort in implementing Data Package and Table Schema.
tabulator provides a consistent interface for streaming reading and writing of tabular data. It supports CSV, which is required for Table Schema, Tabular Data Resource, and Tabular Data Package, and also supports Excel, JSON, newline delimited JSON, Google Sheets, and ODS.
goodtables validates tabular data, checking for structural and schematic errors, and producing reports that can be used to iterate on data file sources as part of common data publication work flows. goodtables uses
datapackage internally, as well as implementing
It may be of general interest that
goodtables is also available as a service -
goodtables.io - providing continuous data validation in the style of CI solutions for code.
# Work process
If you would like to contribute sections based on idioms in your target language, that would be great: it will serve as a further reference to others, and also have the added benefit of enabling our team to learn from you.