csvz
is the hot new open database standard that is taking the entire technological world by storm.
A csvz
file is literally just a bunch of csv
files, in a zip file, that has been renamed to have a ".csvz" file extension.
Are you using
csvz
? Why not?csvz
is the brave technology that unites the worlds of data science, sql and no-sql. Is it no-sql's answer to the rdbms? Or is it the rdbms answer to no-sql? You decide.
- The csvz specification
csvz-0
A csvz file is literally just a bunch ofcsv
files, in a zip file with a file name that ends with ".csvz"csvz-meta-tables
A csvz file can contain a file calledtables.csv
describing the contents of the filecsvz-meta-columns
A csvz file can contain a file calledcolumns.csv
csvz-meta-relations
A csvz file can contain a file calledrelations.csv
csvz-meta-csv
A csvz file can contain a file calledcsv.csv
csvz-meta-per-file
The ability to include individual meta-files per csv file- Suggested specification fragments
- A list of
csvz-compliant
Tools and Libraries - Contribute
- License
The csvz
specification is broken into meaningful fragments.
Files can call themselves csvz-compliant
if they only comply with the first fragment of the specification, csvz-0
.
They can also indicate other fragments of the specification that they have implemented, such as csvz-meta-tables
, csv-meta-relations
etc.
csvz-0
A csvz file is literally just a bunch of csv
files, in a zip file with a file name that ends with ".csvz"
A csvz file is compliant with csvz-0
if it is literally just a bunch of csv
files, in a zip file, that has been renamed to have a ".csvz" file extension.
(Note that each fragment has a fragment identifier written at the beginning of the fragment. For example this is csvz-0
and the next fragment is csvz-meta-tables
. Fragments are optional, but it is good to know which fragments you do or do not comply with.)
The csv
files themselves should be parseable with most csv reading software.
(Anywhere that this spec refers to "a csv file" it means a file that complies with RFC 4180
or a compatible dialect as described by the CSV on the Web Working Group, unless a stricter definition is explicitly given.)
(Anywhere that the csvz specification
refers to "this spec" it means the csvz specification
.)
csvz-meta-tables
A csvz
file can contain a file called tables.csv
describing the contents of the file
Metadata about the contents of the csvz file is contained in a directory called "_meta". The file tables.csv
, if present, is inside this directory.
(Assume that the csvz reserves the right to create other .csv files under the _meta folder, and to create more folders under it. Details appear in subsequent spec fragments.)
The file tables.csv
contains metadata about all of the csv files included in the csvz
file.
(The file tables.csv
is a csv file.)
(Anywhere that this spec refers to a file with a name that ends with ".csv" it means the file is a "csv file", as described in csvz-0
.)
The file tables.csv
meets the following description:
- There is a header row naming the columns in this file
- Each data row describes a different csv file within this
csvz
file - The columns must include a column called "filename"
- There may be more columns.
- Here are some suggestions:
bytes
- the size of the file in bytesrows
- the number of rows in the filecolumns
- the number of columns in the filedescription
- a description of the filepublished
- the date the data in the file was first publishedsource
- information about the source of the data in the filehas-column-names
- atrue/false
value indicating if the file has a header row containing column namesskip-rows
- How many rows need to be skipped, before the data begins? (Rarely need to specify this, but when you need it, you need it!)- (todo: where information in table.csv conflicts with information in
csv.csv
, thentables.csv
has precedence overcsv.csv
, for the file it describes. For examplecsv.csv
may indicate that all files have header rows, but a specific file may not, and this would be indicated intables.csv
)
- The file
tables.csv
may also describe itself. See Russell. Note thatbytes
(for example) might cause a paradox.
(The word "must" is used for parts of the specification that are required for a file or tool to claim compliance with the standards described in this spec. The word "may" is used for parts which are not required; Optional sections may be covered in more detail, as required elements in a subsequent fragment of this spec.)
(Whenever suggestions are provided, they are not required for conformance with the current spec fragment. These suggestion may be described more fully in later spec fragments, in which they may be required.)
(Expectations around the encoding of true/false
values, and other fundamental data-types
, are not currently defined.)
Metadata about the contents of the csvz file is contained in a directory called "_meta". The file columns.csv
, if present, is inside this directory.
The file columns.csv
contains metadata about all of the columns in all of the csv files included in the csvz
file.
The file columns.csv
meets the following description:
- There is a header row naming the columns in this file
- Each data row describes a different column in a different file
- The columns must include a column called "filename" and a column called "column".
- It is expected that the columns "filename" and "column" are unique.
- If the columns "filename" and "column" are not unique, then any meta data about that file may not be correctly interpreted. This may cause difficulties
- There should be more columns than just the "filename" and "column" column. Some suggestions:
data-type
- the type of the column. (Data-types are not described in this spec fragment, and will be covered in later spec fragments.)nullable
- atrue/false
value indicating if the column can be nullmax-length
- a nullable column, that describes the maximum length of the column, in cases where the data-type supports a maximum lengthunique
- atrue/false
value indicating if the values in the column should be uniqueprimary-key
- atrue/false
value indicating if the column can serve as (part or whole of) the primary key of the table.description
- a description of the columnunits
- a nullable name description of the unit of measureordinal
- the order in which the columns have been written to the file. In cases where there is no header row, or where columns are re-ordered, this can be helpful.published
- the date the data in the file was first publishedsource
- information about the source of the data in the file
(The word "should" is used for parts of the specification that are not required, but which will lead to difficulty for users of the data or the tools if they are not complied with.)
Metadata about the contents of the csvz file is contained in a directory called "_meta". The file relations.cs