Skip to content
Sarven Capadisli edited this page Feb 15, 2013 · 27 revisions

Before you start

You are reading this because you have some familiarity with SDMX-ML or the RDF Data Cube vocabulary. Some knowledge of Linked Data practices, XML, XSLT would be handy as well.

What can it do?

Given Generic SDMX-ML data or metadata as input, XSL 2.0 templates transforms them to RDF/XML. It uses vocabularies like RDF Data Cube, SDMX-RDF, SKOS, XKOS, PROV-O..

The transformation follows some common Linked Data practices as well as other ones out of thin air :) If you disagree or would like to propose alternatives, please either contact me or better yet, create an issue. Relevant changes will then be reflected here.

Configuration

The scripts/config.rdf file is used to configure some stuff for the transformations. Here is an outline for some of the noteworthy things in the templates.

Agency identifiers and URIs

agencies.ttl is used to track some of the mappings for maintenance agencies. It includes the maintenance agency's i.e., the SDMX publisher's, identifier that's in the SDMX Registry, as well as the base URI for that agency. This file allows references to external agency identifiers to be looked up for their base URI and used in the transformations. Currently this agency recognition is treated as either "SDMX" or some agency that's publishing the actual statistics.

In the case of SDMX, when there is a reference to SDMX CodeLists and Codes, it is typically indicated by the component agency being set to SDMX e.g., codelistAgency="SDMX" of a structure:Component and/or agencyID="SDMX" of a CodeList with id="CL_FREQ". When this is detected, corresponding URIs from the SDMX-RDF vocabulary is used e.g., for metadata; https://purl.org/linked-data/sdmx/2009/code#freq, and data; https://purl.org/linked-data/sdmx/2009/code#freq-A.

Similarly, an agency might use some other agency's codes. By following the same URI pattern conventions, the agency file is used to find the corresponding base URI in order to make a reference. For example, here is a coded property that's used by European Central Bank (4F0) to associate a code list that's defined by Eurostat (4D0):

<https://4F0.270a.info/property/OBS_STATUS>
     <https://purl.org/linked-data/cube#codeList> <https://4D0.270a.info/code/CL_OBS_STATUS>

Naturally, the transformation does not re-define metadata that's from an external agency as the owners of the data would define them under their authority.

URI configurations

Base URIs can be set for classes, codelists, concept schemes, datasets, slices, properties, provenance, as well as for the source SDMX data.

The value for uriThingSeparator e.g., /, lets one set the delimiter to separate the "thing" from the rest of the URI. In the Linked Data community, this is typically either a / or #. For example, if slash is used, an URI would end up like https://example.org/code/CL_GEO (note the last slash before CL_GEO). If hash is used, an URI would end up like https://example.org/code#CL_GEO.

Similarly, uriDimensionSeparator can be set to separate dimension values that's used in RDF Data Cube observation URIs. As observation should have its own unique URI, the method to construct URIs is done by taking dimension values as safe terms to be used in URIs separated by the value in uriDimensionSeparator. For example, here is a crazy looking observation URI where uriDimensionSeparator is set to /: https://example.org/dataset/DSD_T_PERSON_STATTAB-01-2A01/5938/1/15497/4/21/1/2011/2011-12-31. But with uriThingSeparator set to # and uriDimensionSeparator set to -, it could end up like https://example.org/dataset/DSD_T_PERSON_STATTAB-01-2A01#5938-1-15497-4-21-1-2011-2011-12-31. If you are wondering about DSD_T_PERSON_STATTAB-01-2A01, that's the KeyFamily (DSD) id, and https://example.org/dataset/ would be the value that can be set in config for the base URI for dataset.

Creator's URI can also be set which is also used for provenance data.

Default to language

Possibility to force a default xml:lang on skos:prefLabel and skos:definition when lang is not originally in the data. If config.rdf contains a non-empty lang value it will use it e.g.,:

<rdf:Description>
    <rdf:value>en</rdf:value>
    <rdfs:label>lang</rdfs:label>
</rdf:Description>

Default language may also be applied in the case of Annotations. See Interlinking SDMX Annotations for example.

Interlinking SDMX Annotations

SDMX Annotations contain important information that can be put to use by the publisher. Data in AnnotationTypes are typically used as publisher's internal conventions. Hence, there is no standardization on how they are used across all SDMX publishers. In order not to leave this information behind in the final transformation, the configuration allows publishers to define the way they should be transformed. This done by setting interlinkAnnotationTypes: the AnnotationType to detect (in rdfs:label), the predicate (as an XML QName) to use (in rdf:predicate), and the instances of Concepts or Codes to apply to (in rdf:type). Currently this feature is only applied to Annotations in Concepts and Codes. For example, given the following SDMX snippet:

<structure:CodeList id="CL_HGDE_GDE" agencyID="CH1_RN">
  <structure:Code value="13256">
    <structure:Description>Aeugst am Albis</structure:Description>
    <structure:Annotations>
      <common:Annotation>
        <common:AnnotationType>CODE_OFS</common:AnnotationType>
        <common:AnnotationText>1</common:AnnotationText>
      </common:Annotation>
      <common:Annotation>
        <common:AnnotationType>ABBREV</common:AnnotationType>
        <common:AnnotationText>A.a.A.</common:AnnotationText>
      </common:Annotation>
      <common:Annotation>
        <common:AnnotationType>REC_TYPE</common:AnnotationType>
        <common:AnnotationTitle>11</common:AnnotationText>
      </common:Annotation>
  </structure:Code>
</structure:CodeList>

and the following configuration in config.rdf:

<rdf:value>
  <rdf:Description>
    <rdf:value>https://example.org/property/</rdf:value>
    <rdfs:label>property</rdfs:label>
  </rdf:Description>
</rdf:value>

<rdf:value>
  <rdf:Description>
    <rdf:value>
      <rdf:Description>
        <rdf:predicate>xkos:hasPart</rdf:predicate>
        <rdf:type>CODE_OFS</rdf:type>
        <rdfs:label>AnnotationText</rdfs:label>
        <rdfs:range>https://example.org/code/CL_HGDE_GDE</rdfs:range>
      </rdf:Description>
    </rdf:value>
    <rdf:value>
      <rdf:Description>
        <rdf:predicate>skos:altLabel</rdf:predicate>
        <rdf:type>ABBREV</rdf:type>
        <rdfs:label>AnnotationText</rdfs:label>
        <rdfs:range>Literal</rdfs:range>
      </rdf:Description>
    </rdf:value>
    <rdf:value>
      <rdf:Description>
        <rdf:predicate>property:GDE_GARTE</rdf:predicate>
        <rdf:type>REC_TYPE</rdf:type>
        <rdfs:label>AnnotationTitle</rdfs:label>
        <rdfs:range>https://example.org/code/CL_HGDE_MODALITY</rdfs:range>
      </rdf:Description>
    </rdf:value>
    <rdfs:label>interlinkAnnotationTypes</rdfs:label>
  </rdf:Description>
</rdf:value>

would result in the final RDF/XML transformation like:

<rdf:Description rdf:about="https://example.org/code/CL_HGDE_GDE/13256">
  <xkos:hasPart rdf:resource="https://example.org/code/CL_HGDE_GDE/1"/>
  <skos:altLabel>A.a.a.</skos:altLabel>
  <property:GDE_GARTE rdf:resource="https://example.org/CL_HGDE_MODALITY/11"/>
</rdf:Description>

Only the AnnotationTypes with a corresponding configuration will be applied, and unspecific ones will be skipped.

If the default language had been set, the output would have contained xml:lang="{$lang}".

Omitting components

There are cases in which certain data parts contain errors. To get around this until the data is fixed at source, and without giving up on rest of the data at hand, as well as without making any significant assumptions or changes to the remaining data, omitComponents is an configuration option to explicitly skip over those parts. For example, if the Attribute values in a DataSet don't correspond to coded values - where they may contain whitespace - they can be skipped without damaging the rest of the data. This obviously gives up on precision in favour of still making use of the data. The configuration looks like this (in Turtle):

[ rdfs:label "omitComponents" ;
    rdf:value [ rdf:type "structure:Attribute" ;
                rdf:value "UNIT"
    ]
]

See also issue #30.

Vocabularies

Besides the common vocabularies: RDF RDFS, XSD, OWL, XSD, the RDF Data Cube vocabulary is used to describe multi-dimensional statistical data, and SDMX-RDF for the statistical information model. PROV-O is used for provenance coverage. And of course SKOS and XKOS to cover concepts, concept schemes and their relationships to one another. XKOS is currently applied primarily for hierarchical lists here (I hope I understood the vocabulary correctly).

Provenance

There is provenance level data:

Resources of type qb:DataStructureDefinition, qb:DataSet, skos:ConceptScheme are also typed with prov:Entity, and given prov:wasAttributedTo with the value from creator (which is typed with prov:Agent) in config.rdf.

There is a unique prov:Activity for each transformation, and it has a dcterms:title, and contains values for prov:startedAtTime, prov:wasAssociatedWith (the creator), prov:used (i.e., source XML, XSL to transform) to what was prov:generated (and source data URI that it prov:wasDerivedFrom). It also declares the licensing information (taken from config.rdf) using dcterms:license.

A provenance document may be provide to the transformer. This XML document would contain prov:Activity information which indicates the location of the XML document on the local filesystem which would later be transformed. It contains other provenance data like when it was retrieved, with what tools, and so on.

If that provenance document is provided to the transformer, the provenance template looks into that XML to see if there is provenance information about the XML that it is transforming. If it does, it makes a link between the current provenance activity (which is the transformation), with the earlier provenance activity (which is the retrieval) using prov:wasInformedBy.

Versions

As SDMX data publishers version their classifications and in turn the Cubes that are generated refer to particular versions of those classifications, versions need to be explictly part of URIs in order uniquely identify classifications. Although this goes against the general recommendation out there for not including the version in the URI, it is a good exception here. Otherwise, how would creating new terms for URIs without the version information be any different? For some background, see also #31.

URI Patterns

Here is an outline for the URI patterns that's used. example.org is used for the domain as an example (see also: Agency identifiers and URIs) followed with class, code, concept, dataset, property, provenance, or slice as example (i.e., they can be changed from config). /s are used to separate the things and dimensions in URIs, which can also be changed from config. Variable values are derived directly from source SDMX. Some skos:ConceptSchemes have uriValidFromToSeparator which is generated by combining date validity information when both validFrom and validTo are provided.

qb:DataStructureDefinition

https://example.org/dataset/{$KeyFamilyRef}/structure

qb:Observation

https://example.org/dataset/{$KeyFamilyRef}/{dimension-1}/../dimension-n}

qb:Slice

https://example.org/slice/{$KeyFamilyRef}/{dimension-1}/../dimension-n-exluding-FREQ-concept}

skos:Collection

https://example.org/code/{$version}/{$hierarchicalCodeListID}
https://example.org/code/{$version}/{$hierarchyID}

sdmx:CodeList

https://example.org/code/{$version}/{$codeListID}

skos:ConceptScheme

https://example.org/concept/{$version}/{$conceptSchemeID}

skos:Concept , sdmx:Concept

https://example.org/code/{$version}/{$codeListID}/{@codeID}
https://example.org/concept/{$version}/{$conceptSchemeID}/{@conceptID}

owl:Class and rdfs:Class

https://example.org/class/{$version}/{$codeListID}

rdf:Property , qb:DimensionProperty , qb:MeasureProperty , qb:AttributeProperty

https://example.org/property/{$conceptID}

Properties

Properties used in structure (DSD, codelists, ..) and data (observations) are listed below:

Structure

https://example.org/property/{$conceptID}
https://purl.org/dc/terms/identifier
https://purl.org/dc/terms/references
https://purl.org/linked-data/cube#attribute
https://purl.org/linked-data/cube#codeList
https://purl.org/linked-data/cube#component
https://purl.org/linked-data/cube#componentAttachment
https://purl.org/linked-data/cube#componentProperty
https://purl.org/linked-data/cube#concept
https://purl.org/linked-data/cube#dimension
https://purl.org/linked-data/cube#measure
https://purl.org/linked-data/cube#order
https://purl.org/linked-data/cube#sliceKey
https://purl.org/linked-data/sdmx/2009/concept#dataRev
https://purl.org/linked-data/sdmx/2009/concept#dsi
https://purl.org/linked-data/sdmx/2009/concept#mAgency
https://purl.org/linked-data/sdmx/2009/concept#validFrom
https://purl.org/linked-data/sdmx/2009/concept#validTo
https://purl.org/linked-data/xkos#hasPart
https://purl.org/linked-data/xkos#isPartOf
https://www.w3.org/1999/02/22-rdf-syntax-ns#type
https://www.w3.org/2000/01/rdf-schema#comment
https://www.w3.org/2000/01/rdf-schema#range
https://www.w3.org/2000/01/rdf-schema#seeAlso
https://www.w3.org/2000/01/rdf-schema#subClassOf
https://www.w3.org/2004/02/skos/core#definition
https://www.w3.org/2004/02/skos/core#hasTopConcept
https://www.w3.org/2004/02/skos/core#inScheme
https://www.w3.org/2004/02/skos/core#member
https://www.w3.org/2004/02/skos/core#notation
https://www.w3.org/2004/02/skos/core#prefLabel
https://www.w3.org/2004/02/skos/core#topConceptOf
https://www.w3.org/ns/prov#generated
https://www.w3.org/ns/prov#startedAtTime
https://www.w3.org/ns/prov#used
https://www.w3.org/ns/prov#wasAssociatedWith
https://www.w3.org/ns/prov#wasAttributedTo
https://www.w3.org/ns/prov#wasDerivedFrom

Data

https://example.org/property/{$conceptID}
https://purl.org/linked-data/cube#dataSet
https://purl.org/linked-data/cube#observation
https://purl.org/linked-data/cube#slice
https://purl.org/linked-data/cube#sliceStructure
https://purl.org/linked-data/cube#structure
https://www.w3.org/1999/02/22-rdf-syntax-ns#type
https://www.w3.org/ns/prov#generated
https://www.w3.org/ns/prov#startedAtTime
https://www.w3.org/ns/prov#used
https://www.w3.org/ns/prov#wasAssociatedWith
https://www.w3.org/ns/prov#wasAttributedTo
https://www.w3.org/ns/prov#wasDerivedFrom

Types of resources

Type of resources in the structure (DSD, codelists, ..) and data (observations) are listed below:

Structure

https://example.org/class/{$version}/{$codeListID}
https://purl.org/linked-data/cube#AttributeProperty
https://purl.org/linked-data/cube#ComponentSpecification
https://purl.org/linked-data/cube#DataStructureDefinition
https://purl.org/linked-data/cube#DimensionProperty
https://purl.org/linked-data/cube#MeasureProperty
https://purl.org/linked-data/sdmx#CodeList
https://purl.org/linked-data/sdmx#Concept
https://purl.org/linked-data/sdmx#DataStructureDefinition
https://www.w3.org/1999/02/22-rdf-syntax-ns#Property
https://www.w3.org/2000/01/rdf-schema#Class
https://www.w3.org/2002/07/owl#Class
https://www.w3.org/2004/02/skos/core#Collection
https://www.w3.org/2004/02/skos/core#Concept
https://www.w3.org/2004/02/skos/core#ConceptScheme
https://www.w3.org/ns/prov#Activity
https://www.w3.org/ns/prov#Agent
https://www.w3.org/ns/prov#Entity

Data

https://purl.org/linked-data/cube#DataSet
https://purl.org/linked-data/cube#Observation
https://www.w3.org/ns/prov#Activity
https://www.w3.org/ns/prov#Agent
https://www.w3.org/ns/prov#Entity

Datatypes

Some of the XSD datatypes are applied to object resources based on SDMX strucutre:TextFormat/@textType. See also issues #3 and #9, the coverage below.

How to run:

  1. Edit scripts/config.rdf to configure things like base URIs, delimiters to use in URIs, or even how to put SDMX AnnotationTypes into good use. If you don't edit, it will work with defaults (e.g., example.org, /).

  2. Either use the provided scripts/generic.sh to transform generic SDMX-ML in data/ to RDF/XML, or use it on your own data with an XSLT 2.0 processor, with a command something along the lines of (using the Debian saxonb-xslt for example here):

The following takes the metadata from generic.structure.xml using the scripts/generic.xsl template to create the corresponding RDF/XML in generic.structure.rdf. The parameter xmlDocument value is used in the final transformation to let the processor know the file that was being transformed (also used for provenance data) - just reuse the same value as the input XML value in -s, and pathToGenericStructure parameter value is same as xmlDocument in this case because we are going to transform the SDMX KeyFamily / DSD):

saxonb-xslt -s generic.structure.xml -xsl generic.xsl xmlDocument=generic.structure.xml pathToGenericStructure=generic.structure.xml > generic.structure.rdf

Similar to above, but this time we are going to use the generic.structure.xml for the generic data. The following generates the RDF/XML generic.data.rdf from generic.data.xml by making use of the generic structure data in generic.structure.xml with parameter pathToGenericStructure:

saxonb-xslt -t -tree:linked -s generic.data.xml -xsl generic.xsl xmlDocument=generic.structure.xml pathToGenericStructure=generic.structure.xml > generic.data.rdf

-tree:linked in saxonb-xslt helps for large files, not to mention giving more memory to the processor.

Optionally, pathToProvDocument (for extra provenance information) and pathToConfig (to use a custom config, otherwise default config.rdf is used) parameters can be passed in.

Coverage

The following is a coverage (in progress) based on sample data.

BIS OECD UN ECB WB IMF FAO EUROSTAT BFS
"External agencies" refers to agencies in which the SDMX publisher is using an external agency's concepts, codelists etc.
External Agencies SDMX EUROSTAT IAEG SDMX OECD
Annotation(Type) Y Y Y
Hierarchical CodeLists Y Y Y Y Y Y
Datatype (OBS_VALUE) String Double Double Double
Datatype (TIME_FORMAT) String String
Datatype (TIME_PERIOD) String
Datatype (OBS_STATUS) String String
Datatype (OBS_CONF) String
SDMX Version 2.0 2.0 2.0 2.0 2.0 2.0 2.0 2.0 2.0