-
Notifications
You must be signed in to change notification settings - Fork 102
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Create ISO19115-2 reader in MDTranslator #4862
Comments
I did some quick analysis; I basically copied the iso19115-3 writer over the iso19115-2 writer and analyzed the git changes. You can see this here. Some high level notes:
Happy to discuss any of the nuances with the @GSA/data-gov-dev-team at an office hours. |
the intention is to work downward from |
a simple test I ran locally just to confirm what we already know. we currently only use the iso19139ngdc validator for spatial types. this test uses that schema. from lxml import etree # not standard library
# I download the file but you get the idea
xml_doc = "https://github.com/GSA/mdTranslator/blob/datagov/test/readers/iso19115_2/testData/wip_iso19115-2_datagov_harvest_altered_source.xml"
xml_doc = etree.parse(xml_doc)
xsd_doc = "https://github.com/ckan/ckanext-spatial/blob/master/ckanext/spatial/validation/xml/iso19139ngdc/schema/gmi/gmi.xsd"
xsd_doc = etree.parse(xsd_doc)
xmlschema = etree.XMLSchema(xsd_doc)
xmlschema.validate(xml_doc) # returns True
from lxml import etree # not standard library
# I download the file but you get the idea
xml_doc = "https://github.com/GSA/mdTranslator/blob/datagov/test/readers/iso19115_2/testData/wip_iso19115-2_datagov_harvest_altered_source.xml"
xml_doc = etree.parse(xml_doc)
xsd_doc = "https://github.com/ISO-TC211/XML/blob/master/schemas.isotc211.org/19115/-2/gmi/1.0/gmi.xsd"
xsd_doc = etree.parse(xsd_doc)
xmlschema = etree.XMLSchema(xsd_doc)
# ^ this fails and throws the following error. after removing the problem element it works.
"""
>>> xmlschema = etree.XMLSchema(xsd_doc)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "src/lxml/xmlschema.pxi", line 89, in lxml.etree.XMLSchema.__init__
lxml.etree.XMLSchemaParseError: element decl. '{https://standards.iso.org/iso/19115/-2/gmi/1.0}resultFile', attribute 'type': The QName value '{https://www.isotc211.org/2005/gmx}MX_DataFile_PropertyType' does not resolve to a(n) type definition., line 153
"""
xmlschema.validate(xml_doc) # returns False notes
|
This is good research. Thanks for doing this, Reid! |
just got confirmation from chris that the child inherits the requirement from its parent. this impacts the first condition often implemented in the beginning of each |
gml xsd containing information on the important question is which required elements can be "substituted" with an acceptable |
found a python library which can convert xsd trees into plantuml files by running:
<xs:complexType name="CI_Citation_Type">
<xs:annotation>
<xs:documentation>Standardized resource reference</xs:documentation>
</xs:annotation>
<xs:complexContent>
<xs:extension base="gco:AbstractObject_Type">
<xs:sequence>
<xs:element name="title" type="gco:CharacterString_PropertyType"/>
<xs:element name="alternateTitle" type="gco:CharacterString_PropertyType" minOccurs="0" maxOccurs="unbounded"/>
<xs:element name="date" type="gmd:CI_Date_PropertyType" maxOccurs="unbounded"/>
<xs:element name="edition" type="gco:CharacterString_PropertyType" minOccurs="0"/>
<xs:element name="editionDate" type="gco:Date_PropertyType" minOccurs="0"/>
<xs:element name="identifier" type="gmd:MD_Identifier_PropertyType" minOccurs="0" maxOccurs="unbounded"/>
<xs:element name="citedResponsibleParty" type="gmd:CI_ResponsibleParty_PropertyType" minOccurs="0" maxOccurs="unbounded"/>
<xs:element name="presentationForm" type="gmd:CI_PresentationFormCode_PropertyType" minOccurs="0" maxOccurs="unbounded"/>
<xs:element name="series" type="gmd:CI_Series_PropertyType" minOccurs="0"/>
<xs:element name="otherCitationDetails" type="gco:CharacterString_PropertyType" minOccurs="0"/>
<xs:element name="collectiveTitle" type="gco:CharacterString_PropertyType" minOccurs="0"/>
<xs:element name="ISBN" type="gco:CharacterString_PropertyType" minOccurs="0"/>
<xs:element name="ISSN" type="gco:CharacterString_PropertyType" minOccurs="0"/>
</xs:sequence>
</xs:extension>
</xs:complexContent>
</xs:complexType> to this... doesn't look like it pulls it whether an element is optional/required but it does include whether it's unbounded (i.e. it also appears to resolve the properties/attributes of elements as well... so according to this <gmd:CI_Citation gco:nilReason="inapplicable" gco:title="some title" gco:uuidref="asoidhoads1092u3e0hasodh"/>
@dataclass
class CiCitationType(AbstractObjectType):
"""
Standardized resource reference.
"""
class Meta:
name = "CI_Citation_Type"
title: Optional[CharacterStringPropertyType] = field(
default=None,
metadata={
"type": "Element",
"namespace": "https://www.isotc211.org/2005/gmd",
"required": True,
},
)
alternate_title: List[CharacterStringPropertyType] = field(
default_factory=list,
metadata={
"name": "alternateTitle",
"type": "Element",
"namespace": "https://www.isotc211.org/2005/gmd",
},
)
date: List[CiDatePropertyType] = field(
default_factory=list,
metadata={
"type": "Element",
"namespace": "https://www.isotc211.org/2005/gmd",
"min_occurs": 1,
},
)
edition: Optional[CharacterStringPropertyType] = field(
default=None,
metadata={
"type": "Element",
"namespace": "https://www.isotc211.org/2005/gmd",
},
)
edition_date: Optional[DatePropertyType] = field(
default=None,
metadata={
"name": "editionDate",
"type": "Element",
"namespace": "https://www.isotc211.org/2005/gmd",
},
)
identifier: List[MdIdentifierPropertyType] = field(
default_factory=list,
metadata={
"type": "Element",
"namespace": "https://www.isotc211.org/2005/gmd",
},
)
cited_responsible_party: List[CiResponsiblePartyPropertyType] = field(
default_factory=list,
metadata={
"name": "citedResponsibleParty",
"type": "Element",
"namespace": "https://www.isotc211.org/2005/gmd",
},
)
presentation_form: List[CiPresentationFormCodePropertyType] = field(
default_factory=list,
metadata={
"name": "presentationForm",
"type": "Element",
"namespace": "https://www.isotc211.org/2005/gmd",
},
)
series: Optional[CiSeriesPropertyType] = field(
default=None,
metadata={
"type": "Element",
"namespace": "https://www.isotc211.org/2005/gmd",
},
)
other_citation_details: Optional[CharacterStringPropertyType] = field(
default=None,
metadata={
"name": "otherCitationDetails",
"type": "Element",
"namespace": "https://www.isotc211.org/2005/gmd",
},
)
collective_title: Optional[CharacterStringPropertyType] = field(
default=None,
metadata={
"name": "collectiveTitle",
"type": "Element",
"namespace": "https://www.isotc211.org/2005/gmd",
},
)
isbn: Optional[CharacterStringPropertyType] = field(
default=None,
metadata={
"name": "ISBN",
"type": "Element",
"namespace": "https://www.isotc211.org/2005/gmd",
},
)
issn: Optional[CharacterStringPropertyType] = field(
default=None,
metadata={
"name": "ISSN",
"type": "Element",
"namespace": "https://www.isotc211.org/2005/gmd",
},
) |
there's a cool utility
unfortunately, this fails and produces a parsing error.
|
improvements based on testing against production documents using the code in this pr
|
it's common for xsd elements to be substitutable. here's a simple example <xs:element name="Anchor" type="gmx:Anchor_Type" substitutionGroup="gco:CharacterString"/> this means <!-- with substitution -->
<gmd:keyword>
<gmx:Anchor xlink:href="https://www.ncei.noaa.gov/archive/accession/0000003" xlink:actuate="onRequest">0000003</gmx:Anchor>
</gmd:keyword>
<!-- without substitution -->
<gmd:keyword>
<gco:CharacterString>0000003</gco:CharacterString>
</gmd:keyword> thankfully with leaf nodes like these i can simply update the xpath with an or operator like mentioned in my comment above |
doing a test run on https://data.noaa.gov/waf/NOAA/NESDIS/ncei/accessions/iso/xml/. the first validation failure occurred on dataset 5407. the issue involved an incorrectly processed attribute of an |
two takeaways from meeting with liam:
|
here's the schema detection function. i figured the logic would be something like this so it's nice to have some validation. function identifyXMLFormat (options, metadata) {
const { DOMParser, xpath } = options.deps;
const XMLNS = {
GMD: 'https://www.isotc211.org/2005/gmd',
GMI: 'https://www.isotc211.org/2005/gmi',
// GMI: 'https://standards.iso.org/iso/19115/-2/gmi/1.0',
MDB1: 'https://standards.iso.org/iso/19115/-3/mdb/1.0',
MDB2: 'https://standards.iso.org/iso/19115/-3/mdb/2.0'
};
const select = xpath.useNamespaces(XMLNS);
const parser = new DOMParser();
const doc = parser.parseFromString(metadata.toString(), 'application/xml');
const rootNamespace = doc.documentElement.namespaceURI;
switch (rootNamespace) {
case 'https://www.mozilla.org/newlayout/xml/parsererror.xml':
throw new Error(doc.documentElement.textContent);
case XMLNS.GMD:
return 'iso_19139';
case XMLNS.GMI:
return 'iso_19115-2';
case XMLNS.MDB1:
return 'iso_19115-3:1.0';
case XMLNS.MDB2:
return 'iso_19115-3:2.0';
case null:
// catch esri metadata
const esri = select('//esri', doc);
if (esri.length > 0) {
throw new Error(
`Esri metadata has been identified. Metadata cannot be processed. Exiting.`
);
}
const root = select('//metadata', doc);
if (root && root[0]) {
return 'csdgm';
}
// Fall through for error handling
default:
throw new Error(
`Unknown XML metadata namespace ${doc.documentElement.namespaceURI}`
);
}
} |
Liam or Chris will work on the review of this ticket |
User Story
In order to transform ISO19115-2 documents into DCATUS, datagov wants to create a ISO19115-2 reader in mdtranslator
Acceptance Criteria
[ACs should be clearly demoable/verifiable whenever possible. Try specifying them using BDD.]
WHEN the document is ingested by mdtranslator
THEN it will be transformed into DCATUS
Background
ISO19115-2 to DCATUS Progress
(24/24)
(source)
* in progress
** blocked
"r" reviewed (read and write)
program code and bureau code aren't required in the non-federal version of DCATUS. there's also no mapping from iso to this so we're skipping them.
[Any helpful contextual notes or links to artifacts/evidence, if needed]
Security Considerations (required)
[Any security concerns that might be implicated in the change. "None" is OK, just be explicit here!]
Sketch
[Notes or a checklist reflecting our understanding of the selected approach]
Improvements
The text was updated successfully, but these errors were encountered: