Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Create ISO19115-2 reader in MDTranslator #4862

Open
24 of 27 tasks
rshewitt opened this issue Aug 23, 2024 · 14 comments
Open
24 of 27 tasks

Create ISO19115-2 reader in MDTranslator #4862

rshewitt opened this issue Aug 23, 2024 · 14 comments
Assignees
Labels
H2.0/Harvest-Transform Transform Logic for Harvesting 2.0

Comments

@rshewitt
Copy link
Contributor

rshewitt commented Aug 23, 2024

User Story

In order to transform ISO19115-2 documents into DCATUS, datagov wants to create a ISO19115-2 reader in mdtranslator

Acceptance Criteria

[ACs should be clearly demoable/verifiable whenever possible. Try specifying them using BDD.]

  • GIVEN an ISO19115-2 document
    WHEN the document is ingested by mdtranslator
    THEN it will be transformed into DCATUS

Background

  • mdtranslator doc. use the ticket template in this document for each module.
  • ISO19115-3 reader dev info. the general idea is the same for this ticket.
  • this is really 2 tickets
    • creating the reader
    • transforming the document into DCATUS
  • much of the work on ISO19115-3 can be reused for this. The xml elements appear to be identical. The only difference so far seems to be the namespace used.

ISO19115-2 to DCATUS Progress

(24/24)
(source)

* in progress
** blocked
"r" reviewed (read and write)

program code and bureau code aren't required in the non-federal version of DCATUS. there's also no mapping from iso to this so we're skipping them.

  • ProgramCode
  • BureauCode
    [Any helpful contextual notes or links to artifacts/evidence, if needed]

Security Considerations (required)

[Any security concerns that might be implicated in the change. "None" is OK, just be explicit here!]

Sketch

[Notes or a checklist reflecting our understanding of the selected approach]

Improvements

  1. determine which modules must return hash vs nil comment1, comment2
  2. logs comment. the following are some proposals. execution error = an empty input doc. validation errors = everything else would be a validation error idk what an example of an execution error would be. possibly what is considered "structure"?
  3. consider not returning when encountering an error/warning when it's possible so we gather more information in one process ( e.g. TimePeriod, BoundingBox )
  4. determine which required elements can be "substituted" with an acceptable nilReason. we throw a warning/error every time a required element is missing without considering if it has a nilReason. an acceptable and present nilreason means we don't need to throw a warning/error.
  5. determine full list of siblings for a given extension parent. related to comment
  6. refactor requirement of elements based on the exact type definition. comment. an element can be self-enclosing and valid.
@rshewitt rshewitt added the H2.0/Harvest-Transform Transform Logic for Harvesting 2.0 label Aug 23, 2024
@jbrown-xentity
Copy link
Contributor

I did some quick analysis; I basically copied the iso19115-3 writer over the iso19115-2 writer and analyzed the git changes. You can see this here. Some high level notes:

  • The style has definitely changed between the 2, there are variable names that are kind of standardized across each version that are slightly changed as well as using different code conventions. We don't need to carry that over to the reader; those can be consistent in most cases.
  • There are a handful of classes that are not implemented, not sure if those were incomplete in -2, incomplete in -3, or an actual difference.
  • Many of the meaningful changes have to do with the namespace change (see here), which is not a large lift.
  • It's unknown at this high level analysis how much of a difference the actual modules are that are utilized for transforming into DCAT-US.

Happy to discuss any of the nuances with the @GSA/data-gov-dev-team at an office hours.

@rshewitt
Copy link
Contributor Author

rshewitt commented Aug 29, 2024

the intention is to work downward from title to PrimaryITInvestmentUII. check the box when the PR is merged. add ~4% to the progress bar (1/26). here's a child ticket

@Bagesary Bagesary moved this to 📟 Sprint Backlog [7] in data.gov team board Aug 29, 2024
@btylerburton btylerburton self-assigned this Sep 25, 2024
@rshewitt
Copy link
Contributor Author

rshewitt commented Oct 7, 2024

a simple test I ran locally just to confirm what we already know. we currently only use the iso19139ngdc validator for spatial types. this test uses that schema.

from lxml import etree # not standard library

# I download the file but you get the idea
xml_doc = "https://github.com/GSA/mdTranslator/blob/datagov/test/readers/iso19115_2/testData/wip_iso19115-2_datagov_harvest_altered_source.xml"
xml_doc = etree.parse(xml_doc)

xsd_doc = "https://github.com/ckan/ckanext-spatial/blob/master/ckanext/spatial/validation/xml/iso19139ngdc/schema/gmi/gmi.xsd"
xsd_doc = etree.parse(xsd_doc)

xmlschema = etree.XMLSchema(xsd_doc)

xmlschema.validate(xml_doc) # returns True
  • the same document fails when validated against ISO19115-2
from lxml import etree # not standard library

# I download the file but you get the idea
xml_doc = "https://github.com/GSA/mdTranslator/blob/datagov/test/readers/iso19115_2/testData/wip_iso19115-2_datagov_harvest_altered_source.xml"
xml_doc = etree.parse(xml_doc)

xsd_doc = "https://github.com/ISO-TC211/XML/blob/master/schemas.isotc211.org/19115/-2/gmi/1.0/gmi.xsd"
xsd_doc = etree.parse(xsd_doc)

xmlschema = etree.XMLSchema(xsd_doc)
# ^ this fails and throws the following error. after removing the problem element it works.
"""
>>> xmlschema = etree.XMLSchema(xsd_doc)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "src/lxml/xmlschema.pxi", line 89, in lxml.etree.XMLSchema.__init__
lxml.etree.XMLSchemaParseError: element decl. '{https://standards.iso.org/iso/19115/-2/gmi/1.0}resultFile', attribute 'type': The QName value '{https://www.isotc211.org/2005/gmx}MX_DataFile_PropertyType' does not resolve to a(n) type definition., line 153
"""

xmlschema.validate(xml_doc) # returns False

notes

  • the ngdc namespace schemas are present locally in ckanext-spatial and are referenced via relative path. ( e.g. schemaLocation="../gco/gco.xsd".
  • the iso19115-2 namespace schemas are referenced via hyperlink and aren't present locally.

@btylerburton
Copy link
Contributor

This is good research. Thanks for doing this, Reid!

@rshewitt
Copy link
Contributor Author

rshewitt commented Oct 8, 2024

just got confirmation from chris that the child inherits the requirement from its parent. this impacts the first condition often implemented in the beginning of each self.unpack method but specifically determines whether we throw a warning/error in the condition. as an example, <gmd:identificationInfo> is required at least once within <gmi:MI_Metadata> which means gmd:MD_DataIdentification must exist within <gmd:identificationInfo>. we're throwing an error in this exact circumstance which is good but we want to implement this across the board.

@btylerburton btylerburton removed their assignment Oct 8, 2024
@rshewitt
Copy link
Contributor Author

rshewitt commented Oct 9, 2024

gml xsd containing information on nilReason attribute commonly used throughout many other relevant namespaces. many complex types have something like <xs:attribute ref="gco:nilReason"/> in the declaration. bubbling up the hierarchy leads to the gml xsd linked at the beginning of this comment. looks like acceptable values for nilReason are inapplicable, missing, template, unknown, withheld, and a regex other:\w{2,}.

the important question is which required elements can be "substituted" with an acceptable nilReason? in every case so far we throw a warning/error when a required element is missing.

@rshewitt
Copy link
Contributor Author

rshewitt commented Oct 10, 2024

found a python library which can convert xsd trees into plantuml files by running:

f=ckanext-spatial/ckanext/spatial/validation/xml/iso19139ngdc/schema/gmi/gmi.xsd
poetry run xsdata $f --output plantuml --package [output_dir_name]

CI_Citation_Type goes from...

<xs:complexType name="CI_Citation_Type">
		<xs:annotation>
			<xs:documentation>Standardized resource reference</xs:documentation>
		</xs:annotation>
		<xs:complexContent>
			<xs:extension base="gco:AbstractObject_Type">
				<xs:sequence>
					<xs:element name="title" type="gco:CharacterString_PropertyType"/>
					<xs:element name="alternateTitle" type="gco:CharacterString_PropertyType" minOccurs="0" maxOccurs="unbounded"/>
					<xs:element name="date" type="gmd:CI_Date_PropertyType" maxOccurs="unbounded"/>
					<xs:element name="edition" type="gco:CharacterString_PropertyType" minOccurs="0"/>
					<xs:element name="editionDate" type="gco:Date_PropertyType" minOccurs="0"/>
					<xs:element name="identifier" type="gmd:MD_Identifier_PropertyType" minOccurs="0" maxOccurs="unbounded"/>
					<xs:element name="citedResponsibleParty" type="gmd:CI_ResponsibleParty_PropertyType" minOccurs="0" maxOccurs="unbounded"/>
					<xs:element name="presentationForm" type="gmd:CI_PresentationFormCode_PropertyType" minOccurs="0" maxOccurs="unbounded"/>
					<xs:element name="series" type="gmd:CI_Series_PropertyType" minOccurs="0"/>
					<xs:element name="otherCitationDetails" type="gco:CharacterString_PropertyType" minOccurs="0"/>
					<xs:element name="collectiveTitle" type="gco:CharacterString_PropertyType" minOccurs="0"/>
					<xs:element name="ISBN" type="gco:CharacterString_PropertyType" minOccurs="0"/>
					<xs:element name="ISSN" type="gco:CharacterString_PropertyType" minOccurs="0"/>
				</xs:sequence>
			</xs:extension>
		</xs:complexContent>
	</xs:complexType>

to this...

Image

doesn't look like it pulls it whether an element is optional/required but it does include whether it's unbounded (i.e. [] )

it also appears to resolve the properties/attributes of elements as well...

Image

so according to this CI_Citation can look like?

<gmd:CI_Citation gco:nilReason="inapplicable" gco:title="some title" gco:uuidref="asoidhoads1092u3e0hasodh"/>

xsdata can also convert xsd trees into python dataclasses. here's what citation looks like as a dataclass

@dataclass
class CiCitationType(AbstractObjectType):
    """
    Standardized resource reference.
    """

    class Meta:
        name = "CI_Citation_Type"

    title: Optional[CharacterStringPropertyType] = field(
        default=None,
        metadata={
            "type": "Element",
            "namespace": "https://www.isotc211.org/2005/gmd",
            "required": True,
        },
    )
    alternate_title: List[CharacterStringPropertyType] = field(
        default_factory=list,
        metadata={
            "name": "alternateTitle",
            "type": "Element",
            "namespace": "https://www.isotc211.org/2005/gmd",
        },
    )
    date: List[CiDatePropertyType] = field(
        default_factory=list,
        metadata={
            "type": "Element",
            "namespace": "https://www.isotc211.org/2005/gmd",
            "min_occurs": 1,
        },
    )
    edition: Optional[CharacterStringPropertyType] = field(
        default=None,
        metadata={
            "type": "Element",
            "namespace": "https://www.isotc211.org/2005/gmd",
        },
    )
    edition_date: Optional[DatePropertyType] = field(
        default=None,
        metadata={
            "name": "editionDate",
            "type": "Element",
            "namespace": "https://www.isotc211.org/2005/gmd",
        },
    )
    identifier: List[MdIdentifierPropertyType] = field(
        default_factory=list,
        metadata={
            "type": "Element",
            "namespace": "https://www.isotc211.org/2005/gmd",
        },
    )
    cited_responsible_party: List[CiResponsiblePartyPropertyType] = field(
        default_factory=list,
        metadata={
            "name": "citedResponsibleParty",
            "type": "Element",
            "namespace": "https://www.isotc211.org/2005/gmd",
        },
    )
    presentation_form: List[CiPresentationFormCodePropertyType] = field(
        default_factory=list,
        metadata={
            "name": "presentationForm",
            "type": "Element",
            "namespace": "https://www.isotc211.org/2005/gmd",
        },
    )
    series: Optional[CiSeriesPropertyType] = field(
        default=None,
        metadata={
            "type": "Element",
            "namespace": "https://www.isotc211.org/2005/gmd",
        },
    )
    other_citation_details: Optional[CharacterStringPropertyType] = field(
        default=None,
        metadata={
            "name": "otherCitationDetails",
            "type": "Element",
            "namespace": "https://www.isotc211.org/2005/gmd",
        },
    )
    collective_title: Optional[CharacterStringPropertyType] = field(
        default=None,
        metadata={
            "name": "collectiveTitle",
            "type": "Element",
            "namespace": "https://www.isotc211.org/2005/gmd",
        },
    )
    isbn: Optional[CharacterStringPropertyType] = field(
        default=None,
        metadata={
            "name": "ISBN",
            "type": "Element",
            "namespace": "https://www.isotc211.org/2005/gmd",
        },
    )
    issn: Optional[CharacterStringPropertyType] = field(
        default=None,
        metadata={
            "name": "ISSN",
            "type": "Element",
            "namespace": "https://www.isotc211.org/2005/gmd",
        },
    )

@rshewitt
Copy link
Contributor Author

rshewitt commented Oct 10, 2024

there's a cool utility xsdata offers which is to download a schema and its dependencies. here's an example using ISO19115-2.

poetry run xsdata download https://standards.iso.org/iso/19115/-2/gmi/1.0/gmi.xsd

unfortunately, this fails and produces a parsing error.

xsdata.exceptions.ParserError: Unknown property {https://www.w3.org/2001/XMLSchema}schema:{https://www.w3.org/1999/xhtml}head

@btylerburton btylerburton moved this from 📟 Sprint Backlog [7] to 🏗 In Progress [8] in data.gov team board Oct 11, 2024
@rshewitt
Copy link
Contributor Author

rshewitt commented Nov 4, 2024

improvements based on testing against production documents using the code in this pr

  • child elements to TimePeriod ( e.g gml:beginPosition or gml:endPosition ) can have an indeterminatePosition attribute which appears to serve a similar purpose to nilReason. this needs to be handled.
  • gmd:date within gmd:CI_Date can have multiple children ( i.e. gco:DateTime or gco:Date ). only gco:Date is handled currently. need to update the associated module to find either one. this relates to improvement number 5
    • depending on the situation this could be as easy as updating the xpath with an OR ( i.e .gco:Date | gco:DateTime )

@rshewitt
Copy link
Contributor Author

rshewitt commented Nov 5, 2024

it's common for xsd elements to be substitutable. here's a simple example

<xs:element name="Anchor" type="gmx:Anchor_Type" substitutionGroup="gco:CharacterString"/>

this means gmx:Anchor can replace/substitute gco:CharacterString. a practical example i found in a noaa dataset is...

<!-- with substitution --> 
<gmd:keyword>
    <gmx:Anchor xlink:href="https://www.ncei.noaa.gov/archive/accession/0000003" xlink:actuate="onRequest">0000003</gmx:Anchor>
</gmd:keyword>

<!-- without substitution --> 
<gmd:keyword>
    <gco:CharacterString>0000003</gco:CharacterString>
</gmd:keyword>

thankfully with leaf nodes like these i can simply update the xpath with an or operator like mentioned in my comment above

@rshewitt
Copy link
Contributor Author

rshewitt commented Nov 5, 2024

doing a test run on https://data.noaa.gov/waf/NOAA/NESDIS/ncei/accessions/iso/xml/. the first validation failure occurred on dataset 5407. the issue involved an incorrectly processed attribute of an gml:endPosition element. temporal elements like this can contain a "indeterminatePosition" attribute with enumeration values found in the "TimeIndeterminateValueType" in gml temporal. this has been fixed.

@rshewitt
Copy link
Contributor Author

rshewitt commented Nov 20, 2024

two takeaways from meeting with liam:

  • arcpro can export iso docs without element namespaces ( i.e. < citation > vs < cit:citation > ). this will cause problems. he mentioned geoplatform has an automated process to add them back in. i asked him to share it.
  • he mentioned geoplatform also has a schema detection process which i requested to see.

@rshewitt
Copy link
Contributor Author

rshewitt commented Nov 21, 2024

here's the schema detection function. i figured the logic would be something like this so it's nice to have some validation.

function identifyXMLFormat (options, metadata) {
  const { DOMParser, xpath } = options.deps;
  const XMLNS = {
    GMD: 'https://www.isotc211.org/2005/gmd',
    GMI: 'https://www.isotc211.org/2005/gmi',
    // GMI: 'https://standards.iso.org/iso/19115/-2/gmi/1.0',
    MDB1: 'https://standards.iso.org/iso/19115/-3/mdb/1.0',
    MDB2: 'https://standards.iso.org/iso/19115/-3/mdb/2.0'
  };

  const select = xpath.useNamespaces(XMLNS);
  const parser = new DOMParser();
  const doc = parser.parseFromString(metadata.toString(), 'application/xml');
  const rootNamespace = doc.documentElement.namespaceURI;

  switch (rootNamespace) {
    case 'https://www.mozilla.org/newlayout/xml/parsererror.xml':
      throw new Error(doc.documentElement.textContent);

    case XMLNS.GMD:
      return 'iso_19139';

    case XMLNS.GMI:
      return 'iso_19115-2';

    case XMLNS.MDB1:
      return 'iso_19115-3:1.0';

    case XMLNS.MDB2:
      return 'iso_19115-3:2.0';

    case null:
      // catch esri metadata
      const esri = select('//esri', doc);
      if (esri.length > 0) {
        throw new Error(
          `Esri metadata has been identified. Metadata cannot be processed. Exiting.`
        );
      }

      const root = select('//metadata', doc);
      if (root && root[0]) {
        return 'csdgm';
      }
    // Fall through for error handling

    default:
      throw new Error(
        `Unknown XML metadata namespace ${doc.documentElement.namespaceURI}`
      );
  }
}

@Bagesary
Copy link

Liam or Chris will work on the review of this ticket

@Bagesary Bagesary moved this from 🏗 In Progress [8] to 📟 Sprint Backlog [7] in data.gov team board Nov 21, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
H2.0/Harvest-Transform Transform Logic for Harvesting 2.0
Projects
Status: 📟 Sprint Backlog [7]
Development

No branches or pull requests

4 participants