Skip to content

Small application to convert GeoNames data from a giant file in a hard-to-parse format into multiple files of rational XML.

License

Notifications You must be signed in to change notification settings

stevedlawrence/GeoNames

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

11 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

GeoNames

GeoNames is a big and useful data set found at www.geonames.org.

The file is quite large. Many gigabytes when uncompressed.

It's not an easy-to-use format, unless you have a specific data import program that already knows about this format. The single giant file is also quite problematic. The format contains a quasi-XML fragment of data in each record of data. This can't be parsed with an XML parser unless it is first parsed and then reassembled into viable XML.

Note that this DFDL schema is not really a practical way to process this kind of data. There are RDF-style importers that handle this sort of data directly. Nevertheless this example illustrates the kind of thing DFDL can do easily to convert data from some ad-hoc file format into XML or JSON.

This module is a DFDL schema that parses this data, and that enables one to reconstruct a "real XML" representation.

This schema is portable to both IBM DFDL, and Daffodil.

Note that if you parse, then unparse this data, you'll get out again the exact same format you started from. You don't want to unparse this. You want to write out something different.

See https://github.com/OpenDFDL/daffodil-spark for code that actually converts geonames data into XML files. It shows how to use apache spark to operate on the data after parsing it with daffodil. The assembly into XML, and creation of a multi-file compressed-XML dataset is all parallelized by spark.

About

Small application to convert GeoNames data from a giant file in a hard-to-parse format into multiple files of rational XML.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Scala 100.0%