Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rework schema (list of headers) and document #29

Closed
rufuspollock opened this issue May 29, 2013 · 26 comments
Closed

Rework schema (list of headers) and document #29

rufuspollock opened this issue May 29, 2013 · 26 comments
Labels
Website The frontend of publicbodies.org website ★★★

Comments

@rufuspollock
Copy link
Member

@rossjones suggested: "Would it make sense for publicbodies.org to follow the popolo spec at http:https://popoloproject.com/data.html" (that link is now broken)

Correct link is: http:https://popoloproject.com/specs/organization.html

Seems a great idea!

Current fields

Current fields and suggested changes (e.g. to be in line with popolo as much as possible). Note the list of changes is in progress and incomplete.

  • title => name (in org name)
  • abbr => abbreviation
  • key => id (?)
  • category => classification
  • parent => DELETE (just have parent_id)
  • parent_key => parent_id
  • description
  • url
  • jurisdiction => DELETE (just have jurisdiction code)
  • jurisdiction_code = ISO 2 digit code where that exists. Otherwise we coin.
  • source => DELETE in favour of source URL (??)
  • source_url => keep
    • make clear there is no point pointing at exactly the same API endpoint - much more useful to point at a specific location
    • (??) DELETE entirely and just credit in contributor notes (we already have a bunch of different sources for data and as people add the problem will get worse)
    • Could have multiple sources per entry (??)
  • address
  • contact => What's the difference from address
  • email
  • tags => keep
    • at the moment several of the files use tags (though not necessarily consistently)
  • created_at => DELETE (little value ...)
  • updated_at => DELETE (ditto)

Add:

  • other_names: semi-colon separated list of alternate names
  • founding_date: ISO 8601
  • dissolution_date: ISO 8601
  • image

Consider switch to JSON from CSV

Pros / Cons

  • (+) Greater flexibility, ability to directly match org spec
    • In particular can handle multiple values, multiple identifiers
  • (-) Much bigger and less compact. Harder for people to work with (e.g. CSV usable in spreadsheets etc)
  • (-) More complexity (but perhaps necessary)
@jpmckinney
Copy link

Popolo doesn't define a CSV representation yet - there is RDF and JSON so far. On RDF path, I'm not sure if Linked CSV is ready. On the JSON path, it should be straight-forward to re-use JSON fields as CSV headers.

Is there a documented version of the CSV schema? The datapackage.json doesn't describe the difference between parent and parent_key for example. Once I understand the schema, I can propose one that uses Popolo terms.

@rufuspollock
Copy link
Member Author

@jpmckinney list of fields set out above and some initial suggested changes.

@markbrough your thoughts here re IATI very useful ...

@jpmckinney
Copy link

I'll review more closely in a bit, but to clarify one point, you can use fields outside of those within Popolo while still being conformant: http:https://popoloproject.com/specs/#conformance So, if you want to keep created_at, that should be fine (I'll actually be adding it to Popolo as it came up in the previous round of feedback).

@rufuspollock
Copy link
Member Author

@jpmckinney any thoughts here. I'm aiming to do a rev (and possibly finalize) this asap. I guess the big question here is CSV vs JSON (I mean for JSON we'd just take the full popolo version I think). If CSV how do we map and how do we handle things like fields with multiple possible values. Options are:

  • Inline into field in a hacky way (e.g. aliases could be ; separated)
  • Inline JSON into a field :-/
  • Have a separate "table" joined to main table
  • ...?

@jpmckinney
Copy link

Sorry for delay, I'll look at this within the next day.

@jpmckinney
Copy link

The "abbr" column in the CSV would be the "other_names" array in the JSON. Maybe rename "abbr" to "other_name"? Otherwise I think all the other header names conform.

CSV has the big advantage of more people being able to understand, create and use it. Is it anticipated that many fields will be multi-value? Has that come up already? How much detailed info are these lists expected to contain?

If the project is expected to maintain a fairly narrow scope with only essential/primary data, then CSV should be enough. If it's expected to expand to provide detailed info for at least some jurisdictions, then JSON is necessary.

A hybrid approach may allow people to submit CSVs (for those jurisdictions that don't (yet) have detailed info), and a script would be run to convert those CSVs to JSON. Thoughts?

Re: multi-value columns in CSV:

  1. Within-column separators like ";" or "|" have a small risk of causing parsing issues, and are a real headache to escape if they ever occur within one of the multiple values. Not that bad, on the whole.
  2. Inline JSON is worse than within-column separators, I think.
  3. Depending on how important the additional values are, this may be acceptable.

@rufuspollock
Copy link
Member Author

OK, so I think we'll go for plain CSV and see how we do. I've made another tweak to include other_names.

@jpmckinney
Copy link

I don't know if a new other_names field will be used that frequently - I was just suggesting renaming abbr, but in retrospect I guess there's utility to picking out the shortest version of a name, e.g. for display on mobiles or other space-constrained places. Why not rename to abbreviation, though, since no other field name is abbreviated?

For source_url, I think it may be useful to keep. I write scrapers for public bodies, and assign the source to the page on the authoritative source's website that was scraped.

@rufuspollock
Copy link
Member Author

@jpmckinney all good suggestions (as usual!) - let's run with both of them. I've updated the change proposal above to reflect these.

@rufuspollock
Copy link
Member Author

Added founding_date and dissolution_date and image to add.

@stefanw could you clarify what contact is used for versus address in the de data - see http:https://datapipes.okfnlabs.org/csv/head%2010/html?url=https://github.com/okfn/publicbodies/raw/master/data/de.csv

rufuspollock added a commit that referenced this issue Oct 6, 2013
* datapackage.json: new schema with descriptions
* data: update all data in line with new schema (this should be lossless)
* scripts: conversion script
* app: minor update to app and templates for new schema
@rufuspollock
Copy link
Member Author

FIXED.

@stefanw
Copy link

stefanw commented Oct 6, 2013

contact is a text field that contains phone/fax numbers, while address contains one or more of the physical addresses of the public body.

@jpmckinney
Copy link

Awesome! Where can I find docs for the schema? Is it datapackage.json?

@jpmckinney
Copy link

@stefanw wouldn't it make sense to split phone numbers into voice, fax, etc. instead of having an ambiguously named contact field?

@stefanw
Copy link

stefanw commented Oct 7, 2013

@jpmckinney this distinction comes from the German public body dataset out of FragDenStaat.de. The fields were modeled after the original federal data source which was not structured enough to make an easy distinction between voice/fax. Surely this can be inferred from prefixes ("Tel.", "Fax:" etc.). The contact data was never needed, we were only after emails.

This should in no way dictate the structure of an ideal dataset.

@rufuspollock
Copy link
Member Author

@jpmckinney for docs of schema see https://github.com/okfn/publicbodies#data which links to http:https://data.okfn.org/community/okfn/publicbodies (that is nicer than looking at the datapackage.json)

@rufuspollock
Copy link
Member Author

@stefanw so could i drop contact field in de dataset in favour of address and email (already in the dataset)?

@stefanw
Copy link

stefanw commented Oct 7, 2013

Depends on what you want the publicbodies dataset to contain, I don't mind either. I could also parse out voice/fax if it helps, should be an easy regex.

@hannesgassert
Copy link
Contributor

+1 for specific voice / fax etc. fields, with the possibility to have several per line.

@augusto-herrmann
Copy link
Collaborator

+1 for specific voice/fax fields.
contact, as suggested by @stefanw is not appropriate for phone numbers. According to the reference and to popolo is for an address where to send letters to.

@stefanw
Copy link

stefanw commented Apr 2, 2014

@augusto-herrmann I did not suggest anything, I merely answered the question and explained the existing fields. Popolo supports many types of contact info (postal address, email, phone, fax etc.) under "contact_details".

@rufuspollock
Copy link
Member Author

I'm very happy for a new set of fields to go in: @augusto-herrmann could you distill a core set of changes with descriptor of the fields and we'll review. Also very much welcome input form @jpmckinney here so we keep aligned with popolo on this.

@jpmckinney
Copy link

I'll be happy to review any proposed changes to the schema, just @-mention me in any new issues.

@augusto-herrmann
Copy link
Collaborator

@rgrp, the link http:https://data.okfn.org/community/okfn/publicbodies (also referenced in the README) has since become broken. Has the schema documentation been moved somewhere else? If so, it would be nice to have a redirect.

@rufuspollock
Copy link
Member Author

@augusto-herrmann that's a bug in data.okfn.org which is getting fixed now.

@rufuspollock
Copy link
Member Author

@augusto-herrmann ok - the issue was that the data package is actually named public-bodies whilst repo is named publicbodies so redirect was not working correctly. Now fixed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Website The frontend of publicbodies.org website ★★★
Projects
None yet
Development

No branches or pull requests

5 participants