Rework schema (list of headers) and document #29

rufuspollock · 2013-05-29T21:31:31Z

@rossjones suggested: "Would it make sense for publicbodies.org to follow the popolo spec at http:https://popoloproject.com/data.html" (that link is now broken)

Correct link is: http:https://popoloproject.com/specs/organization.html

Seems a great idea!

Current fields

Current fields and suggested changes (e.g. to be in line with popolo as much as possible). Note the list of changes is in progress and incomplete.

title => name (in org name)
abbr => abbreviation
key => id (?)
category => classification
parent => DELETE (just have parent_id)
parent_key => parent_id
description
url
jurisdiction => DELETE (just have jurisdiction code)
jurisdiction_code = ISO 2 digit code where that exists. Otherwise we coin.
source => DELETE in favour of source URL (??)
source_url => keep
- make clear there is no point pointing at exactly the same API endpoint - much more useful to point at a specific location
- (??) DELETE entirely and just credit in contributor notes (we already have a bunch of different sources for data and as people add the problem will get worse)
- Could have multiple sources per entry (??)
address
contact => What's the difference from address
email
tags => keep
- at the moment several of the files use tags (though not necessarily consistently)
created_at => DELETE (little value ...)
updated_at => DELETE (ditto)

Add:

other_names: semi-colon separated list of alternate names
founding_date: ISO 8601
dissolution_date: ISO 8601
image

Consider switch to JSON from CSV

Pros / Cons

(+) Greater flexibility, ability to directly match org spec
- In particular can handle multiple values, multiple identifiers
(-) Much bigger and less compact. Harder for people to work with (e.g. CSV usable in spreadsheets etc)
(-) More complexity (but perhaps necessary)

jpmckinney · 2013-05-29T22:13:30Z

Popolo doesn't define a CSV representation yet - there is RDF and JSON so far. On RDF path, I'm not sure if Linked CSV is ready. On the JSON path, it should be straight-forward to re-use JSON fields as CSV headers.

Is there a documented version of the CSV schema? The datapackage.json doesn't describe the difference between parent and parent_key for example. Once I understand the schema, I can propose one that uses Popolo terms.

rufuspollock · 2013-06-23T13:16:58Z

@jpmckinney list of fields set out above and some initial suggested changes.

@markbrough your thoughts here re IATI very useful ...

jpmckinney · 2013-06-23T16:41:33Z

I'll review more closely in a bit, but to clarify one point, you can use fields outside of those within Popolo while still being conformant: http:https://popoloproject.com/specs/#conformance So, if you want to keep created_at, that should be fine (I'll actually be adding it to Popolo as it came up in the previous round of feedback).

rufuspollock · 2013-07-15T18:43:23Z

@jpmckinney any thoughts here. I'm aiming to do a rev (and possibly finalize) this asap. I guess the big question here is CSV vs JSON (I mean for JSON we'd just take the full popolo version I think). If CSV how do we map and how do we handle things like fields with multiple possible values. Options are:

Inline into field in a hacky way (e.g. aliases could be ; separated)
Inline JSON into a field :-/
Have a separate "table" joined to main table
...?

jpmckinney · 2013-07-15T20:54:25Z

Sorry for delay, I'll look at this within the next day.

jpmckinney · 2013-07-17T05:34:43Z

The "abbr" column in the CSV would be the "other_names" array in the JSON. Maybe rename "abbr" to "other_name"? Otherwise I think all the other header names conform.

CSV has the big advantage of more people being able to understand, create and use it. Is it anticipated that many fields will be multi-value? Has that come up already? How much detailed info are these lists expected to contain?

If the project is expected to maintain a fairly narrow scope with only essential/primary data, then CSV should be enough. If it's expected to expand to provide detailed info for at least some jurisdictions, then JSON is necessary.

A hybrid approach may allow people to submit CSVs (for those jurisdictions that don't (yet) have detailed info), and a script would be run to convert those CSVs to JSON. Thoughts?

Re: multi-value columns in CSV:

Within-column separators like ";" or "|" have a small risk of causing parsing issues, and are a real headache to escape if they ever occur within one of the multiple values. Not that bad, on the whole.
Inline JSON is worse than within-column separators, I think.
Depending on how important the additional values are, this may be acceptable.

rufuspollock · 2013-09-29T16:17:33Z

OK, so I think we'll go for plain CSV and see how we do. I've made another tweak to include other_names.

jpmckinney · 2013-09-29T16:40:42Z

I don't know if a new other_names field will be used that frequently - I was just suggesting renaming abbr, but in retrospect I guess there's utility to picking out the shortest version of a name, e.g. for display on mobiles or other space-constrained places. Why not rename to abbreviation, though, since no other field name is abbreviated?

For source_url, I think it may be useful to keep. I write scrapers for public bodies, and assign the source to the page on the authoritative source's website that was scraped.

rufuspollock · 2013-09-30T09:06:37Z

@jpmckinney all good suggestions (as usual!) - let's run with both of them. I've updated the change proposal above to reflect these.

rufuspollock · 2013-10-06T11:31:04Z

Added founding_date and dissolution_date and image to add.

@stefanw could you clarify what contact is used for versus address in the de data - see http:https://datapipes.okfnlabs.org/csv/head%2010/html?url=https://github.com/okfn/publicbodies/raw/master/data/de.csv

* datapackage.json: new schema with descriptions * data: update all data in line with new schema (this should be lossless) * scripts: conversion script * app: minor update to app and templates for new schema

rufuspollock · 2013-10-06T14:09:50Z

FIXED.

stefanw · 2013-10-06T15:20:18Z

contact is a text field that contains phone/fax numbers, while address contains one or more of the physical addresses of the public body.

jpmckinney · 2013-10-06T15:55:14Z

Awesome! Where can I find docs for the schema? Is it datapackage.json?

…n name via explicit lookup.

jpmckinney · 2013-10-06T21:38:04Z

@stefanw wouldn't it make sense to split phone numbers into voice, fax, etc. instead of having an ambiguously named contact field?

stefanw · 2013-10-07T07:34:16Z

@jpmckinney this distinction comes from the German public body dataset out of FragDenStaat.de. The fields were modeled after the original federal data source which was not structured enough to make an easy distinction between voice/fax. Surely this can be inferred from prefixes ("Tel.", "Fax:" etc.). The contact data was never needed, we were only after emails.

This should in no way dictate the structure of an ideal dataset.

rufuspollock · 2013-10-07T09:43:50Z

@jpmckinney for docs of schema see https://github.com/okfn/publicbodies#data which links to http:https://data.okfn.org/community/okfn/publicbodies (that is nicer than looking at the datapackage.json)

rufuspollock · 2013-10-07T09:44:28Z

@stefanw so could i drop contact field in de dataset in favour of address and email (already in the dataset)?

stefanw · 2013-10-07T09:45:49Z

Depends on what you want the publicbodies dataset to contain, I don't mind either. I could also parse out voice/fax if it helps, should be an easy regex.

hannesgassert · 2013-10-09T19:28:13Z

+1 for specific voice / fax etc. fields, with the possibility to have several per line.

augusto-herrmann · 2014-04-01T18:51:34Z

+1 for specific voice/fax fields.
contact, as suggested by @stefanw is not appropriate for phone numbers. According to the reference and to popolo is for an address where to send letters to.

stefanw · 2014-04-02T09:58:22Z

@augusto-herrmann I did not suggest anything, I merely answered the question and explained the existing fields. Popolo supports many types of contact info (postal address, email, phone, fax etc.) under "contact_details".

rufuspollock · 2014-04-02T15:07:48Z

I'm very happy for a new set of fields to go in: @augusto-herrmann could you distill a core set of changes with descriptor of the fields and we'll review. Also very much welcome input form @jpmckinney here so we keep aligned with popolo on this.

jpmckinney · 2014-04-02T15:56:07Z

I'll be happy to review any proposed changes to the schema, just @-mention me in any new issues.

augusto-herrmann · 2014-06-02T17:05:17Z

@rgrp, the link http:https://data.okfn.org/community/okfn/publicbodies (also referenced in the README) has since become broken. Has the schema documentation been moved somewhere else? If so, it would be nice to have a redirect.

rufuspollock · 2014-06-03T07:04:19Z

@augusto-herrmann that's a bug in data.okfn.org which is getting fixed now.

rufuspollock · 2014-06-03T07:14:26Z

@augusto-herrmann ok - the issue was that the data package is actually named public-bodies whilst repo is named publicbodies so redirect was not working correctly. Now fixed.

rufuspollock mentioned this issue Jun 23, 2013

United States csv #15

Open

jpmckinney mentioned this issue Aug 27, 2013

Follow-up with OKF PublicBodies project popolo-project/popolo-spec#38

Closed

rufuspollock mentioned this issue Sep 27, 2013

Integrate Swiss Federal Data #40

Closed

rufuspollock closed this as completed Oct 6, 2013

This was referenced Oct 6, 2013

JSON output from frontend conforms to Popolo schema #49

Closed

Data for Quebec #26

Closed

rufuspollock added a commit that referenced this issue Oct 6, 2013

[#29][s]: addendum to previous commit to add back in full jurisdictio…

198275a

…n name via explicit lookup.

rufuspollock mentioned this issue Oct 6, 2013

Instructions for data contributors #23

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Rework schema (list of headers) and document #29

Rework schema (list of headers) and document #29

rufuspollock commented May 29, 2013

jpmckinney commented May 29, 2013

rufuspollock commented Jun 23, 2013

jpmckinney commented Jun 23, 2013

rufuspollock commented Jul 15, 2013

jpmckinney commented Jul 15, 2013

jpmckinney commented Jul 17, 2013

rufuspollock commented Sep 29, 2013

jpmckinney commented Sep 29, 2013

rufuspollock commented Sep 30, 2013

rufuspollock commented Oct 6, 2013

rufuspollock commented Oct 6, 2013

stefanw commented Oct 6, 2013

jpmckinney commented Oct 6, 2013

jpmckinney commented Oct 6, 2013

stefanw commented Oct 7, 2013

rufuspollock commented Oct 7, 2013

rufuspollock commented Oct 7, 2013

stefanw commented Oct 7, 2013

hannesgassert commented Oct 9, 2013

augusto-herrmann commented Apr 1, 2014

stefanw commented Apr 2, 2014

rufuspollock commented Apr 2, 2014

jpmckinney commented Apr 2, 2014

augusto-herrmann commented Jun 2, 2014

rufuspollock commented Jun 3, 2014

rufuspollock commented Jun 3, 2014

Rework schema (list of headers) and document #29

Rework schema (list of headers) and document #29

Comments

rufuspollock commented May 29, 2013

Current fields

Consider switch to JSON from CSV

jpmckinney commented May 29, 2013

rufuspollock commented Jun 23, 2013

jpmckinney commented Jun 23, 2013

rufuspollock commented Jul 15, 2013

jpmckinney commented Jul 15, 2013

jpmckinney commented Jul 17, 2013

rufuspollock commented Sep 29, 2013

jpmckinney commented Sep 29, 2013

rufuspollock commented Sep 30, 2013

rufuspollock commented Oct 6, 2013

rufuspollock commented Oct 6, 2013

stefanw commented Oct 6, 2013

jpmckinney commented Oct 6, 2013

jpmckinney commented Oct 6, 2013

stefanw commented Oct 7, 2013

rufuspollock commented Oct 7, 2013

rufuspollock commented Oct 7, 2013

stefanw commented Oct 7, 2013

hannesgassert commented Oct 9, 2013

augusto-herrmann commented Apr 1, 2014

stefanw commented Apr 2, 2014

rufuspollock commented Apr 2, 2014

jpmckinney commented Apr 2, 2014

augusto-herrmann commented Jun 2, 2014

rufuspollock commented Jun 3, 2014

rufuspollock commented Jun 3, 2014