Entity names #19

JGuetschow · 2021-03-01T13:04:57Z

I'm wondering if we want to have a standard for variable names. In PRIMAP1 it's all upper case letters. For PRIMAP2 we have specified a way to add GWP information to variable names, but no convention for the variables themselves. I think all uppercase is sometimes hard to read. I think we should have a specification to simplify running code on different datasets.

mikapfl · 2021-03-01T13:14:19Z

Usually, variables are gases, in which case it would make sense to use the same capitalization as openscm (e.g. CO2). What other entities are there? Gas baskets like F-gases and population, right?

JGuetschow · 2021-03-01T13:19:10Z

there will be a lot of economical variables (different GDP variants, etc)

JGuetschow · 2021-03-01T13:19:59Z

I agree on the variables available in openscm. But openscm doesn't have the baslets, right?

mikapfl · 2021-03-01T13:24:37Z

Nope, no baskets in openscm.

mikapfl · 2021-03-01T13:27:31Z

Is there some other standard (maybe from the IIASA universe) that we can follow? If not, we have to write one ourselves, but it would be less work if there is already something. (-:

JGuetschow · 2021-03-02T08:38:52Z

I don't know for sure, but I think the IIASA databases have some standard. Though I doubt it's described somewhere

mikapfl · 2021-03-02T09:12:22Z

From pyam, I found this "standard": https://data.ene.iiasa.ac.at/database/
However, it doesn't actually define a lot of interesting variables and is very wordy. CF conventions unfortunately don't deal with any socio-economical variable names (and are very wordy for emissions).

FAOstat has entity lists, but they use codes instead of shorter names, which I think is pretty user-hostile (hey there, here you got data for EL-3148).

Maybe we should make our own list? If so, then what should our rules be? Use normal english capitalization rules, so that we end up with population, kyotoghg F-gases etc.?

AnnGuenther · 2021-03-02T09:15:38Z

I think that in case we use population, kyotoghg, F-gases, I would be for KyotoGHG (even though it is not great...)

And I also would vote against using codes as in FAOstat.

AnnGuenther · 2021-03-02T09:18:02Z

Would it make sense to have a list with normal English capitalization rules, but then convert it to uppercase for internal use, so that errors due to the wrong capitalization are not leading to a program breakdown?

mikapfl · 2021-03-02T09:18:17Z

The world bank also has an entity list, but I don't know if we want to use it, e.g. GDP (constant 2010 US$) end up as NY.GDP.MKTP.KD, which is totally non-transparent for me.

@AnnGuenther: Do you have a Reason for KyotoGHG? Just because Kyoto is a name and GHG is an abbreviation and therefore english capitalization rules yield KyotoGHG, or for another reason?

AnnGuenther · 2021-03-02T09:19:28Z

No other reason, just the ones you listed.

mikapfl · 2021-03-02T09:21:00Z

I don't really like silently correcting e.g. capitalization. A KeyError when I use Population instead of population is pretty easy to interpret and the error is immediate. If we do have some normalization happening, we have to remember to do this normalization everywhere, otherwise using Population works at first and breaks later, which will be harder to diagnose.

mikapfl · 2021-03-09T16:46:14Z

I've started adding entity names to a terminology over in climate_categories. So far, there are only emission rates of "gases" from openscm_units ("gases" is wrong here, because e.g. black carbon is not a gas, but what other word is more correct here?), but maybe you can have a look if the level of detail and general idea seems good.

its all bundled in a PR: pik-primap/climate_categories#1

The definition is here: https://github.com/pik-primap/climate_categories/pull/1/files#diff-c28a5ab1cbffcb57d64c46d658e69f373450cce100d9a8f70c72b89648a45f16

Maybe we can continue the discussion in the pull request.

AnnGuenther · 2021-03-09T16:55:00Z

(climate) forcers or drivers instead of gases?

rgieseke · 2021-03-10T12:42:37Z

Did you take a look at the https://github.com/openENTRANCE/nomenclature project? Not sure whether it's as wordy as the Default|IPCC|Emissions|Inventories.

@danielhuppmann is pretty keen on interop so he might have ideas.

danielhuppmann · 2021-03-10T12:55:28Z

Thanks for looping me in @rgieseke - had a look at the discussion so far and the referenced PR. Not sure whether I understand the objective here, but two (more concrete) references to related work.

In openENTRANCE, we tried to formalize the (previously implicit) guidelines for variables names in the IIASA & IAMC universe, with an aim for readability. See here.
The current variable definitions in openENTRANCE follow the IPCC SR15 scenario ensemble (which in turn is based on CD-LINKS, ADVANCE, ...). For the question at hand, see how Kyoto-GHG with a specific GWP conversion metric is named here

mikapfl · 2021-03-10T13:53:18Z

Hi,

thanks for chiming in!

Information for context: There are two things happening in primap2 land at the moment:

We are looking to get all the terminologies from PRIMAP1 (not publicly available, I'm afraid) into primap2 so that we can read in all data that we need to.
We figured that since there seems to be no easy-to-install python package which contains commonly used terminologies like the IPCC categories in computer-readable format, we should build one.

I had a look at the openENTRANCE/nomenclature project before embarking on building an own package, but as far as I could tell from the available documentation, the goal is different there. E.g., there is no hierarchy of IPCC1996 and IPCC2006 categories and I also would not be sure how it fits in your format (would category 1.A.3.b.iii be Emissions|IPCC2006|1|A|3|b|iii? And N2O emissions in this category would then be Emissions|IPCC2006|1|A|3|b|iii|N2O?). For me, it looked like openENTRANCE/nomenclature is specifically for the openENTRANCE project and its data format with exactly six dimensions, but we needed something more general.

That said, we can look if we can re-use some of the definitions of openENTRANCE/nomenclature for primap2.

Cheers,

Mika

danielhuppmann · 2021-03-11T09:07:03Z

Thanks for the context! Don't want to overload this conversation, so my response is as concise as possible - and let's have a follow-up (spoken) discussion somewhere else if there is interest...

The goal of the openENTRANCE nomenclature:

build a list of variables (definitions) starting from previous projects such that it can be extended in later projects
readability is key - hence the yaml file format and very descriptive variable definitions
the Python package is a utility to facilitate working with it - but if someone wants to use the yaml files in an R workflow or copy-paste to Excel, that is fine (and is going to happen, given our user base)

Re your question about 1.A.3.b.iii, I would implement in our yaml lingo as

Emissions|N2O|Energy|Transportation|Road|Heavy Duty Trucks and Buses:
    definition: <bla>
    unit: kt CO2e
    ipcc_2006: 1.A.3.b.iii
    notes: <bla>

You should also take a look at the OpenEnergyOntology (h/t @Ludee & @christian-rli) - they use a formal ontology framework to write their definitions and interrelations...

khaeru · 2021-03-11T10:31:41Z

@danielhuppmann alerted me to this issue. To pick up on one point:

The world bank also has an entity list, but I don't know if we want to use it, e.g. GDP (constant 2010 US$) end up as NY.GDP.MKTP.KD, which is totally non-transparent for me.

This is broader than the World Bank; it reflects the use of SDMX (https://sdmx.org/?page_id=5008, https://datahelpdesk.worldbank.org/knowledgebase/articles/1886701-sdmx-api-queries) which provides an information model that can cover most climate/energy/etc. use cases (at least, all that I've seen). A key like NY.GDP.MKTP.KD might seem opaque per se, but I'd argue that it reflects a more mature, thoroughly-considered approach to problems that we often try, unnecessarily, to solve anew.

As briefly as possible:

Specific data dimensions are linked to abstract concepts;
In a particular data structure definition (DSD) / data sets that are “structured by” that DSD, that concept can be represented by codes from a particular codelist (browse many: https://registry.sdmx.org/items/codelist.html)
Each code has an id (machine-readable), and optional, multilingual name, description, and annotations.
- Applications can decide whether to use/display the id or name; whichever is more suitable.
The code lists and concept schemes are published (incl. versioned) and referenced from DSDs; and the DSDs are referenced by data.

NY.GDP.MKTP.KD is a composite:

There are 4 dimensions/concepts here, separated by .
The 2nd gives the thing measured. GDP is the ID (short, machine-readable) of one code; the plain-language name (in English) might be “Gross domestic product”.
The 4rd is the inflation method applied. KD is the ID; the English name might be "Constant 2010 US dollars".

So a different key/composite like NY.GDP.MKTP.CD (also visible in the WB WDI glossary) conveys that 3 concepts are the same, but the last is different; CD is the ID of a different code, with a different name (“Current dollars”).

Publishing and referring to such code lists is, IMO, much better than trying to cram all metadata into labels on every data set.
For instance, at https://registry.sdmx.org/items/codelist.html one can see the Eurostat (ESTAT) code list for the "area" concept (CL_AREA). Notice that they provide all possible definitions of the "EU". A reference to this code list, and the use of a code from this list, is 100% unambiguous about what is represented, while allowing precision and fine distinctions.

Over at transportenergy/database#62 we're trying to take this approach, namely:

Define the distinct concepts relevant to some or all data, using IDs, names, and descriptions.
Create (or use existing) code lists for each, again with IDs, names, descriptions, annotations, and sometimes hierarchy.

After having done so, it's certainly possible to:

collapse the IDs of codes for multiple concepts into a key like NY.GDP.MKTP.KD, or
collapse the names into a “variable” name using some string formatting. (This is what the WB does for the WDIs; see https://databank.worldbank.org/AjaxDownload/FileDownloadHandler.ashx?filename=WB_WDI_DSD.xml&filetype=DSD).

But it's also possible to handle data in its original dimensions (one per distinct concept), or (as analysis requires) to restore those dimensions when receiving data that's labeled with a collapsed "variable name".

Apologies for a long comment!

mikapfl · 2021-03-11T12:12:08Z

@khaeru
Thank you for the pointers and explanations.
I am still a bit confused by the code concept there, for example, where can I find the dimensions/concepts to fully decode e.g. NY.GDP.MKTP.KD? I browsed the sdmx websites and also the explanations at the world bank, and couldn't find any pointers what the "parts" of each code mean, always just what the full code means, and lots of useful, but not directly related ontologies at sdmx.

khaeru · 2021-03-12T08:47:13Z

@mikapfl sorry, I should have included that URL: https://datahelpdesk.worldbank.org/knowledgebase/articles/201175-how-does-the-world-bank-code-its-indicators

To be clear, the World Bank uses these internally, but does not publish separate SDMX code lists for the constituent parts, because they don't intend to publish data for/support general public usage of all combinations. Instead (last URL in my first comment) they provide a code list called "SERIES" that includes some of these composite codes but also others, based on other schemes.

To expand a little on my point about "collapsing": for instance, if data (e.g. for a measure like <id=EMI, name=Emissions>) has conceptual dimensions like "Species" (coded as <id=CO2; name=Carbon Dioxide>, <id=N2O>, etc.) and "Sector" (coded as <id=T, name=Transport>, <id=A, name=Agriculture>, etc.):

Measure	Species	Sector	Value
Emissions	CO2	T	1.1
Emissions	N20	T	0.2
Emissions	CO2	A	3.3
Emissions	N20	A	0.4

…then, one defines a new code list "VARIABLE" using a simple & transparent algorithm, e.g.:

for measure, species, sector in product(…):
    # Mixing IDs and names is fine, according to need, as long as we're clear what is done
    id = f"{measure.name}|{species.id}|{sector.id}"

    name = f"{measure.name} of {species.name.lower()} from {sector.name.lower()}"

   # Store the mapping to full dimensions in the description; this could be done in several ways
   description=f'{MEASURE="measure.id", SPECIES="{species.id}", SECTOR="{sector.id}"

   # (create and store a code)

…giving:

"Emissions|CO2|T"
"Emissions of carbon dioxide from transportation"
{MEASURE="EMI", SPECIES="CO2", SECTOR="T"}

Then publish the data with 3 distinct conceptual dimensions collapsed to 1:

Variable	Value
Emissions\|CO2\|T	1.1
Emissions\|N20\|T	0.2
Emissions\|CO2\|A	3.3
Emissions\|N20\|A	0.4

…and the VARIABLE codelist, which includes all the information needed for users to restore the actual dimensions, if they want.

This is what we see from the World Bank: NY.GDP.MKTP.KD is analogous to Emissions|CO2|T.

Among other reasons, I think this approach can cover the common case in energy/climate where we include multiple measures in the same data set for which different concepts/dimensions are relevant. (For instance, the "Species" concept/dimension is relevant for the "Emissions" measure, but not for "Population".) Other solutions I've seen include (a) add many columns for every dimension relevant to any one measure (overkill) and (b) split to distinct data sets/data flows, one for each measure, with the appropriate dimensions for each (SDMX does support this, but I realize it's beyond capacity for most of us at this moment).

JGuetschow · 2021-07-08T08:54:19Z

Additionally to a convention on names it would be great to have lists of other names for the entities as e.g. f-gases often have several names referring to the same gas which each have different notations.

JGuetschow added the enhancement New feature or request label Mar 1, 2021

khaeru mentioned this issue Jun 30, 2021

Reuse/consider alignment with iTEM / SDMX code IAMconsortium/nomenclature#10

Open

JGuetschow added the priority: medium medium priority issue. Good to solve soon if possible label Mar 27, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Entity names #19

Entity names #19

JGuetschow commented Mar 1, 2021

mikapfl commented Mar 1, 2021

JGuetschow commented Mar 1, 2021

JGuetschow commented Mar 1, 2021

mikapfl commented Mar 1, 2021

mikapfl commented Mar 1, 2021

JGuetschow commented Mar 2, 2021

mikapfl commented Mar 2, 2021

AnnGuenther commented Mar 2, 2021

AnnGuenther commented Mar 2, 2021

mikapfl commented Mar 2, 2021

AnnGuenther commented Mar 2, 2021

mikapfl commented Mar 2, 2021

mikapfl commented Mar 9, 2021

AnnGuenther commented Mar 9, 2021

rgieseke commented Mar 10, 2021

danielhuppmann commented Mar 10, 2021

mikapfl commented Mar 10, 2021

danielhuppmann commented Mar 11, 2021

khaeru commented Mar 11, 2021 •

edited

Loading

mikapfl commented Mar 11, 2021

khaeru commented Mar 12, 2021 •

edited

Loading

JGuetschow commented Jul 8, 2021

Entity names #19

Entity names #19

Comments

JGuetschow commented Mar 1, 2021

mikapfl commented Mar 1, 2021

JGuetschow commented Mar 1, 2021

JGuetschow commented Mar 1, 2021

mikapfl commented Mar 1, 2021

mikapfl commented Mar 1, 2021

JGuetschow commented Mar 2, 2021

mikapfl commented Mar 2, 2021

AnnGuenther commented Mar 2, 2021

AnnGuenther commented Mar 2, 2021

mikapfl commented Mar 2, 2021

AnnGuenther commented Mar 2, 2021

mikapfl commented Mar 2, 2021

mikapfl commented Mar 9, 2021

AnnGuenther commented Mar 9, 2021

rgieseke commented Mar 10, 2021

danielhuppmann commented Mar 10, 2021

mikapfl commented Mar 10, 2021

danielhuppmann commented Mar 11, 2021

khaeru commented Mar 11, 2021 • edited Loading

mikapfl commented Mar 11, 2021

khaeru commented Mar 12, 2021 • edited Loading

JGuetschow commented Jul 8, 2021

khaeru commented Mar 11, 2021 •

edited

Loading

khaeru commented Mar 12, 2021 •

edited

Loading