-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Properly deal with duplicate variableIds #28
Comments
The problem is that some variables exist twice in the API, with both categories. An example is: [{'variableId': 957732, 'categoryId': 9559, 'classificationId': 10650, 'measureId': 10596, 'gasId': 10637, 'unitId': 175}
{'variableId': 957732, 'categoryId': 9608, 'classificationId': 10650, 'measureId': 10596, 'gasId': 10637, 'unitId': 175}] That is of course very confusing. What we do at the moment is to discard all but the first variable. That's probably not correct, but it is impossible to decide which one is correct. Additionally, there is a bug where we sometimes use the last instead of the first variable, which is why you noticed this. I am not sure how to solve this. Strictly speaking, we should probably discard all data with ambiguous variables because it is simply impossible to know which of the two variables is correct (the data points only specify a variable, and in the example above, it is impossible to say if data with the variableId |
We could also try to find out what the official web query thing does with these duplicate variables, and be bug-compatible with them. |
In my case the duplicate data is actually fine. It's data for the number of livestock and it's just logical that the same number of cows is reported in enteric fermentation and manure management. So I actually do get the correct data just for the wrong category. |
The web interface just gives you the category you filtered for. No data for the same variable for other categories. |
But does it have the same information or is it dropping the duplicate variables one way or another? I mean, I can fix the bug that I found and then it will be one or the other category, but not both - but maybe the incorrect category. |
I think I don't understand the problem. When I set a category in the query, why can't you use the datapoint for the category given in the query? |
The data query in the API works like this:
The root of the problem is now that in the list of A secondary problem is that when filtering the variables for the correct category, we use another information source than when filling in the info in the last step. That's what is causing the issue here - actually, all the data you are getting has at least one variable with the 3.A category, and some also have a variable with the 3.B category, and for some of those with double categories, the parsing code chooses the 3.B category. On the other hand, if you request the 3.B category, the filtering code likely already filters out all the variables with two categories because it only considers the first category and 3.B comes after 3.A, so that the result will be missing some data points, but never has 3.A as a result. I would like to solve the first problem (finding out what it actually means if two variables with identical ID and different categories exist), and return proper data always, instead of only solving the secondary problem (consistently discard all but the first variable with the same ID). I'll make a list with duplicate variableIds and what they mean. Maybe all of these cases are like the 3.A/3.B case where the data is legitimately the same thing (likely, activity data) which can and should be supplied for multiple categories. |
These are all duplicate variables:
|
OK, I think I understand the problem now. So it seems that internally the database has independent tables for variable, category, etc and the table with the actual data just references the ID and can thus have arbitrary combinations of variable and sector. For emissions data there will be a many to one correspondence of variable to category so we can infer the category from the variable, but that is not the case for e.g. activity data where the same activity data (e.g. number of cows) might be used for several sectors. |
Yeah, I looked through the list, and I think we can distinguish two cases, logically: the same category with different IDsSome categories have several IDs, for whatever reason. An example is:
Here, there is no ambiguity. the same data, different categoriesSometimes activity data or sub-categories have the same data (e.g. the total can be equal to a single sub-category if it is the only sub-category) for different categories. Examples are:
In each case, I think it is correct to put the data into both categories. |
Now, I see two things to correct in our parsing/querying. unrestricted queriesIf a user asks for all data, the "same category with different IDs" case is trivial, because we don't distinguish the categories based on their ID. It is the same data, and will be put into the same bucket, all is fine. For the "same data, different categories" case, we should make sure to properly put the data into both categories. Not logically difficult, but some work to do to change the parsing functions to deal with this. restricted queriesQueries like you did, where the user asks for a specific category only are more difficult. Currently, the user has to specify a |
The second problem can be seen like this: In [2]: import unfccc_di_api
In [3]: reader = unfccc_di_api.UNFCCCApiReader()
In [4]: reader.annex_one_reader.query(party_codes=['DEU'], category_ids=[10476])
Out[4]:
party category classification measure gas unit year numberValue stringValue
0 DEU 6. Other Total for category Emission factor information CH4 no unit 1990 NaN NA
1 DEU 6. Other Total for category Emission factor information CH4 no unit 1991 NaN NA
2 DEU 6. Other Total for category Emission factor information CH4 no unit 1992 NaN NA
3 DEU 6. Other Total for category Emission factor information CH4 no unit 1993 NaN NA
4 DEU 6. Other Total for category Emission factor information CH4 no unit 1994 NaN NA
.. ... ... ... ... ... ... ... ... ...
770 DEU 6. Other Total for category Net emissions/removals SO2 kt 2016 NaN NO
771 DEU 6. Other Total for category Net emissions/removals SO2 kt 2017 NaN NO
772 DEU 6. Other Total for category Net emissions/removals SO2 kt 2018 NaN NO
773 DEU 6. Other Total for category Net emissions/removals SO2 kt 2019 NaN NO
774 DEU 6. Other Total for category Net emissions/removals SO2 kt Base year NaN NO
[775 rows x 9 columns]
In [5]: reader.annex_one_reader.query(party_codes=['DEU'], category_ids=[10485])
Out[5]:
party category classification measure gas unit year numberValue stringValue
0 DEU 6. Other Total for category Indirect emissions N2O kt 1990 None NO
1 DEU 6. Other Total for category Indirect emissions N2O kt 1991 None NO
2 DEU 6. Other Total for category Indirect emissions N2O kt 1992 None NO
3 DEU 6. Other Total for category Indirect emissions N2O kt 1993 None NO
4 DEU 6. Other Total for category Indirect emissions N2O kt 1994 None NO
.. ... ... ... ... ... ... ... ... ...
243 DEU 6. Other Total for category Net emissions/removals Unspecified mix of HFCs and PFCs kt CO2 equivalent 2016 None NO
244 DEU 6. Other Total for category Net emissions/removals Unspecified mix of HFCs and PFCs kt CO2 equivalent 2017 None NO
245 DEU 6. Other Total for category Net emissions/removals Unspecified mix of HFCs and PFCs kt CO2 equivalent 2018 None NO
246 DEU 6. Other Total for category Net emissions/removals Unspecified mix of HFCs and PFCs kt CO2 equivalent 2019 None NO
247 DEU 6. Other Total for category Net emissions/removals Unspecified mix of HFCs and PFCs kt CO2 equivalent Base year None NO
[248 rows x 9 columns] Depending on the exact code the user uses, there are either 775 or 248 rows returned, and the user is left to figure this out themselves. The correct thing that the user has to do currently is: In [6]: reader.annex_one_reader.query(party_codes=['DEU'], category_ids=[10485, 10476])
Out[6]:
party category classification measure gas unit year numberValue stringValue
0 DEU 6. Other Total for category Emission factor information CH4 no unit 1990 NaN NA
1 DEU 6. Other Total for category Emission factor information CH4 no unit 1991 NaN NA
2 DEU 6. Other Total for category Emission factor information CH4 no unit 1992 NaN NA
3 DEU 6. Other Total for category Emission factor information CH4 no unit 1993 NaN NA
4 DEU 6. Other Total for category Emission factor information CH4 no unit 1994 NaN NA
... ... ... ... ... ... ... ... ... ...
1018 DEU 6. Other Total for category Net emissions/removals Unspecified mix of HFCs and PFCs kt CO2 equivalent 2016 NaN NO
1019 DEU 6. Other Total for category Net emissions/removals Unspecified mix of HFCs and PFCs kt CO2 equivalent 2017 NaN NO
1020 DEU 6. Other Total for category Net emissions/removals Unspecified mix of HFCs and PFCs kt CO2 equivalent 2018 NaN NO
1021 DEU 6. Other Total for category Net emissions/removals Unspecified mix of HFCs and PFCs kt CO2 equivalent 2019 NaN NO
1022 DEU 6. Other Total for category Net emissions/removals Unspecified mix of HFCs and PFCs kt CO2 equivalent Base year NaN NO
[1023 rows x 9 columns] which is not directly obvious looking at the category hierarchy ( |
Why were you doing a restricted query? Do you need the possibility to filter for |
I did a restricted query because I'm looking for certain category, classification, measure combinations only to find what of the needed data is available from the interface. |
Okay, so for you it would be easier if The changes necessary for proper handling of all situations are not difficult then, only a bit tedious because the internal data structures use the |
It would be best if one can either use ID or name. But on the other hand I have so far not seen any meaningful difference between categories with the same name but different IDs |
Description
When using the annex_one_reader of unfccc_di_api.UNFCCCApiReader with a category filter for 9559 (Enteric Fermentation) the query result also contains data for Manure Management (9608). Querying for Manure management only returns data for manure management.
What I Did
Here's a minimal code example to reproduce the problem
Actual output is
Expected output is
The text was updated successfully, but these errors were encountered: