Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

📊 Update faostat data #2416

Merged
merged 62 commits into from
Mar 25, 2024
Merged

📊 Update faostat data #2416

merged 62 commits into from
Mar 25, 2024

Conversation

pabloarosado
Copy link
Contributor

@pabloarosado pabloarosado commented Mar 14, 2024

Main changes

  • Added a multi_merge function to owid.catalog (that properly works with Table object).
  • Adapted all faostat scripts and steps to the new ETL conventions and the new metadata.
  • Removed datasets that were not used and caused technical issues, namely "World Census of Agriculture" (wcad) and "Energy use" (gn).
  • Changed country names following FAOSTAT changes (e.g. "Netherlands" -> "Netherlands (Kingdom of the)").
  • Removed datasets that do not exist any longer in FAOSTAT, namely ef, el, and ep. I have checked that all variables used in charts from these datasets can (in principle) be replaced by the analogous ones from rf, rl, and rfn, respectively (but will need to confirm once grapher datasets exist).
  • In qcl (and therefore in the food explorer) I renamed Flax fibre -> Flax, raw or retted. This follows the change that FAOSTAT did too, by which item 773 (Flax fibre) disappeared from qcl, being replaced by 771 (Flax, raw or retted).
    • Item 773 can now still be found in other datasets, e.g. qi, and, its definition is "Flax, processed but not spun", with description: "Broken, scutched, hackled etc. but not spun. Traditionally, FAO has used this commodity to identify production in its raw state; in reality, the primary agricultural product is the commodity 01929.01 (Flax, raw or retted) which can either be used for the production of fibre or for other purposes (Unofficial definition)".
    • Item 771 now in qcl is defined as "Flax, raw or retted", with description: "Flax Straw, spp. Linum usitatissimum. Flax is cultivated for seed as well as for fibre. The fibre is obtained from the stem of the plant. Data are reported in terms of straw. (Unofficial definition)".
  • Manually fixed some spurious values, and removed anomalies that are no longer present in the data.
  • Many other small fixes in different steps.

TO-DO for Pablo R (in separate PRs):

  • To be able to see the new data in the global food explorer, merge 📊 Update global food explorer owid-content#39
  • Update all crop yield data and explorer.
  • Improve all FAOSTAT metadata (even if it is currently using the new fields, the metadata is still very messy for some indicators).
  • Update all steps using FAOSTAT, and archive old FAOSTAT steps (since they use old functions and metadata).
  • After merging, chart-sync staging-update-faostat-data -> production.

For reviewers

There is no need to do a thoroughly review this monster PR. I'd suggest:

  • @Marigold could you please review the changes to owid.catalog? Also, note that I temporarily removed some of the fixes that you made to etl/steps/data/garden/faostat/2024-03-14/faostat_fbsc.py to control for memory use. I did that because the concatenate function was not propagating metadata. ETL did not complain in staging, but we may need to adapt this function to work with tables (with metadata propagation). If you think we should do that before merging, let me know and I'll look into it.
  • @spoonerf feel free to have a quick look at some parts of the code as you see fit, and maybe at some of the charts (just scroll through the admin to see the latest changes). Please let me know if you find anything odd (apart from old metadata, and bad map brackets, which is something I should fix soon).

Thanks a lot!!

@pabloarosado pabloarosado self-assigned this Mar 14, 2024
@lucasrodes lucasrodes changed the title Update faostat data 📊 Update faostat data Mar 18, 2024
@pabloarosado pabloarosado marked this pull request as ready for review March 22, 2024 09:15
Copy link
Collaborator

@Marigold Marigold left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PR looks good. If it works without memory optimisations then go ahead and merge it. We'll see what it does in production.

EDIT: I checked tb, and there are only two object columns with sizes around 2gb. That's not that much, but perhaps converting them to categories wouldn't cause problems with metadata?

ipdb> p tb.fao_unit_short_name.memory_usage(deep=True) / 2**20
1631.690894126892
ipdb> p tb.fao_element.memory_usage(deep=True) / 2**20
2025.320728302002

@spoonerf
Copy link
Contributor

Hey Pablo!

This looks great - what an absolute monster of a dataset, it never ceases to amaze me!

I'll write some thoughts on the charts down here, but generally they look great:

@pabloarosado pabloarosado merged commit 32fff70 into master Mar 25, 2024
8 of 10 checks passed
@pabloarosado pabloarosado deleted the update-faostat-data branch March 25, 2024 15:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants