Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add data import from data.statistik.gv.at #11

Merged
merged 68 commits into from
Aug 9, 2021
Merged

Conversation

bernhard-da
Copy link
Collaborator

work-in-progress: add od-functionality

@GregorDeCillia
Copy link
Contributor

GregorDeCillia commented Jul 9, 2021

Thank you very much for this contribution!

As discussed behind our firewall, I have already tested this and the new od_table class is almost perfectly interchangable with the existing sc_table class.

I'll make some finetunings in the next couple of days and hopefully, I'll be able to adress these points

  • Format categorical columns as factor. This is necessary for STATgraph::sg_timeseries() which uses levels(). Currently, the parameter color does not work with sg_timeseries()
  • Use a format in $field(i)$parsed which is consistent with sc_table: <date> for time variables and <fct> for categorical variables
  • The timings only measure the download time for the datasets, not for the metadata
  • Unnecessary traffic in od_table$initialize()? It's probably more efficent to test for a 404 response.
  • Inline TODO
    metafields$total <- TRUE # # todo: check if total is always included
  • Set aggregation-function in $meta$measures$fun to NA
  • For the population-forecasts until 2101, the time variable is not recognized. This has to be fixed in
    if (!all(year %in% 1900:2100))
    Example: OGD_bevstprogjdgebland_PR_BEVJDGB_4 (Bevölkerung zum Jahresanfang nach dem Geburtsland 2016 bis 2101).
  • Move the rendering-logic in $render() to STATgraphApp

@GregorDeCillia GregorDeCillia changed the title start feature: add opendata-functionality Add data import from data.statistik.gv.at Jul 9, 2021
@GregorDeCillia GregorDeCillia added the feature New feature or request label Jul 9, 2021
@GregorDeCillia GregorDeCillia self-assigned this Jul 9, 2021
@bernhard-da
Copy link
Collaborator Author

@GregorDeCillia the issue with incorrect timings is already fixed.

@GregorDeCillia
Copy link
Contributor

@bernhard-da thanks. I think the other TODOs should also be pretty straightforward to complete.

@GregorDeCillia
Copy link
Contributor

GregorDeCillia commented Jul 9, 2021

  • parse_time and round are currently ignored in od_tabulate()
    parse_time <- round <- FALSE
    • For round this makes sense, since there is no metadata to provide a precision. Maybe either remove the parameter or use a warning/error in case it is set to TRUE?
    • For parse_time this is relevant in case the original labels of the months/quarters/years are required for graphs or tables. Unlike in sc_table , the $data slot in od_table contains parsed time entries and therefore the if (parse_time) block doesn't work
  • Translate error messages to english.
    stop(paste0("Datensatz ", shQuote(id), " kann nicht eingelesen werden.\n",
  • Add support for english labels

GregorDeCillia and others added 10 commits July 9, 2021 17:41
until now, all fields were shown if
no argument was provided
the $extras$attribute_description field
in the meta-jsons of open.data can now be
read correctly for all 273 "valid" datasets

valid datasets are datasets that satisfy

* json$resources[[1]] == "csv"
* length(json$resources) > 1

The returned descriptions are not always consistent with the contents of
${opendata_id}_HEADER.csv because the
labels might be slightly different.

However, the codes always match and use the
same ordering
read_delim throws an error if the url for
the csv is passed as a character. Due to
differences in R<4.0 and the newest version,
this caused a bug in od_table()
- status code 200 never occurs
- check if raw object is empty rather
  than parsing it
- cache csv files in ~/.STATcubeR_cache
- only download data again if last_modified
  in json is newer than cache
- avoid nested loops for parsing
- exclude unnecessary columns in meta$measures
- make sure $data uses factors for
  classification variables
@GregorDeCillia
Copy link
Contributor

GregorDeCillia commented Jul 13, 2021

I just made some major changes to the import logic in 1179486. The returned object now only contains three columns in $meta$measures since fun and annotations are not applicable for this class. It also enables caching, see the commit message.

Now the question is how to deal with aggregation of data. The way I see it, there are two types of datasets

  • Datasets that only contain measures which can be aggregated using unweighted sums. Those generally do not contain total-codes and can/must be aggregated directly using base::sum()
  • Datasets such as OGD_veste309_Veste309_1, where certain measures can not be aggregated directly. Those seem to always contain total codes for all fields except time variables

My suggestion: Since it is basically impossible to automatically detect total codes, I'd suggest od_tabulate() aggregates directly by default. If a user wants to use total codes for aggregation, those have to be supplied as characters. For example

x <- od_table("OGD_veste309_Veste309_1")
## define total codes
x$total_codes(
  `C-A11-0` = "A11-1", 
  `C-STAATS-0` = "STAATS-9", 
  `C-VEBDL-0` = "VEBDL-10", 
  `C-BESCHV-0` = "BESCHV-1"
)
## `C-STAATS-0` and `C-BESCHV-0` are aggregated using total codes
## For `C-A11-0` and `C-VEBDL-0` total codes are excluded to make the result tidy
x$tabulate("C-A11-0", "C-VEBDL-0", "F-VESTE_AM")
#> # A tibble: 18 x 3
#>    Sex    `Region (NUTS2)`   `Arithmetic mean`
#>    <fct>  <fct>                          <int>
#>  1 Male   AT11 Burgenland                   17
#>  2 Male   AT12 Lower Austria                18
#>  3 Male   AT13 Vienna                       21
#>  4 Male   AT21 Carinthia                    18
#>  5 Male   AT22 Styria                       18
#>  6 Male   AT31 Upper Austria                20
#>  7 Male   AT32 Salzburg                     19
#>  8 Male   AT33 Tyrol                        19
#>  9 Male   AT34 Vorarlberg                   20
#> 10 Female AT11 Burgenland                   15
#> 11 Female AT12 Lower Austria                14
#> 12 Female AT13 Vienna                       17
#> 13 Female AT21 Carinthia                    14
#> 14 Female AT22 Styria                       15
#> 15 Female AT31 Upper Austria                15
#> 16 Female AT32 Salzburg                     15
#> 17 Female AT33 Tyrol                        15
#> 18 Female AT34 Vorarlberg                   15

What do you think @bernhard-da ?

Another issue with aggregation are hierarchical fields which include partial sums. In some cases, this can be detected in {OGD_ID}_{FIELD_ID}.csv using column 3, but most of the time, the hierarchy structure is not encoded anywhere. In a way, total codes are a special case of hierarchies where all other levels can be considered as direct children of the total code

this affects two datasets, which are both
problematic with the current import logic.

- `OGDEXT_BINNENWAND_1` has no variabe
  codes in the first line of
  `OGDEXT_BINNENWAND_1.csv`
- `OGDEXT_VORNAMEN_1` contains fields
  of type <chr> like `F-VORNAME`
add column `label_en` to
- `$meta$database`
- `$meta$measures`
- `$meta$fields`
- `$field(i)`
this option affects
- the printing method
- the labeling of $data

TODO: language can be switched with
{object}$language <- {new_language},
however, `$data` is not updated in
this case
restructure sc_table to make it compatible
with the base class sc_data. In the new
version, the main part of tabulate() is
executed in the base class which provides
more flexibility.

- mixtures of fields with totals and without
  totals are allowed
- parameters can be specified as codes or
  as labels
- parameter raw can be used to return
  codes instead of labels

The sc_table class now also inherits
total_codes() and a more flexible
implementation of field()

$data and $meta are now parsed eagerly
which means that the slots are calculated
at construction time rather than the first
access

there was a regression regarding the
annotations parameter. Annotations are
stored as attributes in $data_raw and
dropped during aggregation. They currently
don't have priority and will be
re-implemented at some point
if the tibble package is attached, the data
will be printed with this class. These changes
do not cause a formal dependency to tibble
because they only reroute S3 dispatch if
the `print.tbl` generic is in the current
search path.
register s3 methods of pillar is available
with registerS3method(). ALternatively,
vctrs::s3_register() could be used

Also, export the print method for
sc_tibble_meta() and avoid warnings in
devtools::check() due to param inconsitency
because of missing ellipsis

use \uxxxx escape to print the ellipsis
in the footer of print.sc_tibble_meta
since this class is not exported and there
are no factory methods, use the classname
as the name of the R6 class generator object

the field $data_raw was renamed to $data and
the prevois $data is now only available via
$tabulate()
* use R6/roxygen2 to document the whole class
* move request time to $meta$source
* rename $raw -> $json
* advanced printing
* R6/roxygen2 documentation
* link ?sc_table_class from ?sc_table
od_tabulate() now only matches for labels
in the current language. If the language
of a table is set to "en", german labels
cannot be used. Update examples accordingly
@GregorDeCillia
Copy link
Contributor

Breaking changes

  • in sc_data: $data_raw has been renamed to $data and the old verison of $data is only available through $tabulate now
  • od_table()$raw was renamed to od_table()$json
  • sc_data()$meta$database was renamed to sc_data()$meta$source

@GregorDeCillia
Copy link
Contributor

GregorDeCillia commented Aug 9, 2021

A new version of the pkgdown site is available at https://statistikat.github.io/STATcubeR/dev/

This now includes a roxygen2 documentation of the three main R6 classes

The class documentation is supplemental to the documentation of the constructor methods

The index sites were updated

  • The new page includes a well formatted article index and only the most important articles are directly linked from the navbar
  • The reference index now includes categories for to group man pages according to their purpose.

Updates for all pkgdown related source files will be added in a separate branch (#13) because I cannot test the REST API documentation in my current development environment and need to transfer it to another server. The pkgdown manual is still ahead of the VCS in some regard. For example: od_cache_summary() and od_downloads() from the file management article are not available in the VCS.

when developing the R6 documentation,
it was shortly tested how the man pages
would look if the od_table class was
directly exported as od_Table. These
man entries mistakenly use this
invalid syntax
not related to od_table, but the old version
caused errors because of positional arguments
in R/table_custom.R#L27
the annotation parameter is not working
properly with the introduction of
sg_data which allows tabulate() to operate
via sums or via total codes. Use
if(FALSE) to skip this example for now

in the long run, it will have to be decided
how annotations should be aggregated in
$tabulate()
previsouly, this function expected sc_table
objects and now it operates with the base
class. This required some rerouting of the
different implementations and merging
of certain man-pages

the od_tabulate() function was removed from
the NAMESPACE because it would just be an
alias for sc_tabulate() at this point

the fact that the annotations param is broken
is now part of the class documentation
of sc_table_class
@GregorDeCillia GregorDeCillia merged commit 8416f63 into master Aug 9, 2021
@GregorDeCillia GregorDeCillia mentioned this pull request Aug 10, 2021
@GregorDeCillia GregorDeCillia deleted the merge-od branch August 11, 2021 12:17
@GregorDeCillia GregorDeCillia mentioned this pull request Feb 20, 2023
5 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

naming things
2 participants