Add data import from data.statistik.gv.at #11

bernhard-da · 2021-07-09T12:00:28Z

work-in-progress: add od-functionality

GregorDeCillia · 2021-07-09T12:22:01Z

Thank you very much for this contribution!

As discussed behind our firewall, I have already tested this and the new od_table class is almost perfectly interchangable with the existing sc_table class.

I'll make some finetunings in the next couple of days and hopefully, I'll be able to adress these points

Format categorical columns as factor. This is necessary for STATgraph::sg_timeseries() which uses levels(). Currently, the parameter color does not work with sg_timeseries()
Use a format in $field(i)$parsed which is consistent with sc_table: <date> for time variables and <fct> for categorical variables
The timings only measure the download time for the datasets, not for the metadata
Unnecessary traffic in od_table$initialize()? It's probably more efficent to test for a 404 response.
Inline TODO

STATcubeR/R/od_utils.R

Line 98 in ecc01ac

metafields$total <- TRUE # # todo: check if total is always included
Set aggregation-function in $meta$measures$fun to NA
For the population-forecasts until 2101, the time variable is not recognized. This has to be fixed in

STATcubeR/R/table_field.R

Line 24 in ecc01ac

if (!all(year %in% 1900:2100))

Example: OGD_bevstprogjdgebland_PR_BEVJDGB_4 (Bevölkerung zum Jahresanfang nach dem Geburtsland 2016 bis 2101).
Move the rendering-logic in $render() to STATgraphApp

bernhard-da · 2021-07-09T12:36:30Z

@GregorDeCillia the issue with incorrect timings is already fixed.

GregorDeCillia · 2021-07-09T12:44:40Z

@bernhard-da thanks. I think the other TODOs should also be pretty straightforward to complete.

in `$meta$measures$fun`

- re-implement d3e8ef0 - allow years up to 2150

GregorDeCillia · 2021-07-09T15:12:05Z

parse_time and round are currently ignored in od_tabulate()

STATcubeR/R/od_tabulate.R

Line 38 in 3257bd1

parse_time <- round <- FALSE
- For round this makes sense, since there is no metadata to provide a precision. Maybe either remove the parameter or use a warning/error in case it is set to TRUE?
- For parse_time this is relevant in case the original labels of the months/quarters/years are required for graphs or tables. Unlike in sc_table , the $data slot in od_table contains parsed time entries and therefore the if (parse_time) block doesn't work
Translate error messages to english.

STATcubeR/R/od_utils.R

Line 19 in 3257bd1

stop(paste0("Datensatz ", shQuote(id), " kann nicht eingelesen werden.\n",
Add support for english labels

until now, all fields were shown if no argument was provided

the $extras$attribute_description field in the meta-jsons of open.data can now be read correctly for all 273 "valid" datasets valid datasets are datasets that satisfy * json$resources[[1]] == "csv" * length(json$resources) > 1 The returned descriptions are not always consistent with the contents of ${opendata_id}_HEADER.csv because the labels might be slightly different. However, the codes always match and use the same ordering

read_delim throws an error if the url for the csv is passed as a character. Due to differences in R<4.0 and the newest version, this caused a bug in od_table()

- status code 200 never occurs - check if raw object is empty rather than parsing it

- cache csv files in ~/.STATcubeR_cache - only download data again if last_modified in json is newer than cache - avoid nested loops for parsing - exclude unnecessary columns in meta$measures - make sure $data uses factors for classification variables

GregorDeCillia · 2021-07-13T09:54:23Z

I just made some major changes to the import logic in 1179486. The returned object now only contains three columns in $meta$measures since fun and annotations are not applicable for this class. It also enables caching, see the commit message.

Now the question is how to deal with aggregation of data. The way I see it, there are two types of datasets

Datasets that only contain measures which can be aggregated using unweighted sums. Those generally do not contain total-codes and can/must be aggregated directly using base::sum()
Datasets such as OGD_veste309_Veste309_1, where certain measures can not be aggregated directly. Those seem to always contain total codes for all fields except time variables

My suggestion: Since it is basically impossible to automatically detect total codes, I'd suggest od_tabulate() aggregates directly by default. If a user wants to use total codes for aggregation, those have to be supplied as characters. For example

x <- od_table("OGD_veste309_Veste309_1")
## define total codes
x$total_codes(
  `C-A11-0` = "A11-1", 
  `C-STAATS-0` = "STAATS-9", 
  `C-VEBDL-0` = "VEBDL-10", 
  `C-BESCHV-0` = "BESCHV-1"
)
## `C-STAATS-0` and `C-BESCHV-0` are aggregated using total codes
## For `C-A11-0` and `C-VEBDL-0` total codes are excluded to make the result tidy
x$tabulate("C-A11-0", "C-VEBDL-0", "F-VESTE_AM")
#> # A tibble: 18 x 3
#>    Sex    `Region (NUTS2)`   `Arithmetic mean`
#>    <fct>  <fct>                          <int>
#>  1 Male   AT11 Burgenland                   17
#>  2 Male   AT12 Lower Austria                18
#>  3 Male   AT13 Vienna                       21
#>  4 Male   AT21 Carinthia                    18
#>  5 Male   AT22 Styria                       18
#>  6 Male   AT31 Upper Austria                20
#>  7 Male   AT32 Salzburg                     19
#>  8 Male   AT33 Tyrol                        19
#>  9 Male   AT34 Vorarlberg                   20
#> 10 Female AT11 Burgenland                   15
#> 11 Female AT12 Lower Austria                14
#> 12 Female AT13 Vienna                       17
#> 13 Female AT21 Carinthia                    14
#> 14 Female AT22 Styria                       15
#> 15 Female AT31 Upper Austria                15
#> 16 Female AT32 Salzburg                     15
#> 17 Female AT33 Tyrol                        15
#> 18 Female AT34 Vorarlberg                   15

What do you think @bernhard-da ?

Another issue with aggregation are hierarchical fields which include partial sums. In some cases, this can be detected in {OGD_ID}_{FIELD_ID}.csv using column 3, but most of the time, the hierarchy structure is not encoded anywhere. In a way, total codes are a special case of hierarchies where all other levels can be considered as direct children of the total code

this affects two datasets, which are both problematic with the current import logic. - `OGDEXT_BINNENWAND_1` has no variabe codes in the first line of `OGDEXT_BINNENWAND_1.csv` - `OGDEXT_VORNAMEN_1` contains fields of type <chr> like `F-VORNAME`

add column `label_en` to - `$meta$database` - `$meta$measures` - `$meta$fields` - `$field(i)`

this option affects - the printing method - the labeling of $data TODO: language can be switched with {object}$language <- {new_language}, however, `$data` is not updated in this case

restructure sc_table to make it compatible with the base class sc_data. In the new version, the main part of tabulate() is executed in the base class which provides more flexibility. - mixtures of fields with totals and without totals are allowed - parameters can be specified as codes or as labels - parameter raw can be used to return codes instead of labels The sc_table class now also inherits total_codes() and a more flexible implementation of field() $data and $meta are now parsed eagerly which means that the slots are calculated at construction time rather than the first access there was a regression regarding the annotations parameter. Annotations are stored as attributes in $data_raw and dropped during aggregation. They currently don't have priority and will be re-implemented at some point

if the tibble package is attached, the data will be printed with this class. These changes do not cause a formal dependency to tibble because they only reroute S3 dispatch if the `print.tbl` generic is in the current search path.

register s3 methods of pillar is available with registerS3method(). ALternatively, vctrs::s3_register() could be used Also, export the print method for sc_tibble_meta() and avoid warnings in devtools::check() due to param inconsitency because of missing ellipsis use \uxxxx escape to print the ellipsis in the footer of print.sc_tibble_meta

since this class is not exported and there are no factory methods, use the classname as the name of the R6 class generator object the field $data_raw was renamed to $data and the prevois $data is now only available via $tabulate()

* use R6/roxygen2 to document the whole class * move request time to $meta$source * rename $raw -> $json * advanced printing

* R6/roxygen2 documentation * link ?sc_table_class from ?sc_table

od_tabulate() now only matches for labels in the current language. If the language of a table is set to "en", german labels cannot be used. Update examples accordingly

GregorDeCillia · 2021-08-09T08:46:28Z

Breaking changes

in sc_data: $data_raw has been renamed to $data and the old verison of $data is only available through $tabulate now
od_table()$raw was renamed to od_table()$json
sc_data()$meta$database was renamed to sc_data()$meta$source

GregorDeCillia · 2021-08-09T09:06:53Z

A new version of the pkgdown site is available at https://statistikat.github.io/STATcubeR/dev/

This now includes a roxygen2 documentation of the three main R6 classes

The class documentation is supplemental to the documentation of the constructor methods

The index sites were updated

The new page includes a well formatted article index and only the most important articles are directly linked from the navbar
The reference index now includes categories for to group man pages according to their purpose.

Updates for all pkgdown related source files will be added in a separate branch (#13) because I cannot test the REST API documentation in my current development environment and need to transfer it to another server. The pkgdown manual is still ahead of the VCS in some regard. For example: od_cache_summary() and od_downloads() from the file management article are not available in the VCS.

when developing the R6 documentation, it was shortly tested how the man pages would look if the od_table class was directly exported as od_Table. These man entries mistakenly use this invalid syntax

not related to od_table, but the old version caused errors because of positional arguments in R/table_custom.R#L27

the annotation parameter is not working properly with the introduction of sg_data which allows tabulate() to operate via sums or via total codes. Use if(FALSE) to skip this example for now in the long run, it will have to be decided how annotations should be aggregated in $tabulate()

previsouly, this function expected sc_table objects and now it operates with the base class. This required some rerouting of the different implementations and merging of certain man-pages the od_tabulate() function was removed from the NAMESPACE because it would just be an alias for sc_tabulate() at this point the fact that the annotations param is broken is now part of the class documentation of sc_table_class

start feature: add opendata-functionality

ecc01ac

GregorDeCillia changed the title ~~start feature: add opendata-functionality~~ Add data import from data.statistik.gv.at Jul 9, 2021

GregorDeCillia added the feature New feature or request label Jul 9, 2021

GregorDeCillia self-assigned this Jul 9, 2021

GregorDeCillia added 5 commits July 9, 2021 16:32

Set aggregation-function to NA

40abe23

in `$meta$measures$fun`

adapt time parser for od_table()

1bec67f

- re-implement d3e8ef0 - allow years up to 2150

add $tabulate() as method

a0f6f47

od_version() -> sc_version()

1a85062

report version at time of request/parse

3257bd1

GregorDeCillia and others added 10 commits July 9, 2021 17:41

show first field per default in $field()

dc72e06

until now, all fields were shown if no argument was provided

whoops

bd22451

don't use read_delim() with factors

3da3d97

read_delim throws an error if the url for the csv is passed as a character. Due to differences in R<4.0 and the newest version, this caused a bug in od_table()

read colons (:) as NA

80852ed

german -> english

fb5f851

text-wrapping in od_table.print

124829c

improve checks for httr response

1565c59

- status code 200 never occurs - check if raw object is empty rather than parsing it

add more datasets ids in od_table docs

6a9c4c6

refactor od_create_data

1179486

- cache csv files in ~/.STATcubeR_cache - only download data again if last_modified in json is newer than cache - avoid nested loops for parsing - exclude unnecessary columns in meta$measures - make sure $data uses factors for classification variables

GregorDeCillia added 6 commits July 13, 2021 13:51

helper function od_json_get()

b694c6a

minimize dependencies

cadc8b1

add @bernhard-da as contributor

cd1c8ab

don't allow ids starting with OGDEXT

5b0a2ec

this affects two datasets, which are both problematic with the current import logic. - `OGDEXT_BINNENWAND_1` has no variabe codes in the first line of `OGDEXT_BINNENWAND_1.csv` - `OGDEXT_VORNAMEN_1` contains fields of type <chr> like `F-VORNAME`

import english labels

5d78b2a

add column `label_en` to - `$meta$database` - `$meta$measures` - `$meta$fields` - `$field(i)`

add language option in od_table()

8039602

this option affects - the printing method - the labeling of $data TODO: language can be switched with {object}$language <- {new_language}, however, `$data` is not updated in this case

GregorDeCillia added 15 commits August 5, 2021 16:38

version bump: 0.2.1

617403a

advanced printing with tibble

f175b95

if the tibble package is attached, the data will be printed with this class. These changes do not cause a formal dependency to tibble because they only reroute S3 dispatch if the `print.tbl` generic is in the current search path.

update roxygen docs

14a9a53

document missing param in sc_tabulate()

b8f430e

sc_data_class -> sc_data, data_raw -> data

45c6368

since this class is not exported and there are no factory methods, use the classname as the name of the R6 class generator object the field $data_raw was renamed to $data and the prevois $data is now only available via $tabulate()

document od_table class

7b92833

* use R6/roxygen2 to document the whole class * move request time to $meta$source * rename $raw -> $json * advanced printing

document sc_table_class

c1917b7

* R6/roxygen2 documentation * link ?sc_table_class from ?sc_table

add unique param to od_list()

8b59c68

advanced printing of od_resources

5bf5bde

updata od_tabulate examples

4737336

od_tabulate() now only matches for labels in the current language. If the language of a table is set to "en", german labels cannot be used. Update examples accordingly

don't show messages for ommited levels

5a44143

fix documentation typo

26380bc

update roxygen generated files

fafc844

add roxygen generated files for R6 classes

9d0f694

GregorDeCillia added 8 commits August 9, 2021 12:50

add roxygen docs for sc_table_class

14d8746

od_table$new() -> od_table()

6c43089

when developing the R6 documentation, it was shortly tested how the man pages would look if the od_table class was directly exported as od_Table. These man entries mistakenly use this invalid syntax

param add_totals for table_custom

6bcf6e0

not related to od_table, but the old version caused errors because of positional arguments in R/table_custom.R#L27

add github.io to DESCRIPTION -> url

e12f992

version -> 0.2.2

5182442

fix broken @example

4bcd075

GregorDeCillia merged commit 8416f63 into master Aug 9, 2021

GregorDeCillia mentioned this pull request Aug 10, 2021

Time type: Week #15

Closed

GregorDeCillia deleted the merge-od branch August 11, 2021 12:17

GregorDeCillia mentioned this pull request Feb 20, 2023

re-implement annotatons #39

Open

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add data import from data.statistik.gv.at #11

Add data import from data.statistik.gv.at #11

bernhard-da commented Jul 9, 2021

GregorDeCillia commented Jul 9, 2021 •

edited

Loading

bernhard-da commented Jul 9, 2021

GregorDeCillia commented Jul 9, 2021

GregorDeCillia commented Jul 9, 2021 •

edited

Loading

GregorDeCillia commented Jul 13, 2021 •

edited

Loading

GregorDeCillia commented Aug 9, 2021

GregorDeCillia commented Aug 9, 2021 •

edited

Loading

Add data import from data.statistik.gv.at #11

Add data import from data.statistik.gv.at #11

Conversation

bernhard-da commented Jul 9, 2021

GregorDeCillia commented Jul 9, 2021 • edited Loading

bernhard-da commented Jul 9, 2021

GregorDeCillia commented Jul 9, 2021

GregorDeCillia commented Jul 9, 2021 • edited Loading

GregorDeCillia commented Jul 13, 2021 • edited Loading

GregorDeCillia commented Aug 9, 2021

GregorDeCillia commented Aug 9, 2021 • edited Loading

GregorDeCillia commented Jul 9, 2021 •

edited

Loading

GregorDeCillia commented Jul 9, 2021 •

edited

Loading

GregorDeCillia commented Jul 13, 2021 •

edited

Loading

GregorDeCillia commented Aug 9, 2021 •

edited

Loading