Skip to content

fennicahub/fennica

 
 

Repository files navigation

Fennica: Harmonized Finnish national bibliography

This repository contains code for cleaning, enriching and automatically generating reports on the Finnish national bibliography, Fennica.

The live document is deployed in a CSC Rahti container: https://fennica-fennica.rahtiapp.fi The generated bookdown document consists of several different sections, or "chapters". Sections focus on different fields from the MARC formatted raw data MARC. Most chapters also have visualizations that give a quick glance on what the data looks like. Processed CSV datasets can also be downloaded for further analyses.

This README describes how to reproduce the analyses and generate the notebook.

Origins of data

The data was downloaded from The National Metadata Repository Melinda. See more: https://melinda.kansalliskirjasto.fi/

Reproducing the workflow or How to create "Fennica metadata conversions" from scratch.

1. Clone the repository to your computer.

# In terminal / GIT
git clone https://github.com/fennicahub/fennica.git

2. Download dataset from the National Library website

collect.py The script was provided to us by Osma Suominen (The National Library of Finland).

3. Transform raw data into a readable csv format using Python scripts one by one

full_fennica_file.py

raw_fennica_transform.py

combine_csv.py

4. Pick priority fields from the transformed file

pick_fields.py

5. Run init.R to collect priority fields into a main data frame in R-Studio

6. Run script <field.R> in fennica/inst/examples to harmonize each field separately and to create summary tables

language.R

publication_time.R

title.R

title_uniform.R

7. Main polish functions to clean and harmonize different types of data field in fennica/R

polish_years.R

polish_languages.R

polish_title.R

8. Render qmd file for each <field.qmd> in fennica/inst/examples

publication_time.qmd

language.qmd

title.qmd

title_uniform.qmd

9. Render the whole notebook from R-Studio terminal. How to render here

quarto render

to render a single file

quarto render <field_name>.qmd
  1. Upload summary tables to Allas by running a allas.R script

allas.R

Description of the Webhook workflow, image from CSC Documentation

The bookdown document is rendered with GitHub Actions. The generated files are placed in gh-pages branch in the GitHub Repository. The generated files are copied to Rahti by utilizing a webhook and are hosted on an nginx server.

Earlier material

Links to notebooks that are not actively maintained but may contain useful information regarding related past work.

The analyses cover several steps including XML parsing, data harmonization, removing unrecognized entries, enriching and organizing the data, carrying out statistical summaries, analysis, visualization and automated document generation.

Licensing

The analyses and full source code) are provided in this repository and can be freely reused under the BSD 2 clause (FreeBSD) open source licence. The analyses are based on R and rely on various R packages.

The original data has been published openly by National Library of Finland.

Acknowledgements

The project is now developed based on research and infrastructure funding from the Research Council of Finland (DHL-FI and FIN-CLARIAH). The work is based on past and present collaboration between and Turku Data Science Group (University of Turku), Helsinki Computational History Group (COMHIS) (University of Helsinki) and National library of Finland (Fennica data collection). For the list of contributors, see contributors and the related publications.

Contact

Email: [email protected] / [email protected]

The project is under active open development:

About

R tools for Fennica (Finnish national bibliography)

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Jupyter Notebook 46.3%
  • R 25.9%
  • HTML 21.9%
  • JavaScript 3.9%
  • Python 0.9%
  • TeX 0.6%
  • Other 0.5%