This is the supporting repository of two articles, which are the product of the master level course Introduction to Quantitative Linguistics (IQL) at Universitat Politècnica de Catalunya (spring semester, 2022). Specifically:
- Direct and indirect evidence of compression of word lengths. Zipf's law of abbreviation revisited (arXiv:2303.10128)
- The optimality of word lengths. Theoretical foundations and an empirical study (arXiv:2208.10384)
- Sonia Petrini
- Antoni Casas-i-Muñoz
- Jordi Cluet-i-Martinell
- Mengxue Wang
- Christian Bentz
- Ramon Ferrer-i-Cancho
The repository contains the following folders:
- code: all the R and Python code developed to preprocess and analyze the data (running R code requires being located in the parent directory)
- data: Common Voice Forced Alignments and Parallel Universal Dependencies datasets, both filtered (filtered subfolder) and not filtered (non_filtered subfolder) as described in the paper. The other subfolder contains other material used throughout the project
- figures: figures produced for the paper, both using the filtered data (filtered subfolder) and the non-filtered data (non_filtered subfolder)
- latex_tables: latex tables produced for the paper, both using the filtered data (filtered subfolder) and the non-filtered data (non_filtered subfolder)
- results: csv files obtained from the analysis, both using the filtered data (filtered subfolder) and the non-filtered data (non_filtered subfolder)
The two branches are related to the first and the second article respectively. The data for pud differs slightly between branches, as we improved its preprocessing after the publication of the first article. However, the changes are minimal, only concern few languages, and do not impact the qualitative results.
Throughout the whole repository pud stands for the Parallel Universal Dependencies collection and cv stands for the Common Voice Forced Alignments collection.