Skip to content

IQL-course/IQL-Research-Project-21-22

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

The optimality of word lengths

This is the supporting repository of two articles, which are the product of the master level course Introduction to Quantitative Linguistics (IQL) at Universitat Politècnica de Catalunya (spring semester, 2022). Specifically:

  • Direct and indirect evidence of compression of word lengths. Zipf's law of abbreviation revisited (arXiv:2303.10128)
  • The optimality of word lengths. Theoretical foundations and an empirical study (arXiv:2208.10384)

Authors

  • Sonia Petrini
  • Antoni Casas-i-Muñoz
  • Jordi Cluet-i-Martinell
  • Mengxue Wang
  • Christian Bentz
  • Ramon Ferrer-i-Cancho

Repository organization

The repository contains the following folders:

  • code: all the R and Python code developed to preprocess and analyze the data (running R code requires being located in the parent directory)
  • data: Common Voice Forced Alignments and Parallel Universal Dependencies datasets, both filtered (filtered subfolder) and not filtered (non_filtered subfolder) as described in the paper. The other subfolder contains other material used throughout the project
  • figures: figures produced for the paper, both using the filtered data (filtered subfolder) and the non-filtered data (non_filtered subfolder)
  • latex_tables: latex tables produced for the paper, both using the filtered data (filtered subfolder) and the non-filtered data (non_filtered subfolder)
  • results: csv files obtained from the analysis, both using the filtered data (filtered subfolder) and the non-filtered data (non_filtered subfolder)

Branches

The two branches are related to the first and the second article respectively. The data for pud differs slightly between branches, as we improved its preprocessing after the publication of the first article. However, the changes are minimal, only concern few languages, and do not impact the qualitative results.

Notes

Throughout the whole repository pud stands for the Parallel Universal Dependencies collection and cv stands for the Common Voice Forced Alignments collection.

About

Research Project of the IQL 2021-22 course

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published