Please split data set and cli tooling into separate projects/repositories #29

alerque · 2021-03-26T11:57:51Z

Migrated from #14 where it is slightly off topic:

I really hope the web interface and ability to use it via CLI and package are beneficial here.

Sure they are! But that isn't the question. The question is about the data. In fact this highlights something I'm convinced is a major mistake in this project: the data set should be one project and the CLI tooling should be another. Having these coupled in one repository will be a serious limitation down the road. The data may well prove useful in other contexts where the tooling would be clutter, and the tooling should be decoupled from the data versioning so people can potentially use the tooling with different data (older versions, a fork for contested data, substituting CLDR data, etc.). Perhaps some day you want to rework the tooling a bit so the CLI works differently. That may break a lot of old projects that would otherwise refresh their data, but now the tooling would block them.

MrBrezina · 2021-03-26T12:40:02Z

That’s a fair point. We merged it purely for practical reasons (it was practical for us). Let’s wait until we resolve some early issues with the data structure we still have.

alerque · 2021-03-26T12:44:14Z

Sure, for early prototyping it's handy to mess with the data structures and tooling at the same time without separate procedures for collating them. Just don't wait too long to get them split up...every person that starts using this for anything will have to refactor as soon as you do, so the balance between "easier for us developers doing early prototyping" and "easier for consumers" will tip sooner that developers tend to notice. In particular don't save it for "the big 1.0.0", you want to hash out the way the projects correlate and the release process before you call it good, not at the same time.

kontur · 2021-03-26T13:28:01Z

One thing to note here is that reading the plain yaml dataset with the Python library actually augments it (orthography inheritance, macrolanguages, glyph decomposition for checking, etc.). So I think there are three components:

Raw data
Python wrapper around the data
CLI tools

In terms of maintaining data integrity in the database yaml we are using a bunch of scripts for validating and saving as well — these could be separated as "tests" for the data, but I think it might also be valuable to emphasize out the pythonic way of accessing the data, as opposed to using the yaml.

E.g. the database use:

from hyperglot.languages import Languages
langs = Languages()
print(langs['eng'])
print(langs.get_support_from_chars(["A", "B", ...]))

It's not something we documented well so far, but imo the yaml is "only" data input. One thing I had considered also is generating and using a pickled cache object from the yaml for accessing the language data programmatically.

All that said, I think the point made is a good one. We are seeing the same issue with conflict of concerns now that we have publicized it in regard to issues and PR being split between CLI & data.

alerque mentioned this issue Mar 26, 2021

Compare to CLDR #14

Closed

MrBrezina added the enhancement New feature or request label Mar 26, 2021

MrBrezina assigned kontur Mar 26, 2021

kontur added the documentation Improvements or additions to documentation label Jun 27, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Please split data set and cli tooling into separate projects/repositories #29

Please split data set and cli tooling into separate projects/repositories #29

alerque commented Mar 26, 2021

MrBrezina commented Mar 26, 2021

alerque commented Mar 26, 2021

kontur commented Mar 26, 2021

Please split data set and cli tooling into separate projects/repositories #29

Please split data set and cli tooling into separate projects/repositories #29

Comments

alerque commented Mar 26, 2021

MrBrezina commented Mar 26, 2021

alerque commented Mar 26, 2021

kontur commented Mar 26, 2021