Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Please split data set and cli tooling into separate projects/repositories #29

Open
alerque opened this issue Mar 26, 2021 · 3 comments
Open
Assignees
Labels
documentation Improvements or additions to documentation enhancement New feature or request

Comments

@alerque
Copy link
Contributor

alerque commented Mar 26, 2021

Migrated from #14 where it is slightly off topic:


I really hope the web interface and ability to use it via CLI and package are beneficial here.

Sure they are! But that isn't the question. The question is about the data. In fact this highlights something I'm convinced is a major mistake in this project: the data set should be one project and the CLI tooling should be another. Having these coupled in one repository will be a serious limitation down the road. The data may well prove useful in other contexts where the tooling would be clutter, and the tooling should be decoupled from the data versioning so people can potentially use the tooling with different data (older versions, a fork for contested data, substituting CLDR data, etc.). Perhaps some day you want to rework the tooling a bit so the CLI works differently. That may break a lot of old projects that would otherwise refresh their data, but now the tooling would block them.

@alerque alerque mentioned this issue Mar 26, 2021
@MrBrezina MrBrezina added the enhancement New feature or request label Mar 26, 2021
@MrBrezina
Copy link
Member

That’s a fair point. We merged it purely for practical reasons (it was practical for us). Let’s wait until we resolve some early issues with the data structure we still have.

@alerque
Copy link
Contributor Author

alerque commented Mar 26, 2021

Sure, for early prototyping it's handy to mess with the data structures and tooling at the same time without separate procedures for collating them. Just don't wait too long to get them split up...every person that starts using this for anything will have to refactor as soon as you do, so the balance between "easier for us developers doing early prototyping" and "easier for consumers" will tip sooner that developers tend to notice. In particular don't save it for "the big 1.0.0", you want to hash out the way the projects correlate and the release process before you call it good, not at the same time.

@kontur
Copy link
Contributor

kontur commented Mar 26, 2021

One thing to note here is that reading the plain yaml dataset with the Python library actually augments it (orthography inheritance, macrolanguages, glyph decomposition for checking, etc.). So I think there are three components:

  • Raw data
  • Python wrapper around the data
  • CLI tools

In terms of maintaining data integrity in the database yaml we are using a bunch of scripts for validating and saving as well — these could be separated as "tests" for the data, but I think it might also be valuable to emphasize out the pythonic way of accessing the data, as opposed to using the yaml.

E.g. the database use:

from hyperglot.languages import Languages
langs = Languages()
print(langs['eng'])
print(langs.get_support_from_chars(["A", "B", ...]))

It's not something we documented well so far, but imo the yaml is "only" data input. One thing I had considered also is generating and using a pickled cache object from the yaml for accessing the language data programmatically.

All that said, I think the point made is a good one. We are seeing the same issue with conflict of concerns now that we have publicized it in regard to issues and PR being split between CLI & data.

@kontur kontur added the documentation Improvements or additions to documentation label Jun 27, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation Improvements or additions to documentation enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants