-
Notifications
You must be signed in to change notification settings - Fork 22
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support dataset caching on a server #68
Comments
Can we call it OPUS? |
I was thinking of more automation than what is currently at OPUS.
|
Automatically caching would be nice. There could be some practical issues but we can discuss how to solve them
By the way, no need to crawl the OPUS website. There is https://opus.nlpl.eu/opusapi/ (Note that we work on a new version because updating the DB broke recently because of some changes in storage etc) |
Yes, that would work! Thanks. I will start using OPUS API in the future |
Currently, if a dataset hosting server does down, our library stops working for those datasets (see #66 )
We have a local cache: but it wont pro-actively download ahead of time, and the cache is constrained to a single node. (unless we scp the whole cache directory or place it on network FS).
Solution/ future enhancement:
We need a cache server! Assuming we can afford AND maintain a server with enough bandwidth and storage:
Unlike OPUS, the sever has to be just a cache of files (no processing, no manual work once it is setup).
This architecture would be similar to the Maven with its central repository https://mvnrepository.com/repos/central
Maven has a nice way to uniquely identify packages:
<group>:<artifact>:<version>
We will have to do:
<group>:<artifact>:<languages>:<version>
Where
Also think about: copyright/ IP.
The text was updated successfully, but these errors were encountered: