Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support dataset caching on a server #68

Open
thammegowda opened this issue Jun 15, 2021 · 4 comments
Open

Support dataset caching on a server #68

thammegowda opened this issue Jun 15, 2021 · 4 comments
Labels
enhancement New feature or request

Comments

@thammegowda
Copy link
Owner

Currently, if a dataset hosting server does down, our library stops working for those datasets (see #66 )

We have a local cache: but it wont pro-actively download ahead of time, and the cache is constrained to a single node. (unless we scp the whole cache directory or place it on network FS).

Solution/ future enhancement:
We need a cache server! Assuming we can afford AND maintain a server with enough bandwidth and storage:

  1. Proactively download and cache all the datasets on a server that serves as remote cache.
  2. Allow mtdata client to access the files from the remote cache

Unlike OPUS, the sever has to be just a cache of files (no processing, no manual work once it is setup).

  • It should automatically download newly listed datasets when a new version of mtdata is released to pip.
  • Make it easy to host and serve cache so we can have backup servers (if needed).
  • Permit others to host and use their own server if they don't trust our cache server.

This architecture would be similar to the Maven with its central repository https://mvnrepository.com/repos/central
Maven has a nice way to uniquely identify packages: <group>:<artifact>:<version>
We will have to do: <group>:<artifact>:<languages>:<version>

Where

  • group is the domain name or source name: e.g. Statmt, OPUS, Paracrawl, etc
  • artifact is the name of a specific dataset from this domain: e.g. news_commentary, newstest, newsdev, paracrawl
  • languages are (source, target) -- we have to support variations for both source and target based on script and region (BCP47 #47)
  • version: as the name says. default is v1

Also think about: copyright/ IP.

  • Caching and distributing files via our own server may not be permissible for many datasets.
  • If datasets are permitted for non-commercial use only, how do we restrict or warn the users about this
@thammegowda thammegowda added the enhancement New feature or request label Jun 15, 2021
@kpu
Copy link
Collaborator

kpu commented Jun 22, 2021

Can we call it OPUS?

@thammegowda
Copy link
Owner Author

I was thinking of more automation than what is currently at OPUS.
If @Helsinki-NLP / @jorgtied agrees on this plan, sure, we can make OPUS as the cache server

  • Automatically generate dataset pages on OPUS, and cache the missing ones.
    Currently, we are lacking a way to automatically retrieval of license files. we will fix it in Licence info for datasets #69. I am not sure what else is required.
  • We need to establish a consistency in referencing datasets across various systems involved.
    I am thinking of <group>/<artifact>/<languages>/<version> as a solution (open to suggestions!).
    Then we can swap <group> from the original source to cache. e.g. changing group from Paracrawl to OPUS
  • We also need a way to automatically sync cache index to mtdata client. Currently, we are crawling the OPUS site for dataset links.
    If we agree <group>/<artifact>/<languages>/<version> as dataset ID, and if OPUS automatically picks up new datasets, then we don't need to crawl it, we can assume it exists at OPUS!

@jorgtied
Copy link

Automatically caching would be nice. There could be some practical issues but we can discuss how to solve them

  • a cache would be great to also keep things alive of the original link is broken or unavailable - but the synchonization job should avoid overwriting the cached version of the original source is unavailable or empty
  • the licensing issue might be problematic
  • I would also like to continue to get certain datasets "officially" into OPUS - that would require some data processing. So far I have failed to introduce good procedures that make contributions easier.
  • the naming conventions in OPUS are rather <corpusname>/<version>/<format>/<lang/langpair> - would that work?

By the way, no need to crawl the OPUS website. There is https://opus.nlpl.eu/opusapi/ (Note that we work on a new version because updating the DB broke recently because of some changes in storage etc)

@thammegowda
Copy link
Owner Author

thammegowda commented Jul 1, 2021

the naming conventions in OPUS are rather <corpusname>/<version>/<format>/<lang/langpair> - would that work?

Yes, that would work! Thanks. I will start using OPUS API in the future

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants