Support dataset caching on a server #68

thammegowda · 2021-06-15T21:37:33Z

Currently, if a dataset hosting server does down, our library stops working for those datasets (see #66 )

We have a local cache: but it wont pro-actively download ahead of time, and the cache is constrained to a single node. (unless we scp the whole cache directory or place it on network FS).

Solution/ future enhancement:
We need a cache server! Assuming we can afford AND maintain a server with enough bandwidth and storage:

Proactively download and cache all the datasets on a server that serves as remote cache.
Allow mtdata client to access the files from the remote cache

Unlike OPUS, the sever has to be just a cache of files (no processing, no manual work once it is setup).

It should automatically download newly listed datasets when a new version of mtdata is released to pip.
Make it easy to host and serve cache so we can have backup servers (if needed).
Permit others to host and use their own server if they don't trust our cache server.

This architecture would be similar to the Maven with its central repository https://mvnrepository.com/repos/central
Maven has a nice way to uniquely identify packages: <group>:<artifact>:<version>
We will have to do: <group>:<artifact>:<languages>:<version>

Where

group is the domain name or source name: e.g. Statmt, OPUS, Paracrawl, etc
artifact is the name of a specific dataset from this domain: e.g. news_commentary, newstest, newsdev, paracrawl
languages are (source, target) -- we have to support variations for both source and target based on script and region (BCP47 #47)
version: as the name says. default is v1

Also think about: copyright/ IP.

Caching and distributing files via our own server may not be permissible for many datasets.
If datasets are permitted for non-commercial use only, how do we restrict or warn the users about this

The text was updated successfully, but these errors were encountered:

kpu · 2021-06-22T12:58:12Z

Can we call it OPUS?

thammegowda · 2021-06-22T20:59:53Z

I was thinking of more automation than what is currently at OPUS.
If @Helsinki-NLP / @jorgtied agrees on this plan, sure, we can make OPUS as the cache server

Automatically generate dataset pages on OPUS, and cache the missing ones.
Currently, we are lacking a way to automatically retrieval of license files. we will fix it in Licence info for datasets #69. I am not sure what else is required.
We need to establish a consistency in referencing datasets across various systems involved.
I am thinking of <group>/<artifact>/<languages>/<version> as a solution (open to suggestions!).
Then we can swap <group> from the original source to cache. e.g. changing group from Paracrawl to OPUS
We also need a way to automatically sync cache index to mtdata client. Currently, we are crawling the OPUS site for dataset links.
If we agree <group>/<artifact>/<languages>/<version> as dataset ID, and if OPUS automatically picks up new datasets, then we don't need to crawl it, we can assume it exists at OPUS!

jorgtied · 2021-06-28T09:54:57Z

Automatically caching would be nice. There could be some practical issues but we can discuss how to solve them

a cache would be great to also keep things alive of the original link is broken or unavailable - but the synchonization job should avoid overwriting the cached version of the original source is unavailable or empty
the licensing issue might be problematic
I would also like to continue to get certain datasets "officially" into OPUS - that would require some data processing. So far I have failed to introduce good procedures that make contributions easier.
the naming conventions in OPUS are rather <corpusname>/<version>/<format>/<lang/langpair> - would that work?

By the way, no need to crawl the OPUS website. There is https://opus.nlpl.eu/opusapi/ (Note that we work on a new version because updating the DB broke recently because of some changes in storage etc)

thammegowda · 2021-07-01T20:00:55Z

the naming conventions in OPUS are rather <corpusname>/<version>/<format>/<lang/langpair> - would that work?

Yes, that would work! Thanks. I will start using OPUS API in the future

thammegowda added the enhancement New feature or request label Jun 15, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support dataset caching on a server #68

Support dataset caching on a server #68

thammegowda commented Jun 15, 2021

kpu commented Jun 22, 2021

thammegowda commented Jun 22, 2021

jorgtied commented Jun 28, 2021

thammegowda commented Jul 1, 2021 •

edited

Loading

Support dataset caching on a server #68

Support dataset caching on a server #68

Comments

thammegowda commented Jun 15, 2021

kpu commented Jun 22, 2021

thammegowda commented Jun 22, 2021

jorgtied commented Jun 28, 2021

thammegowda commented Jul 1, 2021 • edited Loading

thammegowda commented Jul 1, 2021 •

edited

Loading