Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

v2.4.1 vs v2.2 biocache load of GBIF DWcA - v2.4.1 sorting takes very long time (30x) #332

Open
jloomisVCE opened this issue May 6, 2019 · 1 comment
Labels

Comments

@jloomisVCE
Copy link

I am working with biocache-store v2.4.1 within bioatlas/ala-docker. While building an alpha site, to bootstrap the db, attempted to load a GBIF download having ~4.3 million records. In v2.4.1, 'biocache load drxx' appeared to hang after retrieving the zip file from the collectory and unzipping. Looking at /data/biocache-load/drxx, the pre-processing step that creates eg. occurrence.txt-sorted was taking a long time - 103 minutes.

I reverted to biocache-store v2.2 within the same bioatlas/ala-docker system. In that case, the same call to 'biocache load drxx' completed the pre-process sorting in 3 minutes.

I believe that the configuration parameters are the same for both, so the difference appears to be the released version.

See attached file.
2.2-vs-2.4.1-biocache-load-dr7-gbif-download.txt

@djtfmartin djtfmartin added the bug label Sep 19, 2019
@ansell
Copy link
Contributor

ansell commented Oct 23, 2019

This is possibly a performance regression caused by a fix upstream to using safe CSV sorting rather than the previous unsafe method of hoping that CSV files never contain quoted new-line characters and using the unsafe GNU coreutil sort program.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants