v2.4.1 vs v2.2 biocache load of GBIF DWcA - v2.4.1 sorting takes very long time (30x) #332

jloomisVCE · 2019-05-06T16:02:28Z

I am working with biocache-store v2.4.1 within bioatlas/ala-docker. While building an alpha site, to bootstrap the db, attempted to load a GBIF download having ~4.3 million records. In v2.4.1, 'biocache load drxx' appeared to hang after retrieving the zip file from the collectory and unzipping. Looking at /data/biocache-load/drxx, the pre-processing step that creates eg. occurrence.txt-sorted was taking a long time - 103 minutes.

I reverted to biocache-store v2.2 within the same bioatlas/ala-docker system. In that case, the same call to 'biocache load drxx' completed the pre-process sorting in 3 minutes.

I believe that the configuration parameters are the same for both, so the difference appears to be the released version.

See attached file.
2.2-vs-2.4.1-biocache-load-dr7-gbif-download.txt

ansell · 2019-10-23T04:28:40Z

This is possibly a performance regression caused by a fix upstream to using safe CSV sorting rather than the previous unsafe method of hoping that CSV files never contain quoted new-line characters and using the unsafe GNU coreutil sort program.

djtfmartin added the bug label Sep 19, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

v2.4.1 vs v2.2 biocache load of GBIF DWcA - v2.4.1 sorting takes very long time (30x) #332

v2.4.1 vs v2.2 biocache load of GBIF DWcA - v2.4.1 sorting takes very long time (30x) #332

jloomisVCE commented May 6, 2019

ansell commented Oct 23, 2019

v2.4.1 vs v2.2 biocache load of GBIF DWcA - v2.4.1 sorting takes very long time (30x) #332

v2.4.1 vs v2.2 biocache load of GBIF DWcA - v2.4.1 sorting takes very long time (30x) #332

Comments

jloomisVCE commented May 6, 2019

ansell commented Oct 23, 2019