Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ExportFromIndexStream performance improvements required #216

Open
ansell opened this issue Jun 21, 2017 · 6 comments
Open

ExportFromIndexStream performance improvements required #216

ansell opened this issue Jun 21, 2017 · 6 comments
Labels

Comments

@ansell
Copy link
Contributor

ansell commented Jun 21, 2017

The performance of ExportFromIndexStream may need to be improved to reduce the time required for the monthly regeneration of the downloads.ala.org.au archives. Currently it takes about 46 hours, which hopefully could be improved to allow it to be run more regularly than once a month to keep the downloads up to date with the biocache.

For reference, Generate GBIF Archives completes in under 4 hours, and it also hits every record.

@adam-collins
Copy link
Contributor

To avoid CSV parsing issues change or add output format of tsv.

@ansell
Copy link
Contributor Author

ansell commented Apr 27, 2018

@djtfmartin What qualifies this for the "idea" label? It is a serious enough issue that I have more than once considered using another codebase to do the exporting so that exporting does not interrupt the other data management activities.

@djtfmartin
Copy link
Member

thanks @ansell. I guess for me it was just a little vague as to what to do here. Im sure there is a problem, but we need detail (and a plan) to action something.

@ansell
Copy link
Contributor Author

ansell commented Apr 30, 2018 via email

@ansell
Copy link
Contributor Author

ansell commented May 1, 2018

screen shot 2018-05-01 at 11 20 22 am

As a reference, it is running right now, and using a consistent 40% CPU on cass-b4, including the Jenkins/biocache-store and the Cassandra CPU usage.

@djtfmartin
Copy link
Member

I had a little look at the SOLR streaming API. Looks like we need to use docValues to make use of this feature.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants