-
Notifications
You must be signed in to change notification settings - Fork 584
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Experiencing extremely slow reads on TickStore -- fully executable example included. #895
Comments
I also ran a profiler on a read from tickstore, and below are a few relevant lines... Is there a different type that I should be saving the datetime index in? ncalls tottime percall cumtime percall filename:lineno(function)
|
Is there any other information I could provide that would help to better describe this issue? It is still a problem for me. Thanks! |
I did more testing on this, and the problem disappears when running pandas .25.3 It also disappears if you change the freq of the the sample index generator to '1ns' (although the datetime values returned by TickStore will be incorrect if you feed it the nanosecond level data via the write method). It looks like pandas is handing the to_datetime call on line 388 of tickstore.py:
differently enough in v .25.3 vs the 1.x branch that the new version of pandas is spending a lot of time on pandas._libs.tslib.array_with_unit_to_datetime, as shown by the profiler, where the old version is not. |
I just installed and setup Arctic and noticed the slow read performance with the |
Yes, this makes me wonder if man financial (the creator and maintainer of this package) is really still using the old pandas .25 branch internally, or if nobody there is using tickstore, otherwise certainly someone else would have discovered this. |
It's unfortunate, would've been nice to use. I also tested https://github.com/alpacahq/marketstore and it performs well. Would recommend trying it. I've been testing different libraries to determine which one to keep, continue using, and if it's outdated and unmaintained, update it myself. Have you tried other libraries? |
@crazy25000 thanks for the tip! Happy to continue this discussion, but I don't want to clutter up the github issue with it. My email is on my profile if you'd like to chat datastores further! |
This is still an outstanding issue for me, if there is anything else I can provide to help clarify this issue, please let me know. |
I'd bet if you downgrade pandas it will work better, this library isnt extensively used or tested on very recent pandas releases and there have been cases in the past where behavior changed in pandas (for the worse) and made trivial operations in arctic take incredibly long (i.e. 5 ms to 30 seconds) |
@bmoscon you are correct, if I use the pandas .25 branch, the problem is solved. However, this creates pretty serious workflow issues. If a user wants to pull data with .25, but then work with the data using a current 1.x version of pandas, you need two different venvs, and end up storing the data in some kind of intermediate layer, unless I"m missing an obvious and simpler workaround It's possible, just seems to defeat a lot of the benefit of arctic if I am dumping eveything into parquet files with code running .25 and then loading the parquet files with a 1.x pandas to do the work. The most recent version of .25 pandas is from Oct 2019 -- a little stale at this point. But, thank you for the reply, the point is taken that perhaps arctic just doesn't have complete support for 1.x pandas yet. |
I'm having issues with the read speed with Tickstore too. Only around 205k rows takes around 1 min, while writing the data is working perfectly and without issues. Any way to read tick data (which usuallys have thousands and thousands rows) getting the data faster? Maybe using dask, modin or another pandas version with higher speed. |
My solution is to replace line 338 to |
@JunyueLiu I tweaked your suggestion just a little bit to:
and now the reads are back to a normal speed, about 6.5M rows/sec. @JunyueLiu would you like to submit a PR since the fix was basically yours? If not I'll do it and credit you. I would think that this issue must be affecting a lot of people who would benefit from the fix being in a release. |
Feel free to submit the PR. |
I found this issue as well, and found that it can be quicker still by omitting the |
@CmpCtrl I tried your solution, and did indeed did get about 50% faster reads, about 9.5M rows/sec. Thanks! |
I started a branch to work on a couple other things as well, branch. The mktz() seems really slow, my first call to get the max or min date from a symbol took ~0.6 seconds and it seemed like most of that was in finding the local timezone. I also brought in the fixes from #887 so i could get back to the latest python and pandas versions. I haven't done much testing and i am only using a small portion of the functionality so i'm not sure how relevant these changes are to others. |
Thanks, I am checking out your branch, those functions are useful to me! I need to be on the 1.x version of pandas for other reasons, and min and max date are also useful to me. I hope that at some point this project will be able to standardize on the more recent versions of python and pandas, but my feeling is that the main corporate owner of the project probably has their own internal versions that they use, and that's what it's being maintained for. |
I'm also seeing this problem. Picking up the fix from @jeffneuen 's repo fixed it for me. |
Arctic Version
Arctic Store
Platform and version
Ubuntu Linux 20.04, Python 3.8.8 (Anaconda), running JupyterLab
Modern CPU w/ NVMe
Description of problem and/or code sample that reproduces the issue
I am experiencing very slow Tickstore reads. In my sample code below, the write operation clocks at 1.2s for 5 million rows, which seems good. However, when I read the data, the read operation clocks at 59s.
[]
price float64
dtype: object
CPU times: user 1.16 s, sys: 64.1 ms, total: 1.22 s
Wall time: 1.26 s
CPU times: user 59.3 s, sys: 3.14 s, total: 1min 2s
Wall time: 59.7 s
On the read operation, the process seems to be cpu bound, with a single python thread pegged at 100%.
Not sure if I'm missing something obvious here, like using the wrong data types or something, but writes that are that many multiples faster than reads seems odd.
The text was updated successfully, but these errors were encountered: