Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cache=True (default) in to_datetime causes too big slow-down #15736

Closed
2 tasks done
MarcoGorelli opened this issue Apr 18, 2024 · 2 comments · Fixed by #15826
Closed
2 tasks done

cache=True (default) in to_datetime causes too big slow-down #15736

MarcoGorelli opened this issue Apr 18, 2024 · 2 comments · Fixed by #15826
Labels
P-medium Priority: medium performance Performance issues or improvements python Related to Python Polars

Comments

@MarcoGorelli
Copy link
Collaborator

Checks

  • I have checked that this issue has not already been reported.
  • I have confirmed this bug exists on the latest version of Polars.

Reproducible example

s = pl.datetime_range(date(2000, 1, 1), date(2001, 1, 1), '1s', eager=True)
strings = s.dt.strftime('%Y-%m-%dT%H:%M:%S')

# The following takes > 10 seconds
strings.str.to_datetime(format='%Y-%m-%dT%H:%M:%S')

# The following takes < 1 second
strings.str.to_datetime(format='%Y-%m-%dT%H:%M:%S', cache=False)

Log output

No response

Issue description

I don't think the cache should hurt performance - I get the desire to speed it up in the case of duplicates, but unique datetimes is a pretty common case too

Is it possible to limit the cache so that it only holds, say, 50 elements? And then, any new element ejects an existing one from the cache?

Expected behavior

Similar performance

Installed versions

In [12]: pl.show_versions()
--------Version info---------
Polars:               0.20.21
Index type:           UInt32
Platform:             Linux-5.15.146.1-microsoft-standard-WSL2-x86_64-with-glibc2.35
Python:               3.11.9 (main, Apr  6 2024, 17:59:24) [GCC 11.4.0]

----Optional dependencies----
adbc_driver_manager:  <not installed>
cloudpickle:          3.0.0
connectorx:           <not installed>
deltalake:            <not installed>
fastexcel:            <not installed>
fsspec:               2024.3.1
gevent:               <not installed>
hvplot:               0.9.2
matplotlib:           3.8.4
nest_asyncio:         1.6.0
numpy:                1.26.4
openpyxl:             <not installed>
pandas:               2.2.2
pyarrow:              15.0.2
pydantic:             <not installed>
pyiceberg:            <not installed>
pyxlsb:               <not installed>
sqlalchemy:           <not installed>
xlsx2csv:             <not installed>
xlsxwriter:           <not installed>
@MarcoGorelli MarcoGorelli added bug Something isn't working python Related to Python Polars needs triage Awaiting prioritization by a maintainer performance Performance issues or improvements and removed bug Something isn't working labels Apr 18, 2024
@ritchie46
Copy link
Member

I think FastFixedCache should be used in CachedFunc. Currently it is an unlimited hashmap.

@reswqa
Copy link
Collaborator

reswqa commented Apr 19, 2024

As we discussed, we should use FastFixedCache to replace the hashmap in CachedFunc, and the size can be sqrt(ca.len()). If no one beats me, I'll file a PR then.

@reswqa reswqa added P-medium Priority: medium and removed needs triage Awaiting prioritization by a maintainer labels Apr 19, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
P-medium Priority: medium performance Performance issues or improvements python Related to Python Polars
Projects
Archived in project
Development

Successfully merging a pull request may close this issue.

3 participants