44 Improve `WhittakerSmooth`, `AirPls`, and `ArPls` performance #120

MothNik · 2024-05-24T13:10:41Z

This pull request primariliy tackles issue #44, but it does not fully close it (see the second point of 🚶 Next Steps).
It should be squashed before merging because it's more than 100 commits.

🏗️ Main Feature Changes

🧑‍💻 Implementations

complete removal of sparse matrices from WhittakerSmooth, AirPls, and ArPls and transitioning to the more appropriate LAPACK banded storage format. Why? Because
- sparse matrices have a huge overhead and their initialisation alone takes longer (> 1 ... 10 ms) than LAPACK's banded solvers take for a complete smooth (mostly < 1 ms). In contrast, the banded storage can be achieved with dense NumPy-Arrays which has only a small fraction of the initialisation time compared to sparse matrices.
- dense NumPy-Arrays allow for more efficient computation of the penalty matrix D.T @ D than what can be achieved via a sparse matrix, even though the logic becomes a bit more elaborate because symmetry has to be exploited while the redundant computations that make up most of the computation time have to be avoided.
- the sparse solver used before (SuperLU) was designed for sparse matrices that can have arbitrary sparsity pattern, but the matrices for the three algorithms have a very defined sparsity pattern, namely they are banded (tridiagonal for difference order 1, pentadiagonal for difference order 2, and so on). Dedicated banded solvers offer highest perfromance here because the algorithm can follow a very well-defined and straightforward pattern right from the start.
unification of the implementation of WhittakerSmooth, AirPls, and ArPls since they all rely on the same base algorithm and only differ in weighting strategies. Now, they all inherit from the class chemotools.utils._whittaker_base.main.WhittakerLikeSolver that handles all the underlying math once in a centralized place. The only thing the 3 transformer classes add now is a customized weighting strategy. Each of the classes uses different access points to the solver depending on the checks/preprocessing they need to conduct.
adding pentapy-support that will check for availability of pentapy at runtime and use its high-performance solver for all scenarios where the differences order is 2 (see ⏱️ Timings).
improvement of internal checks and type conversions via class variables (by coincidence related to BaselineShift fails on 2d array with dtype=int #87) .
adding the respective documentation.

⏱️ Timings

In summary, the speedup with the minimum set of dependencies is ~5x for all algorithms. However, when pentapy is used, the speedup can be up to 15x. Since it is used for difference order 2 and this is the standard use case, this is quite some gain.
Yet, rust-based implementations seem to be even faster, so we definitely did not reach the limit here.

🌊 `WhittakerSmooth` with difference order 1

Speedup of ~5 to 6 times

🌊 `WhittakerSmooth` with difference order 2

Without pentapy - Speedup of ~5 times

With pentapy - Speedup of ~5 to 15 times

🏴‍☠️ `ArPls`

Without pentapy - Speedup of ~4 times

With pentapy - Speedup of ~5 to 15 times

🛩️ `AirPls` with polynomial order 1

Speedup of ~12 to 5 times

🛩️ `AirPls` with polynomial order 2

Speedup of ~12 to 5 times

With pentapy - Speedup of ~10 to 15 times

🚶 Next Steps

numerical stability of the banded solver (Partially Pivoted LU-decomposition) can only be achieved for difference order up to 2 when the size of the spectra grows to common sizes of 1000 to 10000. Beyond these difference orders high lam-values are no longer possible. Many Whittaker smoother implementations out there suffer from this, but this is something that should be tackled by, e.g., also invoking banded QR-decomposition.
the baseline algorithms use an initailisation which can be far off from the true baseline. Therefore, they take a lot of iterations to converge. Having a good initial guess could solve the problem.

🎁 Additional features

Given that this was a lot of refactoring, the chance was used to enrich the WhittakerSmooth by

adding sample_weight keyword argument to transform and fit_transform to allow for locally weighting datapoints depending on their noise level. Basically, this makes the WhittakerSmooth en par with ArPls and AirPls which were already able to pass weights.
weights allow for automated determination of lam via maximization of the log marginal likelihood (same approach as for sklearn's GaussianProcessRegressor).
a function to estimate the local/global noise levels which can be used for the weighting required for the log marginal likelihood method (chemotools.smooth.estimate_noise_stddev or chemotools.utils.estimate_noise_stddev).
adding the possibility for a model-based specification of lam via chemotools.smooth.WhittakerSmoothLambda similar to SciPy's Bounds for specifying the bounds of parameters during optimizations
adding a SciPy-like wrapper for banded LU-decompositions (chemotools.utils._banded_linalg).

📦📂 Package structure

the settings.json in the .vscode-folder was removed from the GIT version control. Having a file that can overwrite the user's local settings (which might contain much more than just formatting and linting settings) can be quite destructive. It was replaced by a settings_template.json that can provide the basic setup for the user.
the requirements.txt were split into a requirements.txt for the main package capabilities and a requirements-dev.txt for the dependencies needed during development. Migrate to pyptoject.toml #53 will profit from this, since then one can simply point to the requirements.txt from pyproject.toml without having to worry about the user accidently installing pytest, matplotlib, etc.
basic linting via Ruff was configured in the pyproject.toml (requires settings.json to have Ruff configured). It reveals some unused imports, wrong type hints, and non-pythonic statements that should be tackled in the future.

✅❌ Tests

by including pytest-xdist, the tests can now be run in parallel which saves quite some time. The command I always used for running the tests is

pytest  --cov=chemotools .\tests  -n=auto --cov-report html -x

with pytest.mark.parametrize, the tests were extended by running the same test on multiple input combinations. This was especially useful for transformer functionality and numerics tests.

🪤 Miscellaneous

✍️ Fixed a typo and a type hint here and there

…d dev-requirements

…linting)

…s from repo

…ng whitespaces

…APACK wrappers for Whittaker smoothers

…nhancement; extended tests; fixed type hints; black formatted; started lint fixes

…added references

…se-matrix-operations

…iguring out 100% fix

…ixed Tikhonov regularisation; shotgun refactor of all respective modules; extended and parametrized tests; added more FIXMEs and TODOs; adapted docstrings

…e difference matrix in favour of speed; added explanation of Whittaker smoothing

…ion and solve; added preliminary tests

… differences matrix version that does not rely on `scipy.sparse` anymore to avoid overhead and redundant computations

…model

…eader; added sections; renamed "LU-" to "LU "

…oother names enum

… Cholesky; adapted models; used iterators; used more efficient numerical operations; incorporated fast pre-computations; got rid off sparse matrix representations via SciPy

…nary to remove one indirection with one branched logic

…e estimation; restructured reference finite difference kernels

…e difference fixture smarter

…and comments; improved docstring

…Smooth`

This reverts commit e6e8405.

paucablop · 2024-05-24T16:00:33Z

@MothNik FANTASTIC - I have been waiting for this day with a lot of enthusiasm 🤓🤓!!

I am starting to review it right now, and it is a long review, but I hope I can have it done in about a month from now! It is a very exciting contribution. During the review process, we could also start considering how to add the different improvements to the documentation pages 😄

The restructure of the package is a good idea, it goes perfectly in line with #53, and I need to get done during the summer, Having the dev dependencies separated is a great starting point! Also nice to hear you have been using Ruff for linting, it was also on my todo list to trancition from black. I did not know about pytest-xdist, but I have started testing it and... it is pretty cool, I like it a lot!😎

I think that now it is my turn, and I have some work to do 🥳🥳

MothNik · 2024-05-24T16:09:01Z

@paucablop You are highly welcome 😸

Yes, it's a lot of files. I'm sorry it turned so big 😅
Take all the time you need and just ping me for the documentation pages ✍️

I usually would not do package restructuring in a feature branch, but the branch required some setup for the development environment, especially for the tests ✅❌
I hope this will help for #53 and also #61 and make the installation easier 💾

As I said, take your time 😸

MothNik · 2024-05-25T22:23:15Z

I want to give special credits and thanks for the support by Guillaume Biessy, the author of Revisiting Whittaker-Henderson Smoothing which is - as far as I'm aware - the best review of the Whittaker-Henderson Smoothing out there because it is illustrated very well and focuses on the key points 🙏

- renamed all variables in `utils.banded_linalg` and `utils.finite_differences` and the respective tests to be more concise

- renamed all variables and function of the `_whittaker_base` and all related tests to make them more concise - removed `combination` in `pytest.mark.parametrize` in favour of individual variables which are more concise - made error checks explicit for error messages and not only the Exception type to avoid unintended error behaviour

- added error message pattern matching to the tests for models and input check utility functions

- renamed `test_for_utils` to `tests_for_utils`

added a guideline for proper use of `pytest`

added line on why tests for error handling are mandatory

MothNik · 2024-06-24T21:34:08Z

@paucablop
I'm done with the renaming of all the variables and functions to make them more readable.
Besides, I also added a tiny cheatsheet for testing with pytest as a README.

IruNikZe and others added 30 commits December 17, 2023 16:36

wip: [44] test push: untracked VSCode settings; added requirements an…

3882236

…d dev-requirements

wip: [44] attempt to remove settings.json from project

14d2748

feat: [44] added settings.json-template for also including ruff (…

9c10929

…linting)

wip: [44] git-ignored .vscode-folder again to decouple user setting…

4db90e4

…s from repo

fix: [44] fixed wrong type hint; isort; black formatting; trim traili…

935062c

…ng whitespaces

feat: [44] added ruff linting checks; added requirements for development

ae352b3

feat: [44] added finite difference computation for Whittaker smoothers

67faaf3

feat: [44] added more banded linear algebra functions with scipys L…

d13c5ce

…APACK wrappers for Whittaker smoothers

feat: [44] added centralized Whittaker smoother class

0041382

feat: [44] unified whittaker-smoother-like methods with performance e…

c3dff48

…nhancement; extended tests; fixed type hints; black formatted; started lint fixes

style: [44] made docstrings of whittaker smoothers black-compatible; …

995a964

…added references

Merge branch 'main' into 44-improve-airpls-and-arpls-performance-spar…

cc325e3

…se-matrix-operations

wip: [44] figured out pentapy problems and saved working status for f…

0437cc7

…iguring out 100% fix

refactor: [44] changed paradigm in Whittaker smoothing to work with m…

f37ffe7

…ixed Tikhonov regularisation; shotgun refactor of all respective modules; extended and parametrized tests; added more FIXMEs and TODOs; adapted docstrings

fix: [44] added skip for pentapy tests

8f3b56f

refactor: [44] temporarily disabled long computation of squared finit…

b9fd2e0

…e difference matrix in favour of speed; added explanation of Whittaker smoothing

test push

fa0479a

fix: removed maturin from dev requirements

25b5682

feat: re-added improved version of banded LU-decomposition; added tests

11f4937

feat: parallelized pytest

d9bb39a

feat: added wrappers for a dataclass-based banded Cholesky decomposit…

556b92f

…ion and solve; added preliminary tests

refactor: implemented fast computation of both squared forward finite…

064afcf

… differences matrix version that does not rely on `scipy.sparse` anymore to avoid overhead and redundant computations

refactor: renamed decomposition model to solver model; added PentaPy …

097d4f0

…model

feat: added new banded storage conversion function; added submodule h…

bc2198b

…eader; added sections; renamed "LU-" to "LU "

refactor: removed Cholesky support

67815f9

feat: added submodule header; organized submodule; added automated sm…

8cfa2f5

…oother names enum

refactor: major refactoring of Whittaker smoother base class; removed…

f7f6638

… Cholesky; adapted models; used iterators; used more efficient numerical operations; incorporated fast pre-computations; got rid off sparse matrix representations via SciPy

refactor: went for more secure way of specifying the penalty weight

6098848

refactor: replaced intermediate solver distributed method by a dictio…

254f906

…nary to remove one indirection with one branched logic

style: went for using ndarray-methods

7b5385f

MothNik added 7 commits May 21, 2024 00:48

test/refactor: included power failure into wrong input test of nois…

77b9f6e

…e estimation; restructured reference finite difference kernels

test/feat/refactor: added central finite differences test; made finit…

f859bab

…e difference fixture smarter

feat: exposed noise level estimation where it can be useful

7f13f9b

refactor/docs: added kwargs for extrapolator; fixed wrong docstrings …

b042e60

…and comments; improved docstring

refactor: renamed window_size to window_length

e6e8405

doc: added proper Notes for weight selection strategies of `Whittaker…

c60c8e3

…Smooth`

refactor: added psutil-install for pytest-xdist

4a1be7e

MothNik requested a review from paucablop May 24, 2024 13:10

MothNik linked an issue May 24, 2024 that may be closed by this pull request

Improve WhittakerSmooth, AirPLS, and ArPLS performance - sparse matrix operations #44

Open

MothNik self-assigned this May 24, 2024

paucablop added the 💪 enhancement New feature or request label May 24, 2024

paucablop added this to the v0.2.0 milestone May 24, 2024

Revert "refactor: renamed window_size to window_length"

109be51

This reverts commit e6e8405.

refactor: renamed window_length in docstring to window_size

124cf50

This was referenced May 26, 2024

Implement maximum entropy deconvolution #123

Open

[Enhancement] multiple “dependent variables” GeoStat-Framework/pentapy#11

Open

Enable automated window size determination of Savitzky-Golay and Mean filter (maybe also Median filter) #125

Open

MothNik added the ⚙️ maintainability Tasks that help to improve the maintainability label Jun 7, 2024

MothNik added 6 commits June 24, 2024 20:58

style: [44]

5697fd9

- renamed all variables in `utils.banded_linalg` and `utils.finite_differences` and the respective tests to be more concise

style: [44]

45b8224

- added error message pattern matching to the tests for models and input check utility functions

refactor: [44]

36cbf29

- renamed `test_for_utils` to `tests_for_utils`

feat: [44]

e905250

added a guideline for proper use of `pytest`

feat: [44]

3826871

added line on why tests for error handling are mandatory

MothNik added ✍️ documentation Improvements or additions to documentation 🏄 baseline correction Algorithms for baseline correction labels Jun 24, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

44 Improve `WhittakerSmooth`, `AirPls`, and `ArPls` performance #120

44 Improve `WhittakerSmooth`, `AirPls`, and `ArPls` performance #120

MothNik commented May 24, 2024 •

edited

Loading

paucablop commented May 24, 2024

MothNik commented May 24, 2024

MothNik commented May 25, 2024 •

edited

Loading

MothNik commented Jun 24, 2024

44 Improve WhittakerSmooth, AirPls, and ArPls performance #120

Are you sure you want to change the base?

44 Improve WhittakerSmooth, AirPls, and ArPls performance #120

Conversation

MothNik commented May 24, 2024 • edited Loading

🏗️ Main Feature Changes

🧑‍💻 Implementations

⏱️ Timings

🌊 WhittakerSmooth with difference order 1

🌊 WhittakerSmooth with difference order 2

🏴‍☠️ ArPls

🛩️ AirPls with polynomial order 1

🛩️ AirPls with polynomial order 2

🚶 Next Steps

🎁 Additional features

📦📂 Package structure

✅❌ Tests

🪤 Miscellaneous

paucablop commented May 24, 2024

MothNik commented May 24, 2024

MothNik commented May 25, 2024 • edited Loading

MothNik commented Jun 24, 2024

44 Improve `WhittakerSmooth`, `AirPls`, and `ArPls` performance #120

44 Improve `WhittakerSmooth`, `AirPls`, and `ArPls` performance #120

MothNik commented May 24, 2024 •

edited

Loading

🌊 `WhittakerSmooth` with difference order 1

🌊 `WhittakerSmooth` with difference order 2

🏴‍☠️ `ArPls`

🛩️ `AirPls` with polynomial order 1

🛩️ `AirPls` with polynomial order 2

MothNik commented May 25, 2024 •

edited

Loading