Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implementation of Local Cache Algorithm Based on ARC #36376

Closed

Conversation

KinderRiven
Copy link
Contributor

Changelog category (leave one):

  • Improvement

Changelog entry:

Compared with the LRU algorithm, the local cache algorithm based on ARC can effectively avoid the problem of cache pool pollution caused by one-time large-scale scanning.
If that commit is accepted, we can support choosing a different local cache algorithm in the config file in subsequent commits.

@robot-ch-test-poll robot-ch-test-poll added the pr-improvement Pull request with some product improvements label Apr 18, 2022
@alesapin
Copy link
Member

cc @kssenii

@alesapin alesapin added the can be tested Allows running workflows for external contributors label Apr 18, 2022
@kssenii kssenii self-assigned this Apr 18, 2022
@KinderRiven
Copy link
Contributor Author

Something went wrong in testing, but I didn't find out what caused the problem when I submitted the code. Can you help me see what's causing the problem? Or maybe it's due to the test system itself?Thank you ! @kssenii

@KinderRiven
Copy link
Contributor Author

KinderRiven commented Apr 19, 2022

The following figure is the comparison effect of the ARC and LRU algorithms in ClickHouse. We run the Q1.1 request of the SSB for four times (it is a small request to simulate the user's small-scale intensive access), and then we run the Q2.1 of the SSB for once (it is a big request, it may scan the whole table, but it will not occur often in real scenarios). After that we run Q1.1 for 4 times .
The results are shown in the figure, when Q1.1 is executed after the execution of Q2.1, the system based on the LRU algorithm almost replaces the data of the entire buffer pool (this may not be necessary, because large-scale loads are not common), and then executing Q1.1 after that will cause a significant latency rise. The system based on the ARC algorithm does not have the above problems, because it will place frequently accessed cache data in a higher-level queue to avoid pollution.

image

@KinderRiven
Copy link
Contributor Author

We can specify the optional cahce algorithm in the configuration file to manage the local cache. (The default setting is LRU).

image

@kssenii
Copy link
Member

kssenii commented Apr 19, 2022

Something went wrong in testing, but I didn't find out what caused the problem when I submitted the code. Can you help me see what's causing the problem? Or maybe it's due to the test system itself?Thank you ! @kssenii

It was a temporary issue with our CI, now should be ok, I just re-run all checks, now they should start properly.

@kssenii
Copy link
Member

kssenii commented Apr 20, 2022

Let's temporarily turn on your cache algorithm for our CI. Go to ClickHouse/tests/config/config.d/storage_conf.xml and change disk config there. Then it will be applied for some stateless and stateful tests and also for a checks named with tag s3 storage.

@KinderRiven
Copy link
Contributor Author

Let's temporarily turn on your cache algorithm for our CI. Go to ClickHouse/tests/config/config.d/storage_conf.xml and change disk config there. Then it will be applied for some stateless and stateful tests and also for a checks named with tag s3 storage.

I may have found a system error caused by assert in the local cache. I need to fix it first and then start subsequent testing.
PR

@KinderRiven
Copy link
Contributor Author

KinderRiven commented Apr 21, 2022

Let's temporarily turn on your cache algorithm for our CI. Go to ClickHouse/tests/config/config.d/storage_conf.xml and change disk config there. Then it will be applied for some stateless and stateful tests and also for a checks named with tag s3 storage.

There was a failure in executing 02240_system_remote_filesystem_cache and 02241_remote_filesystem_cache_on_insert , but after my investigation, I found that it may not be due to the addition of arc algorithm, because I adopted the latest official master branch, which still exists after local retest. I guess it may be that the comparison information is not updated in time after updating the relevant code of local cache. @kssenii

@KinderRiven
Copy link
Contributor Author

After using the official master code recomit to test LRU, there will still be the problem of test failure. I think it is really the problem of CI.

src/Common/FileCacheFactory.cpp Outdated Show resolved Hide resolved
tests/config/config.d/storage_conf.xml Outdated Show resolved Hide resolved
src/Common/ARCFileCache.h Outdated Show resolved Hide resolved
src/Common/ARCFileCache.cpp Show resolved Hide resolved
@KinderRiven
Copy link
Contributor Author

Stateless tests flaky check (address, actions) — Timeout, fail: 0, passed: 150
The reason for the above problems is that we have added additional arc tests, which may lead to insufficient reservation time (1800 seconds at most).

@KinderRiven KinderRiven force-pushed the local_cache_with_arc_v1 branch 3 times, most recently from b72f4c2 to 9f9f14b Compare May 1, 2022 12:36
return cell.hit_count >= move_threshold;
}

bool ARCFileCache::tryMoveLowToHigh(const FileSegmentCell & cell, std::lock_guard<std::mutex> & cache_lock)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I checked https://www.usenix.org/legacy/events/fast03/tech/full_papers/megiddo/megiddo.pdf and https://dbs.uni-leipzig.de/file/ARC.pdf and it does not really look like ARC described there. To what article did you refer?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for not recovering in time as I have some trivia in my reality. It looks like this, and I will optimize according to the paper.

@KinderRiven
Copy link
Contributor Author

KinderRiven commented May 12, 2022

@kssenii I looked at some cache algorithm papers and our implementation is probably closer (LRU-K, https://dl.acm.org/doi/pdf/10.1145/170036.170081). The ARC algorithm may not be very suitable for the current situation, because it requires that the size of each cache page is fixed (However, the size of the file segment is not fixed in the implementation of the local cache). We can first implement an LRU-K algorithm (which is still valid for access loads with flushing) and then consider other algorithms (such as the improved ARC algorithm), what do you think?

@kssenii
Copy link
Member

kssenii commented May 23, 2022

Sorry for not replying for some time.

We can first implement an LRU-K algorithm (which is still valid for access loads with flushing) and then consider other algorithms (such as the improved ARC algorithm), what do you think?

I agree 👍🏻. I'll read the paper you mentioned and review once again.

src/Common/ARCFileCache.cpp Show resolved Hide resolved
src/Common/ARCFileCache.cpp Show resolved Hide resolved
src/Common/ARCFileCache.cpp Show resolved Hide resolved
src/Common/ARCFileCache.cpp Show resolved Hide resolved
src/Common/ARCFileCache.cpp Show resolved Hide resolved
src/Common/ARCFileCache.cpp Show resolved Hide resolved
@KinderRiven
Copy link
Contributor Author

sorry for the late reply. I may have to close this PR for the following reasons:
[1] Avoiding cache pollution has been addressed in my previous PR1 and PR2. For example, avoiding excessive downloading is actually similar to the implementation of LRU-K.
[2] The implementation of the current algorithm and the cache function is not decoupled, so too much redundancy may need to be introduced when implementing the cache algorithm. For example, currently we only support set query cache size and avoid excessive download in LRUFileCache. Implementing a new caching algorithm may require implementing these functions again, which I think is obviously not good.
[3] In the future, if the caching algorithm can be decoupled from the caching function, I will introduce some new caching algorithms, and I will also consider how to achieve the above goals.

@alexey-milovidov
Copy link
Member

@KinderRiven there is a work on decoupling caching strategy here: #34651

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
can be tested Allows running workflows for external contributors pr-improvement Pull request with some product improvements
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants