Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DiskCache blobs aren't cleaned on overwrite #206

Open
RustyNova016 opened this issue Apr 23, 2024 · 8 comments
Open

DiskCache blobs aren't cleaned on overwrite #206

RustyNova016 opened this issue Apr 23, 2024 · 8 comments

Comments

@RustyNova016
Copy link

I am using a DiskCache with an inner vector, making files in the blob folder of the cache. However, those blob files aren't removed/overwritten when I'm overwriting a cache entry. This leads to inflated storage usage that I rather not have since those structs are heavy.

Here's a example program to test the bug:

use cached::{DiskCache, IOCached};
use serde::{Deserialize, Serialize};

#[derive(Debug, Serialize, Deserialize)]
struct BigStruct {
    pub id: String,
    pub data: Vec<Vec<u128>>
}

fn main() {
    let cache: DiskCache<String, BigStruct> = DiskCache::new("test").build().unwrap();

    // Fill the cache with lots of data
    for i in 0..100 {
        let big_struct = BigStruct {
            id: i.to_string(),
            data: (0..100).into_iter().map(|j| {
                (0..100).into_iter().map(|k| {
                    i + j + k
                }).collect()
            }).collect()
        };

        // cache_remove doesn't do anything either!
        cache.cache_remove(&big_struct.id).unwrap();
        cache.cache_set(big_struct.id.clone(), big_struct).unwrap();
    }
}
@RustyNova016
Copy link
Author

Update:
DiskCacheBuilder::set_sync_to_disk_on_cache_change doesn't change anything.

@ronanM
Copy link

ronanM commented Jul 1, 2024

Running this test multiple times seems to reproduce the problem of accumulating files in the blobs folder of the cache:

#[io_cached(
    disk = true,
    time = 1000000,
    map_error = r##"|e| TestError::DiskError(format!("{:?}", e))"##,
    key = "u32",
    convert = r"{ n }",
)]
async fn cached_disk(n: u32) -> Result<Vec<u32>, TestError> {
    Ok((0..n).collect::<Vec<_>>())
}

#[tokio::test]
async fn test_cached_disk() {
    let n = 1_000_000;
    let expected = Ok((0..n).collect::<Vec<_>>());
    assert_eq!(cached_disk(n).await, expected);
    assert_eq!(cached_disk(n).await, expected);
}

@adundovi
Copy link

Unfortunately, this issue makes DiskCache useless.

@jaemk
Copy link
Owner

jaemk commented Jul 20, 2024

Thanks for pointing this out. I'm able to reproduce also. It looks like this is due to either a bug or the intended operation of the underlying sled::Db. Also looks like development has slowed down on that project. It would probably be best to change the backing store to something more reliable, like sqlite

@jaemk
Copy link
Owner

jaemk commented Jul 20, 2024

There is a potential ugly workaround until the storage backend is swapped. DiskCacheBuilder lets you set a specific disk directory set_disk_directory where each function's file cache will be stored. It looks like sled is continuously creating a new copy of the blob data with the name being an incremental integer. You could scan this directory on startup, for each sub-dir (the name of a function in all caps + _v1) scan the blobs/ dir, sort the files and delete all but the latest file (either keep the largest integer name or check the file metadata for last created)

@adundovi
Copy link

Thanks for pointing this out. I'm able to reproduce also. It looks like this is due to either a bug or the intended operation of the underlying sled::Db. Also looks like development has slowed down on that project. It would probably be best to change the backing store to something more reliable, like sqlite

Thanks for prompt reply.

It seems to me that sled::Db is working fine in a sense that it does speed up the function look-up operations, yet it duplicates "blobs" as you mentioned. Regarding sqlite - it wouldn't be my first choice due to features overkill for this purpose, and it isn't a cargo-friendly dependency.

There is a potential ugly workaround until the storage backend is swapped. DiskCacheBuilder lets you set a specific disk directory set_disk_directory where each function's file cache will be stored. It looks like sled is continuously creating a new copy of the blob data with the name being an incremental integer. You could scan this directory on startup, for each sub-dir (the name of a function in all caps + _v1) scan the blobs/ dir, sort the files and delete all but the latest file (either keep the largest integer name or check the file metadata for last created)

I thought about something similar myself, but it is really an ugly workaround :-) I'll try.

@ronanM
Copy link

ronanM commented Jul 20, 2024

It would probably be best to change the backing store to something more reliable, like sqlite

RocksDB may be a better choice.

@RustyNova016
Copy link
Author

Personally I have switched to sled DB at home and did a combination of Cacache for the storage and serde_rmp to convert to binary data

Not sure if it's the best way tho. But once this issue is sorted I'm definitely switching back

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants