Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support deletion vector #1094

Open
2 tasks
wjones127 opened this issue Jan 24, 2023 · 12 comments
Open
2 tasks

Support deletion vector #1094

wjones127 opened this issue Jan 24, 2023 · 12 comments
Labels
enhancement New feature or request

Comments

@wjones127
Copy link
Collaborator

Description

For protocol version 3, will want to support deletion vector.

  • Supporting reading with deletion vector
  • Support delete operations using delete vector

https://github.com/delta-io/delta/blob/master/PROTOCOL.md#deletion-vectors

Question: how do we decide to rewrite vs use delete vector?

Use Case

This enables much faster deletes.

Related Issue(s)

Prerequisites:

@houqp
Copy link
Member

houqp commented Jan 25, 2023

Question: how do we decide to rewrite vs use delete vector?

This looks like a tradeoff between faster read performance v.s. faster write that need to be decided case by case? If so, might be better to just let the user decide depending on the expected workload pattern.

@wjones127 wjones127 mentioned this issue Feb 5, 2023
21 tasks
@guyrt
Copy link
Contributor

guyrt commented Feb 6, 2023

+1 to supporting user-owned tradeoff decision. I'm investigating this feature internally and update patterns in individual tables likely dictate the right decision.

For instance, in many dimension tables, edits may be spread randomly through existing data and merge on read will be more efficient. For fact tables with mostly append pattern (but occasional fact updates), judicious partition plus copy on write may be superior.

@aersam
Copy link
Contributor

aersam commented Jul 5, 2023

Don't know if this helps, just tried to read a deletion vector file, and this seems to be working with the roaring crate:

fn get_deletion_vectors(
    filename: &str,
) -> Result<Vec<RoaringTreemap>, Box<dyn std::error::Error + Send + Sync>> {
    let mut file = File::open(filename)?;
    let mut buf = vec![0; 2];
    file.read(&mut buf).unwrap();
    let version = u16::from_le_bytes(buf.clone().try_into().unwrap());
    assert_eq!(version, 1);
    let mut index = 0;
    let mut vec = Vec::new();
    loop {
        index += 1;
        let mut buf = vec![0; 3];
        let nrread = file.read(&mut buf)?;
        if nrread == 0 {
            return Ok(vec);
        }

        let size_buf = [&[0], &buf[0..3]].concat();
        let datasize = u32::from_be_bytes(size_buf.try_into().unwrap());
        let mut buf = vec![0; 4];
        file.read(&mut buf)?;
        let magic = i32::from_le_bytes(buf.clone().try_into().unwrap());

        assert!(magic == 1681511377);
        if datasize == 0 {
            continue;
        }

        let before = &file.stream_position()?;
        let take: Take<&File> = (&file).take(datasize as u64 - 4);
        let rdr = RoaringTreemap::deserialize_from(take)?;

        //let mut target_file =
        //    File::create("data/deletion_vectors_splitted/delvec_".to_owned() + &index.to_string())?;
        //std::io::copy(&mut take, &mut target_file)?;

        let after = &file.stream_position()?;
        //println!("{}, {}: {}", before, after, datasize);

        vec.push(rdr);
        // seems roaring-rs does not always read to full end
        let mut buf = vec![0; 1];
        file.read(&mut buf)?;

        let mut checksum_buf = vec![0; 4];
        file.read(&mut checksum_buf)?;
    }
}

@aersam
Copy link
Contributor

aersam commented Jul 10, 2023

Would you accept a PR that does add the required metadata as a first step?

@roeap
Copy link
Collaborator

roeap commented Jul 10, 2023

Hi @aersam - first of all thanks for the code snipplet, it actually samed me a bit of time working on this elsewhere.

In principle we always welcome contributions. In this case we also do, but there is one caveat. Elsewhere we are currently working hard on getting delta-kernel for rust released which will hopefully significantly boost our protocol support.

The more complex thing here is, that in order to support deletion vectors we have to either support reader V3 and writer v7 (i.e. table features), or support a whole bunch of other delta features as well.

Good news is we are actively working on it, but since this involves some larger blocks of work, its likely going to be a few weeks, before this can fully land...

With all that said, if you profit from having some intermediate partial support, I'd be happy to review PRs :)

@aersam
Copy link
Contributor

aersam commented Jul 10, 2023

Well if it's about weeks I can wait. I know that actually column mapping would be first, just thought that cannot be that hard ;)

I did not know about delta-kernel for rust, I'm really glad to hear about it! To be honest I was a bit disappointed as I thought it will be in Java - nothing against Java, but I much prefer Rust, especially for embedding. Where do I find the code for delta-kernel/rust? Just to observe it a bit

Btw I also corrected the snipped, it had a bug when there are multiple vectors within file.

@alippai
Copy link

alippai commented Jul 10, 2023

@roeap where can one follow the Delta kernel initiatives? I saw delta-io/delta#1783 but that's not rust specific, right? Will it happen in this repo or will there be a delta-kernel-rs?

@aersam
Copy link
Contributor

aersam commented Aug 1, 2023

Trying to get the metadata running here: https://github.com/bmsuisse/delta-rs/tree/deletion_vector_meta
Once you have the metadata you could use them for example together with duckdb's read_parquet([parquets...],file_row_number=True) to read tables with deletion vectors

wjones127 added a commit that referenced this issue Aug 15, 2023
# Description
This just adds the deletion vector metadata to the actions. It does not
interpret those yet, reading / writing deletion vectors is not supported
with this. Still it enables use cases where you use delta-rs just for
metadata retrieval

I have to add that I'm still learning rust and I expect this to take
some iterations until code quality is sufficient

# Related Issue(s)
Part of #1094 : Adds the required metadata

# Documentation


https://github.com/delta-io/delta/blob/master/PROTOCOL.md#deletion-vectors

---------

Co-authored-by: Will Jones <[email protected]>
polynomialherder pushed a commit to polynomialherder/delta-rs that referenced this issue Aug 15, 2023
# Description
This just adds the deletion vector metadata to the actions. It does not
interpret those yet, reading / writing deletion vectors is not supported
with this. Still it enables use cases where you use delta-rs just for
metadata retrieval

I have to add that I'm still learning rust and I expect this to take
some iterations until code quality is sufficient

# Related Issue(s)
Part of delta-io#1094 : Adds the required metadata

# Documentation


https://github.com/delta-io/delta/blob/master/PROTOCOL.md#deletion-vectors

---------

Co-authored-by: Will Jones <[email protected]>
@djouallah
Copy link

fwiw; Fabric Datawarehouse just added support for deletion vectors and suddenly the delta table produced is no more compatible with Delta_rs :(

@boccileonardo
Copy link

Is this feature still on the roadmap? Tables produced by recent databricks runtime include deletion vectors by default, so it seems to me that reading them through rust-based solutions like polars is not currently possible natively.

@dylan-lee94
Copy link

Running into the same issue, the latest databricks runtime have deletion vectors enabled by default and our admin won't turn it off. This breaks our python code that is reading with DeltaTable or polars.

@djouallah
Copy link

Running into the same issue, the latest databricks runtime have deletion vectors enabled by default and our admin won't turn it off. This breaks our python code that is reading with DeltaTable or polars.

as a temporary workaround, duckdb do support reading delta table with deletion vectors using the delta extension based on delta kernel not delta_rs

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

9 participants