Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: Create index on reads, not on seeks #74

Open
pauldmccarthy opened this issue May 20, 2021 · 8 comments
Open

ENH: Create index on reads, not on seeks #74

pauldmccarthy opened this issue May 20, 2021 · 8 comments

Comments

@pauldmccarthy
Copy link
Owner

pauldmccarthy commented May 20, 2021

Currently the index is built:

  • On calls to zran_seek(n) - the index is bulit to cover up to position n in the uncompressed data
  • On successive calls to zran_read - when reading from position 0, one seek point is created at the beginning of the stream, but then no more points are created. If another call to zran_read is then made, the index is built to cover up to the position that the previous zran_read call finished - data is re-read and re-decompressed from the beginning of the file/stream. Then the read is initialised from that point.

This is complicated and inefficient - zran_seek should just update an internal offset into the uncompressed data, without doing anything else. Then zran_read should both read data and build the index in a single pass.

To support pre-generating an index, without reading any data, zran_read could allow buf to be NULL, which would cause it to read/build the index, but not copy any data.

@piskvorky
Copy link

piskvorky commented Mar 26, 2023

@pauldmccarthy also bitten by this… is it enough if I manually call seek(tell()) after each read, or is the workaround more involved?

EDIT: OK I tried the above and it didn't work. In the sense that raw.seek_points() didn't report the creation of new seek points where I expected them (i.e. after read() + seek(), in ± correspondence with spacing), but rather strangely in bulk. Logging showed that raw.seek_points() went from 0 items to 108 items in one jump, with over 30 MB (compressed) of various read() and seek() calls in between without any seek points.

I guess I don't understand how this works, will have to study the code in detail.

@pauldmccarthy
Copy link
Owner Author

Hi @piskvorky, I don't fully understand the problem you are experiencing - can you paste some example code to use as a basis for discussion?

@piskvorky
Copy link

piskvorky commented Mar 27, 2023

Sure, thanks for looking into this. I created this minimal example:

import io
import tarfile
import random
from string import ascii_letters

from indexed_gzip import IndexedGzipFile


def generate_tgz(path, num_files=10000, file_size=5*1024):
    """Create a .tgz file containing lots of random files inside."""
    random.seed(42)
    with tarfile.open(name=path, mode='w:gz') as tf:
        for fileno in range(num_files):
            content = (''.join(random.choice(ascii_letters) for i in range(file_size))).encode('utf8')
            ti = tarfile.TarInfo(name=f'{fileno}.txt')
            ti.size = len(content)
            tf.addfile(ti, io.BytesIO(content))


if __name__ == '__main__':
    fname = '/tmp/test_igzip.tgz'
    generate_tgz(fname)

    # Simulate data coming from a stream. In reality, this is from network and not a local disk file.
    raw_obj = open(fname, mode='rb')
    gzip_wrapped = IndexedGzipFile(fileobj=raw_obj, mode='rb', auto_build=True, spacing=256 * 1024, buffer_size=4 * 1024**2)
    tf = tarfile.open(fileobj=gzip_wrapped, mode='r')

    # Simulate advancing through the archive, extracting members.
    # My expectation was that seek points are being created periodically, so we can seek back later.
    last_seek_points = -1
    while True:
        fileinfo = tf.next()
        if not fileinfo:
            break
        content = tf.extractfile(fileinfo)
        gzip_wrapped.seek(gzip_wrapped.tell())  # a no-op seek, attempting to force creating a seek point
        seek_points = list(gzip_wrapped.raw.seek_points())
        if last_seek_points != len(seek_points):
            print(f"at file {fileinfo.name}, {len(seek_points)} seek points, last one {seek_points[-1] if seek_points else '-'}")
            last_seek_points = len(seek_points)

The output is:

at file 0.txt, 0 seek points, last one -
at file 745.txt, 20 seek points, last one (5082521, 3382157)
at file 1490.txt, 33 seek points, last one (8560026, 5697309)
at file 2235.txt, 49 seek points, last one (12840538, 8547225)
at file 2980.txt, 64 seek points, last one (16853227, 11218571)
at file 3725.txt, 80 seek points, last one (21133304, 14068083)
at file 4470.txt, 96 seek points, last one (25413899, 16917874)
at file 5215.txt, 111 seek points, last one (29426378, 19589398)
at file 5960.txt, 127 seek points, last one (33706053, 22438615)
at file 6705.txt, 143 seek points, last one (37985739, 25287739)
at file 7450.txt, 158 seek points, last one (41997656, 27958819)
at file 8195.txt, 174 seek points, last one (46277728, 30808391)
at file 8940.txt, 190 seek points, last one (50559314, 33658717)
at file 9685.txt, 205 seek points, last one (54571040, 36329719)

… while I expected the first seek point to appear around spacing=256KB, rather than after 5 MB.

Basically my mental model for when seek points are created is broken.

@piskvorky
Copy link

My running hypothesis is that it's somehow related to input buffering – either inside tarfile, or igzip's buffer (set to 4 MB above).

@pauldmccarthy
Copy link
Owner Author

Hi @piskvorky, I'm sorry, I still don't understand what the problem is here - it looks to me like the seek points are being created at approximately 256kb intervals?| e.g.:

In [5]: sp = list(gzip_wrapped.seek_points())
In [6]: sp
Out[6]:
[(0, 25),
 (267776, 178048),
 (534958, 355963),
 (802579, 533854),
 (1069800, 711855),
 (1337605, 889908),
 ...

@piskvorky
Copy link

piskvorky commented Mar 29, 2023

The problem is seek points are not being created as the file is being read (which is why I'm hijacking this particular issue, I thought it's related).

Instead, there are no seek points at all, until there are 20 seek points all at once, after ~5 MB have been read.

In other words, at no point is there a single seek point, two seek points, three seek points etc. The number of seek points jumps from 0 - 20 - 33 - 49 - 64…

This despite me explicitly calling seek() in the example code after reading each file, to make sure that seek points are created smoothly. But they're not.

@pauldmccarthy
Copy link
Owner Author

Ok, sorry for my slowness in understanding. Unfortunately, I came to the conclusion that resolving this particular issue would require a major refactor (explanation here).

I originally wrote this library to work with regular on-disk files, and never envisaged it being applied to data streams. It wasn't too difficult to bolt-on support for in-memory file-likes (thanks to the work by @epicfaace in #55), but I feel that adding true support for streamed data would be a substantial undertaking.

To be honest, I am leaning towards the idea of a from-scratch re-write in pure Python (using ctypes for the zlib interactions that aren't covered by the standard zlib module - I really should have done this from the outset, rather than use C). The logic in zran.c is not that complex, and would be so much cleaner and easier to maintain in Python than in its current form.

Unfortunately I can't envisage having the time to sit down and do this work any time soon, as I am continually tied up with unrelated work (and most of my free time is spent away from the screen!). However, if I don't get around to it, I am more than happy to support/review work done by somebody else.

@piskvorky
Copy link

piskvorky commented Mar 29, 2023

I understand, thank you for your support. Being maximally away from the screen and juggling time between open source & family is also something I aspire to!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants