ENH: Create index on reads, not on seeks #74

pauldmccarthy · 2021-05-20T12:07:26Z

Currently the index is built:

On calls to zran_seek(n) - the index is bulit to cover up to position n in the uncompressed data
On successive calls to zran_read - when reading from position 0, one seek point is created at the beginning of the stream, but then no more points are created. If another call to zran_read is then made, the index is built to cover up to the position that the previous zran_read call finished - data is re-read and re-decompressed from the beginning of the file/stream. Then the read is initialised from that point.

This is complicated and inefficient - zran_seek should just update an internal offset into the uncompressed data, without doing anything else. Then zran_read should both read data and build the index in a single pass.

To support pre-generating an index, without reading any data, zran_read could allow buf to be NULL, which would cause it to read/build the index, but not copy any data.

The text was updated successfully, but these errors were encountered:

piskvorky · 2023-03-26T20:27:01Z

@pauldmccarthy also bitten by this… is it enough if I manually call seek(tell()) after each read, or is the workaround more involved?

EDIT: OK I tried the above and it didn't work. In the sense that raw.seek_points() didn't report the creation of new seek points where I expected them (i.e. after read() + seek(), in ± correspondence with spacing), but rather strangely in bulk. Logging showed that raw.seek_points() went from 0 items to 108 items in one jump, with over 30 MB (compressed) of various read() and seek() calls in between without any seek points.

I guess I don't understand how this works, will have to study the code in detail.

pauldmccarthy · 2023-03-27T10:08:45Z

Hi @piskvorky, I don't fully understand the problem you are experiencing - can you paste some example code to use as a basis for discussion?

piskvorky · 2023-03-27T12:09:17Z

Sure, thanks for looking into this. I created this minimal example:

import io
import tarfile
import random
from string import ascii_letters

from indexed_gzip import IndexedGzipFile


def generate_tgz(path, num_files=10000, file_size=5*1024):
    """Create a .tgz file containing lots of random files inside."""
    random.seed(42)
    with tarfile.open(name=path, mode='w:gz') as tf:
        for fileno in range(num_files):
            content = (''.join(random.choice(ascii_letters) for i in range(file_size))).encode('utf8')
            ti = tarfile.TarInfo(name=f'{fileno}.txt')
            ti.size = len(content)
            tf.addfile(ti, io.BytesIO(content))


if __name__ == '__main__':
    fname = '/tmp/test_igzip.tgz'
    generate_tgz(fname)

    # Simulate data coming from a stream. In reality, this is from network and not a local disk file.
    raw_obj = open(fname, mode='rb')
    gzip_wrapped = IndexedGzipFile(fileobj=raw_obj, mode='rb', auto_build=True, spacing=256 * 1024, buffer_size=4 * 1024**2)
    tf = tarfile.open(fileobj=gzip_wrapped, mode='r')

    # Simulate advancing through the archive, extracting members.
    # My expectation was that seek points are being created periodically, so we can seek back later.
    last_seek_points = -1
    while True:
        fileinfo = tf.next()
        if not fileinfo:
            break
        content = tf.extractfile(fileinfo)
        gzip_wrapped.seek(gzip_wrapped.tell())  # a no-op seek, attempting to force creating a seek point
        seek_points = list(gzip_wrapped.raw.seek_points())
        if last_seek_points != len(seek_points):
            print(f"at file {fileinfo.name}, {len(seek_points)} seek points, last one {seek_points[-1] if seek_points else '-'}")
            last_seek_points = len(seek_points)

The output is:

at file 0.txt, 0 seek points, last one -
at file 745.txt, 20 seek points, last one (5082521, 3382157)
at file 1490.txt, 33 seek points, last one (8560026, 5697309)
at file 2235.txt, 49 seek points, last one (12840538, 8547225)
at file 2980.txt, 64 seek points, last one (16853227, 11218571)
at file 3725.txt, 80 seek points, last one (21133304, 14068083)
at file 4470.txt, 96 seek points, last one (25413899, 16917874)
at file 5215.txt, 111 seek points, last one (29426378, 19589398)
at file 5960.txt, 127 seek points, last one (33706053, 22438615)
at file 6705.txt, 143 seek points, last one (37985739, 25287739)
at file 7450.txt, 158 seek points, last one (41997656, 27958819)
at file 8195.txt, 174 seek points, last one (46277728, 30808391)
at file 8940.txt, 190 seek points, last one (50559314, 33658717)
at file 9685.txt, 205 seek points, last one (54571040, 36329719)

… while I expected the first seek point to appear around spacing=256KB, rather than after 5 MB.

Basically my mental model for when seek points are created is broken.

piskvorky · 2023-03-27T12:14:34Z

My running hypothesis is that it's somehow related to input buffering – either inside tarfile, or igzip's buffer (set to 4 MB above).

pauldmccarthy · 2023-03-29T08:47:24Z

Hi @piskvorky, I'm sorry, I still don't understand what the problem is here - it looks to me like the seek points are being created at approximately 256kb intervals?| e.g.:

In [5]: sp = list(gzip_wrapped.seek_points())
In [6]: sp
Out[6]:
[(0, 25),
 (267776, 178048),
 (534958, 355963),
 (802579, 533854),
 (1069800, 711855),
 (1337605, 889908),
 ...

piskvorky · 2023-03-29T09:55:47Z

The problem is seek points are not being created as the file is being read (which is why I'm hijacking this particular issue, I thought it's related).

Instead, there are no seek points at all, until there are 20 seek points all at once, after ~5 MB have been read.

In other words, at no point is there a single seek point, two seek points, three seek points etc. The number of seek points jumps from 0 - 20 - 33 - 49 - 64…

This despite me explicitly calling seek() in the example code after reading each file, to make sure that seek points are created smoothly. But they're not.

pauldmccarthy · 2023-03-29T10:13:37Z

Ok, sorry for my slowness in understanding. Unfortunately, I came to the conclusion that resolving this particular issue would require a major refactor (explanation here).

I originally wrote this library to work with regular on-disk files, and never envisaged it being applied to data streams. It wasn't too difficult to bolt-on support for in-memory file-likes (thanks to the work by @epicfaace in #55), but I feel that adding true support for streamed data would be a substantial undertaking.

To be honest, I am leaning towards the idea of a from-scratch re-write in pure Python (using ctypes for the zlib interactions that aren't covered by the standard zlib module - I really should have done this from the outset, rather than use C). The logic in zran.c is not that complex, and would be so much cleaner and easier to maintain in Python than in its current form.

Unfortunately I can't envisage having the time to sit down and do this work any time soon, as I am continually tied up with unrelated work (and most of my free time is spent away from the screen!). However, if I don't get around to it, I am more than happy to support/review work done by somebody else.

piskvorky · 2023-03-29T10:52:34Z

I understand, thank you for your support. Being maximally away from the screen and juggling time between open source & family is also something I aspire to!

pauldmccarthy mentioned this issue Oct 20, 2022

For unseekable files, can't build full index by reading the entire file #107

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ENH: Create index on reads, not on seeks #74

ENH: Create index on reads, not on seeks #74

pauldmccarthy commented May 20, 2021 •

edited

Loading

piskvorky commented Mar 26, 2023 •

edited

Loading

pauldmccarthy commented Mar 27, 2023

piskvorky commented Mar 27, 2023 •

edited

Loading

piskvorky commented Mar 27, 2023

pauldmccarthy commented Mar 29, 2023

piskvorky commented Mar 29, 2023 •

edited

Loading

pauldmccarthy commented Mar 29, 2023

piskvorky commented Mar 29, 2023 •

edited

Loading

ENH: Create index on reads, not on seeks #74

ENH: Create index on reads, not on seeks #74

Comments

pauldmccarthy commented May 20, 2021 • edited Loading

piskvorky commented Mar 26, 2023 • edited Loading

pauldmccarthy commented Mar 27, 2023

piskvorky commented Mar 27, 2023 • edited Loading

piskvorky commented Mar 27, 2023

pauldmccarthy commented Mar 29, 2023

piskvorky commented Mar 29, 2023 • edited Loading

pauldmccarthy commented Mar 29, 2023

piskvorky commented Mar 29, 2023 • edited Loading

pauldmccarthy commented May 20, 2021 •

edited

Loading

piskvorky commented Mar 26, 2023 •

edited

Loading

piskvorky commented Mar 27, 2023 •

edited

Loading

piskvorky commented Mar 29, 2023 •

edited

Loading

piskvorky commented Mar 29, 2023 •

edited

Loading