Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Miller produces no output on TSV with > 64K characters per line #1501

Closed
GiulioCentorame opened this issue Feb 20, 2024 · 11 comments · Fixed by #1505
Closed

Miller produces no output on TSV with > 64K characters per line #1501

GiulioCentorame opened this issue Feb 20, 2024 · 11 comments · Fixed by #1505

Comments

@GiulioCentorame
Copy link

Hi John,
Hope you are doing well. I just wanted to open an issue wrt this problem I have encountered with one of my files.

To recap: I am working with a big file with approximately the following structure

f.ID    f.1.0.0    f.2.0.0    f.3.0.1    f.3.0.2
1    A    22    B32    -1    
2    F    38    B1    65 

(~5,000 columns and ~500,000 records)

Running mlr --itsv cut -f f.ID [file] does not seem to work as expected, as the program just "hangs" without printing anything on the shell/returning anything to stdout/stderr. We tried a few things to make it run with @aborruso, but nothing seemed to work as intended. This is running on an HPC machine and I provided up to 200 GB RAM to miller, without success.

Apologies for opening an issue without a reproducible example, this seems to happen only with a specific file that I cannot share due to data sharing policies.
Cheers

@johnkerl
Copy link
Owner

Hi @GiulioCentorame ! I'll try to reproduce here.

Meanwhile can you please tell me what this produces?

mlr --itsv head -n 10 then cut -f f.ID [file] 

Then replace the 10 with 100, then 1000, 10000, 100000 -- this will help us see if this is indeed size-related.

@GiulioCentorame
Copy link
Author

mlr --itsv head -n 10 then cut -f f.eid file1.tab returns nothing in the problematic file. Its sister file works normally:

$ mlr --itsv head -n 10 then cut -f f.eid file2.tab
f.eid=1
f.eid=2
f.eid=3
f.eid=4
f.eid=5
f.eid=6
f.eid=7
f.eid=8
f.eid=9
f.eid=10

It doesn't seem to work even when running mlr --itsv head -n 1 only, or head -n 1 file1.tab | mlr --itsv head -n 1 then cut -f f.eid. The good news is that I can share the column headers, which seem to be affected by the problem anyway and don't contain sensitive data: file1_header.txt

@aborruso
Copy link
Contributor

A note for @johnkerl: if I use Miller 5 and the sample file of @GiulioCentorame, it works (I duplicated the header line of the file, to have two rows).

image

Using Miller 6 I have and empty result.

@GiulioCentorame
Copy link
Author

I can replicate that on my original file too, Miller 5 seems to be working just fine

@johnkerl
Copy link
Owner

johnkerl commented Feb 24, 2024

@GiulioCentorame and @aborruso -- digging into this now.

I was initially confused from the input data

f.ID    f.1.0.0    f.2.0.0    f.3.0.1    f.3.0.2
1    A    22    B32    -1    
2    F    38    B1    65 

which is not TSV (it has spaces, not tabs) and which has trailing whitespace.

However, on some more thought I was able to reproduce the problem, and moreover, to narrow in on it.

Here's a data-generator script:

#!/usr/bin/env python

import sys

nrow = 2
ncol = 100
if len(sys.argv) == 2:
    ncol = int(sys.argv[1])
if len(sys.argv) == 3:
    ncol = int(sys.argv[1])
    nrow = int(sys.argv[2])

prefix = "k"
for i in range(nrow):
    for j in range(ncol):
        if j == 0:
            sys.stdout.write("%s%07d" % (prefix, j))
        else:
                sys.stdout.write("\t%s%07d" % (prefix, j))
    sys.stdout.write("\n")
    prefix = "v"

Using this script I can produce an arbitrary number of columns:

$ ./mkt.py 5
k0000000	k0000001	k0000002	k0000003	k0000004
v0000000	v0000001	v0000002	v0000003	v0000004

Running this with various column-counts I get

$ ./mkt.py 7280 > 7280.tsv; mlr --t2j cut -f k0000000 7280.tsv
[
{
  "k0000000": "v0000000"
}
]

$ ./mkt.py 7281 > 7281.tsv; mlr --t2j cut -f k0000000 7281.tsv
[
{
  "k0000000": "v0000000"
}
]

$ ./mkt.py 7282 > 7282.tsv; mlr --t2j cut -f k0000000 7282.tsv
[
]

Note

$ head -n 1 7282.tsv|wc -c
   65538

i.e. the problem happens when the line-length exceeds 64K (65536 == 2**16).

So the bug is in the line-buffering reader. I'll dig into this and find a fix.

@johnkerl johnkerl changed the title Miller hangs without returning anything Miller hangs without returning anything, on TSV with > 64K characters per line Feb 24, 2024
@johnkerl
Copy link
Owner

johnkerl commented Feb 24, 2024

Charming 😬

$ cat bad1.go
package main

import (
	"bufio"
	"fmt"
	"os"
)

func main() {
	handle := os.Stdin
	lineScanner := bufio.NewScanner(handle)
	fmt.Println("BEFORE LOOP")
	for lineScanner.Scan() {
		fmt.Println("IN LOOP")
		line := lineScanner.Text()
		fmt.Printf("line len %d\n", len(line))
	}
	fmt.Println("AFTER LOOP")
}
$ go run bad1.go < /Users/kerl/git/johnkerl/miller/7281.tsv
BEFORE LOOP
IN LOOP
line len 65528
IN LOOP
line len 65528
AFTER LOOP
$ go run bad1.go < /Users/kerl/git/johnkerl/miller/7282.tsv
BEFORE LOOP
AFTER LOOP

See also

Scanning stops unrecoverably at EOF, the first I/O error, or a token too large to fit in the Scanner.Buffer. When a scan stops, the reader may have advanced arbitrarily far past the last token. Programs that need more control over error handling or large tokens, or must run sequential scans on a reader, should use bufio.Reader instead.

And of course I used bufio.NewScanner 😬

@johnkerl johnkerl changed the title Miller hangs without returning anything, on TSV with > 64K characters per line Miller produces no output on TSV with > 64K characters per line Feb 25, 2024
@johnkerl johnkerl reopened this Feb 25, 2024
@johnkerl
Copy link
Owner

johnkerl commented Feb 25, 2024

#1505 is merged but there are remaining performance questions to be addressed (GitHub auto-closed this issue because of my merge of #1505; I'm keeping it open).

@johnkerl
Copy link
Owner

Found the issue; #1506 to follow up; closing this issue.

@johnkerl
Copy link
Owner

Resolved on #1507 -- there is a full performance-analysis write-up there.

@johnkerl
Copy link
Owner

@GiulioCentorame you can pull from head and compile from source if you like (https://miller.readthedocs.io/en/latest/build/) -- or you can use mlr --hash-records for the interim -- and I'll cut a Miller 6.12.0 soon.

@GiulioCentorame
Copy link
Author

I tried pulling from head and it works, thank you so much!

The interesting behaviour now is that mlr --itsv --otsv head -n 10 then cut -f f.eid [file] prints the correct 10 lines almost immediately, but then it takes a while for miller to return the exit status.

E.g. this is miller 5.10.3

$ time mlr --itsv --otsv head -n 10 then cut -f f.eid file1.tab
f.eid
[correct lines]

real    0m16.136s
user    0m16.045s
sys     0m0.035s

and this is miller from aff4b9f (interrupted manually)

$ time ./miller/mlr --itsv --otsv head -n 10 then cut -f f.eid file1.tab
f.eid
[correct lines]
^C^C^C

real    6m15.200s
user    6m0.513s
sys     1m19.823s

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants