Miller produces no output on TSV with > 64K characters per line #1501

GiulioCentorame · 2024-02-20T02:12:12Z

Hi John,
Hope you are doing well. I just wanted to open an issue wrt this problem I have encountered with one of my files.

To recap: I am working with a big file with approximately the following structure

f.ID    f.1.0.0    f.2.0.0    f.3.0.1    f.3.0.2
1    A    22    B32    -1    
2    F    38    B1    65

(~5,000 columns and ~500,000 records)

Running mlr --itsv cut -f f.ID [file] does not seem to work as expected, as the program just "hangs" without printing anything on the shell/returning anything to stdout/stderr. We tried a few things to make it run with @aborruso, but nothing seemed to work as intended. This is running on an HPC machine and I provided up to 200 GB RAM to miller, without success.

Apologies for opening an issue without a reproducible example, this seems to happen only with a specific file that I cannot share due to data sharing policies.
Cheers

The text was updated successfully, but these errors were encountered:

johnkerl · 2024-02-20T14:49:38Z

Hi @GiulioCentorame ! I'll try to reproduce here.

Meanwhile can you please tell me what this produces?

mlr --itsv head -n 10 then cut -f f.ID [file]

Then replace the 10 with 100, then 1000, 10000, 100000 -- this will help us see if this is indeed size-related.

GiulioCentorame · 2024-02-22T06:23:54Z

mlr --itsv head -n 10 then cut -f f.eid file1.tab returns nothing in the problematic file. Its sister file works normally:

$ mlr --itsv head -n 10 then cut -f f.eid file2.tab
f.eid=1
f.eid=2
f.eid=3
f.eid=4
f.eid=5
f.eid=6
f.eid=7
f.eid=8
f.eid=9
f.eid=10

It doesn't seem to work even when running mlr --itsv head -n 1 only, or head -n 1 file1.tab | mlr --itsv head -n 1 then cut -f f.eid. The good news is that I can share the column headers, which seem to be affected by the problem anyway and don't contain sensitive data: file1_header.txt

aborruso · 2024-02-22T08:05:54Z

A note for @johnkerl: if I use Miller 5 and the sample file of @GiulioCentorame, it works (I duplicated the header line of the file, to have two rows).

Using Miller 6 I have and empty result.

GiulioCentorame · 2024-02-23T03:17:10Z

I can replicate that on my original file too, Miller 5 seems to be working just fine

johnkerl · 2024-02-24T18:00:51Z

@GiulioCentorame and @aborruso -- digging into this now.

I was initially confused from the input data

f.ID    f.1.0.0    f.2.0.0    f.3.0.1    f.3.0.2
1    A    22    B32    -1    
2    F    38    B1    65

which is not TSV (it has spaces, not tabs) and which has trailing whitespace.

However, on some more thought I was able to reproduce the problem, and moreover, to narrow in on it.

Here's a data-generator script:

#!/usr/bin/env python

import sys

nrow = 2
ncol = 100
if len(sys.argv) == 2:
    ncol = int(sys.argv[1])
if len(sys.argv) == 3:
    ncol = int(sys.argv[1])
    nrow = int(sys.argv[2])

prefix = "k"
for i in range(nrow):
    for j in range(ncol):
        if j == 0:
            sys.stdout.write("%s%07d" % (prefix, j))
        else:
                sys.stdout.write("\t%s%07d" % (prefix, j))
    sys.stdout.write("\n")
    prefix = "v"

Using this script I can produce an arbitrary number of columns:

$ ./mkt.py 5
k0000000	k0000001	k0000002	k0000003	k0000004
v0000000	v0000001	v0000002	v0000003	v0000004

Running this with various column-counts I get

$ ./mkt.py 7280 > 7280.tsv; mlr --t2j cut -f k0000000 7280.tsv
[
{
  "k0000000": "v0000000"
}
]

$ ./mkt.py 7281 > 7281.tsv; mlr --t2j cut -f k0000000 7281.tsv
[
{
  "k0000000": "v0000000"
}
]

$ ./mkt.py 7282 > 7282.tsv; mlr --t2j cut -f k0000000 7282.tsv
[
]

Note

$ head -n 1 7282.tsv|wc -c
   65538

i.e. the problem happens when the line-length exceeds 64K (65536 == 2**16).

So the bug is in the line-buffering reader. I'll dig into this and find a fix.

johnkerl · 2024-02-24T21:30:24Z

Charming 😬

$ cat bad1.go
package main

import (
	"bufio"
	"fmt"
	"os"
)

func main() {
	handle := os.Stdin
	lineScanner := bufio.NewScanner(handle)
	fmt.Println("BEFORE LOOP")
	for lineScanner.Scan() {
		fmt.Println("IN LOOP")
		line := lineScanner.Text()
		fmt.Printf("line len %d\n", len(line))
	}
	fmt.Println("AFTER LOOP")
}

$ go run bad1.go < /Users/kerl/git/johnkerl/miller/7281.tsv
BEFORE LOOP
IN LOOP
line len 65528
IN LOOP
line len 65528
AFTER LOOP

$ go run bad1.go < /Users/kerl/git/johnkerl/miller/7282.tsv
BEFORE LOOP
AFTER LOOP

See also

https://pkg.go.dev/bufio#NewScanner
https://pkg.go.dev/bufio#MaxScanTokenSize where we clearly see the 64K limit
https://pkg.go.dev/bufio#Scanner helpfully says:

Scanning stops unrecoverably at EOF, the first I/O error, or a token too large to fit in the Scanner.Buffer. When a scan stops, the reader may have advanced arbitrarily far past the last token. Programs that need more control over error handling or large tokens, or must run sequential scans on a reader, should use bufio.Reader instead.

And of course I used bufio.NewScanner 😬

johnkerl · 2024-02-25T20:54:44Z

#1505 is merged but there are remaining performance questions to be addressed (GitHub auto-closed this issue because of my merge of #1505; I'm keeping it open).

johnkerl · 2024-02-25T22:16:28Z

Found the issue; #1506 to follow up; closing this issue.

This issue: correctness of high-column-count files
Read performance can be improved for high-column-count data #1506: read performance for high-column-count files

johnkerl · 2024-02-26T02:52:13Z

Resolved on #1507 -- there is a full performance-analysis write-up there.

johnkerl · 2024-02-26T02:54:06Z

@GiulioCentorame you can pull from head and compile from source if you like (https://miller.readthedocs.io/en/latest/build/) -- or you can use mlr --hash-records for the interim -- and I'll cut a Miller 6.12.0 soon.

GiulioCentorame · 2024-02-26T06:58:03Z

I tried pulling from head and it works, thank you so much!

The interesting behaviour now is that mlr --itsv --otsv head -n 10 then cut -f f.eid [file] prints the correct 10 lines almost immediately, but then it takes a while for miller to return the exit status.

E.g. this is miller 5.10.3

$ time mlr --itsv --otsv head -n 10 then cut -f f.eid file1.tab
f.eid
[correct lines]

real    0m16.136s
user    0m16.045s
sys     0m0.035s

and this is miller from aff4b9f (interrupted manually)

$ time ./miller/mlr --itsv --otsv head -n 10 then cut -f f.eid file1.tab
f.eid
[correct lines]
^C^C^C

real    6m15.200s
user    6m0.513s
sys     1m19.823s

johnkerl changed the title ~~Miller hangs without returning anything~~ Miller hangs without returning anything, on TSV with > 64K characters per line Feb 24, 2024

johnkerl mentioned this issue Feb 25, 2024

Separate out ILineReader abstraction #1504

Merged

johnkerl changed the title ~~Miller hangs without returning anything, on TSV with > 64K characters per line~~ Miller produces no output on TSV with > 64K characters per line Feb 25, 2024

johnkerl mentioned this issue Feb 25, 2024

Miller produces no output on TSV with > 64K characters per line #1505

Merged

johnkerl closed this as completed in #1505 Feb 25, 2024

johnkerl reopened this Feb 25, 2024

johnkerl mentioned this issue Feb 25, 2024

Read performance can be improved for high-column-count data #1506

Closed

johnkerl closed this as completed Feb 25, 2024

johnkerl mentioned this issue Feb 26, 2024

Investigate shutdown latency on mlr head #1509

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Miller produces no output on TSV with > 64K characters per line #1501

Miller produces no output on TSV with > 64K characters per line #1501

GiulioCentorame commented Feb 20, 2024

johnkerl commented Feb 20, 2024

GiulioCentorame commented Feb 22, 2024

aborruso commented Feb 22, 2024

GiulioCentorame commented Feb 23, 2024

johnkerl commented Feb 24, 2024 •

edited

Loading

johnkerl commented Feb 24, 2024 •

edited

Loading

johnkerl commented Feb 25, 2024 •

edited

Loading

johnkerl commented Feb 25, 2024

johnkerl commented Feb 26, 2024

johnkerl commented Feb 26, 2024

GiulioCentorame commented Feb 26, 2024

Miller produces no output on TSV with > 64K characters per line #1501

Miller produces no output on TSV with > 64K characters per line #1501

Comments

GiulioCentorame commented Feb 20, 2024

johnkerl commented Feb 20, 2024

GiulioCentorame commented Feb 22, 2024

aborruso commented Feb 22, 2024

GiulioCentorame commented Feb 23, 2024

johnkerl commented Feb 24, 2024 • edited Loading

johnkerl commented Feb 24, 2024 • edited Loading

johnkerl commented Feb 25, 2024 • edited Loading

johnkerl commented Feb 25, 2024

johnkerl commented Feb 26, 2024

johnkerl commented Feb 26, 2024

GiulioCentorame commented Feb 26, 2024

johnkerl commented Feb 24, 2024 •

edited

Loading

johnkerl commented Feb 24, 2024 •

edited

Loading

johnkerl commented Feb 25, 2024 •

edited

Loading