Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

flo failed on Large genome #33

Open
pan-genome opened this issue Sep 18, 2020 · 9 comments
Open

flo failed on Large genome #33

pan-genome opened this issue Sep 18, 2020 · 9 comments

Comments

@pan-genome
Copy link

flo failed on a 14Gb genome, with "corrupted double-linked list (not small)" error. it runs normally with genome smaller than 4Gb in size. The setting is on an aws m5.16xlarge EC2 instance.

rake -f /home/ubuntu/flo/Rakefile &
mkdir run
cp /home/ubuntu/s.fa run/source.fa
cp /home/ubuntu/t.fa run/target.fa
faToTwoBit run/source.fa run/source.2bit
faToTwoBit run/target.fa run/target.2bit
twoBitInfo run/source.2bit stdout | sort -k2nr > run/source.sizes
twoBitInfo run/target.2bit stdout | sort -k2nr > run/target.sizes
faSplit sequence run/target.fa 21 run/chunk_
parallel --joblog run/joblog.faSplit -j 21 -a run/joblst.faSplit
Academic tradition requires you to cite works you base your article on.
When using programs that use GNU Parallel to process data for publication
please cite:

O. Tange (2011): GNU Parallel - The Command-Line Power Tool,
;login: The USENIX Magazine, February 2011:42-47.

This helps funding further development; and it won't cost you a cent.
If you pay 10000 EUR you should feel free to use GNU Parallel without citing.

To silence the citation notice: run 'parallel --bibtex'.

123322 pieces of 123923 written
133957 pieces of 134763 written
150983 pieces of 152743 written
156478 pieces of 157558 written
98419 pieces of 99073 written
99082 pieces of 99724 written
103154 pieces of 103663 written
113555 pieces of 113991 written
118767 pieces of 119728 written
123551 pieces of 124526 written
141741 pieces of 142672 written
144495 pieces of 146237 written
130388 pieces of 131310 written
147572 pieces of 148896 written
138549 pieces of 140111 written
141907 pieces of 142961 written
149246 pieces of 150844 written
149613 pieces of 150822 written
197774 pieces of 198899 written
160747 pieces of 162550 written
167525 pieces of 170389 written
parallel --joblog run/joblog.blat -j 21 -a run/joblst.blat
Academic tradition requires you to cite works you base your article on.
When using programs that use GNU Parallel to process data for publication
please cite:

O. Tange (2011): GNU Parallel - The Command-Line Power Tool,
;login: The USENIX Magazine, February 2011:42-47.

This helps funding further development; and it won't cost you a cent.
If you pay 10000 EUR you should feel free to use GNU Parallel without citing.

To silence the citation notice: run 'parallel --bibtex'.

corrupted double-linked list (not small)
free(): invalid next size (normal)
free(): invalid next size (normal)
double free or corruption (!prev)
double free or corruption (!prev)
malloc(): smallbin double linked list corrupted
free(): invalid next size (normal)
malloc(): memory corruption
free(): invalid next size (normal)
double free or corruption (!prev)
free(): invalid next size (normal)
double free or corruption (!prev)
double free or corruption (!prev)
rake aborted!
Command failed with status (21): [parallel --joblog run/joblog.blat -j 21 -a...]
/home/ubuntu/flo/Rakefile:153:in parallel' /home/ubuntu/flo/Rakefile:99:in block in <top (required)>'
/home/ubuntu/flo/Rakefile:37:in `block in <top (required)>'
Tasks: TOP => run/liftover.chn
(See full trace by running task with --trace)

[1]+ Exit 1 rake -f /home/ubuntu/flo/Rakefile

@yeban
Copy link
Collaborator

yeban commented Sep 20, 2020

Not sure if the error is coming from GNU parallel or blat. The contents of run/joblog.blat can help decide. Would you mind posting it?

If it's GNU parallel, you could try using a newer version. The version that the install script installs is quite old.

If it's blat, it is possible that 256 GB is not sufficient memory for the task. Did you monitor the memory usage using htop?
You could try lowering the number of parallel processes that flo runs, use a memory optimised (r5) instance for more RAM, and take steps to minimise memory usage of blat, such a create and provide an ooc file.

@pan-genome
Copy link
Author

here is blat joblog:
run$ cat joblog.blat
Seq Host Starttime JobRuntime Send Receive Exitval Signal Command
6 : 1600440618.782 60.508 0 0 0 9 blat -noHead -noHead -fastMap -tileSize=12 -ooc=/home/ubuntu/efs/wheat/4461_12.ooc -minScore=100 -minIdentity=98 run/source.fa run/chunk_05.fa run/chunk_05.fa.psl
9 : 1600440618.787 67.600 0 0 0 9 blat -noHead -noHead -fastMap -tileSize=12 -ooc=/home/ubuntu/efs/wheat/4461_12.ooc -minScore=100 -minIdentity=98 run/source.fa run/chunk_01.fa run/chunk_01.fa.psl
5 : 1600440618.780 74.621 0 0 0 9 blat -noHead -noHead -fastMap -tileSize=12 -ooc=/home/ubuntu/efs/wheat/4461_12.ooc -minScore=100 -minIdentity=98 run/source.fa run/chunk_14.fa run/chunk_14.fa.psl
21 : 1600440618.807 81.061 0 0 0 9 blat -noHead -noHead -fastMap -tileSize=12 -ooc=/home/ubuntu/efs/wheat/4461_12.ooc -minScore=100 -minIdentity=98 run/source.fa run/chunk_20.fa run/chunk_20.fa.psl
4 : 1600440618.778 186.198 0 0 0 9 blat -noHead -noHead -fastMap -tileSize=12 -ooc=/home/ubuntu/efs/wheat/4461_12.ooc -minScore=100 -minIdentity=98 run/source.fa run/chunk_07.fa run/chunk_07.fa.psl
10 : 1600440618.788 312.954 0 0 0 11 blat -noHead -noHead -fastMap -tileSize=12 -ooc=/home/ubuntu/efs/wheat/4461_12.ooc -minScore=100 -minIdentity=98 run/source.fa run/chunk_10.fa run/chunk_10.fa.psl
2 : 1600440618.775 312.980 0 41 0 6 blat -noHead -noHead -fastMap -tileSize=12 -ooc=/home/ubuntu/efs/wheat/4461_12.ooc -minScore=100 -minIdentity=98 run/source.fa run/chunk_03.fa run/chunk_03.fa.psl
8 : 1600440618.785 314.005 0 0 0 11 blat -noHead -noHead -fastMap -tileSize=12 -ooc=/home/ubuntu/efs/wheat/4461_12.ooc -minScore=100 -minIdentity=98 run/source.fa run/chunk_09.fa run/chunk_09.fa.psl
13 : 1600440618.793 314.322 0 35 0 6 blat -noHead -noHead -fastMap -tileSize=12 -ooc=/home/ubuntu/efs/wheat/4461_12.ooc -minScore=100 -minIdentity=98 run/source.fa run/chunk_08.fa run/chunk_08.fa.psl
14 : 1600440618.795 314.361 0 35 0 6 blat -noHead -noHead -fastMap -tileSize=12 -ooc=/home/ubuntu/efs/wheat/4461_12.ooc -minScore=100 -minIdentity=98 run/source.fa run/chunk_17.fa run/chunk_17.fa.psl
20 : 1600440618.805 314.427 0 34 0 6 blat -noHead -noHead -fastMap -tileSize=12 -ooc=/home/ubuntu/efs/wheat/4461_12.ooc -minScore=100 -minIdentity=98 run/source.fa run/chunk_02.fa run/chunk_02.fa.psl
12 : 1600440618.791 319.748 0 34 0 6 blat -noHead -noHead -fastMap -tileSize=12 -ooc=/home/ubuntu/efs/wheat/4461_12.ooc -minScore=100 -minIdentity=98 run/source.fa run/chunk_13.fa run/chunk_13.fa.psl
7 : 1600440618.783 324.924 0 48 0 6 blat -noHead -noHead -fastMap -tileSize=12 -ooc=/home/ubuntu/efs/wheat/4461_12.ooc -minScore=100 -minIdentity=98 run/source.fa run/chunk_18.fa run/chunk_18.fa.psl
11 : 1600440618.790 327.304 0 35 0 6 blat -noHead -noHead -fastMap -tileSize=12 -ooc=/home/ubuntu/efs/wheat/4461_12.ooc -minScore=100 -minIdentity=98 run/source.fa run/chunk_04.fa run/chunk_04.fa.psl
15 : 1600440618.796 330.322 0 28 0 6 blat -noHead -noHead -fastMap -tileSize=12 -ooc=/home/ubuntu/efs/wheat/4461_12.ooc -minScore=100 -minIdentity=98 run/source.fa run/chunk_00.fa run/chunk_00.fa.psl
19 : 1600440618.803 331.255 0 35 0 6 blat -noHead -noHead -fastMap -tileSize=12 -ooc=/home/ubuntu/efs/wheat/4461_12.ooc -minScore=100 -minIdentity=98 run/source.fa run/chunk_11.fa run/chunk_11.fa.psl
17 : 1600440618.800 332.427 0 0 0 11 blat -noHead -noHead -fastMap -tileSize=12 -ooc=/home/ubuntu/efs/wheat/4461_12.ooc -minScore=100 -minIdentity=98 run/source.fa run/chunk_15.fa run/chunk_15.fa.psl
18 : 1600440618.802 332.598 0 34 0 6 blat -noHead -noHead -fastMap -tileSize=12 -ooc=/home/ubuntu/efs/wheat/4461_12.ooc -minScore=100 -minIdentity=98 run/source.fa run/chunk_19.fa run/chunk_19.fa.psl
16 : 1600440618.798 333.617 0 35 0 6 blat -noHead -noHead -fastMap -tileSize=12 -ooc=/home/ubuntu/efs/wheat/4461_12.ooc -minScore=100 -minIdentity=98 run/source.fa run/chunk_06.fa run/chunk_06.fa.psl
1 : 1600440618.774 341.095 0 34 0 6 blat -noHead -noHead -fastMap -tileSize=12 -ooc=/home/ubuntu/efs/wheat/4461_12.ooc -minScore=100 -minIdentity=98 run/source.fa run/chunk_12.fa run/chunk_12.fa.psl
3 : 1600440618.777 345.338 0 34 0 6 blat -noHead -noHead -fastMap -tileSize=12 -ooc=/home/ubuntu/efs/wheat/4461_12.ooc -minScore=100 -minIdentity=98 run/source.fa run/chunk_16.fa run/chunk_16.fa.psl

@pan-genome
Copy link
Author

I was wondering what would be the best way to update parallel, do I install an new version or update the one in /ext/parallel-20150722?
if installed new one in different folder, I then need to point all the parallel in flo to the new src.

@yeban
Copy link
Collaborator

yeban commented Sep 20, 2020

if installed new one in different folder, I then need to point all the parallel in flo to the new src

Best to install in new folder. You can tell flo about the new folder using :add_to_path: key in the config file.

@pan-genome
Copy link
Author

changed to r5.16xlarge and used a new parallel, lower the parallel from 21 to 10 and still get the same error. any suggestion? Thanks!
The blatlog looks like below:
run$ cat joblog.blat
Seq Host Starttime JobRuntime Send Receive Exitval Signal Command
5 : 1600620779.307 255.175 0 0 0 6 blat -noHead -noHead -fastMap -tileSize=12 -ooc=/home/ubuntu/efs/wheat/4461_12.ooc -minScore=100 -minIdentity=98 run/source.fa run/chunk_01.fa run/chunk_01.fa.psl
7 : 1600620779.310 255.858 0 0 0 6 blat -noHead -noHead -fastMap -tileSize=12 -ooc=/home/ubuntu/efs/wheat/4461_12.ooc -minScore=100 -minIdentity=98 run/source.fa run/chunk_08.fa run/chunk_08.fa.psl
1 : 1600620779.302 256.565 0 0 0 6 blat -noHead -noHead -fastMap -tileSize=12 -ooc=/home/ubuntu/efs/wheat/4461_12.ooc -minScore=100 -minIdentity=98 run/source.fa run/chunk_03.fa run/chunk_03.fa.psl
8 : 1600620779.311 256.630 0 0 0 6 blat -noHead -noHead -fastMap -tileSize=12 -ooc=/home/ubuntu/efs/wheat/4461_12.ooc -minScore=100 -minIdentity=98 run/source.fa run/chunk_00.fa run/chunk_00.fa.psl
10 : 1600620779.314 256.855 0 0 0 6 blat -noHead -noHead -fastMap -tileSize=12 -ooc=/home/ubuntu/efs/wheat/4461_12.ooc -minScore=100 -minIdentity=98 run/source.fa run/chunk_02.fa run/chunk_02.fa.psl
2 : 1600620779.303 257.506 0 0 0 11 blat -noHead -noHead -fastMap -tileSize=12 -ooc=/home/ubuntu/efs/wheat/4461_12.ooc -minScore=100 -minIdentity=98 run/source.fa run/chunk_07.fa run/chunk_07.fa.psl
4 : 1600620779.306 257.615 0 0 0 6 blat -noHead -noHead -fastMap -tileSize=12 -ooc=/home/ubuntu/efs/wheat/4461_12.ooc -minScore=100 -minIdentity=98 run/source.fa run/chunk_09.fa run/chunk_09.fa.psl
6 : 1600620779.308 257.718 0 0 0 6 blat -noHead -noHead -fastMap -tileSize=12 -ooc=/home/ubuntu/efs/wheat/4461_12.ooc -minScore=100 -minIdentity=98 run/source.fa run/chunk_04.fa run/chunk_04.fa.psl
9 : 1600620779.312 258.359 0 0 0 6 blat -noHead -noHead -fastMap -tileSize=12 -ooc=/home/ubuntu/efs/wheat/4461_12.ooc -minScore=100 -minIdentity=98 run/source.fa run/chunk_06.fa run/chunk_06.fa.psl
3 : 1600620779.304 258.777 0 0 0 11 blat -noHead -noHead -fastMap -tileSize=12 -ooc=/home/ubuntu/efs/wheat/4461_12.ooc -minScore=100 -minIdentity=98 run/source.fa run/chunk_05.fa run/chunk_05.fa.psl

@yeban
Copy link
Collaborator

yeban commented Sep 26, 2020

Sorry, I am not quite sure what is happening here. I have not encountered this error before. From the information we have in this thread, it might as well be a bug in blat. It might be worth trying to run the blat commands listed in joblst.blat one by one to check if all the chunks fail with the above error, or one in particular. With an isolated example it might then be worth asking on blat's mailing list.

Just to be sure, is it possible that the ooc file you constructed is using a different tileSize than what you are using for running blat? I guess not, because you have _12 suffix on the ooc file.

Did you compile blat yourself or did you download pre-compiled executable (e.g., using the install script)? It is possible that a difference in glibc between your instance and the host on which blat was compiled. In which case, compiling blat yourself can help. But this is a kind of issue where you would be better off getting help on blat's mailing list.

I used flo on ~400 Mb genome, split into 40 chunks, so 10 Mb per chunk. I wonder if increasing the number of processes so that each chunk is smaller helps.

Lastly, I would quickly check the fasta and psl file for each chunk just to make sure we are not missing something too obvious.

@pan-genome
Copy link
Author

pan-genome commented Sep 29, 2020

Hi
How can you split the processes into more than the number of chromosomes/scaffolds? in the information page it says
"Number of CPU cores to use (required - not auto detected). This
cannot be greater than the number of scaffolds in the target assembly."
here I have 21 chromosomes and 21 processes is the max I can get, and looks like it is a memory issue for blat and each chunk is still too big for blat to handle.

@pan-genome
Copy link
Author

here is what happened when I run blat on one chunk:
blat -noHead -fastMap -tileSize=12 -ooc=4461n_12.ooc -minScore=100 -minIdentity=98 source.fa chunk_08.fa chunk_08.fa.psl
Loaded 14547261565 letters in 22 sequences
free(): invalid next size (normal)
Aborted (core dumped)

@akshaya-v
Copy link

Hello !
I am facing a similar issue while running flo with a large genome of ~16 Gb size.
Can you please advise if there is a work-around/solution for this issue.
Thanks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants