Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

metaWRAP binning module skips BAM output if multiple pairs of fastqs with same file names are given #541

Open
Prunoideae opened this issue Apr 5, 2024 · 0 comments

Comments

@Prunoideae
Copy link

Prunoideae commented Apr 5, 2024

I wrote a script to automate script the process of the metagenomic samples from 10+ sites. So for each direct output of read_qc, they will have a pair of fastqs at clean_reads/{sample_name}/final_pure_reads_{1/2}.fq (I know that there's a step of reassigning files with sample names in the usage tutorial, but I didn't expect it will have an effect later).

So all reads given have the same final_pure_reads as prefix, but from

for num in "$@"; do
# paired end reads
if [ $read_type = paired ]; then
if [[ $num == *"_1.fastq"* ]]; then
reads_1=$num
reads_2=${num%_*}_2.fastq
if [ ! -s $reads_1 ]; then error "$reads_1 does not exist. Exiting..."; fi
if [ ! -s $reads_2 ]; then error "$reads_2 does not exist. Exiting..."; fi
tmp=${reads_1##*/}
sample=${tmp%_*}
if [[ ! -f ${out}/work_files/${sample}.bam ]]; then
comm "Aligning $reads_1 and $reads_2 back to assembly"
bwa mem -v 1 -t $threads ${out}/work_files/assembly.fa $reads_1 $reads_2 > ${out}/work_files/${sample}.sam
if [[ $? -ne 0 ]]; then error "Something went wrong with aligning $reads_1 and $reads_2 reads to the assembly. Exiting"; fi
comm "Sorting the $sample alignment file"
samtools sort -T ${out}/work_files/tmp-samtools -@ $threads -O BAM -o ${out}/work_files/${sample}.bam ${out}/work_files/${sample}.sam
if [[ $? -ne 0 ]]; then error "Something went wrong with sorting the alignments. Exiging..."; fi
rm ${out}/work_files/${sample}.sam
else
comm "skipping aligning $sample reads to assembly because ${out}/work_files/${sample}.bam already exists."
fi
fi

We see that all samples are distinguished by their file name prefixes. So, all subsequent fastq pairs will be skipped as the final_pure_reads.bam is present.

I think this should be worth an error showing that the same file names are not permitted, at least, or it can be fixed by using something to hash the whole path (or simply, just the order it runs in the iteration like 0, 1, 2...) and make that bam the file name instead.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant