Training data from HG002, HG003 and HG004 #26

aCoalBall · 2024-06-06T07:30:44Z

Hi,
Thanks for the impressive work of DeepMod2. I am trying to generate the training/valid/test data following your criteria as described in your paper, but I can not get the same number as in Supplementary Table 4. Take HG002 for instance, I downloaded 12 files by LAB01 of the six WGBS methods (TrueSeq, MethylSeq, SPLAT, TrueMethIBS, TrueMethyIOX, and EMSeq) from EpiQC.

GSM5649436_TruSeq_HG002_LAB01_REP02.bedGraph.gz GSM5649437_TruSeq_HG002_LAB01_REP01.bedGraph.gz GSM5649458_SPLAT_HG002_LAB01_REP02.bedGraph.gz GSM5649459_SPLAT_HG002_LAB01_REP01.bedGraph.gz GSM5649479_MethylSeq_HG002_LAB01_REP02.bedGraph.gz GSM5649480_MethylSeq_HG002_LAB01_REP01.bedGraph.gz GSM5649518_EMSeq_HG002_LAB01_REP02.bedGraph.gz GSM5649520_EMSeq_HG002_LAB01_REP01.bedGraph.gz GSM5649543_TrueMethylOX_HG002_LAB01_REP02.bedGraph.gz GSM5649545_TrueMethylOX_HG002_LAB01_REP01.bedGraph.gz GSM5649565_TrueMethylBS_HG002_LAB01_REP02.bedGraph.gz GSM5649567_TrueMethylBS_HG002_LAB01_REP01.bedGraph.gz

I processed these files following the description in your paper:

Before processing, for each WGBS method, I summed the counts from the two replicates and calculated the methylation percentage.
For each of the six WGBS method, we find CpG sites with (coverage >=10 and percentage >=90%) as methylation and (coverage >=10 and percentage < 10%) as unmethylation.
We take the overlapping for ALL the six WGBS methods for the above methylation and unmethylation sites. We then split the data into train/valid/test, according to the region of Chr2-21, Chr22 and Chr1.

Our Train: meth 4,212,407, unmeth: 2,169,575
Your Paper Train: meth 5,332,746, unmeth: 2,490,174

Our Valid: meth 134,316, unmeth: 58,293
Your Paper Valid: meth 179,954, unmeth: 65,730

Could you help me to reproduce the training data split as in Supplmentary Table 4 with more detailed information or kindly sharing your script to generate the data?
Thank you very much!

The text was updated successfully, but these errors were encountered:

umahsn · 2024-06-07T22:27:07Z

Hi, please see the following code for reproducing the training CpG labels:

Process individual datasets for each genome:

h="HG002"

mkdir -p merged/$h

for tp in {TruSeq,TrueMethylBS,TrueMethylOX,EMSeq,MethylSeq,SPLAT}
do
# merge replicates
f1=`find * -name GSM5649*${tp}*${h}_LAB01_REP01.bedGraph`
f2=`find * -name GSM5649*${tp}*${h}_LAB01_REP02.bedGraph`

bedtools intersect -a $f1 -b $f2 -loj |awk -v OFS="\t"  '{print $1, $2, $3,($5+$11)/($5+$11+$6+$12),$5+$11+$6+$12,$5+$11,$6+$12}' |grep -v 'Un\|random\|chrEBV\|p\|M' > merged/$h/merged.$tp.bed

# filter according to depth, and separate according to high or low methylation
mincov=10
high=0.9
low=0.1

mkdir -p merged/$h/dp.$mincov/gt_${high}
mkdir -p merged/$h/dp.$mincov/lt_${low}

cat merged/$h/merged.$tp.bed|awk -v high=$high -v mincov=$mincov  '{if ($5>=mincov && $4>=high){print}}'| grep -v 'X\|Y\|M' > merged/$h/dp.$mincov/gt_${high}/$tp.bed

cat merged/$h/merged.$tp.bed|awk -v low=$low -v mincov=$mincov  '{if ($5>=mincov && $4<=low){print}}'| grep -v 'X\|Y\|M' > merged/$h/dp.$mincov/lt_${low}/$tp.bed
done

Get common sites among six technologies and split into training, testing and validation

g=2

# get common sites
for tp in {"gt_0.9","lt_0.1"}
do 
folder="merged/HG00$g/dp.10/$tp/"
bedtools multiinter -i $folder/*.bed > $folder/outer_join
cat $folder/outer_join|awk '{if($4==6){print}}'|cut -f 1-3 > $folder/common_sites_all
done

#convert sites into stranded CpGs
d=dp.10
l=lt_0.1
h=gt_0.9
mkdir -p merged/HG00$g/$d/training_labels
cat merged/HG00$g/$d/$l/common_sites_all | awk '{print $1 "\t" $2 "\t+\t" 0 "\n" $1 "\t" $2+1 "\t-\t" 0}' > merged/HG00$g/$d/training_labels/HG00$g.$d.$l.$h ; cat merged/HG00$g/$d/$h/common_sites_all|awk '{print $1 "\t" $2 "\t+\t" 1 "\n" $1 "\t" $2+1 "\t-\t" 1}' >> merged/HG00$g/$d/training_labels/HG00$g.$d.$l.$h

#split datasets
cat merged/HG00$g/$d/training_labels/HG00$g.$d.$l.$h|grep  "chr22" > merged/HG00$g/$d/training_labels/HG00$g.$d.$l.$h.chr22

cat merged/HG00$g/$d/training_labels/HG00$g.$d.$l.$h|grep  -v "chr1$(printf '\t')\|chr22" > merged/HG00$g/$d/training_labels/HG00$g.$d.$l.$h.chr2_21

cat merged/HG00$g/$d/training_labels/HG00$g.$d.$l.$h|grep  "chr1$(printf '\t')" > merged/HG00$g/$d/training_labels/HG00$g.$d.$l.$h.chr1

aCoalBall closed this as completed Jul 18, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Training data from HG002, HG003 and HG004 #26

Training data from HG002, HG003 and HG004 #26

aCoalBall commented Jun 6, 2024

umahsn commented Jun 7, 2024

Training data from HG002, HG003 and HG004 #26

Training data from HG002, HG003 and HG004 #26

Comments

aCoalBall commented Jun 6, 2024

umahsn commented Jun 7, 2024