Add new submodule pychip

sbslee · Mar 30, 2023 · c110451 · c110451
1 parent 06c54aa
commit c110451
Show file tree

Hide file tree

Showing 4 changed files with 221 additions and 0 deletions.
diff --git a/CHANGELOG.rst b/CHANGELOG.rst
@@ -5,6 +5,7 @@ Changelog
 -----------------------
 
 * :issue:`67`: Fix bug in :meth:`pymaf.MafFrame.plot_waterfall` method where ``count=1`` was causing color mismatch.
+* Add new submodule ``pychip``.
 
 0.36.0 (2022-08-12)
 -------------------

diff --git a/README.rst b/README.rst
@@ -185,6 +185,7 @@ Below is the list of submodules available in the fuc API:
 - **common** : The common submodule is used by other fuc submodules such as pyvcf and pybed. It also provides many day-to-day actions used in the field of bioinformatics.
 - **pybam** : The pybam submodule is designed for working with sequence alignment files (SAM/BAM/CRAM). It essentially wraps the `pysam <https://pysam.readthedocs.io/en/latest/api.html>`_ package to allow fast computation and easy manipulation. If you are mainly interested in working with depth of coverage data, please check out the pycov submodule which is specifically designed for the task.
 - **pybed** : The pybed submodule is designed for working with BED files. It implements ``pybed.BedFrame`` which stores BED data as ``pandas.DataFrame`` via the `pyranges <https://github.com/biocore-ntnu/pyranges>`_ package to allow fast computation and easy manipulation. The submodule strictly adheres to the standard `BED specification <https://genome.ucsc.edu/FAQ/FAQformat.html>`_.
+- **pychip** : The pychip submodule is designed for working with annotation or manifest files from the Axiom (Thermo Fisher Scientific) and Infinium (Illumina) array platforms.
 - **pycov** : The pycov submodule is designed for working with depth of coverage data from sequence alingment files (SAM/BAM/CRAM). It implements ``pycov.CovFrame`` which stores read depth data as ``pandas.DataFrame`` via the `pysam <https://pysam.readthedocs.io/en/latest/api.html>`_ package to allow fast computation and easy manipulation. The ``pycov.CovFrame`` class also contains many useful plotting methods such as ``CovFrame.plot_region`` and ``CovFrame.plot_uniformity``.
 - **pyfq** : The pyfq submodule is designed for working with FASTQ files. It implements ``pyfq.FqFrame`` which stores FASTQ data as ``pandas.DataFrame`` to allow fast computation and easy manipulation.
 - **pygff** : The pygff submodule is designed for working with GFF/GTF files. It implements ``pygff.GffFrame`` which stores GFF/GTF data as ``pandas.DataFrame`` to allow fast computation and easy manipulation. The submodule strictly adheres to the standard `GFF specification <https://github.com/The-Sequence-Ontology/Specifications/blob/master/gff3.md>`_.

diff --git a/docs/api.rst b/docs/api.rst
@@ -14,6 +14,7 @@ Below is the list of submodules available in the fuc API:
 - **common** : The common submodule is used by other fuc submodules such as pyvcf and pybed. It also provides many day-to-day actions used in the field of bioinformatics.
 - **pybam** : The pybam submodule is designed for working with sequence alignment files (SAM/BAM/CRAM). It essentially wraps the `pysam <https://pysam.readthedocs.io/en/latest/api.html>`_ package to allow fast computation and easy manipulation. If you are mainly interested in working with depth of coverage data, please check out the pycov submodule which is specifically designed for the task.
 - **pybed** : The pybed submodule is designed for working with BED files. It implements ``pybed.BedFrame`` which stores BED data as ``pandas.DataFrame`` via the `pyranges <https://github.com/biocore-ntnu/pyranges>`_ package to allow fast computation and easy manipulation. The submodule strictly adheres to the standard `BED specification <https://genome.ucsc.edu/FAQ/FAQformat.html>`_.
+- **pychip** : The pychip submodule is designed for working with annotation or manifest files from the Axiom (Thermo Fisher Scientific) and Infinium (Illumina) array platforms.
 - **pycov** : The pycov submodule is designed for working with depth of coverage data from sequence alingment files (SAM/BAM/CRAM). It implements ``pycov.CovFrame`` which stores read depth data as ``pandas.DataFrame`` via the `pysam <https://pysam.readthedocs.io/en/latest/api.html>`_ package to allow fast computation and easy manipulation. The ``pycov.CovFrame`` class also contains many useful plotting methods such as ``CovFrame.plot_region`` and ``CovFrame.plot_uniformity``.
 - **pyfq** : The pyfq submodule is designed for working with FASTQ files. It implements ``pyfq.FqFrame`` which stores FASTQ data as ``pandas.DataFrame`` to allow fast computation and easy manipulation.
 - **pygff** : The pygff submodule is designed for working with GFF/GTF files. It implements ``pygff.GffFrame`` which stores GFF/GTF data as ``pandas.DataFrame`` to allow fast computation and easy manipulation. The submodule strictly adheres to the standard `GFF specification <https://github.com/The-Sequence-Ontology/Specifications/blob/master/gff3.md>`_.
@@ -48,6 +49,12 @@ fuc.pybed
 .. automodule:: fuc.api.pybed
  :members:
 
+fuc.pychip
+==========
+
+.. automodule:: fuc.api.pychip
+ :members:
+
 fuc.pycov
 =========
 

diff --git a/fuc/api/pychip.py b/fuc/api/pychip.py
@@ -0,0 +1,212 @@
+"""
+The pychip submodule is designed for working with annotation or manifest
+files from the Axiom (Thermo Fisher Scientific) and Infinium (Illumina)
+array platforms.
+"""
+
+import re
+import pandas as pd
+
+class AxiomFrame:
+ """
+ Class for storing Axiom annotation data.
+
+ Parameters
+ ----------
+ meta : list
+ List of metadata lines.
+ df : pandas.DataFrame
+ DataFrame containing annotation data.
+ """
+ def __init__(self, meta, df):
+ self._meta = meta
+ self._df = df.reset_index(drop=True)
+
+ @property
+ def meta(self):
+ """list : List of metadata lines."""
+ return self._meta
+
+ @meta.setter
+ def meta(self, value):
+ self._meta = value
+
+ @property
+ def df(self):
+ """pandas.DataFrame : DataFrame containing annotation data."""
+ return self._df
+
+ @df.setter
+ def df(self, value):
+ self._df = value.reset_index(drop=True)
+
+ @classmethod
+ def from_file(cls, fn):
+ """
+ Construct AxiomFrame from a CSV file.
+
+ Parameters
+ ----------
+ fn : str
+ CSV file (compressed or uncompressed).
+
+ Returns
+ -------
+ AxiomFrame
+ AxiomFrame object.
+ """
+ if fn.startswith('~'):
+ fn = os.path.expanduser(fn)
+
+ if fn.endswith('.gz'):
+ f = gzip.open(fn, 'rt')
+ else:
+ f = open(fn)
+
+ meta = []
+ n = 0
+ for line in f:
+ if line.startswith('#'):
+ meta.append(line)
+ n += 1
+ f.close()
+
+ df = pd.read_csv(fn, skiprows=n)
+
+ return cls(meta, df)
+
+ def to_vep(self):
+ """
+ Convert AxiomFrame to the Ensembl VEP format.
+
+ Returns
+ -------
+ pandas.DataFrame
+ Variants in Ensembl VEP format.
+ """
+ print(self.df.shape)
+ df = self.df[self.df.Chromosome != '---']
+ print(df.shape)
+ def one_row(r):
+ result = []
+ nucleotides = ['A', 'C', 'G', 'T']
+ chrom = r['Chromosome']
+ ref = r['Ref Allele']
+ strand = r['Strand']
+ start = r['Physical Position']
+ end = r['Position End']
+ for alt in r['Alt Allele'].split(' // '):
+ if ref in nucleotides and alt in nucleotides: # SNV
+ pass
+ elif alt == '-': # DEL I
+ pass
+ elif len(alt) == len(ref): # MNV
+ pass
+ elif len(alt) < len(ref) and ref.startswith(alt): # DEL II
+ start += len(alt)
+ ref = ref[len(alt):]
+ alt = '-'
+ elif ref == '-': # INS I
+ start += 1
+ end = start - 1
+ elif len(alt) > len(ref) and alt.startswith(ref): # INS II
+ diff = len(alt) - len(ref)
+ start += diff
+ end = start - 1
+ ref = '-'
+ alt = alt[diff:]
+ else:
+ pass
+ line = [chrom, start, end, f'{ref}/{alt}', strand]
+ result.append('|'.join([str(x) for x in line]))
+ return ','.join(result)
+ s = df.apply(one_row, axis=1)
+ s = ','.join(s)
+ data = [x.split('|') for x in s.split(',')]
+ df = pd.DataFrame(data).drop_duplicates()
+ df.iloc[:, 1] = df.iloc[:, 1].astype(int)
+ df.iloc[:, 2] = df.iloc[:, 2].astype(int)
+ df = df.sort_values(by=[0, 1])
+ return df
+
+class InfiniumFrame:
+ """
+ Class for storing Infinium manifest data.
+
+ Parameters
+ ----------
+ df : pandas.DataFrame
+ DataFrame containing manifest data.
+ """
+ def __init__(self, df):
+ self._df = df.reset_index(drop=True)
+
+ @property
+ def df(self):
+ """pandas.DataFrame : DataFrame containing manifest data."""
+ return self._df
+
+ @df.setter
+ def df(self, value):
+ self._df = value.reset_index(drop=True)
+
+ @classmethod
+ def from_file(cls, fn):
+ """
+ Construct InfiniumFrame from a CSV file.
+
+ Parameters
+ ----------
+ fn : str
+ CSV file (compressed or uncompressed).
+
+ Returns
+ -------
+ InfiniumFrame
+ InfiniumFrame object.
+ """
+ if fn.startswith('~'):
+ fn = os.path.expanduser(fn)
+
+ if fn.endswith('.gz'):
+ f = gzip.open(fn, 'rt')
+ else:
+ f = open(fn)
+
+ lines = f.readlines()
+ f.close()
+
+ for i, line in enumerate(lines):
+ if line.startswith('[Assay]'):
+ start = i
+ headers = lines[i+1].strip().split(',')
+ elif line.startswith('[Controls]'):
+ end = i
+
+ lines = lines[start+2:end]
+ lines = [x.strip().split(',') for x in lines]
+
+ df = pd.DataFrame(lines, columns=headers)
+
+ return cls(df)
+
+ def to_vep(self):
+ """
+ Convert InfiniumFrame to the Ensembl VEP format.
+
+ Returns
+ -------
+ pandas.DataFrame
+ Variants in Ensembl VEP format.
+ """
+ df = self.df[(self.df.Chr != 'XY') & (self.df.Chr != '0')]
+ def one_row(r):
+ pos = r.MapInfo
+ matches = re.findall(r'\[([^\]]+)\]', r.SourceSeq)
+ if not matches:
+ raise ValueError(f'Something went wrong: {r}')
+ a1, a2 = matches[0].split('/')
+ data = pd.Series([r.Chr, r.MapInfo, a1, a2])
+ return data
+ df = df.apply(one_row, axis=1)
+ return df