Skip to content

Commit

Permalink
Add new submodule pychip
Browse files Browse the repository at this point in the history
  • Loading branch information
sbslee committed Mar 30, 2023
1 parent 06c54aa commit c110451
Show file tree
Hide file tree
Showing 4 changed files with 221 additions and 0 deletions.
1 change: 1 addition & 0 deletions CHANGELOG.rst
Original file line number Diff line number Diff line change
Expand Up @@ -5,6 +5,7 @@ Changelog
-----------------------

* :issue:`67`: Fix bug in :meth:`pymaf.MafFrame.plot_waterfall` method where ``count=1`` was causing color mismatch.
* Add new submodule ``pychip``.

0.36.0 (2022-08-12)
-------------------
Expand Down
1 change: 1 addition & 0 deletions README.rst
Original file line number Diff line number Diff line change
Expand Up @@ -185,6 +185,7 @@ Below is the list of submodules available in the fuc API:
- **common** : The common submodule is used by other fuc submodules such as pyvcf and pybed. It also provides many day-to-day actions used in the field of bioinformatics.
- **pybam** : The pybam submodule is designed for working with sequence alignment files (SAM/BAM/CRAM). It essentially wraps the `pysam <https://pysam.readthedocs.io/en/latest/api.html>`_ package to allow fast computation and easy manipulation. If you are mainly interested in working with depth of coverage data, please check out the pycov submodule which is specifically designed for the task.
- **pybed** : The pybed submodule is designed for working with BED files. It implements ``pybed.BedFrame`` which stores BED data as ``pandas.DataFrame`` via the `pyranges <https://github.com/biocore-ntnu/pyranges>`_ package to allow fast computation and easy manipulation. The submodule strictly adheres to the standard `BED specification <https://genome.ucsc.edu/FAQ/FAQformat.html>`_.
- **pychip** : The pychip submodule is designed for working with annotation or manifest files from the Axiom (Thermo Fisher Scientific) and Infinium (Illumina) array platforms.
- **pycov** : The pycov submodule is designed for working with depth of coverage data from sequence alingment files (SAM/BAM/CRAM). It implements ``pycov.CovFrame`` which stores read depth data as ``pandas.DataFrame`` via the `pysam <https://pysam.readthedocs.io/en/latest/api.html>`_ package to allow fast computation and easy manipulation. The ``pycov.CovFrame`` class also contains many useful plotting methods such as ``CovFrame.plot_region`` and ``CovFrame.plot_uniformity``.
- **pyfq** : The pyfq submodule is designed for working with FASTQ files. It implements ``pyfq.FqFrame`` which stores FASTQ data as ``pandas.DataFrame`` to allow fast computation and easy manipulation.
- **pygff** : The pygff submodule is designed for working with GFF/GTF files. It implements ``pygff.GffFrame`` which stores GFF/GTF data as ``pandas.DataFrame`` to allow fast computation and easy manipulation. The submodule strictly adheres to the standard `GFF specification <https://github.com/The-Sequence-Ontology/Specifications/blob/master/gff3.md>`_.
Expand Down
7 changes: 7 additions & 0 deletions docs/api.rst
Original file line number Diff line number Diff line change
Expand Up @@ -14,6 +14,7 @@ Below is the list of submodules available in the fuc API:
- **common** : The common submodule is used by other fuc submodules such as pyvcf and pybed. It also provides many day-to-day actions used in the field of bioinformatics.
- **pybam** : The pybam submodule is designed for working with sequence alignment files (SAM/BAM/CRAM). It essentially wraps the `pysam <https://pysam.readthedocs.io/en/latest/api.html>`_ package to allow fast computation and easy manipulation. If you are mainly interested in working with depth of coverage data, please check out the pycov submodule which is specifically designed for the task.
- **pybed** : The pybed submodule is designed for working with BED files. It implements ``pybed.BedFrame`` which stores BED data as ``pandas.DataFrame`` via the `pyranges <https://github.com/biocore-ntnu/pyranges>`_ package to allow fast computation and easy manipulation. The submodule strictly adheres to the standard `BED specification <https://genome.ucsc.edu/FAQ/FAQformat.html>`_.
- **pychip** : The pychip submodule is designed for working with annotation or manifest files from the Axiom (Thermo Fisher Scientific) and Infinium (Illumina) array platforms.
- **pycov** : The pycov submodule is designed for working with depth of coverage data from sequence alingment files (SAM/BAM/CRAM). It implements ``pycov.CovFrame`` which stores read depth data as ``pandas.DataFrame`` via the `pysam <https://pysam.readthedocs.io/en/latest/api.html>`_ package to allow fast computation and easy manipulation. The ``pycov.CovFrame`` class also contains many useful plotting methods such as ``CovFrame.plot_region`` and ``CovFrame.plot_uniformity``.
- **pyfq** : The pyfq submodule is designed for working with FASTQ files. It implements ``pyfq.FqFrame`` which stores FASTQ data as ``pandas.DataFrame`` to allow fast computation and easy manipulation.
- **pygff** : The pygff submodule is designed for working with GFF/GTF files. It implements ``pygff.GffFrame`` which stores GFF/GTF data as ``pandas.DataFrame`` to allow fast computation and easy manipulation. The submodule strictly adheres to the standard `GFF specification <https://github.com/The-Sequence-Ontology/Specifications/blob/master/gff3.md>`_.
Expand Down Expand Up @@ -48,6 +49,12 @@ fuc.pybed
.. automodule:: fuc.api.pybed
:members:

fuc.pychip
==========

.. automodule:: fuc.api.pychip
:members:

fuc.pycov
=========

Expand Down
212 changes: 212 additions & 0 deletions fuc/api/pychip.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,212 @@
"""
The pychip submodule is designed for working with annotation or manifest
files from the Axiom (Thermo Fisher Scientific) and Infinium (Illumina)
array platforms.
"""

import re
import pandas as pd

class AxiomFrame:
"""
Class for storing Axiom annotation data.
Parameters
----------
meta : list
List of metadata lines.
df : pandas.DataFrame
DataFrame containing annotation data.
"""
def __init__(self, meta, df):
self._meta = meta
self._df = df.reset_index(drop=True)

@property
def meta(self):
"""list : List of metadata lines."""
return self._meta

@meta.setter
def meta(self, value):
self._meta = value

@property
def df(self):
"""pandas.DataFrame : DataFrame containing annotation data."""
return self._df

@df.setter
def df(self, value):
self._df = value.reset_index(drop=True)

@classmethod
def from_file(cls, fn):
"""
Construct AxiomFrame from a CSV file.
Parameters
----------
fn : str
CSV file (compressed or uncompressed).
Returns
-------
AxiomFrame
AxiomFrame object.
"""
if fn.startswith('~'):
fn = os.path.expanduser(fn)

if fn.endswith('.gz'):
f = gzip.open(fn, 'rt')
else:
f = open(fn)

meta = []
n = 0
for line in f:
if line.startswith('#'):
meta.append(line)
n += 1
f.close()

df = pd.read_csv(fn, skiprows=n)

return cls(meta, df)

def to_vep(self):
"""
Convert AxiomFrame to the Ensembl VEP format.
Returns
-------
pandas.DataFrame
Variants in Ensembl VEP format.
"""
print(self.df.shape)
df = self.df[self.df.Chromosome != '---']
print(df.shape)
def one_row(r):
result = []
nucleotides = ['A', 'C', 'G', 'T']
chrom = r['Chromosome']
ref = r['Ref Allele']
strand = r['Strand']
start = r['Physical Position']
end = r['Position End']
for alt in r['Alt Allele'].split(' // '):
if ref in nucleotides and alt in nucleotides: # SNV
pass
elif alt == '-': # DEL I
pass
elif len(alt) == len(ref): # MNV
pass
elif len(alt) < len(ref) and ref.startswith(alt): # DEL II
start += len(alt)
ref = ref[len(alt):]
alt = '-'
elif ref == '-': # INS I
start += 1
end = start - 1
elif len(alt) > len(ref) and alt.startswith(ref): # INS II
diff = len(alt) - len(ref)
start += diff
end = start - 1
ref = '-'
alt = alt[diff:]
else:
pass
line = [chrom, start, end, f'{ref}/{alt}', strand]
result.append('|'.join([str(x) for x in line]))
return ','.join(result)
s = df.apply(one_row, axis=1)
s = ','.join(s)
data = [x.split('|') for x in s.split(',')]
df = pd.DataFrame(data).drop_duplicates()
df.iloc[:, 1] = df.iloc[:, 1].astype(int)
df.iloc[:, 2] = df.iloc[:, 2].astype(int)
df = df.sort_values(by=[0, 1])
return df

class InfiniumFrame:
"""
Class for storing Infinium manifest data.
Parameters
----------
df : pandas.DataFrame
DataFrame containing manifest data.
"""
def __init__(self, df):
self._df = df.reset_index(drop=True)

@property
def df(self):
"""pandas.DataFrame : DataFrame containing manifest data."""
return self._df

@df.setter
def df(self, value):
self._df = value.reset_index(drop=True)

@classmethod
def from_file(cls, fn):
"""
Construct InfiniumFrame from a CSV file.
Parameters
----------
fn : str
CSV file (compressed or uncompressed).
Returns
-------
InfiniumFrame
InfiniumFrame object.
"""
if fn.startswith('~'):
fn = os.path.expanduser(fn)

if fn.endswith('.gz'):
f = gzip.open(fn, 'rt')
else:
f = open(fn)

lines = f.readlines()
f.close()

for i, line in enumerate(lines):
if line.startswith('[Assay]'):
start = i
headers = lines[i+1].strip().split(',')
elif line.startswith('[Controls]'):
end = i

lines = lines[start+2:end]
lines = [x.strip().split(',') for x in lines]

df = pd.DataFrame(lines, columns=headers)

return cls(df)

def to_vep(self):
"""
Convert InfiniumFrame to the Ensembl VEP format.
Returns
-------
pandas.DataFrame
Variants in Ensembl VEP format.
"""
df = self.df[(self.df.Chr != 'XY') & (self.df.Chr != '0')]
def one_row(r):
pos = r.MapInfo
matches = re.findall(r'\[([^\]]+)\]', r.SourceSeq)
if not matches:
raise ValueError(f'Something went wrong: {r}')
a1, a2 = matches[0].split('/')
data = pd.Series([r.Chr, r.MapInfo, a1, a2])
return data
df = df.apply(one_row, axis=1)
return df

0 comments on commit c110451

Please sign in to comment.