How Python connects you with biological databases? #1 – Uniprot

In bioinfomatics, the possibility to automatically use information gathered in numerous biological databases is crucial. Some databases are really easy to use, wrappers are great, some have very basic wrappers and some has none. There is a great movement to provide easy access to all biological databases and tools but we have still a lot to do.

One of the first databases I came across during Python programming was Uniprot. Uniprot (http://www.uniprot.org/) is not so easy to use through their page if you don’t really know what are you looking for. It’s common thing for biological data – data is so diverse that it is impossible to avoid redundancy and complexity. However, after some time, it gets easier.

Let’s look on example page of human GAPDH protein (http://www.uniprot.org/uniprot/P04406). You can see that data is categorized and it really makes your life easier. You can look at this page e.g., as xml (so you can extract the part you’re interested in) or text (each line starts with two letter information what is in this line, so it can also be extracted with the use of, e.g. regular expressions). There are multiple different approaches proposed to extract information you need (you have to be careful as some of the solutions may work for Python2 or Python3 only):

  1. requests (example shown here: http://stackoverflow.com/questions/15514614/how-to-use-python-get-results-from-uniprot-automatically)
    import requests
    from StringIO import StringIO  # Python 2
    from io import StringIO  # Python 3
    
    params = {"query": "GO:0070337", "format": "fasta"}
    response = requests.get("http://www.uniprot.org/uniprot/", params)
    
    for record in SeqIO.parse(StringIO(r.text), "fasta"):
        # Do what you need here with your sequences.
  2. uniprot tools (I like this way, connecting it with regular expressions you can extract exact information you need; https://pypi.python.org/pypi/uniprot_tools/0.4.1)
    import uniprot as uni
    print uni.map('P31749', f='ACC', t='P_ENTREZGENEID') # map single id
    print uni.map(['P31749','Q16204'], f='ACC', t='P_ENTREZGENEID') # map list of ids
    print uni.retrieve('P31749')
    print uni.retrieve(['P31749','Q16204'])
  3. swissprot (example shown https://www.biostars.org/p/66904/)
    #!/usr/bin/env python
    """Fetch uniprot entries for given go terms"""
    import sys
    from Bio import SwissProt
    #load go terms
    gos = set(sys.argv[1:])
    sys.stderr.write("Looking for %s GO term(s): %s\n" % (len(gos)," ".join(gos)))
    #parse swisprot dump
    k = 0
    sys.stderr.write("Parsing...\n")
    for i,r in enumerate(SwissProt.parse(sys.stdin)):  
        sys.stderr.write(" %9i\r"%(i+1,))
        #parse cross_references
        for ex_db_data in r.cross_references:
            #print ex_db_data
            extdb,extid = ex_db_data[:2]
            if extdb=="GO" and extid in gos:
              k += 1
              sys.stdout.write( ">%s %s\n%s\n" % (r.accessions[0], extid, r.sequence) )
    sys.stderr.write("Reported %s entries\n" % k)  
  4. bioservices (https://pythonhosted.org/bioservices/references.html#bioservices.uniprot.UniProt) – this is interesting service to look at as they intend to include wrappers to all important biological databases
    from bioservices import UniProt
    u = UniProt(verbose=False)
    u.mapping("ACC", "KEGG_ID", query='P43403')
    defaultdict(<type 'list'>, {'P43403': ['hsa:7535']})
    res = u.search("P43403")
    
    # Returns sequence on the ZAP70_HUMAN accession Id
    sequence = u.search("ZAP70_HUMAN", columns="sequence")
  5. urllib

It is proposed on uniprot website, example:

import urllib,urllib2

url = 'http://www.uniprot.org/uploadlists/'

params = {
'from':'ACC',
'to':'P_REFSEQ_AC',
'format':'tab',
'query':'P13368 P20806 Q9UM73 P97793 Q17192'
}

data = urllib.urlencode(params)
request = urllib2.Request(url, data)
contact = "" # Please set your email address here to help us debug in case of problems.
request.add_header('User-Agent', 'Python %s' % contact)
response = urllib2.urlopen(request)
page = response.read(200000)

 

You check out some general info about providing access to biological databases and tools here:

https://pythonhosted.org/bioservices/

Python possibilities for RNA structure folding

There are quite a lot of different software available for biologists for prediction of RNA secondary or tertiary structure. Here, I won’t discuss different algorithmic approaches and when to use them but I will check which approaches are available easily for Python users to include in their software (of course in specific needs only one software would be suitable and different – e.g. shell – approach would be required to use it automatically). I also won’t discuss software for protein structure folding because it is a whole new subject.

  1. Vienna Package (http://www.tbi.univie.ac.at/RNA/#self_packages) – I guess that this is most popular package used in Python for RNA folding. Information from the site: “RNA secondary structure prediction through energy minimization is the most used function in the package. We provide three kinds of dynamic programming algorithms for structure prediction: the minimum free energy algorithm of (Zuker & Stiegler 1981) which yields a single optimal structure, the partition function algorithm of (McCaskill 1990) which calculates base pair probabilities in the thermodynamic ensemble, and the suboptimal folding algorithm of (Wuchty et.al 1999) which generates all suboptimal structures within a given energy range of the optimal energy. For secondary structure comparison, the package contains several measures of distance (dissimilarities) using either string alignment or tree-editing (Shapiro & Zhang 1990). Finally, we provide an algorithm to design sequences with a predefined structure (inverse folding).”
  2. Multifold (https://pypi.python.org/pypi/multifold) – According to the authors:”MultifFold is a Python based tool package for RNA footprinting data analysis.
    • It accepts RNA footprinting data generated from SHAPE-Seq, PARS and DMS-Seq technologies.
    • It can fold multiple secondary structures for each transcript based on user provideed constraints.
    • It could quantify the abundance of each structure centroid using the RNA footprinting data.
    • It provides a series of commands for users to customize every procedure.
    • It could handle RNA footprinting data generated from gene isoforms or overlapped transcripts.”
  3. Forgi (www.tbi.univie.ac.at/~pkerp/forgi) – it is not for folding exactly but for manipulating folded structures and for this usage it is an excellent tool
  4. Barnacle (https://sourceforge.net/projects/barnacle-rna/) – I think it’s not supported any more.
  5. Frnakenstein (http://www.stats.ox.ac.uk/research/genome/software/frnakenstein) – actually uses Vienna Package

There are also other tools written in Python but not implemented with Python interface. Just to name one:

  1. modeRNA (http://genesilico.pl/moderna/)

Also many short scripts are available at github and private websites but I would be careful with them:

  1. http://philipuren.com/python/RNAFolding.php
  2. https://github.com/xlr8runner/RNA-Folding
  3. https://github.com/cgoliver/Nussinov

If I left something out, please include it in the comment 🙂

Subjective guide to upcoming conferences/workshops in 2017

This is a very subjective guide to upcoming conferences in 2017 (in area of Data Science, Bioinformatics, Programming in general). Order is quite random.

  • BOSC 2017

Prague, Czech Republic

July 22 – 23, 2017

https://www.open-bio.org/wiki/BOSC

 

  • EMBL Summer School in Bioinformatics 2017

Heidelberg, Germany

June 26 – 30 June, 2017

http://www.ebi.ac.uk/training/events/2017/summer-school-bioinformatics-0

 

  • EuroPython 2017

Rimini, Italy

July 9 – 16, 2017

https://ep2017.europython.eu/en/

 

  • AMC/BCB – WABI

Boston, US

August 20 – 23, 2017

http://acm-bcb.org/2017/index.php

http://www.acm-bcb.org/WABI/2017/

 

  • RECOMB 2017

Hong Kong

May 3 – 7, 2017

http://cb.csail.mit.edu/cb/recomb2017/

 

Enjoy :)!

 

 

Biopython 1.69 released

Great news for all bioinformatitians out there  – new biopython version was released.

What is new?

First of all Python 2.6 was dropped for further development, but many other versions are supported (Python 2.7, 3.3, 3.4, 3.5, 3.6, PyPy v5.7, PyPy3.5 v5.7 beta, and Jython 2.7).

Many changes has been made, both improving performance and functionality, broadening compatibilities with biological databases.

  • Bio.AlignIO supports MAF format,
  • Bio.SearchIO.AbiIO support FSA files,
  • Uniprot parser parse “submittedName” in XML files,
  • NEXUS parser was improved to improve co-working with tools like the BEAST TreeAnnotator,
  • New parser was introduced for ExPASy Cellosaurus,
  • Bio.Seq module has complement function,
  • SeqFeature object’s qualifiers is explicitly ordered dictionary,
  • Bio.SeqIO UniProt-XML parser was updated to cope with features with unknown locations which can be found in mass spec data.
  • Bio.SeqIO GenBank, EMBL, and IMGT parsers were updated,
  • Bio.Affy package supports CEL format version 4,
  • Bio.Restriction enzyme list has been updated,
  • Bio.PDB.PDBList now can download PDBx/mmCif (new default).

What I really like is that they use “Python PEP8, PEP257 and best practice standard coding style”. Great example to others ;).

You can check out all contributors and details here:

https://news.open-bio.org/2017/04/07/biopython-1-69-released/

Maybe some day you (or I) will be at this list?

And… have fun with new Biopython 🙂