How Python connects you with biological databases? #1 – Uniprot

In bioinfomatics, the possibility to automatically use information gathered in numerous biological databases is crucial. Some databases are really easy to use, wrappers are great, some have very basic wrappers and some has none. There is a great movement to provide easy access to all biological databases and tools but we have still a lot to do.

One of the first databases I came across during Python programming was Uniprot. Uniprot (http://www.uniprot.org/) is not so easy to use through their page if you don’t really know what are you looking for. It’s common thing for biological data – data is so diverse that it is impossible to avoid redundancy and complexity. However, after some time, it gets easier.

Let’s look on example page of human GAPDH protein (http://www.uniprot.org/uniprot/P04406). You can see that data is categorized and it really makes your life easier. You can look at this page e.g., as xml (so you can extract the part you’re interested in) or text (each line starts with two letter information what is in this line, so it can also be extracted with the use of, e.g. regular expressions). There are multiple different approaches proposed to extract information you need (you have to be careful as some of the solutions may work for Python2 or Python3 only):

  1. requests (example shown here: http://stackoverflow.com/questions/15514614/how-to-use-python-get-results-from-uniprot-automatically)
    import requests
    from StringIO import StringIO  # Python 2
    from io import StringIO  # Python 3
    
    params = {"query": "GO:0070337", "format": "fasta"}
    response = requests.get("http://www.uniprot.org/uniprot/", params)
    
    for record in SeqIO.parse(StringIO(r.text), "fasta"):
        # Do what you need here with your sequences.
  2. uniprot tools (I like this way, connecting it with regular expressions you can extract exact information you need; https://pypi.python.org/pypi/uniprot_tools/0.4.1)
    import uniprot as uni
    print uni.map('P31749', f='ACC', t='P_ENTREZGENEID') # map single id
    print uni.map(['P31749','Q16204'], f='ACC', t='P_ENTREZGENEID') # map list of ids
    print uni.retrieve('P31749')
    print uni.retrieve(['P31749','Q16204'])
  3. swissprot (example shown https://www.biostars.org/p/66904/)
    #!/usr/bin/env python
    """Fetch uniprot entries for given go terms"""
    import sys
    from Bio import SwissProt
    #load go terms
    gos = set(sys.argv[1:])
    sys.stderr.write("Looking for %s GO term(s): %s\n" % (len(gos)," ".join(gos)))
    #parse swisprot dump
    k = 0
    sys.stderr.write("Parsing...\n")
    for i,r in enumerate(SwissProt.parse(sys.stdin)):  
        sys.stderr.write(" %9i\r"%(i+1,))
        #parse cross_references
        for ex_db_data in r.cross_references:
            #print ex_db_data
            extdb,extid = ex_db_data[:2]
            if extdb=="GO" and extid in gos:
              k += 1
              sys.stdout.write( ">%s %s\n%s\n" % (r.accessions[0], extid, r.sequence) )
    sys.stderr.write("Reported %s entries\n" % k)  
  4. bioservices (https://pythonhosted.org/bioservices/references.html#bioservices.uniprot.UniProt) – this is interesting service to look at as they intend to include wrappers to all important biological databases
    from bioservices import UniProt
    u = UniProt(verbose=False)
    u.mapping("ACC", "KEGG_ID", query='P43403')
    defaultdict(<type 'list'>, {'P43403': ['hsa:7535']})
    res = u.search("P43403")
    
    # Returns sequence on the ZAP70_HUMAN accession Id
    sequence = u.search("ZAP70_HUMAN", columns="sequence")
  5. urllib

It is proposed on uniprot website, example:

import urllib,urllib2

url = 'http://www.uniprot.org/uploadlists/'

params = {
'from':'ACC',
'to':'P_REFSEQ_AC',
'format':'tab',
'query':'P13368 P20806 Q9UM73 P97793 Q17192'
}

data = urllib.urlencode(params)
request = urllib2.Request(url, data)
contact = "" # Please set your email address here to help us debug in case of problems.
request.add_header('User-Agent', 'Python %s' % contact)
response = urllib2.urlopen(request)
page = response.read(200000)

 

You check out some general info about providing access to biological databases and tools here:

https://pythonhosted.org/bioservices/

Advertisements

Python possibilities for RNA structure folding

There are quite a lot of different software available for biologists for prediction of RNA secondary or tertiary structure. Here, I won’t discuss different algorithmic approaches and when to use them but I will check which approaches are available easily for Python users to include in their software (of course in specific needs only one software would be suitable and different – e.g. shell – approach would be required to use it automatically). I also won’t discuss software for protein structure folding because it is a whole new subject.

  1. Vienna Package (http://www.tbi.univie.ac.at/RNA/#self_packages) – I guess that this is most popular package used in Python for RNA folding. Information from the site: “RNA secondary structure prediction through energy minimization is the most used function in the package. We provide three kinds of dynamic programming algorithms for structure prediction: the minimum free energy algorithm of (Zuker & Stiegler 1981) which yields a single optimal structure, the partition function algorithm of (McCaskill 1990) which calculates base pair probabilities in the thermodynamic ensemble, and the suboptimal folding algorithm of (Wuchty et.al 1999) which generates all suboptimal structures within a given energy range of the optimal energy. For secondary structure comparison, the package contains several measures of distance (dissimilarities) using either string alignment or tree-editing (Shapiro & Zhang 1990). Finally, we provide an algorithm to design sequences with a predefined structure (inverse folding).”
  2. Multifold (https://pypi.python.org/pypi/multifold) – According to the authors:”MultifFold is a Python based tool package for RNA footprinting data analysis.
    • It accepts RNA footprinting data generated from SHAPE-Seq, PARS and DMS-Seq technologies.
    • It can fold multiple secondary structures for each transcript based on user provideed constraints.
    • It could quantify the abundance of each structure centroid using the RNA footprinting data.
    • It provides a series of commands for users to customize every procedure.
    • It could handle RNA footprinting data generated from gene isoforms or overlapped transcripts.”
  3. Forgi (www.tbi.univie.ac.at/~pkerp/forgi) – it is not for folding exactly but for manipulating folded structures and for this usage it is an excellent tool
  4. Barnacle (https://sourceforge.net/projects/barnacle-rna/) – I think it’s not supported any more.
  5. Frnakenstein (http://www.stats.ox.ac.uk/research/genome/software/frnakenstein) – actually uses Vienna Package

There are also other tools written in Python but not implemented with Python interface. Just to name one:

  1. modeRNA (http://genesilico.pl/moderna/)

Also many short scripts are available at github and private websites but I would be careful with them:

  1. http://philipuren.com/python/RNAFolding.php
  2. https://github.com/xlr8runner/RNA-Folding
  3. https://github.com/cgoliver/Nussinov

If I left something out, please include it in the comment 🙂

Subjective guide to upcoming conferences/workshops in 2017

This is a very subjective guide to upcoming conferences in 2017 (in area of Data Science, Bioinformatics, Programming in general). Order is quite random.

  • BOSC 2017

Prague, Czech Republic

July 22 – 23, 2017

https://www.open-bio.org/wiki/BOSC

 

  • EMBL Summer School in Bioinformatics 2017

Heidelberg, Germany

June 26 – 30 June, 2017

http://www.ebi.ac.uk/training/events/2017/summer-school-bioinformatics-0

 

  • EuroPython 2017

Rimini, Italy

July 9 – 16, 2017

https://ep2017.europython.eu/en/

 

  • AMC/BCB – WABI

Boston, US

August 20 – 23, 2017

http://acm-bcb.org/2017/index.php

http://www.acm-bcb.org/WABI/2017/

 

  • RECOMB 2017

Hong Kong

May 3 – 7, 2017

http://cb.csail.mit.edu/cb/recomb2017/

 

Enjoy :)!

 

 

Biopython 1.69 released

Great news for all bioinformatitians out there  – new biopython version was released.

What is new?

First of all Python 2.6 was dropped for further development, but many other versions are supported (Python 2.7, 3.3, 3.4, 3.5, 3.6, PyPy v5.7, PyPy3.5 v5.7 beta, and Jython 2.7).

Many changes has been made, both improving performance and functionality, broadening compatibilities with biological databases.

  • Bio.AlignIO supports MAF format,
  • Bio.SearchIO.AbiIO support FSA files,
  • Uniprot parser parse “submittedName” in XML files,
  • NEXUS parser was improved to improve co-working with tools like the BEAST TreeAnnotator,
  • New parser was introduced for ExPASy Cellosaurus,
  • Bio.Seq module has complement function,
  • SeqFeature object’s qualifiers is explicitly ordered dictionary,
  • Bio.SeqIO UniProt-XML parser was updated to cope with features with unknown locations which can be found in mass spec data.
  • Bio.SeqIO GenBank, EMBL, and IMGT parsers were updated,
  • Bio.Affy package supports CEL format version 4,
  • Bio.Restriction enzyme list has been updated,
  • Bio.PDB.PDBList now can download PDBx/mmCif (new default).

What I really like is that they use “Python PEP8, PEP257 and best practice standard coding style”. Great example to others ;).

You can check out all contributors and details here:

https://news.open-bio.org/2017/04/07/biopython-1-69-released/

Maybe some day you (or I) will be at this list?

And… have fun with new Biopython 🙂

Where to look for bioinformatic knowledge?

Bioinformatics is increasingly emerging field on the cross of IT, biology and statistics ( and chemistry, physics and so on and so on). People that work ‘in bioinformatics’ have so various background like in no other field of science. But the question is where to learn bioinformatics, where to find interesting articles, where to find problems and data to work with (to learn on) and finally where to publish?

Where to learn? Recently, I wrote about coursera courses (https://scienceisthenewblackblog.wordpress.com/2017/02/20/coursera-and-udemy/), I really recommend. The bioinformatics courses on Coursera are related to Rosalind (http://rosalind.info), where you can find problems to solve and check whether your solution is sufficient and correct. Rosalind project was named after Rosalind Franklin, who has undervalued role in the understanding of DNA structure (associated mainly with Watson and Crick). There is also nice book including various basic bioinformatic algorithms by the same authors (http://bioinformaticsalgorithms.com/). If you’re already a bit familiar with bioinformatics the best way is to work on real data. For example for microarrays and NGS, a lot of exemplary data is deposed at https://www.ncbi.nlm.nih.gov/ (in really various, not so unified format, but you can use it to learn) and these data often are associated with papers so you can compare what you got from the analyses and what the authors got.

There are also many facebook fanpages (e.g., European Bioinformatics Institute (EMBL-EBI)), twitters (e.g. @Bioinformatics_ , @BioinfoRRmatics) and blogs (e.g., http://www.opiniomics.org/, http://ivory.idyll.org/blog/, https://flxlexblog.wordpress.com/, http://lab.loman.net/, ) where you can find infos on how to start with bioinformatics and how this field is developing. You really just need to google it and you’ll find ocean of information.

For learning I also recommend you to go through some bioinformatic papers and check out code (often available at github or researcher’s sites), try to run it and understand how it works. What technologies the authors used (of course it is better to analyze papers of top bioinformatic labs to be sure that the choice of technology wasn’t just a guess of graduate student 😉 ), what data the software works on and what output it provides. Many data types are special for bioinformatics and this is an interesting way to learn about them.

As for publishing, there is a lot of possibilities. There are strictly ‘bioinformatics’ journals, just to name a couple: Bioinformatics (Oxford), BMC Bioinformatics and Briefings of bioinfomatics. Also more mathematical/statistical journals adopted to publish bioinformatics papers (e.g., Journal of Theoretical Biology or Journal of Mathematical Biology). However, there are many possibilities to publish bioinformatic papers (tools or results of analyses) in biological journals depending of the biological dilemma investigated. For example, Nucleic Acids Research not only publish bioinformatic papers on ‘daily’ basis but also has a special issue called Web Server Issue and special issue called Database issue. So the possibilities are really broad.

Happy ‘bioinformating’ 😉

Translation part #1

So I started to implement translation process (from RNA/DNA sequence to protein sequence).

First of all, I needed to prepare codon table – instruction what three nucleotides code what amino acid. I managed that with Python dictionary. I also prepared backwards dictionary (for each amino acid a list of codons that encodes it).

https://github.com/science-is-the-new-black/bio_stuff/blob/master/translate/translate_dicts.py

However, when I started to implement the functionality, couple of questions appeared. What reading frame should be taken? Single or all? Should the sequence be translated on the whole length or from first start codon? From the first start codon or all start codons? Should it end with stop codon or be translated to the end of the input sequence with a gap representing stop codon? Should it be translated for one strand or also for reversed sequence?

I decided to check out existing solutions:

  1. http://www.bioinformatics.org/sms/

One chosen reading frame (or choice of the sequence by uppercase), option to reverse the sequence and possibility to change genetic code on your own (I don’t think it’s necessary)

2. http://www.bioinformatics.org/sms2/translate.html

Same but with multiple choice of genetic codes.

3. http://web.expasy.org/translate/

Multiple choice of genetic codes.

4. http://www.ebi.ac.uk/Tools/st/emboss_transeq/

One reading frame, choice of genetic codes, trim, reverse options

5. http://www.fr33.net/translator.php

all reading frames in both directions

I still don’t have all the answers but I decided to leave the decisions to the user for most of the options (so I have to implement all the options). I didn’t plan to, but now I think that implementing various genetic codes is necessary.

I started with finding start codons with regex and translating the sequence in three reading frames. So far so good. Next project-post, I think it will be done.

 

 

Development of a pipeline for automated, high-throughput analysis of paraspeckle proteins reveals specific roles for importin α proteins

For today, I have new paper that is only partially bioinformatic, however, I’ll obviously focus on this part. The article was written by Andrew T. Major, Yoichi Miyamoto, Camden Y. Lo, David A. Jans and Kate L. Loveland and can be found here: http://www.nature.com/articles/srep43323

Generally, the authors hypothesized that changes in IMPα proteins in paraspeckles are important for spermatogenesis. Due to unsuccessful past experiences (due to natural bias among cells), they prepared high throughput image analysis pipeline to analyze thousands of cells to obtain results that enables to cut off normal biological variation background. Images were prepared as tiled confocal z-series. The authors used Imaris software for identification of cells, nuclei and PSPC1/SFPQ nuclear spots. Results were exported from Imaris in CSV formats, processed with Python scripts and then statistical analysis was performed with R (all scripts – Python and R as well as shell scripts – are available as supplementary data).

Thresholds were computed for two fluorescent signals – GFP that was used to image whole cell cytoplasm and nucleus including autofluorescent signal from unmodified cells, and DAPI/Draq5 that was used to image nuclei. Cells located on the border of images were excluded from the analysis in R scripts and so were cells with undetectable nuclei. The pipeline calculated the ratio between the fluorescent nuclear and cytoplasmic signals for proteins, what is manually very time-consuming, and subjective. Results obtained with the automatic workflow remained in the general agreement with the earlier results gathered by manual paraspeckles analysis.

To work with overwhelming amount of gathered data the authors implemented PCA analysis to visualize 2D relationship among analyzed parameters. For statistical analysis with the use of R script additional packages were used: “car”, “epitools”, “geepack” and “ggplot2”. Comparisons were made between groups using generalized linear models, logistic or linear regression.

What is really important, this pipeline may be quite simply implemented in analyzes in other research areas. A disadvantage is the need to use Imaris software, that besides it’s unquestionable advantages, is also very expensive. However, with some manipulations, images may be also saved in csv format from free imaging software, e.g. ImageJ. Although, the pipeline would need to be highly modified and tested.

I’m actually quite impressed with the explanation of the scripts in readme file and in scripts themselves, especially as it’s not bioinformatic paper. It’s easier to use these pipeline scripts then sometimes software published to use as is. It worked perfectly from the first run on Ubuntu 16.04 LTS. Good job! It took 5 days to run all analyzes (4-year old personal computer) and as expected the results were as in publication (no, I did not check them all). You just need to check all the dependencies (indicated in the script, e.g. R libraries). However, there is a need to explain in more detail (and somewhere easy to find) what each folder contain and how to understand each csv file and results in generated pdfs.

Moreover, some of the lines in scripts are unnecessary and authors knew that and left them be anyway. I don’t know why. In StatsView.R script we can find lines:

citation(“ggbiplot”) # Nothing used from this in final analysis

or

citation(“reshape2”) # Requirement for ggplot2?

So why inlcude libraries that are not used and why not check if something is really a requirement? However, the script is enormous and it updates the user in the shell on which step it actually is, so it’s pardonable.

Over all, scripts are reusable and this paper may be a good example how we can help with our scripts to accelerate research 🙂