Seaborn library for pretty plots

Seaborn is visualization library based on matplotlib (and complementary to matplotlib, you should really understand matplotlib first). It basically makes your work easier and prettier. The library is not really complicated and broad but it makes some thing for you, things that you would have to do in matplotlib on your own.

Seaborn works very well with libraries used in Python for data analysis (pandas, numpy, scipy, statsmodels) and may be used easily in Jupyter notebook for plots imaging. The most frequently mentioned advantage of seaborn are built-in themes. Well… it helps people who don’t know how to combine different colors to make plots aesthetic. Second thing is that functions do really nice plots that try to show something useful even when called with a minimal set of arguments. However, as for matplotlib you have almost endless number of possibilities to adjust your plots.

Additional information about seaborn can be find here: https://seaborn.pydata.org/

Installation is pretty easy:

pip install seaborn

or

conda install seaborn

Additionall info about installation (including development version of seaborn) can be found here: http://seaborn.pydata.org/installing.html

They did really nice job also when comes to documentation and tutorials (e.g. https://seaborn.pydata.org/tutorial.html)

My favorite thing about seaborn? I would say seaborn.distplot function. I usually do two visualizations to look at data before working with it – scatter plot and distribution. Scatter plot is probably easy to obtain in any visualization lib you work with, as is with matplotlib. However to see distribution along with KDE plot, I recommend seaborn function displot.

Here are some examples: https://seaborn.pydata.org/generated/seaborn.distplot.html?highlight=distplot#seaborn.distplot

Basically, all you need to do to have your data in array, e.g. as pandas DataFrame column

import seaborn as sns
ax = sns.distplot(df['column name'])

To try with example data, you can try plotting normally distributed data:

import seaborn as sns, numpy as np
x = np.random.randn(100)
ax = sns.distplot(x)

 

DATA Science in business – perspective from first day employee

This is the one and only opportunity for me to write what was my thoughts about data science after first couple of days in a new office. I decided to start another job, first job in data analysis and first in software house – huge difference from the first day.

First of all I had no idea what anyone was talking about. All those abbreviations and  office slang is a bit overwhelming at first. But you get used to it and understand more and more every day. But after couple of days it is still not enough. But hey! During the first weeks you’re allowed to ask stupid questions, so use it.

Second of all, your chosen technology is not necessary used in the office. Even if you spoke about used technologies during interview and you were asked to implement your chosen technology to your new office, you will have to use also technologies they use. It is not at all a bad thing, it gives you opportunity to expand your knowledge and perspectives. It is easier to understand the data in a way they work with it and after you understand it you can go on with your technologies.

Third thing is the co-operation. You will not be with data alone, others need your results, they (engineers) change the way how data looks like, they want to learn how to do some analyses on their own – strong co-operation in data analysis is crucial.

Forth – you need to understand the goal of your existence in the company. You don’t analyse the data just to analyse it, there is company strategy in it, you have to keep it mind all the time.

Last, but not least, connected to all of the above, company lives its life and what you do today may not be so necessary tomorrow and sometimes you have to deal with unfinished projects. Just go with the flow (and company development).

Python for kids?

Python is a very friendly programming language to start for adults as a very first programming experience. However, can it be used in education of younger? Of course. But I would recommend it as a second step after understanding basic algorithms (e.g. with Scratch or ScratchJr). Python is great to start their programming journey for teens, those who know what sentence structure is and who can do some mathematics.

Why Python is good to start?

  • Comparing to other widely used programming languages, it has quite easy syntax. You can experience it from your first line of clean and working code, so it will encourage you to write more.
  • It is a high level programming language, so you really don’t need too much code to see the effect – also encouraging.
  • There are a lot of information on the Web about Python, how to start with Python, what to do in case of failure – it is very important to have support, feeling that you have someone to turn to for help anytime.
  • If you want to sell something to kids, it must feel interesting to them. Well… you can make simple games with Python (and it’s not very hard), you can quite easily prepare a website and, last but not least, coding on Raspberry Pi may also be an argument.

There are also a couple of books concerning Python for kids. Just to name some:

  1. Python For Kids For Dummies: Brendan Scott
  2. Python for Kids. A Playful Introduction to Programming by Jason R. Briggs
  3. Python Projects for Kids. Jessica Ingrassellino

How Python connects you with biological databases? #1 – Uniprot

In bioinfomatics, the possibility to automatically use information gathered in numerous biological databases is crucial. Some databases are really easy to use, wrappers are great, some have very basic wrappers and some has none. There is a great movement to provide easy access to all biological databases and tools but we have still a lot to do.

One of the first databases I came across during Python programming was Uniprot. Uniprot (http://www.uniprot.org/) is not so easy to use through their page if you don’t really know what are you looking for. It’s common thing for biological data – data is so diverse that it is impossible to avoid redundancy and complexity. However, after some time, it gets easier.

Let’s look on example page of human GAPDH protein (http://www.uniprot.org/uniprot/P04406). You can see that data is categorized and it really makes your life easier. You can look at this page e.g., as xml (so you can extract the part you’re interested in) or text (each line starts with two letter information what is in this line, so it can also be extracted with the use of, e.g. regular expressions). There are multiple different approaches proposed to extract information you need (you have to be careful as some of the solutions may work for Python2 or Python3 only):

  1. requests (example shown here: http://stackoverflow.com/questions/15514614/how-to-use-python-get-results-from-uniprot-automatically)
    import requests
    from StringIO import StringIO  # Python 2
    from io import StringIO  # Python 3
    
    params = {"query": "GO:0070337", "format": "fasta"}
    response = requests.get("http://www.uniprot.org/uniprot/", params)
    
    for record in SeqIO.parse(StringIO(r.text), "fasta"):
        # Do what you need here with your sequences.
  2. uniprot tools (I like this way, connecting it with regular expressions you can extract exact information you need; https://pypi.python.org/pypi/uniprot_tools/0.4.1)
    import uniprot as uni
    print uni.map('P31749', f='ACC', t='P_ENTREZGENEID') # map single id
    print uni.map(['P31749','Q16204'], f='ACC', t='P_ENTREZGENEID') # map list of ids
    print uni.retrieve('P31749')
    print uni.retrieve(['P31749','Q16204'])
  3. swissprot (example shown https://www.biostars.org/p/66904/)
    #!/usr/bin/env python
    """Fetch uniprot entries for given go terms"""
    import sys
    from Bio import SwissProt
    #load go terms
    gos = set(sys.argv[1:])
    sys.stderr.write("Looking for %s GO term(s): %s\n" % (len(gos)," ".join(gos)))
    #parse swisprot dump
    k = 0
    sys.stderr.write("Parsing...\n")
    for i,r in enumerate(SwissProt.parse(sys.stdin)):  
        sys.stderr.write(" %9i\r"%(i+1,))
        #parse cross_references
        for ex_db_data in r.cross_references:
            #print ex_db_data
            extdb,extid = ex_db_data[:2]
            if extdb=="GO" and extid in gos:
              k += 1
              sys.stdout.write( ">%s %s\n%s\n" % (r.accessions[0], extid, r.sequence) )
    sys.stderr.write("Reported %s entries\n" % k)  
  4. bioservices (https://pythonhosted.org/bioservices/references.html#bioservices.uniprot.UniProt) – this is interesting service to look at as they intend to include wrappers to all important biological databases
    from bioservices import UniProt
    u = UniProt(verbose=False)
    u.mapping("ACC", "KEGG_ID", query='P43403')
    defaultdict(<type 'list'>, {'P43403': ['hsa:7535']})
    res = u.search("P43403")
    
    # Returns sequence on the ZAP70_HUMAN accession Id
    sequence = u.search("ZAP70_HUMAN", columns="sequence")
  5. urllib

It is proposed on uniprot website, example:

import urllib,urllib2

url = 'http://www.uniprot.org/uploadlists/'

params = {
'from':'ACC',
'to':'P_REFSEQ_AC',
'format':'tab',
'query':'P13368 P20806 Q9UM73 P97793 Q17192'
}

data = urllib.urlencode(params)
request = urllib2.Request(url, data)
contact = "" # Please set your email address here to help us debug in case of problems.
request.add_header('User-Agent', 'Python %s' % contact)
response = urllib2.urlopen(request)
page = response.read(200000)

 

You check out some general info about providing access to biological databases and tools here:

https://pythonhosted.org/bioservices/

Python possibilities for RNA structure folding

There are quite a lot of different software available for biologists for prediction of RNA secondary or tertiary structure. Here, I won’t discuss different algorithmic approaches and when to use them but I will check which approaches are available easily for Python users to include in their software (of course in specific needs only one software would be suitable and different – e.g. shell – approach would be required to use it automatically). I also won’t discuss software for protein structure folding because it is a whole new subject.

  1. Vienna Package (http://www.tbi.univie.ac.at/RNA/#self_packages) – I guess that this is most popular package used in Python for RNA folding. Information from the site: “RNA secondary structure prediction through energy minimization is the most used function in the package. We provide three kinds of dynamic programming algorithms for structure prediction: the minimum free energy algorithm of (Zuker & Stiegler 1981) which yields a single optimal structure, the partition function algorithm of (McCaskill 1990) which calculates base pair probabilities in the thermodynamic ensemble, and the suboptimal folding algorithm of (Wuchty et.al 1999) which generates all suboptimal structures within a given energy range of the optimal energy. For secondary structure comparison, the package contains several measures of distance (dissimilarities) using either string alignment or tree-editing (Shapiro & Zhang 1990). Finally, we provide an algorithm to design sequences with a predefined structure (inverse folding).”
  2. Multifold (https://pypi.python.org/pypi/multifold) – According to the authors:”MultifFold is a Python based tool package for RNA footprinting data analysis.
    • It accepts RNA footprinting data generated from SHAPE-Seq, PARS and DMS-Seq technologies.
    • It can fold multiple secondary structures for each transcript based on user provideed constraints.
    • It could quantify the abundance of each structure centroid using the RNA footprinting data.
    • It provides a series of commands for users to customize every procedure.
    • It could handle RNA footprinting data generated from gene isoforms or overlapped transcripts.”
  3. Forgi (www.tbi.univie.ac.at/~pkerp/forgi) – it is not for folding exactly but for manipulating folded structures and for this usage it is an excellent tool
  4. Barnacle (https://sourceforge.net/projects/barnacle-rna/) – I think it’s not supported any more.
  5. Frnakenstein (http://www.stats.ox.ac.uk/research/genome/software/frnakenstein) – actually uses Vienna Package

There are also other tools written in Python but not implemented with Python interface. Just to name one:

  1. modeRNA (http://genesilico.pl/moderna/)

Also many short scripts are available at github and private websites but I would be careful with them:

  1. http://philipuren.com/python/RNAFolding.php
  2. https://github.com/xlr8runner/RNA-Folding
  3. https://github.com/cgoliver/Nussinov

If I left something out, please include it in the comment 🙂

Subjective guide to upcoming conferences/workshops in 2017

This is a very subjective guide to upcoming conferences in 2017 (in area of Data Science, Bioinformatics, Programming in general). Order is quite random.

  • BOSC 2017

Prague, Czech Republic

July 22 – 23, 2017

https://www.open-bio.org/wiki/BOSC

 

  • EMBL Summer School in Bioinformatics 2017

Heidelberg, Germany

June 26 – 30 June, 2017

http://www.ebi.ac.uk/training/events/2017/summer-school-bioinformatics-0

 

  • EuroPython 2017

Rimini, Italy

July 9 – 16, 2017

https://ep2017.europython.eu/en/

 

  • AMC/BCB – WABI

Boston, US

August 20 – 23, 2017

http://acm-bcb.org/2017/index.php

http://www.acm-bcb.org/WABI/2017/

 

  • RECOMB 2017

Hong Kong

May 3 – 7, 2017

http://cb.csail.mit.edu/cb/recomb2017/

 

Enjoy :)!

 

 

matplotlib #1

matplotlib is one of the best libraries for data visualization for Python (or someone disagree?). It’s quite easy to use and the plots obtained are really pretty 🙂

In matplotlib #1 I will focus on basics of matplotlib library and show an example of use of matplotlib.pyplot (scatter plot). First of all you need to install matplotlib on your machine as it is not included in most of Python distributions. The easiest way is to use apt-get (Ubuntu) or pip (Ubuntu/Windows). Please refer to the installation guide (https://matplotlib.org/users/installing.html). When you have any problems installing, check on stackoverflow for possible solutions. Most common error is due to lack of some dependencies (e.g. pkg-config/libpng-dev/libfreetype6-dev).

Ok, if we you have matplotlib installed, we can do some magic ;).

First of all import matplotlib; for plots usually you will use:

import matplotlib.pyplot as plt

For this example, we will use some example data that Python has included (actually it is a dataset from R implemented for Python)

from sklearn import datasets
iris = datasets.load_iris()

to prepare this data as DataFrame, we will do some pandas-based manipulations:

frame = pd.DataFrame(iris.data, columns = iris.feature_names)
frame['type'] = iris.target

So now we have pandas DataFrame with 4 columns with data and 5th column telling us which row was what plant (0 – ‘setosa’, 1 – ‘versicolor’, 2 – ‘virginica’). For simple testing this library we could also use random data generators (e.g. random from numpy).

To check how our DataFrame looks like we can look at its ‘head’.

frame.head()
sepal length (cm) sepal width (cm) petal length (cm) petal width (cm) type
0 5.1 3.5 1.4 0.2 0
1 4.9 3.0 1.4 0.2 0
2 4.7 3.2 1.3 0.2 0
3 4.6 3.1 1.5 0.2 0
4 5.0 3.6 1.4 0.2 0

So we can proceed to making our first plot, to this data line plot (the easiest one) doesn’t fit , so we will use scatter plot for all of the plants together showing relation between sepal length and petal length.

plt.scatter(frame['sepal length (cm)'], frame['petal length (cm)'])

plt.scatter() creates scatter plot and first two values are x values and y values, respectively.

To actually see the plot we need to write one more line:

plt.show()

scatterplot1

Looks nice, but we really don’t know what are units, what are x and y values. Also, you can see that part of results is separated from the rest. In the next step we will color these dots to see if its one of the three species. But first things first.

[Note if you want to clean the figure you can always do:

plt.clf()

and start over]

For adding x and y labels with units you can do:

plt.xlabel('sepal length [cm]')
plt.ylabel('petal length [cm]')

scatterplot2

Ok, so now we would like to add some color to see if separated dots are specific for one species. We have column that numerically says which result comes from which plant. We can color dots using value ‘c’ in scatter plot generator (we have to provide array with colors of the length the same as number of dots). In matplot lib we can use differently defined colors, e.g. simple names of colors like ‘red’, ‘blue’ and ‘green’.

colors = frame['type'].replace(0, 'red').replace(1, 'blue').replace(2, 'green')
plot1 = plt.scatter(frame['sepal length (cm)'], frame['petal length (cm)'], c = colors)

scatterplot3

So now our plot gives some information. But we need a legend, because I don’t think any of you remember which color is which species.

To do it nicely it would be better if we divide data on ‘type’ and create distinct scatter plot for each species (on the same image).

plot1 = plt.scatter(frame[frame['type'] == 0]['sepal length (cm)'], frame[frame['type'] == 0]['petal length (cm)'], c = 'red')
plot2 = plt.scatter(frame[frame['type'] == 1]['sepal length (cm)'], frame[frame['type'] == 1]['petal length (cm)'], c = 'blue')
plot3 = plt.scatter(frame[frame['type'] == 2]['sepal length (cm)'], frame[frame['type'] == 2]['petal length (cm)'], c = 'green')
plt.legend([plot1, plot2, plot3], ['setosa', 'versicolor', 'virginica'])

And last, we can add title to our plot:

plt.title('Iris sepal/petal length')

Overall, our code looks now:

import matplotlib.pyplot as plt
import pandas as pd
from sklearn import datasets
iris = datasets.load_iris()
frame = pd.DataFrame(iris.data, columns = iris.feature_names)
frame['type'] = iris.target
plot1 = plt.scatter(frame[frame['type'] == 0]['sepal length (cm)'], frame[frame['type'] == 0]['petal length (cm)'], c = 'red')
plot2 = plt.scatter(frame[frame['type'] == 1]['sepal length (cm)'], frame[frame['type'] == 1]['petal length (cm)'], c = 'blue')
plot3 = plt.scatter(frame[frame['type'] == 2]['sepal length (cm)'], frame[frame['type'] == 2]['petal length (cm)'], c = 'green')
plt.xlabel('sepal length (cm)')
plt.ylabel('petal length (cm)')
plt.legend([plot1, plot2, plot3], ['setosa', 'versicolor', 'virginica'])
plt.title('Iris sepal/petal length')
plt.show()

scatterplot4

I hope everything was understandable. Have fun with matplotlib and if you have any questions, suggestions what to do in matplotlib #2  – write in comments!