Python for kids?

Python is a very friendly programming language to start for adults as a very first programming experience. However, can it be used in education of younger? Of course. But I would recommend it as a second step after understanding basic algorithms (e.g. with Scratch or ScratchJr). Python is great to start their programming journey for teens, those who know what sentence structure is and who can do some mathematics.

Why Python is good to start?

  • Comparing to other widely used programming languages, it has quite easy syntax. You can experience it from your first line of clean and working code, so it will encourage you to write more.
  • It is a high level programming language, so you really don’t need too much code to see the effect – also encouraging.
  • There are a lot of information on the Web about Python, how to start with Python, what to do in case of failure – it is very important to have support, feeling that you have someone to turn to for help anytime.
  • If you want to sell something to kids, it must feel interesting to them. Well… you can make simple games with Python (and it’s not very hard), you can quite easily prepare a website and, last but not least, coding on Raspberry Pi may also be an argument.

There are also a couple of books concerning Python for kids. Just to name some:

  1. Python For Kids For Dummies: Brendan Scott
  2. Python for Kids. A Playful Introduction to Programming by Jason R. Briggs
  3. Python Projects for Kids. Jessica Ingrassellino
Advertisements

How Python connects you with biological databases? #1 – Uniprot

In bioinfomatics, the possibility to automatically use information gathered in numerous biological databases is crucial. Some databases are really easy to use, wrappers are great, some have very basic wrappers and some has none. There is a great movement to provide easy access to all biological databases and tools but we have still a lot to do.

One of the first databases I came across during Python programming was Uniprot. Uniprot (http://www.uniprot.org/) is not so easy to use through their page if you don’t really know what are you looking for. It’s common thing for biological data – data is so diverse that it is impossible to avoid redundancy and complexity. However, after some time, it gets easier.

Let’s look on example page of human GAPDH protein (http://www.uniprot.org/uniprot/P04406). You can see that data is categorized and it really makes your life easier. You can look at this page e.g., as xml (so you can extract the part you’re interested in) or text (each line starts with two letter information what is in this line, so it can also be extracted with the use of, e.g. regular expressions). There are multiple different approaches proposed to extract information you need (you have to be careful as some of the solutions may work for Python2 or Python3 only):

  1. requests (example shown here: http://stackoverflow.com/questions/15514614/how-to-use-python-get-results-from-uniprot-automatically)
    import requests
    from StringIO import StringIO  # Python 2
    from io import StringIO  # Python 3
    
    params = {"query": "GO:0070337", "format": "fasta"}
    response = requests.get("http://www.uniprot.org/uniprot/", params)
    
    for record in SeqIO.parse(StringIO(r.text), "fasta"):
        # Do what you need here with your sequences.
  2. uniprot tools (I like this way, connecting it with regular expressions you can extract exact information you need; https://pypi.python.org/pypi/uniprot_tools/0.4.1)
    import uniprot as uni
    print uni.map('P31749', f='ACC', t='P_ENTREZGENEID') # map single id
    print uni.map(['P31749','Q16204'], f='ACC', t='P_ENTREZGENEID') # map list of ids
    print uni.retrieve('P31749')
    print uni.retrieve(['P31749','Q16204'])
  3. swissprot (example shown https://www.biostars.org/p/66904/)
    #!/usr/bin/env python
    """Fetch uniprot entries for given go terms"""
    import sys
    from Bio import SwissProt
    #load go terms
    gos = set(sys.argv[1:])
    sys.stderr.write("Looking for %s GO term(s): %s\n" % (len(gos)," ".join(gos)))
    #parse swisprot dump
    k = 0
    sys.stderr.write("Parsing...\n")
    for i,r in enumerate(SwissProt.parse(sys.stdin)):  
        sys.stderr.write(" %9i\r"%(i+1,))
        #parse cross_references
        for ex_db_data in r.cross_references:
            #print ex_db_data
            extdb,extid = ex_db_data[:2]
            if extdb=="GO" and extid in gos:
              k += 1
              sys.stdout.write( ">%s %s\n%s\n" % (r.accessions[0], extid, r.sequence) )
    sys.stderr.write("Reported %s entries\n" % k)  
  4. bioservices (https://pythonhosted.org/bioservices/references.html#bioservices.uniprot.UniProt) – this is interesting service to look at as they intend to include wrappers to all important biological databases
    from bioservices import UniProt
    u = UniProt(verbose=False)
    u.mapping("ACC", "KEGG_ID", query='P43403')
    defaultdict(<type 'list'>, {'P43403': ['hsa:7535']})
    res = u.search("P43403")
    
    # Returns sequence on the ZAP70_HUMAN accession Id
    sequence = u.search("ZAP70_HUMAN", columns="sequence")
  5. urllib

It is proposed on uniprot website, example:

import urllib,urllib2

url = 'http://www.uniprot.org/uploadlists/'

params = {
'from':'ACC',
'to':'P_REFSEQ_AC',
'format':'tab',
'query':'P13368 P20806 Q9UM73 P97793 Q17192'
}

data = urllib.urlencode(params)
request = urllib2.Request(url, data)
contact = "" # Please set your email address here to help us debug in case of problems.
request.add_header('User-Agent', 'Python %s' % contact)
response = urllib2.urlopen(request)
page = response.read(200000)

 

You check out some general info about providing access to biological databases and tools here:

https://pythonhosted.org/bioservices/

Programming for kids #2 – ScratchJr

ScratchJr is Scratch for younger kids. You don’t even need to write or read to use it but still it will show you basics of algorithms and enables to create simple scenes and games.

All you need is a tablet with Android or iPad and your free to go (app is free for both systems). It was, as Scratch, created in one of the best Universities in the World, MIT (Massachusetts Institute of Technology).

Similarily to Scratch, children can draw their ‘ghosts’ and background pictures and make them move (jumping dancing etc.). Simple loops and conditional actions are also possible. What children really like is the possibility to record own voices.

If you don’t have a tablet, you can also try some emulators for computers, e.g., http://www.scratchguide.com/how-to-run-scratchjr-on-windows-and-mac/ .

More information you can find here: https://www.scratchjr.org/

There are also a lots of ideas on youtube:

Have fun!

Examples of Data Scientists’ Portfolios

While searching for employment as Data Scientist it is important to show your skills with well prepared portfolio, just as Developers show their github accounts to show their programming skills. Your portfolio should show how you use your skills stated in the Resume. I think that the most important thing is to tell a story about chosen data, show what you can do with openly available data, how insightful you are when it comes to asking questions based on the data and whether you can represent the results clearly (but also aesthetically and beautifully).

Let’s look at some examples of projects representation:

1.Projects – scientific way

http://timdettmers.com/data-science-portfolio/

I like this portfolio because I am scientist (but probably it’s not ideal for recruiters). Each project is described with abstract, methods and results with discussion (with accompanying figure). It’s quite simple with no graphical fireworks, but it’s clear.

2.Projects – more advertising way

http://binnie869.github.io/

Projects are represented by title, short comment and a image that redirects to the github project (code).

3.Analyses – tools used

http://gemelli.spacescience.org/~hahnjm/data_science/data_science.html

As the author stated, not exactly projects but activities are shown. Each activity is represented by a graph and short description of statistical method/tool used. It for sure shows skills of the author.

4. Projects – very advertising

http://davidventuri.com/portfolio

For sure the author knows how to make nice website ;). Again, projects are represented by title, short comment (however here in the caption also technologies are included) and a image that redirects to the extended project description or website presenting results.

5. Projects – story telling

http://dsal1951-portfolio-v1.businesscatalyst.com/portfolio.html

I really like this portfolio as it is really ‘story telling’ portfolio. When you enter a project, story about data and various approaches to analyse it are presented.

As you can see, each Data Scientist has different way for showing their expertise. Which is best? Hard to say, depends what you want to do with data science and what kind of a company you want to work in.

matplotlib #1

matplotlib is one of the best libraries for data visualization for Python (or someone disagree?). It’s quite easy to use and the plots obtained are really pretty 🙂

In matplotlib #1 I will focus on basics of matplotlib library and show an example of use of matplotlib.pyplot (scatter plot). First of all you need to install matplotlib on your machine as it is not included in most of Python distributions. The easiest way is to use apt-get (Ubuntu) or pip (Ubuntu/Windows). Please refer to the installation guide (https://matplotlib.org/users/installing.html). When you have any problems installing, check on stackoverflow for possible solutions. Most common error is due to lack of some dependencies (e.g. pkg-config/libpng-dev/libfreetype6-dev).

Ok, if we you have matplotlib installed, we can do some magic ;).

First of all import matplotlib; for plots usually you will use:

import matplotlib.pyplot as plt

For this example, we will use some example data that Python has included (actually it is a dataset from R implemented for Python)

from sklearn import datasets
iris = datasets.load_iris()

to prepare this data as DataFrame, we will do some pandas-based manipulations:

frame = pd.DataFrame(iris.data, columns = iris.feature_names)
frame['type'] = iris.target

So now we have pandas DataFrame with 4 columns with data and 5th column telling us which row was what plant (0 – ‘setosa’, 1 – ‘versicolor’, 2 – ‘virginica’). For simple testing this library we could also use random data generators (e.g. random from numpy).

To check how our DataFrame looks like we can look at its ‘head’.

frame.head()
sepal length (cm) sepal width (cm) petal length (cm) petal width (cm) type
0 5.1 3.5 1.4 0.2 0
1 4.9 3.0 1.4 0.2 0
2 4.7 3.2 1.3 0.2 0
3 4.6 3.1 1.5 0.2 0
4 5.0 3.6 1.4 0.2 0

So we can proceed to making our first plot, to this data line plot (the easiest one) doesn’t fit , so we will use scatter plot for all of the plants together showing relation between sepal length and petal length.

plt.scatter(frame['sepal length (cm)'], frame['petal length (cm)'])

plt.scatter() creates scatter plot and first two values are x values and y values, respectively.

To actually see the plot we need to write one more line:

plt.show()

scatterplot1

Looks nice, but we really don’t know what are units, what are x and y values. Also, you can see that part of results is separated from the rest. In the next step we will color these dots to see if its one of the three species. But first things first.

[Note if you want to clean the figure you can always do:

plt.clf()

and start over]

For adding x and y labels with units you can do:

plt.xlabel('sepal length [cm]')
plt.ylabel('petal length [cm]')

scatterplot2

Ok, so now we would like to add some color to see if separated dots are specific for one species. We have column that numerically says which result comes from which plant. We can color dots using value ‘c’ in scatter plot generator (we have to provide array with colors of the length the same as number of dots). In matplot lib we can use differently defined colors, e.g. simple names of colors like ‘red’, ‘blue’ and ‘green’.

colors = frame['type'].replace(0, 'red').replace(1, 'blue').replace(2, 'green')
plot1 = plt.scatter(frame['sepal length (cm)'], frame['petal length (cm)'], c = colors)

scatterplot3

So now our plot gives some information. But we need a legend, because I don’t think any of you remember which color is which species.

To do it nicely it would be better if we divide data on ‘type’ and create distinct scatter plot for each species (on the same image).

plot1 = plt.scatter(frame[frame['type'] == 0]['sepal length (cm)'], frame[frame['type'] == 0]['petal length (cm)'], c = 'red')
plot2 = plt.scatter(frame[frame['type'] == 1]['sepal length (cm)'], frame[frame['type'] == 1]['petal length (cm)'], c = 'blue')
plot3 = plt.scatter(frame[frame['type'] == 2]['sepal length (cm)'], frame[frame['type'] == 2]['petal length (cm)'], c = 'green')
plt.legend([plot1, plot2, plot3], ['setosa', 'versicolor', 'virginica'])

And last, we can add title to our plot:

plt.title('Iris sepal/petal length')

Overall, our code looks now:

import matplotlib.pyplot as plt
import pandas as pd
from sklearn import datasets
iris = datasets.load_iris()
frame = pd.DataFrame(iris.data, columns = iris.feature_names)
frame['type'] = iris.target
plot1 = plt.scatter(frame[frame['type'] == 0]['sepal length (cm)'], frame[frame['type'] == 0]['petal length (cm)'], c = 'red')
plot2 = plt.scatter(frame[frame['type'] == 1]['sepal length (cm)'], frame[frame['type'] == 1]['petal length (cm)'], c = 'blue')
plot3 = plt.scatter(frame[frame['type'] == 2]['sepal length (cm)'], frame[frame['type'] == 2]['petal length (cm)'], c = 'green')
plt.xlabel('sepal length (cm)')
plt.ylabel('petal length (cm)')
plt.legend([plot1, plot2, plot3], ['setosa', 'versicolor', 'virginica'])
plt.title('Iris sepal/petal length')
plt.show()

scatterplot4

I hope everything was understandable. Have fun with matplotlib and if you have any questions, suggestions what to do in matplotlib #2  – write in comments!

Sending e-mails from web app

I decided that part of the results created by the project will be provided to users by email. I have my site run by cherrypy (Python 3) on Ubuntu 16.04 (I used until recently 14.04 and everything worked on both systems) located at DigitalOcean.

What I wanted was to send emails from specific email address (using my domain name), ensure that emails won’t end up in the spam folder and redirect all emails send to any email address on my domain to specific email.

Easiest solution was postfix.

First attempt resulted in successful email sending, however, directly to spam folder. To solve this issue, there are some online free software to check what is wrong with the emails sent. Just to name some:

https://www.mail-tester.com/

http://isnotspam.com/

Actually,  I liked the latter better. First has restrictions about number of free usages per day and despite of not so user-friendly interface it was easier for me to draw some conclusions from the latter. Most important was:

  1. HELO greeting
  2. SPF
  3. DKIM

HELO was easiest as it is set in postfix configuration.

SPF need to be changed on the site of DNS (for me it was on DigitalOcean site).

For DKIM I used opendkim (and other perl libraries needed for opendkim proper functioning).

And that did the trick, I can send emails and receive emails as well.

Inside the app (cherrypy) I use smtplib for sending emails and email.mime for formatting emails and attachments (there are some differences in mime use between Python2 and Python3).

And it’s up and running :).

 

 

Biopython 1.69 released

Great news for all bioinformatitians out there  – new biopython version was released.

What is new?

First of all Python 2.6 was dropped for further development, but many other versions are supported (Python 2.7, 3.3, 3.4, 3.5, 3.6, PyPy v5.7, PyPy3.5 v5.7 beta, and Jython 2.7).

Many changes has been made, both improving performance and functionality, broadening compatibilities with biological databases.

  • Bio.AlignIO supports MAF format,
  • Bio.SearchIO.AbiIO support FSA files,
  • Uniprot parser parse “submittedName” in XML files,
  • NEXUS parser was improved to improve co-working with tools like the BEAST TreeAnnotator,
  • New parser was introduced for ExPASy Cellosaurus,
  • Bio.Seq module has complement function,
  • SeqFeature object’s qualifiers is explicitly ordered dictionary,
  • Bio.SeqIO UniProt-XML parser was updated to cope with features with unknown locations which can be found in mass spec data.
  • Bio.SeqIO GenBank, EMBL, and IMGT parsers were updated,
  • Bio.Affy package supports CEL format version 4,
  • Bio.Restriction enzyme list has been updated,
  • Bio.PDB.PDBList now can download PDBx/mmCif (new default).

What I really like is that they use “Python PEP8, PEP257 and best practice standard coding style”. Great example to others ;).

You can check out all contributors and details here:

https://news.open-bio.org/2017/04/07/biopython-1-69-released/

Maybe some day you (or I) will be at this list?

And… have fun with new Biopython 🙂