Get noticed – wrapping up

So the contest I was participating in – get noticed – has ended. It triggered me to start this blog and I think I will continue to write about bioinformatics, data science, Python and so on (however, maybe one post per week will be easier to manage).

I am really happy that I lasted and I am among 183 finalists (from over a 1000 contestants).

I had a really great adventure during this contest. I didn’t finish my project but the end of the contest does not mean the end of the project.

I even got the courage to apply for a job in data science and I got it (It will be my third job but I hope I’ll manage).

In June I plan to start real web server with the modules I managed to develop during the contest. So keep in touch.


Testing modules within cherrypy

I was wondering how unittests may be implemented within cherrypy to test not cherrypy itself but the modules included. I wanted tests to be run every time when new changes were introduced, that means when the server reloads.

There are some information about unittesting in cherrypy (e.g., however, none directly addressed my issue.

I decided to prepare single file named (remember not to name your file as as there will be problem with library import) with unittests and TextTestRunner setup. TextTestRunner does not interrupt the reload of the server as standard unittest.main() does.

import unittest
from module import function1, function2

class MyTestClass(unittest.TestCase):

    def test_search(self):
        self.assertEqual(function1(input1), output1)
        self.assertEqual(function2(input2), output2)

def start():
    suite = unittest.TestSuite()
    runner = unittest.TextTestRunner()

and in the main cherrypy file, besides import, I added single line (unittests.start()):

import unittests

if __name__ == "__main__":
    cherrypy.quickstart(Page(), config="config.conf")

Every time the server reloaded the tests were run and I could check whether everything is correct.

Python in Big Data #1 – Hadoop & Snakebite

One of the often mentioned term during searching for Big Data information is Hadoop. What is Hadoop exactly? And what are first steps to handle Hadoop with Python?

Hadoop is a system file (HDFS) that enable scalable data handling. Basically, Hadoop was developed to store large amount of data while providing reliability and scalability. It is based on data block system. HDFS is based on two processes: NameNode that gathers metadata and DataNodes that store blocks of data. Blocks of data are replicated on different machines to provide data stability when one machine crash.

There are increasing number of libraries for Python that enables handling Hadoop. Just to name some: Snakebite, mrjob, PySpark.

Snakebite enables basic file operations on Hadoop and accessing Hadoop from Python applications. It can be install easily:

pip install snakebite

While installed, within Python application you can connect with HDFS NameNode with:

from snakebite.client import Client
client = Client('localhost', 8000)

Then you can implement various functions similar to functions in shell: ls, mkdir, delete to handle files and directories within Hadoop. Another group of functions (e.g. copyToLocal) enable retrieving data from HDFS.

from snakebite.client import Client
client = Client('localhost', 8000)
for file in client.copyToLocal(['/hdfs_1/file1.txt'], '/tmp'):
    print file

Snakebite provide also CLI client.

Some additional information can be found in a free book:

Hadoop with Python by Zachary Radtka and Donald Miner

How Python connects you with biological databases? #2 – PubMed

Pubmed is the biggest database of biological scientific papers. There are other databases that gather information about all scientific papers (e.g. google scholar, scopus), however in biological sciences, still most commonly used is Pubmed from NCBI.

There is quite easy way to access Pubmed through their API (Entrez), however, there is already easier way by using BioPython, which I recommend.

To use it you need BioPython installed on your computer, import Entrez from BioPython

from Bio import Entrez

Simple examples are available in the documentation: = ""
pmid = "19304878"
handle = Entrez.elink(dbfrom="pubmed", id=pmid, linkname="pubmed_pubmed")
record =

More specific instructions are in Tutorial (concerning both PubMed and MedLine): = ""
handle = Entrez.esearch(db="pubmed", term="orchid", retmax=463)
record =
idlist = record["IdList"]

For specific information for found articles you can use Entrez.efetch using ids of articles.

handle = Entrez.efetch(db = 'pubmed', retmode = 'xml', id = idlist)
results =

Then, you can then handle the results as dictionaries in Python.

BioPython really made it easier to use many databases.

Python for kids?

Python is a very friendly programming language to start for adults as a very first programming experience. However, can it be used in education of younger? Of course. But I would recommend it as a second step after understanding basic algorithms (e.g. with Scratch or ScratchJr). Python is great to start their programming journey for teens, those who know what sentence structure is and who can do some mathematics.

Why Python is good to start?

  • Comparing to other widely used programming languages, it has quite easy syntax. You can experience it from your first line of clean and working code, so it will encourage you to write more.
  • It is a high level programming language, so you really don’t need too much code to see the effect – also encouraging.
  • There are a lot of information on the Web about Python, how to start with Python, what to do in case of failure – it is very important to have support, feeling that you have someone to turn to for help anytime.
  • If you want to sell something to kids, it must feel interesting to them. Well… you can make simple games with Python (and it’s not very hard), you can quite easily prepare a website and, last but not least, coding on Raspberry Pi may also be an argument.

There are also a couple of books concerning Python for kids. Just to name some:

  1. Python For Kids For Dummies: Brendan Scott
  2. Python for Kids. A Playful Introduction to Programming by Jason R. Briggs
  3. Python Projects for Kids. Jessica Ingrassellino

How Python connects you with biological databases? #1 – Uniprot

In bioinfomatics, the possibility to automatically use information gathered in numerous biological databases is crucial. Some databases are really easy to use, wrappers are great, some have very basic wrappers and some has none. There is a great movement to provide easy access to all biological databases and tools but we have still a lot to do.

One of the first databases I came across during Python programming was Uniprot. Uniprot ( is not so easy to use through their page if you don’t really know what are you looking for. It’s common thing for biological data – data is so diverse that it is impossible to avoid redundancy and complexity. However, after some time, it gets easier.

Let’s look on example page of human GAPDH protein ( You can see that data is categorized and it really makes your life easier. You can look at this page e.g., as xml (so you can extract the part you’re interested in) or text (each line starts with two letter information what is in this line, so it can also be extracted with the use of, e.g. regular expressions). There are multiple different approaches proposed to extract information you need (you have to be careful as some of the solutions may work for Python2 or Python3 only):

  1. requests (example shown here:
    import requests
    from StringIO import StringIO  # Python 2
    from io import StringIO  # Python 3
    params = {"query": "GO:0070337", "format": "fasta"}
    response = requests.get("", params)
    for record in SeqIO.parse(StringIO(r.text), "fasta"):
        # Do what you need here with your sequences.
  2. uniprot tools (I like this way, connecting it with regular expressions you can extract exact information you need;
    import uniprot as uni
    print'P31749', f='ACC', t='P_ENTREZGENEID') # map single id
    print['P31749','Q16204'], f='ACC', t='P_ENTREZGENEID') # map list of ids
    print uni.retrieve('P31749')
    print uni.retrieve(['P31749','Q16204'])
  3. swissprot (example shown
    #!/usr/bin/env python
    """Fetch uniprot entries for given go terms"""
    import sys
    from Bio import SwissProt
    #load go terms
    gos = set(sys.argv[1:])
    sys.stderr.write("Looking for %s GO term(s): %s\n" % (len(gos)," ".join(gos)))
    #parse swisprot dump
    k = 0
    for i,r in enumerate(SwissProt.parse(sys.stdin)):  
        sys.stderr.write(" %9i\r"%(i+1,))
        #parse cross_references
        for ex_db_data in r.cross_references:
            #print ex_db_data
            extdb,extid = ex_db_data[:2]
            if extdb=="GO" and extid in gos:
              k += 1
              sys.stdout.write( ">%s %s\n%s\n" % (r.accessions[0], extid, r.sequence) )
    sys.stderr.write("Reported %s entries\n" % k)  
  4. bioservices ( – this is interesting service to look at as they intend to include wrappers to all important biological databases
    from bioservices import UniProt
    u = UniProt(verbose=False)
    u.mapping("ACC", "KEGG_ID", query='P43403')
    defaultdict(<type 'list'>, {'P43403': ['hsa:7535']})
    res ="P43403")
    # Returns sequence on the ZAP70_HUMAN accession Id
    sequence ="ZAP70_HUMAN", columns="sequence")
  5. urllib

It is proposed on uniprot website, example:

import urllib,urllib2

url = ''

params = {
'query':'P13368 P20806 Q9UM73 P97793 Q17192'

data = urllib.urlencode(params)
request = urllib2.Request(url, data)
contact = "" # Please set your email address here to help us debug in case of problems.
request.add_header('User-Agent', 'Python %s' % contact)
response = urllib2.urlopen(request)
page =


You check out some general info about providing access to biological databases and tools here:

Programming for kids #2 – ScratchJr

ScratchJr is Scratch for younger kids. You don’t even need to write or read to use it but still it will show you basics of algorithms and enables to create simple scenes and games.

All you need is a tablet with Android or iPad and your free to go (app is free for both systems). It was, as Scratch, created in one of the best Universities in the World, MIT (Massachusetts Institute of Technology).

Similarily to Scratch, children can draw their ‘ghosts’ and background pictures and make them move (jumping dancing etc.). Simple loops and conditional actions are also possible. What children really like is the possibility to record own voices.

If you don’t have a tablet, you can also try some emulators for computers, e.g., .

More information you can find here:

There are also a lots of ideas on youtube:

Have fun!