Working with GA analytics from Python API

To analyse data from your website: users, sessions, page views, many would choose Google Analytics – GA (https://analytics.google.com/). To obtain data you can go to the GA website, choose metrics and time frame you need and download data to work with in csv or xslx format. However, there is also an API to directly import data from GA to data-analysing script (e.g. within Jupyter notebook), so why not explore it instead?

How to connect to GA API and use data from it? Great step-by-step instructions are available on GA API websites (https://developers.google.com/sheets/api/quickstart/python) and I will just combine all of needed info in one place.

First of all:

  1. Go to this page: https://console.developers.google.com/flows/enableapi?apiid=sheets.googleapis.com and create a project and (automatically) turn on the API. Click Continue, then click Go to credentials.
  2. On the Add credentials to your projectpage, click the Cancel
  3. At the top of the page, select the OAuth consent screen Select an Email address, enter a Product name if not already set, and click the Save button.
  4. Select the Credentials tab, click the Create credentials button and select OAuth client ID.
  5. Select the application type Other, enter the name and click the Create.
  6. Dismiss the resulting dialog.
  7. Click the file_download(Download JSON) button to the right of the client ID.
  8. Move this file to your working directory and rename it to client_secrets.json
  9. on your computer install google-api-python-client
    pip install --upgrade google-api-python-client

    or

  10. sudo easy_install --upgrade google-api-python-client

and you’re good to go.

Now how to connect to GA and send your first request:

(the example is based on https://github.com/EchoFUN/GAreader/blob/master/hello_analytics_api_v3.py)

from googleapiclient.errors import HttpError
from googleapiclient import sample_tools
from oauth2client.client import AccessTokenRefreshError

def ga_request(input_dict):
    service, flags = sample_tools.init(
    [], 'analytics', 'v3', __doc__, __file__,
    scope='https://www.googleapis.com/auth/analytics.readonly')

    try:
        first_profile_id = get_first_profile_id(service)
        if not first_profile_id:
            print('Could not find a valid profile for this user.')
        else:
            results = get_top_keywords(service, first_profile_id, input_dict)
            return results

    except TypeError as error:
        print(('There was an error in constructing your query : %s' % error))

    except HttpError as error:
        print(('Arg, there was an API error : %s : %s' %
              (error.resp.status, error._get_reason())))

    except AccessTokenRefreshError:
        print('The credentials have been revoked or expired, please re-run '
          'the application to re-authorize')

def get_first_profile_id(service):
    accounts = service.management().accounts().list().execute()
    if accounts.get('items'):
        firstAccountId = accounts.get('items')[0].get('id')
        webproperties = service.management().webproperties().list(
            accountId=firstAccountId).execute()

        if webproperties.get('items'):
            firstWebpropertyId = webproperties.get('items')[0].get('id')
            profiles = service.management().profiles().list(
                accountId=firstAccountId,
                webPropertyId=firstWebpropertyId).execute()

            if profiles.get('items'):
                return profiles.get('items')[0].get('id')

    return None

def get_top_keywords(service, profile_id, input_dict):
    if input_dict['filters'] == '':
        return service.data().ga().get(
            ids=input_dict['ids'],
            start_date=input_dict['start_date'],
            end_date=input_dict['end_date'],
            metrics=input_dict['metrics'],
            dimensions=input_dict['dimensions']).execute() 
    return service.data().ga().get(
        ids=input_dict['ids'],
        start_date=input_dict['start_date'],
        end_date=input_dict['end_date'], 
        metrics=input_dict['metrics'],
        filters = input_dict['filters'],
        dimensions=input_dict['dimensions']).execute()

Save this file as ga_api_example.py

You need to remember that from API, metrics and other feature names may have other exact names, e.g. custom dimensions are just called dimension[no] etc.

And now, after loading your file

from ga_api_example import ga_request

You can prepare request

request = {
"ids" : "ga:<your_id>",
"start_date" : "2017-06-25",
"end_date" : "2017-06-25",
"metrics" : "ga:pageviews",
 "filters" : "ga:dimension1=~yes",
"dimensions" : ""
}

=~ means use regex as at GA website reports

and execute it:

data = ga_request(request)

Now you have data in Python script where you work with, e.g. with pandas.

Seaborn library for pretty plots

Seaborn is visualization library based on matplotlib (and complementary to matplotlib, you should really understand matplotlib first). It basically makes your work easier and prettier. The library is not really complicated and broad but it makes some thing for you, things that you would have to do in matplotlib on your own.

Seaborn works very well with libraries used in Python for data analysis (pandas, numpy, scipy, statsmodels) and may be used easily in Jupyter notebook for plots imaging. The most frequently mentioned advantage of seaborn are built-in themes. Well… it helps people who don’t know how to combine different colors to make plots aesthetic. Second thing is that functions do really nice plots that try to show something useful even when called with a minimal set of arguments. However, as for matplotlib you have almost endless number of possibilities to adjust your plots.

Additional information about seaborn can be find here: https://seaborn.pydata.org/

Installation is pretty easy:

pip install seaborn

or

conda install seaborn

Additionall info about installation (including development version of seaborn) can be found here: http://seaborn.pydata.org/installing.html

They did really nice job also when comes to documentation and tutorials (e.g. https://seaborn.pydata.org/tutorial.html)

My favorite thing about seaborn? I would say seaborn.distplot function. I usually do two visualizations to look at data before working with it – scatter plot and distribution. Scatter plot is probably easy to obtain in any visualization lib you work with, as is with matplotlib. However to see distribution along with KDE plot, I recommend seaborn function displot.

Here are some examples: https://seaborn.pydata.org/generated/seaborn.distplot.html?highlight=distplot#seaborn.distplot

Basically, all you need to do to have your data in array, e.g. as pandas DataFrame column

import seaborn as sns
ax = sns.distplot(df['column name'])

To try with example data, you can try plotting normally distributed data:

import seaborn as sns, numpy as np
x = np.random.randn(100)
ax = sns.distplot(x)

 

Testing modules within cherrypy

I was wondering how unittests may be implemented within cherrypy to test not cherrypy itself but the modules included. I wanted tests to be run every time when new changes were introduced, that means when the server reloads.

There are some information about unittesting in cherrypy (e.g. https://stackoverflow.com/questions/14260101/unittesting-cherrypy-webapp), however, none directly addressed my issue.

I decided to prepare single file named unittests.py (remember not to name your file as unittest.py as there will be problem with library import) with unittests and TextTestRunner setup. TextTestRunner does not interrupt the reload of the server as standard unittest.main() does.

import unittest
from module import function1, function2

class MyTestClass(unittest.TestCase):

    def test_search(self):
        self.assertEqual(function1(input1), output1)
        self.assertEqual(function2(input2), output2)

def start():
    suite = unittest.TestSuite()
    suite.addTest(MyTestClass("test_search"))
    runner = unittest.TextTestRunner()
    runner.run(suite)

and in the main cherrypy file, besides import, I added single line (unittests.start()):

import unittests

if __name__ == "__main__":
    cherrypy.config.update("config.conf")
    unittests.start()
    cherrypy.quickstart(Page(), config="config.conf")

Every time the server reloaded the tests were run and I could check whether everything is correct.

Python in Big Data #1 – Hadoop & Snakebite

One of the often mentioned term during searching for Big Data information is Hadoop. What is Hadoop exactly? And what are first steps to handle Hadoop with Python?

Hadoop is a system file (HDFS) that enable scalable data handling. Basically, Hadoop was developed to store large amount of data while providing reliability and scalability. It is based on data block system. HDFS is based on two processes: NameNode that gathers metadata and DataNodes that store blocks of data. Blocks of data are replicated on different machines to provide data stability when one machine crash.

There are increasing number of libraries for Python that enables handling Hadoop. Just to name some: Snakebite, mrjob, PySpark.

Snakebite enables basic file operations on Hadoop and accessing Hadoop from Python applications. It can be install easily:

pip install snakebite

While installed, within Python application you can connect with HDFS NameNode with:

from snakebite.client import Client
client = Client('localhost', 8000)

Then you can implement various functions similar to functions in shell: ls, mkdir, delete to handle files and directories within Hadoop. Another group of functions (e.g. copyToLocal) enable retrieving data from HDFS.

from snakebite.client import Client
client = Client('localhost', 8000)
for file in client.copyToLocal(['/hdfs_1/file1.txt'], '/tmp'):
    print file

Snakebite provide also CLI client.

Some additional information can be found in a free book:

Hadoop with Python by Zachary Radtka and Donald Miner

http://www.oreilly.com/programming/free/hadoop-with-python.csp

How Python connects you with biological databases? #2 – PubMed

Pubmed is the biggest database of biological scientific papers. There are other databases that gather information about all scientific papers (e.g. google scholar, scopus), however in biological sciences, still most commonly used is Pubmed from NCBI.

There is quite easy way to access Pubmed through their API (Entrez), however, there is already easier way by using BioPython, which I recommend.

To use it you need BioPython installed on your computer, import Entrez from BioPython

from Bio import Entrez

Simple examples are available in the documentation:

http://biopython.org/DIST/docs/api/Bio.Entrez-module.html

Entrez.email = "Your.Name.Here@example.org"
pmid = "19304878"
handle = Entrez.elink(dbfrom="pubmed", id=pmid, linkname="pubmed_pubmed")
record = Entrez.read(handle)
handle.close()

More specific instructions are in Tutorial (concerning both PubMed and MedLine):

http://biopython.org/DIST/docs/tutorial/Tutorial.html#htoc129

Entrez.email = "example@example.com"
handle = Entrez.esearch(db="pubmed", term="orchid", retmax=463)
record = Entrez.read(handle)
idlist = record["IdList"]
handle.close()

For specific information for found articles you can use Entrez.efetch using ids of articles.

handle = Entrez.efetch(db = 'pubmed', retmode = 'xml', id = idlist)
results = Entrez.read(handle)

Then, you can then handle the results as dictionaries in Python.

BioPython really made it easier to use many databases.

How Python connects you with biological databases? #1 – Uniprot

In bioinfomatics, the possibility to automatically use information gathered in numerous biological databases is crucial. Some databases are really easy to use, wrappers are great, some have very basic wrappers and some has none. There is a great movement to provide easy access to all biological databases and tools but we have still a lot to do.

One of the first databases I came across during Python programming was Uniprot. Uniprot (http://www.uniprot.org/) is not so easy to use through their page if you don’t really know what are you looking for. It’s common thing for biological data – data is so diverse that it is impossible to avoid redundancy and complexity. However, after some time, it gets easier.

Let’s look on example page of human GAPDH protein (http://www.uniprot.org/uniprot/P04406). You can see that data is categorized and it really makes your life easier. You can look at this page e.g., as xml (so you can extract the part you’re interested in) or text (each line starts with two letter information what is in this line, so it can also be extracted with the use of, e.g. regular expressions). There are multiple different approaches proposed to extract information you need (you have to be careful as some of the solutions may work for Python2 or Python3 only):

  1. requests (example shown here: http://stackoverflow.com/questions/15514614/how-to-use-python-get-results-from-uniprot-automatically)
    import requests
    from StringIO import StringIO  # Python 2
    from io import StringIO  # Python 3
    
    params = {"query": "GO:0070337", "format": "fasta"}
    response = requests.get("http://www.uniprot.org/uniprot/", params)
    
    for record in SeqIO.parse(StringIO(r.text), "fasta"):
        # Do what you need here with your sequences.
  2. uniprot tools (I like this way, connecting it with regular expressions you can extract exact information you need; https://pypi.python.org/pypi/uniprot_tools/0.4.1)
    import uniprot as uni
    print uni.map('P31749', f='ACC', t='P_ENTREZGENEID') # map single id
    print uni.map(['P31749','Q16204'], f='ACC', t='P_ENTREZGENEID') # map list of ids
    print uni.retrieve('P31749')
    print uni.retrieve(['P31749','Q16204'])
  3. swissprot (example shown https://www.biostars.org/p/66904/)
    #!/usr/bin/env python
    """Fetch uniprot entries for given go terms"""
    import sys
    from Bio import SwissProt
    #load go terms
    gos = set(sys.argv[1:])
    sys.stderr.write("Looking for %s GO term(s): %s\n" % (len(gos)," ".join(gos)))
    #parse swisprot dump
    k = 0
    sys.stderr.write("Parsing...\n")
    for i,r in enumerate(SwissProt.parse(sys.stdin)):  
        sys.stderr.write(" %9i\r"%(i+1,))
        #parse cross_references
        for ex_db_data in r.cross_references:
            #print ex_db_data
            extdb,extid = ex_db_data[:2]
            if extdb=="GO" and extid in gos:
              k += 1
              sys.stdout.write( ">%s %s\n%s\n" % (r.accessions[0], extid, r.sequence) )
    sys.stderr.write("Reported %s entries\n" % k)  
  4. bioservices (https://pythonhosted.org/bioservices/references.html#bioservices.uniprot.UniProt) – this is interesting service to look at as they intend to include wrappers to all important biological databases
    from bioservices import UniProt
    u = UniProt(verbose=False)
    u.mapping("ACC", "KEGG_ID", query='P43403')
    defaultdict(<type 'list'>, {'P43403': ['hsa:7535']})
    res = u.search("P43403")
    
    # Returns sequence on the ZAP70_HUMAN accession Id
    sequence = u.search("ZAP70_HUMAN", columns="sequence")
  5. urllib

It is proposed on uniprot website, example:

import urllib,urllib2

url = 'http://www.uniprot.org/uploadlists/'

params = {
'from':'ACC',
'to':'P_REFSEQ_AC',
'format':'tab',
'query':'P13368 P20806 Q9UM73 P97793 Q17192'
}

data = urllib.urlencode(params)
request = urllib2.Request(url, data)
contact = "" # Please set your email address here to help us debug in case of problems.
request.add_header('User-Agent', 'Python %s' % contact)
response = urllib2.urlopen(request)
page = response.read(200000)

 

You check out some general info about providing access to biological databases and tools here:

https://pythonhosted.org/bioservices/

Python possibilities for RNA structure folding

There are quite a lot of different software available for biologists for prediction of RNA secondary or tertiary structure. Here, I won’t discuss different algorithmic approaches and when to use them but I will check which approaches are available easily for Python users to include in their software (of course in specific needs only one software would be suitable and different – e.g. shell – approach would be required to use it automatically). I also won’t discuss software for protein structure folding because it is a whole new subject.

  1. Vienna Package (http://www.tbi.univie.ac.at/RNA/#self_packages) – I guess that this is most popular package used in Python for RNA folding. Information from the site: “RNA secondary structure prediction through energy minimization is the most used function in the package. We provide three kinds of dynamic programming algorithms for structure prediction: the minimum free energy algorithm of (Zuker & Stiegler 1981) which yields a single optimal structure, the partition function algorithm of (McCaskill 1990) which calculates base pair probabilities in the thermodynamic ensemble, and the suboptimal folding algorithm of (Wuchty et.al 1999) which generates all suboptimal structures within a given energy range of the optimal energy. For secondary structure comparison, the package contains several measures of distance (dissimilarities) using either string alignment or tree-editing (Shapiro & Zhang 1990). Finally, we provide an algorithm to design sequences with a predefined structure (inverse folding).”
  2. Multifold (https://pypi.python.org/pypi/multifold) – According to the authors:”MultifFold is a Python based tool package for RNA footprinting data analysis.
    • It accepts RNA footprinting data generated from SHAPE-Seq, PARS and DMS-Seq technologies.
    • It can fold multiple secondary structures for each transcript based on user provideed constraints.
    • It could quantify the abundance of each structure centroid using the RNA footprinting data.
    • It provides a series of commands for users to customize every procedure.
    • It could handle RNA footprinting data generated from gene isoforms or overlapped transcripts.”
  3. Forgi (www.tbi.univie.ac.at/~pkerp/forgi) – it is not for folding exactly but for manipulating folded structures and for this usage it is an excellent tool
  4. Barnacle (https://sourceforge.net/projects/barnacle-rna/) – I think it’s not supported any more.
  5. Frnakenstein (http://www.stats.ox.ac.uk/research/genome/software/frnakenstein) – actually uses Vienna Package

There are also other tools written in Python but not implemented with Python interface. Just to name one:

  1. modeRNA (http://genesilico.pl/moderna/)

Also many short scripts are available at github and private websites but I would be careful with them:

  1. http://philipuren.com/python/RNAFolding.php
  2. https://github.com/xlr8runner/RNA-Folding
  3. https://github.com/cgoliver/Nussinov

If I left something out, please include it in the comment 🙂