Working with GA analytics from Python API

To analyse data from your website: users, sessions, page views, many would choose Google Analytics – GA (https://analytics.google.com/). To obtain data you can go to the GA website, choose metrics and time frame you need and download data to work with in csv or xslx format. However, there is also an API to directly import data from GA to data-analysing script (e.g. within Jupyter notebook), so why not explore it instead?

How to connect to GA API and use data from it? Great step-by-step instructions are available on GA API websites (https://developers.google.com/sheets/api/quickstart/python) and I will just combine all of needed info in one place.

First of all:

  1. Go to this page: https://console.developers.google.com/flows/enableapi?apiid=sheets.googleapis.com and create a project and (automatically) turn on the API. Click Continue, then click Go to credentials.
  2. On the Add credentials to your projectpage, click the Cancel
  3. At the top of the page, select the OAuth consent screen Select an Email address, enter a Product name if not already set, and click the Save button.
  4. Select the Credentials tab, click the Create credentials button and select OAuth client ID.
  5. Select the application type Other, enter the name and click the Create.
  6. Dismiss the resulting dialog.
  7. Click the file_download(Download JSON) button to the right of the client ID.
  8. Move this file to your working directory and rename it to client_secrets.json
  9. on your computer install google-api-python-client
    pip install --upgrade google-api-python-client

    or

  10. sudo easy_install --upgrade google-api-python-client

and you’re good to go.

Now how to connect to GA and send your first request:

(the example is based on https://github.com/EchoFUN/GAreader/blob/master/hello_analytics_api_v3.py)

from googleapiclient.errors import HttpError
from googleapiclient import sample_tools
from oauth2client.client import AccessTokenRefreshError

def ga_request(input_dict):
    service, flags = sample_tools.init(
    [], 'analytics', 'v3', __doc__, __file__,
    scope='https://www.googleapis.com/auth/analytics.readonly')

    try:
        first_profile_id = get_first_profile_id(service)
        if not first_profile_id:
            print('Could not find a valid profile for this user.')
        else:
            results = get_top_keywords(service, first_profile_id, input_dict)
            return results

    except TypeError as error:
        print(('There was an error in constructing your query : %s' % error))

    except HttpError as error:
        print(('Arg, there was an API error : %s : %s' %
              (error.resp.status, error._get_reason())))

    except AccessTokenRefreshError:
        print('The credentials have been revoked or expired, please re-run '
          'the application to re-authorize')

def get_first_profile_id(service):
    accounts = service.management().accounts().list().execute()
    if accounts.get('items'):
        firstAccountId = accounts.get('items')[0].get('id')
        webproperties = service.management().webproperties().list(
            accountId=firstAccountId).execute()

        if webproperties.get('items'):
            firstWebpropertyId = webproperties.get('items')[0].get('id')
            profiles = service.management().profiles().list(
                accountId=firstAccountId,
                webPropertyId=firstWebpropertyId).execute()

            if profiles.get('items'):
                return profiles.get('items')[0].get('id')

    return None

def get_top_keywords(service, profile_id, input_dict):
    if input_dict['filters'] == '':
        return service.data().ga().get(
            ids=input_dict['ids'],
            start_date=input_dict['start_date'],
            end_date=input_dict['end_date'],
            metrics=input_dict['metrics'],
            dimensions=input_dict['dimensions']).execute() 
    return service.data().ga().get(
        ids=input_dict['ids'],
        start_date=input_dict['start_date'],
        end_date=input_dict['end_date'], 
        metrics=input_dict['metrics'],
        filters = input_dict['filters'],
        dimensions=input_dict['dimensions']).execute()

Save this file as ga_api_example.py

You need to remember that from API, metrics and other feature names may have other exact names, e.g. custom dimensions are just called dimension[no] etc.

And now, after loading your file

from ga_api_example import ga_request

You can prepare request

request = {
"ids" : "ga:<your_id>",
"start_date" : "2017-06-25",
"end_date" : "2017-06-25",
"metrics" : "ga:pageviews",
 "filters" : "ga:dimension1=~yes",
"dimensions" : ""
}

=~ means use regex as at GA website reports

and execute it:

data = ga_request(request)

Now you have data in Python script where you work with, e.g. with pandas.

DATA Science in business – perspective from first day employee

This is the one and only opportunity for me to write what was my thoughts about data science after first couple of days in a new office. I decided to start another job, first job in data analysis and first in software house – huge difference from the first day.

First of all I had no idea what anyone was talking about. All those abbreviations and  office slang is a bit overwhelming at first. But you get used to it and understand more and more every day. But after couple of days it is still not enough. But hey! During the first weeks you’re allowed to ask stupid questions, so use it.

Second of all, your chosen technology is not necessary used in the office. Even if you spoke about used technologies during interview and you were asked to implement your chosen technology to your new office, you will have to use also technologies they use. It is not at all a bad thing, it gives you opportunity to expand your knowledge and perspectives. It is easier to understand the data in a way they work with it and after you understand it you can go on with your technologies.

Third thing is the co-operation. You will not be with data alone, others need your results, they (engineers) change the way how data looks like, they want to learn how to do some analyses on their own – strong co-operation in data analysis is crucial.

Forth – you need to understand the goal of your existence in the company. You don’t analyse the data just to analyse it, there is company strategy in it, you have to keep it mind all the time.

Last, but not least, connected to all of the above, company lives its life and what you do today may not be so necessary tomorrow and sometimes you have to deal with unfinished projects. Just go with the flow (and company development).

Get noticed – wrapping up

So the contest I was participating in – get noticed – has ended. It triggered me to start this blog and I think I will continue to write about bioinformatics, data science, Python and so on (however, maybe one post per week will be easier to manage).

I am really happy that I lasted and I am among 183 finalists (from over a 1000 contestants).

I had a really great adventure during this contest. I didn’t finish my project but the end of the contest does not mean the end of the project.

I even got the courage to apply for a job in data science and I got it (It will be my third job but I hope I’ll manage).

In June I plan to start real web server with the modules I managed to develop during the contest. So keep in touch.

Translation part #2

Going back to translation module of the project. I was wondering how many of the existing translation codes (actually they are not so different, existing codes differ usually only by couple of codons) should I implement.

I referred to NCBI page ( https://www.ncbi.nlm.nih.gov/Taxonomy/Utils/wprintgc.cgi ) to check what codes are broadly accepted and decided that I will implement all of the codes presented on the page (as many as 24):

1. The Standard Code

2. The Vertebrate Mitochondrial Code

3. The Yeast Mitochondrial Code

4. The Mold, Protozoan, and Coelenterate Mitochondrial Code and the Mycoplasma/Spiroplasma Code

5. The Invertebrate Mitochondrial Code

6. The Ciliate, Dasycladacean and Hexamita Nuclear Code

7. The Echinoderm and Flatworm Mitochondrial Code

8. The Euplotid Nuclear Code

9. The Bacterial, Archaeal and Plant Plastid Code

10. The Alternative Yeast Nuclear Code

11. The Ascidian Mitochondrial Code

12. The Alternative Flatworm Mitochondrial Code

13. Chlorophycean Mitochondrial Code

14. Trematode Mitochondrial Code

15. Scenedesmus obliquus Mitochondrial Code

16. Thraustochytrium Mitochondrial Code

17. Pterobranchia Mitochondrial Code

18. Candidate Division SR1 and Gracilibacteria Code

19. Pachysolen tannophilus Nuclear Code

20. Karyorelict Nuclear

21. Condylostoma Nuclear

22. Mesodinium Nuclear

23. Peritrich Nuclear

24. Blastocrithidia Nuclear

The question that remains is whether to include all alternative start (initiation) codons and how to annotate the results to make it clear for the user. This is even more important how to mark the translation starts/stops, alternative starts. User experience – part #1 I guess.

So translation module is not finished yet.

Biopython 1.69 released

Great news for all bioinformatitians out there  – new biopython version was released.

What is new?

First of all Python 2.6 was dropped for further development, but many other versions are supported (Python 2.7, 3.3, 3.4, 3.5, 3.6, PyPy v5.7, PyPy3.5 v5.7 beta, and Jython 2.7).

Many changes has been made, both improving performance and functionality, broadening compatibilities with biological databases.

  • Bio.AlignIO supports MAF format,
  • Bio.SearchIO.AbiIO support FSA files,
  • Uniprot parser parse “submittedName” in XML files,
  • NEXUS parser was improved to improve co-working with tools like the BEAST TreeAnnotator,
  • New parser was introduced for ExPASy Cellosaurus,
  • Bio.Seq module has complement function,
  • SeqFeature object’s qualifiers is explicitly ordered dictionary,
  • Bio.SeqIO UniProt-XML parser was updated to cope with features with unknown locations which can be found in mass spec data.
  • Bio.SeqIO GenBank, EMBL, and IMGT parsers were updated,
  • Bio.Affy package supports CEL format version 4,
  • Bio.Restriction enzyme list has been updated,
  • Bio.PDB.PDBList now can download PDBx/mmCif (new default).

What I really like is that they use “Python PEP8, PEP257 and best practice standard coding style”. Great example to others ;).

You can check out all contributors and details here:

https://news.open-bio.org/2017/04/07/biopython-1-69-released/

Maybe some day you (or I) will be at this list?

And… have fun with new Biopython 🙂

Where to look for bioinformatic knowledge?

Bioinformatics is increasingly emerging field on the cross of IT, biology and statistics ( and chemistry, physics and so on and so on). People that work ‘in bioinformatics’ have so various background like in no other field of science. But the question is where to learn bioinformatics, where to find interesting articles, where to find problems and data to work with (to learn on) and finally where to publish?

Where to learn? Recently, I wrote about coursera courses (https://scienceisthenewblackblog.wordpress.com/2017/02/20/coursera-and-udemy/), I really recommend. The bioinformatics courses on Coursera are related to Rosalind (http://rosalind.info), where you can find problems to solve and check whether your solution is sufficient and correct. Rosalind project was named after Rosalind Franklin, who has undervalued role in the understanding of DNA structure (associated mainly with Watson and Crick). There is also nice book including various basic bioinformatic algorithms by the same authors (http://bioinformaticsalgorithms.com/). If you’re already a bit familiar with bioinformatics the best way is to work on real data. For example for microarrays and NGS, a lot of exemplary data is deposed at https://www.ncbi.nlm.nih.gov/ (in really various, not so unified format, but you can use it to learn) and these data often are associated with papers so you can compare what you got from the analyses and what the authors got.

There are also many facebook fanpages (e.g., European Bioinformatics Institute (EMBL-EBI)), twitters (e.g. @Bioinformatics_ , @BioinfoRRmatics) and blogs (e.g., http://www.opiniomics.org/, http://ivory.idyll.org/blog/, https://flxlexblog.wordpress.com/, http://lab.loman.net/, ) where you can find infos on how to start with bioinformatics and how this field is developing. You really just need to google it and you’ll find ocean of information.

For learning I also recommend you to go through some bioinformatic papers and check out code (often available at github or researcher’s sites), try to run it and understand how it works. What technologies the authors used (of course it is better to analyze papers of top bioinformatic labs to be sure that the choice of technology wasn’t just a guess of graduate student 😉 ), what data the software works on and what output it provides. Many data types are special for bioinformatics and this is an interesting way to learn about them.

As for publishing, there is a lot of possibilities. There are strictly ‘bioinformatics’ journals, just to name a couple: Bioinformatics (Oxford), BMC Bioinformatics and Briefings of bioinfomatics. Also more mathematical/statistical journals adopted to publish bioinformatics papers (e.g., Journal of Theoretical Biology or Journal of Mathematical Biology). However, there are many possibilities to publish bioinformatic papers (tools or results of analyses) in biological journals depending of the biological dilemma investigated. For example, Nucleic Acids Research not only publish bioinformatic papers on ‘daily’ basis but also has a special issue called Web Server Issue and special issue called Database issue. So the possibilities are really broad.

Happy ‘bioinformating’ 😉

Say ‘Thank you’ to stackoverflow!

All of you know what an important site is https://stackoverflow.com for every programmer (it’s one of rare things that unifies Python, Java, C#, RubyonRails (etc.) developers). Recently, I became active user of stackoverflow and, if that’s even possible, I’m even more impressed.

Documentation beta is a interesting thing and earlier I did not pay much attention to it. Especially interesting in Python is incompatibilities between Python2 and Python3 as Python supports both versions for now, but about that I would like to write later as it’s a bit wide topic. You can actually learn from scratch using stackoverflow Documentation. Documentation of R starts with installing R and ‘hello world’. This Documentation is different than original documentation because when users write documentation is really based mainly on examples. Developers think about their software a bit (a lot?) differently than the users. And it’s ok, two documentations complement each other.

I didn’t use jobs so I won’t comment on it. However, it’s nice that it was included on stackoverflow.

The most important thing: Questions. There are millions of questions about programming in any language on any system you use. You can learn from answers, see how many ways there is to solve the same issue. Maybe you’ll notice faster solution for your application? To learn for yourself it’s even fun to repeat what asker did and try to solve the problem. When you do, you can submit the answer.

However, the strength of stackoverflow depend on users input, therefore I encourage you to share your expertise with other users. Don’t be just readers but write answers, comment, score answers, ask questions and mark best answers! It can be really rewarding :). The best way to say ‘Thank you’ to stackoverflow is to get active. As we say in polish: “No one is a lonely island”, get noticed!