ISMB/ECCB 2017 conference – highlights

I would like to comment on two great lectures held on ISBM/ECCB 2017 conference.

Overall, the conference was great, however, there was so many tracks, it was impossible to be on every interesting talks as they often overlapped. I am aware that topic covered is really broad and even with tracks being hold simultaneously, it was pretty long conference but for me it was a bit worrying that I have to choose between many interesting talks.

What caught my eye was still increasing interest in Machine Learning use and single cell sequencing. Attending this conference one can assume that these were ‘hot topics’ of bioinformatics for 2017.

I would like to highlight two talks that were not really about bioinformatics, rather on social bioinformatic-somehow-related problems.

(1)Open Humans: Opening human health data – Talk by Madeleine Ball

Really inspiring talk about open data movement, what are pros, cons and ambivalent features of gathering open data (understood mainly but not only as genome data). Open humans project, which Madeleine Ball is co-founder and advocate,  is a platform to share your data with selected scientific projects (you decide on every step whether you want to share or not). It may help to push research to faster and more reliable answers. However, shared data is sensitive as may  serve to identify a person so we need to remember about the right not to share. World is never black and white.

(2) Bioinformatics: A Servant or the Queen of Molecular Biology? – Talk by Pavel Pevzner

This talk was really about education and what are the next steps for MOOCs (Massive Open Online Courses) and not bioinformatics itself. The discussion afterwards was also interesting as showed different point of views what people believe is the best way to share knowledge and learn yourself. Is it really needed to have academic lectures? Is it necessary to have personal contact with your professor and your colleagues? Or you can do similarly well (or better) with online peer-to-peer help? The discussion showed that the optimum solution can be actually pretty personal (can we do something about it? This is really important question to pedagogues and educators.). I think it was amazing ending for this conference as we cannot do science without knowledge sharing and teaching. We cannot be unavailable for those who want to learn and work in our field. We have to show why this is so amazing and worth every effort to push science even a little bit further.

I was a student of Pavel Pevzner during online courses on Coursera and the form of knowledge sharing they developed is really amazing and I think acceptable to learn bioinformatics for both programmers as well as biologists (but what can I know).

Both lectures should be soon available  at conference’s website and I really recommend to listen it yourself.

Advertisements

SymPy 1.1 has been released

Have you ever used sympy? AS the authors say: “SymPy is a Python library for symbolic mathematics. It aims to become a full-featured computer algebra system (CAS) while keeping the code as simple as possible in order to be comprehensible and easily extensible. SymPy is written entirely in Python.”

You can install new release by simply:

pip install -U sympy

It is promised that it will be be soon available also via conda.

What is new (among others)?

1. Many improvements to code generation, including addition of
tensorflow (to lambdify), C++, llvm JIT, and Rust language support, as
well as the beginning of AST classes.

2. Several bug fixes for floating point numbers using higher than the
default precision.

3. Improvements to the group theory module.

4. Implementation of Singularity Functions to solve Beam Bending
problems.

5. Improvements to the mechanics module.

As the main author (Aaron Meurer ) say: “A total of 184 people contributed to this release. Of these, 143 people contributed to SymPy for the first time for this release. ” Maybe you will be next?

Multiple other projects also use SymPy; just to name some, there is Cadabra for (quantum) field theory system, LaTeX Expression project for typesetting of algebraic expressions and yt for analyzing and visualizing volumetric data.

 

Official page of sympy you can find here: http://www.sympy.org/en/index.html

It is also freely available at github: https://github.com/sympy/sympy.github.com

Exact information about the release/authors/deprecations/etc. can be found here: https://github.com/sympy/sympy/wiki/Release-Notes-for-1.1

 

Why use Jupyter Notebook for data analyses?

I think that everyone interested in data science and data analysis somewhere, somehow during their education or internet searches comes across Jupyter Notebook.  Jupyter notebook is an aplication that enables you to create (and share) document that contains code (in various programming languages), explanaitions (text) and visualizations. Jupyter Notebook is super useful when you want to show your workflow and prepare how-to for future analyses for yourself or your team.

I use Jupyter Notebook with Python3 but you can use it with various programming languages if you prefer to. Python has very broad offer of libraries for statistical analysis, data visualizations and machine learning.

With Jupyter Notebook you can show every step of data transformation showing, e.g. pandas’ DataFrames in really nice shape:

Screen Shot 2017-07-01 at 07.53.29

Moreover you can include plots with the code you used to create it so you can easily reproduce it for other data:

Screen Shot 2017-07-01 at 07.51.01

Just to mention, super useful thing in Jupyter Notebook is:

%matplotlib inline

that makes your plots appear as you execute a cell without calling

plt.show()

I hope you see what are indisputable perks of using Jupyter Notebook, I encourage you strongly to try it out.

If you’re into Jupyter Notebook, this year there is a conference in August in New York, called Jupytercon (https://conferences.oreilly.com/jupyter/jup-ny).

There are a lot of interesting projects around Jupyter Notebook, like JupyterHub (https://jupyterhub.readthedocs.io/en/latest/) that allows Jupyter server to be used by multiple users or nteract (https://nteract.io/) that transforms Jupyter Notebooks to desktop application so it’s even easier to use.

Working with GA analytics from Python API

To analyse data from your website: users, sessions, page views, many would choose Google Analytics – GA (https://analytics.google.com/). To obtain data you can go to the GA website, choose metrics and time frame you need and download data to work with in csv or xslx format. However, there is also an API to directly import data from GA to data-analysing script (e.g. within Jupyter notebook), so why not explore it instead?

How to connect to GA API and use data from it? Great step-by-step instructions are available on GA API websites (https://developers.google.com/sheets/api/quickstart/python) and I will just combine all of needed info in one place.

First of all:

  1. Go to this page: https://console.developers.google.com/flows/enableapi?apiid=sheets.googleapis.com and create a project and (automatically) turn on the API. Click Continue, then click Go to credentials.
  2. On the Add credentials to your projectpage, click the Cancel
  3. At the top of the page, select the OAuth consent screen Select an Email address, enter a Product name if not already set, and click the Save button.
  4. Select the Credentials tab, click the Create credentials button and select OAuth client ID.
  5. Select the application type Other, enter the name and click the Create.
  6. Dismiss the resulting dialog.
  7. Click the file_download(Download JSON) button to the right of the client ID.
  8. Move this file to your working directory and rename it to client_secrets.json
  9. on your computer install google-api-python-client
    pip install --upgrade google-api-python-client

    or

  10. sudo easy_install --upgrade google-api-python-client

and you’re good to go.

Now how to connect to GA and send your first request:

(the example is based on https://github.com/EchoFUN/GAreader/blob/master/hello_analytics_api_v3.py)

from googleapiclient.errors import HttpError
from googleapiclient import sample_tools
from oauth2client.client import AccessTokenRefreshError

def ga_request(input_dict):
    service, flags = sample_tools.init(
    [], 'analytics', 'v3', __doc__, __file__,
    scope='https://www.googleapis.com/auth/analytics.readonly')

    try:
        first_profile_id = get_first_profile_id(service)
        if not first_profile_id:
            print('Could not find a valid profile for this user.')
        else:
            results = get_top_keywords(service, first_profile_id, input_dict)
            return results

    except TypeError as error:
        print(('There was an error in constructing your query : %s' % error))

    except HttpError as error:
        print(('Arg, there was an API error : %s : %s' %
              (error.resp.status, error._get_reason())))

    except AccessTokenRefreshError:
        print('The credentials have been revoked or expired, please re-run '
          'the application to re-authorize')

def get_first_profile_id(service):
    accounts = service.management().accounts().list().execute()
    if accounts.get('items'):
        firstAccountId = accounts.get('items')[0].get('id')
        webproperties = service.management().webproperties().list(
            accountId=firstAccountId).execute()

        if webproperties.get('items'):
            firstWebpropertyId = webproperties.get('items')[0].get('id')
            profiles = service.management().profiles().list(
                accountId=firstAccountId,
                webPropertyId=firstWebpropertyId).execute()

            if profiles.get('items'):
                return profiles.get('items')[0].get('id')

    return None

def get_top_keywords(service, profile_id, input_dict):
    if input_dict['filters'] == '':
        return service.data().ga().get(
            ids=input_dict['ids'],
            start_date=input_dict['start_date'],
            end_date=input_dict['end_date'],
            metrics=input_dict['metrics'],
            dimensions=input_dict['dimensions']).execute() 
    return service.data().ga().get(
        ids=input_dict['ids'],
        start_date=input_dict['start_date'],
        end_date=input_dict['end_date'], 
        metrics=input_dict['metrics'],
        filters = input_dict['filters'],
        dimensions=input_dict['dimensions']).execute()

Save this file as ga_api_example.py

You need to remember that from API, metrics and other feature names may have other exact names, e.g. custom dimensions are just called dimension[no] etc.

And now, after loading your file

from ga_api_example import ga_request

You can prepare request

request = {
"ids" : "ga:<your_id>",
"start_date" : "2017-06-25",
"end_date" : "2017-06-25",
"metrics" : "ga:pageviews",
 "filters" : "ga:dimension1=~yes",
"dimensions" : ""
}

=~ means use regex as at GA website reports

and execute it:

data = ga_request(request)

Now you have data in Python script where you work with, e.g. with pandas.

Seaborn library for pretty plots

Seaborn is visualization library based on matplotlib (and complementary to matplotlib, you should really understand matplotlib first). It basically makes your work easier and prettier. The library is not really complicated and broad but it makes some thing for you, things that you would have to do in matplotlib on your own.

Seaborn works very well with libraries used in Python for data analysis (pandas, numpy, scipy, statsmodels) and may be used easily in Jupyter notebook for plots imaging. The most frequently mentioned advantage of seaborn are built-in themes. Well… it helps people who don’t know how to combine different colors to make plots aesthetic. Second thing is that functions do really nice plots that try to show something useful even when called with a minimal set of arguments. However, as for matplotlib you have almost endless number of possibilities to adjust your plots.

Additional information about seaborn can be find here: https://seaborn.pydata.org/

Installation is pretty easy:

pip install seaborn

or

conda install seaborn

Additionall info about installation (including development version of seaborn) can be found here: http://seaborn.pydata.org/installing.html

They did really nice job also when comes to documentation and tutorials (e.g. https://seaborn.pydata.org/tutorial.html)

My favorite thing about seaborn? I would say seaborn.distplot function. I usually do two visualizations to look at data before working with it – scatter plot and distribution. Scatter plot is probably easy to obtain in any visualization lib you work with, as is with matplotlib. However to see distribution along with KDE plot, I recommend seaborn function displot.

Here are some examples: https://seaborn.pydata.org/generated/seaborn.distplot.html?highlight=distplot#seaborn.distplot

Basically, all you need to do to have your data in array, e.g. as pandas DataFrame column

import seaborn as sns
ax = sns.distplot(df['column name'])

To try with example data, you can try plotting normally distributed data:

import seaborn as sns, numpy as np
x = np.random.randn(100)
ax = sns.distplot(x)

 

DATA Science in business – perspective from first day employee

This is the one and only opportunity for me to write what was my thoughts about data science after first couple of days in a new office. I decided to start another job, first job in data analysis and first in software house – huge difference from the first day.

First of all I had no idea what anyone was talking about. All those abbreviations and  office slang is a bit overwhelming at first. But you get used to it and understand more and more every day. But after couple of days it is still not enough. But hey! During the first weeks you’re allowed to ask stupid questions, so use it.

Second of all, your chosen technology is not necessary used in the office. Even if you spoke about used technologies during interview and you were asked to implement your chosen technology to your new office, you will have to use also technologies they use. It is not at all a bad thing, it gives you opportunity to expand your knowledge and perspectives. It is easier to understand the data in a way they work with it and after you understand it you can go on with your technologies.

Third thing is the co-operation. You will not be with data alone, others need your results, they (engineers) change the way how data looks like, they want to learn how to do some analyses on their own – strong co-operation in data analysis is crucial.

Forth – you need to understand the goal of your existence in the company. You don’t analyse the data just to analyse it, there is company strategy in it, you have to keep it mind all the time.

Last, but not least, connected to all of the above, company lives its life and what you do today may not be so necessary tomorrow and sometimes you have to deal with unfinished projects. Just go with the flow (and company development).

Get noticed – wrapping up

So the contest I was participating in – get noticed – has ended. It triggered me to start this blog and I think I will continue to write about bioinformatics, data science, Python and so on (however, maybe one post per week will be easier to manage).

I am really happy that I lasted and I am among 183 finalists (from over a 1000 contestants).

I had a really great adventure during this contest. I didn’t finish my project but the end of the contest does not mean the end of the project.

I even got the courage to apply for a job in data science and I got it (It will be my third job but I hope I’ll manage).

In June I plan to start real web server with the modules I managed to develop during the contest. So keep in touch.