Python in Big Data #1 – Hadoop & Snakebite

One of the often mentioned term during searching for Big Data information is Hadoop. What is Hadoop exactly? And what are first steps to handle Hadoop with Python?

Hadoop is a system file (HDFS) that enable scalable data handling. Basically, Hadoop was developed to store large amount of data while providing reliability and scalability. It is based on data block system. HDFS is based on two processes: NameNode that gathers metadata and DataNodes that store blocks of data. Blocks of data are replicated on different machines to provide data stability when one machine crash.

There are increasing number of libraries for Python that enables handling Hadoop. Just to name some: Snakebite, mrjob, PySpark.

Snakebite enables basic file operations on Hadoop and accessing Hadoop from Python applications. It can be install easily:

pip install snakebite

While installed, within Python application you can connect with HDFS NameNode with:

from snakebite.client import Client
client = Client('localhost', 8000)

Then you can implement various functions similar to functions in shell: ls, mkdir, delete to handle files and directories within Hadoop. Another group of functions (e.g. copyToLocal) enable retrieving data from HDFS.

from snakebite.client import Client
client = Client('localhost', 8000)
for file in client.copyToLocal(['/hdfs_1/file1.txt'], '/tmp'):
    print file

Snakebite provide also CLI client.

Some additional information can be found in a free book:

Hadoop with Python by Zachary Radtka and Donald Miner


Examples of Data Scientists’ Portfolios

While searching for employment as Data Scientist it is important to show your skills with well prepared portfolio, just as Developers show their github accounts to show their programming skills. Your portfolio should show how you use your skills stated in the Resume. I think that the most important thing is to tell a story about chosen data, show what you can do with openly available data, how insightful you are when it comes to asking questions based on the data and whether you can represent the results clearly (but also aesthetically and beautifully).

Let’s look at some examples of projects representation:

1.Projects – scientific way

I like this portfolio because I am scientist (but probably it’s not ideal for recruiters). Each project is described with abstract, methods and results with discussion (with accompanying figure). It’s quite simple with no graphical fireworks, but it’s clear.

2.Projects – more advertising way

Projects are represented by title, short comment and a image that redirects to the github project (code).

3.Analyses – tools used

As the author stated, not exactly projects but activities are shown. Each activity is represented by a graph and short description of statistical method/tool used. It for sure shows skills of the author.

4. Projects – very advertising

For sure the author knows how to make nice website ;). Again, projects are represented by title, short comment (however here in the caption also technologies are included) and a image that redirects to the extended project description or website presenting results.

5. Projects – story telling

I really like this portfolio as it is really ‘story telling’ portfolio. When you enter a project, story about data and various approaches to analyse it are presented.

As you can see, each Data Scientist has different way for showing their expertise. Which is best? Hard to say, depends what you want to do with data science and what kind of a company you want to work in.

Morality of Data Scientist

Can graphical representation of data influence the audience understanding? No doubt. In last couple of years there is a hot discussion over manipulating (knowingly or simply by ignorance) data by various graphical inaccuracies.

While preparing graphical representation of you data, doesn’t matter whether it is statistics taken from big data, your app popularity or improvement of speed of the module you’re preparing, you need to take into account the audience of your graphic. You should ask yourself what reader of your graph will understand without your explanation: is it what you wanted him/her to get from this graph (and I assume that it is the truth and not what you would really want to be the truth)? Most common mechanisms of misleading in graphical representation of data (according to specialists in the field) is showing too many data, not enough data or distorting data. There are many examples of such misleading, especially in journalism. However, it is important to remember about graphics truthfulness in data science, which is constantly emerging branch of IT.

Graphical representation of data created with programming languages is built from scratch and you can control practically any element of the graph, depending on your knowledge of libraries. You should therefore work on your skills to be sure to really be the designer of your graphs and not relying on accidents that may lead to misleading graphics (by so called ignorance). From my experience Python and R and perfect for complete control of your graphics. Data scientists and bioinformaticians use these programming languages widely and many libraries (open source) are available.

I was inspired to write this post by the course I’m attending (Coursera, Applied Plotting, Charting & Data Representation in Python, Applied Data Science with Python, week 1).

If you’re interested, here is some additional reading:

Cairo, A. (2015). Graphics lies, misleading visuals. In New Challenges for Data Design (pp. 103-116). Springer London.

Coursera and Udemy

I think that everyone knows Coursera and Udemy and many other websites that provide online courses. Most of the courses proposed by the websites are free and you can learn a lot from them, no matter what your area of interest is. I would like to present you some courses that I tested on myself; what I liked about them and what made me frustrated. All courses are connected to Python or bioinformatics or data analysis.

  1. Programming for Everybody (Python) – Coursera


This course is absolutely too easy for you when you already know Python, but I would really recommend it for people that are starting with this programming language. Of course there are a lot of courses for Python beginners; I tried some myself (maybe I would write about them later), but this one was really user friendly. In comparison I didn’t like the course ‘An Introduction to Interactive Programming in Python’ (Coursera). I didn’t even finish it, but on the other hand it has really nice reviews. So I guess it’s really personal what courses are best for you to start with Python programming.


  1. Network analysis in systems biology – Coursera


I have to say that I don’t really remember much from this course. I had to return to Coursera to remind myself what was that all about. It’s mostly because I did not use the information I learned later. I guess it’s because there is just specific problem explained there. For me, it was interesting, but you have to keep in mind that the course is a bit narrow. It’s not really about programming to solve network analysis problem, but how to use proper software to do it. And as I remember, it was quite easy to slip through the course.


  1. Bioinformatics specialization – Coursera


With this specialization you can extend your knowledge of Python (only if you know the basics, they won’t teach you Python; you can actually use any programming language you want) as well as how to use your programming skills to solve real biological problems. For me it was the best course I’ve taken so far. There is quite a lot of algorithmics and you should learn how to transform pseudo-code to actual code in preferred programming language. They check your solutions not only if they return correct answers but also if your solution is time efficient. It’s really important when you work with huge biological data. If you’re into bioinformatics, this course is a ‘must do’.


  1. Applied Data Science with Python specialization – Coursera


I’m in the middle (let’s say at the beginning) of this specialization, only after first course (because only one was available so far, can’t wait for more). I think it’s quite amazing. I was looking for data science course in Python which will show me how to work with real data, with wrong formatting or missing data, and this is really great place to start. They give you basics, how it works (videos are really helpful), but you really learn on the examples. Sometimes it is hard to get right answer (sometimes due to vague questions) and it’s really frustrating, but hey! Data science is not so easy in the real life and employer’s requests not so clear as well. People are also really helpful at the course’s forum. So far so good.


  1. Python for Data Science and Machine Learning Bootcamp – Udemy


I’m in the middle of this course. It nicely prepared, you can even start almost without any Python expertise as the course provides crash course for Python. It has even step-by-step tutorial how to install Python version you need. Everything is prepared with Jupyter – big plus for that. You can try for yourself the code every time you need. It not really challenging most of the time as if something is too hard for you, you can just check the answer in the other Jupyter notebook, but it’s up to you if you do.


To briefly sum up, I really recommend online courses like these proposed by Coursera and Udemy. You can always find something new to learn. I think that you have to be more careful on Udemy, where there are many short, not so worthy, courses, but you can also find really well-prepared stuff. Of course: sometimes you will find a course not really useful for yourself, or too easy or too hard, but it will broaden your knowledge somehow, always.


Funny fact: while my boyfriend was spending his money on games on Steam sale, I spent mine on Udemy sale ;).