Python in Big Data #1 – Hadoop & Snakebite

One of the often mentioned term during searching for Big Data information is Hadoop. What is Hadoop exactly? And what are first steps to handle Hadoop with Python?

Hadoop is a system file (HDFS) that enable scalable data handling. Basically, Hadoop was developed to store large amount of data while providing reliability and scalability. It is based on data block system. HDFS is based on two processes: NameNode that gathers metadata and DataNodes that store blocks of data. Blocks of data are replicated on different machines to provide data stability when one machine crash.

There are increasing number of libraries for Python that enables handling Hadoop. Just to name some: Snakebite, mrjob, PySpark.

Snakebite enables basic file operations on Hadoop and accessing Hadoop from Python applications. It can be install easily:

pip install snakebite

While installed, within Python application you can connect with HDFS NameNode with:

from snakebite.client import Client
client = Client('localhost', 8000)

Then you can implement various functions similar to functions in shell: ls, mkdir, delete to handle files and directories within Hadoop. Another group of functions (e.g. copyToLocal) enable retrieving data from HDFS.

from snakebite.client import Client
client = Client('localhost', 8000)
for file in client.copyToLocal(['/hdfs_1/file1.txt'], '/tmp'):
    print file

Snakebite provide also CLI client.

Some additional information can be found in a free book:

Hadoop with Python by Zachary Radtka and Donald Miner

http://www.oreilly.com/programming/free/hadoop-with-python.csp

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s