One of the often mentioned term during searching for Big Data information is Hadoop. What is Hadoop exactly? And what are first steps to handle Hadoop with Python?
Hadoop is a system file (HDFS) that enable scalable data handling. Basically, Hadoop was developed to store large amount of data while providing reliability and scalability. It is based on data block system. HDFS is based on two processes: NameNode that gathers metadata and DataNodes that store blocks of data. Blocks of data are replicated on different machines to provide data stability when one machine crash.
There are increasing number of libraries for Python that enables handling Hadoop. Just to name some: Snakebite, mrjob, PySpark.
Snakebite enables basic file operations on Hadoop and accessing Hadoop from Python applications. It can be install easily:
pip install snakebite
While installed, within Python application you can connect with HDFS NameNode with:
from snakebite.client import Client client = Client('localhost', 8000)
Then you can implement various functions similar to functions in shell: ls, mkdir, delete to handle files and directories within Hadoop. Another group of functions (e.g. copyToLocal) enable retrieving data from HDFS.
from snakebite.client import Client client = Client('localhost', 8000) for file in client.copyToLocal(['/hdfs_1/file1.txt'], '/tmp'): print file
Snakebite provide also CLI client.
Some additional information can be found in a free book:
Hadoop with Python by Zachary Radtka and Donald Miner