Streaming Map
Streaming Map/Reduce in Python
Links:
- http://www.michael-noll.com/tutorials/writing-an-hadoop-mapreduce-program-in-python/
- http://wiki.apache.org/hadoop/ImportantConcepts
Introduction
Python seams to a language supported by many Hadoop providers (like Amazon etc). I would prefer NodeJS but Python will have to do for now.
I have installed hadoop on my Mac (see separate post).
First create the map and reduct python scripts in the article.
Then copy the data to the hadoop file system:
# Create a directories for the job
hadoop fs -mkdir /Users/jonas/hadoop-store/mapred/wordcount
hadoop fs -mkdir /Users/jonas/hadoop-store/mapred/wordcount-output
# Copy the data
hadoop fs -put /Users/jonas/git/colmsjo/wip/Python/MapReduce_example/zaratustra.txt /Users/jonas/hadoop-store/mapred/wordcount
# Check that it's there
hadoop fs -ls /Users/jonas/hadoop-store/mapred/wordcount
Now run the job:
hadoop jar /usr/local/Cellar/hadoop/1.1.1/libexec/contrib/streaming/hadoop-streaming-1.1.1.jar \
-file /Users/jonas/git/colmsjo/wip/Python/MapReduce_example/mapper.py \
-mapper /Users/jonas/git/colmsjo/wip/Python/MapReduce_example/mapper.py \
-file /Users/jonas/git/colmsjo/wip/Python/MapReduce_example/reducer.py \
-reducer /Users/jonas/git/colmsjo/wip/Python/MapReduce_example/reducer.py \
-input /Users/jonas/hadoop-store/mapred/wordcount/* \
-output /Users/jonas/hadoop-store/mapred/wordcount-output
Show the output:
hadoop fs -cat /Users/jonas/hadoop-store/mapred/wordcount-output/part-00000
Delete the output dir if you want to run the job again:
hadoop fs -rmr /Users/jonas/hadoop-store/mapred/wordcount-output