-
Recent Posts
Recent Comments
- Mac OS X color showing ESC[whatever for git-diff colors (and more) (12)
- Gopala Krishna A: Thanks a lot!! This really proved helpful on opensuse 11.2
- Silly Avatar: Thanx for this blog entry. Was having this problem while ssh-ing to a linux vps w/Putty. Thought it was...
- Girish KS: Thanks for the post Nate and thanks pablitostar for your suggestion. I started using git few days back and...
- pablitostar: I found using the -r flag did fix git-diff, but it broke something else in less. Specifically, searching...
- Gopala Krishna A: Thanks a lot!! This really proved helpful on opensuse 11.2
- ActiveRecord from_xml (and from_json) part 2 (3)
- Billy Kimble: Thanks for the snippet of code — it has helped me out tremendously. Unfortunately it did not work...
- Mac OS X color showing ESC[whatever for git-diff colors (and more) (12)
Categories
- bookmarks (2)
- cascading (2)
- code (2)
- deployment (6)
- ec2 (3)
- erlang (2)
- gems (3)
- git (7)
- hadoop (3)
- java (1)
- merb (1)
- music (1)
- osx (2)
- poolparty (3)
- processing (1)
- programming (43)
- rails (11)
- ruby (21)
- scalability (5)
- shell (8)
- sysadmin (16)
- tips (13)
- Uncategorized (3)
- useless (1)
Archives
Pages
Blogroll

How to use Cascading with Hadoop Streaming
Last time we talked about how to use a raw MapReduce job in Cascading. Now we are going to up the ante by using Hadoop Streaming as a Flow in Cascading. In this example, we hook a python streaming job into a Cascade.
Its pretty easy once you know how to do it:
hadoop-*-streaming.jarwith your cascading job by putting it in yourjar-fileoption-file,-cacheFile, or-cacheArchiveoptions (See the Hadoop Streaming page for more details)Resources
NLTK
To generate the
nltkandyaml.modzip file do the following:Note that this technique is taken from Cloudera
WordNet
The WordNet zip file needs to be flat. e.g. don’t zip up the files with a subdirectory. You could create this file like so:
Streaming Script
In python, we’ll be using
zipimport.zipimporterto import thenltklibraries from a zip file. In Hadoop 0.20.0, Hadoop didn’t decompress ourwordnet-flat.zipfile automatically (but we’ve heard reports that it will, but I’m not sure which versions). For us the.zipfile was placed inlibrelative to thepwdof the script. This allowed us to keep the WordNet corpus as a zip and read it in that format.(In this code we’re not using the python reducer.)
Cascading Code
Here’s the bulk of the code that will achieve the effect we want. Like last time, we’re using two intermediate taps as the input and output of the streaming job. Also, we’re just using TextLine files for simplicity. If you don’t want the intermediate files hanging around, look at the comments towards the bottom for some example code on how to remove the files when the job is finished running.