<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>X-Combinator &#187; hadoop</title>
	<atom:link href="http://www.xcombinator.com/category/hadoop/feed/" rel="self" type="application/rss+xml" />
	<link>http://www.xcombinator.com</link>
	<description>making the human scalable</description>
	<lastBuildDate>Thu, 09 Sep 2010 03:46:53 +0000</lastBuildDate>
	
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
			<item>
		<title>Cascading, TF-IDF, and BufferedSum (Part 1)</title>
		<link>http://www.xcombinator.com/2009/12/18/cascading-tf-idf-and-bufferedsum-part-1/</link>
		<comments>http://www.xcombinator.com/2009/12/18/cascading-tf-idf-and-bufferedsum-part-1/#comments</comments>
		<pubDate>Fri, 18 Dec 2009 18:08:17 +0000</pubDate>
		<dc:creator>Nate Murray</dc:creator>
				<category><![CDATA[cascading]]></category>
		<category><![CDATA[hadoop]]></category>
		<category><![CDATA[programming]]></category>

		<guid isPermaLink="false">http://www.xcombinator.com/?p=188</guid>
		<description><![CDATA[Introduction
A common technique in MapReduce is to input a group of records, calculate a value from that group, and emit each record with the new value attached. While this is easy to do in raw MR jobs, the solution in Cascading is not very obvious. This tutorial introduces a new operation to Cascading called BufferedSum. [...]]]></description>
			<content:encoded><![CDATA[<h2>Introduction</h2>
<p>A common technique in MapReduce is to input a group of records, calculate a value from that group, and emit each record with the new value attached. While this is easy to do in raw MR jobs, the solution in Cascading is not very obvious. This tutorial introduces a new operation to Cascading called <code>BufferedSum</code>. <code>BufferedSum</code> allows us to calculate values from a group of tuples and emit the group value to individual tuples in a scalable way.</p>
<p>Describing the operation of <code>BufferedSum</code> is clearer when discussed in concrete terms, so let&#8217;s work with an example.</p>
<h2>Example</h2>
<p>When dealing with large amounts of documents in Hadoop, its common to have each input file to contain many documents. Our input file in this case will contain two documents:</p>

<div class="wp_syntax"><div class="code"><pre class="java" style="font-family:monospace;">a.<span style="color: #006633;">txt</span>\thello world world
b.<span style="color: #006633;">txt</span>\tgoodbye goodbye world</pre></div></div>

<p>Lets say we want to calculate <a href="http://en.wikipedia.org/wiki/Tf‚Äìidf">tf-idf</a> for these documents.  One of the first values we need is the count of the occurrence particular term within each document. </p>
<p>First, we will split each line into <code>(document_id, body)</code> pairs:  </p>

<div class="wp_syntax"><div class="code"><pre class="java" style="font-family:monospace;">  pipe <span style="color: #339933;">=</span> <span style="color: #000000; font-weight: bold;">new</span> Each<span style="color: #009900;">&#40;</span>pipe, 
      <span style="color: #000000; font-weight: bold;">new</span> Fields<span style="color: #009900;">&#40;</span><span style="color: #0000ff;">&quot;line&quot;</span><span style="color: #009900;">&#41;</span>, 
      <span style="color: #000000; font-weight: bold;">new</span> RegexSplitter<span style="color: #009900;">&#40;</span><span style="color: #000000; font-weight: bold;">new</span> Fields<span style="color: #009900;">&#40;</span><span style="color: #0000ff;">&quot;document_id&quot;</span>, <span style="color: #0000ff;">&quot;body&quot;</span><span style="color: #009900;">&#41;</span>, <span style="color: #0000ff;">&quot;<span style="color: #000099; font-weight: bold;">\t</span>&quot;</span><span style="color: #009900;">&#41;</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span></pre></div></div>

<p>From there we &#8220;tokenize&#8221; the document and extract each term:</p>

<div class="wp_syntax"><div class="code"><pre class="java" style="font-family:monospace;">  pipe <span style="color: #339933;">=</span> <span style="color: #000000; font-weight: bold;">new</span> Each<span style="color: #009900;">&#40;</span>pipe, <span style="color: #666666; font-style: italic;">// tokenize words by space</span>
      <span style="color: #000000; font-weight: bold;">new</span> Fields<span style="color: #009900;">&#40;</span><span style="color: #0000ff;">&quot;body&quot;</span><span style="color: #009900;">&#41;</span>,
      <span style="color: #000000; font-weight: bold;">new</span> RegexSplitGenerator<span style="color: #009900;">&#40;</span><span style="color: #000000; font-weight: bold;">new</span> Fields<span style="color: #009900;">&#40;</span><span style="color: #0000ff;">&quot;term&quot;</span><span style="color: #009900;">&#41;</span>, <span style="color: #0000ff;">&quot;<span style="color: #000099; font-weight: bold;">\\</span>s+&quot;</span><span style="color: #009900;">&#41;</span>, 
      <span style="color: #000000; font-weight: bold;">new</span> Fields<span style="color: #009900;">&#40;</span><span style="color: #0000ff;">&quot;document_id&quot;</span>, <span style="color: #0000ff;">&quot;term&quot;</span><span style="color: #009900;">&#41;</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span></pre></div></div>

<p>Now our tuple stream is the following:</p>

<div class="wp_syntax"><div class="code"><pre class="java" style="font-family:monospace;">a.<span style="color: #006633;">txt</span> hello
a.<span style="color: #006633;">txt</span> world
a.<span style="color: #006633;">txt</span> world
b.<span style="color: #006633;">txt</span> goodbye
b.<span style="color: #006633;">txt</span> goodbye
b.<span style="color: #006633;">txt</span> world</pre></div></div>

<h2>Count of <code>term</code> in <code>document_id</code></h2>
<p>We now have <code>(document_id, term)</code> and we want to calculate <code>(document_id, term, term_count_in_document)</code>. With Cascading, this is easy, simply group by <code>document_id</code> and <code>term</code> and use the <code>Count()</code> function:</p>

<div class="wp_syntax"><div class="code"><pre class="java" style="font-family:monospace;">  <span style="color: #666666; font-style: italic;">// count how many times `term` appears in `document_id`</span>
  pipe <span style="color: #339933;">=</span> <span style="color: #000000; font-weight: bold;">new</span> GroupBy<span style="color: #009900;">&#40;</span>pipe, <span style="color: #000000; font-weight: bold;">new</span> Fields<span style="color: #009900;">&#40;</span><span style="color: #0000ff;">&quot;document_id&quot;</span>, <span style="color: #0000ff;">&quot;term&quot;</span><span style="color: #009900;">&#41;</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span>
  pipe <span style="color: #339933;">=</span> <span style="color: #000000; font-weight: bold;">new</span> Every<span style="color: #009900;">&#40;</span>pipe, 
      <span style="color: #000000; font-weight: bold;">new</span> Fields<span style="color: #009900;">&#40;</span><span style="color: #0000ff;">&quot;term&quot;</span><span style="color: #009900;">&#41;</span>, 
      <span style="color: #000000; font-weight: bold;">new</span> Count<span style="color: #009900;">&#40;</span><span style="color: #000000; font-weight: bold;">new</span> Fields<span style="color: #009900;">&#40;</span><span style="color: #0000ff;">&quot;term_count_in_document&quot;</span><span style="color: #009900;">&#41;</span><span style="color: #009900;">&#41;</span>, 
      <span style="color: #000000; font-weight: bold;">new</span> Fields<span style="color: #009900;">&#40;</span><span style="color: #0000ff;">&quot;document_id&quot;</span>, <span style="color: #0000ff;">&quot;term&quot;</span>, <span style="color: #0000ff;">&quot;term_count_in_document&quot;</span><span style="color: #009900;">&#41;</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span></pre></div></div>

<h2>Calculating <code>total_terms_in_document</code></h2>
<p>So far, so good. Up to this point Cascading has provided everything we need.  However, next we want to get the total terms within each document and keep the tuples we have calculated thus far. Put another way, we have an input of <code>(document_id, term, term_count_in_document)</code> and we want to emit <code>(document_id, term, term_count_in_document, total_terms_in_document)</code> </p>
<p>Our first instinct might be to use <code>GroupBy()</code> and <code>Count()</code> like before. But there is a catch: <code>Every</code> operations emit the operator result with the <em>group tuple</em> (see the <a href="http://www.cascading.org/userguide/html/ch03s02.html#N20228">Each and Every Pipes</a> in the Cascading User Guide). </p>
<p>This means if we group by <code>document_id</code> and <code>Sum()</code> the <code>total_terms_in_document</code> we will emit <code>(document_id, total_terms_in_document)</code>.  The number in <code>total_terms_in_document</code> will be accurate, but we lose our <code>term</code> and <code>term_count_in_document</code>. </p>
<p>If we try to save our other fields by grouping on all three of them <code>(document_id, term, term_count_in_document)</code> then we&#8217;ve &#8220;over-grouped&#8221; and every &#8220;group&#8221; is a single tuple (the input tuple) and we won&#8217;t get the count of terms in the document as a whole. <code>BufferedSum</code> was created to solve this problem. </p>
<h2><code>BufferedSum</code></h2>
<p><code>BufferedSum</code> takes as its input three things:</p>
<ul>
<li>The name of the <code>Field</code> to output</li>
<li>The name of the <code>Field</code> to sum</li>
<li>The other <code>Fields</code> to &#8220;pull through&#8221; the operation</li>
</ul>
<p>Here is how we can use <code>BufferedSum</code> to achieve the desired effect:</p>

<div class="wp_syntax"><div class="code"><pre class="java" style="font-family:monospace;"><span style="color: #666666; font-style: italic;">// input: (document_id, term, term_count_in_document)</span>
<span style="color: #666666; font-style: italic;">// emits: (document_id, term, term_count_in_document, total_terms_in_document) </span>
pipe <span style="color: #339933;">=</span> <span style="color: #000000; font-weight: bold;">new</span> GroupBy<span style="color: #009900;">&#40;</span>pipe, <span style="color: #000000; font-weight: bold;">new</span> Fields<span style="color: #009900;">&#40;</span><span style="color: #0000ff;">&quot;document_id&quot;</span><span style="color: #009900;">&#41;</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span>
pipe <span style="color: #339933;">=</span> <span style="color: #000000; font-weight: bold;">new</span> Every<span style="color: #009900;">&#40;</span>pipe, 
    <span style="color: #000000; font-weight: bold;">new</span> BufferedSum<span style="color: #009900;">&#40;</span><span style="color: #000000; font-weight: bold;">new</span> Fields<span style="color: #009900;">&#40;</span><span style="color: #0000ff;">&quot;total_terms_in_document&quot;</span><span style="color: #009900;">&#41;</span>, 
                    <span style="color: #000000; font-weight: bold;">new</span> Fields<span style="color: #009900;">&#40;</span><span style="color: #0000ff;">&quot;term_count_in_document&quot;</span><span style="color: #009900;">&#41;</span>,
                    <span style="color: #000000; font-weight: bold;">new</span> Fields<span style="color: #009900;">&#40;</span><span style="color: #0000ff;">&quot;document_id&quot;</span>, <span style="color: #0000ff;">&quot;term&quot;</span>, <span style="color: #0000ff;">&quot;term_count_in_document&quot;</span><span style="color: #009900;">&#41;</span><span style="color: #009900;">&#41;</span>, 
    Fields.<span style="color: #006633;">SWAP</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span></pre></div></div>

<blockquote>
<p>Note: the output selector <code>Fields.SWAP</code> is critical due to Cascading tuple selection.</p>
</blockquote>
<h2>Memory considerations</h2>
<p>One thing to be careful of when using <code>BufferedSum</code> is to try and keep your groups small enough to fit in memory. However, this is not a requirement.  <code>BufferedSum</code> uses Cascading&#8217;s <code>SpillableTupleList</code> which will spill to the HDFS if it grows too large. That said, spilling is an expensive operation and should be avoided if possible.</p>
<h2>Summary</h2>
<p><code>BufferedSum</code> is a widely useful operation when dealing with sums in Cascading.  In Part 2 we will use <code>BufferedSum</code> and Cascading to finish calculating tf-idf.</p>
<h2>The Code</h2>

<div class="wp_syntax"><div class="code"><pre class="java" style="font-family:monospace;"><span style="color: #000000; font-weight: bold;">package</span> <span style="color: #006699;">com.xcombinator.cascading.operations.buffers</span><span style="color: #339933;">;</span>
<span style="color: #000000; font-weight: bold;">import</span> <span style="color: #006699;">cascading.flow.FlowProcess</span><span style="color: #339933;">;</span>
<span style="color: #000000; font-weight: bold;">import</span> <span style="color: #006699;">cascading.operation.BaseOperation</span><span style="color: #339933;">;</span>
<span style="color: #000000; font-weight: bold;">import</span> <span style="color: #006699;">cascading.operation.Buffer</span><span style="color: #339933;">;</span>
<span style="color: #000000; font-weight: bold;">import</span> <span style="color: #006699;">cascading.operation.BufferCall</span><span style="color: #339933;">;</span>
<span style="color: #000000; font-weight: bold;">import</span> <span style="color: #006699;">cascading.tuple.Fields</span><span style="color: #339933;">;</span>
<span style="color: #000000; font-weight: bold;">import</span> <span style="color: #006699;">cascading.tuple.Tuple</span><span style="color: #339933;">;</span>
<span style="color: #000000; font-weight: bold;">import</span> <span style="color: #006699;">cascading.tuple.TupleEntry</span><span style="color: #339933;">;</span>
<span style="color: #000000; font-weight: bold;">import</span> <span style="color: #006699;">cascading.tuple.SpillableTupleList</span><span style="color: #339933;">;</span>
&nbsp;
<span style="color: #000000; font-weight: bold;">import</span> <span style="color: #006699;">java.util.Iterator</span><span style="color: #339933;">;</span>
&nbsp;
<span style="color: #008000; font-style: italic; font-weight: bold;">/**
 * BufferedSum sums a value for every Tuple in a Group and emits every input
 * Tuple with the sum appended.
 * &lt;p/&gt;
 * 
 * 
 * EXAMPLE:
 *
 * {@code 
 *
 * // input: (document_id, term, term_count_in_document)
 * // emits: (document_id, term, term_count_in_document, total_terms_in_document) 
 *
 *     pipe = new GroupBy(pipe, new Fields(&quot;document_id&quot;));
 *     pipe = new Every(pipe, 
 *         new BufferedSum(new Fields(&quot;total_terms_in_document&quot;), 
 *                        new Fields(&quot;term_count_in_document&quot;),
 *                        new Fields(&quot;document_id&quot;, &quot;term&quot;, &quot;term_count_in_document&quot;)), 
 *         Fields.SWAP);
 * }
 *
 * @see BufferedSum
 * 
 */</span>
&nbsp;
<span style="color: #000000; font-weight: bold;">public</span> <span style="color: #000000; font-weight: bold;">class</span> BufferedSum <span style="color: #000000; font-weight: bold;">extends</span> BaseOperation <span style="color: #000000; font-weight: bold;">implements</span> Buffer
  <span style="color: #009900;">&#123;</span>
  <span style="color: #000000; font-weight: bold;">private</span> <span style="color: #003399;">Double</span> sum<span style="color: #339933;">;</span>
  <span style="color: #000000; font-weight: bold;">private</span> SpillableTupleList list<span style="color: #339933;">;</span>
  <span style="color: #000000; font-weight: bold;">private</span> Fields extrasSelector<span style="color: #339933;">;</span>
  <span style="color: #000000; font-weight: bold;">private</span> Fields fieldToSum<span style="color: #339933;">;</span>
&nbsp;
  <span style="color: #008000; font-style: italic; font-weight: bold;">/**
   * Returns a BufferedSum Buffer Operation. 
   *
   * @param emittedSumFieldName a {@link Fields} naming the field to emit the sum value
   * @param fieldToSum          a {@link Fields} naming the field to sum
   * @param extrasSelector      a {@link Fields} naming the other fields to &quot;pull through&quot;. These fields *must* be of the same order and size as the input Tuple
   */</span>
  <span style="color: #000000; font-weight: bold;">public</span> BufferedSum<span style="color: #009900;">&#40;</span> Fields emittedSumFieldName, Fields fieldToSum, Fields extrasSelector <span style="color: #009900;">&#41;</span>
    <span style="color: #009900;">&#123;</span>
    <span style="color: #000000; font-weight: bold;">super</span><span style="color: #009900;">&#40;</span> extrasSelector.<span style="color: #006633;">append</span><span style="color: #009900;">&#40;</span> emittedSumFieldName <span style="color: #009900;">&#41;</span> <span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span>
    <span style="color: #000000; font-weight: bold;">this</span>.<span style="color: #006633;">extrasSelector</span> <span style="color: #339933;">=</span> extrasSelector<span style="color: #339933;">;</span>
    <span style="color: #000000; font-weight: bold;">this</span>.<span style="color: #006633;">fieldToSum</span> <span style="color: #339933;">=</span> fieldToSum<span style="color: #339933;">;</span>
    <span style="color: #009900;">&#125;</span>
&nbsp;
  <span style="color: #000000; font-weight: bold;">public</span> <span style="color: #000066; font-weight: bold;">void</span> operate<span style="color: #009900;">&#40;</span> FlowProcess flowProcess, BufferCall bufferCall <span style="color: #009900;">&#41;</span>
    <span style="color: #009900;">&#123;</span>
    Iterator<span style="color: #339933;">&lt;</span>TupleEntry<span style="color: #339933;">&gt;</span> iterator <span style="color: #339933;">=</span> bufferCall.<span style="color: #006633;">getArgumentsIterator</span><span style="color: #009900;">&#40;</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span>
    sum <span style="color: #339933;">=</span> 0.0D<span style="color: #339933;">;</span>
    list <span style="color: #339933;">=</span> <span style="color: #000000; font-weight: bold;">new</span> SpillableTupleList<span style="color: #009900;">&#40;</span> <span style="color: #cc66cc;">10000</span> <span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span>
&nbsp;
    <span style="color: #000000; font-weight: bold;">while</span><span style="color: #009900;">&#40;</span> iterator.<span style="color: #006633;">hasNext</span><span style="color: #009900;">&#40;</span><span style="color: #009900;">&#41;</span> <span style="color: #009900;">&#41;</span>
      <span style="color: #009900;">&#123;</span>
      TupleEntry arguments <span style="color: #339933;">=</span> iterator.<span style="color: #006633;">next</span><span style="color: #009900;">&#40;</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span> <span style="color: #666666; font-style: italic;">// must be called</span>
      sum <span style="color: #339933;">+=</span> arguments.<span style="color: #006633;">getDouble</span><span style="color: #009900;">&#40;</span> <span style="color: #000000; font-weight: bold;">this</span>.<span style="color: #006633;">fieldToSum</span>.<span style="color: #006633;">get</span><span style="color: #009900;">&#40;</span><span style="color: #cc66cc;">0</span><span style="color: #009900;">&#41;</span> <span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span>
      list.<span style="color: #006633;">add</span><span style="color: #009900;">&#40;</span> arguments.<span style="color: #006633;">getTuple</span><span style="color: #009900;">&#40;</span><span style="color: #009900;">&#41;</span> <span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span>
      <span style="color: #009900;">&#125;</span>
&nbsp;
    <span style="color: #000000; font-weight: bold;">for</span><span style="color: #009900;">&#40;</span> Tuple tuple <span style="color: #339933;">:</span> list <span style="color: #009900;">&#41;</span>
      <span style="color: #009900;">&#123;</span>
      bufferCall.<span style="color: #006633;">getOutputCollector</span><span style="color: #009900;">&#40;</span><span style="color: #009900;">&#41;</span>.<span style="color: #006633;">add</span><span style="color: #009900;">&#40;</span> tuple.<span style="color: #006633;">append</span><span style="color: #009900;">&#40;</span> <span style="color: #000000; font-weight: bold;">new</span> Tuple<span style="color: #009900;">&#40;</span> sum <span style="color: #009900;">&#41;</span> <span style="color: #009900;">&#41;</span> <span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span>
      <span style="color: #009900;">&#125;</span>
&nbsp;
    <span style="color: #009900;">&#125;</span>
  <span style="color: #009900;">&#125;</span></pre></div></div>

<p>Share:</p>
<p>	<a rel="nofollow"  href="http://delicious.com/post?url=http%3A%2F%2Fwww.xcombinator.com%2F2009%2F12%2F18%2Fcascading-tf-idf-and-bufferedsum-part-1%2F&amp;title=Cascading%2C%20TF-IDF%2C%20and%20BufferedSum%20%28Part%201%29&amp;notes=Introduction%0D%0A%0D%0AA%20common%20technique%20in%20MapReduce%20is%20to%20input%20a%20group%20of%20records%2C%20calculate%20a%20value%20from%20that%20group%2C%20and%20emit%20each%20record%20with%20the%20new%20value%20attached.%20While%20this%20is%20easy%20to%20do%20in%20raw%20MR%20jobs%2C%20the%20solution%20in%20Cascading%20is%20not%20very%20obviou" title="del.icio.us"><img src="http://www.xcombinator.com/wp-content/plugins/sociable/images/delicious.png" title="del.icio.us" alt="del.icio.us" class="sociable-hovers" /></a><br />
	<a rel="nofollow"  href="http://reddit.com/submit?url=http%3A%2F%2Fwww.xcombinator.com%2F2009%2F12%2F18%2Fcascading-tf-idf-and-bufferedsum-part-1%2F&amp;title=Cascading%2C%20TF-IDF%2C%20and%20BufferedSum%20%28Part%201%29" title="Reddit"><img src="http://www.xcombinator.com/wp-content/plugins/sociable/images/reddit.png" title="Reddit" alt="Reddit" class="sociable-hovers" /></a><br />
	<a rel="nofollow"  href="http://technorati.com/faves?add=http%3A%2F%2Fwww.xcombinator.com%2F2009%2F12%2F18%2Fcascading-tf-idf-and-bufferedsum-part-1%2F" title="Technorati"><img src="http://www.xcombinator.com/wp-content/plugins/sociable/images/technorati.png" title="Technorati" alt="Technorati" class="sociable-hovers" /></a><br />
	<a rel="nofollow"  href="http://twitter.com/home?status=Cascading%2C%20TF-IDF%2C%20and%20BufferedSum%20%28Part%201%29%20-%20http%3A%2F%2Fwww.xcombinator.com%2F2009%2F12%2F18%2Fcascading-tf-idf-and-bufferedsum-part-1%2F" title="Twitter"><img src="http://www.xcombinator.com/wp-content/plugins/sociable/images/twitter.png" title="Twitter" alt="Twitter" class="sociable-hovers" /></a><br />
	<a rel="nofollow"  href="http://www.facebook.com/share.php?u=http%3A%2F%2Fwww.xcombinator.com%2F2009%2F12%2F18%2Fcascading-tf-idf-and-bufferedsum-part-1%2F&amp;t=Cascading%2C%20TF-IDF%2C%20and%20BufferedSum%20%28Part%201%29" title="Facebook"><img src="http://www.xcombinator.com/wp-content/plugins/sociable/images/facebook.png" title="Facebook" alt="Facebook" class="sociable-hovers" /></a><br />
	<a rel="nofollow"  href="http://www.google.com/bookmarks/mark?op=edit&amp;bkmk=http%3A%2F%2Fwww.xcombinator.com%2F2009%2F12%2F18%2Fcascading-tf-idf-and-bufferedsum-part-1%2F&amp;title=Cascading%2C%20TF-IDF%2C%20and%20BufferedSum%20%28Part%201%29&amp;annotation=Introduction%0D%0A%0D%0AA%20common%20technique%20in%20MapReduce%20is%20to%20input%20a%20group%20of%20records%2C%20calculate%20a%20value%20from%20that%20group%2C%20and%20emit%20each%20record%20with%20the%20new%20value%20attached.%20While%20this%20is%20easy%20to%20do%20in%20raw%20MR%20jobs%2C%20the%20solution%20in%20Cascading%20is%20not%20very%20obviou" title="Google Bookmarks"><img src="http://www.xcombinator.com/wp-content/plugins/sociable/images/googlebookmark.png" title="Google Bookmarks" alt="Google Bookmarks" class="sociable-hovers" /></a><br />
	<a rel="nofollow"  href="http://news.ycombinator.com/submitlink?u=http%3A%2F%2Fwww.xcombinator.com%2F2009%2F12%2F18%2Fcascading-tf-idf-and-bufferedsum-part-1%2F&amp;t=Cascading%2C%20TF-IDF%2C%20and%20BufferedSum%20%28Part%201%29" title="HackerNews"><img src="http://www.xcombinator.com/wp-content/plugins/sociable/images/hackernews.png" title="HackerNews" alt="HackerNews" class="sociable-hovers" /></a><br />
	<a rel="nofollow"  href="http://www.printfriendly.com/print?url=http%3A%2F%2Fwww.xcombinator.com%2F2009%2F12%2F18%2Fcascading-tf-idf-and-bufferedsum-part-1%2F&amp;partner=sociable" title="PDF"><img src="http://www.xcombinator.com/wp-content/plugins/sociable/images/pdf.png" title="PDF" alt="PDF" class="sociable-hovers" /></a><br />
	<a rel="nofollow"  href="http://www.xcombinator.com/feed/" title="RSS"><img src="http://www.xcombinator.com/wp-content/plugins/sociable/images/rss.png" title="RSS" alt="RSS" class="sociable-hovers" /></a></p>
<p><br/><br/></p>
]]></content:encoded>
			<wfw:commentRss>http://www.xcombinator.com/2009/12/18/cascading-tf-idf-and-bufferedsum-part-1/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
		<item>
		<title>How to use a raw MapReduce job in Cascading</title>
		<link>http://www.xcombinator.com/2009/11/11/how-to-use-a-raw-mapreduce-job-in-cascading/</link>
		<comments>http://www.xcombinator.com/2009/11/11/how-to-use-a-raw-mapreduce-job-in-cascading/#comments</comments>
		<pubDate>Wed, 11 Nov 2009 22:31:18 +0000</pubDate>
		<dc:creator>Nate Murray</dc:creator>
				<category><![CDATA[cascading]]></category>
		<category><![CDATA[hadoop]]></category>
		<category><![CDATA[java]]></category>

		<guid isPermaLink="false">http://xcombinator.local/?p=135</guid>
		<description><![CDATA[Cascading is a great abstraction over MapReduce.
However, sometimes you may have code for an existing MapReduce job or want to drop directly to Hadoop for efficiency. Even if you&#8217;re using raw MapReduce jobs, Cascading can still be useful in planning the overall data pipeline. 
The code below is an example of how to use a [...]]]></description>
			<content:encoded><![CDATA[<p><a href="http://www.cascading.org/">Cascading</a> is a great abstraction over MapReduce.</p>
<p>However, sometimes you may have code for an existing MapReduce job or want to drop directly to Hadoop for efficiency. Even if you&#8217;re using raw MapReduce jobs, Cascading can still be useful in planning the overall data pipeline. </p>
<p>The code below is an example of how to use a raw MapReduce job in a Cascade. The main thing to take away is that we are creating intermediate sinks and sources and relying on Cascading to schedule the flows in the correct order.</p>
<blockquote class="normal">
<p>NOTE: this code below depends on commit <a href="http://github.com/jashmenn/cascading/commit/f0dd84cd89da70c326e7285034e982c33d2d7388">f0dd84cd</a> which is a patch to MapReduceFlow.java that allows you to specifically set the Taps for a MapReduceFlow. I&#8217;ve contacted Chris about integrating this into the trunk. </p>
<p>Also note this patch applies to the branch <code>wip-1.1</code> and later.</p>
</blockquote>

<div class="wp_syntax"><div class="code"><pre class="java" style="font-family:monospace;"><span style="color: #000000; font-weight: bold;">package</span> <span style="color: #006699;">com.xcombinator.hadoopjobs.mapreducetest</span><span style="color: #339933;">;</span>
&nbsp;
<span style="color: #000000; font-weight: bold;">import</span> <span style="color: #006699;">cascading.cascade.*</span><span style="color: #339933;">;</span>
<span style="color: #000000; font-weight: bold;">import</span> <span style="color: #006699;">cascading.flow.Flow</span><span style="color: #339933;">;</span>
<span style="color: #000000; font-weight: bold;">import</span> <span style="color: #006699;">cascading.flow.FlowConnector</span><span style="color: #339933;">;</span>
<span style="color: #000000; font-weight: bold;">import</span> <span style="color: #006699;">cascading.flow.MapReduceFlow</span><span style="color: #339933;">;</span>
<span style="color: #000000; font-weight: bold;">import</span> <span style="color: #006699;">cascading.operation.aggregator.Count</span><span style="color: #339933;">;</span>
<span style="color: #000000; font-weight: bold;">import</span> <span style="color: #006699;">cascading.operation.regex.*</span><span style="color: #339933;">;</span>
<span style="color: #000000; font-weight: bold;">import</span> <span style="color: #006699;">cascading.pipe.*</span><span style="color: #339933;">;</span>
<span style="color: #000000; font-weight: bold;">import</span> <span style="color: #006699;">cascading.scheme.*</span><span style="color: #339933;">;</span>
<span style="color: #000000; font-weight: bold;">import</span> <span style="color: #006699;">cascading.tap.*</span><span style="color: #339933;">;</span>
<span style="color: #000000; font-weight: bold;">import</span> <span style="color: #006699;">cascading.tuple.Fields</span><span style="color: #339933;">;</span>
<span style="color: #000000; font-weight: bold;">import</span> <span style="color: #006699;">cascading.operation.Identity</span><span style="color: #339933;">;</span>
<span style="color: #000000; font-weight: bold;">import</span> <span style="color: #006699;">java.util.Properties</span><span style="color: #339933;">;</span>
<span style="color: #000000; font-weight: bold;">import</span> <span style="color: #006699;">org.apache.hadoop.conf.Configuration</span><span style="color: #339933;">;</span>
<span style="color: #000000; font-weight: bold;">import</span> <span style="color: #006699;">org.apache.hadoop.conf.Configured</span><span style="color: #339933;">;</span>
<span style="color: #000000; font-weight: bold;">import</span> <span style="color: #006699;">org.apache.hadoop.io.LongWritable</span><span style="color: #339933;">;</span>
<span style="color: #000000; font-weight: bold;">import</span> <span style="color: #006699;">org.apache.hadoop.io.Text</span><span style="color: #339933;">;</span>
<span style="color: #000000; font-weight: bold;">import</span> <span style="color: #006699;">org.apache.hadoop.mapred.FileInputFormat</span><span style="color: #339933;">;</span>
<span style="color: #000000; font-weight: bold;">import</span> <span style="color: #006699;">org.apache.hadoop.mapred.FileOutputFormat</span><span style="color: #339933;">;</span>
<span style="color: #000000; font-weight: bold;">import</span> <span style="color: #006699;">org.apache.hadoop.mapred.JobConf</span><span style="color: #339933;">;</span>
<span style="color: #000000; font-weight: bold;">import</span> <span style="color: #006699;">org.apache.hadoop.mapred.TextInputFormat</span><span style="color: #339933;">;</span>
<span style="color: #000000; font-weight: bold;">import</span> <span style="color: #006699;">org.apache.hadoop.mapred.TextOutputFormat</span><span style="color: #339933;">;</span>
<span style="color: #000000; font-weight: bold;">import</span> <span style="color: #006699;">org.apache.hadoop.mapred.lib.IdentityMapper</span><span style="color: #339933;">;</span>
<span style="color: #000000; font-weight: bold;">import</span> <span style="color: #006699;">org.apache.hadoop.mapred.lib.IdentityReducer</span><span style="color: #339933;">;</span>
<span style="color: #000000; font-weight: bold;">import</span> <span style="color: #006699;">org.apache.hadoop.util.Tool</span><span style="color: #339933;">;</span>
<span style="color: #000000; font-weight: bold;">import</span> <span style="color: #006699;">org.apache.hadoop.util.ToolRunner</span><span style="color: #339933;">;</span>
<span style="color: #000000; font-weight: bold;">import</span> <span style="color: #006699;">org.apache.log4j.Logger</span><span style="color: #339933;">;</span>
<span style="color: #000000; font-weight: bold;">import</span> <span style="color: #006699;">cascading.operation.Debug</span><span style="color: #339933;">;</span>
&nbsp;
<span style="color: #000000; font-weight: bold;">import</span> <span style="color: #006699;">org.apache.hadoop.mapred.KeyValueTextInputFormat</span><span style="color: #339933;">;</span>
&nbsp;
<span style="color: #008000; font-style: italic; font-weight: bold;">/**
 * An example file to use a raw MapReduce job in cascading
 */</span>
<span style="color: #000000; font-weight: bold;">public</span> <span style="color: #000000; font-weight: bold;">class</span> Main <span style="color: #000000; font-weight: bold;">extends</span> Configured <span style="color: #000000; font-weight: bold;">implements</span> Tool
  <span style="color: #009900;">&#123;</span>
  <span style="color: #000000; font-weight: bold;">private</span> <span style="color: #000000; font-weight: bold;">static</span> <span style="color: #000000; font-weight: bold;">final</span> Logger LOG <span style="color: #339933;">=</span> Logger.<span style="color: #006633;">getLogger</span><span style="color: #009900;">&#40;</span> Main.<span style="color: #000000; font-weight: bold;">class</span> <span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span>
&nbsp;
  <span style="color: #000000; font-weight: bold;">public</span> <span style="color: #000066; font-weight: bold;">int</span> run<span style="color: #009900;">&#40;</span><span style="color: #003399;">String</span><span style="color: #009900;">&#91;</span><span style="color: #009900;">&#93;</span> args<span style="color: #009900;">&#41;</span>
  <span style="color: #009900;">&#123;</span>
    JobConf conf <span style="color: #339933;">=</span> <span style="color: #000000; font-weight: bold;">new</span> JobConf<span style="color: #009900;">&#40;</span>getConf<span style="color: #009900;">&#40;</span><span style="color: #009900;">&#41;</span>, <span style="color: #000000; font-weight: bold;">this</span>.<span style="color: #006633;">getClass</span><span style="color: #009900;">&#40;</span><span style="color: #009900;">&#41;</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span>
    <span style="color: #003399;">Properties</span> properties <span style="color: #339933;">=</span> <span style="color: #000000; font-weight: bold;">new</span> <span style="color: #003399;">Properties</span><span style="color: #009900;">&#40;</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span>
    FlowConnector.<span style="color: #006633;">setApplicationJarClass</span><span style="color: #009900;">&#40;</span>properties, <span style="color: #000000; font-weight: bold;">this</span>.<span style="color: #006633;">getClass</span><span style="color: #009900;">&#40;</span><span style="color: #009900;">&#41;</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span>
&nbsp;
    CascadeConnector cascadeConnector <span style="color: #339933;">=</span> <span style="color: #000000; font-weight: bold;">new</span> CascadeConnector<span style="color: #009900;">&#40;</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span>
    FlowConnector flowConnector <span style="color: #339933;">=</span> <span style="color: #000000; font-weight: bold;">new</span> FlowConnector<span style="color: #009900;">&#40;</span>properties<span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span>
&nbsp;
    <span style="color: #003399;">String</span> inputPath  <span style="color: #339933;">=</span> args<span style="color: #009900;">&#91;</span><span style="color: #cc66cc;">0</span><span style="color: #009900;">&#93;</span><span style="color: #339933;">;</span>
    <span style="color: #003399;">String</span> outputPath <span style="color: #339933;">=</span> args<span style="color: #009900;">&#91;</span><span style="color: #cc66cc;">1</span><span style="color: #009900;">&#93;</span><span style="color: #339933;">;</span>
    <span style="color: #003399;">String</span> intermediatePath1 <span style="color: #339933;">=</span> args<span style="color: #009900;">&#91;</span><span style="color: #cc66cc;">1</span><span style="color: #009900;">&#93;</span> <span style="color: #339933;">+</span> <span style="color: #0000ff;">&quot;-mr-input&quot;</span><span style="color: #339933;">;</span>
    <span style="color: #003399;">String</span> intermediatePath2 <span style="color: #339933;">=</span> args<span style="color: #009900;">&#91;</span><span style="color: #cc66cc;">1</span><span style="color: #009900;">&#93;</span> <span style="color: #339933;">+</span> <span style="color: #0000ff;">&quot;-mr-output&quot;</span><span style="color: #339933;">;</span>
&nbsp;
    Scheme textLineScheme <span style="color: #339933;">=</span> <span style="color: #000000; font-weight: bold;">new</span> TextLine<span style="color: #009900;">&#40;</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span>
&nbsp;
    Tap sourceTap <span style="color: #339933;">=</span> <span style="color: #000000; font-weight: bold;">new</span> Hfs<span style="color: #009900;">&#40;</span>textLineScheme, inputPath<span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span>
    Tap intermediateTap1 <span style="color: #339933;">=</span> <span style="color: #000000; font-weight: bold;">new</span> Hfs<span style="color: #009900;">&#40;</span><span style="color: #000000; font-weight: bold;">new</span> TextLine<span style="color: #009900;">&#40;</span><span style="color: #000000; font-weight: bold;">new</span> Fields<span style="color: #009900;">&#40;</span><span style="color: #0000ff;">&quot;line&quot;</span><span style="color: #009900;">&#41;</span><span style="color: #009900;">&#41;</span>, intermediatePath1<span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span>
    Tap intermediateTap2 <span style="color: #339933;">=</span> <span style="color: #000000; font-weight: bold;">new</span> Hfs<span style="color: #009900;">&#40;</span>textLineScheme, intermediatePath2<span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span>
    Tap sinkTap   <span style="color: #339933;">=</span> <span style="color: #000000; font-weight: bold;">new</span> Hfs<span style="color: #009900;">&#40;</span>textLineScheme, outputPath<span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span>
&nbsp;
    <span style="color: #666666; font-style: italic;">// create our first flow, sink to the intermediateTap</span>
    Pipe wsPipe <span style="color: #339933;">=</span> <span style="color: #000000; font-weight: bold;">new</span> Each<span style="color: #009900;">&#40;</span><span style="color: #0000ff;">&quot;wordsplit&quot;</span>, 
        <span style="color: #000000; font-weight: bold;">new</span> Fields<span style="color: #009900;">&#40;</span><span style="color: #0000ff;">&quot;line&quot;</span><span style="color: #009900;">&#41;</span>, 
        <span style="color: #000000; font-weight: bold;">new</span> RegexSplitGenerator<span style="color: #009900;">&#40;</span><span style="color: #000000; font-weight: bold;">new</span> Fields<span style="color: #009900;">&#40;</span><span style="color: #0000ff;">&quot;word&quot;</span><span style="color: #009900;">&#41;</span>, <span style="color: #0000ff;">&quot;<span style="color: #000099; font-weight: bold;">\\</span>s+&quot;</span><span style="color: #009900;">&#41;</span>, 
        <span style="color: #000000; font-weight: bold;">new</span> Fields<span style="color: #009900;">&#40;</span><span style="color: #0000ff;">&quot;word&quot;</span><span style="color: #009900;">&#41;</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span>
&nbsp;
    Flow parsedLogFlow <span style="color: #339933;">=</span> flowConnector.<span style="color: #006633;">connect</span><span style="color: #009900;">&#40;</span>sourceTap, intermediateTap1, wsPipe<span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span>
&nbsp;
    <span style="color: #666666; font-style: italic;">// Create a pipe and set our mr job for it </span>
    Pipe importPipe <span style="color: #339933;">=</span> <span style="color: #000000; font-weight: bold;">new</span> Pipe<span style="color: #009900;">&#40;</span><span style="color: #0000ff;">&quot;mr pipe&quot;</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span>
&nbsp;
    JobConf mrconf <span style="color: #339933;">=</span> <span style="color: #000000; font-weight: bold;">new</span> JobConf<span style="color: #009900;">&#40;</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span>
    mrconf.<span style="color: #006633;">setJobName</span><span style="color: #009900;">&#40;</span><span style="color: #0000ff;">&quot;custom mr&quot;</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span>
&nbsp;
    mrconf.<span style="color: #006633;">setOutputKeyClass</span><span style="color: #009900;">&#40;</span>LongWritable.<span style="color: #000000; font-weight: bold;">class</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span>
    mrconf.<span style="color: #006633;">setOutputValueClass</span><span style="color: #009900;">&#40;</span>Text.<span style="color: #000000; font-weight: bold;">class</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span>
&nbsp;
    <span style="color: #666666; font-style: italic;">// the IdentityMapper, in this case, will actually output the key, which is</span>
    <span style="color: #666666; font-style: italic;">// a long of offset in bytes. Not what we'd usually want, but we'll leave</span>
    <span style="color: #666666; font-style: italic;">// it in for now.</span>
    mrconf.<span style="color: #006633;">setMapperClass</span><span style="color: #009900;">&#40;</span>IdentityMapper.<span style="color: #000000; font-weight: bold;">class</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span>
    mrconf.<span style="color: #006633;">setReducerClass</span><span style="color: #009900;">&#40;</span>IdentityReducer.<span style="color: #000000; font-weight: bold;">class</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span>
&nbsp;
    <span style="color: #666666; font-style: italic;">// note that your input is straight text-lines. This means in a real mr job</span>
    <span style="color: #666666; font-style: italic;">// you'd most likely need to split the line by some convention</span>
    TextInputFormat format <span style="color: #339933;">=</span> <span style="color: #000000; font-weight: bold;">new</span> TextInputFormat<span style="color: #009900;">&#40;</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span>
    format.<span style="color: #006633;">configure</span><span style="color: #009900;">&#40;</span>mrconf<span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span>
&nbsp;
    <span style="color: #666666; font-style: italic;">// NOTE: this is both here and in the MapReduceFlow below</span>
    FileInputFormat.<span style="color: #006633;">setInputPaths</span><span style="color: #009900;">&#40;</span>mrconf, intermediateTap1.<span style="color: #006633;">getPath</span><span style="color: #009900;">&#40;</span><span style="color: #009900;">&#41;</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span>  
    FileOutputFormat.<span style="color: #006633;">setOutputPath</span><span style="color: #009900;">&#40;</span>mrconf, intermediateTap2.<span style="color: #006633;">getPath</span><span style="color: #009900;">&#40;</span><span style="color: #009900;">&#41;</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span> <span style="color: #666666; font-style: italic;">// likewise</span>
&nbsp;
    <span style="color: #666666; font-style: italic;">// create our second flow, this one is for the mrjob. Notice source and sink taps</span>
    Flow mrFlow <span style="color: #339933;">=</span> <span style="color: #000000; font-weight: bold;">new</span> MapReduceFlow<span style="color: #009900;">&#40;</span><span style="color: #0000ff;">&quot;mrflow&quot;</span>, 
      mrconf, intermediateTap1, intermediateTap2, <span style="color: #000066; font-weight: bold;">false</span>, <span style="color: #000066; font-weight: bold;">true</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span>
&nbsp;
    <span style="color: #666666; font-style: italic;">// create our third &quot;regular&quot; cascading pipe</span>
    Pipe countPipe <span style="color: #339933;">=</span> <span style="color: #000000; font-weight: bold;">new</span> Pipe<span style="color: #009900;">&#40;</span><span style="color: #0000ff;">&quot;count&quot;</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span>
&nbsp;
    <span style="color: #666666; font-style: italic;">// b/c our IdentityMapper is emitting long of offset in the line, just</span>
    <span style="color: #666666; font-style: italic;">// strip that out. You wouldn't have to do this if you had a smarter Mapper</span>
    <span style="color: #666666; font-style: italic;">// class.</span>
    countPipe <span style="color: #339933;">=</span> <span style="color: #000000; font-weight: bold;">new</span> Each<span style="color: #009900;">&#40;</span>countPipe, 
        <span style="color: #000000; font-weight: bold;">new</span> Fields<span style="color: #009900;">&#40;</span><span style="color: #0000ff;">&quot;line&quot;</span><span style="color: #009900;">&#41;</span>, 
        <span style="color: #000000; font-weight: bold;">new</span> RegexParser<span style="color: #009900;">&#40;</span><span style="color: #000000; font-weight: bold;">new</span> Fields<span style="color: #009900;">&#40;</span><span style="color: #0000ff;">&quot;word&quot;</span><span style="color: #009900;">&#41;</span>, <span style="color: #0000ff;">&quot;.*?<span style="color: #000099; font-weight: bold;">\\</span>t(.*)&quot;</span><span style="color: #009900;">&#41;</span>, 
        <span style="color: #000000; font-weight: bold;">new</span> Fields<span style="color: #009900;">&#40;</span><span style="color: #0000ff;">&quot;word&quot;</span><span style="color: #009900;">&#41;</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span>
&nbsp;
    countPipe <span style="color: #339933;">=</span> <span style="color: #000000; font-weight: bold;">new</span> GroupBy<span style="color: #009900;">&#40;</span>countPipe, <span style="color: #000000; font-weight: bold;">new</span> Fields<span style="color: #009900;">&#40;</span><span style="color: #0000ff;">&quot;word&quot;</span><span style="color: #009900;">&#41;</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span>
    countPipe <span style="color: #339933;">=</span> <span style="color: #000000; font-weight: bold;">new</span> Every<span style="color: #009900;">&#40;</span>countPipe, <span style="color: #000000; font-weight: bold;">new</span> Count<span style="color: #009900;">&#40;</span><span style="color: #009900;">&#41;</span>, <span style="color: #000000; font-weight: bold;">new</span> Fields<span style="color: #009900;">&#40;</span><span style="color: #0000ff;">&quot;count&quot;</span>, <span style="color: #0000ff;">&quot;word&quot;</span><span style="color: #009900;">&#41;</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span>
&nbsp;
    <span style="color: #666666; font-style: italic;">// create the flow for the last count pipe</span>
    Flow countFlow <span style="color: #339933;">=</span> flowConnector.<span style="color: #006633;">connect</span><span style="color: #009900;">&#40;</span>intermediateTap2, sinkTap, countPipe<span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span>
&nbsp;
    Cascade cascade <span style="color: #339933;">=</span> cascadeConnector.<span style="color: #006633;">connect</span><span style="color: #009900;">&#40;</span>parsedLogFlow, mrFlow, countFlow<span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span>
    cascade.<span style="color: #006633;">complete</span><span style="color: #009900;">&#40;</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span>
&nbsp;
    <span style="color: #666666; font-style: italic;">// if you want to get rid of the intermediate files you </span>
    <span style="color: #666666; font-style: italic;">// could do something like the following here:</span>
    <span style="color: #666666; font-style: italic;">// Path tmp = tap.getPath();</span>
    <span style="color: #666666; font-style: italic;">// FileSystem fs = tmp.getFileSystem(conf);</span>
    <span style="color: #666666; font-style: italic;">// fs.delete(tmp, true);</span>
&nbsp;
    <span style="color: #000000; font-weight: bold;">return</span> <span style="color: #cc66cc;">0</span><span style="color: #339933;">;</span>
  <span style="color: #009900;">&#125;</span>
&nbsp;
  <span style="color: #000000; font-weight: bold;">public</span> <span style="color: #000000; font-weight: bold;">static</span> <span style="color: #000066; font-weight: bold;">void</span> main<span style="color: #009900;">&#40;</span><span style="color: #003399;">String</span><span style="color: #009900;">&#91;</span><span style="color: #009900;">&#93;</span> args<span style="color: #009900;">&#41;</span> <span style="color: #000000; font-weight: bold;">throws</span> <span style="color: #003399;">Exception</span> 
  <span style="color: #009900;">&#123;</span>
    <span style="color: #000066; font-weight: bold;">int</span> res <span style="color: #339933;">=</span> ToolRunner.<span style="color: #006633;">run</span><span style="color: #009900;">&#40;</span><span style="color: #000000; font-weight: bold;">new</span> Configuration<span style="color: #009900;">&#40;</span><span style="color: #009900;">&#41;</span>, <span style="color: #000000; font-weight: bold;">new</span> Main<span style="color: #009900;">&#40;</span><span style="color: #009900;">&#41;</span>, args<span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span>
    <span style="color: #003399;">System</span>.<span style="color: #006633;">exit</span><span style="color: #009900;">&#40;</span>res<span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span>
  <span style="color: #009900;">&#125;</span>
&nbsp;
  <span style="color: #009900;">&#125;</span></pre></div></div>

<p>Share:</p>
<p>	<a rel="nofollow"  href="http://delicious.com/post?url=http%3A%2F%2Fwww.xcombinator.com%2F2009%2F11%2F11%2Fhow-to-use-a-raw-mapreduce-job-in-cascading%2F&amp;title=How%20to%20use%20a%20raw%20MapReduce%20job%20in%20Cascading&amp;notes=Cascading%20is%20a%20great%20abstraction%20over%20MapReduce.%0D%0A%0D%0AHowever%2C%20sometimes%20you%20may%20have%20code%20for%20an%20existing%20MapReduce%20job%20or%20want%20to%20drop%20directly%20to%20Hadoop%20for%20efficiency.%20Even%20if%20you%27re%20using%20raw%20MapReduce%20jobs%2C%20Cascading%20can%20still%20be%20useful%20in%20planni" title="del.icio.us"><img src="http://www.xcombinator.com/wp-content/plugins/sociable/images/delicious.png" title="del.icio.us" alt="del.icio.us" class="sociable-hovers" /></a><br />
	<a rel="nofollow"  href="http://reddit.com/submit?url=http%3A%2F%2Fwww.xcombinator.com%2F2009%2F11%2F11%2Fhow-to-use-a-raw-mapreduce-job-in-cascading%2F&amp;title=How%20to%20use%20a%20raw%20MapReduce%20job%20in%20Cascading" title="Reddit"><img src="http://www.xcombinator.com/wp-content/plugins/sociable/images/reddit.png" title="Reddit" alt="Reddit" class="sociable-hovers" /></a><br />
	<a rel="nofollow"  href="http://technorati.com/faves?add=http%3A%2F%2Fwww.xcombinator.com%2F2009%2F11%2F11%2Fhow-to-use-a-raw-mapreduce-job-in-cascading%2F" title="Technorati"><img src="http://www.xcombinator.com/wp-content/plugins/sociable/images/technorati.png" title="Technorati" alt="Technorati" class="sociable-hovers" /></a><br />
	<a rel="nofollow"  href="http://twitter.com/home?status=How%20to%20use%20a%20raw%20MapReduce%20job%20in%20Cascading%20-%20http%3A%2F%2Fwww.xcombinator.com%2F2009%2F11%2F11%2Fhow-to-use-a-raw-mapreduce-job-in-cascading%2F" title="Twitter"><img src="http://www.xcombinator.com/wp-content/plugins/sociable/images/twitter.png" title="Twitter" alt="Twitter" class="sociable-hovers" /></a><br />
	<a rel="nofollow"  href="http://www.facebook.com/share.php?u=http%3A%2F%2Fwww.xcombinator.com%2F2009%2F11%2F11%2Fhow-to-use-a-raw-mapreduce-job-in-cascading%2F&amp;t=How%20to%20use%20a%20raw%20MapReduce%20job%20in%20Cascading" title="Facebook"><img src="http://www.xcombinator.com/wp-content/plugins/sociable/images/facebook.png" title="Facebook" alt="Facebook" class="sociable-hovers" /></a><br />
	<a rel="nofollow"  href="http://www.google.com/bookmarks/mark?op=edit&amp;bkmk=http%3A%2F%2Fwww.xcombinator.com%2F2009%2F11%2F11%2Fhow-to-use-a-raw-mapreduce-job-in-cascading%2F&amp;title=How%20to%20use%20a%20raw%20MapReduce%20job%20in%20Cascading&amp;annotation=Cascading%20is%20a%20great%20abstraction%20over%20MapReduce.%0D%0A%0D%0AHowever%2C%20sometimes%20you%20may%20have%20code%20for%20an%20existing%20MapReduce%20job%20or%20want%20to%20drop%20directly%20to%20Hadoop%20for%20efficiency.%20Even%20if%20you%27re%20using%20raw%20MapReduce%20jobs%2C%20Cascading%20can%20still%20be%20useful%20in%20planni" title="Google Bookmarks"><img src="http://www.xcombinator.com/wp-content/plugins/sociable/images/googlebookmark.png" title="Google Bookmarks" alt="Google Bookmarks" class="sociable-hovers" /></a><br />
	<a rel="nofollow"  href="http://news.ycombinator.com/submitlink?u=http%3A%2F%2Fwww.xcombinator.com%2F2009%2F11%2F11%2Fhow-to-use-a-raw-mapreduce-job-in-cascading%2F&amp;t=How%20to%20use%20a%20raw%20MapReduce%20job%20in%20Cascading" title="HackerNews"><img src="http://www.xcombinator.com/wp-content/plugins/sociable/images/hackernews.png" title="HackerNews" alt="HackerNews" class="sociable-hovers" /></a><br />
	<a rel="nofollow"  href="http://www.printfriendly.com/print?url=http%3A%2F%2Fwww.xcombinator.com%2F2009%2F11%2F11%2Fhow-to-use-a-raw-mapreduce-job-in-cascading%2F&amp;partner=sociable" title="PDF"><img src="http://www.xcombinator.com/wp-content/plugins/sociable/images/pdf.png" title="PDF" alt="PDF" class="sociable-hovers" /></a><br />
	<a rel="nofollow"  href="http://www.xcombinator.com/feed/" title="RSS"><img src="http://www.xcombinator.com/wp-content/plugins/sociable/images/rss.png" title="RSS" alt="RSS" class="sociable-hovers" /></a></p>
<p><br/><br/></p>
]]></content:encoded>
			<wfw:commentRss>http://www.xcombinator.com/2009/11/11/how-to-use-a-raw-mapreduce-job-in-cascading/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>&#8220;Easily&#8221; setup a monitored Hadoop / Hive Cluster in EC2 with PoolParty</title>
		<link>http://www.xcombinator.com/2009/07/08/easily-setup-a-monitored-hadoop-hive-cluster-in-ec2-with-poolparty/</link>
		<comments>http://www.xcombinator.com/2009/07/08/easily-setup-a-monitored-hadoop-hive-cluster-in-ec2-with-poolparty/#comments</comments>
		<pubDate>Wed, 08 Jul 2009 14:13:34 +0000</pubDate>
		<dc:creator>Nate Murray</dc:creator>
				<category><![CDATA[hadoop]]></category>
		<category><![CDATA[poolparty]]></category>
		<category><![CDATA[programming]]></category>
		<category><![CDATA[ruby]]></category>
		<category><![CDATA[scalability]]></category>

		<guid isPermaLink="false">http://www.xcombinator.com/2009/07/08/easily-setup-a-monitored-hadoop-hive-cluster-in-ec2-with-poolparty/</guid>
		<description><![CDATA[
Summary
Setting up a scalable Hadoop cluster isn&#8217;t easy, but PoolParty makes it easier
and manageable.
By the time we&#8217;re done with this tutorial you&#8217;ll have a Hadoop cluster consisting of one master node and two slaves.  The slaves are formatted with HDFS and process MapReduce jobs that are delegated to them from the master. 

The whole [...]]]></description>
			<content:encoded><![CDATA[<p></p>
<h1>Summary</h1>
<p>Setting up a scalable Hadoop cluster isn&#8217;t easy, but PoolParty makes it easier<br />
and manageable.</p>
<p>By the time we&#8217;re done with this tutorial you&#8217;ll have a Hadoop cluster consisting of one master node and two slaves.  The slaves are formatted with HDFS and process MapReduce jobs that are delegated to them from the master.<br /> 
</p>
<p>The whole cluster is monitored by Ganglia.</p>
<p> <a href='http://www.xcombinator.com/wp-content/uploads/2009/07/picture-8.png' title='ganglia cluster monitoring'><img src='http://www.xcombinator.com/wp-content/uploads/2009/07/picture-8.thumbnail.png' alt='ganglia cluster monitoring' /></a></p>
<h1>Benefits of PoolParty</h1>
<p>The nodes are very interdependent. By that I mean that each node needs to have 2 or 3 configuration files that are based on the other currently running nodes in the cluster. As nodes are joining and leaving the cluster each of these files on every node needs to be updated. PoolParty handles this process for you more-or-less automatically. The benefit is that you don&#8217;t have roll your own methods to do this every time you want to setup a cluster. </p>
<p>In PoolParty plugins are first-class citizens. This means you can write your own plugins and they are every bit as powerful as the resources that make up PoolParty core itself. This makes it easy to break up server functionality into <em>modules of code</em> . PoolParty, in a sense, gives you object-oriented server configurations. You can, for instance, take a Ganglia object, call a few methods and PoolParty takes care of executing the required commands to deploy a configured Ganglia cluster.</p>
<h1>Architecture </h1>
<p>PoolParty is built around the notion of <em>pools</em> and <em>clouds</em> . A pool is simply a collection of clouds. A cloud is a homogeneous set of nodes. i.e. <strong>every node in a cloud is <em>configured</em> the same way</strong> . Obviously nodes in a cloud will have different sets of working data as they run, but the idea is any node in a cloud could be substituted for any other node in that same cloud.<br /> 
</p>
<p>PoolParty itself is designed to be fully distributed and masterless. There is no required concept of &#8220;master&#8221; and &#8220;slave&#8221; in PoolParty itself. That said, many pieces of software, such as Hadoop, do have this concept and PoolParty can be configured to take advantage of that. </p>
<p>We&#8217;ll be setting up our pool as two clouds <code>hadoop_master</code> and <code>hadoop_slave</code>. Obviously, <code>hadoop_slave</code> will be a cloud (cluster) of nodes configured to be Hadoop slaves. <code>hadoop_master</code> will also be a cloud of masters. In our example we&#8217;re only going to use 1 node as the master. But  you could relatively easily configure everything to have more than one master.<br /> 
</p>
<h1>Software involved</h1>
<ul>
<li><a href="http://hadoop.apache.org/core/">Hadoop</a> </li>
<li><a href="http://wiki.apache.org/hadoop/Hive">Hive</a></li>
<li><a href="http://ganglia.info/">Ganglia</a></li>
<li><a href="http://poolpartyrb.com">PoolParty</a></li>
</ul>
<h1>Prerequisites</h1>
<p>This tutorial assumes that:</p>
<ol>
<li><strong>You have Amazon EC2 java tools installed</strong>. See <a href="http://docs.amazonwebservices.com/AWSEC2/latest/GettingStartedGuide/index.html?StartCLI.html">EC2: Getting Started with the Command Line Tools</a></li>
<li><strong>You have the proper EC2 environment variables setup</strong>. See <a href="http://auser.github.com/poolparty/amazon_ec2_setup.html">Setting up EC2</a> on the PoolParty website. For instance, a typical PoolParty install would have these variables in <code>$HOME/.ec2/keys_and_secrets.sh</code>.</li>
<li><strong>You have PoolParty installed from source</strong>. In theory, you should be able to install the gem. However, <em>today</em>  you should probably install from source. Make sure you have <code>git://github.com/auser/poolparty.git</code> checked out and then follow the &#8220;Installing&#8221; directions on <a href="http://wiki.github.com/auser/poolparty/installing">the PoolParty wiki</a>. You only need to complete the two sections <strong>Dependencies required to build gem locally</strong> and <strong>Instructions</strong> . This will install all the development dependency gems and then make sure you have all of the submodules. <strong>NOTE</strong> PoolParty deploys ruby gem versions based on the versions on your <em>local</em> machine. So make sure you have the most recent versions of the required gems installed locally.</li>
<li><strong>You have the <a href="http://github.com/jashmenn/poolparty-examples/tree/master">jashmenn/poolparty-examples</a> repository</strong>. <code>git clone git://github.com/jashmenn/poolparty-examples.git /path/to/poolparty-examples</code> </li>
<li><strong>You have the <a href="http://github.com/jashmenn/poolparty-extensions/tree/master">jashmenn/poolparty-extensions</a> repository</strong>. Note that this directory must be a <em>sibling</em> directory to the <code>poolparty-examples</code> directory. <code>git clone git://github.com/jashmenn/poolparty-extensions.git /path/to/poolparty-extensions</code></li>
</ol>
<h1>EC2 Security</h1>
<p>Now that we have the code issue complete, we now need to deal with Amazon&#8217;s security. (See <a href="http://auser.github.com/poolparty/amazon.html">here</a> if you are unclear on how EC2 security works.)</p>
<h2>Setup Keypairs</h2>
<hr />
<p>Every cloud in PoolParty must have its own unique keypair. Thats important enough it&#8217;s worth repeating: <em>every cloud in PoolParty must have its own unique keypair</em> .</p>
<p>So run the following commands:</p>

<div class="wp_syntax"><div class="code"><pre class="shell" style="font-family:monospace;">ec2-add-keypair cloud_hadoop_slave &amp;gt; ~/.ssh/cloud_hadoop_slave
ec2-add-keypair cloud_hadoop_master &amp;gt; ~/.ssh/cloud_hadoop_master
chmod 600 ~/.ssh/cloud_hadoop_*</pre></div></div>

<h2>Security Groups</h2>
<hr />
<p>You&#8217;ll also want to create a security group for our <em>pool</em> . </p>

<div class="wp_syntax"><div class="code"><pre class="shell" style="font-family:monospace;">ec2-add-group hadoop_pool -d &quot;the pool of hadoop masters and slaves&quot;</pre></div></div>

<p><strong>NOTICE:</strong> Hadoop has a crazy number of ports that it requires. The ports below will <em>work</em> but may not be the most secure configuration. If you understand this better than I please recommend better settings. Otherwise proceed knowing that these ports are probably a little <em>too</em> open.</p>
<p>We also need to open a number of ports for this security group:</p>

<div class="wp_syntax"><div class="code"><pre class="shell" style="font-family:monospace;">&lt;code&gt;ec2-authorize -p 22 hadoop_pool               # ssh
ec2-authorize -p 8642 hadoop_pool             # poolparty internal daemons
ec2-authorize -P icmp -t -1:-1 hadoop_pool    # if you want to ping (optional, i guess)
ec2-authorize -p 80 hadoop_pool               # apache
&nbsp;
ec2-authorize -p 8649 -P udp hadoop_pool      # ganglia UDP
ec2-authorize hadoop_pool -o hadoop_pool -u xxxxxxxxxxxx # xxxxxxxxxxxx is your amazon account id. ugly but true
&lt;/code&gt;</pre></div></div>

<h1>Start your cloud</h1>
<p><strong>NOTE</strong> : There are a number of configurations that rely on the whole cloud being booted. This means that the first time you run <code>cloud-start</code> you may see a few shell errors. This is okay as long as it goes away after subsequent configures. The idea is that all nodes need to be started before the whole configuration will work properly.</p>

<div class="wp_syntax"><div class="code"><pre class="shell" style="font-family:monospace;">cd /path/to/poolparty-examples/hadoop
cloud-list # sanity check, no instances should show up, no exceptions should be raised
cloud-start -vd</pre></div></div>

<p><em>Tons</em>  of information will fly by. Be patient, this could take upwards of 15 minutes. </p>
<p>Everything done? Good. Now you&#8217;re going to need to configure a second time. Now that all the nodes are booted they can be configured to talk to each other properly.</p>

<div class="wp_syntax"><div class="code"><pre class="shell" style="font-family:monospace;">cloud-configure -vd</pre></div></div>

<p>Again, tons of output should fly by. Wait for it to finish.</p>
<p>Now what we want to do is actually run our hadoop sample job. Open up the <code>hadoop/clouds.rb</code> and find the lines that look like this:</p>

<div class="wp_syntax"><div class="code"><pre class="ruby" style="font-family:monospace;">hadoop <span style="color:#9966CC; font-weight:bold;">do</span>
  configure_master
  prep_example_job
  <span style="color:#008000; font-style:italic;"># run_example_job</span>
<span style="color:#9966CC; font-weight:bold;">end</span></pre></div></div>

<p>Uncomment the <code>run_example_job</code> line and configure, but this time we only need to configure master.</p>

<div class="wp_syntax"><div class="code"><pre class="shell" style="font-family:monospace;">cloud-configure -vd -c hadoop_master</pre></div></div>

<p>This <em>should</em> work, but there is a chance the hdfs wont be started in time to load the sample job. If that happens, just configure one more time.<br />
You know it worked if you see output like the following (it wont be at the bottom):</p>
<pre>[Fri, 26 Jun 2009 20:09:50 +0000] DEBUG: STDERR: 09/06/26 20:09:11 INFO input.FileInputFormat: Total input paths to process : 3
09/06/26 20:09:12 INFO mapred.JobClient: Running job: job_200906262006_0001
09/06/26 20:09:13 INFO mapred.JobClient:  map 0% reduce 0%
09/06/26 20:09:32 INFO mapred.JobClient:  map 66% reduce 0%
09/06/26 20:09:38 INFO mapred.JobClient:  map 100% reduce 0%
09/06/26 20:09:47 INFO mapred.JobClient:  map 100% reduce 100%
09/06/26 20:09:49 INFO mapred.JobClient: Job complete: job_200906262006_0001
09/06/26 20:09:49 INFO mapred.JobClient: Counters: 17
</pre>
<p>Congradulations! You now have a scalable Hadoop cluster at your disposal!</p>
<h1>What to do when something goes wrong</h1>
<ul>
<li>Checkout the <a href="http://auser.github.com/poolparty/community.html">PoolParty IRC channel</a>, we&#8217;re always around and ready to help #poolpartyrb. </li>
</ul>
<p>This plugin was based on a number of helpful sites on the web. Checkout the following links:</p>
<h2>Hadoop</h2>
<hr />
<ul>
<li><a href="http://www.michael-noll.com/wiki/Running_Hadoop_On_Ubuntu_Linux_(Multi-Node_Cluster">Michael Noll&#8217;s Haddop Tutorial</a>)</li>
</ul>
<h2>Hive</h2>
<hr />
<ul>
<li><a href="http://wiki.apache.org/hadoop/Hive">Apache&#8217;s Hive website</a></li>
</ul>
<h2>Ganglia</h2>
<hr />
<ul>
<li><a href="http://www.ibm.com/developerworks/wikis/display/WikiPtype/ganglia">IBM&#8217;s Ganglia Tutorial</a></li>
</ul>
<h1>References</h1>
<ul>
<li><a href="http://auser.github.com/poolparty/docs/index.html">PoolParty Documentation</a></li>
</ul>
<p>Share:</p>
<p>	<a rel="nofollow"  href="http://delicious.com/post?url=http%3A%2F%2Fwww.xcombinator.com%2F2009%2F07%2F08%2Feasily-setup-a-monitored-hadoop-hive-cluster-in-ec2-with-poolparty%2F&amp;title=%22Easily%22%20setup%20a%20monitored%20Hadoop%20%2F%20Hive%20Cluster%20in%20EC2%20with%20PoolParty&amp;notes=%20%0D%0A%20%0D%0ASummary%20%0D%0A%20%0D%0ASetting%20up%20a%20scalable%20Hadoop%20cluster%20isn%27t%20easy%2C%20but%20PoolParty%20makes%20it%20easier%0D%0Aand%20manageable.%20%0D%0A%20%0D%0ABy%20the%20time%20we%27re%20done%20with%20this%20tutorial%20you%27ll%20have%20a%20Hadoop%20cluster%20consisting%20of%20one%20master%20node%20and%20two%20slaves.%20%20The%20slaves%20a" title="del.icio.us"><img src="http://www.xcombinator.com/wp-content/plugins/sociable/images/delicious.png" title="del.icio.us" alt="del.icio.us" class="sociable-hovers" /></a><br />
	<a rel="nofollow"  href="http://reddit.com/submit?url=http%3A%2F%2Fwww.xcombinator.com%2F2009%2F07%2F08%2Feasily-setup-a-monitored-hadoop-hive-cluster-in-ec2-with-poolparty%2F&amp;title=%22Easily%22%20setup%20a%20monitored%20Hadoop%20%2F%20Hive%20Cluster%20in%20EC2%20with%20PoolParty" title="Reddit"><img src="http://www.xcombinator.com/wp-content/plugins/sociable/images/reddit.png" title="Reddit" alt="Reddit" class="sociable-hovers" /></a><br />
	<a rel="nofollow"  href="http://technorati.com/faves?add=http%3A%2F%2Fwww.xcombinator.com%2F2009%2F07%2F08%2Feasily-setup-a-monitored-hadoop-hive-cluster-in-ec2-with-poolparty%2F" title="Technorati"><img src="http://www.xcombinator.com/wp-content/plugins/sociable/images/technorati.png" title="Technorati" alt="Technorati" class="sociable-hovers" /></a><br />
	<a rel="nofollow"  href="http://twitter.com/home?status=%22Easily%22%20setup%20a%20monitored%20Hadoop%20%2F%20Hive%20Cluster%20in%20EC2%20with%20PoolParty%20-%20http%3A%2F%2Fwww.xcombinator.com%2F2009%2F07%2F08%2Feasily-setup-a-monitored-hadoop-hive-cluster-in-ec2-with-poolparty%2F" title="Twitter"><img src="http://www.xcombinator.com/wp-content/plugins/sociable/images/twitter.png" title="Twitter" alt="Twitter" class="sociable-hovers" /></a><br />
	<a rel="nofollow"  href="http://www.facebook.com/share.php?u=http%3A%2F%2Fwww.xcombinator.com%2F2009%2F07%2F08%2Feasily-setup-a-monitored-hadoop-hive-cluster-in-ec2-with-poolparty%2F&amp;t=%22Easily%22%20setup%20a%20monitored%20Hadoop%20%2F%20Hive%20Cluster%20in%20EC2%20with%20PoolParty" title="Facebook"><img src="http://www.xcombinator.com/wp-content/plugins/sociable/images/facebook.png" title="Facebook" alt="Facebook" class="sociable-hovers" /></a><br />
	<a rel="nofollow"  href="http://www.google.com/bookmarks/mark?op=edit&amp;bkmk=http%3A%2F%2Fwww.xcombinator.com%2F2009%2F07%2F08%2Feasily-setup-a-monitored-hadoop-hive-cluster-in-ec2-with-poolparty%2F&amp;title=%22Easily%22%20setup%20a%20monitored%20Hadoop%20%2F%20Hive%20Cluster%20in%20EC2%20with%20PoolParty&amp;annotation=%20%0D%0A%20%0D%0ASummary%20%0D%0A%20%0D%0ASetting%20up%20a%20scalable%20Hadoop%20cluster%20isn%27t%20easy%2C%20but%20PoolParty%20makes%20it%20easier%0D%0Aand%20manageable.%20%0D%0A%20%0D%0ABy%20the%20time%20we%27re%20done%20with%20this%20tutorial%20you%27ll%20have%20a%20Hadoop%20cluster%20consisting%20of%20one%20master%20node%20and%20two%20slaves.%20%20The%20slaves%20a" title="Google Bookmarks"><img src="http://www.xcombinator.com/wp-content/plugins/sociable/images/googlebookmark.png" title="Google Bookmarks" alt="Google Bookmarks" class="sociable-hovers" /></a><br />
	<a rel="nofollow"  href="http://news.ycombinator.com/submitlink?u=http%3A%2F%2Fwww.xcombinator.com%2F2009%2F07%2F08%2Feasily-setup-a-monitored-hadoop-hive-cluster-in-ec2-with-poolparty%2F&amp;t=%22Easily%22%20setup%20a%20monitored%20Hadoop%20%2F%20Hive%20Cluster%20in%20EC2%20with%20PoolParty" title="HackerNews"><img src="http://www.xcombinator.com/wp-content/plugins/sociable/images/hackernews.png" title="HackerNews" alt="HackerNews" class="sociable-hovers" /></a><br />
	<a rel="nofollow"  href="http://www.printfriendly.com/print?url=http%3A%2F%2Fwww.xcombinator.com%2F2009%2F07%2F08%2Feasily-setup-a-monitored-hadoop-hive-cluster-in-ec2-with-poolparty%2F&amp;partner=sociable" title="PDF"><img src="http://www.xcombinator.com/wp-content/plugins/sociable/images/pdf.png" title="PDF" alt="PDF" class="sociable-hovers" /></a><br />
	<a rel="nofollow"  href="http://www.xcombinator.com/feed/" title="RSS"><img src="http://www.xcombinator.com/wp-content/plugins/sociable/images/rss.png" title="RSS" alt="RSS" class="sociable-hovers" /></a></p>
<p><br/><br/></p>
]]></content:encoded>
			<wfw:commentRss>http://www.xcombinator.com/2009/07/08/easily-setup-a-monitored-hadoop-hive-cluster-in-ec2-with-poolparty/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
	</channel>
</rss>

<!-- Dynamic Page Served (once) in 1.511 seconds -->
