incubator-general mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Josh Wills <>
Subject Re: [DISCUSS] Crunch joining the Apache Incubator
Date Thu, 17 May 2012 04:57:19 GMT
Hey Priyank,

Replies inlined.

On Wed, May 16, 2012 at 9:25 PM, Priyank Rastogi
<> wrote:
> Hi Josh,
> A couple of queries I have.
> 1) How is this different from Oozie or Cascading?

Re: Oozie, the one context in which Crunch and Oozie overlap is when a
developer is writing a series of MapReduce jobs in Java and wanted to
tie the output of one job to the input of another together. In Crunch,
you would express that dependency in Java code; in Oozie, you would
express it in XML. In my opinion, expressing those dependencies in
Java has some advantages for certain types of problems, such as the
iterative computations that are performed in systems like Mahout. It
is also nice to be able to construct a data pipeline within a single
language, without having to move back and forth between Java and XML.

Oozie has cron-like job scheduling functionality, which Crunch does
not. It also supports chaining different types of MapReduce-based
systems together, like Pig jobs and Hive queries. Crunch provides a
library of common MapReduce design patterns, like counts,
aggregations, joins, and sorts, that Oozie does not. I think it would
be easy to see a case where a developer would write a Crunch pipeline
whose output was consumed by one or more Pig jobs, and Oozie was used
to schedule the execution of those jobs and specify their

Re: Cascading, Cascading has the same data model as Pig and Hive,
i.e., all of the operations they define are based on a single,
serializable type (SST), which is usually referred to as a "Tuple."
Crunch does not have a single, serializable type-- developers can work
with data in whatever format make sense for the data they are
processing, such as time series, images, or serializable object
formats like Apache Avro, Apache Thrift, or JSON. The initial reason
that we developed Crunch at Cloudera was a customer project that
required large-scale data pipelines over time series, and we felt that
none of the existing pipeline languages, including Cascading, were
designed for this type of problem. Additionally, Cascading is not an
Apache project, either top-level or in the incubator.

> 2) Are there any patents filed/planned for any part of work within Crunch?

Google has a patent related to FlumeJava, which is the basis for
Crunch's design:

Of course, Google has also patented MapReduce and GFS, the basis of
the core of Apache Hadoop.

Cloudera has not filed any patents on the work done on Crunch and has
no intention of doing so.

> -Priyank
> -----Original Message-----
> From: Josh Wills []
> Sent: Thursday, May 17, 2012 8:44 AM
> To:
> Subject: [DISCUSS] Crunch joining the Apache Incubator
> Hi all,
> I would like to propose Crunch, a library for writing MapReduce
> pipelines in Java and Scala, as an Apache Incubator project. The
> proposal is here:
> We would gladly welcome additional volunteers to act as mentors on the
> project, so if this sounds like your cup of tea, please feel free to
> sign up or let us know.
> Thanks!
> Josh
> --
> Director of Data Science
> Cloudera
> Twitter: @josh_wills
> ---------------------------------------------------------------------
> To unsubscribe, e-mail:
> For additional commands, e-mail:
> ---------------------------------------------------------------------
> To unsubscribe, e-mail:
> For additional commands, e-mail:

Director of Data Science
Twitter: @josh_wills

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message