incubator-general mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Franklin, Matthew B." <>
Subject RE: [VOTE] Accept Crunch into the Apache Incubator
Date Thu, 24 May 2012 11:52:18 GMT
+1 (binding)

>-----Original Message-----
>From: Josh Wills []
>Sent: Wednesday, May 23, 2012 2:46 PM
>Subject: [VOTE] Accept Crunch into the Apache Incubator
>I would like to call a vote for accepting "Apache Crunch" for
>incubation in the Apache Incubator. The full proposal is available
>below.  We ask the Incubator PMC to sponsor it, with phunt as
>Champion, and phunt, tomwhite, and acmurthy volunteering to be
>Please cast your vote:
>[ ] +1, bring Crunch into Incubator
>[ ] +0, I don't care either way,
>[ ] -1, do not bring Crunch into Incubator, because...
>This vote will be open for 72 hours and only votes from the Incubator
>PMC are binding.
>Proposal text from the wiki:
>= Crunch - Easy, Efficient MapReduce Pipelines in Java and Scala =
>== Abstract ==
>Crunch is a Java library for writing, testing, and running pipelines
>of !MapReduce jobs on Apache Hadoop.
>== Proposal ==
>Crunch is a Java library for writing, testing, and running pipelines
>of !MapReduce jobs on Apache Hadoop. Its main goal is to provide a
>high-level API for writing and testing complex !MapReduce jobs that
>require multiple processing stages.  It has a simple, flexible, and
>extensible data model that makes it ideal for processing data that
>does not naturally fit into a relational structure, such as time
>series and serialized object formats like JSON and Avro. It supports
>running pipelines either as a series of !MapReduce jobs on an Apache
>Hadoop cluster or in memory on a single machine for fast testing and
>== Background ==
>Crunch was initially developed by Cloudera to simplify the process of
>creating sequences of dependent !MapReduce jobs, especially jobs that
>processed non-relational data like time series. Its design was based
>on a paper Google published about a Java library they developed called
>!FlumeJava that was created in order to solve a similar class of
>problems. Crunch was open-sourced by Cloudera on !GitHub as an Apache
>2.0 licensed project in October 2011. During this time Crunch has been
>formally released twice, as versions 0.1.0 (October 2010) and 0.2.0
>(February 2012), with an incremental update to version 0.2.1 (March
>2012) .  These releases are also distributed by Cloudera as source and
>binaries from Cloudera's Maven repository.
>== Rationale ==
>Most of the interesting analytical and data processing tasks that are
>run on an Apache Hadoop cluster require a series of !MapReduce jobs to
>be executed in sequence. Developers who are creating these pipelines
>today need to manually assign the sequence of tasks to perform in a
>dependent chain of !MapReduce jobs, even though there are a number of
>well-known patterns for fusing dependent computations together into a
>single !MapReduce stage and for performing common types of joins and
>aggregations. This results in !MapReduce pipelines that are more
>difficult to test, maintain, and extend to support new functionality.
>Furthermore, the type of data that is being stored and processed using
>Apache Hadoop is evolving. Although Hadoop was originally used for
>storing large volumes of structured text in the form of webpages and
>log files, it is now common for Hadoop to store complex, structured
>data formats such as JSON, Apache Avro, and Apache Thrift. These
>formats allow developers to work with serialized objects in
>programming languages like Java, C++, and Python, and allow for new
>types of analysis to be performed on complex data types. Hadoop has
>also been adopted by the scientific research community, who are using
>Hadoop to process time series data, structured binary files in the
>HDF5 format, and large medical and satellite images.
>Crunch addresses these challenges by providing a lightweight and
>extensible Java API for defining the stages of a data processing
>pipeline, which can then be run on an Apache Hadoop cluster as a
>sequence of dependent !MapReduce jobs, or in-memory on a single
>machine to facilitate fast testing and debugging. Crunch relies on a
>small set of primitive abstractions that represent immutable,
>distributed collections of objects. Developers define functions that
>are applied to those objects in order to generate new immutable,
>distributed collections of objects. Crunch also provides a library of
>common !MapReduce patterns for performing efficient joins and
>aggregation operations over these distributed collections that
>developers may integrate into their own pipelines. Crunch also
>provides native support for processing structured binary data formats
>like JSON, Apache Avro, and Apache Thrift, and is designed to be
>extensible to support working with any kind of data format that Java
>supports in its native form.
>== Initial Goals ==
>Crunch is currently in its first major release with a considerable
>number of enhancement requests, tasks, and issues recorded towards its
>future development. The initial goal of this project will be to
>continue to build community in the spirit of the "Apache Way", and to
>address the highly requested features and bug-fixes towards the next
>dot release.
>Some goals include:
> * To stand up a sustaining Apache-based community around the Crunch
> * Improved documentation of Java libraries and best practices.
> * Support the ability to "fuse" logically independent pipeline stages
>that aggregate the same data in different ways into a single
>!MapReduce job.
> * Performance, usability, and robustness improvements.
> * Improving diagnostic reporting and debugging for individual !MapReduce
> * Providing a centralized place for contributed extensions and
>domain-specific applications.
>= Current Status =
>== Meritocracy ==
>Crunch was initially developed by Josh Wills in September 2011 at
>Cloudera. Developers external to Cloudera provided feedback, suggested
>features and fixes and implemented extensions of Crunch. Cloudera's
>engineering team has since maintained the project with Josh Wills, Tom
>White, and Brock Noland dedicated towards its improvement.
>Contributors to Crunch include developers from multiple organizations,
>including businesses and universities.
>== Community ==
>Crunch is currently used by a number of organizations all over the
>world. Crunch has an active and growing user and developer community
>with active participation in
>and [[
>mailing lists.
>Since open sourcing the project, there have been eight individuals
>from five organizations who have contributed code.
>== Core Developers ==
>The core developers for Crunch are:
> * Brock Noland: Wrote many of the test cases, user documentation, and
>contributed several bug fixes.
> * Josh Wills: Josh wrote much of the original Crunch code.
> * Gabriel Reid: Gabriel significantly improved Crunch's handling of
>Avro data and has contributed several bug fixes for the core planner.
> * Tom White: Tom added several libraries for common !MapReduce
>pipeline operations, including the sort library and a library of set
> * Christian Tzolov: Christian has contributed several bug fixes for
>the Avro serialization module and the unit testing framework.
> * Robert Chu: Robert did the left/right/outer join implementations
>for Crunch and fixed several bugs in the runtime configuration logic.
>Several of the core developers of Crunch have contributed towards
>Hadoop or related Apache projects and are familiar with Apache
>principles and philosophy for community driven software development.
>== Alignment ==
>Crunch complements several current Apache projects. It complements
>Hadoop !MapReduce by providing a higher-level API for developing
>complex data processing pipelines that require a sequence of
>!MapReduce jobs to perform. Crunch also supports Apache HBase in order
>to simplify the process of writing !MapReduce jobs that execute over
>HBase tables. Crunch makes extensive use of the Apache Avro data
>format as an internal data representation process that makes
>!MapReduce jobs execute quickly and efficiently.
>= Known Risks =
>== Orphaned Products ==
>Crunch is already deployed in production at multiple companies and
>they are actively participating in creating new features. Crunch is
>getting traction with developers and thus the risks of it being
>orphaned are minimal.
>== Inexperience with Open Source ==
>All code developed for Crunch has been open sourced by Cloudera under
>Apache 2.0 license.  All committers to Crunch are intimately familiar
>with the Apache model for open-source development and are experienced
>with working with new contributors.
>== Homogeneous Developers ==
>The initial set of committers is from a reduced set of organizations.
>However, we expect that once approved for incubation, the project will
>attract new contributors from diverse organizations and will thus grow
>organically. The submission of patches from developers from several
>different organizations is a strong indication that Crunch will be
>widely adopted.
>== Reliance on Salaried Developers ==
>It is expected that Crunch will be developed on salaried and volunteer
>time, although all of the initial developers will work on it mainly on
>salaried time.
>== Relationships with Other Apache Products ==
>Crunch depends upon other Apache Projects: Apache Hadoop, Apache
>HBase, Apache Log4J, Apache Thrift, Apache Avro, and multiple Apache
>Commons components. Its build depends upon Apache Maven.
>Crunch's functionality has some indirect or direct overlap with the
>functionality of Apache Pig and Apache Hive but has several
>significant differences in terms of their user community and the types
>of data they are designed to work with.  Both Hive and Pig are
>high-level languages that are designed to allow non-programmers to
>quickly create and run !MapReduce jobs. Crunch is a Java library whose
>primary community is Java developers who are creating scalable data
>pipelines and !MapReduce-based applications. Additionally, Hive and
>Pig both employ a relational, tuple-oriented data model on top of
>HDFS, which introduces overhead and limits expressive power for
>developers who are working with serialized objects and non-relational
>data types. Crunch uses a lower-level data model that gives developers
>the freedom to work with data in a format that is optimized for the
>problem they are trying to solve.
>== An Excessive Fascination with the Apache Brand ==
>We would like Crunch to become an Apache project to further foster a
>healthy community of contributors and consumers around the project.
>Since Crunch directly interacts with many Apache Hadoop-related
>projects and solves an important problem of many Hadoop users,
>residing in the Apache Software Foundation will increase interaction
>with the larger community.
>= Documentation =
> * Crunch wiki at GitHub:
> * Crunch jira at Cloudera:
> * Crunch javadoc at GitHub:
>= Initial Source =
> *
>== Source and Intellectual Property Submission Plan ==
> * The initial source is already licensed under the Apache License,
>Version 2.0.
>== External Dependencies ==
>The required external dependencies are all Apache License or
>compatible licenses. Following components with non-Apache licenses are
> * : New BSD
> * org.hamcrest: New BSD
> * org.slf4j: MIT-like License
>Non-Apache build tools that are used by Crunch are as follows:
> * Cobertura: GNU GPLv2
>Note that Cobertura is optional and is only used for calculating unit
>test coverage.
>== Cryptography ==
>Crunch uses standard APIs and tools for SSH and SSL communication
>where necessary.
>= Required  Resources =
>== Mailing lists ==
> * crunch-private (with moderated subscriptions)
> * crunch-dev
> * crunch-commits
> * crunch-user
>== Github Repositories ==
>== Issue Tracking ==
>== Other Resources ==
>The existing code already has unit and integration tests so we would
>like a Jenkins instance to run them whenever a new patch is submitted.
>This can be added after project creation.
>= Initial Committers =
> * Brock Noland (brock at cloudera dot com)
> * Josh Wills (jwills at cloudera dot com)
> * Gabriel Reid (gabriel dot reid at gmail dot com)
> * Tom White (tom at cloudera dot com)
> * Christian Tzolov (christian dot tzolov at gmail dot com)
> * Robert Chu (robert at wibidata dot com)
> * Vinod Kumar Vavilapalli (vinodkv at hortonworks dot com)
>= Affiliations =
> * Brock Noland, Cloudera
> * Josh Wills, Cloudera
> * Gabriel Reid, !TomTom
> * Tom White, Cloudera
> * Christian Tzolov, !TomTom
> * Robert Chu, !WibiData
> * Vinod Kumar Vavilapalli, Hortonworks
>= Sponsors =
>== Champion ==
> * Patrick Hunt
>== Nominated Mentors ==
> * Tom White
> * Patrick Hunt
> * Arun Murthy
>== Sponsoring Entity ==
> * Apache Incubator PMC
>To unsubscribe, e-mail:
>For additional commands, e-mail:

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message