incubator-general mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Suresh Marru <>
Subject Re: [VOTE] Accept DataFu into the Incubator
Date Sat, 04 Jan 2014 09:41:11 GMT
+ 1 (binding).

On Dec 31, 2013, at 3:39 PM, Jakob Homan <> wrote:

> Incubator-
> Following the discussion earlier, I'm calling a vote to accept DataFu as a
> new Incubator project.
> The proposal draft is available at:
>, and is also included
> below.
> Vote is open for at least 96h and closes at the earliest on 4 Jan 13:00
> PDT.  I'm letting the vote run an extra day as we're in the holiday season.
> [ ] +1 accept DataFu in the Incubator
> [ ] +/-0
> [ ] -1 because...
> Here's my binding +1.
> -Jakob
> -------------------------------
> Abstract
> DataFu makes it easier to solve data problems using Hadoop and higher level
> languages based on it.
> Proposal
> DataFu provides a collection of Hadoop MapReduce jobs and functions in
> higher level languages based on it to perform data analysis. It provides
> functions for common statistics tasks (e.g. quantiles, sampling), PageRank,
> stream sessionization, and set and bag operations. DataFu also provides
> Hadoop jobs for incremental data processing in MapReduce.
> Background
> DataFu began two years ago as set of UDFs developed internally at LinkedIn,
> coming from our desire to solve common problems with reusable components.
> Recognizing that the community could benefit from such a library, we added
> documentation, an extensive suite of unit tests, and open sourced the code.
> Since then there have been steady contributions to DataFu as we encountered
> common problems not yet solved by it. Others outside LinkedIn have
> contributed as well. More recently we recognized the challenges with
> efficient incremental processing of data in Hadoop and have contributed a
> set of Hadoop MapReduce jobs as a solution.
> DataFu began as a project at LinkedIn, but it has shown itself to be useful
> to other organizations and developers as well as they have faced similar
> problems. We would like to share DataFu with the ASF and begin developing a
> community of developers and users within Apache.
> Rationale
> There is a strong need for well tested libraries that help developers solve
> common data problems in Hadoop and higher level languages such as Pig,
> Hive, Crunch, Scalding, etc.
> Current Status
> Meritocracy
> Our intent with this incubator proposal is to start building a diverse
> developer community around DataFu following the Apache meritocracy model.
> Since DataFu was initially open sourced in 2011, it has received
> contributions from both within and outside LinkedIn. We plan to continue
> support for new contributors and work with those who contribute
> significantly to the project to make them committers.
> Community
> DataFu has been building a community of developers for two years. It began
> with contributors from LinkedIn and has received contributions from
> developers at Cloudera since very early on. It has been included included
> in Cloudera’s Hadoop Distribution and Apache Bigtop. We hope to extend our
> contributor base significantly and invite all those who are interested in
> solving large-scale data processing problems to participate.
> Core Developers
> DataFu has a strong base of developers at LinkedIn. Matthew Hayes initiated
> the project in 2011, and aside from continued contributions to DataFu has
> also contributed the sub-project Hourglass for incremental MapReduce
> processing. Separate from DataFu he has also open sourced the White
> Elephant project. Sam Shah contributed a significant portion of the
> original code and continues to contribute to the project. William Vaughan
> has been contributing regularly to DataFu for the past two years. Evion Kim
> has been contributing to DataFu for the past year. Xiangrui Meng recently
> contributed implementations of scalable sampling algorithms based on
> research from a paper he published. Chris Lloyd has provided some important
> bug fixes and unit tests. Mitul Tiwari has also contributed to DataFu.
> Mathieu Bastian has been developing MapReduce jobs that we hope to include
> in DataFu. In addition he also leads the open source Gephi project.
> Alignment
> The ASF is the natural choice to host the DataFu project as its goal of
> encouraging community-driven open-source projects fits with our vision for
> DataFu. Additionally, other projects DataFu integrates with, such as Apache
> Pig and Apache Hadoop, and in the future Apache Hive and Apache Crunch, are
> hosted by the ASF and we will benefit and provide benefit by close
> proximity to them.
> Known Risks
> Orphaned Products
> The core developers have been contributing to DataFu for the past two
> years. There is very little risk of DataFu being abandoned given its
> widespread use within LinkedIn.
> Inexperience with Open Source
> DataFu was started as an open source project in 2011 and has remained so
> for two years. Matt initiated the project, and additionally is the creator
> of the open source White Elephant project. He has also contributed patches
> to Apache Pig. Most recently he has released Hourglass as a sub-project of
> DataFu. Sam contributed much of the original code and continues to
> contribute to the project. Will has been contributing to DataFu since it
> was first open sourced. Evion has been contributing for the past year.
> Mathieu leads the open source Gephi project. Jakob has been actively
> involved with the ASF as a full-time Hadoop committer and PMC member.
> Homogeneous Developers
> The current core developers are all from LinkedIn. DataFu has also received
> contributions from other corporations such as Cloudera. Two of these
> developers are among the Initial Committers listed below. We hope to
> establish a developer community that includes contributors from several
> other corporations and we are actively encouraging new contributors via
> presentations and blog posts.
> Reliance on Salaried Developers
> The current core developers are salaried employees of LinkedIn, however
> they are not paid specifically to work on DataFu. Contributions to DataFu
> arise from the developers solving problems they encounter in their various
> projects. The purpose of DataFu is to share these solutions so that others
> may benefit and build a community of developers striving to solve common
> problems together. Furthermore, once the project has a community built
> around it, we expect to get committers, developers and contributions from
> outside the current core developers.
> Relationships with Other Apache Products
> DataFu is deeply integrated with Apache products. It began as a library of
> user-defined functions for Apache Pig. It has grown to also include Hadoop
> jobs for incremental data processing and in the future will include code
> for other higher level languages built on top of Apache Hadoop.
> An Excessive Obsession with the Apache Brand
> While we respect the reputation of the Apache brand and have no doubts that
> it will attract contributors and users, our interest is primarily to give
> DataFu a solid home as an open source project following an established
> development model.
> Documentation
> Information on DataFu can be found at:
> Initial Source
> The initial source is available at:
> Source and Intellectual Property Submission Plan
>    The DataFu library source code, available on GitHub.
> External Dependencies
> The initial source has the following external dependencies that are either
> included in the final DataFu library or required in order to use it:
>    fastutil (Apache 2.0)
>    joda-time (Apache 2.0)
>    commons-math (Apache 2.0)
>    guava (Apache 2.0)
>    stream (Apache 2.0)
>    jsr-305 (BSD)
>    log4j (Apache 2.0)
>    json (The JSON License)
>    avro (Apache 2.0)
> In addition, the following external libraries are used either in building,
> developing, or testing the project:
>    pig (Apache 2.0)
>    hadoop (Apache 2.0)
>    jline (BSD)
>    antlr (BSD)
>    commons-io (Apache 2.0)
>    testng (Apache 2.0)
>    maven (Apache 2.0)
>    jsr-311 (CDDL-1.0)
>    slf4j (MIT)
>    eclipse (Eclipse Public License 1.0)
>    autojar (GPLv2)
>    jarjar (Apache 2.0)
> Cryptography
> DataFu has user-defined functions that use MD5 and SHA provided by Java’s
> Required Resources
> Mailing Lists
> DataFu-private for private PMC discussions (with moderated subscriptions)
> DataFu-dev DataFu-commits
> Subversion Directory
> Git is the preferred source control system: git://
> Issue Tracking
> JIRA DataFu (DataFu)
> Other Resources
> The existing code already has unit tests, so we would like a Hudson
> instance to run them whenever a new patch is submitted. This can be added
> after project creation.
> Initial Committers
>    Matthew Hayes
>    William Vaughan
>    Evion Kim
>    Sam Shah
>    Xiangrui Meng
>    Christopher Lloyd
>    Mathieu Bastian
>    Mitul Tiwari
>    Josh Wills
>    Jarek Jarcec Cecho
> Affiliations
>    Matthew Hayes (LinkedIn)
>    William Vaughan (LinkedIn)
>    Evion Kim (LinkedIn)
>    Sam Shah (LinkedIn)
>    Xiangrui Meng (LinkedIn)
>    Christopher Lloyd (LinkedIn)
>    Mathieu Bastian (LinkedIn)
>    Mitul Tiwari (LinkedIn)
>    Josh Wills (Cloudera)
>    Jarek Jarcec Cecho (Cloudera)
> Sponsors
> Champion
> Jakob Homan (Apache Member)
> Nominated Mentors
>    Ashutosh Chauhan <hashutosh at apache dot org>
>    Roman Shaposhnik <rvs at apache dot org>
>    Ted Dunning <tdunning at apache dot org>
> Sponsoring Entity
> We are requesting the Incubator to sponsor this project.

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message