incubator-general mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Avery Ching <>
Subject Re: [PROPOSAL] Proposing Giraph for the Apache Incubator
Date Sat, 16 Jul 2011 03:55:46 GMT

Offline, you and I have discussed potential future collaboration in our projects.  However,
there are significant differences in our approaches today.

* Hama has been focused on BSP computing.  It only recently (June 30 - about 16 days ago)
opened a JIRA for graph processing (  Giraph
has been focused on BSP-based graph processing from day one.
* Giraph runs entirely on the Hadoop infrastructure today.  It it meant to be used on shared
Hadoop clusters and integrated as part of Oozie pipelines.  Today, Hama uses its own infrastructure
and is pretty much a stand-alone system, only using HDFS.
* Giraph has focused on fault-tolerance and dynamic resource usage in a shared Hadoop cluster.
 These are infrastructure-specific challenges that have a lot of value for our users and we
will continue to focus and improve on this.

As we have discussed, things may change when next-gen Hadoop is released, however, that might
take some time.  And when it is released, it will take some time for it to be stable enough
to deploy to our installations.  I think it is productive for us to share ideas (as we have
been doing), but also useful to have separate projects as they are different enough now and
cater to a different set of users.

Even if these projects do overlap one day, under the incubator proposal guidelines (
in the section 'Relationships with Other Apache Products', it reads:

"Apache allows different projects to have competing or overlapping goals. However, this should
mean friendly competition between codebases and cordial cooperation between communities.

It is not always obvious whether a candidate is a direct competitor to an existing project,
an indirect competitor (same problem space, different ecological niche) or are just peers
with some overlap. In the case of indirect competition, it is important that the abstract
describes accurately the niche. Direct competitors should expect to be asked to summarize
architectural differences and similarities to existing projects."

Based on that statement, I expect that if Giraph is accepted in the Apache Incubator, our
projects will hopefully be able to share ideas and grow together.



On Jul 15, 2011, at 4:29 PM, Edward J. Yoon wrote:

Just FYI,

My heavy concern is that the boundaries between 'Apache Hama' and
'Giraph' you said, can be collapsed in near future.

* Someone already contributed Pregel-like vertex API set on top of
Hama v0.2[1].
* The Hama job will be run on both Hama own cluster and Hadoop nextGen.

Then, BSP-based computing VS. BSP-based *only* graph computing, that's it.

Regarding this proposal, I'm +0.



On Sat, Jul 16, 2011 at 3:14 AM, Avery Ching <<>>

I would like to propose Giraph as an Apache Incubator project.  Giraph is a large-scale graph
processing infrastructure (inspired by Pregel) that runs entirely on Hadoop.  Giraph applications
and MapReduce jobs coexist on shared Hadoop instances and Giraph applications can be part
of Oozie workflows as a normal MapReduce job.

Here is a link to the proposal in our GitHub wiki:

The proposal is also inlined below:



= Giraph : Large-scale graph processing on Hadoop =

== Abstract ==

Giraph is a large-scale, fault-tolerant, Bulk Synchronous Parallel (BSP)-based graph processing

== Proposal ==

Graph processing platforms to run large-scale algorithms (such as page rank, shared connections,
personalization-based popularity, etc.) have become quite popular.  Some recent examples include
Pregel and HaLoop.  For general-purpose big data computation, the MapReduce computation model
is widely adopted and the most deployed MapReduce infrastructure is Apache Hadoop.  We have
implemented a graph-processing framework that is launched as a typical Hadoop MapReduce job
to leverage existing Hadoop infrastructure, such as Amazon’s EC2.  Giraph builds upon the
graph-oriented nature of Pregel but additionally adds fault-tolerance to the coordinator process
with the use of ZooKeeper as its centralized coordination service.  Additionally, Giraph will
include a library of generic graph algorithms.

== Background ==

Giraph was initially began development as a side project at Yahoo! at the end of 2010.  It
was made functional in a month and then started adding various features.  Development has
been focused on internal customers needs until this point.

== Rationale ==

Web and online social graphs have been rapidly growing in size and scale during the past decade.
 In 2008, Google estimated that the number of web pages reached over a trillion.  Online social
networking and email sites, including Yahoo!, Google, Microsoft, Facebook, LinkedIn, and Twitter,
have hundreds of millions of users and are expected to grow much more in the future.  Processing
these graphs plays a big role in relevant and personalized information for users, such as
results from a search engine or news in an online social networking site.

== Initial Goals ==

At this point, most of the functionality has been implemented and we are looking to get more
adoption and contributions from users outside Yahoo!.   We want to ensure that performance
scales and that the code is robust and fault tolerant.

== Current Status ==

=== Meritocracy ===

Giraph was initially developed by Avery Ching and Christian Kunz beginning in December 2010
at Yahoo!.  There are other developers using Giraph at Yahoo! that are making suggestions
and adding code.  We are reaching out to other folks at social networking companies for additional
usage and development.

=== Community ===

Several groups who are interested in either joining our project or using our code have contacted
us.  We certainly believe that there is a lot of interest and are actively looking to improve
and expand the community.

=== Core Developers ===

Avery Ching: Wrote a majority of the code
Christian Kunz: Wrote most of the communication code and security integration with Hadoop

=== Alignment ===

Giraph uses several Apache projects as its underlying infrastructure (Hadoop and ZooKeeper).
  It also builds on Apache Maven.

== Known Risks ==

=== Orphaned products ===

There are many social networking companies that would be interested in using this graph-processing
framework and we have already received interest from some of them.  Yahoo! is already using
this code in production and will certainly continue to use it in the future as well.

=== Inexperience with Open Source ===

While the initial developers have limited experience on contributing to open-source projects,
Yahoo! as a company has a strong commitment to open-source and we have several advisors that
we can ask for help.

=== Homogenous Developers ===

At this time, the project is relatively young and the developers work at only two companies
(Yahoo! and Jybe).  However, given the interest we have seen in the project, we expect the
diversity to improve in the near future.

=== Reliance on Salaried Developers ===

Currently Giraph is being developed by a combination of salaried and volunteer time.  We expect
that other corporations will take an interest in this project and likely contribute with salaried
developers.  Some individuals will likely spend volunteer time on it as well.  It is still
early in their project and we are hoping for a lot of growth.

=== Relationships with Other Apache Products ===

Giraph depends on many Apache projects: Hadoop, ZooKeeper, Log4j, Commons, etc.  It is built
using Apache Maven.

Giraph has some overlapping functionality with Apache Hama.  However, there are some significant
differences.  Giraph focuses on graph-based bulk synchronous parallel (BSP) computing, while
Apache Hama is more for general purposed BSP computing.  Giraph runs on the Hadoop infrastructure,
while Apache Hama uses its own computing framework.

=== An Excessive Fascination with the Apache Brand ===

The Apache brand is likely to help us find contributors, however, our interests in Apache
are primarily because the other projects that we depend on are also Apache projects and it
makes sense that all this software be available from the same place.

=== Documentation ===

Currently we have little documentation, but several examples.  We are working on improving
this situation.

=== Initial Source ===

The initial source of the code is from Yahoo! and began development in December 2010.  It
is already available on GitHub at

=== Source and Intellectual Property Submission Plan ===

We intend the entire code base to be licensed under the Apache License, Version 2.0.

=== External Dependencies ===

The required dependencies are all Apache compatible licenses.  The following components with
non-Apache licenses are enumerated:
* JSON – Public Domain

=== Cryptography ===

Giraph depends on secure Hadoop that can optionally use Kerberos.

== Required Resources ==

=== Mailing lists ===

* giraph-private (with moderated subscriptions)
* giraph-dev
* giraph-commits
* giraph-users

=== Subversion Directory ===

=== Issue Tracking ===


=== Other Resources ===

Giraph has integration tests that can be run with the LocalJobRunner.  These same tests also
designed to be run on a small (even single node) Hadoop cluster.  While not required at this
time, it would be nice if such a resource were available.

=== Initial Committers ===

Avery Ching, aching at yahoo-inc dot com
Christian Kunz, christian at jybe-inc dot com
Owen O’Malley, owen at hortonworks dot com

=== Affiliations ===

Avery Ching, Yahoo!
Christian Kunz, Jybe

== Sponsors ==

=== Champion ===

Owen O’ Malley

=== Nominated Mentors ===

Owen O’Malley

=== Sponsoring Entity ===

Apache Incubator PMC

Best Regards, Edward J. Yoon

To unsubscribe, e-mail:<>
For additional commands, e-mail:<>

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message