incubator-general mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Benson Margulies <>
Subject Re: [PROPOSAL] Accumulo for the Apache Incubator
Date Fri, 02 Sep 2011 21:39:37 GMT
No votes yet, please, except as an informal expression of (un)enthusiasm.

Owen, you raise two question.

On the subject of grants, please read the IP description in the
proposal again. You can't 'grant' rights to something that neither you
nor anyone else owns. The proposal offers both a preferred alternative
and a backstop.

On the subject of LGPL, I'll leave it to the authors to answer.

On Fri, Sep 2, 2011 at 5:17 PM, Todd Lipcon <> wrote:
> Non-binding +1. Regarding Owen's concern over licenses, if I recall
> correctly, those concerns would block graduation from the incubator,
> but not acceptance to it.
> I am also interested in being added as a committer to this proposal.
> As an HBase committer (but not speaking for the project as a whole) I
> think having cross-pollination between the codebases will be
> beneficial to everyone, so I'd like to be involved.
> Thanks
> -Todd
> On Fri, Sep 2, 2011 at 8:45 AM, Billie J Rinaldi
> <> wrote:
>> Greetings,
>> I would like to propose Accumulo to be an Apache Incubator project.  Accumulo is
a distributed key/value store that provides expressive cell-level access labels and a server-side
programming mechanism that can modify key/value pairs at various points in the data management
process.  It is based on Google's BigTable design and runs over Apache Hadoop and Zookeeper.
>> Here is a link to the proposal in the Incubator wiki:
>> I've also pasted the initial contents below.
>> Thanks,
>> Billie Rinaldi
>> = Accumulo Proposal =
>> == Abstract ==
>> Accumulo is a distributed key/value store that provides expressive, cell-level access
>> == Proposal ==
>> Accumulo is a sorted, distributed key/value store based on Google's BigTable design.
 It is built on top of Apache Hadoop, Zookeeper, and Thrift.  It features a few novel improvements
on the BigTable design in the form of cell-level access labels and a server-side programming
mechanism that can modify key/value pairs at various points in the data management process.
>> == Background ==
>> Google published the design of BigTable in 2006.  Several other open source projects
have implemented aspects of this design including HBase, CloudStore, and Cassandra.  Accumulo
began its development in 2008.
>> == Rationale ==
>> There is a need for a flexible, high performance distributed key/value store that
provides expressive, fine-grained access labels.  The communities we expect to be most interested
in such a project are government, health care, and other industries where privacy is a concern.
 We have made much progress in developing this project over the past 3 years and believe
both the project and the interested communities would benefit from this work being openly
available and having open development.
>> == Current Status ==
>> === Meritocracy ===
>> We intend to strongly encourage the community to help with and contribute to the
code.  We will actively seek potential committers and help them become familiar with the
>> === Community ===
>> A strong government community has developed around Accumulo and training classes
have been ongoing for about a year.  Hundreds of developers use Accumulo.
>> === Core Developers ===
>> The developers are mainly employed by the National Security Agency, but we anticipate
interest developing among other companies.
>> === Alignment ===
>> Accumulo is built on top of Hadoop, Zookeeper, and Thrift.  It builds with Maven.
 Due to the strong relationship with these Apache projects, the incubator is a good match
for Accumulo.
>> == Known Risks ==
>> === Orphaned Products ===
>> There is only a small risk of being orphaned.  The community is committed to improving
the codebase of the project due to its fulfilling needs not addressed by any other software.
>> === Inexperience with Open Source ===
>> The codebase has been treated internally as an open source project since its beginning,
and the initial Apache committers have been involved with the code for multiple years.  While
our experience with public open source is limited, we do not anticipate difficulty in operating
under Apache's development process.
>> === Homogeneous Developers ===
>> The committers have multiple employers and it is expected that committers from different
companies will be recruited.
>> === Reliance on Salaried Developers ===
>> The initial committers are all paid by their employers to work on Accumulo and we
expect such employment to continue.  Some of the initial committers would continue as volunteers
even if no longer employed to do so.
>> === Relationships with Other Apache Products ===
>> Accumulo uses Hadoop, Zookeeper, Thrift, Maven, log4j, commons-lang, -net, -io, -jci,
-collections, -configuration, -logging, and -codec.
>> === Relationship to HBase ===
>> Accumulo and HBase are both based on the design of Google's BigTable, so there is
a danger that potential users will have difficulty distinguishing the two or that they will
not see an incentive in adopting Accumulo.  There are a few key areas in which Accumulo differs
from HBase.  Some of the desired features of Accumulo could be incorporated into HBase, however
the most important of these may be unlikely to be adopted (see cell-level access labels and
iterators below).  It is a possibility that the codebases will ultimately converge, but the
number of differences at the current time warrants a separate project for Accumulo.
>> ==== Access Labels ====
>> Accumulo has an additional portion of its key that sorts after the column qualifier
and before the timestamp.  It is called column visibility and enables expressive cell-level
access control.  Authorizations are passed with each query to control what data is returned
to the user.  The column visibilities are boolean AND and OR combinations of arbitrary strings
(such as "(A&B)|C") and authorizations are sets of strings (such as {C,D}).
>> ==== Iterators ====
>> Accumulo has a novel server-side programming mechanism that can modify the data written
to disk or returned to the user.  This mechanism can be configured for any of the scopes
where data is read from or written to disk.  It can be used to perform joins on data within
a single tablet.
>> ==== Flexibility ====
>> HBase requires the user to specify the set of column families to be used up front.
 Accumulo places no restrictions on the column families.  Also, each column family in HBase
is stored separately on disk.  Accumulo allows column families to be grouped together on
disk, as does BigTable.  This enables users to configure how their data is stored, potentially
providing improvements in compression and lookup speeds.  It gives Accumulo a row/column
hybrid nature, while HBase is currently column-oriented.
>> ==== Testing ====
>> Accumulo has testing frameworks that have resulted in its achieving a high level
of correctness and performance.  We have observed that under some configurations and conditions
Accumulo will outperform HBase and provide greater data integrity.
>> ==== Logging ====
>> HBase uses a write-ahead log on the Hadoop Distributed File System.  Accumulo has
its own logging service that does not depend on communication with the HDFS NameNode.
>> ==== Storage ====
>> Accumulo has a relative key file format that improves compression.
>> ==== Areas in which HBase features improvements over Accumulo ====
>> in memory tables, upserts, coprocessors, connections to other projects such as Cascading
and Pig
>> === Expectations ===
>> There is a risk that Accumulo will be criticized for not providing adequate security.
 The access labels in Accumulo do not in themselves provide a complete security solution,
but are a mechanism for labeling each piece of data with the authorizations that are necessary
to see it.
>> === Apache Brand ===
>> Our interest in releasing this code as an Apache incubator project is due to its
strong relationship with other Apache projects, i.e. Hadoop, Zookeeper, and HBase.
>> == Documentation ==
>> There is not currently documentation about Accumulo on the web, but a fair amount
of documentation and training materials exists and will be provided on the Accumulo wiki at  Also, a paper discussing YCSB results for Accumulo will be presented at the
2011 Symposium on Cloud Computing.
>> == Initial Source ==
>> Accumulo has been in development since spring 2008.  There are hundreds of developers
using it and tens of developers have contributed to it.  The core codebase consists of 200,000
lines of code (mainly Java) and 100s of pages of documentation.  There are also a few projects
built on top of Accumulo that may be added to its contrib in the future.  These include support
for Hive, Matlab, YCSB, and graph processing.
>> == Source and Intellectual Property Submission Plan ==
>> Accumulo core code, examples, documention, and training materials will be submitted
by the National Security Agency.
>> We will also be soliciting contributions of further plugins from MIT Lincoln Labs,
Carnegie Mellon University, and others.
>> Accumulo has been developed by a mix of government employees and private companies
under government contract.  Material developed by government employees is in the public domain
and no U.S. copyright exists in works of the federal government.  For the contractor developed
material in the initial submission, the U.S. Government has sufficient authority per the ICLA
from the copyright owner to contribute the Accumulo code to the incubator.
>> There has been some discussion regarding accepting contributions from US Government
sources on [ LEGAL-93]. We propose that the
NSA will sign an ICLA/CCLA if that document could be slightly modified to explicitly address
copyright in works of government employees. Specifically, we propose that the definition of
“You” be modified to include “the copyright owner, the owner of a Contribution not subject
to copyright, or legal entity authorized by the copyright owner that is making this Agreement.”
In addition, section 2, the copyright license grant be modified after “You hereby grant”
that either states “to the extent authorized by law” or “to the extent copyright exists
in the Contribution.”  These changes will permit US Government employee developed work
to be included.
>> One proposed solution is to form a Collaborative Research and Development Agreement
(CRADA) between the Apache Software Foundation and the US Government, but this will not solve
the underlying problem that U.S. law does not grant copyright to works of government employees.
 At this time a CRADA is not necessary but should it be determined that a CRADA is necessary,
we would like to work through that process during the incubation phase of Accumulo rather
than before acceptance as this may take time to enter into an agreement.
>> == External Dependencies ==
>> jetty (Apache and EPL), jline (BSD), jfreechart (LGPL), jcommon (LGPL), slf4j (MIT),
junit (CPL)
>> == Cryptography ==
>> none
>> == Required Resources ==
>>  * Mailing Lists
>>   * accumulo-private
>>   * accumulo-dev
>>   * accumulo-commits
>>   * accumulo-user
>>  * Subversion Directory
>>   *
>>  * Issue Tracking
>>   * JIRA Accumulo (ACCUMULO)
>>  * Continuous Integration
>>   * Jenkins builds on
>>  * Web
>>   *
>>   * wiki at or
>> == Initial Committers ==
>>  * Aaron Cordova (aaron at cordovas dot org)
>>  * Adam Fuchs (adam.p.fuchs at ugov dot gov)
>>  * Eric Newton (ecn at swcomplete dot com)
>>  * Billie Rinaldi (billie.j.rinaldi at ugov dot gov)
>>  * Keith Turner (keith.turner at ptech-llc dot com)
>>  * John Vines (john.w.vines at ugov dot gov)
>>  * Chris Waring (christopher.a.waring at ugov dot gov)
>> == Affiliations ==
>>  * Aaron Cordova, The Interllective
>>  * Adam Fuchs, National Security Agency
>>  * Eric Newton, SW Complete Incorporated
>>  * Billie Rinaldi, National Security Agency
>>  * Keith Turner, Peterson Technology LLC
>>  * John Vines, National Security Agency
>>  * Chris Waring, National Security Agency
>> == Sponsors ==
>>  * Champion: Doug Cutting
>>  * Nominated Mentors: Benson Margulies, ?, ?
>>  * Sponsoring Entity: Apache Incubator
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail:
>> For additional commands, e-mail:
> --
> Todd Lipcon
> Software Engineer, Cloudera
> ---------------------------------------------------------------------
> To unsubscribe, e-mail:
> For additional commands, e-mail:

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message