incubator-general mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Hadrian Zbarcea <>
Subject Re: [DISCUSS] [PROPOSAL] Zeppelin for Apache Incubator
Date Fri, 19 Dec 2014 05:09:27 GMT


On 12/18/2014 11:54 PM, Konstantin Boudnik wrote:
> And again - big +1: I think the whole data stack will benefit from it.
> On Sat, Dec 13, 2014 at 05:18PM, Roman Shaposhnik wrote:
>> Hi,
>> I would like to propose Zeppelin as an Apache Incubator
>> project:
>> Please let me know what do you think and feel free to
>> volunteer as additional mentors for the project.
>> The easiest way to get to see what this project looks like
>> in action would be this demo:
>> Thanks,
>> Roman.
>> == Abstract ==
>> Zeppelin is a collaborative data analytics and visualization tool for
>> distributed, general-purpose data processing systems such as Apache
>> Spark, Apache Flink, etc.
>> == Proposal ==
>> Zeppelin is a modern web-based tool for the data scientists to
>> collaborate over large-scale data exploration and visualization
>> projects. It is a notebook style interpreter that enable collaborative
>> analysis sessions sharing between users. Zeppelin is independent of
>> the execution framework itself. Current version runs on top of Apache
>> Spark but it has pluggable interpreter APIs to support other data
>> processing systems. More execution frameworks could be added at a
>> later date i.e Apache Flink, Crunch as well as SQL-like backends such
>> as Hive, Tajo, MRQL.
>> We have a strong preference for the project to be called Zeppelin. In
>> case that may not be feasible, alternative names could be: “Mir”,
>> “Yuga” or “Sora”.
>> == Background ==
>> Large scale data analysis workflow includes multiple steps like data
>> acquisition, pre-processing, visualization, etc and may include
>> inter-operation of multiple different tools and technologies. With the
>> widespread of the open source general-purpose data processing systems
>> like Spark there is a lack of open source, modern user-friendly tools
>> that combine strengths of interpreted language for data analysis with
>> new in-browser visualization libraries and collaborative capabilities.
>> Zeppelin initially started as a GUI tool for diverse set of
>> SQL-over-Hadoop systems like Hive, Presto, Shark, etc. It was open
>> source since its inception in Sep 2013. Later, it became clear that
>> there was a need for a greater web-based tool for data scientists to
>> collaborate on data exploration over the large-scale projects, not
>> limited to SQL. So Zeppelin integrated full support of Apache Spark
>> while adding a collaborative environment with the ability to run and
>> share interpreter sessions in-browser
>> == Rationale ==
>> There are no open source alternatives for a collaborative
>> notebook-based interpreter with support of multiple distributed data
>> processing systems.
>> As a number of companies adopting and contributing back to Zeppelin is
>> growing, we think that having a long-term home at Apache foundation
>> would be a great fit for the project ensuring that processes and
>> procedures are in place to keep project and community “healthy” and
>> free of any commercial, political or legal faults.
>> == Initial Goals ==
>> The initial goals will be to move the existing codebase to Apache and
>> integrate with the Apache development process. This includes moving
>> all infrastructure that we currently maintain, such as: a website, a
>> mailing list, an issues tracker and a Jenkins CI, as mentioned in
>> “Required Resources” section of current proposal.
>> Once this is accomplished, we plan for incremental development and
>> releases that follow the Apache guidelines.
>> To increase adoption the major goal for the project would be to
>> provide integration with as much projects from Apache data ecosystem
>> as possible, including new interpreters for Apache Hive, Apache Drill
>> and adding Zeppelin distribution to Apache Bigtop.
>> On the community building side the main goal is to attract a diverse
>> set of contributors by promoting Zeppelin to wide variety of
>> engineers, starting a Zeppelin user groups around the globe and by
>> engaging with other existing Apache projects communities online.
>> == Current Status ==
>> Currently, Zeppelin has 4 released versions and is used in production
>> at a number of companies across the globe mentioned in Affiliation
>> section. Current implementation status is pre-release with public API
>> not being finalized yet. Current main and default backend processing
>> engine is Apache Spark with consistent support of SparkSQL.
>> Zeppelin is distributed as a binary package which includes an embedded
>> webserver, application itself, a set of libraries and startup/shutdown
>> scripts. No platform-specific installation packages are provided yet
>> but it is something we are looking to provide as part of Apache Bigtop
>> integration.
>> Project codebase is currently hosted at, which will form
>> the basis of the Apache git repository.
>> === Meritocracy ===
>> Zeppelin is an open source project that already leverages meritocracy
>> principles.  It was started by a handfull of people and now it has
>> multiple contributors, although as the number of contribution grows we
>> want to build a diverse developer and user community that is governed
>> by the "Apache way". Users and new contributors will be treated with
>> respect and welcomed; they will earn merit in the project by tendering
>> quality patches and support that move the project forward. Those with
>> a proven support and quality patch track record will be encouraged to
>> become committers.
>> === Community ===
>> Zeppelin already has a burgeoning community of users spread across the
>> world that leverage and contributes to the code base and mailing list.
>> We hope that being part of Apache Foundation will help to grow it more
>> and convert some of the users into active contributors to the project.
>> === Core Developers ===
>> The core developers of Zeppelin are listed in our contributors and
>> initial PPMC below. It is a diverse group of people from two
>> companies, NFLabs and Between, as mentioned in Affiliations section
>> including at least one Apache committer and PPMC member, Lee Moon Soo,
>> of Apache MRQL project.
>> === Alignment ===
>> Zeppelin is already integrated with Apache Spark. Integration with
>> Apache Tajo and Apache MRQL is something that has been currently
>> worked on. Apache Flink is a potential next integration step. We also
>> plan to add a binary distribution of Zeppelin to Apache Bigtop to
>> align it with whole ASF Hadoop data stack.
>> == Known Risks ==
>> We feel that for Zeppelin to become as successful as it can be, it
>> needs to be picked up by as many back-end systems as possible, not
>> only Apache Spark.
>> === Orphaned Products ===
>> Initial code contributors were from the same company but in last few
>> months we see signs of the global adoption, at least 2 more companies
>> in Europe and US have products based on a Zeppelin codebase. Other
>> companies use Zeppelin in production for their data analytics
>> workflows. We believe that this, plus the fact that Zeppelin already
>> have contributors from different companies mitigates this risk well.
>> === Inexperience with Open Source ===
>> Zeppelin was born as an open source project from scratch. Majority of
>> the current core contributors have experience working on other open
>> source projects. We also expect that as we grow the community further
>> based on meritocracy and with the guidance of more experienced mentors
>> this will have a positive influence on the project in the long term.
>> === Homogenous Developers ===
>> The initial committers are from same region but there are already 2
>> companies in the Europe that contribute to Zeppelin and others in US
>> also reviewing it and being active on the mailing list. We are
>> committed to create diverse mix of developers from all over the world.
>> === Reliance on Salaried Developers ===
>> Most of the Zeppelin contributors use it as tool of choice either in
>> their own companies internally or distribute it as part of the
>> product.
>> Backend agnostic design helps to keep it as tool of choice for diverse
>> community of data analysts even if they move from one employee to
>> another.
>> There also is at least one university in US with students who
>> potentially might use Zeppelin for R’n’D projects.
>> === Relationship with Other Apache Products ===
>> Right now Zeppelin relies on Apache Spark to run distributed task
>> across a cluster of machines, but it’s abstract interpreter design
>> allows it to work with other systems like Apache MRQL, Apache Crunch
>> as well as SQL-based systems like Apache Tajo, Apache Hive
>> === A Excessive Fascination with the Apache Brand ===
>> We believe that joining Apache will help us attract more contributors
>> to Zeppelin, by giving us a well-defined, transparent development and
>> governance process under a known brand. The reason for this proposal
>> is not to gain publicity, but to further strengthen the longevity of
>> the project without affiliation with any particular company. There are
>> no plans to use of Apache brand in press releases nor posting
>> advertising of acceptance it into Apache Incubator.
>> === Documentation ===
>> Additional documentation on Zeppelin may be found on its github website:
>>   * Zeppelin overview:
>>   * Zeppelin docs:
>>   * Zeppelin road map:
>> TODO!
>>   * Zeppelin issue tracking:
>>   * Zeppelin codebase:
>>   * User group:
>> == Initial Source ==
>> Zeppelin codebase is currently hosted on Github:
>> === Source and Intellectual Property Submission Plan ===
>> Currently, the Zeppleing codebase is distributed under an Apache 2.0 License.
>> == External Dependencies ==
>> To the best of our knowledge, all other dependencies of Zeppelin are
>> distributed under Apache compatible licenses (e.g. junit is EPL,
>> Eclipse Public License v1.0, atmosphere-jersey is CDDL1.0  and
>> dom4j:dom4 is BSD licensed, org.slf4j and
>> are MIT).
>> Only org.reflections:reflections
>> is WTFPL 2.0, which should not
>> be a problem as of
>> Upon acceptance to the incubator, we would begin a thorough analysis
>> of all transitive dependencies to verify this information and
>> introduce license checking into the build and release process by
>> integrating with Apache Rat.
>> == Required Resources ==
>> === Mailing list ===
>> We will migrate the existing Zeppelin mailing lists as follows:
>>   * -->
>>   *
>>   * for PPMC members
>>   *
>> The latter is to be consistent with the new PIAO naming scheme for podlings.
>> === Source control ===
>> Zeppelin team would like to use Git for source control, as it already
>> uses Git. We request a writeable Git repo for Zeppelin, and mirroring
>> to be set up to Github through INFRA.
>> === Issue Tracking ===
>> Zeppelin currently uses the Jira tracking system
>> We will
>> migrate to the Apache JIRA:
>> === Other Resources ===
>>   * Jenkins/Hudson for builds and test running.
>>   * Wiki for documentation purposes
>>   * Blog to improve project dissemination
>> == Initial Committers ==
>>   * Lee Moon Soo <moon at apache dot org>
>>   * Anthony Corbacho <corbacho.anthony at gmail dot com>, CLA submitted
>>   * Damien Corneau <corneadoug at gmail dot com>, CLA submitted
>>   * Alexander Bezzubov <abezzubov at nflabs dot com>, CLA confirmed
>>   * Kevin Sangwoo Kim <sangwookim dot me at gmail dot us>, CLA confirmed
>> == Affiliations ==
>>   * Lee Moon Soo: NFLabs
>>   * Anthony Corbacho: NFLabs
>>   * Damien Corneau: NFLabs
>>   * Alexander Bezzubov: NFLabs
>>   * Kevin Sangwoo Kim: VCNC (a.k.a Between)
>> == Sponsors ==
>> === Champion ===
>>   * Roman Shaposhnik
>> === Nominated Mentors ===
>>   * Konstantin Boudnik
>>   * Ted Dunning
>>   * Henry Saputra
>>   * Roman Shaposhnik
>> === Sponsoring Entity ===
>>   The Apache Incubator
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail:
>> For additional commands, e-mail:

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message