incubator-general mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ted Dunning <ted.dunn...@gmail.com>
Subject Re: [VOTE] Accept MADlib into the Apache Incubator
Date Sun, 13 Sep 2015 21:41:28 GMT
+1



On Fri, Sep 11, 2015 at 9:11 PM, Skip Intro <alias@2kb.net> wrote:

> +1 (non-binding) On Wed, Sep 09, 2015 at 07:37PM, Roman Shaposhnik wrote: >
> > Following the discussion earlier: > http://s.apache.org/TE6 > > I would
> like to call a VOTE for accepting > MADlib community as a new ASF incubator
> > project. > > The proposal is available at: >
> https://wiki.apache.org/incubator/MADlibProposal > and is also included at
> the bottom of this email. > > Vote is open until at least Mon, 14 September
> 2015, 23:59:00 PST > > [ ] +1 accept MADlib into the Apache Incubator > [ ]
> ±0 > [ ] -1 because... > > Thanks, > Roman. > > == Abstract == >
MADlib is
> an open-source library (licensed under 2-clause BSD license) > for scalable
> in-database analytics. It provides data-parallel > implementations of
> mathematical, statistical and machine learning > methods for structured and
> unstructured data. The MADlib mission is to > foster widespread development
> of scalable analytic skills, by > harnessing efforts from commercial
> practice, academic research, and > open source development. > > MADlib
> occupies a unique niche in the realm of data science and > machine learning
> libraries since its SQL APIs can allow it to work on > a wide range of data
> stores and SQL engines. > > == Proposal == > The current open source
> community behind MADlib feels that aligning > itself with HAWQ's community,
> governance model, infrastructure and > roadmap will allow the project to
> accelerate adoption and community > growth. Given HAWQ's trajectory of
> entering Apache Software Foundation > family as an Incubating project, we
> feel that the best course of > action for MADlib is to follow a similar
> route. > > MADlib and HAWQ are complementary technologies in that MADlib >
> in-database analytical functions can run within the HAWQ execution >
> engine. (MADlib also runs on Greenplum Database and PostgreSQL today.) > It
> is expected that contributors to MADlib will be cognizant of the > HAWQ ASF
> project and may contribute to it as well. In short, > collaboration between
> the two communities will make both projects more > vibrant and advance the
> respective technologies in potentially novel > directions. > > Contributors
> may also look at the HAWQ project as a starting port for > ports to other
> parallel database engines. This proposal highly > encourages this type of
> work as it would help to further realize the > original cross-platform goal
> of MADlib as envisioned by its > originators. > > Thus, the goal of this
> proposal is to bring the existing MADlib open > source community into ASF,
> change the project's governance model to > the "Apache Way" and transition
> the project's codebase and > infrastructure into ASF INFRA. The community
> has agreed to transfer > the brand name "MADlib" to Apache Software
> Foundation as well. > > Pivotal Inc. on behalf of the MADlib open source
> community is > submitting this proposal to transition source code and
> associated > artifacts (documentation, web site content, wiki, etc.) to the
> Apache > Software Foundation Incubator under the Apache License, Version
> 2.0 > and is asking Incubator PMC to established a MADlib incubating >
> project. > > Currently MADlib uses a few category X licensed software tools
> during > its build (mostly for generating documentation): > * doxypy 0.4.2
> (GPL) > * doxygen 1.8.4 (GPL) > * TikZ-UML > * bison 2.4 (GPL, with an
> exception for generated output) > We feel that this usage is compatible
> with an overall project licensed > under the ALv2 and don't anticipate any
> changes. > Our usage of LGPL library cern_root-5.34 is expected to go away
> since > the 2 cern modules used are being entirely re-written > in MADlib >
> > Finally, MADlib inclusion of MPL licensed library (eigen 3.2.2) into >
> its binary artifact seems to be consistent with > ASF recommendation for
> managing "weak copyleft" dependencies. > > > == Background == > MADlib grew
> out of discussions between database engine developers, > data scientists,
> IT architects and academics interested in new > approaches to scalable,
> sophisticated in-database analytics. These > discussions were written up in
> a paper in VLDB 2009 that coined the > term “MAD Skills” for data analysis
> > (http://dl.acm.org/citation.cfm?id=1687576). The MADlib software >
> project began the following year as a collaboration between > researchers
> at UC Berkeley and engineers and data scientists at > Pivotal (former
> EMC/Greenplum). > > The initial MADlib codebase came from EMC/Greenplum, UC
> Berkeley, the > University of Wisconsin, and the University of Florida. The
> project > was publicly documented in a paper at VLDB 2012 > (
> http://vldb.org/pvldb/vol5/p1700_joehellerstein_vldb2012.pdf). Today >
> MADlib has contributors from around the world including both > individuals
> and institutions. For example, recent contributions have > come from
> Pivotal, Stanford University, and the University of Illinois > at Chicago.
> > > MADlib was conceived from the outset as a free, open source library >
> for all to use and contribute to. Since its inception, the community > has
> steadily added new methods in the areas of mathematics, > statistics,
> machine learning, and data transformation. The current > library includes
> over 30 principle algorithms as well as many > additional operators and
> utility functions. > > The methods in MADlib are designed both for in- or
> out-of-core > execution, and for the shared-nothing, scale-out parallelism
> offered > by modern parallel database engines, ensuring that computation is
> done > close to the data. The core functionality is written in declarative
> > SQL statements, which orchestrate data movement to and from disk, and >
> across networked machines. Single-node inner loops take advantage of > SQL
> extensibility to call out to high performance math libraries in >
> user-defined scalar and aggregate functions. At the highest level, > tasks
> that require iteration and/or structure definition are coded in > Python
> driver routines, which are used only to kick off the data-rich >
> computations that happen within the database engine. > > The first
> platforms supported by MADlib were Greenplum Database and > PostgreSQL.
> With the development of HAWQ SQL-on-Hadoop technology by > Pivotal, MADlib
> offers a way to perform predictive analytics on very > large data sets
> stored on a Hadoop cluster. > > Today, MADlib is in active development and
> is deployed on a wide > variety of industry and academic projects across
> many different > verticals. > > == Rationale == > Enterprises today are
> seeing the value of landing very large > quantities of data in Hadoop
> clusters with the goal improving their > products and processes. With the
> proliferation of increasingly > sophisticated SQL-on-Hadoop technologies
> such as HAWQ, analysts can > use the familiar SQL language to query this
> data at scale. This > effectively opens the door to Hadoop in the
> enterprise. > > Adding SQL-based predictive analytics like MADlib to the
> equation > enables organizations to reason across large data sets without >
> resorting to sampling, which has been a traditional approach when >
> confronted with scale problems. Operating on all of the data with > MADlib
> results in more robust and accurate models. > > Since MADlib is a SQL-based
> interface, organizations do not need to > re-train their teams on an
> unfamiliar programming language since SQL > skills are ubiquitous in
> today's enterprises. > > Given the high velocity of innovation happening in
> the underlying > Hadoop ecosystem, any SQL-based predictive analytics
> technology that > plays in this ecosystem must be commensurately agile to
> keep up with > the community. We strongly believe that in the Big Data
> space, this > can be optimally achieved through a vibrant, diverse,
> self-governed > community collectively innovating around a single codebase
> while at > the same time cross-pollinating with various other data
> management > communities. Apache Software Foundation is the ideal place to
> meet > those ambitious goals. > > == Initial Goals == > Our initial goals
> are to bring MADlib into the ASF, transition the > engineering and
> governance processes to be in accordance with the > "Apache Way" and foster
> a collaborative development model closely > aligned with that of HAWQ. > >
> Another important goal is encouraging efforts to port to other > execution
> engines. > > The MADlib project will continue developing new functionality
> in an > open, community-driven way. We envision accelerating innovation
> under > ASF governance, in order to meet the requirements of a wide variety
> of > predictive analytics use cases. > > We will also require transitioning
> of existing project infrastructure > (source code, JIRA, mailing list) to
> the ASF infrastructure. > > == Current Status == > Currently, the project
> is available at http://madlib.net/. The > codebase is licensed under the a
> 2-clause BSD license. Our current > governance model could be described as
> a "benevolent dictator" one. As > stated above, the existing MADlib
> community feels that closer > alignment with HAWQ community, infrastructure
> and the governance model > as it is being proposed to ASF will allow MADlib
> project to thrive > much more compared to relative isolation from HAWQ. > >
> === Meritocracy === > Our proposed list of initial committers include the
> current MADlib R&D > team at Pivotal and existing active members of the
> open source > project. This group will form a base for the broader
> community we will > invite to collaborate on the codebase. We intend to
> radically expand > the initial developer and user community by running the
> project in > accordance with the "Apache Way". Users and new contributors
> will be > treated with respect and welcomed. By participating in the
> community > and providing quality patches/support that move the project
> forward, > they will earn merit. They also will be encouraged to provide
> non-code > contributions (documentation, events, community management,
> etc.) and > will gain merit for doing so. Those with a proven support and
> quality > track record will be encouraged to become committers. > > ===
> Community === > If MADlib is accepted for incubation, the primary initial
> goal will be > transitioning the core community towards embracing the
> Apache Way of > project governance. We would solicit major existing
> contributors to > become committers on the project from the start. > > ===
> Core Developers === > MADlib core developers are skilled in working as part
> of openly > governed communities. That said, most of the core developers
> are > currently NOT affiliated with the ASF and would require new ICLAs >
> before committing to the project. > > === Alignment === > The following
> existing ASF projects can be considered when reviewing > the MADlib
> proposal: > > Apache Mahout project's goal is to build an environment for
> quickly > creating scalable performant machine learning applications.
> Apache > Mahout is, perhaps, the oldest machine learning library in Hadoop
> > ecosystem. The three major components of Mahout are an environment for >
> building scalable algorithms, many new Scala + Spark (H2O in progress) >
> algorithms, and Mahout's mature Hadoop MapReduce algorithms. We see > the
> two projects benefiting from each other's experience of > implementing
> similar classes of algorithms and look forward to a > fruitful exchange of
> ideas between the two communities. > > Apache Spark is a fast engine for
> processing large datasets, typically > from a Hadoop cluster, and
> performing batch, streaming, interactive, > or machine learning workloads.
> Recently, Apache Spark has embraced > SQL-like APIs around DataFrames at
> its core. Because of that we would > expect a level of collaboration
> between the two projects. Spark > project also contains a library (MLlib)
> that is the closest cousin to > MADlib. MLlib is Apache Spark's scalable
> machine learning library. We > see the two projects benefiting from each
> other's experience of > implementing similar classes of algorithms and look
> forward to a > fruitful exchange of ideas between the two communities. > >
> Apache Hive is a data warehouse software that facilitates querying and >
> managing large datasets residing in distributed storage. Hive provides > a
> mechanism to project structure onto this data and query the data > using a
> SQL-like language called HiveQL. We see a potential for MADlib > to
> leverage Hive as a backend the same way it currently leverages >
> PostgreSQL-derived SQL backends. This could be especially useful for >
> longer running algorithms. > > Apache Drill is a schema-free SQL query
> engine for Hadoop, NoSQL and > Cloud Storage. We see a potential for MADlib
> to leverage Drill as a > backend the same way it currently leverages
> PostgreSQL-derived SQL > backends. This could be especially useful for
> analyzing data coming > from heterogenous sources and federated by the
> Drill engine. > > == Known Risks == > Development has been sponsored mostly
> by a single company (or its > predecessors) thus far and coordinated mainly
> by the core Pivotal R&D > team. > > So far, the project's governance model
> has explicitly been a > "benevolent dictator" one. For the project to fully
> transition to the > "Apache Way", development must shift towards the
> meritocracy-centric > model of growing a community of contributors balanced
> with the needs > for extreme stability and core implementation coherency. >
> > === Orphaned products === > The community proposing MADlib for incubation
> is an independent open > source community. Even though Pivotal happens to
> be the biggest > corporate sponsor of the project (by means of employing
> the core team) > the community goes beyond those affiliated with Pivotal.
> On top of > that, Pivotal is fully committed to maintain its position as
> one of > the leading providers of SQL-based analytics aimed squarely at
> data > scientists. MADlib is the only game in town that can leverage SQL
> APIs > ranging from traditional RDBMS technology all the way to data >
> warehousing (Pivotal Greenplum Database) and into SQL-on-Hadoop > (HAWQ).
> Moreover, Pivotal has a vested interest in making MADlib > succeed by
> driving its close integration with sister ASF projects. We > expect this to
> further reduces the risk of orphaning the product. > > Even in the absence
> of support by a particular vendor such as Pivotal, > and in a worst-case
> scenario where HAWQ and Greenplum Database fail to > gain traction in OSS,
> the existence of an established PostgreSQL OSS > project means there’s will
> still be a working stack for MADlib. > > === Inexperience with Open Source
> === > MADlib has been an open source project from the outset. All
> developers > working on the project (regardless of their employment
> affiliation) > did so completely in the open. While the governance model of
> MADlib > has been more of a benevolent dictator model, the project has
> always > been receptive to accepting contributions from all sources and >
> including them in future releases based on thorough code review, > testing,
> and compliance with the project’s coding best practices. > > ===
> Homogeneous Developers === > While most of the initial committers are
> employed by Pivotal, there's > still a healthy level of interest coming
> from academia. On top of that > we expect to spark curiosity in sister ASF
> projects and attract > developers unaffiliated with Pivotal. Finally,
> MADlib is being used > extensively whenever Pivotal engages with customers
> on data science > projects. This typically means that the skills remain
> within a > customer organization which further increases the chance of
> turning > customer data scientists into MADlib contributors. > > ===
> Reliance on Salaried Developers === > A large percentage of the
> contributors are paid to work in the Big > Data space. While they might
> wander from their current employers, they > are unlikely to venture far
> from their core expertise and thus will > continue to be engaged with the
> project regardless of their current > employers. In addition, the project
> is still enjoying popularity in > academic circles and we hope that will
> help mitigate reliance on > salaried developers as well. > > ===
> Relationships with Other Apache Products === > As mentioned in the
> Alignment section, MADlib may consider various > degrees of integration and
> code exchange with Apache Spark (MLlib), > Apache Mahout, Apache Hive and
> Apache Drill projects. We expect > integration points to be inside and
> outside the project. We look > forward to collaborating with these
> communities as well as other > communities under the Apache umbrella. > >
> === An Excessive Fascination with the Apache Brand === > While we intend to
> leverage the Apache "brand" when talking to other > projects as a testament
> to our project’s neutrality, we have no plans > for making use of the
> Apache brand in press releases nor posting > billboards advertising
> acceptance of MADlib into Apache Incubator. > > == Documentation == > The
> documentation is currently available at:
> https://github.com/madlib/frontpage
> > > The documentation is currently licensed under 2-clause BSD license. > >
> == Initial Source == > Initial source code is available at: > * MADlib:
> https://github.com/madlib/madlib > * Testsuite:
> https://github.com/madlib/testsuite > * Contributors:
> https://github.com/madlib/contrib > > The code is currently licensed under
> 2-clause BSD license. > > == Source and Intellectual Property Submission
> Plan == > As soon as MADlib is approved to join the Incubator, the source
> code > will be transitioned via the Software Grant Agreement onto ASF >
> infrastructure and in turn made available under the Apache License, >
> version 2.0. We know of no legal encumbrances that would inhibit the >
> transfer of source code to the ASF. > > == External Dependencies == > >
> Runtime dependencies: > * boost-1.47.0 (Boost Software License) > *
> _m_widen_init (MIT for this subcomponent of GCC) > * python-argparse-1.2.1
> (PSF LICENSE AGREEMENT FOR PYTHON 2.7.1) > * pyyaml-3.10 (MIT license) > *
> cern_root-5.34 (LGPL, however this dependency will be removed > since the 2
> cern modules used are being entirely re-written in MADlib) > * eigen-3.2.2
> (Mozilla Public License) > * pyxb-1.2.4 (Apache license version 2) > *
> python (Python Software Foundation License Version 2) > * mathjax-2.5
> (Apache license version 2) > > Build only dependencies: > * doxypy-0.4.2
> (GPL) > * cmake-2.8.4 (BSD 3-clause License) > * doxygen >= 1.8.4 (GPL) >
*
> flex >= 2.5.33 (BSD) > * bison >= 2.4 (GPL) > * latex (LaTeX Project Public
> License) > * TikZ-UML (no license information) > > Cryptography > * N/A >
>
> == Required Resources == > > === Mailing lists === > *
> private@madlib.incubator.apache.org (moderated subscriptions) > *
> commits@madlib.incubator.apache.org > * dev@madlib.incubator.apache.org >
> *
> issues@madlib.incubator.apache.org > * user@madlib.incubator.apache.org >
> >
> === Git Repository === >
> https://git-wip-us.apache.org/repos/asf/incubator-madlib.git > > === Issue
> Tracking === > JIRA Project MADlib (MADLIB) > > We will also request
> migration of our current JIRA available at > http://jira.madlib.net/ > >
> === Other Resources === > > Means of setting up regular builds for MADlib
> on builds.apache.org > will require integration with Docker support. > >
> ==
> Initial Committers == > * Anirudh Kondaveeti > * Caleb Welton > * Frank
> McQuillan > * Gang Xiong > * Gautam Muralidhar > * Hitoshi Harada > * Hulya
> Emir-farinas > * Ian Huston > * KeeSiong Ng > * Noel Sio > * Rahul Iyer >
*
> Rashmi Raghu > * Regunathan Radhakrishnan > * Ronert Obst > * Samuel
> Ziegler > * Sarah Aerni > * Srivatsan Ramanujam > * Woo Jae Jung > * Xixuan
> Feng > * Yu Yang > * Atri Sharma > * Greg Chase > * Chloe Jackson > *
Roman
> Shaposhnik > * Vaibhav Gumashta > * Ted Dunning > * Konstantin Boudnik >
>
> == Affiliations == > * Hortonworks: Vaibhav Gumashta > * MapR: Ted Dunning
> > * WANDisco: Konstantin Boudnik > * Barclays: Atri Sharma > * Pivotal:
> everyone else on this proposal > > == Sponsors == > > === Champion === >
> Roman Shaposhnik > > === Nominated Mentors === > > The initial mentors are
> listed below: > * Ted Dunning - Apache Member, MapR > * Konstantin Boudnik
> - Apache Member, WANDisco > * Roman Shaposhnik - Apache Member, Pivotal > >
> === Sponsoring Entity === > We would like to propose Apache incubator to
> sponsor this project.
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message