incubator-general mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Todd Lipcon <t...@cloudera.com>
Subject Re: Looking for Champion
Date Fri, 08 Jun 2018 16:37:12 GMT
On Fri, Jun 8, 2018 at 9:18 AM, Tim Armstrong <tarmstrong@cloudera.com>
wrote:

> > Meanwhile we found Impala is a very good MPP SQL query engine, so we
> integrated
> them together.
>
> Palo didn't integrate with Impala, it forked Impala's codebase and embedded
> it in its own repository. I don't remember any attempts from the Palo team
> to engage with the Impala community or attempt to work with us to
> contribute any improvements.
>
> It looks like Palo is still pulling in new code from Impala.  E.g. this
> commit includes a bunch of code I wrote as part of IMPALA-3200:
> https://github.com/baidu/palo/commit/2419384e8a211f10e7636afc6d3423
> 700ba22b5a#diff-1c501d9a8b5c3d1d1cce48d5e1fb0edf
>
> The code isn't owned by any individual, I contributed it to Apache and it's
> free for anyone to do what they want to do with it, but pulling in
> improvements from other projects without any attempt to attribute it or
> contribute improvements back seems contrary to the Apache way.
>

+1. Also briefly browsing the code I found suspicious commits like this one:
https://github.com/baidu/palo/commit/6486be64c319fe0beb8c6b4430c1662de54f182e

... in which a GPL license copyright by Oracle was "fixed" to be an Apache
license copyright Baidu.

So if this project does enter incubation I think we should be extra careful
to audit the origins of all of the source code.

-Todd


> On Fri, Jun 8, 2018 at 9:12 AM, Todd Lipcon <todd@cloudera.com> wrote:
>
> > On Thu, Jun 7, 2018 at 11:55 PM, Li,De(BDG) <lide@baidu.com> wrote:
> >
> > > Hi, Jim
> > >
> > > Thank you for your response.
> > > Actually, we start Palo in several years ago, and that time we
> developed
> > > the storage engine based on Mesa technology.
> > > Meanwhile we found Impala is a very good MPP SQL query engine, so we
> > > integrated them together.
> > >
> >
> > From what I can tell of the Palo source, it's not so much an integration
> as
> > a copied-and-modified codebase, right? i.e Palo does not use Impala as a
> > dependency, but rather shares a lot of code from the Impala project that
> > has since diverged.
> >
> >
> > >
> > > With this integration, the goal of Palo is to implement a single,
> > > full-featured, mysql protocol compatible data warehousing.
> > >
> >
> > That sounds pretty similar to the goals of the Impala project. Impala
> isn't
> > MySQL-compatible at the moment but that seems more like a particular
> > feature that could be added rather than a distinct identity of the
> project.
> > Otherwise, Impala's goal is to be a full featured data warehouse engine
> as
> > well.
> >
> > Generally Apache has no rules against multiple projects fulfilling
> similar
> > goals or use cases, even when those projects might compete. However I
> think
> > it would be relatively unusual to incubate a project that appears to be
> > derived from a fork of an existing project, at least without first
> > considering whether the additional feature set could be contributed back
> to
> > the existing community.
> >
> > -Todd
> >
> >
> > > 在 2018/6/8 下午1:55, "Jim Apple" <jbapple@apache.org> 写入:
> > >
> > > >Hello! As a contributor to Impala, I’d be interested in hearing
> thoughts
> > > >from the Palo community about integration between Impala and Palo.
> > > >
> > > >For instance, are there any apparent design goals of Impala that the
> > Palo
> > > >community thinks are fundamentally incompatible with Palo?
> > > >
> > > >Thanks,
> > > >Jim
> > > >
> > > >On 2018/06/08 04:45:32, "Li,De(BDG)" <lide@baidu.com> wrote:
> > > >> Hi all,
> > > >>
> > > >> I am Reed, as a developer worked with the team for Palo (a MPP-based
> > > >>interactive SQL data warehousing).
> > > >> https://github.com/baidu/palo/wiki/Palo-Overview
> > > >>
> > > >> We propose to contribute Palo as an Apache Incubator project, and
> > > >> we are still looking for possible Champion if anyone would like to
> > > >>volunteer. Thanks a lot.
> > > >>
> > > >> Best Regards,
> > > >> Reed
> > > >>
> > > >> ===================
> > > >> The draft of the proposal as below:
> > > >>
> > > >> #Apache Palo
> > > >>
> > > >> ##Abstract
> > > >>
> > > >> Palo is a MPP-based interactive SQL data warehousing for reporting
> and
> > > >>analysis.
> > > >>
> > > >> ##Proposal
> > > >>
> > > >> We propose to contribute the Palo codebase and associated artifacts
> > > >>(e.g. documentation, web-site content etc.) to the Apache Software
> > > >>Foundation with the intent of forming a productive, meritocratic and
> > > >>open community around Palo’s continued development, according to
the
> > > >>‘Apache Way’.
> > > >>
> > > >> Baidu owns several trademarks regarding Palo, and proposes to
> transfer
> > > >>ownership of those trademarks in full to the ASF.
> > > >>
> > > >> ###Overview of Palo
> > > >>
> > > >> Palo’s implementation consists of two daemons: Frontend (FE) and
> > > >>Backend (BE).
> > > >>
> > > >> **Frontend daemon** consists of query coordinator and catalog
> manager.
> > > >>Query coordinator is responsible for receiving users’ sql queries,
> > > >>compiling queries and managing queries execution. Catalog manager is
> > > >>responsible for managing metadata such as databases, tables,
> > partitions,
> > > >>replicas and etc. Several frontend daemons could be deployed to
> > > >>guarantee fault-tolerance, and load balancing.
> > > >>
> > > >> **Backend daemon** stores the data and executes the query fragments.
> > > >>Many backend daemons could also be deployed to provide scalability
> and
> > > >>fault-tolerance.
> > > >>
> > > >> A typical Palo cluster generally composes of several frontend
> daemons
> > > >>and dozens to hundreds of backend daemons.
> > > >>
> > > >> Users can use MySQL client tools to connect any frontend daemon to
> > > >>submit SQL query. Frontend receives the query and compiles it into
> > query
> > > >>plans executable by the Backend. Then Frontend sends the query plan
> > > >>fragments to Backend. Backend will build a query execution DAG. Data
> is
> > > >>fetched and pipelined into the DAG. The final result response is sent
> > to
> > > >>client via Frontend. The distribution of query fragment execution
> takes
> > > >>minimizing data movement and maximizing scan locality as the main
> goal.
> > > >>
> > > >> ##Background
> > > >>
> > > >> At Baidu, Prior to Palo, different tools were deployed to solve
> > diverse
> > > >>requirements in many ways. And when a use case requires the
> > simultaneous
> > > >>availability of capabilities that cannot all be provided by a single
> > > >>tool, users were forced to build hybrid architectures that stitch
> > > >>multiple tools together, but we believe that they shouldn’t need
to
> > > >>accept such inherent complexity. A storage system built to provide
> > great
> > > >>performance across a broad range of workloads provides a more elegant
> > > >>solution to the problems that hybrid architectures aim to solve. Palo
> > is
> > > >>the solution.
> > > >>
> > > >> Palo is designed to be a simple and single tightly coupled system,
> not
> > > >>depending on other systems. Palo provides high concurrent low latency
> > > >>point query performance, but also provides high throughput queries
of
> > > >>ad-hoc analysis. Palo provides bulk-batch data loading, but also
> > > >>provides near real-time mini-batch data loading. Palo also provides
> > high
> > > >>availability, reliability, fault tolerance, and scalability.
> > > >>
> > > >> ##Rationale
> > > >>
> > > >> Palo mainly integrates the technology of Google Mesa and Apache
> > Impala.
> > > >>
> > > >> Mesa is a highly scalable analytic data storage system that stores
> > > >>critical measurement data related to Google's Internet advertising
> > > >>business. Mesa is designed to satisfy complex and challenging set of
> > > >>users’ and systems’ requirements, including near real-time data
> > > >>ingestion and query ability, as well as high availability,
> reliability,
> > > >>fault tolerance, and scalability for large data and query volumes.
> > > >>
> > > >> Impala is a modern, open-source MPP SQL engine architected from the
> > > >>ground up for the Hadoop data processing environment. At present, by
> > > >>virtue of its superior performance and rich functionality, Impala
has
> > > >>been comparable to many commercial MPP database query engine. Mesa
> can
> > > >>satisfy the needs of many of our storage requirements, however Mesa
> > > >>itself does not provide a SQL query engine; Impala is a very good MPP
> > > >>SQL query engine, but the lack of a perfect distributed storage
> engine.
> > > >>So in the end we chose the combination of these two technologies.
> > > >>
> > > >> Learning from Mesa’s data model, we developed a distributed storage
> > > >>engine. Unlike Mesa, this storage engine does not rely on any
> > > >>distributed file system. Then we deeply integrate this storage engine
> > > >>with Impala query engine. Query compiling, query execution
> coordination
> > > >>and catalog management of storage engine are integrated to be
> frontend
> > > >>daemon; query execution and data storage are integrated to be backend
> > > >>daemon. With this integration, we implemented a single,
> full-featured,
> > > >>high performance state the art of MPP database, as well as
> maintaining
> > > >>the simplicity.
> > > >>
> > > >> ##Current Status
> > > >>
> > > >> Palo has been an open source project on GitHub
> > > >>(https://github.com/baidu/palo).
> > > >>
> > > >> ###Meritocracy
> > > >>
> > > >> Palo has been deployed in production at Baidu and is applying more
> > than
> > > >>200 lines of business. It has demonstrated great performance benefits
> > > >>and has proved to be a better way for reporting and analysis based
> big
> > > >>data. Still We look forward to growing a rich user and developer
> > > >>community.
> > > >>
> > > >> ###Community
> > > >>
> > > >> Palo seeks to develop developer and user communities during
> > incubation.
> > > >>
> > > >> ###Core Developers
> > > >>
> > > >> * Ruyue Ma (https://github.com/maruyue,
> > > >>maruyue@baidu.com<mailto:maruyue@baidu.com>)
> > > >> * Chun Zhao (https://github.com/imay,
> > > >>buaa.zhaoc@gmail.com<mailto:buaa.zhaoc@gmail.com>)
> > > >> * Mingyu Chen (https://github.com/morningman,chenmingyu@baidu.com)
> > > >> * De Li(https://github.com/lide-reed,
> > > >>mailtolide@sina.com)<mailto:mailtolide@sina.com%EF%BC%89>
> > > >> * Hao Chen (https://github.com/chenhao7253886,
> > > >>chenhao16@baidu.com<mailto:chenhao16@baidu.com>)
> > > >> * Chaoyong Li (https://github.com/cyongli,
> > > >>lichaoyong@baidu.com<mailto:lichaoyong@baidu.com>)
> > > >> * Bin Lin (https://github.com/lingbin,
> > > >>lingbinlb@gmail.com<mailto:lingbinlb@gmail.com>)
> > > >>
> > > >> ###Alignment
> > > >>
> > > >> Palo is related to several other Apache projects:
> > > >>
> > > >> * Palo can also read data stored in Apache Hadoop clusters powered
> by
> > > >>the HDFS filesystem.
> > > >> * Palo is closely integrated with Impala, which is also being
> proposed
> > > >>to the Incubator.
> > > >> * Palo uses Apache Thrift as its RPC and serialization framework of
> > > >>choice.
> > > >>
> > > >> ##Known Risks
> > > >>
> > > >> ###Orphaned Products
> > > >>
> > > >> The core developers of Palo team plan to work full time on this
> > > >>project. There is very little risk of Palo getting orphaned since at
> > > >>least one large company (Baidu) is extensively using it in their
> > > >>production. For example, currently there are more than 200 use cases
> > > >>using Palo in production. Furthermore, since Palo was open sourced
at
> > > >>the beginning of October 2017, it has received more than 660 stars
> and
> > > >>been forked nearly 170 times. We plan to extend and diversify this
> > > >>community further through Apache.
> > > >>
> > > >> ###Inexperience with Open Source
> > > >>
> > > >> The core developers are all active users and followers of open
> source.
> > > >>They are already committers and contributors to the Palo Github
> > project.
> > > >>All have been involved with the source code that has been released
> > under
> > > >>an open source license, and several of them also have experience
> > > >>developing code in an open source environment. Though the core set
of
> > > >>Developers do not have Apache Open Source experience, there are plans
> > to
> > > >>onboard individuals with Apache open source experience on to the
> > project.
> > > >>
> > > >> ###Homogenous Developers
> > > >>
> > > >> The most of core developers are from Baidu, but after Palo was open
> > > >>sourced, Palo received a lot of bug fixes and enhancements from other
> > > >>developers not working at Baidu.
> > > >>
> > > >> ###Reliance on Salaried Developers
> > > >>
> > > >> Baidu invested in Palo as the OLAP solution and some of its key
> > > >>engineers are working full time on the project. In addition, since
> > there
> > > >>is a growing Big Data need for scalable OLAP solutions, we look
> forward
> > > >>to other Apache developers and researchers to contribute to the
> > project.
> > > >>Also key to addressing the risk associated with relying on Salaried
> > > >>developers from a single entity is to increase the diversity of the
> > > >>contributors and actively lobby for Domain experts in the BI space
to
> > > >>contribute. Apache Palo intends to do this.
> > > >>
> > > >> ###An Excessive Fascination with the Apache Brand
> > > >>
> > > >> Palo is proposing to enter incubation at Apache in order to help
> > > >>efforts to diversify the committer-base, not so much to capitalize
on
> > > >>the Apache brand. The Palo project is in production use already
> inside
> > > >>Baidu, but is not expected to be an Baidu product for external
> > > >>customers. As such, the Palo project is not seeking to use the Apache
> > > >>brand as a marketing tool.
> > > >>
> > > >> ##Documentation
> > > >>
> > > >> Information about Palo can be found at
> https://github.com/baidu/palo.
> > > >>The following links provide more information about Palo in open
> source:
> > > >>
> > > >> * Palo wiki site: https://github.com/baidu/palo/wiki
> > > >> * Codebase at Github: https://github.com/baidu/palo
> > > >> * Issue Tracking: https://github.com/baidu/palo/issues
> > > >> * Overview: https://github.com/baidu/palo/wiki/Palo-Overview
> > > >> * FAQ: https://github.com/baidu/palo/wiki/Palo-FAQ
> > > >>
> > > >> ##Initial Source
> > > >>
> > > >> Palo has been under development since 2017 by a team of engineers
at
> > > >>Baidu Inc. It is currently hosted on Github.com under an Apache
> license
> > > >>at https://github.com/baidu/palo.
> > > >>
> > > >> ##External Dependencies
> > > >>
> > > >> Palo has the following external dependencies.
> > > >>
> > > >> * Google gflags (BSD)
> > > >> * Google glog (BSD)
> > > >> * Apache Thrift (Apache Software License v2.0)
> > > >> * Apache Commons (Apache Software License v2.0)
> > > >> * Boost (Boost Software License)
> > > >> * OpenLdap (OpenLDAP Software License)
> > > >> * rapidjson (Tencent)
> > > >> * Google RE2 (BSD-style)
> > > >> * lz4 (BSD)
> > > >> * snappy (BSD)
> > > >> * cyrus-sasl (CMU License)
> > > >> * Twitter Bootstrap (Apache Software License v2.0)
> > > >> * d3 (BSD)
> > > >> * LLVM (BSD-like)
> > > >>
> > > >> Build and test dependencies:
> > > >>
> > > >> * ant (Apache Software License v2.0)
> > > >> * Apache Maven (Apache Software License v2.0)
> > > >> * cmake (BSD)
> > > >> * clang (BSD)
> > > >> * Google gtest (Apache Software License v2.0)
> > > >>
> > > >> ##Required Resources
> > > >>
> > > >> ###Mailing List
> > > >>
> > > >> There are currently no mailing lists. The usual mailing lists are
> > > >>expected to be set up when entering incubation:
> > > >>
> > > >>
> > > >>private@palo.incubator.apache.org<mailto:private@
> > > palo.incubator.apache.or
> > > >>g>
> > > >> dev@palo.incubator.apache.org<mailto:dev@palo.incubator.apache.org>
> > > >>
> > > >>commits@palo.incubator.apache.org<mailto:commits@
> > > palo.incubator.apache.or
> > > >>g>
> > > >>
> > > >> ###Subversion Directory
> > > >>
> > > >> Upon entering incubation: https://github.com/baidu/palo.
> > > >> After incubation, we want to move the existing repo from
> > > >>https://github.com/baidu/palo to Apache infrastructure.
> > > >>
> > > >> ###Issue Tracking
> > > >>
> > > >> Palo currently uses GitHub to track issues. Would like to continue
> to
> > > >>do so while we discuss migration possibilities with the ASF Infra
> > > >>committee.
> > > >>
> > > >> ###Other Resources
> > > >>
> > > >> The existing code already has unit tests so we will make use of
> > > >>existing Apache continuous testing infrastructure. The resulting load
> > > >>should not be very large.
> > > >>
> > > >> ##Initial Committers
> > > >>
> > > >> * Ruyue Ma (https://github.com/maruyue,
> > > >>maruyue@baidu.com<mailto:maruyue@baidu.com>)
> > > >> * Chun Zhao (https://github.com/imay,
> > > >>buaa.zhaoc@gmail.com<mailto:buaa.zhaoc@gmail.com>)
> > > >> * Mingyu Chen (https://github.com/morningman,chenmingyu@baidu.com)
> > > >> * De Li(https://github.com/lide-reed,
> > > >>mailtolide@sina.com)<mailto:mailtolide@sina.com%EF%BC%89>
> > > >> * Hao Chen (https://github.com/chenhao7253886,
> > > >>chenhao16@baidu.com<mailto:chenhao16@baidu.com>)
> > > >> * Chaoyong Li (https://github.com/cyongli,
> > > >>lichaoyong@baidu.com<mailto:lichaoyong@baidu.com>)
> > > >> * Bin Lin (https://github.com/lingbin,
> > > >>lingbinlb@gmail.com<mailto:lingbinlb@gmail.com>)
> > > >>
> > > >> ##Affiliations
> > > >>
> > > >> The initial committers are employees of Baidu Inc.. The nominated
> > > >>mentors are employees of TODO.
> > > >>
> > > >> ##Sponsors
> > > >>
> > > >> ###Champion
> > > >>
> > > >> TODO
> > > >>
> > > >> ###Nominated Mentors
> > > >>
> > > >> * sijie guo, guosijie@gmail.com<mailto:guosijie@gmail.com>
> > > >> * Luke Han, lukehan@apache.org<mailto:lukehan@apache.org>
> > > >> * Zheng Shao, zshao@apache.org<mailto:zshao@apache.org>
> > > >>
> > > >> ###Sponsoring Entity
> > > >>
> > > >> We are requesting the Incubator to sponsor this project.
> > > >>
> > > >
> > > >---------------------------------------------------------------------
> > > >To unsubscribe, e-mail: general-unsubscribe@incubator.apache.org
> > > >For additional commands, e-mail: general-help@incubator.apache.org
> > > >
> > >
> > >
> > > ---------------------------------------------------------------------
> > > To unsubscribe, e-mail: general-unsubscribe@incubator.apache.org
> > > For additional commands, e-mail: general-help@incubator.apache.org
> > >
> >
> >
> >
> > --
> > Todd Lipcon
> > Software Engineer, Cloudera
> >
>



-- 
Todd Lipcon
Software Engineer, Cloudera

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message