incubator-general mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Li,De(BDG)" <l...@baidu.com>
Subject Re: Looking for Champion
Date Sat, 09 Jun 2018 07:35:42 GMT
Hi Todd,

Thank you for your response.

It is serious mistake to replace Oracle license to Apache when updating
license with a script.

We have not check carefully, actually, those file no longer been used.
So I removed them and made a new commit.

https://github.com/baidu/palo/commit/ac770c33d445a4c18a0b74f56b28a4180b30bf
b7

Best Regards,
Reed


在 2018/6/9 上午12:37, "Todd Lipcon" <todd@cloudera.com> 写入:

>On Fri, Jun 8, 2018 at 9:18 AM, Tim Armstrong <tarmstrong@cloudera.com>
>wrote:
>
>> > Meanwhile we found Impala is a very good MPP SQL query engine, so we
>> integrated
>> them together.
>>
>> Palo didn't integrate with Impala, it forked Impala's codebase and
>>embedded
>> it in its own repository. I don't remember any attempts from the Palo
>>team
>> to engage with the Impala community or attempt to work with us to
>> contribute any improvements.
>>
>> It looks like Palo is still pulling in new code from Impala.  E.g. this
>> commit includes a bunch of code I wrote as part of IMPALA-3200:
>> https://github.com/baidu/palo/commit/2419384e8a211f10e7636afc6d3423
>> 700ba22b5a#diff-1c501d9a8b5c3d1d1cce48d5e1fb0edf
>>
>> The code isn't owned by any individual, I contributed it to Apache and
>>it's
>> free for anyone to do what they want to do with it, but pulling in
>> improvements from other projects without any attempt to attribute it or
>> contribute improvements back seems contrary to the Apache way.
>>
>
>+1. Also briefly browsing the code I found suspicious commits like this
>one:
>https://github.com/baidu/palo/commit/6486be64c319fe0beb8c6b4430c1662de54f1
>82e
>
>... in which a GPL license copyright by Oracle was "fixed" to be an Apache
>license copyright Baidu.
>
>So if this project does enter incubation I think we should be extra
>careful
>to audit the origins of all of the source code.
>
>-Todd
>
>
>> On Fri, Jun 8, 2018 at 9:12 AM, Todd Lipcon <todd@cloudera.com> wrote:
>>
>> > On Thu, Jun 7, 2018 at 11:55 PM, Li,De(BDG) <lide@baidu.com> wrote:
>> >
>> > > Hi, Jim
>> > >
>> > > Thank you for your response.
>> > > Actually, we start Palo in several years ago, and that time we
>> developed
>> > > the storage engine based on Mesa technology.
>> > > Meanwhile we found Impala is a very good MPP SQL query engine, so we
>> > > integrated them together.
>> > >
>> >
>> > From what I can tell of the Palo source, it's not so much an
>>integration
>> as
>> > a copied-and-modified codebase, right? i.e Palo does not use Impala
>>as a
>> > dependency, but rather shares a lot of code from the Impala project
>>that
>> > has since diverged.
>> >
>> >
>> > >
>> > > With this integration, the goal of Palo is to implement a single,
>> > > full-featured, mysql protocol compatible data warehousing.
>> > >
>> >
>> > That sounds pretty similar to the goals of the Impala project. Impala
>> isn't
>> > MySQL-compatible at the moment but that seems more like a particular
>> > feature that could be added rather than a distinct identity of the
>> project.
>> > Otherwise, Impala's goal is to be a full featured data warehouse
>>engine
>> as
>> > well.
>> >
>> > Generally Apache has no rules against multiple projects fulfilling
>> similar
>> > goals or use cases, even when those projects might compete. However I
>> think
>> > it would be relatively unusual to incubate a project that appears to
>>be
>> > derived from a fork of an existing project, at least without first
>> > considering whether the additional feature set could be contributed
>>back
>> to
>> > the existing community.
>> >
>> > -Todd
>> >
>> >
>> > > 在 2018/6/8 下午1:55, "Jim Apple" <jbapple@apache.org> 写入:
>> > >
>> > > >Hello! As a contributor to Impala, I’d be interested in hearing
>> thoughts
>> > > >from the Palo community about integration between Impala and Palo.
>> > > >
>> > > >For instance, are there any apparent design goals of Impala that
>>the
>> > Palo
>> > > >community thinks are fundamentally incompatible with Palo?
>> > > >
>> > > >Thanks,
>> > > >Jim
>> > > >
>> > > >On 2018/06/08 04:45:32, "Li,De(BDG)" <lide@baidu.com> wrote:
>> > > >> Hi all,
>> > > >>
>> > > >> I am Reed, as a developer worked with the team for Palo (a
>>MPP-based
>> > > >>interactive SQL data warehousing).
>> > > >> https://github.com/baidu/palo/wiki/Palo-Overview
>> > > >>
>> > > >> We propose to contribute Palo as an Apache Incubator project,
and
>> > > >> we are still looking for possible Champion if anyone would like
>>to
>> > > >>volunteer. Thanks a lot.
>> > > >>
>> > > >> Best Regards,
>> > > >> Reed
>> > > >>
>> > > >> ===================
>> > > >> The draft of the proposal as below:
>> > > >>
>> > > >> #Apache Palo
>> > > >>
>> > > >> ##Abstract
>> > > >>
>> > > >> Palo is a MPP-based interactive SQL data warehousing for
>>reporting
>> and
>> > > >>analysis.
>> > > >>
>> > > >> ##Proposal
>> > > >>
>> > > >> We propose to contribute the Palo codebase and associated
>>artifacts
>> > > >>(e.g. documentation, web-site content etc.) to the Apache Software
>> > > >>Foundation with the intent of forming a productive, meritocratic
>>and
>> > > >>open community around Palo’s continued development, according
to
>>the
>> > > >>‘Apache Way’.
>> > > >>
>> > > >> Baidu owns several trademarks regarding Palo, and proposes to
>> transfer
>> > > >>ownership of those trademarks in full to the ASF.
>> > > >>
>> > > >> ###Overview of Palo
>> > > >>
>> > > >> Palo’s implementation consists of two daemons: Frontend (FE)
and
>> > > >>Backend (BE).
>> > > >>
>> > > >> **Frontend daemon** consists of query coordinator and catalog
>> manager.
>> > > >>Query coordinator is responsible for receiving users’ sql queries,
>> > > >>compiling queries and managing queries execution. Catalog manager
>>is
>> > > >>responsible for managing metadata such as databases, tables,
>> > partitions,
>> > > >>replicas and etc. Several frontend daemons could be deployed to
>> > > >>guarantee fault-tolerance, and load balancing.
>> > > >>
>> > > >> **Backend daemon** stores the data and executes the query
>>fragments.
>> > > >>Many backend daemons could also be deployed to provide scalability
>> and
>> > > >>fault-tolerance.
>> > > >>
>> > > >> A typical Palo cluster generally composes of several frontend
>> daemons
>> > > >>and dozens to hundreds of backend daemons.
>> > > >>
>> > > >> Users can use MySQL client tools to connect any frontend daemon
>>to
>> > > >>submit SQL query. Frontend receives the query and compiles it into
>> > query
>> > > >>plans executable by the Backend. Then Frontend sends the query
>>plan
>> > > >>fragments to Backend. Backend will build a query execution DAG.
>>Data
>> is
>> > > >>fetched and pipelined into the DAG. The final result response is
>>sent
>> > to
>> > > >>client via Frontend. The distribution of query fragment execution
>> takes
>> > > >>minimizing data movement and maximizing scan locality as the main
>> goal.
>> > > >>
>> > > >> ##Background
>> > > >>
>> > > >> At Baidu, Prior to Palo, different tools were deployed to solve
>> > diverse
>> > > >>requirements in many ways. And when a use case requires the
>> > simultaneous
>> > > >>availability of capabilities that cannot all be provided by a
>>single
>> > > >>tool, users were forced to build hybrid architectures that stitch
>> > > >>multiple tools together, but we believe that they shouldn’t need
>>to
>> > > >>accept such inherent complexity. A storage system built to provide
>> > great
>> > > >>performance across a broad range of workloads provides a more
>>elegant
>> > > >>solution to the problems that hybrid architectures aim to solve.
>>Palo
>> > is
>> > > >>the solution.
>> > > >>
>> > > >> Palo is designed to be a simple and single tightly coupled
>>system,
>> not
>> > > >>depending on other systems. Palo provides high concurrent low
>>latency
>> > > >>point query performance, but also provides high throughput
>>queries of
>> > > >>ad-hoc analysis. Palo provides bulk-batch data loading, but also
>> > > >>provides near real-time mini-batch data loading. Palo also
>>provides
>> > high
>> > > >>availability, reliability, fault tolerance, and scalability.
>> > > >>
>> > > >> ##Rationale
>> > > >>
>> > > >> Palo mainly integrates the technology of Google Mesa and Apache
>> > Impala.
>> > > >>
>> > > >> Mesa is a highly scalable analytic data storage system that
>>stores
>> > > >>critical measurement data related to Google's Internet advertising
>> > > >>business. Mesa is designed to satisfy complex and challenging set
>>of
>> > > >>users’ and systems’ requirements, including near real-time
data
>> > > >>ingestion and query ability, as well as high availability,
>> reliability,
>> > > >>fault tolerance, and scalability for large data and query volumes.
>> > > >>
>> > > >> Impala is a modern, open-source MPP SQL engine architected from
>>the
>> > > >>ground up for the Hadoop data processing environment. At present,
>>by
>> > > >>virtue of its superior performance and rich functionality, Impala
>>has
>> > > >>been comparable to many commercial MPP database query engine. Mesa
>> can
>> > > >>satisfy the needs of many of our storage requirements, however
>>Mesa
>> > > >>itself does not provide a SQL query engine; Impala is a very good
>>MPP
>> > > >>SQL query engine, but the lack of a perfect distributed storage
>> engine.
>> > > >>So in the end we chose the combination of these two technologies.
>> > > >>
>> > > >> Learning from Mesa’s data model, we developed a distributed
>>storage
>> > > >>engine. Unlike Mesa, this storage engine does not rely on any
>> > > >>distributed file system. Then we deeply integrate this storage
>>engine
>> > > >>with Impala query engine. Query compiling, query execution
>> coordination
>> > > >>and catalog management of storage engine are integrated to be
>> frontend
>> > > >>daemon; query execution and data storage are integrated to be
>>backend
>> > > >>daemon. With this integration, we implemented a single,
>> full-featured,
>> > > >>high performance state the art of MPP database, as well as
>> maintaining
>> > > >>the simplicity.
>> > > >>
>> > > >> ##Current Status
>> > > >>
>> > > >> Palo has been an open source project on GitHub
>> > > >>(https://github.com/baidu/palo).
>> > > >>
>> > > >> ###Meritocracy
>> > > >>
>> > > >> Palo has been deployed in production at Baidu and is applying
>>more
>> > than
>> > > >>200 lines of business. It has demonstrated great performance
>>benefits
>> > > >>and has proved to be a better way for reporting and analysis based
>> big
>> > > >>data. Still We look forward to growing a rich user and developer
>> > > >>community.
>> > > >>
>> > > >> ###Community
>> > > >>
>> > > >> Palo seeks to develop developer and user communities during
>> > incubation.
>> > > >>
>> > > >> ###Core Developers
>> > > >>
>> > > >> * Ruyue Ma (https://github.com/maruyue,
>> > > >>maruyue@baidu.com<mailto:maruyue@baidu.com>)
>> > > >> * Chun Zhao (https://github.com/imay,
>> > > >>buaa.zhaoc@gmail.com<mailto:buaa.zhaoc@gmail.com>)
>> > > >> * Mingyu Chen
>>(https://github.com/morningman,chenmingyu@baidu.com)
>> > > >> * De Li(https://github.com/lide-reed,
>> > > >>mailtolide@sina.com)<mailto:mailtolide@sina.com%EF%BC%89>
>> > > >> * Hao Chen (https://github.com/chenhao7253886,
>> > > >>chenhao16@baidu.com<mailto:chenhao16@baidu.com>)
>> > > >> * Chaoyong Li (https://github.com/cyongli,
>> > > >>lichaoyong@baidu.com<mailto:lichaoyong@baidu.com>)
>> > > >> * Bin Lin (https://github.com/lingbin,
>> > > >>lingbinlb@gmail.com<mailto:lingbinlb@gmail.com>)
>> > > >>
>> > > >> ###Alignment
>> > > >>
>> > > >> Palo is related to several other Apache projects:
>> > > >>
>> > > >> * Palo can also read data stored in Apache Hadoop clusters
>>powered
>> by
>> > > >>the HDFS filesystem.
>> > > >> * Palo is closely integrated with Impala, which is also being
>> proposed
>> > > >>to the Incubator.
>> > > >> * Palo uses Apache Thrift as its RPC and serialization framework
>>of
>> > > >>choice.
>> > > >>
>> > > >> ##Known Risks
>> > > >>
>> > > >> ###Orphaned Products
>> > > >>
>> > > >> The core developers of Palo team plan to work full time on this
>> > > >>project. There is very little risk of Palo getting orphaned since
>>at
>> > > >>least one large company (Baidu) is extensively using it in their
>> > > >>production. For example, currently there are more than 200 use
>>cases
>> > > >>using Palo in production. Furthermore, since Palo was open
>>sourced at
>> > > >>the beginning of October 2017, it has received more than 660 stars
>> and
>> > > >>been forked nearly 170 times. We plan to extend and diversify this
>> > > >>community further through Apache.
>> > > >>
>> > > >> ###Inexperience with Open Source
>> > > >>
>> > > >> The core developers are all active users and followers of open
>> source.
>> > > >>They are already committers and contributors to the Palo Github
>> > project.
>> > > >>All have been involved with the source code that has been released
>> > under
>> > > >>an open source license, and several of them also have experience
>> > > >>developing code in an open source environment. Though the core
>>set of
>> > > >>Developers do not have Apache Open Source experience, there are
>>plans
>> > to
>> > > >>onboard individuals with Apache open source experience on to the
>> > project.
>> > > >>
>> > > >> ###Homogenous Developers
>> > > >>
>> > > >> The most of core developers are from Baidu, but after Palo was
>>open
>> > > >>sourced, Palo received a lot of bug fixes and enhancements from
>>other
>> > > >>developers not working at Baidu.
>> > > >>
>> > > >> ###Reliance on Salaried Developers
>> > > >>
>> > > >> Baidu invested in Palo as the OLAP solution and some of its key
>> > > >>engineers are working full time on the project. In addition, since
>> > there
>> > > >>is a growing Big Data need for scalable OLAP solutions, we look
>> forward
>> > > >>to other Apache developers and researchers to contribute to the
>> > project.
>> > > >>Also key to addressing the risk associated with relying on
>>Salaried
>> > > >>developers from a single entity is to increase the diversity of
>>the
>> > > >>contributors and actively lobby for Domain experts in the BI
>>space to
>> > > >>contribute. Apache Palo intends to do this.
>> > > >>
>> > > >> ###An Excessive Fascination with the Apache Brand
>> > > >>
>> > > >> Palo is proposing to enter incubation at Apache in order to help
>> > > >>efforts to diversify the committer-base, not so much to
>>capitalize on
>> > > >>the Apache brand. The Palo project is in production use already
>> inside
>> > > >>Baidu, but is not expected to be an Baidu product for external
>> > > >>customers. As such, the Palo project is not seeking to use the
>>Apache
>> > > >>brand as a marketing tool.
>> > > >>
>> > > >> ##Documentation
>> > > >>
>> > > >> Information about Palo can be found at
>> https://github.com/baidu/palo.
>> > > >>The following links provide more information about Palo in open
>> source:
>> > > >>
>> > > >> * Palo wiki site: https://github.com/baidu/palo/wiki
>> > > >> * Codebase at Github: https://github.com/baidu/palo
>> > > >> * Issue Tracking: https://github.com/baidu/palo/issues
>> > > >> * Overview: https://github.com/baidu/palo/wiki/Palo-Overview
>> > > >> * FAQ: https://github.com/baidu/palo/wiki/Palo-FAQ
>> > > >>
>> > > >> ##Initial Source
>> > > >>
>> > > >> Palo has been under development since 2017 by a team of
>>engineers at
>> > > >>Baidu Inc. It is currently hosted on Github.com under an Apache
>> license
>> > > >>at https://github.com/baidu/palo.
>> > > >>
>> > > >> ##External Dependencies
>> > > >>
>> > > >> Palo has the following external dependencies.
>> > > >>
>> > > >> * Google gflags (BSD)
>> > > >> * Google glog (BSD)
>> > > >> * Apache Thrift (Apache Software License v2.0)
>> > > >> * Apache Commons (Apache Software License v2.0)
>> > > >> * Boost (Boost Software License)
>> > > >> * OpenLdap (OpenLDAP Software License)
>> > > >> * rapidjson (Tencent)
>> > > >> * Google RE2 (BSD-style)
>> > > >> * lz4 (BSD)
>> > > >> * snappy (BSD)
>> > > >> * cyrus-sasl (CMU License)
>> > > >> * Twitter Bootstrap (Apache Software License v2.0)
>> > > >> * d3 (BSD)
>> > > >> * LLVM (BSD-like)
>> > > >>
>> > > >> Build and test dependencies:
>> > > >>
>> > > >> * ant (Apache Software License v2.0)
>> > > >> * Apache Maven (Apache Software License v2.0)
>> > > >> * cmake (BSD)
>> > > >> * clang (BSD)
>> > > >> * Google gtest (Apache Software License v2.0)
>> > > >>
>> > > >> ##Required Resources
>> > > >>
>> > > >> ###Mailing List
>> > > >>
>> > > >> There are currently no mailing lists. The usual mailing lists
are
>> > > >>expected to be set up when entering incubation:
>> > > >>
>> > > >>
>> > > >>private@palo.incubator.apache.org<mailto:private@
>> > > palo.incubator.apache.or
>> > > >>g>
>> > > >> 
>>dev@palo.incubator.apache.org<mailto:dev@palo.incubator.apache.org>
>> > > >>
>> > > >>commits@palo.incubator.apache.org<mailto:commits@
>> > > palo.incubator.apache.or
>> > > >>g>
>> > > >>
>> > > >> ###Subversion Directory
>> > > >>
>> > > >> Upon entering incubation: https://github.com/baidu/palo.
>> > > >> After incubation, we want to move the existing repo from
>> > > >>https://github.com/baidu/palo to Apache infrastructure.
>> > > >>
>> > > >> ###Issue Tracking
>> > > >>
>> > > >> Palo currently uses GitHub to track issues. Would like to
>>continue
>> to
>> > > >>do so while we discuss migration possibilities with the ASF Infra
>> > > >>committee.
>> > > >>
>> > > >> ###Other Resources
>> > > >>
>> > > >> The existing code already has unit tests so we will make use of
>> > > >>existing Apache continuous testing infrastructure. The resulting
>>load
>> > > >>should not be very large.
>> > > >>
>> > > >> ##Initial Committers
>> > > >>
>> > > >> * Ruyue Ma (https://github.com/maruyue,
>> > > >>maruyue@baidu.com<mailto:maruyue@baidu.com>)
>> > > >> * Chun Zhao (https://github.com/imay,
>> > > >>buaa.zhaoc@gmail.com<mailto:buaa.zhaoc@gmail.com>)
>> > > >> * Mingyu Chen
>>(https://github.com/morningman,chenmingyu@baidu.com)
>> > > >> * De Li(https://github.com/lide-reed,
>> > > >>mailtolide@sina.com)<mailto:mailtolide@sina.com%EF%BC%89>
>> > > >> * Hao Chen (https://github.com/chenhao7253886,
>> > > >>chenhao16@baidu.com<mailto:chenhao16@baidu.com>)
>> > > >> * Chaoyong Li (https://github.com/cyongli,
>> > > >>lichaoyong@baidu.com<mailto:lichaoyong@baidu.com>)
>> > > >> * Bin Lin (https://github.com/lingbin,
>> > > >>lingbinlb@gmail.com<mailto:lingbinlb@gmail.com>)
>> > > >>
>> > > >> ##Affiliations
>> > > >>
>> > > >> The initial committers are employees of Baidu Inc.. The nominated
>> > > >>mentors are employees of TODO.
>> > > >>
>> > > >> ##Sponsors
>> > > >>
>> > > >> ###Champion
>> > > >>
>> > > >> TODO
>> > > >>
>> > > >> ###Nominated Mentors
>> > > >>
>> > > >> * sijie guo, guosijie@gmail.com<mailto:guosijie@gmail.com>
>> > > >> * Luke Han, lukehan@apache.org<mailto:lukehan@apache.org>
>> > > >> * Zheng Shao, zshao@apache.org<mailto:zshao@apache.org>
>> > > >>
>> > > >> ###Sponsoring Entity
>> > > >>
>> > > >> We are requesting the Incubator to sponsor this project.
>> > > >>
>> > > >
>> > > 
>>>---------------------------------------------------------------------
>> > > >To unsubscribe, e-mail: general-unsubscribe@incubator.apache.org
>> > > >For additional commands, e-mail: general-help@incubator.apache.org
>> > > >
>> > >
>> > >
>> > > 
>>---------------------------------------------------------------------
>> > > To unsubscribe, e-mail: general-unsubscribe@incubator.apache.org
>> > > For additional commands, e-mail: general-help@incubator.apache.org
>> > >
>> >
>> >
>> >
>> > --
>> > Todd Lipcon
>> > Software Engineer, Cloudera
>> >
>>
>
>
>
>-- 
>Todd Lipcon
>Software Engineer, Cloudera

Mime
View raw message