incubator-general mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Li,De(BDG)" <l...@baidu.com>
Subject Re: Looking for Champion
Date Sat, 09 Jun 2018 13:12:54 GMT
Hi Tim, Todd,

Thank you for your response.

We are so sorry that we have not contribute any improvements to Impala so
far.
I think we will do that as soon, it is a good opportuniy to us to
participate
in open source community and learn to do things in Apache way.

One of causes is that We think most of our patches may not been accept by
Impala.
Because there is a big difference between Palo and Impala, our patch just
could
apply to Palo.

Firstly, as a query engine for Hadoop, Impala deeply depend on HDFS and
HBase 
(At least several years ago it was like this)
but Palo is just the opposite. We struggle to build a single tool which do
not 
depend on any other system.
The simplicity (of developing, deploying and using) and meeting many data
serving requirements in single system are the main feature of Palo.
So we just want a query engine from Impala rather than others such as
read/write Hive data.

Secondly, due to introduced Mesa data model. The Catalog is different from
Impala.
We developped a In-Memory Catalog and also support Rollup, aggregation
data 
model. As a consequnce, we have to change sql grammar based on Impala.

Thirdly, it is a big difference in Cluster manager and node deployment.
Contrast Impala, Query compiling, query execution coordination and catalog
management of storage engine are integrated to be frontend daemon.
Query execution and data storage are integrated to be backend daemon.

Now, as you mentioned, regarding Impala's goal is to be a full featured
data 
warehouse engine as well, maybe some of Palo's feature also usefull to
Impala. 
If it is possible, we are very happy to contribute code for Impala.
We are very appreciate for Impala community and we are looking forward to
corporate with Impala community in whatever way.

Best Regards,
Reed



在 2018/6/9 上午12:18, "Tim Armstrong" <tarmstrong@cloudera.com> 写入:

>> Meanwhile we found Impala is a very good MPP SQL query engine, so we
>>integrated
>them together.
>
>Palo didn't integrate with Impala, it forked Impala's codebase and
>embedded
>it in its own repository. I don't remember any attempts from the Palo team
>to engage with the Impala community or attempt to work with us to
>contribute any improvements.
>
>It looks like Palo is still pulling in new code from Impala.  E.g. this
>commit includes a bunch of code I wrote as part of IMPALA-3200:
>https://github.com/baidu/palo/commit/2419384e8a211f10e7636afc6d3423700ba22
>b5a#diff-1c501d9a8b5c3d1d1cce48d5e1fb0edf
>
>The code isn't owned by any individual, I contributed it to Apache and
>it's
>free for anyone to do what they want to do with it, but pulling in
>improvements from other projects without any attempt to attribute it or
>contribute improvements back seems contrary to the Apache way.
>
>Anyway, maybe incubation is an opportunity for us to work together, but
>I'd
>hope that if Palo does go into incubation that it will rethink some of the
>practices it's been following.
>
>On Fri, Jun 8, 2018 at 9:12 AM, Todd Lipcon <todd@cloudera.com> wrote:
>
>> On Thu, Jun 7, 2018 at 11:55 PM, Li,De(BDG) <lide@baidu.com> wrote:
>>
>> > Hi, Jim
>> >
>> > Thank you for your response.
>> > Actually, we start Palo in several years ago, and that time we
>>developed
>> > the storage engine based on Mesa technology.
>> > Meanwhile we found Impala is a very good MPP SQL query engine, so we
>> > integrated them together.
>> >
>>
>> From what I can tell of the Palo source, it's not so much an
>>integration as
>> a copied-and-modified codebase, right? i.e Palo does not use Impala as a
>> dependency, but rather shares a lot of code from the Impala project that
>> has since diverged.
>>
>>
>> >
>> > With this integration, the goal of Palo is to implement a single,
>> > full-featured, mysql protocol compatible data warehousing.
>> >
>>
>> That sounds pretty similar to the goals of the Impala project. Impala
>>isn't
>> MySQL-compatible at the moment but that seems more like a particular
>> feature that could be added rather than a distinct identity of the
>>project.
>> Otherwise, Impala's goal is to be a full featured data warehouse engine
>>as
>> well.
>>
>> Generally Apache has no rules against multiple projects fulfilling
>>similar
>> goals or use cases, even when those projects might compete. However I
>>think
>> it would be relatively unusual to incubate a project that appears to be
>> derived from a fork of an existing project, at least without first
>> considering whether the additional feature set could be contributed
>>back to
>> the existing community.
>>
>> -Todd
>>
>>
>> > 在 2018/6/8 下午1:55, "Jim Apple" <jbapple@apache.org> 写入:
>> >
>> > >Hello! As a contributor to Impala, I’d be interested in hearing
>>thoughts
>> > >from the Palo community about integration between Impala and Palo.
>> > >
>> > >For instance, are there any apparent design goals of Impala that the
>> Palo
>> > >community thinks are fundamentally incompatible with Palo?
>> > >
>> > >Thanks,
>> > >Jim
>> > >
>> > >On 2018/06/08 04:45:32, "Li,De(BDG)" <lide@baidu.com> wrote:
>> > >> Hi all,
>> > >>
>> > >> I am Reed, as a developer worked with the team for Palo (a
>>MPP-based
>> > >>interactive SQL data warehousing).
>> > >> https://github.com/baidu/palo/wiki/Palo-Overview
>> > >>
>> > >> We propose to contribute Palo as an Apache Incubator project, and
>> > >> we are still looking for possible Champion if anyone would like to
>> > >>volunteer. Thanks a lot.
>> > >>
>> > >> Best Regards,
>> > >> Reed
>> > >>
>> > >> ===================
>> > >> The draft of the proposal as below:
>> > >>
>> > >> #Apache Palo
>> > >>
>> > >> ##Abstract
>> > >>
>> > >> Palo is a MPP-based interactive SQL data warehousing for reporting
>>and
>> > >>analysis.
>> > >>
>> > >> ##Proposal
>> > >>
>> > >> We propose to contribute the Palo codebase and associated artifacts
>> > >>(e.g. documentation, web-site content etc.) to the Apache Software
>> > >>Foundation with the intent of forming a productive, meritocratic and
>> > >>open community around Palo’s continued development, according to the
>> > >>‘Apache Way’.
>> > >>
>> > >> Baidu owns several trademarks regarding Palo, and proposes to
>>transfer
>> > >>ownership of those trademarks in full to the ASF.
>> > >>
>> > >> ###Overview of Palo
>> > >>
>> > >> Palo’s implementation consists of two daemons: Frontend (FE) and
>> > >>Backend (BE).
>> > >>
>> > >> **Frontend daemon** consists of query coordinator and catalog
>>manager.
>> > >>Query coordinator is responsible for receiving users’ sql queries,
>> > >>compiling queries and managing queries execution. Catalog manager is
>> > >>responsible for managing metadata such as databases, tables,
>> partitions,
>> > >>replicas and etc. Several frontend daemons could be deployed to
>> > >>guarantee fault-tolerance, and load balancing.
>> > >>
>> > >> **Backend daemon** stores the data and executes the query
>>fragments.
>> > >>Many backend daemons could also be deployed to provide scalability
>>and
>> > >>fault-tolerance.
>> > >>
>> > >> A typical Palo cluster generally composes of several frontend
>>daemons
>> > >>and dozens to hundreds of backend daemons.
>> > >>
>> > >> Users can use MySQL client tools to connect any frontend daemon to
>> > >>submit SQL query. Frontend receives the query and compiles it into
>> query
>> > >>plans executable by the Backend. Then Frontend sends the query plan
>> > >>fragments to Backend. Backend will build a query execution DAG.
>>Data is
>> > >>fetched and pipelined into the DAG. The final result response is
>>sent
>> to
>> > >>client via Frontend. The distribution of query fragment execution
>>takes
>> > >>minimizing data movement and maximizing scan locality as the main
>>goal.
>> > >>
>> > >> ##Background
>> > >>
>> > >> At Baidu, Prior to Palo, different tools were deployed to solve
>> diverse
>> > >>requirements in many ways. And when a use case requires the
>> simultaneous
>> > >>availability of capabilities that cannot all be provided by a single
>> > >>tool, users were forced to build hybrid architectures that stitch
>> > >>multiple tools together, but we believe that they shouldn’t need to
>> > >>accept such inherent complexity. A storage system built to provide
>> great
>> > >>performance across a broad range of workloads provides a more
>>elegant
>> > >>solution to the problems that hybrid architectures aim to solve.
>>Palo
>> is
>> > >>the solution.
>> > >>
>> > >> Palo is designed to be a simple and single tightly coupled system,
>>not
>> > >>depending on other systems. Palo provides high concurrent low
>>latency
>> > >>point query performance, but also provides high throughput queries
>>of
>> > >>ad-hoc analysis. Palo provides bulk-batch data loading, but also
>> > >>provides near real-time mini-batch data loading. Palo also provides
>> high
>> > >>availability, reliability, fault tolerance, and scalability.
>> > >>
>> > >> ##Rationale
>> > >>
>> > >> Palo mainly integrates the technology of Google Mesa and Apache
>> Impala.
>> > >>
>> > >> Mesa is a highly scalable analytic data storage system that stores
>> > >>critical measurement data related to Google's Internet advertising
>> > >>business. Mesa is designed to satisfy complex and challenging set of
>> > >>users’ and systems’ requirements, including near real-time data
>> > >>ingestion and query ability, as well as high availability,
>>reliability,
>> > >>fault tolerance, and scalability for large data and query volumes.
>> > >>
>> > >> Impala is a modern, open-source MPP SQL engine architected from the
>> > >>ground up for the Hadoop data processing environment. At present, by
>> > >>virtue of its superior performance and rich functionality, Impala
>>has
>> > >>been comparable to many commercial MPP database query engine. Mesa
>>can
>> > >>satisfy the needs of many of our storage requirements, however Mesa
>> > >>itself does not provide a SQL query engine; Impala is a very good
>>MPP
>> > >>SQL query engine, but the lack of a perfect distributed storage
>>engine.
>> > >>So in the end we chose the combination of these two technologies.
>> > >>
>> > >> Learning from Mesa’s data model, we developed a distributed storage
>> > >>engine. Unlike Mesa, this storage engine does not rely on any
>> > >>distributed file system. Then we deeply integrate this storage
>>engine
>> > >>with Impala query engine. Query compiling, query execution
>>coordination
>> > >>and catalog management of storage engine are integrated to be
>>frontend
>> > >>daemon; query execution and data storage are integrated to be
>>backend
>> > >>daemon. With this integration, we implemented a single,
>>full-featured,
>> > >>high performance state the art of MPP database, as well as
>>maintaining
>> > >>the simplicity.
>> > >>
>> > >> ##Current Status
>> > >>
>> > >> Palo has been an open source project on GitHub
>> > >>(https://github.com/baidu/palo).
>> > >>
>> > >> ###Meritocracy
>> > >>
>> > >> Palo has been deployed in production at Baidu and is applying more
>> than
>> > >>200 lines of business. It has demonstrated great performance
>>benefits
>> > >>and has proved to be a better way for reporting and analysis based
>>big
>> > >>data. Still We look forward to growing a rich user and developer
>> > >>community.
>> > >>
>> > >> ###Community
>> > >>
>> > >> Palo seeks to develop developer and user communities during
>> incubation.
>> > >>
>> > >> ###Core Developers
>> > >>
>> > >> * Ruyue Ma (https://github.com/maruyue,
>> > >>maruyue@baidu.com<mailto:maruyue@baidu.com>)
>> > >> * Chun Zhao (https://github.com/imay,
>> > >>buaa.zhaoc@gmail.com<mailto:buaa.zhaoc@gmail.com>)
>> > >> * Mingyu Chen (https://github.com/morningman,chenmingyu@baidu.com)
>> > >> * De Li(https://github.com/lide-reed,
>> > >>mailtolide@sina.com)<mailto:mailtolide@sina.com%EF%BC%89>
>> > >> * Hao Chen (https://github.com/chenhao7253886,
>> > >>chenhao16@baidu.com<mailto:chenhao16@baidu.com>)
>> > >> * Chaoyong Li (https://github.com/cyongli,
>> > >>lichaoyong@baidu.com<mailto:lichaoyong@baidu.com>)
>> > >> * Bin Lin (https://github.com/lingbin,
>> > >>lingbinlb@gmail.com<mailto:lingbinlb@gmail.com>)
>> > >>
>> > >> ###Alignment
>> > >>
>> > >> Palo is related to several other Apache projects:
>> > >>
>> > >> * Palo can also read data stored in Apache Hadoop clusters powered
>>by
>> > >>the HDFS filesystem.
>> > >> * Palo is closely integrated with Impala, which is also being
>>proposed
>> > >>to the Incubator.
>> > >> * Palo uses Apache Thrift as its RPC and serialization framework of
>> > >>choice.
>> > >>
>> > >> ##Known Risks
>> > >>
>> > >> ###Orphaned Products
>> > >>
>> > >> The core developers of Palo team plan to work full time on this
>> > >>project. There is very little risk of Palo getting orphaned since at
>> > >>least one large company (Baidu) is extensively using it in their
>> > >>production. For example, currently there are more than 200 use cases
>> > >>using Palo in production. Furthermore, since Palo was open sourced
>>at
>> > >>the beginning of October 2017, it has received more than 660 stars
>>and
>> > >>been forked nearly 170 times. We plan to extend and diversify this
>> > >>community further through Apache.
>> > >>
>> > >> ###Inexperience with Open Source
>> > >>
>> > >> The core developers are all active users and followers of open
>>source.
>> > >>They are already committers and contributors to the Palo Github
>> project.
>> > >>All have been involved with the source code that has been released
>> under
>> > >>an open source license, and several of them also have experience
>> > >>developing code in an open source environment. Though the core set
>>of
>> > >>Developers do not have Apache Open Source experience, there are
>>plans
>> to
>> > >>onboard individuals with Apache open source experience on to the
>> project.
>> > >>
>> > >> ###Homogenous Developers
>> > >>
>> > >> The most of core developers are from Baidu, but after Palo was open
>> > >>sourced, Palo received a lot of bug fixes and enhancements from
>>other
>> > >>developers not working at Baidu.
>> > >>
>> > >> ###Reliance on Salaried Developers
>> > >>
>> > >> Baidu invested in Palo as the OLAP solution and some of its key
>> > >>engineers are working full time on the project. In addition, since
>> there
>> > >>is a growing Big Data need for scalable OLAP solutions, we look
>>forward
>> > >>to other Apache developers and researchers to contribute to the
>> project.
>> > >>Also key to addressing the risk associated with relying on Salaried
>> > >>developers from a single entity is to increase the diversity of the
>> > >>contributors and actively lobby for Domain experts in the BI space
>>to
>> > >>contribute. Apache Palo intends to do this.
>> > >>
>> > >> ###An Excessive Fascination with the Apache Brand
>> > >>
>> > >> Palo is proposing to enter incubation at Apache in order to help
>> > >>efforts to diversify the committer-base, not so much to capitalize
>>on
>> > >>the Apache brand. The Palo project is in production use already
>>inside
>> > >>Baidu, but is not expected to be an Baidu product for external
>> > >>customers. As such, the Palo project is not seeking to use the
>>Apache
>> > >>brand as a marketing tool.
>> > >>
>> > >> ##Documentation
>> > >>
>> > >> Information about Palo can be found at
>>https://github.com/baidu/palo.
>> > >>The following links provide more information about Palo in open
>>source:
>> > >>
>> > >> * Palo wiki site: https://github.com/baidu/palo/wiki
>> > >> * Codebase at Github: https://github.com/baidu/palo
>> > >> * Issue Tracking: https://github.com/baidu/palo/issues
>> > >> * Overview: https://github.com/baidu/palo/wiki/Palo-Overview
>> > >> * FAQ: https://github.com/baidu/palo/wiki/Palo-FAQ
>> > >>
>> > >> ##Initial Source
>> > >>
>> > >> Palo has been under development since 2017 by a team of engineers
>>at
>> > >>Baidu Inc. It is currently hosted on Github.com under an Apache
>>license
>> > >>at https://github.com/baidu/palo.
>> > >>
>> > >> ##External Dependencies
>> > >>
>> > >> Palo has the following external dependencies.
>> > >>
>> > >> * Google gflags (BSD)
>> > >> * Google glog (BSD)
>> > >> * Apache Thrift (Apache Software License v2.0)
>> > >> * Apache Commons (Apache Software License v2.0)
>> > >> * Boost (Boost Software License)
>> > >> * OpenLdap (OpenLDAP Software License)
>> > >> * rapidjson (Tencent)
>> > >> * Google RE2 (BSD-style)
>> > >> * lz4 (BSD)
>> > >> * snappy (BSD)
>> > >> * cyrus-sasl (CMU License)
>> > >> * Twitter Bootstrap (Apache Software License v2.0)
>> > >> * d3 (BSD)
>> > >> * LLVM (BSD-like)
>> > >>
>> > >> Build and test dependencies:
>> > >>
>> > >> * ant (Apache Software License v2.0)
>> > >> * Apache Maven (Apache Software License v2.0)
>> > >> * cmake (BSD)
>> > >> * clang (BSD)
>> > >> * Google gtest (Apache Software License v2.0)
>> > >>
>> > >> ##Required Resources
>> > >>
>> > >> ###Mailing List
>> > >>
>> > >> There are currently no mailing lists. The usual mailing lists are
>> > >>expected to be set up when entering incubation:
>> > >>
>> > >>
>> > >>private@palo.incubator.apache.org<mailto:private@
>> > palo.incubator.apache.or
>> > >>g>
>> > >> dev@palo.incubator.apache.org<mailto:dev@palo.incubator.apache.org>
>> > >>
>> > >>commits@palo.incubator.apache.org<mailto:commits@
>> > palo.incubator.apache.or
>> > >>g>
>> > >>
>> > >> ###Subversion Directory
>> > >>
>> > >> Upon entering incubation: https://github.com/baidu/palo.
>> > >> After incubation, we want to move the existing repo from
>> > >>https://github.com/baidu/palo to Apache infrastructure.
>> > >>
>> > >> ###Issue Tracking
>> > >>
>> > >> Palo currently uses GitHub to track issues. Would like to continue
>>to
>> > >>do so while we discuss migration possibilities with the ASF Infra
>> > >>committee.
>> > >>
>> > >> ###Other Resources
>> > >>
>> > >> The existing code already has unit tests so we will make use of
>> > >>existing Apache continuous testing infrastructure. The resulting
>>load
>> > >>should not be very large.
>> > >>
>> > >> ##Initial Committers
>> > >>
>> > >> * Ruyue Ma (https://github.com/maruyue,
>> > >>maruyue@baidu.com<mailto:maruyue@baidu.com>)
>> > >> * Chun Zhao (https://github.com/imay,
>> > >>buaa.zhaoc@gmail.com<mailto:buaa.zhaoc@gmail.com>)
>> > >> * Mingyu Chen (https://github.com/morningman,chenmingyu@baidu.com)
>> > >> * De Li(https://github.com/lide-reed,
>> > >>mailtolide@sina.com)<mailto:mailtolide@sina.com%EF%BC%89>
>> > >> * Hao Chen (https://github.com/chenhao7253886,
>> > >>chenhao16@baidu.com<mailto:chenhao16@baidu.com>)
>> > >> * Chaoyong Li (https://github.com/cyongli,
>> > >>lichaoyong@baidu.com<mailto:lichaoyong@baidu.com>)
>> > >> * Bin Lin (https://github.com/lingbin,
>> > >>lingbinlb@gmail.com<mailto:lingbinlb@gmail.com>)
>> > >>
>> > >> ##Affiliations
>> > >>
>> > >> The initial committers are employees of Baidu Inc.. The nominated
>> > >>mentors are employees of TODO.
>> > >>
>> > >> ##Sponsors
>> > >>
>> > >> ###Champion
>> > >>
>> > >> TODO
>> > >>
>> > >> ###Nominated Mentors
>> > >>
>> > >> * sijie guo, guosijie@gmail.com<mailto:guosijie@gmail.com>
>> > >> * Luke Han, lukehan@apache.org<mailto:lukehan@apache.org>
>> > >> * Zheng Shao, zshao@apache.org<mailto:zshao@apache.org>
>> > >>
>> > >> ###Sponsoring Entity
>> > >>
>> > >> We are requesting the Incubator to sponsor this project.
>> > >>
>> > >
>> > >---------------------------------------------------------------------
>> > >To unsubscribe, e-mail: general-unsubscribe@incubator.apache.org
>> > >For additional commands, e-mail: general-help@incubator.apache.org
>> > >
>> >
>> >
>> > ---------------------------------------------------------------------
>> > To unsubscribe, e-mail: general-unsubscribe@incubator.apache.org
>> > For additional commands, e-mail: general-help@incubator.apache.org
>> >
>>
>>
>>
>> --
>> Todd Lipcon
>> Software Engineer, Cloudera
>>


---------------------------------------------------------------------
To unsubscribe, e-mail: general-unsubscribe@incubator.apache.org
For additional commands, e-mail: general-help@incubator.apache.org
Mime
View raw message