incubator-general mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Todd Lipcon <t...@cloudera.com>
Subject Re: Looking for Champion
Date Fri, 08 Jun 2018 16:12:12 GMT
On Thu, Jun 7, 2018 at 11:55 PM, Li,De(BDG) <lide@baidu.com> wrote:

> Hi, Jim
>
> Thank you for your response.
> Actually, we start Palo in several years ago, and that time we developed
> the storage engine based on Mesa technology.
> Meanwhile we found Impala is a very good MPP SQL query engine, so we
> integrated them together.
>

>From what I can tell of the Palo source, it's not so much an integration as
a copied-and-modified codebase, right? i.e Palo does not use Impala as a
dependency, but rather shares a lot of code from the Impala project that
has since diverged.


>
> With this integration, the goal of Palo is to implement a single,
> full-featured, mysql protocol compatible data warehousing.
>

That sounds pretty similar to the goals of the Impala project. Impala isn't
MySQL-compatible at the moment but that seems more like a particular
feature that could be added rather than a distinct identity of the project.
Otherwise, Impala's goal is to be a full featured data warehouse engine as
well.

Generally Apache has no rules against multiple projects fulfilling similar
goals or use cases, even when those projects might compete. However I think
it would be relatively unusual to incubate a project that appears to be
derived from a fork of an existing project, at least without first
considering whether the additional feature set could be contributed back to
the existing community.

-Todd


> 在 2018/6/8 下午1:55, "Jim Apple" <jbapple@apache.org> 写入:
>
> >Hello! As a contributor to Impala, I’d be interested in hearing thoughts
> >from the Palo community about integration between Impala and Palo.
> >
> >For instance, are there any apparent design goals of Impala that the Palo
> >community thinks are fundamentally incompatible with Palo?
> >
> >Thanks,
> >Jim
> >
> >On 2018/06/08 04:45:32, "Li,De(BDG)" <lide@baidu.com> wrote:
> >> Hi all,
> >>
> >> I am Reed, as a developer worked with the team for Palo (a MPP-based
> >>interactive SQL data warehousing).
> >> https://github.com/baidu/palo/wiki/Palo-Overview
> >>
> >> We propose to contribute Palo as an Apache Incubator project, and
> >> we are still looking for possible Champion if anyone would like to
> >>volunteer. Thanks a lot.
> >>
> >> Best Regards,
> >> Reed
> >>
> >> ===================
> >> The draft of the proposal as below:
> >>
> >> #Apache Palo
> >>
> >> ##Abstract
> >>
> >> Palo is a MPP-based interactive SQL data warehousing for reporting and
> >>analysis.
> >>
> >> ##Proposal
> >>
> >> We propose to contribute the Palo codebase and associated artifacts
> >>(e.g. documentation, web-site content etc.) to the Apache Software
> >>Foundation with the intent of forming a productive, meritocratic and
> >>open community around Palo’s continued development, according to the
> >>‘Apache Way’.
> >>
> >> Baidu owns several trademarks regarding Palo, and proposes to transfer
> >>ownership of those trademarks in full to the ASF.
> >>
> >> ###Overview of Palo
> >>
> >> Palo’s implementation consists of two daemons: Frontend (FE) and
> >>Backend (BE).
> >>
> >> **Frontend daemon** consists of query coordinator and catalog manager.
> >>Query coordinator is responsible for receiving users’ sql queries,
> >>compiling queries and managing queries execution. Catalog manager is
> >>responsible for managing metadata such as databases, tables, partitions,
> >>replicas and etc. Several frontend daemons could be deployed to
> >>guarantee fault-tolerance, and load balancing.
> >>
> >> **Backend daemon** stores the data and executes the query fragments.
> >>Many backend daemons could also be deployed to provide scalability and
> >>fault-tolerance.
> >>
> >> A typical Palo cluster generally composes of several frontend daemons
> >>and dozens to hundreds of backend daemons.
> >>
> >> Users can use MySQL client tools to connect any frontend daemon to
> >>submit SQL query. Frontend receives the query and compiles it into query
> >>plans executable by the Backend. Then Frontend sends the query plan
> >>fragments to Backend. Backend will build a query execution DAG. Data is
> >>fetched and pipelined into the DAG. The final result response is sent to
> >>client via Frontend. The distribution of query fragment execution takes
> >>minimizing data movement and maximizing scan locality as the main goal.
> >>
> >> ##Background
> >>
> >> At Baidu, Prior to Palo, different tools were deployed to solve diverse
> >>requirements in many ways. And when a use case requires the simultaneous
> >>availability of capabilities that cannot all be provided by a single
> >>tool, users were forced to build hybrid architectures that stitch
> >>multiple tools together, but we believe that they shouldn’t need to
> >>accept such inherent complexity. A storage system built to provide great
> >>performance across a broad range of workloads provides a more elegant
> >>solution to the problems that hybrid architectures aim to solve. Palo is
> >>the solution.
> >>
> >> Palo is designed to be a simple and single tightly coupled system, not
> >>depending on other systems. Palo provides high concurrent low latency
> >>point query performance, but also provides high throughput queries of
> >>ad-hoc analysis. Palo provides bulk-batch data loading, but also
> >>provides near real-time mini-batch data loading. Palo also provides high
> >>availability, reliability, fault tolerance, and scalability.
> >>
> >> ##Rationale
> >>
> >> Palo mainly integrates the technology of Google Mesa and Apache Impala.
> >>
> >> Mesa is a highly scalable analytic data storage system that stores
> >>critical measurement data related to Google's Internet advertising
> >>business. Mesa is designed to satisfy complex and challenging set of
> >>users’ and systems’ requirements, including near real-time data
> >>ingestion and query ability, as well as high availability, reliability,
> >>fault tolerance, and scalability for large data and query volumes.
> >>
> >> Impala is a modern, open-source MPP SQL engine architected from the
> >>ground up for the Hadoop data processing environment. At present, by
> >>virtue of its superior performance and rich functionality, Impala has
> >>been comparable to many commercial MPP database query engine. Mesa can
> >>satisfy the needs of many of our storage requirements, however Mesa
> >>itself does not provide a SQL query engine; Impala is a very good MPP
> >>SQL query engine, but the lack of a perfect distributed storage engine.
> >>So in the end we chose the combination of these two technologies.
> >>
> >> Learning from Mesa’s data model, we developed a distributed storage
> >>engine. Unlike Mesa, this storage engine does not rely on any
> >>distributed file system. Then we deeply integrate this storage engine
> >>with Impala query engine. Query compiling, query execution coordination
> >>and catalog management of storage engine are integrated to be frontend
> >>daemon; query execution and data storage are integrated to be backend
> >>daemon. With this integration, we implemented a single, full-featured,
> >>high performance state the art of MPP database, as well as maintaining
> >>the simplicity.
> >>
> >> ##Current Status
> >>
> >> Palo has been an open source project on GitHub
> >>(https://github.com/baidu/palo).
> >>
> >> ###Meritocracy
> >>
> >> Palo has been deployed in production at Baidu and is applying more than
> >>200 lines of business. It has demonstrated great performance benefits
> >>and has proved to be a better way for reporting and analysis based big
> >>data. Still We look forward to growing a rich user and developer
> >>community.
> >>
> >> ###Community
> >>
> >> Palo seeks to develop developer and user communities during incubation.
> >>
> >> ###Core Developers
> >>
> >> * Ruyue Ma (https://github.com/maruyue,
> >>maruyue@baidu.com<mailto:maruyue@baidu.com>)
> >> * Chun Zhao (https://github.com/imay,
> >>buaa.zhaoc@gmail.com<mailto:buaa.zhaoc@gmail.com>)
> >> * Mingyu Chen (https://github.com/morningman,chenmingyu@baidu.com)
> >> * De Li(https://github.com/lide-reed,
> >>mailtolide@sina.com)<mailto:mailtolide@sina.com%EF%BC%89>
> >> * Hao Chen (https://github.com/chenhao7253886,
> >>chenhao16@baidu.com<mailto:chenhao16@baidu.com>)
> >> * Chaoyong Li (https://github.com/cyongli,
> >>lichaoyong@baidu.com<mailto:lichaoyong@baidu.com>)
> >> * Bin Lin (https://github.com/lingbin,
> >>lingbinlb@gmail.com<mailto:lingbinlb@gmail.com>)
> >>
> >> ###Alignment
> >>
> >> Palo is related to several other Apache projects:
> >>
> >> * Palo can also read data stored in Apache Hadoop clusters powered by
> >>the HDFS filesystem.
> >> * Palo is closely integrated with Impala, which is also being proposed
> >>to the Incubator.
> >> * Palo uses Apache Thrift as its RPC and serialization framework of
> >>choice.
> >>
> >> ##Known Risks
> >>
> >> ###Orphaned Products
> >>
> >> The core developers of Palo team plan to work full time on this
> >>project. There is very little risk of Palo getting orphaned since at
> >>least one large company (Baidu) is extensively using it in their
> >>production. For example, currently there are more than 200 use cases
> >>using Palo in production. Furthermore, since Palo was open sourced at
> >>the beginning of October 2017, it has received more than 660 stars and
> >>been forked nearly 170 times. We plan to extend and diversify this
> >>community further through Apache.
> >>
> >> ###Inexperience with Open Source
> >>
> >> The core developers are all active users and followers of open source.
> >>They are already committers and contributors to the Palo Github project.
> >>All have been involved with the source code that has been released under
> >>an open source license, and several of them also have experience
> >>developing code in an open source environment. Though the core set of
> >>Developers do not have Apache Open Source experience, there are plans to
> >>onboard individuals with Apache open source experience on to the project.
> >>
> >> ###Homogenous Developers
> >>
> >> The most of core developers are from Baidu, but after Palo was open
> >>sourced, Palo received a lot of bug fixes and enhancements from other
> >>developers not working at Baidu.
> >>
> >> ###Reliance on Salaried Developers
> >>
> >> Baidu invested in Palo as the OLAP solution and some of its key
> >>engineers are working full time on the project. In addition, since there
> >>is a growing Big Data need for scalable OLAP solutions, we look forward
> >>to other Apache developers and researchers to contribute to the project.
> >>Also key to addressing the risk associated with relying on Salaried
> >>developers from a single entity is to increase the diversity of the
> >>contributors and actively lobby for Domain experts in the BI space to
> >>contribute. Apache Palo intends to do this.
> >>
> >> ###An Excessive Fascination with the Apache Brand
> >>
> >> Palo is proposing to enter incubation at Apache in order to help
> >>efforts to diversify the committer-base, not so much to capitalize on
> >>the Apache brand. The Palo project is in production use already inside
> >>Baidu, but is not expected to be an Baidu product for external
> >>customers. As such, the Palo project is not seeking to use the Apache
> >>brand as a marketing tool.
> >>
> >> ##Documentation
> >>
> >> Information about Palo can be found at https://github.com/baidu/palo.
> >>The following links provide more information about Palo in open source:
> >>
> >> * Palo wiki site: https://github.com/baidu/palo/wiki
> >> * Codebase at Github: https://github.com/baidu/palo
> >> * Issue Tracking: https://github.com/baidu/palo/issues
> >> * Overview: https://github.com/baidu/palo/wiki/Palo-Overview
> >> * FAQ: https://github.com/baidu/palo/wiki/Palo-FAQ
> >>
> >> ##Initial Source
> >>
> >> Palo has been under development since 2017 by a team of engineers at
> >>Baidu Inc. It is currently hosted on Github.com under an Apache license
> >>at https://github.com/baidu/palo.
> >>
> >> ##External Dependencies
> >>
> >> Palo has the following external dependencies.
> >>
> >> * Google gflags (BSD)
> >> * Google glog (BSD)
> >> * Apache Thrift (Apache Software License v2.0)
> >> * Apache Commons (Apache Software License v2.0)
> >> * Boost (Boost Software License)
> >> * OpenLdap (OpenLDAP Software License)
> >> * rapidjson (Tencent)
> >> * Google RE2 (BSD-style)
> >> * lz4 (BSD)
> >> * snappy (BSD)
> >> * cyrus-sasl (CMU License)
> >> * Twitter Bootstrap (Apache Software License v2.0)
> >> * d3 (BSD)
> >> * LLVM (BSD-like)
> >>
> >> Build and test dependencies:
> >>
> >> * ant (Apache Software License v2.0)
> >> * Apache Maven (Apache Software License v2.0)
> >> * cmake (BSD)
> >> * clang (BSD)
> >> * Google gtest (Apache Software License v2.0)
> >>
> >> ##Required Resources
> >>
> >> ###Mailing List
> >>
> >> There are currently no mailing lists. The usual mailing lists are
> >>expected to be set up when entering incubation:
> >>
> >>
> >>private@palo.incubator.apache.org<mailto:private@
> palo.incubator.apache.or
> >>g>
> >> dev@palo.incubator.apache.org<mailto:dev@palo.incubator.apache.org>
> >>
> >>commits@palo.incubator.apache.org<mailto:commits@
> palo.incubator.apache.or
> >>g>
> >>
> >> ###Subversion Directory
> >>
> >> Upon entering incubation: https://github.com/baidu/palo.
> >> After incubation, we want to move the existing repo from
> >>https://github.com/baidu/palo to Apache infrastructure.
> >>
> >> ###Issue Tracking
> >>
> >> Palo currently uses GitHub to track issues. Would like to continue to
> >>do so while we discuss migration possibilities with the ASF Infra
> >>committee.
> >>
> >> ###Other Resources
> >>
> >> The existing code already has unit tests so we will make use of
> >>existing Apache continuous testing infrastructure. The resulting load
> >>should not be very large.
> >>
> >> ##Initial Committers
> >>
> >> * Ruyue Ma (https://github.com/maruyue,
> >>maruyue@baidu.com<mailto:maruyue@baidu.com>)
> >> * Chun Zhao (https://github.com/imay,
> >>buaa.zhaoc@gmail.com<mailto:buaa.zhaoc@gmail.com>)
> >> * Mingyu Chen (https://github.com/morningman,chenmingyu@baidu.com)
> >> * De Li(https://github.com/lide-reed,
> >>mailtolide@sina.com)<mailto:mailtolide@sina.com%EF%BC%89>
> >> * Hao Chen (https://github.com/chenhao7253886,
> >>chenhao16@baidu.com<mailto:chenhao16@baidu.com>)
> >> * Chaoyong Li (https://github.com/cyongli,
> >>lichaoyong@baidu.com<mailto:lichaoyong@baidu.com>)
> >> * Bin Lin (https://github.com/lingbin,
> >>lingbinlb@gmail.com<mailto:lingbinlb@gmail.com>)
> >>
> >> ##Affiliations
> >>
> >> The initial committers are employees of Baidu Inc.. The nominated
> >>mentors are employees of TODO.
> >>
> >> ##Sponsors
> >>
> >> ###Champion
> >>
> >> TODO
> >>
> >> ###Nominated Mentors
> >>
> >> * sijie guo, guosijie@gmail.com<mailto:guosijie@gmail.com>
> >> * Luke Han, lukehan@apache.org<mailto:lukehan@apache.org>
> >> * Zheng Shao, zshao@apache.org<mailto:zshao@apache.org>
> >>
> >> ###Sponsoring Entity
> >>
> >> We are requesting the Incubator to sponsor this project.
> >>
> >
> >---------------------------------------------------------------------
> >To unsubscribe, e-mail: general-unsubscribe@incubator.apache.org
> >For additional commands, e-mail: general-help@incubator.apache.org
> >
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: general-unsubscribe@incubator.apache.org
> For additional commands, e-mail: general-help@incubator.apache.org
>



-- 
Todd Lipcon
Software Engineer, Cloudera

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message