incubator-general mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Akira Ajisaka <aajis...@apache.org>
Subject Re: [VOTE] Accept Hudi into the Apache Incubator
Date Tue, 15 Jan 2019 05:05:16 GMT
+1 (binding)

-Akira

2019年1月15日(火) 10:25 Jakob Homan <jghoman@gmail.com>:
>
> +1 (binding)
>
> -Jakob
>
> On Mon, Jan 14, 2019 at 5:22 PM Mayank Bansal <mabansal@gmail.com> wrote:
> >
> > +1
> >
> > On Mon, Jan 14, 2019 at 5:11 PM Mohammad Islam <mislam77@yahoo.com.invalid>
> > wrote:
> >
> > >  +1
> > >     On Monday, January 14, 2019, 12:46:48 PM PST, Kenneth Knowles <
> > > kenn@apache.org> wrote:
> > >
> > >  +1
> > >
> > > On Mon, Jan 14, 2019 at 9:38 AM Felix Cheung <felixcheung@apache.org>
> > > wrote:
> > >
> > > > +1
> > > >
> > > >
> > > > On Mon, Jan 14, 2019 at 3:20 AM Suneel Marthi
> > > > <suneel_marthi@yahoo.com.invalid> wrote:
> > > >
> > > > > +1
> > > > >
> > > > > Sent from my iPhone
> > > > >
> > > > > > On Jan 13, 2019, at 5:34 PM, Thomas Weise <thw@apache.org>
wrote:
> > > > > >
> > > > > > Hi all,
> > > > > >
> > > > > > Following the discussion of the Hudi proposal in [1], this is
a vote
> > > > > > on accepting Hudi into the Apache Incubator,
> > > > > > per the ASF policy [2] and voting rules [3].
> > > > > >
> > > > > > A vote for accepting a new Apache Incubator podling is a
> > > > > > majority vote. Everyone is welcome to vote, only
> > > > > > Incubator PMC member votes are binding.
> > > > > >
> > > > > > This vote will run for at least 72 hours. Please VOTE as
> > > > > > follows:
> > > > > >
> > > > > > [ ] +1 Accept Hudi into the Apache Incubator
> > > > > > [ ] +0 Abstain
> > > > > > [ ] -1 Do not accept Hudi into the Apache Incubator because
...
> > > > > >
> > > > > > The proposal is included below, but you can also access it on
> > > > > > the wiki [4].
> > > > > >
> > > > > > Thanks for reviewing and voting,
> > > > > > Thomas
> > > > > >
> > > > > > [1]
> > > > > >
> > > > >
> > > >
> > > https://lists.apache.org/thread.html/12e2bdaa095d68dae6f8731e473d3d43885783177d1b7e3ff2f65b6d@%3Cgeneral.incubator.apache.org%3E
> > > > > >
> > > > > > [2]
> > > > > >
> > > > >
> > > >
> > > https://incubator.apache.org/policy/incubation.html#approval_of_proposal_by_sponsor
> > > > > >
> > > > > > [3] http://www.apache.org/foundation/voting.html
> > > > > >
> > > > > > [4] https://wiki.apache.org/incubator/HudiProposal
> > > > > >
> > > > > >
> > > > > >
> > > > > > = Hudi Proposal =
> > > > > >
> > > > > > == Abstract ==
> > > > > >
> > > > > > Hudi is a big-data storage library, that provides atomic upserts
and
> > > > > > incremental data streams.
> > > > > >
> > > > > > Hudi manages data stored in Apache Hadoop and other API compatible
> > > > > > distributed file systems/cloud stores.
> > > > > >
> > > > > > == Proposal ==
> > > > > >
> > > > > > Hudi provides the ability to atomically upsert datasets with
new
> > > values
> > > > > in
> > > > > > near-real time, making data available quickly to existing query
> > > engines
> > > > > > like Apache Hive, Apache Spark, & Presto. Additionally,
Hudi
> > > provides a
> > > > > > sequence of changes to a dataset from a given point-in-time
to enable
> > > > > > incremental data pipelines that yield greater efficiency &
latency
> > > than
> > > > > > their typical batch counterparts. By carefully managing number
of
> > > > files &
> > > > > > sizes, Hudi greatly aids both query engines (e.g: always providing
> > > > > > well-sized files) and underlying storage (e.g: HDFS NameNode
memory
> > > > > > consumption).
> > > > > >
> > > > > > Hudi is largely implemented as an Apache Spark library that
> > > > reads/writes
> > > > > > data from/to Hadoop compatible filesystem. SQL queries on Hudi
> > > datasets
> > > > > are
> > > > > > supported via specialized Apache Hadoop input formats, that
> > > understand
> > > > > > Hudi’s storage layout. Currently, Hudi manages datasets using
a
> > > > > combination
> > > > > > of Apache Parquet & Apache Avro file/serialization formats.
> > > > > >
> > > > > > == Background ==
> > > > > >
> > > > > > Apache Hadoop distributed filesystem (HDFS) & other compatible
cloud
> > > > > > storage systems (e.g: Amazon S3, Google Cloud, Microsoft Azure)
serve
> > > > as
> > > > > > longer term analytical storage for thousands of organizations.
> > > Typical
> > > > > > analytical datasets are built by reading data from a source
(e.g:
> > > > > upstream
> > > > > > databases, messaging buses, or other datasets), transforming
the
> > > data,
> > > > > > writing results back to storage, & making it available for
analytical
> > > > > > queries--all of this typically accomplished in batch jobs which
> > > operate
> > > > > in
> > > > > > a bulk fashion on partitions of datasets. Such a style of processing
> > > > > > typically incurs large delays in making data available to queries
as
> > > > well
> > > > > > as lot of complexity in carefully partitioning datasets to guarantee
> > > > > > latency SLAs.
> > > > > >
> > > > > > The need for fresher/faster analytics has increased enormously
in the
> > > > > past
> > > > > > few years, as evidenced by the popularity of Stream processing
> > > systems
> > > > > like
> > > > > > Apache Spark, Apache Flink, and messaging systems like Apache
Kafka.
> > > By
> > > > > > using updateable state store to incrementally compute &
instantly
> > > > reflect
> > > > > > new results to queries and using a “tailable” messaging
bus to
> > > publish
> > > > > > these results to other downstream jobs, such systems employ
a
> > > different
> > > > > > approach to building analytical dataset. Even though this approach
> > > > yields
> > > > > > low latency, the amount of data managed in such real-time data-marts
> > > is
> > > > > > typically limited in comparison to the aforementioned longer
term
> > > > storage
> > > > > > options. As a result, the overall data architecture has become
more
> > > > > complex
> > > > > > with more moving parts and specialized systems, leading to
> > > duplication
> > > > of
> > > > > > data and a strain on usability.
> > > > > >
> > > > > > Hudi takes a hybrid approach. Instead of moving vast amounts
of batch
> > > > > data
> > > > > > to streaming systems, we simply add the streaming primitives
> > > (upserts &
> > > > > > incremental consumption) onto existing batch processing technologies.
> > > > We
> > > > > > believe that by adding some missing blocks to an existing Hadoop
> > > stack,
> > > > > we
> > > > > > are able to a provide similar capabilities right on top of Hadoop
at
> > > a
> > > > > > reduced cost and with an increased efficiency, greatly simplifying
> > > the
> > > > > > overall architecture in the process.
> > > > > >
> > > > > > Hudi was originally developed at Uber (original name “Hoodie”)
to
> > > > address
> > > > > > such broad inefficiencies in ingest & ETL & ML pipelines
across
> > > Uber’s
> > > > > data
> > > > > > ecosystem that required the upsert & incremental consumption
> > > primitives
> > > > > > supported by Hudi.
> > > > > >
> > > > > > == Rationale ==
> > > > > >
> > > > > > We truly believe the capabilities supported by Hudi would be
> > > > increasingly
> > > > > > useful for big-data ecosystems, as data volumes & need for
faster
> > > data
> > > > > > continue to increase. A detailed description of target use-cases
can
> > > be
> > > > > > found at https://uber.github.io/hudi/use_cases.html.
> > > > > >
> > > > > > Given our reliance on so many great Apache projects, we believe
that
> > > > the
> > > > > > Apache way of open source community driven development will
enable us
> > > > to
> > > > > > evolve Hudi in collaboration with a diverse set of contributors
who
> > > can
> > > > > > bring new ideas into the project.
> > > > > >
> > > > > > == Initial Goals ==
> > > > > >
> > > > > > * Move the existing codebase, website, documentation, and mailing
> > > lists
> > > > > to
> > > > > > an Apache-hosted infrastructure.
> > > > > > * Integrate with the Apache development process.
> > > > > > * Ensure all dependencies are compliant with Apache License
version
> > > > 2.0.
> > > > > > * Incrementally develop and release per Apache guidelines.
> > > > > >
> > > > > > == Current Status ==
> > > > > >
> > > > > > Hudi is a stable project used in production at Uber since 2016
and
> > > was
> > > > > open
> > > > > > sourced under the Apache License, Version 2.0 in 2017. At Uber,
Hudi
> > > > > > manages 4000+ tables holding several petabytes, bringing our
Hadoop
> > > > > > warehouse from several hours of data delays to under 30 minutes,
over
> > > > the
> > > > > > past two years. The source code is currently hosted at github.com
(
> > > > > > https://github.com/uber/hudi ), which will seed the Apache git
> > > > > repository.
> > > > > >
> > > > > > === Meritocracy ===
> > > > > >
> > > > > > We are fully committed to open, transparent, & meritocratic
> > > > interactions
> > > > > > with our community. In fact, one of the primary motivations
for us to
> > > > > enter
> > > > > > the incubation process is to be able to rely on Apache best
practices
> > > > > that
> > > > > > can ensure meritocracy. This will eventually help incorporate
the
> > > best
> > > > > > ideas back into the project & enable contributors to continue
> > > investing
> > > > > > their time in the project. Current guidelines (
> > > > > > https://uber.github.io/hudi/community.html#becoming-a-committer)
> > > have
> > > > > > already put in place a meritocratic process which we will replace
> > > with
> > > > > > Apache guidelines during incubation.
> > > > > >
> > > > > > === Community ===
> > > > > >
> > > > > > Hudi community is fairly young, since the project was open sourced
> > > only
> > > > > in
> > > > > > early 2017. Currently, Hudi has committers from Uber & Snowflake.
We
> > > > > have a
> > > > > > vibrant set of contributors (~46 members in our slack channel)
> > > > including
> > > > > > Shopify, DoubleVerify and Vungle & others, who have either
submitted
> > > > > > patches or filed issues with hudi pipelines either in early
> > > production
> > > > or
> > > > > > testing stages. Our primary goal during the incubation would
be to
> > > grow
> > > > > the
> > > > > > community and groom our existing active contributors into committers.
> > > > > >
> > > > > > === Core Developers ===
> > > > > >
> > > > > > Current core developers work at Uber & Snowflake. We are
confident
> > > that
> > > > > > incubation will help us grow a diverse community in a open &
> > > > > collaborative
> > > > > > way.
> > > > > >
> > > > > > === Alignment ===
> > > > > >
> > > > > > Hudi is designed as a general purpose analytical storage abstraction
> > > > that
> > > > > > integrates with multiple Apache projects: Apache Spark, Apache
Hive,
> > > > > Apache
> > > > > > Hadoop. It was built using multiple Apache projects, including
Apache
> > > > > > Parquet and Apache Avro, that support near-real time analytics
right
> > > on
> > > > > top
> > > > > > of existing Apache Hadoop data lakes. Our sincere hope is that
being
> > > a
> > > > > part
> > > > > > of the Apache foundation would enable us to drive the future
of the
> > > > > project
> > > > > > in alignment with the other Apache projects for the benefit
of
> > > > thousands
> > > > > of
> > > > > > organizations that already leverage these projects.
> > > > > >
> > > > > > == Known Risks ==
> > > > > >
> > > > > > === Orphaned products ===
> > > > > >
> > > > > > The risk of abandonment of Hudi is low. It is used in production
at
> > > > Uber
> > > > > > for petabytes of data and other companies (mentioned in community
> > > > > section)
> > > > > > are either evaluating or in the early stage for production use.
Uber
> > > is
> > > > > > committed to further development of the project and invest resources
> > > > > > towards the Apache processes & building the community, during
> > > > incubation
> > > > > > period.
> > > > > >
> > > > > > === Inexperience with Open Source ===
> > > > > >
> > > > > > Even though the initial committers are new to the Apache world,
some
> > > > have
> > > > > > considerable open source experience - Vinoth Chandar (Linkedin
> > > > voldemort,
> > > > > > Chromium), Prasanna Rajaperumal (Cloudera experience), Zeeshan
> > > Qureshi
> > > > > > (Chromium) & Balaji Varadarajan (Linkedin Databus). We have
been
> > > > > > successfully managing the current open source community answering
> > > > > questions
> > > > > > and taking feedback already. Moreover, we hope to obtain guidance
and
> > > > > > mentorship from current ASF members to help us succeed with
the
> > > > > incubation.
> > > > > >
> > > > > > === Length of Incubation ===
> > > > > >
> > > > > > We expect the project be in incubation for 2 years or less.
> > > > > >
> > > > > > === Homogenous Developers ===
> > > > > >
> > > > > > Currently, the lead developers for Hudi are from Uber. However,
we
> > > have
> > > > > an
> > > > > > active set of early contributors/collaborators from Shopify,
> > > > DoubleVerify
> > > > > > and Vungle, that we hope will increase the diversity going forward.
> > > > Once
> > > > > > again, a primary motivation for incubation is to facilitate
this in
> > > the
> > > > > > Apache way.
> > > > > >
> > > > > > === Reliance on Salaried Developers ===
> > > > > >
> > > > > > Both the current committers & early contributors have several
years
> > > of
> > > > > core
> > > > > > expertise around data systems. Current committers are very passionate
> > > > > about
> > > > > > the project and have already invested hundreds of hours towards
> > > > helping &
> > > > > > building the community. Thus, even with employer changes, we
expect
> > > > they
> > > > > > will be able to actively engage in the project either because
they
> > > will
> > > > > be
> > > > > > working in similar areas even with newer employers or out of
belief
> > > in
> > > > > the
> > > > > > project.
> > > > > >
> > > > > > === Relationships with Other Apache Products ===
> > > > > >
> > > > > > To the best of our knowledge, there are no direct competing
projects
> > > > with
> > > > > > Hudi that offer all of the feature set namely - upserts, incremental
> > > > > > streams, efficient storage/file management, snapshot
> > > > isolation/rollbacks
> > > > > -
> > > > > > in a coherent way. However, some projects share common goals
and
> > > > > technical
> > > > > > elements and we will highlight them here. Hive ACID/Kudu both
offer
> > > > > upsert
> > > > > > capabilities without storage management/incremental streams.
The
> > > recent
> > > > > > Iceberg project offers similar snapshot isolation/rollbacks,
but not
> > > > > > upserts or other data plane features. A detailed comparison
with
> > > their
> > > > > > trade-offs can be found at
> > > https://uber.github.io/hudi/comparison.html
> > > > .
> > > > > >
> > > > > > We are committed to open collaboration with such Apache projects
and
> > > > > > incorporate changes to Hudi or contribute patches to other projects,
> > > > with
> > > > > > the goal of making it easier for the community at large, to
adopt
> > > these
> > > > > > open source technologies.
> > > > > >
> > > > > > === Excessive Fascination with the Apache Brand ===
> > > > > >
> > > > > > This proposal is not for the purpose of generating publicity.
We have
> > > > > > already been doing talks/meetups independently that have helped
us
> > > > build
> > > > > > our community. We are drawn towards Apache as a potential way
of
> > > > ensuring
> > > > > > that our open source community management is successful early
on so
> > > > hudi
> > > > > > can evolve into a broadly accepted--and used--method of managing
data
> > > > on
> > > > > > Hadoop.
> > > > > >
> > > > > > == Documentation ==
> > > > > > [1] Detailed documentation can be found at
> > > > https://uber.github.io/hudi/
> > > > > >
> > > > > > == Initial Source ==
> > > > > >
> > > > > > The codebase is currently hosted on Github:
> > > > https://github.com/uber/hudi
> > > > > .
> > > > > > During incubation, the codebase will be migrated to an Apache
> > > > > > infrastructure. The source code already has an Apache 2.0 licensed.
> > > > > >
> > > > > > == Source and Intellectual Property Submission Plan ==
> > > > > >
> > > > > > Current code is Apache 2.0 licensed and the copyright is assigned
to
> > > > > Uber.
> > > > > > If the project enters incubator, Uber will transfer the source
code &
> > > > > > trademark ownership to ASF via a Software Grant Agreement
> > > > > >
> > > > > > == External Dependencies ==
> > > > > >
> > > > > > Non apache dependencies are listed below
> > > > > >
> > > > > > * JCommander (1.48) Apache-2.0
> > > > > > * Kryo (4.0.0) BSD-2-Clause
> > > > > > * Kryo (2.21) BSD-3-Clause
> > > > > > * Jackson-annotations (2.6.4) Apache-2.0
> > > > > > * Jackson-annotations (2.6.5) Apache-2.0
> > > > > > * jackson-databind (2.6.4) Apache-2.0
> > > > > > * jackson-databind (2.6.5) Apache-2.0
> > > > > > * Jackson datatype: Guava (2.9.4) Apache-2.0
> > > > > > * docker-java (3.1.0-rc-3) Apache-2.0
> > > > > > * Guava: Google Core Libraries for Java (20.0) Apache-2.0
> > > > > > * bijection-avro (0.9.2) Apache-2.0
> > > > > > * com.twitter.common:objectsize (0.0.12) Apache-2.0
> > > > > > * Ascii Table (0.2.5) Apache-2.0
> > > > > > * config (3.0.0) Apache-2.0
> > > > > > * utils (3.0.0) Apache-2.0
> > > > > > * kafka-avro-serializer (3.0.0) Apache-2.0
> > > > > > * kafka-schema-registry-client (3.0.0) Apache-2.0
> > > > > > * Metrics Core (3.1.1) Apache-2.0
> > > > > > * Graphite Integration for Metrics (3.1.1) Apache-2.0
> > > > > > * Joda-Time (2.9.6) Apache-2.0
> > > > > > * JUnit CPL-1.0
> > > > > > * Awaitility (3.1.2) Apache-2.0
> > > > > > * jersey-connectors-apache (2.17) GPL-2.0-only CDDL-1.0
> > > > > > * jersey-container-servlet-core (2.17) GPL-2.0-only CDDL-1.0
> > > > > > * jersey-core-server (2.17) GPL-2.0-only CDDL-1.0
> > > > > > * htrace-core (3.0.4) Apache-2.0
> > > > > > * Mockito (1.10.19) MIT
> > > > > > * scalatest (3.0.1) Apache-2.0
> > > > > > * Spring Shell (1.2.0.RELEASE) Apache-2.0
> > > > > >
> > > > > > All of them are Apache compatible
> > > > > >
> > > > > > == Cryptography ==
> > > > > >
> > > > > > No cryptographic libraries used
> > > > > >
> > > > > > == Required Resources ==
> > > > > >
> > > > > > === Mailing lists ===
> > > > > >
> > > > > > * private@hudi.incubator.apache.org (with moderated subscriptions)
> > > > > > * dev@hudi.incubator.apache.org
> > > > > > * commits@hudi.incubator.apache.org
> > > > > > * user@hudi.incubator.apache.org
> > > > > >
> > > > > > === Git Repositories ===
> > > > > >
> > > > > > Git is the preferred source control system: git://
> > > > > > git.apache.org/incubator-hudi
> > > > > >
> > > > > > === Issue Tracking ===
> > > > > >
> > > > > > We prefer to use the Apache gitbox integration to sync Github
&
> > > Apache
> > > > > > infrastructure, and rely on Github issues & pull requests
for
> > > community
> > > > > > engagement. If this is not possible, then we prefer JIRA: Hudi
(HUDI)
> > > > > >
> > > > > > == Initial Committers ==
> > > > > >
> > > > > > * Vinoth Chandar (vinoth at uber dot com) (Uber)
> > > > > > * Nishith Agarwal (nagarwal at uber dot com) (Uber)
> > > > > > * Balaji Varadarajan (varadarb at uber dot com) (Uber)
> > > > > > * Prasanna Rajaperumal (prasanna dot raj at gmail dot com)
> > > (Snowflake)
> > > > > > * Zeeshan Qureshi (zeeshan dot qureshi at shopify dot com) (Shopify)
> > > > > > * Anbu Cheeralan (alunarbeach at gmail dot com) (DoubleVerify)
> > > > > >
> > > > > > == Sponsors ==
> > > > > >
> > > > > > === Champion ===
> > > > > > Julien Le Dem (julien at apache dot org)
> > > > > >
> > > > > > === Nominated Mentors ===
> > > > > >
> > > > > > * Luciano Resende (lresende at apache dot org)
> > > > > > * Thomas Weise (thw at apache dot org
> > > > > > * Kishore Gopalakrishna (kishoreg at apache dot org)
> > > > > > * Suneel Marthi (smarthi at apache dot org)
> > > > > >
> > > > > > === Sponsoring Entity ===
> > > > > >
> > > > > > The Incubator PMC
> > > > >
> > > > >
> > > > > ---------------------------------------------------------------------
> > > > > To unsubscribe, e-mail: general-unsubscribe@incubator.apache.org
> > > > > For additional commands, e-mail: general-help@incubator.apache.org
> > > > >
> > > > >
> > > >
> >
> > --
> > Thanks and Regards,
> > Mayank
> > Cell: 408-718-9370
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: general-unsubscribe@incubator.apache.org
> For additional commands, e-mail: general-help@incubator.apache.org
>

---------------------------------------------------------------------
To unsubscribe, e-mail: general-unsubscribe@incubator.apache.org
For additional commands, e-mail: general-help@incubator.apache.org


Mime
View raw message