incubator-general mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Arthur Wiedmer <art...@apache.org>
Subject Re: [VOTE] Accept the Iceberg project for incubation
Date Tue, 13 Nov 2018 17:28:33 GMT
+1

(Non-binding)

Best,
Arthur

On Tue, Nov 13, 2018, 09:24 Hugo Louro <hmclouro@gmail.com wrote:

> +1 (non-binding)
>
> > On Nov 13, 2018, at 9:19 AM, Owen O'Malley <owen.omalley@gmail.com>
> wrote:
> >
> > +1 (binding)
> >
> >> On Tue, Nov 13, 2018 at 12:12 PM Dave Fisher <dave2wave@comcast.net>
> wrote:
> >>
> >> +1 (binding)
> >>
> >>> On Nov 13, 2018, at 9:10 AM, Matt Sicker <boards@gmail.com> wrote:
> >>>
> >>> +1 binding
> >>>
> >>>> On Tue, 13 Nov 2018 at 11:09, Ryan Blue <blue@apache.org> wrote:
> >>>>
> >>>> +1 (binding)
> >>>>
> >>>>> On Tue, Nov 13, 2018 at 9:06 AM Ryan Blue <blue@apache.org>
wrote:
> >>>>>
> >>>>> The discuss thread seems to have reached consensus, so I propose
> >>>> accepting
> >>>>> the Iceberg project for incubation.
> >>>>>
> >>>>> The proposal is copied below and in the wiki:
> >>>>> https://wiki.apache.org/incubator/IcebergProposal
> >>>>>
> >>>>> Please vote on whether to accept Iceberg in the next 72 hours:
> >>>>>
> >>>>> [ ] +1, accept Iceberg for incubation
> >>>>> [ ] -1, reject the Iceberg proposal because . . .
> >>>>>
> >>>>> Thank you for reviewing the proposal and voting,
> >>>>>
> >>>>> rb
> >>>>> ------------------------------
> >>>>> Iceberg Proposal Abstract
> >>>>>
> >>>>> Iceberg is a table format for large, slow-moving tabular data.
> >>>>>
> >>>>> It is designed to improve on the de-facto standard table layout
built
> >>>> into
> >>>>> Apache Hive, Presto, and Apache Spark.
> >>>>> Proposal
> >>>>>
> >>>>> The purpose of Iceberg is to provide SQL-like tables that are backed
> by
> >>>>> large sets of data files. Iceberg is similar to the Hive table
> layout,
> >>>> the
> >>>>> de-facto standard structure used to track files in a table, but
> >> provides
> >>>>> additional guarantees and performance optimizations:
> >>>>>
> >>>>>  - Atomicity - Each change to the table is will be complete or will
> >>>>>  fail. “Do or do not. There is no try.”
> >>>>>  - Snapshot isolation - Reads use one and only one snapshot of a
> table
> >>>>>  at some time without holding a lock.
> >>>>>  - Safe schema evolution - A table’s schema can change in
> well-defined
> >>>>>  ways, without breaking older data files.
> >>>>>  - Column projection - An engine may request a subset of the
> available
> >>>>>  columns, including nested fields.
> >>>>>  - Predicate pushdown - An engine can push filters into read planning
> >>>>>  to improve performance using partition data and file-level
> >> statistics.
> >>>>>
> >>>>> Iceberg does NOT define a new file format. All data is stored in
> Apache
> >>>>> Avro, Apache ORC, or Apache Parquet files.
> >>>>>
> >>>>> Additionally, Iceberg is designed to work well when data files are
> >> stored
> >>>>> in cloud blob stores, even when those systems provide weaker
> guarantees
> >>>>> than a file system, including:
> >>>>>
> >>>>>  - Eventual consistency in the namespace
> >>>>>  - High latency for directory listings
> >>>>>  - No renames of objects
> >>>>>  - No folder hierarchy
> >>>>>
> >>>>> Rationale
> >>>>>
> >>>>> Initial benchmarks show dramatic improvements in query planning.
For
> >>>>> example, in Netflix’s Atlas use case, which stores time-series
> metrics
> >>>> from
> >>>>> Netflix runtime systems and 1 month is stored across 2.7 million
> files
> >> in
> >>>>> 2,688 partitions:
> >>>>>
> >>>>>  - Hive table using Parquet:
> >>>>>     - 400k+ splits, not combined
> >>>>>     - Explain query: 9.6 minutes wall time (planning only)
> >>>>>  - Iceberg table with partition filtering:
> >>>>>     - 15,218 splits, combined
> >>>>>     - Planning: 10 seconds
> >>>>>     - Query wall time: 13 minutes
> >>>>>  - Iceberg table with partition and min/max filtering:
> >>>>>     - 412 splits
> >>>>>     - Planning: 25 seconds
> >>>>>     - Query wall time: 42 seconds
> >>>>>
> >>>>> These performance gains combined with the cross-engine compatibility
> >> are
> >>>> a
> >>>>> very compelling story.
> >>>>> Initial Goals
> >>>>>
> >>>>> The initial goal will be to move the existing codebase to Apache
and
> >>>>> integrate with the Apache development process and infrastructure.
A
> >>>> primary
> >>>>> goal of incubation will be to grow and diversify the Iceberg
> community.
> >>>> We
> >>>>> are well aware that the project community is largely comprised of
> >>>>> individuals from a single company. We aim to change that during
> >>>> incubation.
> >>>>> Current Status
> >>>>>
> >>>>> As previously mentioned, Iceberg is under active development at
> >> Netflix,
> >>>>> and is being used in processing large volumes of data in Amazon
EC2.
> >>>>>
> >>>>> Iceberg license documentation is already based on Apache guidelines
> for
> >>>>> LICENSE and NOTICE content.
> >>>>> Meritocracy
> >>>>>
> >>>>> We value meritocracy and we understand that it is the basis for
an
> open
> >>>>> community that encourages multiple companies and individuals to
> >>>> contribute
> >>>>> and be invested in the project’s future. We will encourage and
> monitor
> >>>>> participation and make sure to extend privileges and responsibilities
> >> to
> >>>>> all contributors.
> >>>>> Community
> >>>>>
> >>>>> Iceberg is currently being used by developers at Netflix and a
> growing
> >>>>> number of users are actively using it in production environments.
> >> Iceberg
> >>>>> has received contributions from developers working at Hortonworks,
> >>>> WeWork,
> >>>>> and Palantir. By bringing Iceberg to Apache we aim to assure current
> >> and
> >>>>> future contributors that the Iceberg community is meritocratic and
> >> open,
> >>>> in
> >>>>> order to broaden and diversity the user and developer community.
> >>>>> Core Developers
> >>>>>
> >>>>> Iceberg was initially developed at Netflix and is under active
> >>>>> development. We believe Netflix will be of interest to a broad range
> of
> >>>>> users and developers and that incubating the project at the ASF
will
> >> help
> >>>>> us build a diverse, sustainable community.
> >>>>> Alignment
> >>>>>
> >>>>> Iceberg utilizes other Apache projects such as Avro, Hadoop, Hive,
> ORC,
> >>>>> Parquet, Pig, and Spark. We anticipate integration with additional
> >> Apache
> >>>>> projects as the Iceberg community and interest in the project grows.
> >>>>> Known Risks Orphaned Products
> >>>>>
> >>>>> Netflix is committed to the future development of Iceberg and
> >> understands
> >>>>> that graduation to a TLP, while preferable, is not the only positive
> >>>>> outcome of incubation.
> >>>>>
> >>>>> Should the Iceberg project be accepted by the Incubator, the
> >> prospective
> >>>>> PPMC would be willing to agree to a target incubation period of
2
> years
> >>>> or
> >>>>> less, knowing that every Incubator project incurs a certain cost
in
> >> terms
> >>>>> of ASF infrastructure and volunteer time.
> >>>>> Inexperience with Open Source
> >>>>>
> >>>>> Three of the initial committers are Apache members and Incubator
PMC
> >>>>> members. They will work with the other community members to teach
> them
> >>>> the
> >>>>> Apache Way.
> >>>>> Homogenous Developers
> >>>>>
> >>>>> The majority of the committers work at Netflix, though we are
> committed
> >>>> to
> >>>>> recruiting and developing additional committers from a wide spectrum
> of
> >>>>> industries and backgrounds.
> >>>>> Reliance on Salaried Developers
> >>>>>
> >>>>> It is expected that Iceberg development will occur on both salaried
> >> time
> >>>>> and on volunteer time, after hours. Most of the initial committers
> are
> >>>> paid
> >>>>> by Netflix to contribute to this project. However, they are all
> >>>> passionate
> >>>>> about the project, and we are both confident and hopeful that the
> >> project
> >>>>> will continue even if no salaried developers contribute to the
> project.
> >>>>> Relationships with Other Apache Products
> >>>>>
> >>>>> As mentioned in the Rationale section, Iceberg utilizes a number
of
> >>>>> existing Apache projects (Avro, Hadoop, Hive, ORC, Parquet, Pig,
&
> >>>> Spark),
> >>>>> and we expect that list to expand as the community grows and
> >> diversifies.
> >>>>> Any Apache project in the big data space that needs to store or
> process
> >>>>> tabular data would be potentially relevant.
> >>>>> An Excessive Fascination with the Apache Brand
> >>>>>
> >>>>> We are applying to the Incubator process because we think it is
the
> >> next
> >>>>> logical step for the Iceberg project after open-sourcing the code.
> This
> >>>>> proposal is not for the purpose of generating publicity. Rather,
we
> >> want
> >>>> to
> >>>>> make sure to create a very inclusive and meritocratic community,
> >> outside
> >>>>> the umbrella of a single company. Netflix has a long history of
> >>>>> contributing to Apache projects and the Iceberg developers and
> >>>> contributors
> >>>>> understand the implication of making it an Apache project.
> >>>>> Required Resources Mailing lists
> >>>>>
> >>>>>  - dev@iceberg.incubator.apache.org
> >>>>>  - commits@iceberg.incubator.apache.org
> >>>>>  - private@iceberg.incubator.apache.org
> >>>>>
> >>>>> The podling may also create a user mailing list, if needed.
> >>>>> Source Control and Issue Tracking
> >>>>>
> >>>>> The Iceberg podling would use Apache’s gitbox integration to sync
> >> between
> >>>>> github and Apache infrastructure. The podling would use github issues
> >> and
> >>>>> pull requests for community engagement.
> >>>>> Current Resources
> >>>>>
> >>>>>  - Initial source: https://github.com/Netflix/iceberg
> >>>>>  - Java documentation:
> >>>>>
> >>>>
> >>
> https://netflix.github.io/iceberg/current/javadoc/index.html?com/netflix/iceberg/package-summary.html
> >>>>>  - Table specification:
> >>>>>
> >>>>
> >>
> https://docs.google.com/document/d/1Q-zL5lSCle6NEEdyfiYsXYzX_Q8Qf0ctMyGBKslOswA/edit
> >>>>>
> >>>>> Source and Intellectual Property Submission Plan
> >>>>>
> >>>>> The Iceberg source code in Github is currently licensed under Apache
> >>>>> License v2.0 and the copyright is assigned to Netflix. If Iceberg
> >> becomes
> >>>>> an Incubator project at the ASF, Netflix will transfer the source
> code
> >>>> and
> >>>>> trademark ownership to the Apache Software Foundation via a Software
> >>>> Grant
> >>>>> Agreement.
> >>>>> External Dependencies
> >>>>>
> >>>>> External dependencies licensed under Apache License 2.0
> >>>>>
> >>>>>  - Guava https://github.com/google/guava
> >>>>>  - Jackson https://github.com/FasterXML/jackson-core
> >>>>>  - Joda-Time http://www.joda.org/joda-time/
> >>>>>
> >>>>> External dependencies licensed under the MIT License
> >>>>>
> >>>>>  - SLF4J https://www.slf4j.org/
> >>>>>  - Mockito https://github.com/mockito/mockito
> >>>>>
> >>>>> ASF Projects
> >>>>>
> >>>>>  - Apache Avro
> >>>>>  - Apache Hadoop
> >>>>>  - Apache Hive
> >>>>>  - Apache ORC
> >>>>>  - Apache Parquet
> >>>>>  - Apache Pig
> >>>>>  - Apache Spark
> >>>>>
> >>>>> Cryptography
> >>>>>
> >>>>> We do not expect Iceberg to be a controlled export item due to the
> use
> >> of
> >>>>> encryption.
> >>>>> Initial Committers and Affiliations
> >>>>>
> >>>>>  - Ryan Blue blue@apache.org (Netflix)
> >>>>>  - Parth Brahmbhatt parth@apache.org (Netflix)
> >>>>>  - Julien Le Dem julien@apache.org (WeWork)
> >>>>>  - Owen O’Malley omalley@apache.org (Hortonworks)
> >>>>>  - Daniel Weeks dweeks@apache.org (Netflix)
> >>>>>
> >>>>> Sponsors and Nominated Mentors
> >>>>>
> >>>>>  - Champion and mentor: Owen O’Malley omalley@apache.org
> >>>>>  - Mentor: Ryan Blue blue@apache.org
> >>>>>  - Mentor: Julien Le Dem julien@apache.org
> >>>>>
> >>>>> Sponsoring Entity
> >>>>>
> >>>>> The Apache Incubator
> >>>>> --
> >>>>> Ryan Blue
> >>>>>
> >>>>
> >>>>
> >>>> --
> >>>> Ryan Blue
> >>>>
> >>>
> >>>
> >>> --
> >>> Matt Sicker <boards@gmail.com>
> >>
> >>
> >> ---------------------------------------------------------------------
> >> To unsubscribe, e-mail: general-unsubscribe@incubator.apache.org
> >> For additional commands, e-mail: general-help@incubator.apache.org
> >>
> >>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: general-unsubscribe@incubator.apache.org
> For additional commands, e-mail: general-help@incubator.apache.org
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message