incubator-general mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Hugo Louro <hmclo...@gmail.com>
Subject Re: [VOTE] Accept the Iceberg project for incubation
Date Tue, 13 Nov 2018 17:24:27 GMT
+1 (non-binding)

> On Nov 13, 2018, at 9:19 AM, Owen O'Malley <owen.omalley@gmail.com> wrote:
> 
> +1 (binding)
> 
>> On Tue, Nov 13, 2018 at 12:12 PM Dave Fisher <dave2wave@comcast.net> wrote:
>> 
>> +1 (binding)
>> 
>>> On Nov 13, 2018, at 9:10 AM, Matt Sicker <boards@gmail.com> wrote:
>>> 
>>> +1 binding
>>> 
>>>> On Tue, 13 Nov 2018 at 11:09, Ryan Blue <blue@apache.org> wrote:
>>>> 
>>>> +1 (binding)
>>>> 
>>>>> On Tue, Nov 13, 2018 at 9:06 AM Ryan Blue <blue@apache.org> wrote:
>>>>> 
>>>>> The discuss thread seems to have reached consensus, so I propose
>>>> accepting
>>>>> the Iceberg project for incubation.
>>>>> 
>>>>> The proposal is copied below and in the wiki:
>>>>> https://wiki.apache.org/incubator/IcebergProposal
>>>>> 
>>>>> Please vote on whether to accept Iceberg in the next 72 hours:
>>>>> 
>>>>> [ ] +1, accept Iceberg for incubation
>>>>> [ ] -1, reject the Iceberg proposal because . . .
>>>>> 
>>>>> Thank you for reviewing the proposal and voting,
>>>>> 
>>>>> rb
>>>>> ------------------------------
>>>>> Iceberg Proposal Abstract
>>>>> 
>>>>> Iceberg is a table format for large, slow-moving tabular data.
>>>>> 
>>>>> It is designed to improve on the de-facto standard table layout built
>>>> into
>>>>> Apache Hive, Presto, and Apache Spark.
>>>>> Proposal
>>>>> 
>>>>> The purpose of Iceberg is to provide SQL-like tables that are backed
by
>>>>> large sets of data files. Iceberg is similar to the Hive table layout,
>>>> the
>>>>> de-facto standard structure used to track files in a table, but
>> provides
>>>>> additional guarantees and performance optimizations:
>>>>> 
>>>>>  - Atomicity - Each change to the table is will be complete or will
>>>>>  fail. “Do or do not. There is no try.”
>>>>>  - Snapshot isolation - Reads use one and only one snapshot of a table
>>>>>  at some time without holding a lock.
>>>>>  - Safe schema evolution - A table’s schema can change in well-defined
>>>>>  ways, without breaking older data files.
>>>>>  - Column projection - An engine may request a subset of the available
>>>>>  columns, including nested fields.
>>>>>  - Predicate pushdown - An engine can push filters into read planning
>>>>>  to improve performance using partition data and file-level
>> statistics.
>>>>> 
>>>>> Iceberg does NOT define a new file format. All data is stored in Apache
>>>>> Avro, Apache ORC, or Apache Parquet files.
>>>>> 
>>>>> Additionally, Iceberg is designed to work well when data files are
>> stored
>>>>> in cloud blob stores, even when those systems provide weaker guarantees
>>>>> than a file system, including:
>>>>> 
>>>>>  - Eventual consistency in the namespace
>>>>>  - High latency for directory listings
>>>>>  - No renames of objects
>>>>>  - No folder hierarchy
>>>>> 
>>>>> Rationale
>>>>> 
>>>>> Initial benchmarks show dramatic improvements in query planning. For
>>>>> example, in Netflix’s Atlas use case, which stores time-series metrics
>>>> from
>>>>> Netflix runtime systems and 1 month is stored across 2.7 million files
>> in
>>>>> 2,688 partitions:
>>>>> 
>>>>>  - Hive table using Parquet:
>>>>>     - 400k+ splits, not combined
>>>>>     - Explain query: 9.6 minutes wall time (planning only)
>>>>>  - Iceberg table with partition filtering:
>>>>>     - 15,218 splits, combined
>>>>>     - Planning: 10 seconds
>>>>>     - Query wall time: 13 minutes
>>>>>  - Iceberg table with partition and min/max filtering:
>>>>>     - 412 splits
>>>>>     - Planning: 25 seconds
>>>>>     - Query wall time: 42 seconds
>>>>> 
>>>>> These performance gains combined with the cross-engine compatibility
>> are
>>>> a
>>>>> very compelling story.
>>>>> Initial Goals
>>>>> 
>>>>> The initial goal will be to move the existing codebase to Apache and
>>>>> integrate with the Apache development process and infrastructure. A
>>>> primary
>>>>> goal of incubation will be to grow and diversify the Iceberg community.
>>>> We
>>>>> are well aware that the project community is largely comprised of
>>>>> individuals from a single company. We aim to change that during
>>>> incubation.
>>>>> Current Status
>>>>> 
>>>>> As previously mentioned, Iceberg is under active development at
>> Netflix,
>>>>> and is being used in processing large volumes of data in Amazon EC2.
>>>>> 
>>>>> Iceberg license documentation is already based on Apache guidelines for
>>>>> LICENSE and NOTICE content.
>>>>> Meritocracy
>>>>> 
>>>>> We value meritocracy and we understand that it is the basis for an open
>>>>> community that encourages multiple companies and individuals to
>>>> contribute
>>>>> and be invested in the project’s future. We will encourage and monitor
>>>>> participation and make sure to extend privileges and responsibilities
>> to
>>>>> all contributors.
>>>>> Community
>>>>> 
>>>>> Iceberg is currently being used by developers at Netflix and a growing
>>>>> number of users are actively using it in production environments.
>> Iceberg
>>>>> has received contributions from developers working at Hortonworks,
>>>> WeWork,
>>>>> and Palantir. By bringing Iceberg to Apache we aim to assure current
>> and
>>>>> future contributors that the Iceberg community is meritocratic and
>> open,
>>>> in
>>>>> order to broaden and diversity the user and developer community.
>>>>> Core Developers
>>>>> 
>>>>> Iceberg was initially developed at Netflix and is under active
>>>>> development. We believe Netflix will be of interest to a broad range
of
>>>>> users and developers and that incubating the project at the ASF will
>> help
>>>>> us build a diverse, sustainable community.
>>>>> Alignment
>>>>> 
>>>>> Iceberg utilizes other Apache projects such as Avro, Hadoop, Hive, ORC,
>>>>> Parquet, Pig, and Spark. We anticipate integration with additional
>> Apache
>>>>> projects as the Iceberg community and interest in the project grows.
>>>>> Known Risks Orphaned Products
>>>>> 
>>>>> Netflix is committed to the future development of Iceberg and
>> understands
>>>>> that graduation to a TLP, while preferable, is not the only positive
>>>>> outcome of incubation.
>>>>> 
>>>>> Should the Iceberg project be accepted by the Incubator, the
>> prospective
>>>>> PPMC would be willing to agree to a target incubation period of 2 years
>>>> or
>>>>> less, knowing that every Incubator project incurs a certain cost in
>> terms
>>>>> of ASF infrastructure and volunteer time.
>>>>> Inexperience with Open Source
>>>>> 
>>>>> Three of the initial committers are Apache members and Incubator PMC
>>>>> members. They will work with the other community members to teach them
>>>> the
>>>>> Apache Way.
>>>>> Homogenous Developers
>>>>> 
>>>>> The majority of the committers work at Netflix, though we are committed
>>>> to
>>>>> recruiting and developing additional committers from a wide spectrum
of
>>>>> industries and backgrounds.
>>>>> Reliance on Salaried Developers
>>>>> 
>>>>> It is expected that Iceberg development will occur on both salaried
>> time
>>>>> and on volunteer time, after hours. Most of the initial committers are
>>>> paid
>>>>> by Netflix to contribute to this project. However, they are all
>>>> passionate
>>>>> about the project, and we are both confident and hopeful that the
>> project
>>>>> will continue even if no salaried developers contribute to the project.
>>>>> Relationships with Other Apache Products
>>>>> 
>>>>> As mentioned in the Rationale section, Iceberg utilizes a number of
>>>>> existing Apache projects (Avro, Hadoop, Hive, ORC, Parquet, Pig, &
>>>> Spark),
>>>>> and we expect that list to expand as the community grows and
>> diversifies.
>>>>> Any Apache project in the big data space that needs to store or process
>>>>> tabular data would be potentially relevant.
>>>>> An Excessive Fascination with the Apache Brand
>>>>> 
>>>>> We are applying to the Incubator process because we think it is the
>> next
>>>>> logical step for the Iceberg project after open-sourcing the code. This
>>>>> proposal is not for the purpose of generating publicity. Rather, we
>> want
>>>> to
>>>>> make sure to create a very inclusive and meritocratic community,
>> outside
>>>>> the umbrella of a single company. Netflix has a long history of
>>>>> contributing to Apache projects and the Iceberg developers and
>>>> contributors
>>>>> understand the implication of making it an Apache project.
>>>>> Required Resources Mailing lists
>>>>> 
>>>>>  - dev@iceberg.incubator.apache.org
>>>>>  - commits@iceberg.incubator.apache.org
>>>>>  - private@iceberg.incubator.apache.org
>>>>> 
>>>>> The podling may also create a user mailing list, if needed.
>>>>> Source Control and Issue Tracking
>>>>> 
>>>>> The Iceberg podling would use Apache’s gitbox integration to sync
>> between
>>>>> github and Apache infrastructure. The podling would use github issues
>> and
>>>>> pull requests for community engagement.
>>>>> Current Resources
>>>>> 
>>>>>  - Initial source: https://github.com/Netflix/iceberg
>>>>>  - Java documentation:
>>>>> 
>>>> 
>> https://netflix.github.io/iceberg/current/javadoc/index.html?com/netflix/iceberg/package-summary.html
>>>>>  - Table specification:
>>>>> 
>>>> 
>> https://docs.google.com/document/d/1Q-zL5lSCle6NEEdyfiYsXYzX_Q8Qf0ctMyGBKslOswA/edit
>>>>> 
>>>>> Source and Intellectual Property Submission Plan
>>>>> 
>>>>> The Iceberg source code in Github is currently licensed under Apache
>>>>> License v2.0 and the copyright is assigned to Netflix. If Iceberg
>> becomes
>>>>> an Incubator project at the ASF, Netflix will transfer the source code
>>>> and
>>>>> trademark ownership to the Apache Software Foundation via a Software
>>>> Grant
>>>>> Agreement.
>>>>> External Dependencies
>>>>> 
>>>>> External dependencies licensed under Apache License 2.0
>>>>> 
>>>>>  - Guava https://github.com/google/guava
>>>>>  - Jackson https://github.com/FasterXML/jackson-core
>>>>>  - Joda-Time http://www.joda.org/joda-time/
>>>>> 
>>>>> External dependencies licensed under the MIT License
>>>>> 
>>>>>  - SLF4J https://www.slf4j.org/
>>>>>  - Mockito https://github.com/mockito/mockito
>>>>> 
>>>>> ASF Projects
>>>>> 
>>>>>  - Apache Avro
>>>>>  - Apache Hadoop
>>>>>  - Apache Hive
>>>>>  - Apache ORC
>>>>>  - Apache Parquet
>>>>>  - Apache Pig
>>>>>  - Apache Spark
>>>>> 
>>>>> Cryptography
>>>>> 
>>>>> We do not expect Iceberg to be a controlled export item due to the use
>> of
>>>>> encryption.
>>>>> Initial Committers and Affiliations
>>>>> 
>>>>>  - Ryan Blue blue@apache.org (Netflix)
>>>>>  - Parth Brahmbhatt parth@apache.org (Netflix)
>>>>>  - Julien Le Dem julien@apache.org (WeWork)
>>>>>  - Owen O’Malley omalley@apache.org (Hortonworks)
>>>>>  - Daniel Weeks dweeks@apache.org (Netflix)
>>>>> 
>>>>> Sponsors and Nominated Mentors
>>>>> 
>>>>>  - Champion and mentor: Owen O’Malley omalley@apache.org
>>>>>  - Mentor: Ryan Blue blue@apache.org
>>>>>  - Mentor: Julien Le Dem julien@apache.org
>>>>> 
>>>>> Sponsoring Entity
>>>>> 
>>>>> The Apache Incubator
>>>>> --
>>>>> Ryan Blue
>>>>> 
>>>> 
>>>> 
>>>> --
>>>> Ryan Blue
>>>> 
>>> 
>>> 
>>> --
>>> Matt Sicker <boards@gmail.com>
>> 
>> 
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: general-unsubscribe@incubator.apache.org
>> For additional commands, e-mail: general-help@incubator.apache.org
>> 
>> 

---------------------------------------------------------------------
To unsubscribe, e-mail: general-unsubscribe@incubator.apache.org
For additional commands, e-mail: general-help@incubator.apache.org


Mime
View raw message