incubator-general mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Julian Hyde <jh...@apache.org>
Subject Re: [VOTE] Accept the Iceberg project for incubation
Date Tue, 13 Nov 2018 17:40:47 GMT
+1 (binding)

Julian


> On Nov 13, 2018, at 9:28 AM, Arthur Wiedmer <arthur@apache.org> wrote:
> 
> +1
> 
> (Non-binding)
> 
> Best,
> Arthur
> 
> On Tue, Nov 13, 2018, 09:24 Hugo Louro <hmclouro@gmail.com wrote:
> 
>> +1 (non-binding)
>> 
>>> On Nov 13, 2018, at 9:19 AM, Owen O'Malley <owen.omalley@gmail.com>
>> wrote:
>>> 
>>> +1 (binding)
>>> 
>>>> On Tue, Nov 13, 2018 at 12:12 PM Dave Fisher <dave2wave@comcast.net>
>> wrote:
>>>> 
>>>> +1 (binding)
>>>> 
>>>>> On Nov 13, 2018, at 9:10 AM, Matt Sicker <boards@gmail.com> wrote:
>>>>> 
>>>>> +1 binding
>>>>> 
>>>>>> On Tue, 13 Nov 2018 at 11:09, Ryan Blue <blue@apache.org> wrote:
>>>>>> 
>>>>>> +1 (binding)
>>>>>> 
>>>>>>> On Tue, Nov 13, 2018 at 9:06 AM Ryan Blue <blue@apache.org>
wrote:
>>>>>>> 
>>>>>>> The discuss thread seems to have reached consensus, so I propose
>>>>>> accepting
>>>>>>> the Iceberg project for incubation.
>>>>>>> 
>>>>>>> The proposal is copied below and in the wiki:
>>>>>>> https://wiki.apache.org/incubator/IcebergProposal
>>>>>>> 
>>>>>>> Please vote on whether to accept Iceberg in the next 72 hours:
>>>>>>> 
>>>>>>> [ ] +1, accept Iceberg for incubation
>>>>>>> [ ] -1, reject the Iceberg proposal because . . .
>>>>>>> 
>>>>>>> Thank you for reviewing the proposal and voting,
>>>>>>> 
>>>>>>> rb
>>>>>>> ------------------------------
>>>>>>> Iceberg Proposal Abstract
>>>>>>> 
>>>>>>> Iceberg is a table format for large, slow-moving tabular data.
>>>>>>> 
>>>>>>> It is designed to improve on the de-facto standard table layout
built
>>>>>> into
>>>>>>> Apache Hive, Presto, and Apache Spark.
>>>>>>> Proposal
>>>>>>> 
>>>>>>> The purpose of Iceberg is to provide SQL-like tables that are
backed
>> by
>>>>>>> large sets of data files. Iceberg is similar to the Hive table
>> layout,
>>>>>> the
>>>>>>> de-facto standard structure used to track files in a table, but
>>>> provides
>>>>>>> additional guarantees and performance optimizations:
>>>>>>> 
>>>>>>> - Atomicity - Each change to the table is will be complete or
will
>>>>>>> fail. “Do or do not. There is no try.”
>>>>>>> - Snapshot isolation - Reads use one and only one snapshot of
a
>> table
>>>>>>> at some time without holding a lock.
>>>>>>> - Safe schema evolution - A table’s schema can change in
>> well-defined
>>>>>>> ways, without breaking older data files.
>>>>>>> - Column projection - An engine may request a subset of the
>> available
>>>>>>> columns, including nested fields.
>>>>>>> - Predicate pushdown - An engine can push filters into read planning
>>>>>>> to improve performance using partition data and file-level
>>>> statistics.
>>>>>>> 
>>>>>>> Iceberg does NOT define a new file format. All data is stored
in
>> Apache
>>>>>>> Avro, Apache ORC, or Apache Parquet files.
>>>>>>> 
>>>>>>> Additionally, Iceberg is designed to work well when data files
are
>>>> stored
>>>>>>> in cloud blob stores, even when those systems provide weaker
>> guarantees
>>>>>>> than a file system, including:
>>>>>>> 
>>>>>>> - Eventual consistency in the namespace
>>>>>>> - High latency for directory listings
>>>>>>> - No renames of objects
>>>>>>> - No folder hierarchy
>>>>>>> 
>>>>>>> Rationale
>>>>>>> 
>>>>>>> Initial benchmarks show dramatic improvements in query planning.
For
>>>>>>> example, in Netflix’s Atlas use case, which stores time-series
>> metrics
>>>>>> from
>>>>>>> Netflix runtime systems and 1 month is stored across 2.7 million
>> files
>>>> in
>>>>>>> 2,688 partitions:
>>>>>>> 
>>>>>>> - Hive table using Parquet:
>>>>>>>    - 400k+ splits, not combined
>>>>>>>    - Explain query: 9.6 minutes wall time (planning only)
>>>>>>> - Iceberg table with partition filtering:
>>>>>>>    - 15,218 splits, combined
>>>>>>>    - Planning: 10 seconds
>>>>>>>    - Query wall time: 13 minutes
>>>>>>> - Iceberg table with partition and min/max filtering:
>>>>>>>    - 412 splits
>>>>>>>    - Planning: 25 seconds
>>>>>>>    - Query wall time: 42 seconds
>>>>>>> 
>>>>>>> These performance gains combined with the cross-engine compatibility
>>>> are
>>>>>> a
>>>>>>> very compelling story.
>>>>>>> Initial Goals
>>>>>>> 
>>>>>>> The initial goal will be to move the existing codebase to Apache
and
>>>>>>> integrate with the Apache development process and infrastructure.
A
>>>>>> primary
>>>>>>> goal of incubation will be to grow and diversify the Iceberg
>> community.
>>>>>> We
>>>>>>> are well aware that the project community is largely comprised
of
>>>>>>> individuals from a single company. We aim to change that during
>>>>>> incubation.
>>>>>>> Current Status
>>>>>>> 
>>>>>>> As previously mentioned, Iceberg is under active development
at
>>>> Netflix,
>>>>>>> and is being used in processing large volumes of data in Amazon
EC2.
>>>>>>> 
>>>>>>> Iceberg license documentation is already based on Apache guidelines
>> for
>>>>>>> LICENSE and NOTICE content.
>>>>>>> Meritocracy
>>>>>>> 
>>>>>>> We value meritocracy and we understand that it is the basis for
an
>> open
>>>>>>> community that encourages multiple companies and individuals
to
>>>>>> contribute
>>>>>>> and be invested in the project’s future. We will encourage
and
>> monitor
>>>>>>> participation and make sure to extend privileges and responsibilities
>>>> to
>>>>>>> all contributors.
>>>>>>> Community
>>>>>>> 
>>>>>>> Iceberg is currently being used by developers at Netflix and
a
>> growing
>>>>>>> number of users are actively using it in production environments.
>>>> Iceberg
>>>>>>> has received contributions from developers working at Hortonworks,
>>>>>> WeWork,
>>>>>>> and Palantir. By bringing Iceberg to Apache we aim to assure
current
>>>> and
>>>>>>> future contributors that the Iceberg community is meritocratic
and
>>>> open,
>>>>>> in
>>>>>>> order to broaden and diversity the user and developer community.
>>>>>>> Core Developers
>>>>>>> 
>>>>>>> Iceberg was initially developed at Netflix and is under active
>>>>>>> development. We believe Netflix will be of interest to a broad
range
>> of
>>>>>>> users and developers and that incubating the project at the ASF
will
>>>> help
>>>>>>> us build a diverse, sustainable community.
>>>>>>> Alignment
>>>>>>> 
>>>>>>> Iceberg utilizes other Apache projects such as Avro, Hadoop,
Hive,
>> ORC,
>>>>>>> Parquet, Pig, and Spark. We anticipate integration with additional
>>>> Apache
>>>>>>> projects as the Iceberg community and interest in the project
grows.
>>>>>>> Known Risks Orphaned Products
>>>>>>> 
>>>>>>> Netflix is committed to the future development of Iceberg and
>>>> understands
>>>>>>> that graduation to a TLP, while preferable, is not the only positive
>>>>>>> outcome of incubation.
>>>>>>> 
>>>>>>> Should the Iceberg project be accepted by the Incubator, the
>>>> prospective
>>>>>>> PPMC would be willing to agree to a target incubation period
of 2
>> years
>>>>>> or
>>>>>>> less, knowing that every Incubator project incurs a certain cost
in
>>>> terms
>>>>>>> of ASF infrastructure and volunteer time.
>>>>>>> Inexperience with Open Source
>>>>>>> 
>>>>>>> Three of the initial committers are Apache members and Incubator
PMC
>>>>>>> members. They will work with the other community members to teach
>> them
>>>>>> the
>>>>>>> Apache Way.
>>>>>>> Homogenous Developers
>>>>>>> 
>>>>>>> The majority of the committers work at Netflix, though we are
>> committed
>>>>>> to
>>>>>>> recruiting and developing additional committers from a wide spectrum
>> of
>>>>>>> industries and backgrounds.
>>>>>>> Reliance on Salaried Developers
>>>>>>> 
>>>>>>> It is expected that Iceberg development will occur on both salaried
>>>> time
>>>>>>> and on volunteer time, after hours. Most of the initial committers
>> are
>>>>>> paid
>>>>>>> by Netflix to contribute to this project. However, they are all
>>>>>> passionate
>>>>>>> about the project, and we are both confident and hopeful that
the
>>>> project
>>>>>>> will continue even if no salaried developers contribute to the
>> project.
>>>>>>> Relationships with Other Apache Products
>>>>>>> 
>>>>>>> As mentioned in the Rationale section, Iceberg utilizes a number
of
>>>>>>> existing Apache projects (Avro, Hadoop, Hive, ORC, Parquet, Pig,
&
>>>>>> Spark),
>>>>>>> and we expect that list to expand as the community grows and
>>>> diversifies.
>>>>>>> Any Apache project in the big data space that needs to store
or
>> process
>>>>>>> tabular data would be potentially relevant.
>>>>>>> An Excessive Fascination with the Apache Brand
>>>>>>> 
>>>>>>> We are applying to the Incubator process because we think it
is the
>>>> next
>>>>>>> logical step for the Iceberg project after open-sourcing the
code.
>> This
>>>>>>> proposal is not for the purpose of generating publicity. Rather,
we
>>>> want
>>>>>> to
>>>>>>> make sure to create a very inclusive and meritocratic community,
>>>> outside
>>>>>>> the umbrella of a single company. Netflix has a long history
of
>>>>>>> contributing to Apache projects and the Iceberg developers and
>>>>>> contributors
>>>>>>> understand the implication of making it an Apache project.
>>>>>>> Required Resources Mailing lists
>>>>>>> 
>>>>>>> - dev@iceberg.incubator.apache.org
>>>>>>> - commits@iceberg.incubator.apache.org
>>>>>>> - private@iceberg.incubator.apache.org
>>>>>>> 
>>>>>>> The podling may also create a user mailing list, if needed.
>>>>>>> Source Control and Issue Tracking
>>>>>>> 
>>>>>>> The Iceberg podling would use Apache’s gitbox integration to
sync
>>>> between
>>>>>>> github and Apache infrastructure. The podling would use github
issues
>>>> and
>>>>>>> pull requests for community engagement.
>>>>>>> Current Resources
>>>>>>> 
>>>>>>> - Initial source: https://github.com/Netflix/iceberg
>>>>>>> - Java documentation:
>>>>>>> 
>>>>>> 
>>>> 
>> https://netflix.github.io/iceberg/current/javadoc/index.html?com/netflix/iceberg/package-summary.html
>>>>>>> - Table specification:
>>>>>>> 
>>>>>> 
>>>> 
>> https://docs.google.com/document/d/1Q-zL5lSCle6NEEdyfiYsXYzX_Q8Qf0ctMyGBKslOswA/edit
>>>>>>> 
>>>>>>> Source and Intellectual Property Submission Plan
>>>>>>> 
>>>>>>> The Iceberg source code in Github is currently licensed under
Apache
>>>>>>> License v2.0 and the copyright is assigned to Netflix. If Iceberg
>>>> becomes
>>>>>>> an Incubator project at the ASF, Netflix will transfer the source
>> code
>>>>>> and
>>>>>>> trademark ownership to the Apache Software Foundation via a Software
>>>>>> Grant
>>>>>>> Agreement.
>>>>>>> External Dependencies
>>>>>>> 
>>>>>>> External dependencies licensed under Apache License 2.0
>>>>>>> 
>>>>>>> - Guava https://github.com/google/guava
>>>>>>> - Jackson https://github.com/FasterXML/jackson-core
>>>>>>> - Joda-Time http://www.joda.org/joda-time/
>>>>>>> 
>>>>>>> External dependencies licensed under the MIT License
>>>>>>> 
>>>>>>> - SLF4J https://www.slf4j.org/
>>>>>>> - Mockito https://github.com/mockito/mockito
>>>>>>> 
>>>>>>> ASF Projects
>>>>>>> 
>>>>>>> - Apache Avro
>>>>>>> - Apache Hadoop
>>>>>>> - Apache Hive
>>>>>>> - Apache ORC
>>>>>>> - Apache Parquet
>>>>>>> - Apache Pig
>>>>>>> - Apache Spark
>>>>>>> 
>>>>>>> Cryptography
>>>>>>> 
>>>>>>> We do not expect Iceberg to be a controlled export item due to
the
>> use
>>>> of
>>>>>>> encryption.
>>>>>>> Initial Committers and Affiliations
>>>>>>> 
>>>>>>> - Ryan Blue blue@apache.org (Netflix)
>>>>>>> - Parth Brahmbhatt parth@apache.org (Netflix)
>>>>>>> - Julien Le Dem julien@apache.org (WeWork)
>>>>>>> - Owen O’Malley omalley@apache.org (Hortonworks)
>>>>>>> - Daniel Weeks dweeks@apache.org (Netflix)
>>>>>>> 
>>>>>>> Sponsors and Nominated Mentors
>>>>>>> 
>>>>>>> - Champion and mentor: Owen O’Malley omalley@apache.org
>>>>>>> - Mentor: Ryan Blue blue@apache.org
>>>>>>> - Mentor: Julien Le Dem julien@apache.org
>>>>>>> 
>>>>>>> Sponsoring Entity
>>>>>>> 
>>>>>>> The Apache Incubator
>>>>>>> --
>>>>>>> Ryan Blue
>>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> --
>>>>>> Ryan Blue
>>>>>> 
>>>>> 
>>>>> 
>>>>> --
>>>>> Matt Sicker <boards@gmail.com>
>>>> 
>>>> 
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail: general-unsubscribe@incubator.apache.org
>>>> For additional commands, e-mail: general-help@incubator.apache.org
>>>> 
>>>> 
>> 
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: general-unsubscribe@incubator.apache.org
>> For additional commands, e-mail: general-help@incubator.apache.org
>> 
>> 


---------------------------------------------------------------------
To unsubscribe, e-mail: general-unsubscribe@incubator.apache.org
For additional commands, e-mail: general-help@incubator.apache.org


Mime
View raw message