incubator-general mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Luciano Resende <>
Subject [VOTE] Accept Crail into the Apache Incubator
Date Thu, 26 Oct 2017 15:31:38 GMT
Now that the discussion thread on the Crail proposal has ended, please vote
on accepting Crail into into the Apache Incubator.

The ASF voting rules are described at:

A vote for accepting a new Apache Incubator podling is a majority vote
for which only Incubator PMC member votes are binding.

Votes from other people are also welcome as an indication of peoples
enthusiasm (or lack thereof).

Please do not use this VOTE thread for discussions.
If needed, start a new thread instead.

This vote will run for at least 72 hours. Please VOTE as follows
[] +1 Accept Crail into the Apache Incubator
[] +0 Abstain.
[] -1 Do not accept Crail into the Apache Incubator because ...

The proposal below is also on the wiki:



Crail is a storage platform for sharing performance critical data in
distributed data processing jobs at very high speed. Crail is built
entirely upon principles of user-level I/O and specifically targets data
center deployments with fast network and storage hardware (e.g., 100Gbps
RDMA, plenty of DRAM, NVMe flash, etc.) as well as new modes of operation
such resource disaggregation or serverless computing. Crail is written in
Java and integrates seamlessly with the Apache data processing ecosystem.
It can be used as a backbone to accelerate high-level data operations such
as shuffle or broadcast, or as a cache to store hot data that is queried
repeatedly, or as a storage platform for sharing inter-job data in complex
multi-job pipelines, etc.


Crail enables Apache data processing frameworks to run efficiently in next
generation data centers using fast storage and network hardware in
combination with resource (e.g., DRAM, Flash) disaggregation.


Crail started as a research project at the IBM Zurich Research Laboratory
around 2014 aiming to integrate high-speed I/O hardware effectively into
large scale data processing systems.


During the last decade, I/O hardware has undergone rapid performance
improvements, typically in the order of magnitudes. Modern day networking
and storage hardware can deliver 100+ Gbps (10+ GBps) bandwidth with a few
microseconds of access latencies. However, despite such progress in raw I/O
performance, effectively leveraging modern hardware in data processing
frameworks remains challenging. In most of the cases, upgrading to high-end
networking or storage hardware has very little effect on the performance of
analytics workloads. The problem comes from heavily layered software
imposing overheads such as deep call stacks, unnecessary data copies,
thread contention, etc. These problems have already been addressed at the
operating system level with new I/O APIs such as RDMA verbs, NVMe, etc.,
allowing applications to bypass software layers during I/O operations.
Distributed data processing frameworks on the other hand, are typically
implemented on legacy I/O interfaces such as such as sockets or block
storage. These interfaces have been shown to be insufficient to deliver the
full hardware performance. Yet, to the best of our knowledge, there are no
active and systematic efforts to integrate these new user level I/O APIs
into Apache software frameworks. This problem affects all end-users and
organizations that use Apache software. We expect them to see
unsatisfactory small performance gains when upgrading their networking and
storage hardware.

Crail solves this problem by providing an efficient storage platform built
upon user-level I/O, thus, bypassing layers such as JVM and OS during I/O
operations. Moreover, Crail directly leverages the specific hardware
features of RDMA and NVMe to provide a better integration with high-level
data operations in Apache compute frameworks. As a consequence, Crail
enables users to run larger, more complex queries against ever increasing
amounts of data at a speed largely determined by the deployed hardware.
Crail is generic solution that integrates well with the Apache ecosystem
including frameworks like Spark, Hadoop, Hive, etc.

Initial Goals

The initial goals to move Crail to the Apache Incubator is to broaden the
community, and foster contributions from developers to leverage Crail in
various data processing frameworks and workloads. Ultimately, the goal for
Crail is to become the de-facto standard platform for storing temporary
performance critical data in distributed data processing systems.

Current Status

The initial code has been developed at the IBM Zurich Research Center and
has recently been made available in GitHub under the Apache Software
License 2.0. The Project currently has explicit support for Spark and
Hadoop. Project documentation is available on the website
There is also a public forum for discussions related to Crail available at!forum/zrlio-users.


The current developers are familiar with the meritocratic open source
development process at Apache. Over the last year, the project has gathered
interest at GitHub and several companies have already expressed interest in
the project. We plan to invest in supporting a meritocracy by inviting
additional developers to participate.


The need for a generic solution to integrate high-performance I/O hardware
in the open source is tremendous, so there is a potential for a very large
community. We believe that Crail’s extensible architecture and its
alignment with the Apache Ecosystem will further encourage community
participation. We expect that over time Crail will attract a large


Crail is written in Java and is built for the Apache data processing
ecosystem. The basic storage services of Crail can be used seamlessly from
Spark, Hadoop, Storm. The enhanced storage services require dedicated data
processing specific binding, which currently are available only for Spark.
We think that moving Crail to the Apache incubator will help to extend
Crail’s support for different data processing frameworks.

Known Risks

To-date, development has been sponsored by IBM and coordinated mostly by
the core team of researchers at the IBM Zurich Research Center. For Crail
to fully transition to an "Apache Way" governance model, it needs to start
embracing the meritocracy-centric way of growing the community of

Orphaned Products

The Crail developers have a long-term interest in use and maintenance of
the code and there is also hope that growing a diverse community around the
project will become a guarantee against the project becoming orphaned. We
feel that it is also important to put formal governance in place both for
the project and the contributors as the project expands. We feel ASF is the
best location for this.

Inexperience with Open Source

Several of the initial committers are experienced open source developers
(Linux Kernel, DPDK, etc.).

Relationships with Other Apache Products

As of now, Crail has been tested with Spark, Hadoop and Hive, but it is
designed to integrate with any of the Apache data processing frameworks.

Homogeneous Developers

The project already has a diverse developer base including contributions
from organizations and public developers.

An Excessive Fascination with the Apache Brand

Crail solves a real need for a generic approach to leverage modern network
and storage hardware effectively in the Apache Hadoop and Spark ecosystems.
Our rationale for developing Crail as an Apache project is detailed in the
Rationale section. We believe that the Apache brand and community process
will help to us to engage a larger community and facilitate closer ties
with various Apache data processing projects.


Documentation regarding Crail is available at

Initial Source

Initial source is available on GitHub under the Apache License 2.0:
External Dependencies

Crail is written in Java and currently supports Apache Hadoop MapReduce and
Apache Spark runtimes. To the best of our knowledge, all dependencies of
Crail are distributed under Apache compatible licenses.

Required Resource

Mailing lists
Git repository
Issue Tracking

JIRA (Crail)
Initial Committers

Patrick Stuedi <stu AT ibm DOT zurich DOT com>
Animesh Trivedi <atr AT ibm DOT zurich DOT com>
Jonas Pfefferle <jpf AT ibm DOT zurich DOT com>
Bernard Metzler <bmt AT ibm DOT zurich DOT com>
Michael Kaufmann <kau AT ibm DOT zurich DOT com>
Adrian Schuepbach <dri AT ibm DOT zurich DOT com>
Patrick McArthur <patrick AT patrickmcarthur DOT net>
Ana Klimovic <anakli AT stanford DOT edu>
Yuval Degani <yuvaldeg AT mellanox DOT com>
Vu Pham <vuhuong AT mellanox DOT com>

IBM (Patrick, Stuedi, Animesh Trivedi, Jonas Pfefferle, Bernard Metzler,
Michael Kaufmann, Adrian Schuepbach)
University of New Hampshire (Patrick McArthur)
Stanford University (Ana Klimovic)
Mellanox (Yuval Degani, Vu Pham)


Luciano Resende <lresende AT apache DOT org>

Nominated Mentors

Luciano Resende <lresende AT apache DOT org>

Raphael Bircher <rbircher AT apache DOT org>

Julian Hyde <jhyde AT apache DOT org>

Sponsoring Entity

We would like to propose the Apache Incubator to sponsor this project.

Luciano Resende

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message