incubator-general mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Daniel B. Widdis" <>
Subject Re: [VOTE] Accept Wayang into the Apache Incubator
Date Fri, 11 Dec 2020 17:02:58 GMT
+1 (non-binding).  I'm interested in getting involved in this project!

On Fri, Dec 11, 2020 at 8:33 AM Christofer Dutz <>

> Hi all,
> following up the [DISCUSS] thread on Wayang (
> I would like to call a VOTE to accept Wayang Aka Rheem into the Apache
> Incubator.
> Please cast your vote:
>   [ ] +1, bring Wayang into the Incubator
>   [ ] +0, I don't care either way
>   [ ] -1, do not bring Wayang into the Incubator, because...
> The vote will open at least for 72 hours and only votes from the Incubator
> PMC are binding, but votes from everyone are welcome.
> Chris
> -----
> Wayang Proposal (
> == Abstract ==
> Wayang is a cross-platform data processing system that aims at decoupling
> the business logic of data analytics applications from concrete data
> processing platforms, such as Apache Flink or Apache Spark. Hence, it tames
> the complexity that arises from the "Cambrian explosion" of novel data
> processing platforms that we currently witness.
> Note that Wayang project is the Rheem project, but we have renamed the
> project because of trademark issues.
> You can find the project web page at:
> = Proposal =
> Wayang is a cross-platform system that provides an abstraction over data
> processing platforms to free users from the burdens of (i) performing
> tedious and costly data migration and integration tasks to run their
> applications, and (ii) choosing the right data processing platforms for
> their applications. To achieve this, Wayang: (1) provides an abstraction on
> top of existing data processing platforms that allows users to specify
> their data analytics tasks in a form of a DAG of operators; (2) comes with
> a cross-platform optimizer for automating the selection of
> suitable/efficient platforms; and (3) and finally takes care of executing
> the optimized plan, including communication across platforms. In summary,
> Wayang has the following salient features:
> - Flexible Data Model - It considers a flexible and simple data model
> based on data quanta. A data quantum is an atomic processing unit in the
> system, that can represent a large spectrum of data formats, such as data
> points for a machine learning application, tuples for a database
> application, or RDF triples. Hence, Wayang is able to express a wide range
> of data analytics tasks.
> - Platform independence - It provides a simple interface (currently Java
> and Scala) that is inspired by established programming models, such as that
> of Apache Spark and Apache Flink. Users represent their data analytic tasks
> as a DAG (Wayang plan), where vertices correspond to Wayang operators and
> edges represent data flows (data quanta flowing) among these operators. A
> Wayang operator defines a particular kind of data transformation over an
> input data quantum, ranging from basic functionality (e.g.,
> transformations, filters, joins) to complex, extensible tasks (e.g.,
> PageRank).
> - Cross-platform execution - Besides running a data analytic task on any
> data processing platform, it also comes with an optimizer that can decide
> to execute a single data analytic task using multiple data processing
> platforms. This allows for exploiting the capabilities of different data
> processing platforms to perform complex data analytic tasks more
> efficiently.
> Self-tuning UDF-based cost model - Its optimizer uses a cost model fully
> based on UDFs. This not only enables Wayang to learn the cost functions of
> newly added data processing platforms, but also allows developers to tune
> the optimizer at will.
> - Extensibility - It treats data processing platforms as plugins to allow
> users (developers) to easily incorporate new data processing platforms into
> the system. This is achieved by exposing the functionalities of data
> processing platforms as operators (execution operators). The same approach
> is followed at the Wayang interface, where users can also extend Wayang
> capabilities, i.e., the operators, easily.
> We plan to work on the stability of all these features as well as
> extending Wayang with more advanced features. Furthermore, Wayang currently
> supports Apache Spark, Standalone Java, GraphChi, relational databases (via
> JDBC). We plan to incorporate more data processing platforms, such as
> Apache Flink and Apache Hive.
> === Background ===
> Many organizations and companies collect or produce large variety of data
> to apply data analytics over them. This is because insights from data
> rapidly allow them to make better decisions. Thus, the pursuit for
> efficient and scalable data analytics as well as the
> one-size-does-not-fit-all philosophy has given rise to a plethora of data
> processing platforms. Examples of these specialized processing platforms
> range from DBMSs to MapReduce-like platforms.
> However, today's data analytics are moving beyond the limits of a single
> data processing platform. More and more applications need to perform
> complex data analytics over several data processing platforms. For example,
> IBM reported that North York hospital needs to process 50 diverse datasets,
> which are on a dozen different internal systems, (ii) oil & gas companies
> stated they need to process large amounts of data they produce everyday,
> e.g., a single oil company can produce more than 1.5TB of diverse
> (structured and unstructured) data per day, (iii) Fortune magazine stated
> that airlines need to analyze large datasets, which are produced by
> different departments, are of different data formats, and reside on
> multiple data sources, to produce global reports for decision makers, and
> (iv) Hewlett Packard has claimed that, according to its customer portfolio,
> business intelligence typically require a single analytics pipeline using
> different processing platforms at different parts of the pipeline. These
> are just a few examples of emerging applications that require a diversity
> of data processing platforms.
> Today, developers have to deal with this myriad of data processing
> platforms. That is, they have to choose the right data processing platform
> for their applications (or data analytic tasks) and to familiarize with the
> intricacies of the different platforms to achieve high efficiency and
> scalability. Several systems have also appeared with the goal of helping
> users to easily glue several platforms together, such as Apache Drill,
> PrestoDB, and Luigi. Nevertheless, all these systems still require quite
> good expertise from users to decide which data processing platforms to use
> for the data analytic task at hand. In consequence, great engineering
> effort is required to unify the data from various sources, to combine the
> processing capabilities of different platforms, and to maintain those
> applications, so as to unleash the full potential of the data. In the worst
> case, such applications are not built in the first place, as it seems too
> much of a daunting endeavor.
> === Rationale ===
> It is evident that there is an urgent need to release developers from the
> burden of knowing all the intricacies of choosing and glueing together data
> processing platforms for supporting their applications (data analytic
> tasks). Developers must focus only on the logics of their applications.
> Surprisingly, there is no open source system trying to satisfy this urgent
> need. Wayang aims at filling this gap. It copes with this urgent need by
> providing both a common interface over data processing platforms and an
> optimizer to execute data analytic tasks on the right data processing
> platform(s) seamlessly. As Apache is the place where most of the important
> big data systems are, we then consider Apache as the right place for Wayang.
> === Current Status ===
> The current version of Wayang (v0.5.0) was initially co-developed by
> staff, students, and interns at the Qatar Computing Research Institute
> (QCRI) and the Hasso-Plattner Institute (HPI). The project was initiated at
> and sponsored by QCRI in 2015 with the goal of freeing data scientists and
> developers from the intricacies of data processing platforms to support
> their analytic tasks. The first open source release of Wayang was made only
> one year and a half later, in June 13th of 2016, under the Apache Software
> License 2.0. Since we have made several releases, the latest release was
> done on January 23th, 2019.
> ** Meritocracy **
> All current Wayang developers are familiar with this development process
> at Apache and are currently trying to follow this meritocracy process as
> much as possible. For example, Wayang already follows a committer principle
> where any pull request is analyzed by at least one Wayang core developer.
> This was one of the reasons for choosing Apache for Wayang as we all want
> to encourage and keep this style of development for Wayang.
> ** Community **
> Wayang started as a pure research project, but it quickly started
> developing into a community. People from HPI quickly joined our efforts
> almost from the very beginning to make this project a reality. Recently,
> the Berlin Institute of Technology (TU Berlin) and the Pontifical Catholic
> University of Valparaiso (PUCV) in Chile have also joined our efforts for
> developing Wayang. A company, called Scalytics, has been created around
> Wayang. Currently, we are intensively seeking to further develop both
> developer and user communities. To keep broadening the community, we plan
> to also exploit our ongoing academic collaborations with multiple
> universities in Berlin and companies that we collaborate with. For
> instance, Wayang is already being utilized for accessing multiple data
> sources in the context of a large data analytics project led by TU Berlin
> and Huawei. We also believe that Wayang's extensible architecture (i.e.,
> adding new operators and platforms) will further encourage community
> participation. During incubation we plan to have Wayang adopted by at least
> one company and will explicitly seek more industrial participation.
> ** Core Developers **
> The initial developers of the project are diverse, they are from four
> different institutions (TU Berlin, Scalytics, PUCV, and HBKU). We will work
> aggressively to grow the community during the incubation by recruiting more
> developers from other institutions.
> ** Alignment **
> We believe Apache is the most natural home for taking Wayang to the next
> level. Apache is currently hosting the most important big data systems.
> Hadoop, Spark, Flink, HBase, Hive, Tez, Reef, Storm, Drill, and Ignite are
> just some examples of these technologies. Wayang fills a significant gap -
> it provides a common abstraction for all these platforms and decides on
> which platforms to run a single data analytic task - that exist in the big
> data open source world. Wayang is now being developed following the
> Apache-style development model. Also, it is well-aligned with the Apache
> principle of building a community to impact the big data community.
> === Known Risks ===
> ** Orphaned Products **
> Currently, Wayang is the core technology behind Scalytics inc.. As a
> result, a team of two engineers are working on a full time basis on this
> project. Recently, three more developers have joined our efforts in
> building Wayang. Thus, the risk of Wayang becoming orphaned is relatively
> very low. Still, people outside Scalytics (from TU Berlin and HBKU) have
> also joined the project, which makes the risk of abandoning the project
> even lower. The PUCV in Chile is also beginning to contribute to the code
> base and to develop a declarative query language on top of Wayang. The
> project is constantly being monitored by email and frequent Skype meetings
> as well as by weekly meetings with Scalytics people. Additionally, at the
> end of each year, we meet to discuss the status of the project as well as
> to plan the most important aspects we should work on during the year after.
> ** Inexperience with Open Source **
> Wayang quickly started being developed in open source under the Apache
> Software License 2.0. The source code is available on Github. Also few of
> the initial committers have contributed to other open source projects:
> Hadoop and Flume
> ** Homogeneous Developers **
> The initial committers are already geographically distributed among Chile,
> Germany, and Qatar. During incubation, one of our main goals is to increase
> the heterogeneity of the current community and we will work hard to achieve
> it.
> ** Reliance on salaried developers **
> Wayang is already being developed by a mix of full time and volunteer
> time. Only 2 of the initial committers are working full time on this
> project (Scalytics). So, we are confident that the project will not
> decrease its development pace. Furthermore, we are committed to recruit
> additional committers to significantly increase the development pace of the
> project.
> ** Relationships with other Apache products **
> Wayang is somehow related to Apache Spark as its developing interface is
> inspired from Spark. In contrast to Spark, Wayang is not a data processing
> platform, but a mediator between user applications and data processing
> platforms. In this sense, Wayang is similar to the Apache Drill project,
> and Apache Beam. However, Wayang significantly differs from Apache Drill in
> two main aspects. First, Apache Drill provides only a common interface to
> query multiple data storages and hence users have to specify in their query
> the data to fetch. Then, Apache Drill translates the query to the
> processing platforms where the data is stored, e.g. into mongoDB query
> representation. In contrast, in Wayang, users only specify the data path
> and Wayang decides which are the best (performance-wise) data processing
> platforms to use to process such data. Second, the query interface in
> Apache Drill is SQL. Wayang uses an interface based on operators forming
> DAGs. In this latter point, we are currently developing a PIGLatin-like
> query language for Wayang. In addition, in contrast to Apache Beam, Wayang
> not only allows users to use multiple data processing platforms at the same
> time, but also it provides an optimizer to choose the most efficient
> platform for the task at hand. In Apache Beam, users have to specify an
> appropriate runner (platform).
> Given these similarities with the two Apache projects mentioned above, we
> are looking forward to collaborating with those communities. Still, we are
> open and would also love to collaborate with other Apache communities as
> well.
> ** An excessive fascination with the Apache Brand **
> Wayang solves a real problem that currently users and developers have to
> deal with at a high cost: monetary cost, high design and development
> efforts, and very time consuming. Therefore, we believe that Wayang can be
> successful in building a large community around it. We are convinced that
> the Apache brand and community process will significantly help us in
> building such a community and to establish the project in the long-term. We
> simply believe that ASF is the right home for Wayang to achieve this.
> === Documentation ===
> Further details, documentation, and publications related to Wayang can be
> found at
> === Initial Source ===
> The current source code of Wayang resides in Github:
> === External Dependencies ===
> Wayang depends on the following Apache projects:
> * Maven
> * HDFS
> * Hadoop
> * Spark
> Wayang depends on the following other open source projects organized by
> license:
> org.json.json: Json (
> SnakeYAML: Apache 2.0
> Java Unified Expression Language API (Juel): Apache 2.0
> ProfileDB Instrumentation: Apache 2.0
> Gson: Apache 2.0
> Hadoop: Apache 2.0
> Scala: Apache 2.0
> Antlr 4: BSD
> Jackson: Apache 2.0
> Junit 5: EPL 2.0
> Mockito: MIT
> Assertj: Apache 2.0
> logback-classic: EPL 1.0 LGPL 2.1
> slf4j: MIT
> GNU Trove: LGPL 2.1
> graphchi: Apache 2.0
> SQLite JDBC: Apache 2.0
> PostgreSQL: BSD 2-clause
> jcommander: Apache 2.0
> Koloboke Collections API: Apache 2.0
> Snappy Java: Apache 2.0
> Apache Spark: Apache 2.0
> HyperSQL Database: BSD Modified (
> Apache Giraph: Apache 2.0
> Apache Flink: Apache 2.0
> Apache Commons IO: Apache 2.0
> Apache Commons Lang: Apache 2.0
> Apache Maven: Apache 2.0
> === Cryptography ===
> (not applicable)
> === Required Resources ===
> ** Mailing Lists **
> *
> *
> *
> ** Git repositories **
> git://
> ** Issue tracking **
> === Initial Committers ===
> The following list gives the planned initial committers (in alphabetical
> order):
> * Bertty Contreras-Rojas <bertty@>
> * Rodrigo Pardo-Meza <rodrigo@>
> * Alexander Alten-Lorenz <alo@>
> * Zoi Kaoudi <zoi.kaoudi@>
> * Haralampos Gavriilidis <gavriilidis@>
> * Jorge-Arnulfo Quiane-Ruiz <jorge.quiane@>
> * Anis Troudi <atroudi@>
> * Wenceslao Palma-Muñoz <wenceslao.palma@>
> ** Affiliations **
> * Scalytics Inc.
> ** Bertty Contreras-Rojas
> ** Rodrigo Pardo-Meza
> ** Alexander Alten-Lorenz
> * Berlin Institute of Technology (TU Berlin)
> ** Zoi Kaoudi
> ** Haralampos Gavriilidis
> ** Jorge-Arnulfo Quiane-Ruiz
> * Hamad Bin Khalifa University (HBKU)
> ** Anis Troudi
> * Pontifical Catholic University of Valparaiso, Chile (PUCV)
> ** Wenceslao Palma-Muñoz
> === Sponsors ===
> ** Champion **
> * Christofer Dutz (christofer.dutz at c-ware dot de)
> ** Mentors **
> . (cdutz) Christofer Dutz
> . (larsgeorge) Lars George
> . (berndf) Fondermann
> . (jbonofre) Jean-Baptiste Onofré
> ** Sponsoring Entity **
> The Apache Incubator
> ---------------------------------------------------------------------
> To unsubscribe, e-mail:
> For additional commands, e-mail:

Dan Widdis

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message