incubator-general mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Dave Fisher <w...@apache.org>
Subject Re: [VOTE] Accept Wayang into the Apache Incubator
Date Fri, 11 Dec 2020 16:35:37 GMT
+1 (binding)

Sent from my iPhone

> On Dec 11, 2020, at 8:33 AM, Christofer Dutz <christofer.dutz@c-ware.de> wrote:
> 
> Hi all,
> 
> following up the [DISCUSS] thread on Wayang (https://lists.apache.org/thread.html/r5fc03ae014f44c7c31a509a6db4ac07faedb2e1c6245cd917b744826%40%3Cgeneral.incubator.apache.org%3E)
I would like to call a VOTE to accept Wayang Aka Rheem into the Apache Incubator.
> 
> Please cast your vote:
> 
>  [ ] +1, bring Wayang into the Incubator
>  [ ] +0, I don't care either way
>  [ ] -1, do not bring Wayang into the Incubator, because...
> 
> The vote will open at least for 72 hours and only votes from the Incubator PMC are binding,
but votes from everyone are welcome.
> 
> Chris
> 
> -----
> 
> Wayang Proposal (https://cwiki.apache.org/confluence/display/INCUBATOR/WayangProposal)
> 
> == Abstract ==
> 
> Wayang is a cross-platform data processing system that aims at decoupling the business
logic of data analytics applications from concrete data processing platforms, such as Apache
Flink or Apache Spark. Hence, it tames the complexity that arises from the "Cambrian explosion"
of novel data processing platforms that we currently witness.
> 
> Note that Wayang project is the Rheem project, but we have renamed the project because
of trademark issues.
> 
> You can find the project web page at: https://rheem-ecosystem.github.io/
> 
> = Proposal =
> 
> Wayang is a cross-platform system that provides an abstraction over data processing platforms
to free users from the burdens of (i) performing tedious and costly data migration and integration
tasks to run their applications, and (ii) choosing the right data processing platforms for
their applications. To achieve this, Wayang: (1) provides an abstraction on top of existing
data processing platforms that allows users to specify their data analytics tasks in a form
of a DAG of operators; (2) comes with a cross-platform optimizer for automating the selection
of suitable/efficient platforms; and (3) and finally takes care of executing the optimized
plan, including communication across platforms. In summary, Wayang has the following salient
features:
> 
> - Flexible Data Model - It considers a flexible and simple data model based on data quanta.
A data quantum is an atomic processing unit in the system, that can represent a large spectrum
of data formats, such as data points for a machine learning application, tuples for a database
application, or RDF triples. Hence, Wayang is able to express a wide range of data analytics
tasks.
> - Platform independence - It provides a simple interface (currently Java and Scala) that
is inspired by established programming models, such as that of Apache Spark and Apache Flink.
Users represent their data analytic tasks as a DAG (Wayang plan), where vertices correspond
to Wayang operators and edges represent data flows (data quanta flowing) among these operators.
A Wayang operator defines a particular kind of data transformation over an input data quantum,
ranging from basic functionality (e.g., transformations, filters, joins) to complex, extensible
tasks (e.g., PageRank).
> - Cross-platform execution - Besides running a data analytic task on any data processing
platform, it also comes with an optimizer that can decide to execute a single data analytic
task using multiple data processing platforms. This allows for exploiting the capabilities
of different data processing platforms to perform complex data analytic tasks more efficiently.
> Self-tuning UDF-based cost model - Its optimizer uses a cost model fully based on UDFs.
This not only enables Wayang to learn the cost functions of newly added data processing platforms,
but also allows developers to tune the optimizer at will.
> - Extensibility - It treats data processing platforms as plugins to allow users (developers)
to easily incorporate new data processing platforms into the system. This is achieved by exposing
the functionalities of data processing platforms as operators (execution operators). The same
approach is followed at the Wayang interface, where users can also extend Wayang capabilities,
i.e., the operators, easily.
> 
> We plan to work on the stability of all these features as well as extending Wayang with
more advanced features. Furthermore, Wayang currently supports Apache Spark, Standalone Java,
GraphChi, relational databases (via JDBC). We plan to incorporate more data processing platforms,
such as Apache Flink and Apache Hive.
> 
> === Background ===
> 
> Many organizations and companies collect or produce large variety of data to apply data
analytics over them. This is because insights from data rapidly allow them to make better
decisions. Thus, the pursuit for efficient and scalable data analytics as well as the one-size-does-not-fit-all
philosophy has given rise to a plethora of data processing platforms. Examples of these specialized
processing platforms range from DBMSs to MapReduce-like platforms.
> 
> However, today's data analytics are moving beyond the limits of a single data processing
platform. More and more applications need to perform complex data analytics over several data
processing platforms. For example, IBM reported that North York hospital needs to process
50 diverse datasets, which are on a dozen different internal systems, (ii) oil & gas companies
stated they need to process large amounts of data they produce everyday, e.g., a single oil
company can produce more than 1.5TB of diverse (structured and unstructured) data per day,
(iii) Fortune magazine stated that airlines need to analyze large datasets, which are produced
by different departments, are of different data formats, and reside on multiple data sources,
to produce global reports for decision makers, and (iv) Hewlett Packard has claimed that,
according to its customer portfolio, business intelligence typically require a single analytics
pipeline using different processing platforms at different parts of the pipeline. These are
just a few examples of emerging applications that require a diversity of data processing platforms.
> 
> Today, developers have to deal with this myriad of data processing platforms. That is,
they have to choose the right data processing platform for their applications (or data analytic
tasks) and to familiarize with the intricacies of the different platforms to achieve high
efficiency and scalability. Several systems have also appeared with the goal of helping users
to easily glue several platforms together, such as Apache Drill, PrestoDB, and Luigi. Nevertheless,
all these systems still require quite good expertise from users to decide which data processing
platforms to use for the data analytic task at hand. In consequence, great engineering effort
is required to unify the data from various sources, to combine the processing capabilities
of different platforms, and to maintain those applications, so as to unleash the full potential
of the data. In the worst case, such applications are not built in the first place, as it
seems too much of a daunting endeavor.
> 
> === Rationale ===
> 
> It is evident that there is an urgent need to release developers from the burden of knowing
all the intricacies of choosing and glueing together data processing platforms for supporting
their applications (data analytic tasks). Developers must focus only on the logics of their
applications. Surprisingly, there is no open source system trying to satisfy this urgent need.
Wayang aims at filling this gap. It copes with this urgent need by providing both a common
interface over data processing platforms and an optimizer to execute data analytic tasks on
the right data processing platform(s) seamlessly. As Apache is the place where most of the
important big data systems are, we then consider Apache as the right place for Wayang.
> 
> === Current Status ===
> 
> The current version of Wayang (v0.5.0) was initially co-developed by staff, students,
and interns at the Qatar Computing Research Institute (QCRI) and the Hasso-Plattner Institute
(HPI). The project was initiated at and sponsored by QCRI in 2015 with the goal of freeing
data scientists and developers from the intricacies of data processing platforms to support
their analytic tasks. The first open source release of Wayang was made only one year and a
half later, in June 13th of 2016, under the Apache Software License 2.0. Since we have made
several releases, the latest release was done on January 23th, 2019.
> 
> ** Meritocracy **
> 
> All current Wayang developers are familiar with this development process at Apache and
are currently trying to follow this meritocracy process as much as possible. For example,
Wayang already follows a committer principle where any pull request is analyzed by at least
one Wayang core developer. This was one of the reasons for choosing Apache for Wayang as we
all want to encourage and keep this style of development for Wayang.
> 
> ** Community **
> 
> Wayang started as a pure research project, but it quickly started developing into a community.
People from HPI quickly joined our efforts almost from the very beginning to make this project
a reality. Recently, the Berlin Institute of Technology (TU Berlin) and the Pontifical Catholic
University of Valparaiso (PUCV) in Chile have also joined our efforts for developing Wayang.
A company, called Scalytics, has been created around Wayang. Currently, we are intensively
seeking to further develop both developer and user communities. To keep broadening the community,
we plan to also exploit our ongoing academic collaborations with multiple universities in
Berlin and companies that we collaborate with. For instance, Wayang is already being utilized
for accessing multiple data sources in the context of a large data analytics project led by
TU Berlin and Huawei. We also believe that Wayang's extensible architecture (i.e., adding
new operators and platforms) will further encourage community participation. During incubation
we plan to have Wayang adopted by at least one company and will explicitly seek more industrial
participation.
> 
> ** Core Developers **
> 
> The initial developers of the project are diverse, they are from four different institutions
(TU Berlin, Scalytics, PUCV, and HBKU). We will work aggressively to grow the community during
the incubation by recruiting more developers from other institutions.
> 
> ** Alignment **
> 
> We believe Apache is the most natural home for taking Wayang to the next level. Apache
is currently hosting the most important big data systems. Hadoop, Spark, Flink, HBase, Hive,
Tez, Reef, Storm, Drill, and Ignite are just some examples of these technologies. Wayang fills
a significant gap - it provides a common abstraction for all these platforms and decides on
which platforms to run a single data analytic task - that exist in the big data open source
world. Wayang is now being developed following the Apache-style development model. Also, it
is well-aligned with the Apache principle of building a community to impact the big data community.
> 
> === Known Risks ===
> 
> ** Orphaned Products **
> 
> Currently, Wayang is the core technology behind Scalytics inc.. As a result, a team of
two engineers are working on a full time basis on this project. Recently, three more developers
have joined our efforts in building Wayang. Thus, the risk of Wayang becoming orphaned is
relatively very low. Still, people outside Scalytics (from TU Berlin and HBKU) have also joined
the project, which makes the risk of abandoning the project even lower. The PUCV in Chile
is also beginning to contribute to the code base and to develop a declarative query language
on top of Wayang. The project is constantly being monitored by email and frequent Skype meetings
as well as by weekly meetings with Scalytics people. Additionally, at the end of each year,
we meet to discuss the status of the project as well as to plan the most important aspects
we should work on during the year after.
> 
> ** Inexperience with Open Source **
> 
> Wayang quickly started being developed in open source under the Apache Software License
2.0. The source code is available on Github. Also few of the initial committers have contributed
to other open source projects: Hadoop and Flume
> 
> ** Homogeneous Developers **
> 
> The initial committers are already geographically distributed among Chile, Germany, and
Qatar. During incubation, one of our main goals is to increase the heterogeneity of the current
community and we will work hard to achieve it.
> 
> ** Reliance on salaried developers **
> 
> Wayang is already being developed by a mix of full time and volunteer time. Only 2 of
the initial committers are working full time on this project (Scalytics). So, we are confident
that the project will not decrease its development pace. Furthermore, we are committed to
recruit additional committers to significantly increase the development pace of the project.
> 
> ** Relationships with other Apache products **
> 
> Wayang is somehow related to Apache Spark as its developing interface is inspired from
Spark. In contrast to Spark, Wayang is not a data processing platform, but a mediator between
user applications and data processing platforms. In this sense, Wayang is similar to the Apache
Drill project, and Apache Beam. However, Wayang significantly differs from Apache Drill in
two main aspects. First, Apache Drill provides only a common interface to query multiple data
storages and hence users have to specify in their query the data to fetch. Then, Apache Drill
translates the query to the processing platforms where the data is stored, e.g. into mongoDB
query representation. In contrast, in Wayang, users only specify the data path and Wayang
decides which are the best (performance-wise) data processing platforms to use to process
such data. Second, the query interface in Apache Drill is SQL. Wayang uses an interface based
on operators forming DAGs. In this latter point, we are currently developing a PIGLatin-like
query language for Wayang. In addition, in contrast to Apache Beam, Wayang not only allows
users to use multiple data processing platforms at the same time, but also it provides an
optimizer to choose the most efficient platform for the task at hand. In Apache Beam, users
have to specify an appropriate runner (platform).
> Given these similarities with the two Apache projects mentioned above, we are looking
forward to collaborating with those communities. Still, we are open and would also love to
collaborate with other Apache communities as well.
> ** An excessive fascination with the Apache Brand **
> 
> Wayang solves a real problem that currently users and developers have to deal with at
a high cost: monetary cost, high design and development efforts, and very time consuming.
Therefore, we believe that Wayang can be successful in building a large community around it.
We are convinced that the Apache brand and community process will significantly help us in
building such a community and to establish the project in the long-term. We simply believe
that ASF is the right home for Wayang to achieve this.
> 
> === Documentation ===
> 
> Further details, documentation, and publications related to Wayang can be found at https://docs.rheem.io/rheem/
> 
> === Initial Source ===
> 
> The current source code of Wayang resides in Github:
> https://github.com/rheem-ecosystem/rheem
> 
> === External Dependencies ===
> 
> Wayang depends on the following Apache projects:
> 
> * Maven
> * HDFS
> * Hadoop
> * Spark
> 
> Wayang depends on the following other open source projects organized by license:
> 
> org.json.json: Json (http://json.org/license.html) 
> SnakeYAML: Apache 2.0
> Java Unified Expression Language API (Juel): Apache 2.0
> ProfileDB Instrumentation: Apache 2.0
> Gson: Apache 2.0
> Hadoop: Apache 2.0
> Scala: Apache 2.0
> Antlr 4: BSD
> Jackson: Apache 2.0
> Junit 5: EPL 2.0
> Mockito: MIT
> Assertj: Apache 2.0
> logback-classic: EPL 1.0 LGPL 2.1
> slf4j: MIT
> GNU Trove: LGPL 2.1
> graphchi: Apache 2.0
> SQLite JDBC: Apache 2.0
> PostgreSQL: BSD 2-clause
> jcommander: Apache 2.0
> Koloboke Collections API: Apache 2.0
> Snappy Java: Apache 2.0
> Apache Spark: Apache 2.0
> HyperSQL Database: BSD Modified (http://hsqldb.org/web/hsqlLicense.html) 
> Apache Giraph: Apache 2.0
> Apache Flink: Apache 2.0
> Apache Commons IO: Apache 2.0
> Apache Commons Lang: Apache 2.0
> Apache Maven: Apache 2.0
> 
> === Cryptography ===
> 
> (not applicable)
> 
> === Required Resources ===
> 
> ** Mailing Lists **
> 
> * mailto:private@wayang.incubator.apache.org
> * mailto:dev@wayang.incubator.apache.org
> * mailto:commits@Wayang.incubator.apache.org
> 
> ** Git repositories **
> 
> git://git.apache.org/repos/asf/incubator/wayang
> 
> ** Issue tracking **
> 
> https://issues.apache.org/jira/browse/RHEEM
> 
> === Initial Committers ===
> 
> The following list gives the planned initial committers (in alphabetical order):
> 
> * Bertty Contreras-Rojas <bertty@http://scalytics.io>
> * Rodrigo Pardo-Meza <rodrigo@http://scalytics.io>
> * Alexander Alten-Lorenz <alo@http://scalytics.io>
> * Zoi Kaoudi <zoi.kaoudi@http://tu-berlin.de>
> * Haralampos Gavriilidis <gavriilidis@http://tu-berlin.de>
> * Jorge-Arnulfo Quiane-Ruiz <jorge.quiane@http://tu-berlin.de>
> * Anis Troudi <atroudi@http://hbku.edu.qa>
> * Wenceslao Palma-Muñoz <wenceslao.palma@http://pucv.cl>
> 
> ** Affiliations **
> 
> * Scalytics Inc.
> ** Bertty Contreras-Rojas
> ** Rodrigo Pardo-Meza
> ** Alexander Alten-Lorenz
> * Berlin Institute of Technology (TU Berlin)
> ** Zoi Kaoudi
> ** Haralampos Gavriilidis
> ** Jorge-Arnulfo Quiane-Ruiz
> * Hamad Bin Khalifa University (HBKU)
> ** Anis Troudi
> * Pontifical Catholic University of Valparaiso, Chile (PUCV)
> ** Wenceslao Palma-Muñoz
> 
> === Sponsors ===
> 
> ** Champion **
> 
> * Christofer Dutz (christofer.dutz at c-ware dot de)
> 
> ** Mentors **
> 
> . (cdutz) Christofer Dutz
> . (larsgeorge) Lars George
> . (berndf) Fondermann
> . (jbonofre) Jean-Baptiste Onofré
> 
> ** Sponsoring Entity **
> 
> The Apache Incubator
> 
> 
> 
> 
> 
> 
> 
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: general-unsubscribe@incubator.apache.org
> For additional commands, e-mail: general-help@incubator.apache.org
> 

---------------------------------------------------------------------
To unsubscribe, e-mail: general-unsubscribe@incubator.apache.org
For additional commands, e-mail: general-help@incubator.apache.org


Mime
View raw message