incubator-general mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Julian Hyde <jh...@apache.org>
Subject Re: [VOTE] Hop proposal
Date Tue, 10 Nov 2020 20:42:04 GMT
I saw that the text of the proposal was not attached to the vote. So, for the historical record,
I am attaching the text here. The text is copied from version 10 of the proposal in the wiki,
dated 2020-09-17.



## Abstract

Hop is short for the Hop Orchestration Platform. Written
completely in Java it aims to provide a wide range of data
orchestration tools, including a visual development
environment, servers, metadata analysis, auditing services
and so on. As a platform, Hop also wants to be a reusable
library so that it can be easily reused by other software.

## Proposal

Hop provides all the tools to build, maintain and deploy
data orchestration, ETL and data integration solutions. For
example, Hop allows you to diagram a data flow that
propagates changes from a database via Apache Kafka to a
data warehouse and deploy it as an Apache Beam pipeline. The
core concepts of Hop are Pipelines and Workflows.

* Pipelines do the core data manipulation work (read,
  manipulate, write data). The main items of work in
  pipelines are transforms. A pipeline consists of two or
  more (usually many) transforms that each perform a
  granular piece of work. The transforms in a pipeline run
  in parallel, and together create a powerful data
  processing tool.

* Workflows take care of the orchestration of actions:
  execute pipelines, run child workflows, environment
  checks, preparation, problem alerting and so on.  If these
  terms sound familiar it’s because they are taken from the
  Apache Beam and Apache Airflow projects.

The main components of the Hop platform are:

* hop-gui, a visual data orchestration IDE
* hop-run: a CLI tool to run workflows or pipelines
* hop-config: a CLI tool to configure Hop and its components
* hop-server: a light-weight web server to run and monitor
  workflows and pipelines
* hop-translator: a tool for translating the various parts
  of the Hop tools (i18n).
* hop-web: a thin client version of hop-gui for web browsers
  and mobile devices

The cornerstone of the Hop platform is extensibility: all
major components of the platform are designed to be
pluggable. This allows any possible missing functionality to
be created in a short amount of time.

## Background

The Hop Orchestration Platform has its origins in the Kettle
community. Kettle got acquired by Pentaho and after
Pentaho’s acquisition by Hitachi in 2015, the community
struck out to solve problems less aligned with Hitachi’s
interests.

## Rationale

In the Hop community, we have always aimed to function as a
meritocracy, where contributions are accepted based on
merit, and individuals gain status in the community based on
their contributions (coding and otherwise). We’re proud to
have a diverse group of people doing all the required things
in a project: development, documentation, tutorials,
architecture, testing, graphics design and much
more. Bringing the project under the Apache Software
Foundation would allow us to continue and grow, but also
give our users confidence about the governance, IP status,
and future of the project.

## ASF Preparation Phase

The very first goal of project Hop is to find a good way to
cooperate on the development across wide geographical,
economical and social spectra. To make this possible real
changes were needed to a codebase which is essentially 20
years old. Most of these changes have been tackled by
now. We think it’s fair to say that by now, Hop is a new
platform even though it shares a common background as it
partly started from the Kettle code base.

Here are a few of the key focus areas we’re trying to
saveguard going forward:

* Plugins: lightweight plugins for all major
  functionality. This makes it possible to extend Hop or
  reduce Hop in size. It also allows people to implement or
  change functionality with minimal coding. In other words
  it makes it easier to contribute.

* Maintain an open and responsive community where every
  concern, feedback and contribution is welcome.

* Maintain a clear focus on data orchestration user
  requirements, not on “industry trends”

* Documentation: we set up a version controlled “adoc”
  system with automated builds which is both open,
  controlled and reviewed. This is incredibly important for
  every Hop user and developer.

* Testing and stability: we want to massively increase
  stability by implementing integration tests beyond the
  standard Java unit testing because of the dynamic nature
  of data orchestration work. We still have a long way to
  go. This work will never be finished. It’s a clear and
  important goal nevertheless.

* Simplicity: things are complex enough. We follow the
  example of projects like Apache Spark and Flink and so as
  an example “hop-run.sh” does exactly what the name says
  without the need to dive into documentation. As much as
  possible we make things self-evident and will re-use
  existing terminology.

For a list of the changes you can look at the monthly
roundup which was compiled since February 2020. It
documents the hard work of our community so far:

* http://www.project-hop.org/news/roundup-2020-02/
* http://www.project-hop.org/news/roundup-2020-03/
* http://www.project-hop.org/news/roundup-2020-04/
* http://www.project-hop.org/news/roundup-2020-05/
* http://www.project-hop.org/news/roundup-2020-06/
* http://www.project-hop.org/news/roundup-2020-08/

## Goals

Here are a few more details and specifics of things we still
want to take on going forward:

* Add more plugin metadata to Transforms and Action plugins
  as well as their supported engines. This will make it
  easier to refine the user interface and make the user
  experience better by giving to the point feedback on what
  operations are supported and required. Example metadata
  to add: extra version and build information, dependencies,
  tags and labels (replacing categories), keywords,
  documentation links, input and output capabilities, engine
  capabilities and so on.

* SWT: While the Eclipse SWT project is still supported we
  want to make a list of all the commonly used API calls and
  stick to those with our own API. This will help the
  development of hop-web and allow us to possibly more
  easily migrate to different user interfaces later on.

* Integration testing: every transform and action should
  have an integration test before it is released to ensure
  quality. Java unit testing has been proven to be
  insufficient in guarding against backward compatibility,
  stability and functionality. We need to do better.

* Apache VFS: Hop makes extensive use of this API to handle
  files. As such we want to implement the various drivers
  for gs://, hdfs://, s3:// through standard Kettle plugins
  making it easier to choose which protocols to support.

* Variables & Parameters: make this experience more
  intuitive, clean up the underlying API and add more
  options to the various user interfaces responsible for
  setting and passing variables and parameters.

* Make Hop-Web an integral part of the Apache Hop project
  removing the code duplication (fork) we’re dealing with
  now. This includes the need to improve various user
  interfaces which were designed for non-web clients.

* Make best practices and governance functionality an
  integral part of the API of the project:

** Data sets and unit testing (already done)
** Environments and lifecycle management (partly done)
** Git support (partly done)
** Auditing and lineage
** Software policies and enforcement thereof
** Configuration management (partly done)

## Current Status

# Meritocracy

With Project Hop, we actively work to foster the existing
community and encourage community contributions. As of
September 1st 2020 we received over 250 pull requests and
have around 600 tickets in our JIRA platform (a lot of which
were created by community members) and have active
discussions in our Mattermost chat platform with over 80
members.

The last half year we started to ask users on our chat chat
server for specific feedback on terminology, features and so
on. It’s been a wonderfully positive experience to have
in-depth discussions on complex issues with industry
experts. We look forward to moving these discussions and
votes to an Apache mailing list.

# Community

Hop is developed, extended and maintained by a global
community of users and developers. The Hop community is what
has driven its development and growth.

The particular past history of Hop has led to a lot of
interest for the project and already led to a number of
contributions, documentation and translations.

# Core Developers

We have a diverse group of core developers with people
joining on a regular basis. Matt Casters, Rodrigo Haces and
David Rosenblum are part time developers on Hop, salaried by
Neo Solutions. Bart Maertens, Hans Van Akelyen, Yannick Mols
are part time Hop developers paid for by company
know.bi. Doug and Gretchen Moran were Pentaho employees but
along with Rafael Valenzuela, Dan Keeley, Jason Chu, Sergio
Ramazzina and many others they can be considered to be long
time consultants and community members for over a decade
that joined the Hop community in the last year or two.

# Alignment

We want to anchor and safeguard our development and
community building efforts for the future. We strongly
believe that as an Apache project this can be achieved in
the best possible way. The Hop project also started to align
with projects like Apache Beam, Spark and Flink in its use
of terminology, tools, manner of configuration and so on. As
mentioned elsewhere in this document Hop is a large user of
other Apache projects and libraries and we believe that
becoming an Apache project is mutually
beneficial. Specifically for Apache Beam we believe that
providing a visual pipeline development tool can be of great
value.

## Known Risks

While the current code-base of Kettle on which we have
started from is already released under the Apache Public
License 2.0 proper attribution needs to happen to Hitachi
Vantara.

We have no knowledge of existing patents on any part of the
Kettle codebase.

To further reduce any risk of there even being any
discussion on naming the Hop team decided to rename the
project, its tools (to be more self-evident as well), the
java API and even the main concepts (Transformations are now
called Pipelines, in line with Apache Beam naming
conventions).

# Orphaned products

There is little risk that the project will become
orphaned. The list of active developers is large, and
consists of a mix of developers who have been working on the
code for several years and recent arrivals in the community

# Inexperience with Open Source

The project team has a long history in open source and has
contributed to Apache licensed open source projects, mostly
in the Kettle ecosystem such as Kettle itself and the many
plugins and projects surrounding it. The experience gained
there has allowed us to quickly set up all required build
tools and processes. In its fairly short history, Hop has
been advocating open source in all aspects of the
project. Our submission to the Apache Software Foundation is
a logical extension of our commitment to open source
software.

# Licensing

The original source code we started from (see below) has
been open source since december 2005, initially under the
Lesser GPL but since January 2012 all under the Apache
License version 2.0. All Hop code has been scanned for
compliance with APL 2.0. We integrated Apache Rat with our
build process.

# Heterogeneous Developers

Hop is built, developed and maintained by a global community
of developers. Input comes from a large group of developers
and users from all over the world. At this moment over 7
companies contribute to Hop through the developers along
with a list of individuals and consultants.

Reliance on Salaried Developers
Hop developers are a mix of volunteers, enthusiasts and people working for an employer. There
is also a group of consultants who want to be involved in Hop because it allows them to do
projects with it. They are in fact our most important users and developers since they provide
valuable feedback from the trenches.

Relationships with Other Apache Products
Hop is a heavy user of Apache software libraries.

Apache Commons usage:

* commons-beanutils
* commons-cli
* commons-codec
* commons-collections
* commons-collections4
* commons-compiler
* commons-compress
* commons-configuration
* commons-database-model
* commons-dbcp
* commons-digester
* commons-el
* commons-httpclient
* commons-io
* commons-lang and commons-lang3
* commons-logging
* commons-math and commons-math3-3.5.jar
* commons-net
* commons-pool
* commons-validator
* commons-vfs2

Other libraries:

* Apache Batik : for the front-end SVG drawing
* Apache Xerces (XSLT, XML processing)


Other usage of Apache projects related to Hop (plugins):

* Apache Avro
* Apache Beam w/ Apache Spark, Apache Flink, …
* Apache Cassandra
* Apache CouchDB
* Apache Derby
* Apache Flume
* Apache Hadoop
* Apache Hive
* Apache Kafka
* Apache Solr
* Apache Subversion
* Apache Zookeeper

For the build process

* Apache Maven
* Apache Jenkins

# An excessive Fascination with the Apache Brand

With this proposal we are not seeking attention or
publicity. Rather, we firmly believe in Hop, visual data
pipeline development and the ability to treat the developed
data pipelines (ETL) as software code. While the original
Hop code has been open source for about 15 years, we believe
putting code on GitHub can only go so far. We see the Apache
community, processes, and mission as critical for ensuring
Hop is truly community-driven, positively impactful, and
innovative open source software. We believe Hop is a great
fit for the Apache Software Foundation due to its focus on
visual data processing and its relationships to existing ASF
projects.

## Documentation

Over the years, the community has contributed extensive
documentation to https://wiki.pentaho.com. Over time, areas
of the available information have become incomplete or
outdated. Most of this documentation has been reviewed,
updated and will be contributed to the Apache foundation
with the Hop source code. Documentation for the extensive
new functionality that was added to Hop in recent months is
being written.

We consider documentation to be a core piece of the Hop
platform and will treat documentation as any other item of
code.

## Initial Source

While there isn’t a Java class in Hop which is unchanged
from its origins we should mention we selected this source
code to form the base of Apache Kettle:

* https://github.com/pentaho/pentaho-kettle/tree/8.2.0.7-R

We merged various changes from the WebSpoon fork found over
here:

* https://github.com/HiromuHota/pentaho-kettle

Various community driven Kettle plugins were written to
bypass bugs, slow down code-rot and to implement missing
features. They were were merged into Hop from these
locations:

* https://github.com/mattcasters/kettle-debug-plugin (better
  debugging)

* https://github.com/mattcasters/kettle-beam (Apache Beam
  support)

* https://github.com/mattcasters/pentaho-pdi-dataset (Unit
  Testing)

* https://github.com/mattcasters/kettle-needful-things (Bug
  fixes & workarounds)

* https://github.com/mattcasters/kettle-environment
  (Environment management)

The Hop repositories are currently hosted at:

* https://github.com/project-hop/

with the following repositories:

* Hop: source code for the Hop project
* Hop-doc: technical documentation for the Hop project
* Hop-website: Hop website and content repository
* Hop-docker: Docker containers, Kubernetes

## Source and Intellectual Property Submission Plan

The originating source code is already licensed under an
Apache 2 license:

* https://github.com/pentaho/pentaho-kettle/blob/8.2.0.7-R/LICENSE.txt
* https://github.com/HiromuHota/pentaho-kettle/blob/webspoon-8.3/LICENSE.txt
* https://github.com/mattcasters/kettle-debug-plugin/blob/master/LICENSE
* https://github.com/mattcasters/kettle-beam/blob/master/LICENSE
* https://github.com/mattcasters/pentaho-pdi-dataset/blob/master/LICENSE.txt
* https://github.com/mattcasters/kettle-needful-things/blob/master/LICENSE
* https://github.com/mattcasters/kettle-environment/blob/master/LICENSE

For all contributions we have an agreement in place:
https://cla-assistant.io/project-hop/hop

## External Dependencies

Over the course of the last year we removed non-essential
dependencies as much as possible and replaced them by
interfaces and plugin types. We did this to simplify the
architecture.

It’s important to note all external dependencies are
licensed under an Apache 2.0 or Apache-compatible
license. As we grow the Hop community we will configure our
build process to require and validate all contributions and
dependencies are licensed under the Apache 2.0 license or
are under an Apache-compatible license.

## Cryptography

## Required Resources

# Mailing lists

We currently use a mix of email and Mattermost. We will
migrate our existing mailing lists to the following:

* dev@hop.incubator.apache.org
* user@hop.incubator.apache.org
* private@hop.incubator.apache.org
* commits@hop.incubator.apache.org

# Git Repository

The Hop code is currently in git, we’d like to keep it that
way. We request a git repository for incubator-hop with
mirroring to GitHub.

# Issue Tracking

We request the creation of an Apache-hosted JIRA.

Jira ID: HOP

# Other Resources

To allow other projects to use Hop as a library we would
love to publish artifacts on a Maven server like
maven.apache.org.

## Initial Committers

* Nicholas Adment <nadment@gmail.com>
* Hans Van Akelyen <hans.van.akelyen@know.bi>
* Lokke Bruyndonckx <lokke.bruyndonckx@know.bi>
* Matt Casters <matt.casters@neo4j.com>
* Jason Chu <jianjunchu@gmail.com>
* Peter Fabricius <info@peter-fabricius.de>
* Rodrigo Haces <rodrigo.haces@neo4j.com>
* Dave Henry <dshenry99@gmail.com>
* Hiromu Hota <hiromu.hota@gmail.com>
* Brandon Jackson <usbrandon@gmail.com>
* Dan Keeley <dan@dankeeley.co.uk>
* Bart Maertens <bart.maertens@know.bi>
* Yannick Mols <yannick.mols@know.bi>
* Doug Moran <doug@dougandgretchen.com>
* Gretchen Moran <gretchen@dougandgretchen.com>
* Sergio Ramazzina <sergio.ramazzina@serasoft.it>
* Maria Carina Roldan <maria.carina.roldan@gmail.com>
* David Rosenblum <david.rosenblum@neo4j.com>
* Rafael Valenzuela <ravamo@gmail.com>

# Affiliations

* Neo4J
** Matt Casters
** Rodrigo Haces
** David Rosenblum
* Know.bi
** Bart Maertens
** Hans Van Akelyen
** Lokke Bruyndonckx
** Yannick Mols
* eHealth Africa
** Doug & Gretchen Moran
* Schemetrica
** Dave Henry
* Beijing Auphi Data Co
** Jason Chu
* Serasoft Italy
** Sergio Ramazzina
* Hitachi Research
** Hiromu Hota

## Sponsors

Champion

* Maximilian Michels (mxm@apache.org)

Nominated Mentors

* Tom Barber (magicaltrout@apache.org)
* Julian Hyde (jhyde@apache.org)
* Maximilian Michels (mxm@apache.org)
* Francois Papon (fpapon@apache.org)
* Kevin Ratnasekera (djkevincr@apache.org)


---------------------------------------------------------------------
To unsubscribe, e-mail: general-unsubscribe@incubator.apache.org
For additional commands, e-mail: general-help@incubator.apache.org


Mime
View raw message