incubator-general mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Kevin Ratnasekera <djkevincr1...@gmail.com>
Subject Re: [DISCUSS] Hop proposal
Date Tue, 08 Sep 2020 10:06:59 GMT
+1 ( binding ) Interesting project. Please add me as a mentor to the
project.

On Tue, Sep 8, 2020 at 3:26 PM Matt Casters
<matt.casters@neotechnology.com.invalid> wrote:

> Hello Apache,
>
> Our community is eager to propose for Hop to join the Apache Incubator.
> The Hop Orchestration Platform aims to help people with complex data and
> metadata orchestration problems.
>
> Below is the complete text of the proposal but you can also find it here:
> https://cwiki.apache.org/confluence/display/INCUBATOR/HopProposal
>
> Any help with respect to the incubation is appreciated including help from
> a few more mentors to set us on the right track.  On behalf of my community
> I'd be happy to answer any questions you might have regarding Hop.  Our
> thanks go out to Max, Julian and Tom for helping us set up this proposal.
>
> Thanks in advance for your time!
>
> Best regards,
>
> Matt - Hop co-founder
> www.project-hop.org
> ---
>
> Abstract
> =========
> Hop is short for the Hop Orchestration Platform. Written completely in Java
> it aims to provide a wide range of data orchestration tools, including a
> visual development environment, servers, metadata analysis, auditing
> services and so on. As a platform Hop also wants to be a re-usable library
> so that it can be easily re-used by other software.
>
> Proposal
> =========
> Hop provides all the tools to build, maintain and deploy data
> orchestration, ETL and data integration solutions. For example, Hop allows
> you to diagram a data flow that propagates changes from a database via
> Apache Kafka to a data warehouse and deploy it as an Apache Beam pipeline.
> The core concepts of Hop are Pipelines and Workflows.
> * Pipelines do the core data manipulation work (read, manipulate, write
> data). The main items of work in pipelines are transforms. A pipeline
> consists of two or more (usually many) transforms that each perform a
> granular piece of work. The transforms in a pipeline run in parallel, and
> together create a powerful data processing tool.
> * Workflows take care of the orchestration of actions: execute pipelines,
> run child workflows, environment checks, preparation, problem alerting and
> so on.
> If these terms sound familiar it’s because they are taken from the Apache
> Beam and Apache Airflow projects.
>
>
> The main components of the Hop platform are:
> * hop-gui, a visual data orchestration IDE
> * hop-run: a CLI tool to run workflows or pipelines
> * hop-config: a CLI tool to configure Hop and its components
> * hop-server: a light-weight web server to run and monitor workflows and
> pipelines
> * hop-translator: a tool for translating the various parts of the Hop tools
> (i18n).
> * hop-web: a thin client version of hop-gui for web browsers and mobile
> devices
>
>
> The cornerstone of the Hop platform is extensibility: all major components
> of the platform are designed to be pluggable. This allows any possible
> missing functionality to be created in a short amount of time.
>
> Background
> ===========
> The Hop Orchestration Platform has its origins in the Kettle community.
> Kettle got acquired by Pentaho and after Pentaho’s acquisition by Hitachi
> in 2015, the community struck out to solve problems less aligned with
> Hitachi’s interests.
>
> Rationale
> ==========
> In the Hop community, we have always aimed to function as a meritocracy,
> where contributions are accepted based on merit, and individuals gain
> status in the community based on their contributions (coding and
> otherwise). We’re proud to have a diverse group of people doing all the
> required things in a project: development , documentation, tutorials,
> architecture, testing, graphics design and much more. Bringing the project
> under the Apache Software Foundation would allow us to continue and grow,
> but also give our users confidence about the governance, IP status, and
> future of the project.
>
> ASF Preparation Phase
> ======================
> The very first goal of project Hop is to find a good way to cooperate on
> the development across wide geographical, economical and social spectra. To
> make this possible real changes were needed to a codebase which is
> essentially 20 years old. Most of these changes have been tackled by now.
> We think it’s fair to say that by now, Hop is a new platform even though it
> shares a common background as it partly started from the Kettle code base.
> Here are a few of the key focus areas we’re trying to saveguard going
> forward:
> * Plugins: lightweight plugins for all major functionality. This makes it
> possible to extend Hop or reduce Hop in size.  It also allows people to
> implement or change functionality with minimal coding.  In other words it
> makes it easier to contribute.
> * Maintain an open and responsive community where every concern, feedback
> and contribution is welcome.
> * Maintain a clear focus on data orchestration user requirements, not on
> “industry trends”
> * Documentation: we set up a version controlled “adoc” system with
> automated builds which is both open, controlled and reviewed.  This is
> incredibly important for every Hop user and developer.
> * Testing and stability: we want to massively increase stability by
> implementing integration tests beyond the standard Java unit testing
> because of the dynamic nature of data orchestration work.  We still have a
> long way to go.  This work will never be finished.  It’s a clear and
> important goal nevertheless.
> * Simplicity: things are complex enough.  We follow the example of projects
> like Apache Spark and Flink and so as an example “hop-run.sh” does exactly
> what the name says without the need to dive into documentation.  As much as
> possible we make things self-evident and will re-use existing terminology.
>
>
> For a list of the changes you can look at the monthly roundup which was
> compiled since February 2020.  It documents to hard work of our community
> so far:
>
>
>         http://www.project-hop.org/news/roundup-2020-02/
>         http://www.project-hop.org/news/roundup-2020-03/
>         http://www.project-hop.org/news/roundup-2020-04/
>         http://www.project-hop.org/news/roundup-2020-05/
>         http://www.project-hop.org/news/roundup-2020-06/
>         http://www.project-hop.org/news/roundup-2020-08/
>
>
> Goals
> ======
> Here are a few more details and specifics of things we still want to take
> on going forward:
> * Add more plugin metadata to Transforms and Action plugins as well as
> their supported engines.  This will make it easier to refine the user
> interface and make the user experience better by giving to the point
> feedback on what operations are supported and required.  Example metadata
> to add: extra version and build information, dependencies, tags and labels
> (replacing categories), documentation links, input and output capabilities,
> engine capabilities and so on.
> * SWT:  While the Eclipse SWT project is still supported we want to make a
> list of all the commonly used API calls and stick to those with our own
> API. This will help the development of hop-web and allow us to possibly
> more easily migrate to different user interfaces later on.
> * Integration testing: every transform and action should have an
> integration test before it is released to ensure quality.  Java unit
> testing has been proven to be insufficient in guarding against backward
> compatibility, stability and functionality.  We need to do better.
> * Apache VFS: Hop makes extensive use of this API to handle files.  As such
> we want to implement the various drivers for gs://, hdfs://, s3:// through
> standard Kettle plugins making it easier to choose which protocols to
> support.
> * Variables & Parameters:  make this experience more intuitive, clean up
> the underlying API and add more options to the various user interfaces
> responsible for setting and passing variables and parameters.
> * Make Hop-Web an integral part of the Apache Hop project removing the code
> duplication (fork) we’re dealing with now.  This includes the need to
> improve various user interfaces which were designed for non-web clients.
> * Make best practices and governance functionality an integral part of the
> API of the project:
>    * Data sets and unit testing (already done)
>    * Environments and lifecycle management (partly done)
>    * Git support (partly done)
>    * Auditing and lineage
>    * Software policies and enforcement thereof
>    * Configuration management (partly done)
>
>
> Current Status
> ===============
>
> Meritocracy
> ------------
> With Project Hop, we actively work to foster the existing community and
> encourage community contributions. As of  September 1st 2020 we received
> over 250 pull requests and have around 600 tickets in our JIRA platform (a
> lot of which were created by community members) and have active discussions
> in our Mattermost chat platform with over 80 members.
>
>
> The last half year we started to ask users on our chat chat server for
> specific feedback on terminology, features and so on.  It’s been a
> wonderfully positive experience to have in-depth discussions on complex
> issues with industry experts. We look forward to moving these discussions
> and votes to an Apache mailing list.
>
> Community
> ------------
> Hop is developed, extended and maintained by a global community of users
> and developers. The Hop community is what has driven its development and
> growth.
> The particular past history of Hop has led to a lot of interest for the
> project and already led to a number of contributions, documentation and
> translations.
>
> Core Developers
> ----------------
> We have a diverse group of core developers with people joining on a regular
> basis.  Matt Casters, Rodrigo Haces and David Rosenblum are part time
> developers on Hop, salaried by Neo Solutions.  Bart Maertens, Hans Van
> Akelyen, Yannick Mols are part time Hop developers paid for by company
> know.bi.  Doug and Gretchen Moran were Pentaho employees but along with
> Rafael Valenzuela, Dan Keeley, Jason Chu, Sergio Ramazzina and many others
> they can be considered to be long time consultants and community members
> for over a decade that joined the Hop community in the last year or two.
>
>
> Alignment
> ----------
> We want to anchor and safeguard our development and community building
> efforts for the future. We strongly believe that as an Apache project this
> can be achieved in the best possible way. The Hop project also started to
> align with projects like Apache Beam, Spark and Flink in it's use of
> terminology, tools, manner of configuration and so on.  As mentioned
> elsewhere in this document Hop is a large user of other Apache projects and
> libraries and we believe that becoming an Apache project is beneficial.
> Specifically for Apache Beam we believe that providing a visual pipeline
> development tool can be of great value.
>
> Known Risks
> ============
> While the current code-base of Kettle on which we have started from is
> already released under the Apache Public License 2.0 proper attribution
> needs to happen to Hitachi Vantara.
> We have no knowledge of existing patents on any part of the Kettle
> codebase.
> To further reduce any risk of there even being any discussion on naming the
> Hop team decided to rename the project, its tools (to be more self-evident
> as well), the java API and even the main concepts (Transformations are now
> called Pipelines, in line with Apache Beam naming conventions).
>
> Orphaned products
> ------------------
> There is little risk that the project will become orphaned. The list of
> active developers is large, and consists of a mix of developers  who have
> been working on the code for several years and recent arrivals in the
> community.
>
> Inexperience with Open Source
> ------------------------------
> The project team has a long history in open source and has contributed to
> Apache licensed open source projects, mostly in the Kettle ecosystem such
> as Kettle itself and the many plugins and projects surrounding it. The
> experience gained there has allowed us to quickly set up all required build
> tools and processes.  In its fairly short history, Hop has been advocating
> open source in all aspects of the project. Our submission to the Apache
> Software Foundation is a logical extension of our commitment to open source
> software.
>
> Licensing
> ----------
> The original source code we started from (see below) has been open source
> since december 2005, initially under the Lesser GPL but since January 2012
> all under the Apache License version 2.0. All Hop code has been scanned for
> compliance with APL 2.0. We integrated Apache Rat with our build process.
>
> Heterogeneous Developers
> -------------------------
> Hop is built, developed and maintained by a global community of
> developers.  Input comes from a large group of developers and users from
> all over the world.  At this moment over 7 companies contribute to Hop
> through the developers along with a list of individuals and consultants.
>
> Reliance on Salaried Developers
> --------------------------------
> Hop developers are a mix of volunteers, enthusiasts and people working for
> an employer. There is also a group of consultants who want to be involved
> in Hop because it allows them to do projects with it.  They are in fact our
> most important users and developers since they provide valuable feedback
> from the trenches.
>
> Relationships with Other Apache Products
> -----------------------------------------
> Hop is a heavy user of Apache software libraries.
>
> Apache Commons usage:
> * commons-beanutils
> * commons-cli
> * commons-codec
> * commons-collections
> * commons-collections4
> * commons-compiler
> * commons-compress
> * commons-configuration
> * commons-database-model
> * commons-dbcp
> * commons-digester
> * commons-el
> * commons-httpclient
> * commons-io
> * commons-lang and commons-lang3
> * commons-logging
> * commons-math and commons-math3-3.5.jar
> * commons-net
> * commons-pool
> * commons-validator
> * commons-vfs2
>
>
> Other libraries:
> * Apache Batik : for the front-end SVG drawing
> * Apache Xerces (XSLT, XML processing)
>
>
> Other usage of Apache projects related to Hop (plugins):
> * Apache Avro
> * Apache Beam w/ Apache Spark, Apache Flink, …
> * Apache Cassandra
> * Apache CouchDB
> * Apache Derby
> * Apache Flume
> * Apache Hadoop
> * Apache Hive
> * Apache Kafka
> * Apache Solr
> * Apache Subversion
> * Apache Zookeeper
>
>
> For the build process
> * Apache Maven
> * Apache Jenkins
>
> An excessive Fascination with the Apache Brand
> -----------------------------------------------
> With this proposal we are not seeking attention or publicity. Rather, we
> firmly believe in Hop, visual data pipeline development and the ability to
> treat the developed data pipelines (ETL) as software code. While the
> original Hop code has been open source for about 15 years, we believe
> putting code on GitHub can only go so far. We see the Apache community,
> processes, and mission as critical for ensuring Hop is truly
> community-driven, positively impactful, and innovative open source
> software. We believe Hop is a great fit for the Apache Software Foundation
> due to its focus on visual data processing and its relationships to
> existing ASF projects.
>
> Documentation
> ==============
> Over the years, the community has contributed extensive documentation to
> wiki.pentaho.com. Over time, areas of the available information have
> become
> incomplete or outdated. Most of this documentation has been reviewed,
> updated and will be contributed to the Apache foundation with the Hop
> source code. Documentation for the extensive new functionality that was
> added to Hop in recent months is being written.
> We consider documentation to be a core piece of the Hop platform and will
> treat documentation as any other item of code.
>
> Initial Source
> ===============
> While there isn’t a Java class in Hop which is unchanged from its origins
> we should mention we selected this source code to form the base of Apache
> Kettle:
> https://github.com/pentaho/pentaho-kettle/tree/8.2.0.7-R
>
> We merged various changes from the WebSpoon fork found over here:
> https://github.com/HiromuHota/pentaho-kettle
>
>
> Various community driven Kettle plugins were written to bypass bugs, slow
> down code-rot and to implement missing features.  They were were merged
> into Hop from these locations:
> https://github.com/mattcasters/kettle-debug-plugin (better debugging)
> https://github.com/mattcasters/kettle-beam (Apache Beam support)
> https://github.com/mattcasters/pentaho-pdi-dataset (Unit Testing)
> https://github.com/mattcasters/kettle-needful-things (Bug fixes &
> workarounds)
> https://github.com/mattcasters/kettle-environment (Environment management)
>
>
> The Hop repositories are currently hosted at:
> https://github.com/project-hop/
> * Hop: source code for the Hop project
> * Hop-doc: technical documentation for the Hop project
> * Hop-website: Hop website and content repository
> * Hop-docker: Docker containers, Kubernetes
>
> Source and Intellectual Property Submission Plan
> =================================================
> The originating source code is already licensed under an Apache 2 license:
> * https://github.com/pentaho/pentaho-kettle/blob/8.2.0.7-R/LICENSE.txt
> *
> https://github.com/HiromuHota/pentaho-kettle/blob/webspoon-8.3/LICENSE.txt
> * https://github.com/mattcasters/kettle-debug-plugin/blob/master/LICENSE
> * https://github.com/mattcasters/kettle-beam/blob/master/LICENSE
> *
> https://github.com/mattcasters/pentaho-pdi-dataset/blob/master/LICENSE.txt
> * https://github.com/mattcasters/kettle-needful-things/blob/master/LICENSE
> * https://github.com/mattcasters/kettle-environment/blob/master/LICENSE
>
>
> For all contributions we have an agreement in place:
> https://cla-assistant.io/project-hop/hop
>
> External Dependencies
> ======================
> Over the course of the last year we removed non-essential dependencies as
> much as possible and replaced them by interfaces and plugin types. We did
> this to simplify the architecture.
> It’s important to note all external dependencies are licensed under an
> Apache 2.0 or Apache-compatible license. As we grow the Hop community we
> will configure our build process to require and validate all contributions
> and dependencies are licensed under the Apache 2.0 license or are under an
> Apache-compatible license.
>
> Cryptography
> =============
>
> Required Resources
> ===================
>
> Mailing lists
> --------------
> We currently use a mix of email and Mattermost. We will migrate our
> existing mailing lists to the following:
>
> dev@hop.incubator.apache.org
> user@hop.incubator.apache.org
> private@hop.incubator.apache.org
> commits@hop.incubator.apache.org
>
> Git Repository
> ---------------
> The Hop code is currently in git, we’d like to keep it that way. We request
> a git repository for incubator-hop with mirroring to GitHub.
>
> Issue Tracking
> ---------------
> We request the creation of an Apache-hosted JIRA.
>
> Jira ID: HOP
>
>
> Other Resources
> ----------------
> To allow other projects to use Hop as a library we would love to publish
> artifacts on a Maven server like maven.apache.org.
>
> Initial Committers
> ===================
> * Nicholas Adment <nadment@gmail.com>
> * Hans Van Akelyen <hans.van.akelyen@know.bi>
> * Lokke Bruyndonckx <lokke.bruyndonckx@know.bi>
> * Matt Casters <matt.casters@neo4j.com>
> * Jason Chu <jianjunchu@gmail.com>
> * Peter Fabricius <info@peter-fabricius.de>
> * Rodrigo Haces <rodrigo.haces@neo4j.com>
> * Dave Henry <dshenry99@gmail.com>
> * Hiromu Hota <hiromu.hota@gmail.com>
> * Brandon Jackson <usbrandon@gmail.com>
> * Dan Keeley <dan@dankeeley.co.uk>
> * Bart Maertens <bart.maertens@know.bi>
> * Yannick Mols <yannick.mols@know.bi>
> * Doug Moran <doug@dougandgretchen.com>
> * Gretchen Moran <gretchen@dougandgretchen.com>
> * Sergio Ramazzina <sergio.ramazzina@serasoft.it>
> * Maria Carina Roldan <maria.carina.roldan@gmail.com>
> * David Rosenblum <david.rosenblum@neo4j.com>
> * Rafael Valenzuela <ravamo@gmail.com>
>
> Affiliations
> =============
> * Neo4J
>    * Matt Casters
>    * Rodrigo Haces
>    * David Rosenblum
> * Know.bi
>    * Bart Maertens
>    * Hans Van Akelyen
>    * Lokke Bruyndonckx
>    * Yannick Mols
> * eHealth Africa
>    * Doug & Gretchen Moran
> * Schemetrica
>    * Dave Henry
> * Beijing Auphi Data Co
>    * Jason Chu
> * Serasoft Italy
>    * Sergio Ramazzina
> * Hitachi Research
>    * Hiromu Hota
>
>
> Sponsors
> =========
> Champion
> ---------
> Maximilian Michels (mxm@apache.org)
>
> Nominated Mentors
> ------------------
> Tom Barber (magicaltrout@apache.org)
> Julian Hyde (jhyde@apache.org)
> Maximilian Michels (mxm@apache.org)
>
> Sponsoring Entity
> ==================
> The Apache Incubator
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message