incubator-general mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jean-Baptiste Onofré>
Subject Re: [discuss] Apache Gobblin Incubator Proposal
Date Wed, 15 Feb 2017 19:35:33 GMT
Thanks for the proposal Jim.


On Feb 15, 2017, 11:37, at 11:37, Jim Jagielski <> wrote:
>If you need/want another mentor, I volunteer
>> On Feb 14, 2017, at 3:53 PM, Olivier Lamy <> wrote:
>> Hi
>> Well I don't see issues as no one discuss the proposal.
>> So I will start the official vote tomorrow.
>> Cheers
>> Olivier
>> On 6 February 2017 at 14:08, Olivier Lamy <> wrote:
>>> Hello everyone,
>>> I would like to submit to you a proposal to bring Gooblin to the
>>> Software Foundation.
>>> The text of the proposal is included below and available as a draft
>>> in the Wiki:
>>> We will appreciate any feedback and input.
>>> Olivier on behalf of the Gobblin community
>>> = Apache Gobblin Proposal =
>>> == Abstract ==
>>> Gobblin is a distributed data integration framework that simplifies
>>> aspects of big data integration such as data ingestion, replication,
>>> organization and lifecycle management for both streaming and batch
>>> ecosystems.
>>> == Proposal ==
>>> Gobblin is a universal data integration framework. The framework has
>>> used to build a variety of big data applications such as ingestion,
>>> replication, and data retention. The fundamental constructs provided
>by the
>>> Gobblin framework are:
>>> 1. An expandable set of connectors that allow data to be integrated
>>> a variety of sources and sinks. The range of connectors already
>>> in Gobblin are quite diverse and are an ever expanding set. To
>>> just a few examples, connectors exist for databases (e.g., MySQL,
>>> Teradata, Couchbase etc.), web based technologies (REST APIs,
>>> servers, Filers), scalable storage (HDFS, S3, Ambry etc,), streaming
>>> (Kafka, EventHubs etc.), and a variety of proprietary data sources
>>> sinks (e.g.Salesforce, Google Analytics, Google Webmaster etc.).
>>> Gobblin has a rich library of converters that allow for conversion
>of data
>>> from one format to another as data moves across system boundaries
>>> AVRO in HDFS to JSON in another system).
>>> 2. Gobblin has a well defined and customizable state management
>>> that allows writing stateful applications. These are particularly
>>> when solving problems like bulk incremental ingest and keeping
>>> clusters replicated in sync. The ability to record work that has
>>> completed and what remains in a scalable manner is critical to
>writing such
>>> diverse applications successfully.
>>> 3. Gobblin is agnostic to the underlying execution engine. It can be
>>> tailored to run ontop of a variety of execution frameworks ranging
>>> multiple processes on a single node, to open source execution
>engines like
>>> MapReduce, Spark or Samza, natively on top of raw containers like
>Yarn or
>>> Mesos, and the public cloud like Amazon AWS or Microsoft Azure. We
>>> extending Gobblin to run on top of a self managed cluster when
>security is
>>> vital.  This allows different applications that require different
>>> of scalability, latency or security to be customized to for their
>>> needs. For example, highly latency sensitive applications can be
>>> in a streaming environment while batch based execution might benefit
>>> applications where the priority might be geared towards optimal
>>> utilization.
>>> 4.Gobblin comes out of the box with several diagnosability features
>>> Gobblin metrics and error handling. Collectively, these features
>>> Gobblin to operate at the scale of petabytes of data. To give just
>>> example, the ability to quarantine a few bad records from an
>isolated Kafka
>>> topic without stopping the entire flow from continued execution is
>>> when the number of Kafka topics range in the thousands and the
>>> data handled is in the petabytes.
>>> Gobblin thus provides crisply defined software constructs that can
>be used
>>> to build a vast array of data integration applications customizable
>>> varied user needs. It has become a preferred technology for data
>>> integration use-cases by many organizations worldwide (see a partial
>>> here).
>>> == Background ==
>>> Over the last decade, data integration has evolved use case by use
>case in
>>> most companies. For example, at LinkedIn, when Kafka became a
>>> part of the data ecosystem, a system called Camus was built to
>ingest this
>>> data for analytics processing on Hadoop. Similarly, we had custom
>>> to ingest data from Salesforce, Oracle and myriad other sources.
>>> pattern became the norm rather than the exception and one point,
>>> was running at least fifteen different types of ingestion pipelines.
>>> fragmentation has several unfortunate implications. Operational
>costs scale
>>> with the number of pipelines even if the myriad pipelines share a
>>> array of common features. Bug fixes and performance optimizations
>cannot be
>>> shared across the pipelines. A common set of practices around
>debugging and
>>> deployment does not emerge. Each pipeline operator will continue to
>>> in his little silo of the data integration world completely
>oblivious to
>>> the challenges of his fellow operator sitting five tables down.
>>> These experiences were the genesis behind the design and
>implementation of
>>> Gobblin. Gobblin thus started out as a universal data ingestion
>>> focussed on extracting, transforming, and synchronizing large
>volumes of
>>> data between different data sources and sinks. Not surprisingly,
>given its
>>> origins, the initial design of Gobblin placed great emphasis on
>>> abstractions that can be leveraged repeatedly. These abstractions
>>> stood the test of time at LinkedIn and we have been able to leverage
>>> constructs well beyond ingest. Gobblin's architecture has allowed us
>>> LinkedIn to use it for a variety of applications ranging from from
>>> format conversion to adhering to compliance policies set by European
>>> standards. Finally, as noted earlier, Gobblin can be deployed in a
>>> of execution environments: it can be deployed as a library embedded
>>> another application or can be used to execute jobs on a public
>cloud. A
>>> fluid architectural and execution design story has allowed Gobblin
>>> become a truly successful data integration platform.
>>> Gobblin has continued to evolve with a variety of utility packages
>>> Gobblin metrics and Gobblin config management. Collectively, these
>>> organizations utilizing Gobblin to use a system that has been battle
>>> at LinkedIn scale. This is something that its consumers have to come
>>> appreciate greatly.
>>> == Rationale ==
>>> Gobblin's entry to the Apache foundation is beneficial to both the
>>> and the Apache communities. Gobblin has greatly benefited from its
>>> source roots. Its community and adoption has grown greatly as a
>>> More importantly, the feedback from the community whether through
>>> interactions at meetups or through the mailing list have allowed for
>a rich
>>> exchange of ideas. In order to grow up the Gobblin community and
>>> the project, we would like to propose Gobblin to the Apache
>incubator. The
>>> Gobblin community will greatly benefit from the established
>development and
>>> consensus processes that have worked well for other projects. The
>>> process has served many other open source projects well and we
>believe that
>>> the Gobblin community will greatly benefit from these practices as
>>> == Initial Goals ==
>>> Migrate the existing codebase to Apache
>>> Study and Integrate with the Apache development process
>>> Ensure all dependencies are compliant with Apache License version
>>> Incremental development and releases per Apache guidelines
>>> Improve the relationship between Gobblin and other Apache projects
>>> == Current Status ==
>>> Gobblin has undergone five major releases (0.5, 0.6, 0.7, 0.8, 0.9)
>>> many minor ones. The latest version, Gobblin 0.9 has just been
>released in
>>> December, 2016. Gobblin is being used in production by over 20
>>> organizations. Gobblin codebase is currently hosted at,
>>> will seed the Apache git repository.
>>> === Meritocracy ===
>>> We plan to invest in supporting a meritocracy. We will discuss the
>>> requirements in an open forum. Several companies have already
>>> interest in this project, and we intend to invite additional
>developers to
>>> participate. We will encourage and monitor community participation
>so that
>>> privileges can be extended to those that contribute.
>>> === Community ===
>>> The need for a extensible and flexible data integration platform in
>>> open source is tremendous. Gobblin is currently being used by at
>least 20
>>> organizations worldwide (some examples are listed here). By bringing
>>> Gobblin into Apache, we believe that the community will grow even
>>> === Core Developers ===
>>> Gobblin was started by engineers at LinkedIn, and now has developers
>>> Google, Facebook, LinkedIn, Cloudera, Nerdwallet, Swisscom, and many
>>> companies.
>>> === Alignment ===
>>> Gobblin aligns exceedingly well with the Apache ecosystem. Gobblin
>>> built leveraging several existing Apache projects (Apache Helix,
>>> Zookeeper etc.). As Gobblin matures, we expect to leverage several
>>> Apache projects further. This leverage invariably results in
>>> back to these projects (e.g., a contribution to Helix was made
>during the
>>> Gobblin Yarn development). Finally, being an integration platform,
>>> serves as a bridge between several Apache projects like Apache
>Hadoop and
>>> Apache Kafka. This integration is highly desired and their
>>> through Gobblin will lead to a virtuous cycle of greater adoption
>and newer
>>> features in these projects. Thus, we believe that it will be a nice
>>> addition to the current set of big data projects under the auspices
>of the
>>> Apache foundation.
>>> == Known Risks ==
>>> === Orphaned Products ===
>>> The risk of the Gobblin project being abandoned is minimal. As noted
>>> earlier, there are many organizations that have already invested in
>>> significantly and are thus incentivized to continue development.
>Many of
>>> these organizations operate critical data ingest, compliance and
>>> pipelines built with Gobblin and are thus heavily invested in the
>>> success of Gobblin.
>>> === Inexperience with Open Source ===
>>> Gobblin has existed as a healthy open source project for several
>>> During that time, we have curated an open-source community
>>> Any risks that we foresee are ones associated with scaling our open
>>> communication and operation process rather than with inherent
>>> in operating an open source project.
>>> === Homogenous Developers ===
>>> Gobblin’s committers are employed by companies of varying sizes and
>>> industry. Committers come from well heeled internet companies like
>>> LinkedIn and Facebook. We also have developers from traditional
>>> companies like SwissCom. Well funded startups like Nerdwallet are
>active in
>>> the community of developers. We  plan to double our efforts in
>>> a diverse set of committers for Gobblin.
>>> === Reliance on Salaried Developers ===
>>> It is expected that Gobblin development will occur on both salaried
>>> and on volunteer time, after hours. The majority of initial
>committers are
>>> paid by their employer to contribute to this project. However, they
>are all
>>> passionate about the project, and we are confident that the project
>>> continue even if no salaried developers contribute to the project.
>We are
>>> committed to recruiting additional committers including non-salaried
>>> developers.
>>> === Relationships with Other Apache Products ===
>>> As noted earlier, Gobblin leverages several open source projects and
>>> contributes back to them. There is also overlap with aspects of
>>> Apache projects that we will discuss briefly here. Apache Nifi, like
>>> Gobblin aspires to reduce the operational overhead arising from data
>>> heterogeneity. Apache Nifi is structured as a visual flow based
>>> and provides built-in constructs for buffering data, prioritizing
>data, and
>>> understanding data lineage as data flows across systems. Apache Nifi
>>> its own dataflow based execution engine with buffering, scheduling
>>> streaming capabilities. Apache Falcon is a Hadoop centric data
>>> engine for defining, scheduling, and monitoring data management
>>> through flow definition typically for data that has been ingested
>>> Hadoop already. Apache Falcon generally delegates data management
>jobs to
>>> tools that already exist in the Hadoop ecosystem (e.g. Distcp,
>Sqoop, Hive
>>> etc). Apache Sqoop is primarily geared for bulk ingest especially
>>> databases which is one part of Gobblin’s feature set. Apache Flume
>>> primarily on streaming data movement. Finally, general purpose data
>>> processing engines like Apache Flink, Apache Samza, and Apache Spark
>>> on generic computation.
>>> Gobblin design choices intersect with specific features in all of
>>> systems, however in aggregate, it is a different point in the design
>>> It is designed to handle both streaming and batch data. It supports
>>> execution through a standalone cluster mode as well as through
>>> frameworks such as MR, Yarn, Hive, Samza etc allowing users to
>choose the
>>> deployment model that is optimal for the specific data integration
>>> challenge. It provides native optimized implementations for critical
>>> integrations such as Kafka, Hadoop - Hadoop copies etc. Gobblin also
>>> supports both Hadoop and non-Hadoop data, being able to ingest data
>>> Kafka as well as other key-value stores like Couchbase. Gobblin is
>also not
>>> just a generic computation framework, it has specific constructs for
>>> integration patterns such as data quality metrics and policies.
>>> configuration management system allows it to be fully multi-tenant
>and take
>>> advantage of grouped policies when required. For batch workloads,
>>> has a planning phase that provides for better resource utilization.
>>> In summary, there is healthy diversity in the number of systems
>>> approaching the interesting and pressing problem of big data
>>> We believe that Gobblin will provide another compelling choice in
>>> design space.
>>> === An Excessive Fascination with the Apache Brand ===
>>> Gobblin is already a healthy and well known open source project.
>>> proposal is not for the purpose of generating publicity. Rather, the
>>> primary benefits to joining Apache are already outlined in the
>>> section.
>>> == Documentation ==
>>> The reader will find these websites highly relevant:
>>> * Website:
>>> * Documentation:
>>> * Codebase:
>>> * User group:!forum/gobblin-users
>>> == Source and Intellectual Property Submission Plan ==
>>> The Gobblin codebase is currently hosted on Github. This is the
>>> codebase that we would migrate to the Apache foundation.The Gobblin
>>> code is already licensed under Apache License Version 2.0. Going
>>> we will continue to have all the contributions licensed directly to
>>> Apache foundation through our signed Individual Contributor License
>>> Agreements for all the committers on the project.
>>> == External Dependencies ==
>>> To the best of our knowledge, all of Gobblin dependencies are
>>> under Apache compatible licenses. Upon acceptance to the incubator,
>>> would begin a thorough analysis of all transitive dependencies to
>>> this fact and introduce license checking into the build and release
>>> (for instance integrating Apache Rat).
>>> == Cryptography ==
>>> We do not expect Gobblin to be a controlled export item due to the
>use of
>>> encryption.
>>> == Required Resources ==
>>> === Mailing lists ===
>>> * gobblin-user
>>> * gobblin-dev
>>> * gobblin-commits
>>> * gobblin-private for private PMC discussions (with moderated
>>> subscriptions)
>>> === Subversion Directory ===
>>> Git is the preferred source control system:
>>> === Issue Tracking ===
>>> JIRA Gobblin (GOBBLIN)
>>> === Other Resources ===
>>> The existing code already has unit and integration tests, so we
>>> like a Jenkins instance to run them whenever a new patch is
>submitted. This
>>> can be added after project creation.
>>> == Initial Committers ==
>>> * Abhishek Tiwari <abhishektiwari dot btech at gmail dot com>
>>> * Shirshanka Das <shirshanka at apache dot org>
>>> * Chavdar Botev <cbotev at gmail dot com>
>>> * Sahil Takiar <takiar.sahil at gmail dot com>
>>> * Yinan Li <liyinan926 at gmail dot com>
>>> * Ziyang Liu <>
>>> * Lorand Bendig <lbendig at gmail dot com>
>>> * Issac Buenrostro <ibuenros at linkedin dot com>
>>> * Hung Tran <hutran at linkedin dot com>
>>> * Olivier Lamy <olamy at apache dot org>
>>> * Jean-Baptiste Onofré <>
>>> == Affiliations ==
>>> * Abhishek Tiwari - LinkedIn
>>> * Shirshanka Das - LinkedIn
>>> * Chavdar Botev - Stealth Startup
>>> * Sahil Takiar - Cloudera
>>> * Yinan Li - Google
>>> * Ziyang Liu - Facebook
>>> * Lorand Bendig - Swisscom
>>> * Issac Buenrostro - LinkedIn
>>> * Hung Tran - LinkedIn
>>> * Olivier Lamy - Webtide
>>> * Jean-Baptiste Onofre - Talend
>>> == Sponsors ==
>>> === Champion ===
>>> Olivier Lamy < olamy at apache dot org>
>>> === Nominated Mentors ===
>>> * Olivier Lamy <olamy at apache dot org>
>>> * Jean-Baptiste Onofre <jbonofre at apache dot org>
>>> * ?
>>> * ?
>>> == Sponsoring Entity ==
>>> The Apache Incubator
>> -- 
>> Olivier Lamy
>> |
>To unsubscribe, e-mail:
>For additional commands, e-mail:

  • Unnamed multipart/alternative (inline, 7-Bit, 0 bytes)
View raw message