incubator-general mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Amareshwari Sriramdasu <>
Subject Re: [VOTE] Accept Eagle into Apache Incubation
Date Sun, 25 Oct 2015 01:31:54 GMT
+1 (binding)

On Fri, Oct 23, 2015 at 7:41 PM, Manoharan, Arun <>

> Hello Everyone,
> Thanks for all the feedback on the Eagle Proposal.
> I would like to call for a [VOTE] on Eagle joining the ASF as an
> incubation project.
> The vote is open for 72 hours:
> [ ] +1 accept Eagle in the Incubator
> [ ] ±0
> [ ] -1 (please give reason)
> Eagle is a Monitoring solution for Hadoop to instantly identify access to
> sensitive data, recognize attacks, malicious activities and take actions in
> real time. Eagle supports a wide variety of policies on HDFS data and Hive.
> Eagle also provides machine learning models for detecting anomalous user
> behavior in Hadoop.
> The proposal is available on the wiki here:
> The text of the proposal is also available at the end of this email.
> Thanks for your time and help.
> Thanks,
> Arun
> <COPY of the proposal in text format>
> Eagle
> Abstract
> Eagle is an Open Source Monitoring solution for Hadoop to instantly
> identify access to sensitive data, recognize attacks, malicious activities
> in hadoop and take actions.
> Proposal
> Eagle audits access to HDFS files, Hive and HBase tables in real time,
> enforces policies defined on sensitive data access and alerts or blocks
> user’s access to that sensitive data in real time. Eagle also creates user
> profiles based on the typical access behaviour for HDFS and Hive and sends
> alerts when anomalous behaviour is detected. Eagle can also import
> sensitive data information classified by external classification engines to
> help define its policies.
> Overview of Eagle
> Eagle has 3 main parts.
> 1.Data collection and storage - Eagle collects data from various hadoop
> logs in real time using Kafka/Yarn API and uses HDFS and HBase for storage.
> 2.Data processing and policy engine - Eagle allows users to create
> policies based on various metadata properties on HDFS, Hive and HBase data.
> 3.Eagle services - Eagle services include policy manager, query service
> and the visualization component. Eagle provides intuitive user interface to
> administer Eagle and an alert dashboard to respond to real time alerts.
> Data Collection and Storage:
> Eagle provides programming API for extending Eagle to integrate any data
> source into Eagle policy evaluation framework. For example, Eagle hdfs
> audit monitoring collects data from Kafka which is populated from namenode
> log4j appender or from logstash agent. Eagle hive monitoring collects hive
> query logs from running job through YARN API, which is designed to be
> scalable and fault-tolerant. Eagle uses HBase as storage for storing
> metadata and metrics data, and also supports relational database through
> configuration change.
> Data Processing and Policy Engine:
> Processing Engine: Eagle provides stream processing API which is an
> abstraction of Apache Storm. It can also be extended to other streaming
> engines. This abstraction allows developers to assemble data
> transformation, filtering, external data join etc. without physically bound
> to a specific streaming platform. Eagle streaming API allows developers to
> easily integrate business logic with Eagle policy engine and internally
> Eagle framework compiles business logic execution DAG into program
> primitives of underlying stream infrastructure e.g. Apache Storm. For
> example, Eagle HDFS monitoring transforms audit log from Namenode to object
> and joins sensitivity metadata, security zone metadata which are generated
> from external programs or configured by user. Eagle hive monitoring filters
> running jobs to get hive query string and parses query string into object
> and then joins sensitivity metadata.
> Alerting Framework: Eagle Alert Framework includes stream metadata API,
> scalable policy engine framework, extensible policy engine framework.
> Stream metadata API allows developers to declare event schema including
> what attributes constitute an event, what is the type for each attribute,
> and how to dynamically resolve attribute value in runtime when user
> configures policy. Scalable policy engine framework allows policies to be
> executed on different physical nodes in parallel. It is also used to define
> your own policy partitioner class. Policy engine framework together with
> streaming partitioning capability provided by all streaming platforms will
> make sure policies and events can be evaluated in a fully distributed way.
> Extensible policy engine framework allows developer to plugin a new policy
> engine with a few lines of codes. WSO2 Siddhi CEP engine is the policy
> engine which Eagle supports as first-class citizen.
> Machine Learning module: Eagle provides capabilities to define user
> activity patterns or user profiles for Hadoop users based on the user
> behaviour in the platform. These user profiles are modeled using Machine
> Learning algorithms and used for detection of anomalous users activities.
> Eagle uses Eigen Value Decomposition, and Density Estimation algorithms for
> generating user profile models. The model reads data from HDFS audit logs,
> preprocesses and aggregates data, and generates models using Spark
> programming APIs. Once models are generated, Eagle uses stream processing
> engine for near real-time anomaly detection to determine if any user’s
> activities are suspicious or not.
> Eagle Services:
> Query Service: Eagle provides SQL-like service API to support
> comprehensive computation for huge set of data on the fly, for e.g.
> comprehensive filtering, aggregation, histogram, sorting, top, arithmetical
> expression, pagination etc. HBase is the data storage which Eagle supports
> as first-class citizen, relational database is supported as well. For HBase
> storage, Eagle query framework compiles user provided SQL-like query into
> HBase native filter objects and execute it through HBase coprocessor on the
> fly.
> Policy Manager: Eagle policy manager provides UI and Restful API for user
> to define policy with just a few clicks. It includes site management UI,
> policy editor, sensitivity metadata import, HDFS or Hive sensitive resource
> browsing, alert dashboards etc.
> Background
> Data is one of the most important assets for today’s businesses, which
> makes data security one of the top priorities of today’s enterprises.
> Hadoop is widely used across different verticals as a big data repository
> to store this data in most modern enterprises.
> At eBay we use hadoop platform extensively for our data processing needs.
> Our data in Hadoop is becoming bigger and bigger as our user base is seeing
> an exponential growth. Today there are variety of data sets available in
> Hadoop cluster for our users to consume. eBay has around 120 PB of data
> stored in HDFS across 6 different clusters and around 1800+ active hadoop
> users consuming data thru Hive, HBase and mapreduce jobs everyday to build
> applications using this data. With this astronomical growth of data there
> are also challenges in securing sensitive data and monitoring the access to
> this sensitive data. Today in large organizations HDFS is the defacto
> standard for storing big data. Data sets which includes and not limited to
> consumer sentiment, social media data, customer segmentation, web clicks,
> sensor data, geo-location and transaction data get stored in Hadoop for day
> to day business needs.
> We at eBay want to make sure the sensitive data and data platforms are
> completely protected from security breaches. So we partnered very closely
> with our Information Security team to understand the requirements for Eagle
> to monitor sensitive data access on hadoop:
> 1.Ability to identify and stop security threats in real time
> 2.Scale for big data (Support PB scale and Billions of events)
> 3.Ability to create data access policies
> 4.Support multiple data sources like HDFS, HBase, Hive
> 5.Visualize alerts in real time
> 6.Ability to block malicious access in real time
> We did not find any data access monitoring solution that available today
> and can provide the features and functionality that we need to monitor the
> data access in the hadoop ecosystem at our scale. Hence with an excellent
> team of world class developers and several users, we have been able to
> bring Eagle into production as well as open source it.
> Rationale
> In today’s world; data is an important asset for any company. Businesses
> are using data extensively to create amazing experiences for users. Data
> has to be protected and access to data should be secured from security
> breaches. Today Hadoop is not only used to store logs but also stores
> financial data, sensitive data sets, geographical data, user click stream
> data sets etc. which makes it more important to be protected from security
> breaches. To secure a data platform there are multiple things that need to
> happen. One is having a strong access control mechanism which today is
> provided by Apache Ranger and Apache Sentry. These tools provide the
> ability to provide fine grain access control mechanism to data sets on
> hadoop. But there is a big gap in terms of monitoring all the data access
> events and activities in order to securing the hadoop data platform.
> Together with strong access control, perimeter security and data access
> monitoring in place data in the hadoop clusters can be secured against
> breaches. We looked around and found following:
> Existing data activity monitoring products are designed for traditional
> databases and data warehouse. Existing monitoring platforms cannot scale
> out to support fast growing data and petabyte scale. Few products in the
> industry are still very early in terms of supporting HDFS, Hive, HBase data
> access monitoring.
> As mentioned in the background, the business requirement and urgency to
> secure the data from users with malicious intent drove eBay to invest in
> building a real time data access monitoring solution from scratch to offer
> real time alerts and remediation features for malicious data access.
> With the power of open source distributed systems like Hadoop, Kafka and
> much more we were able to develop a data activity monitoring system that
> can scale, identify and stop malicious access in real time.
> Eagle allows admins to create standard access policies and rules for
> monitoring HDFS, Hive and HBase data. Eagle also provides out of box
> machine learning models for modeling user profiles based on user access
> behaviour and use the model to alert on anomalies.
> Current Status
> Meritocracy
> Eagle has been deployed in production at eBay for monitoring billions of
> events per day from HDFS and Hive operations. From the start; the product
> has been built with focus on high scalability and application extensibility
> in mind and Eagle has demonstrated great performance in responding to
> suspicious events instantly and great flexibility in defining policy.
> Community
> Eagle seeks to develop the developer and user communities during
> incubation.
> Core Developers
> Eagle is currently being designed and developed by engineers from eBay
> Inc. – Edward Zhang, Hao Chen, Chaitali Gupta, Libin Sun, Jilin Jiang,
> Qingwen Zhao, Senthil Kumar, Hemanth Dendukuri, Arun Manoharan. All of
> these core developers have deep expertise in developing monitoring products
> for the Hadoop ecosystem.
> Alignment
> The ASF is a natural host for Eagle given that it is already the home of
> Hadoop, HBase, Hive, Storm, Kafka, Spark and other emerging big data
> projects. Eagle leverages lot of Apache open-source products. Eagle was
> designed to offer real time insights into sensitive data access by actively
> monitoring the data access on various data sets in hadoop and an extensible
> alerting framework with a powerful policy engine. Eagle compliments the
> existing Hadoop platform area by providing a comprehensive monitoring and
> alerting solution for detecting sensitive data access threats based on
> preset policies and machine learning models for user behaviour analysis.
> Known Risks
> Orphaned Products
> The core developers of Eagle team work full time on this project. There is
> no risk of Eagle getting orphaned since eBay is extensively using it in
> their production Hadoop clusters and have plans to go beyond hadoop. For
> example, currently there are 7 hadoop clusters and 2 of them are being
> monitored using Hadoop Eagle in production. We have plans to extend it to
> all hadoop clusters and eventually other data platforms. There are 10’s of
> policies onboarded and actively monitored with plans to onboard more use
> case. We are very confident that every hadoop cluster in the world will be
> monitored using Eagle for securing the hadoop ecosystem by actively
> monitoring for data access on sensitive data. We plan to extend and
> diversify this community further through Apache. We presented Eagle at the
> hadoop summit in china and garnered interest from different companies who
> use hadoop extensively.
> Inexperience with Open Source
> The core developers are all active users and followers of open source.
> They are already committers and contributors to the Eagle Github project.
> All have been involved with the source code that has been released under an
> open source license, and several of them also have experience developing
> code in an open source environment. Though the core set of Developers do
> not have Apache Open Source experience, there are plans to onboard
> individuals with Apache open source experience on to the project. Apache
> Kylin PMC members are also in the same ebay organization. We work very
> closely with Apache Ranger committers and are looking forward to find
> meaningful integrations to improve the security of hadoop platform.
> Homogenous Developers
> The core developers are from eBay. Today the problem of monitoring data
> activities to find and stop threats is a universal problem faced by all the
> businesses. Apache Incubation process encourages an open and diverse
> meritocratic community. Eagle intends to make every possible effort to
> build a diverse, vibrant and involved community and has already received
> substantial interest from various organizations.
> Reliance on Salaried Developers
> eBay invested in Eagle as the monitoring solution for Hadoop clusters and
> some of its key engineers are working full time on the project. In
> addition, since there is a growing need for securing sensitive data access
> we need a data activity monitoring solution for Hadoop, we look forward to
> other Apache developers and researchers to contribute to the project.
> Additional contributors, including Apache committers have plans to join
> this effort shortly. Also key to addressing the risk associated with
> relying on Salaried developers from a single entity is to increase the
> diversity of the contributors and actively lobby for Domain experts in the
> security space to contribute. Eagle intends to do this.
> Relationships with Other Apache Products
> Eagle has a strong relationship and dependency with Apache Hadoop, HBase,
> Spark, Kafka and Storm. Being part of Apache’s Incubation community, could
> help with a closer collaboration among these projects and as well as
> others. An Excessive Fascination with the Apache Brand Eagle is proposing
> to enter incubation at Apache in order to help efforts to diversify the
> committer-base, not so much to capitalize on the Apache brand. The Eagle
> project is in production use already inside eBay, but is not expected to be
> an eBay product for external customers. As such, the Eagle project is not
> seeking to use the Apache brand as a marketing tool.
> Documentation
> Information about Eagle can be found at
> The following link provide more information about Eagle<
> Initial Source
> Eagle has been under development since 2014 by a team of engineers at eBay
> Inc. It is currently hosted on under an Apache license 2.0 at
> Once in incubation we will be moving the
> code base to apache git library.
> External Dependencies
> Eagle has the following external dependencies.
> Basic
> •JDK 1.7+
> •Scala 2.10.4
> •Apache Maven
> •JUnit
> •Log4j
> •Slf4j
> •Apache Commons
> •Apache Commons Math3
> •Jackson
> •Siddhi CEP engine
> Hadoop
> •Apache Hadoop
> •Apache HBase
> •Apache Hive
> •Apache Zookeeper
> •Apache Curator
> Apache Spark
> •Spark Core Library
> REST Service
> •Jersey
> Query
> •Antlr
> Stream processing
> •Apache Storm
> •Apache Kafka
> Web
> •AngularJS
> •jQuery
> •Bootstrap V3
> •Moment JS
> •Admin LTE
> •html5shiv
> •respond
> •Fastclick
> •Date Range Picker
> •Flot JS
> Cryptography
> Eagle will eventually support encryption on the wire. This is not one of
> the initial goals, and we do not expect Eagle to be a controlled export
> item due to the use of encryption. Eagle supports but does not require the
> Kerberos authentication mechanism to access secured Hadoop services.
> Required Resources
> Mailing List
> •eagle-private for private PMC discussions
> •eagle-dev for developers
> •eagle-commits for all commits
> •eagle-users for all eagle users
> Subversion Directory
> •Git is the preferred source control system.
> Issue Tracking
> •JIRA Eagle (Eagle)
> Other Resources
> The existing code already has unit tests so we will make use of existing
> Apache continuous testing infrastructure. The resulting load should not be
> very large.
> Initial Committers
> •Seshu Adunuthula <sadunuthula at ebay dot com>
> •Arun Manoharan <armanoharan at ebay dot com>
> •Edward Zhang <yonzhang at ebay dot com>
> •Hao Chen <hchen9 at ebay dot com>
> •Chaitali Gupta <cgupta at ebay dot com>
> •Libin Sun <libsun at ebay dot com>
> •Jilin Jiang <jiljiang at ebay dot com>
> •Qingwen Zhao <qingwzhao at ebay dot com>
> •Hemanth Dendukuri <hdendukuri at ebay dot com>
> •Senthil Kumar <senthilkumar at ebay dot com>
> Affiliations
> The initial committers are employees of eBay Inc.
> Sponsors
> Champion
> •Henry Saputra <hsaputra at apache dot org> - Apache IPMC member
> Nominated Mentors
> •Owen O’Malley < omalley at apache dot org > - Apache IPMC member,
> Hortonworks
> •Henry Saputra <hsaputra at apache dot org> - Apache IPMC member
> •Julian Hyde <jhyde at hortonworks dot com> - Apache IPMC member,
> Hortonworks
> •Amareshwari Sriramdasu <amareshwari at apache dot org> - Apache IPMC
> member
> •Taylor Goetz <ptgoetz at apache dot org> - Apache IPMC member, Hortonworks
> Sponsoring Entity
> We are requesting the Incubator to sponsor this project.

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message