incubator-general mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Balaji Ganesan <>
Subject Re: [VOTE] Accept Eagle into Apache Incubation
Date Fri, 23 Oct 2015 20:14:01 GMT

On Fri, Oct 23, 2015 at 12:26 PM, Chris Nauroth <>

> +1 (binding)
> --Chris Nauroth
> On 10/23/15, 7:11 AM, "Manoharan, Arun" <> wrote:
> >Hello Everyone,
> >
> >Thanks for all the feedback on the Eagle Proposal.
> >
> >I would like to call for a [VOTE] on Eagle joining the ASF as an
> >incubation project.
> >
> >The vote is open for 72 hours:
> >
> >[ ] +1 accept Eagle in the Incubator
> >[ ] ±0
> >[ ] -1 (please give reason)
> >
> >Eagle is a Monitoring solution for Hadoop to instantly identify access to
> >sensitive data, recognize attacks, malicious activities and take actions
> >in real time. Eagle supports a wide variety of policies on HDFS data and
> >Hive. Eagle also provides machine learning models for detecting anomalous
> >user behavior in Hadoop.
> >
> >The proposal is available on the wiki here:
> >
> >
> >The text of the proposal is also available at the end of this email.
> >
> >Thanks for your time and help.
> >
> >Thanks,
> >Arun
> >
> ><COPY of the proposal in text format>
> >
> >Eagle
> >
> >Abstract
> >Eagle is an Open Source Monitoring solution for Hadoop to instantly
> >identify access to sensitive data, recognize attacks, malicious
> >activities in hadoop and take actions.
> >
> >Proposal
> >Eagle audits access to HDFS files, Hive and HBase tables in real time,
> >enforces policies defined on sensitive data access and alerts or blocks
> >user¹s access to that sensitive data in real time. Eagle also creates
> >user profiles based on the typical access behaviour for HDFS and Hive and
> >sends alerts when anomalous behaviour is detected. Eagle can also import
> >sensitive data information classified by external classification engines
> >to help define its policies.
> >
> >Overview of Eagle
> >Eagle has 3 main parts.
> >1.Data collection and storage - Eagle collects data from various hadoop
> >logs in real time using Kafka/Yarn API and uses HDFS and HBase for
> >storage.
> >2.Data processing and policy engine - Eagle allows users to create
> >policies based on various metadata properties on HDFS, Hive and HBase
> >data.
> >3.Eagle services - Eagle services include policy manager, query service
> >and the visualization component. Eagle provides intuitive user interface
> >to administer Eagle and an alert dashboard to respond to real time alerts.
> >
> >Data Collection and Storage:
> >Eagle provides programming API for extending Eagle to integrate any data
> >source into Eagle policy evaluation framework. For example, Eagle hdfs
> >audit monitoring collects data from Kafka which is populated from
> >namenode log4j appender or from logstash agent. Eagle hive monitoring
> >collects hive query logs from running job through YARN API, which is
> >designed to be scalable and fault-tolerant. Eagle uses HBase as storage
> >for storing metadata and metrics data, and also supports relational
> >database through configuration change.
> >
> >Data Processing and Policy Engine:
> >Processing Engine: Eagle provides stream processing API which is an
> >abstraction of Apache Storm. It can also be extended to other streaming
> >engines. This abstraction allows developers to assemble data
> >transformation, filtering, external data join etc. without physically
> >bound to a specific streaming platform. Eagle streaming API allows
> >developers to easily integrate business logic with Eagle policy engine
> >and internally Eagle framework compiles business logic execution DAG into
> >program primitives of underlying stream infrastructure e.g. Apache Storm.
> >For example, Eagle HDFS monitoring transforms audit log from Namenode to
> >object and joins sensitivity metadata, security zone metadata which are
> >generated from external programs or configured by user. Eagle hive
> >monitoring filters running jobs to get hive query string and parses query
> >string into object and then joins sensitivity metadata.
> >Alerting Framework: Eagle Alert Framework includes stream metadata API,
> >scalable policy engine framework, extensible policy engine framework.
> >Stream metadata API allows developers to declare event schema including
> >what attributes constitute an event, what is the type for each attribute,
> >and how to dynamically resolve attribute value in runtime when user
> >configures policy. Scalable policy engine framework allows policies to be
> >executed on different physical nodes in parallel. It is also used to
> >define your own policy partitioner class. Policy engine framework
> >together with streaming partitioning capability provided by all streaming
> >platforms will make sure policies and events can be evaluated in a fully
> >distributed way. Extensible policy engine framework allows developer to
> >plugin a new policy engine with a few lines of codes. WSO2 Siddhi CEP
> >engine is the policy engine which Eagle supports as first-class citizen.
> >Machine Learning module: Eagle provides capabilities to define user
> >activity patterns or user profiles for Hadoop users based on the user
> >behaviour in the platform. These user profiles are modeled using Machine
> >Learning algorithms and used for detection of anomalous users activities.
> >Eagle uses Eigen Value Decomposition, and Density Estimation algorithms
> >for generating user profile models. The model reads data from HDFS audit
> >logs, preprocesses and aggregates data, and generates models using Spark
> >programming APIs. Once models are generated, Eagle uses stream processing
> >engine for near real-time anomaly detection to determine if any user¹s
> >activities are suspicious or not.
> >
> >Eagle Services:
> >Query Service: Eagle provides SQL-like service API to support
> >comprehensive computation for huge set of data on the fly, for e.g.
> >comprehensive filtering, aggregation, histogram, sorting, top,
> >arithmetical expression, pagination etc. HBase is the data storage which
> >Eagle supports as first-class citizen, relational database is supported
> >as well. For HBase storage, Eagle query framework compiles user provided
> >SQL-like query into HBase native filter objects and execute it through
> >HBase coprocessor on the fly.
> >Policy Manager: Eagle policy manager provides UI and Restful API for user
> >to define policy with just a few clicks. It includes site management UI,
> >policy editor, sensitivity metadata import, HDFS or Hive sensitive
> >resource browsing, alert dashboards etc.
> >Background
> >Data is one of the most important assets for today¹s businesses, which
> >makes data security one of the top priorities of today¹s enterprises.
> >Hadoop is widely used across different verticals as a big data repository
> >to store this data in most modern enterprises.
> >At eBay we use hadoop platform extensively for our data processing needs.
> >Our data in Hadoop is becoming bigger and bigger as our user base is
> >seeing an exponential growth. Today there are variety of data sets
> >available in Hadoop cluster for our users to consume. eBay has around 120
> >PB of data stored in HDFS across 6 different clusters and around 1800+
> >active hadoop users consuming data thru Hive, HBase and mapreduce jobs
> >everyday to build applications using this data. With this astronomical
> >growth of data there are also challenges in securing sensitive data and
> >monitoring the access to this sensitive data. Today in large
> >organizations HDFS is the defacto standard for storing big data. Data
> >sets which includes and not limited to consumer sentiment, social media
> >data, customer segmentation, web clicks, sensor data, geo-location and
> >transaction data get stored in Hadoop for day to day business needs.
> >We at eBay want to make sure the sensitive data and data platforms are
> >completely protected from security breaches. So we partnered very closely
> >with our Information Security team to understand the requirements for
> >Eagle to monitor sensitive data access on hadoop:
> >1.Ability to identify and stop security threats in real time
> >2.Scale for big data (Support PB scale and Billions of events)
> >3.Ability to create data access policies
> >4.Support multiple data sources like HDFS, HBase, Hive
> >5.Visualize alerts in real time
> >6.Ability to block malicious access in real time
> >We did not find any data access monitoring solution that available today
> >and can provide the features and functionality that we need to monitor
> >the data access in the hadoop ecosystem at our scale. Hence with an
> >excellent team of world class developers and several users, we have been
> >able to bring Eagle into production as well as open source it.
> >
> >Rationale
> >In today¹s world; data is an important asset for any company. Businesses
> >are using data extensively to create amazing experiences for users. Data
> >has to be protected and access to data should be secured from security
> >breaches. Today Hadoop is not only used to store logs but also stores
> >financial data, sensitive data sets, geographical data, user click stream
> >data sets etc. which makes it more important to be protected from
> >security breaches. To secure a data platform there are multiple things
> >that need to happen. One is having a strong access control mechanism
> >which today is provided by Apache Ranger and Apache Sentry. These tools
> >provide the ability to provide fine grain access control mechanism to
> >data sets on hadoop. But there is a big gap in terms of monitoring all
> >the data access events and activities in order to securing the hadoop
> >data platform. Together with strong access control, perimeter security
> >and data access monitoring in place data in the hadoop clusters can be
> >secured against breaches. We looked around and found following:
> >Existing data activity monitoring products are designed for traditional
> >databases and data warehouse. Existing monitoring platforms cannot scale
> >out to support fast growing data and petabyte scale. Few products in the
> >industry are still very early in terms of supporting HDFS, Hive, HBase
> >data access monitoring.
> >As mentioned in the background, the business requirement and urgency to
> >secure the data from users with malicious intent drove eBay to invest in
> >building a real time data access monitoring solution from scratch to
> >offer real time alerts and remediation features for malicious data access.
> >With the power of open source distributed systems like Hadoop, Kafka and
> >much more we were able to develop a data activity monitoring system that
> >can scale, identify and stop malicious access in real time.
> >Eagle allows admins to create standard access policies and rules for
> >monitoring HDFS, Hive and HBase data. Eagle also provides out of box
> >machine learning models for modeling user profiles based on user access
> >behaviour and use the model to alert on anomalies.
> >
> >Current Status
> >
> >Meritocracy
> >Eagle has been deployed in production at eBay for monitoring billions of
> >events per day from HDFS and Hive operations. From the start; the product
> >has been built with focus on high scalability and application
> >extensibility in mind and Eagle has demonstrated great performance in
> >responding to suspicious events instantly and great flexibility in
> >defining policy.
> >
> >Community
> >Eagle seeks to develop the developer and user communities during
> >incubation.
> >
> >Core Developers
> >Eagle is currently being designed and developed by engineers from eBay
> >Inc. ­ Edward Zhang, Hao Chen, Chaitali Gupta, Libin Sun, Jilin Jiang,
> >Qingwen Zhao, Senthil Kumar, Hemanth Dendukuri, Arun Manoharan. All of
> >these core developers have deep expertise in developing monitoring
> >products for the Hadoop ecosystem.
> >
> >Alignment
> >The ASF is a natural host for Eagle given that it is already the home of
> >Hadoop, HBase, Hive, Storm, Kafka, Spark and other emerging big data
> >projects. Eagle leverages lot of Apache open-source products. Eagle was
> >designed to offer real time insights into sensitive data access by
> >actively monitoring the data access on various data sets in hadoop and an
> >extensible alerting framework with a powerful policy engine. Eagle
> >compliments the existing Hadoop platform area by providing a
> >comprehensive monitoring and alerting solution for detecting sensitive
> >data access threats based on preset policies and machine learning models
> >for user behaviour analysis.
> >
> >Known Risks
> >
> >Orphaned Products
> >The core developers of Eagle team work full time on this project. There
> >is no risk of Eagle getting orphaned since eBay is extensively using it
> >in their production Hadoop clusters and have plans to go beyond hadoop.
> >For example, currently there are 7 hadoop clusters and 2 of them are
> >being monitored using Hadoop Eagle in production. We have plans to extend
> >it to all hadoop clusters and eventually other data platforms. There are
> >10¹s of policies onboarded and actively monitored with plans to onboard
> >more use case. We are very confident that every hadoop cluster in the
> >world will be monitored using Eagle for securing the hadoop ecosystem by
> >actively monitoring for data access on sensitive data. We plan to extend
> >and diversify this community further through Apache. We presented Eagle
> >at the hadoop summit in china and garnered interest from different
> >companies who use hadoop extensively.
> >
> >Inexperience with Open Source
> >The core developers are all active users and followers of open source.
> >They are already committers and contributors to the Eagle Github project.
> >All have been involved with the source code that has been released under
> >an open source license, and several of them also have experience
> >developing code in an open source environment. Though the core set of
> >Developers do not have Apache Open Source experience, there are plans to
> >onboard individuals with Apache open source experience on to the project.
> >Apache Kylin PMC members are also in the same ebay organization. We work
> >very closely with Apache Ranger committers and are looking forward to
> >find meaningful integrations to improve the security of hadoop platform.
> >
> >Homogenous Developers
> >The core developers are from eBay. Today the problem of monitoring data
> >activities to find and stop threats is a universal problem faced by all
> >the businesses. Apache Incubation process encourages an open and diverse
> >meritocratic community. Eagle intends to make every possible effort to
> >build a diverse, vibrant and involved community and has already received
> >substantial interest from various organizations.
> >
> >Reliance on Salaried Developers
> >eBay invested in Eagle as the monitoring solution for Hadoop clusters and
> >some of its key engineers are working full time on the project. In
> >addition, since there is a growing need for securing sensitive data
> >access we need a data activity monitoring solution for Hadoop, we look
> >forward to other Apache developers and researchers to contribute to the
> >project. Additional contributors, including Apache committers have plans
> >to join this effort shortly. Also key to addressing the risk associated
> >with relying on Salaried developers from a single entity is to increase
> >the diversity of the contributors and actively lobby for Domain experts
> >in the security space to contribute. Eagle intends to do this.
> >
> >Relationships with Other Apache Products
> >Eagle has a strong relationship and dependency with Apache Hadoop, HBase,
> >Spark, Kafka and Storm. Being part of Apache¹s Incubation community,
> >could help with a closer collaboration among these projects and as well
> >as others. An Excessive Fascination with the Apache Brand Eagle is
> >proposing to enter incubation at Apache in order to help efforts to
> >diversify the committer-base, not so much to capitalize on the Apache
> >brand. The Eagle project is in production use already inside eBay, but is
> >not expected to be an eBay product for external customers. As such, the
> >Eagle project is not seeking to use the Apache brand as a marketing tool.
> >
> >Documentation
> >Information about Eagle can be found at
> >The following link provide more information about Eagle
> ><>.
> >
> >Initial Source
> >Eagle has been under development since 2014 by a team of engineers at
> >eBay Inc. It is currently hosted on under an Apache license
> >2.0 at Once in incubation we will be
> >moving the code base to apache git library.
> >
> >External Dependencies
> >Eagle has the following external dependencies.
> >Basic
> >€JDK 1.7+
> >€Scala 2.10.4
> >€Apache Maven
> >€JUnit
> >€Log4j
> >€Slf4j
> >€Apache Commons
> >€Apache Commons Math3
> >€Jackson
> >€Siddhi CEP engine
> >
> >Hadoop
> >€Apache Hadoop
> >€Apache HBase
> >€Apache Hive
> >€Apache Zookeeper
> >€Apache Curator
> >
> >Apache Spark
> >€Spark Core Library
> >
> >REST Service
> >€Jersey
> >
> >Query
> >€Antlr
> >
> >Stream processing
> >€Apache Storm
> >€Apache Kafka
> >
> >Web
> >€AngularJS
> >€jQuery
> >€Bootstrap V3
> >€Moment JS
> >€Admin LTE
> >€html5shiv
> >€respond
> >€Fastclick
> >€Date Range Picker
> >€Flot JS
> >
> >Cryptography
> >Eagle will eventually support encryption on the wire. This is not one of
> >the initial goals, and we do not expect Eagle to be a controlled export
> >item due to the use of encryption. Eagle supports but does not require
> >the Kerberos authentication mechanism to access secured Hadoop services.
> >
> >Required Resources
> >
> >Mailing List
> >€eagle-private for private PMC discussions
> >€eagle-dev for developers
> >€eagle-commits for all commits
> >€eagle-users for all eagle users
> >
> >Subversion Directory
> >€Git is the preferred source control system.
> >
> >Issue Tracking
> >€JIRA Eagle (Eagle)
> >
> >Other Resources
> >The existing code already has unit tests so we will make use of existing
> >Apache continuous testing infrastructure. The resulting load should not
> >be very large.
> >
> >Initial Committers
> >€Seshu Adunuthula <sadunuthula at ebay dot com>
> >€Arun Manoharan <armanoharan at ebay dot com>
> >€Edward Zhang <yonzhang at ebay dot com>
> >€Hao Chen <hchen9 at ebay dot com>
> >€Chaitali Gupta <cgupta at ebay dot com>
> >€Libin Sun <libsun at ebay dot com>
> >€Jilin Jiang <jiljiang at ebay dot com>
> >€Qingwen Zhao <qingwzhao at ebay dot com>
> >€Hemanth Dendukuri <hdendukuri at ebay dot com>
> >€Senthil Kumar <senthilkumar at ebay dot com>
> >
> >
> >Affiliations
> >The initial committers are employees of eBay Inc.
> >
> >Sponsors
> >
> >Champion
> >€Henry Saputra <hsaputra at apache dot org> - Apache IPMC member
> >
> >Nominated Mentors
> >€Owen O¹Malley < omalley at apache dot org > - Apache IPMC member,
> >Hortonworks
> >€Henry Saputra <hsaputra at apache dot org> - Apache IPMC member
> >€Julian Hyde <jhyde at hortonworks dot com> - Apache IPMC member,
> >Hortonworks
> >€Amareshwari Sriramdasu <amareshwari at apache dot org> - Apache IPMC
> >member
> >€Taylor Goetz <ptgoetz at apache dot org> - Apache IPMC member,
> >Hortonworks
> >
> >Sponsoring Entity
> >We are requesting the Incubator to sponsor this project.
> >
> ---------------------------------------------------------------------
> To unsubscribe, e-mail:
> For additional commands, e-mail:

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message