incubator-general mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Kenneth Knowles <>
Subject Re: [PROPOSAL] Apache DataSketches
Date Sat, 23 Feb 2019 06:03:58 GMT

I would very much like to help mentor this project, though you already have
a couple good ones.

I concur with incubator as sponsoring entity.

Kenn (VP Apache Beam)

On Fri, Feb 22, 2019 at 9:45 PM leerho <> wrote:

> I didn't realize that this mail list does not accept PDF files, apparently
> only text.  So let me try one more time ... :)  Please let me know if
> this works!
> = Apache DataSketches Proposal[1] =
> == Abstract ==
> is an open source, high-performance library of
> stochastic streaming algorithms commonly called "sketches" in the data
> sciences. Sketches are small, stateful programs that process massive data
> as a stream and can provide approximate answers, with mathematical
> guarantees, to computationally difficult queries orders-of-magnitude faster
> than traditional, exact methods.
> This proposal is to move DataSketches to the Apache Software
> Foundation(ASF) transferring ownership of its copyright intellectual
> property to the ASF.  Thereafter, DataSketches would be officially known as
> Apache DataSketches and its evolution and governance would come under the
> rules and guidance of the ASF.
> == Introduction ==
> The DataSketches library contains carefully crafted implementations of
> sketch algorithms that meet rigorous standards of quality and performance
> and provide capabilities required for large-scale production systems that
> must process and analyze massive data. The DataSketches core repository is
> written in Java with a parallel core repository written in C++ that
> includes Python wrappers. The DataSketches library also includes special
> repositories for extending the core library for Apache Hive and Apache Pig.
> The sketches developed in the different languages share a common binary
> storage format so that sketches created and stored in Java, for example,
> can be fully used in C++, and visa versa.  Because the stored sketch
> "images" are just a "blob" of bytes (similar to picture images), they can
> be shared across many different systems, languages and platforms.
> The DataSketches documentation website, ,
> includes general tutorials, a comprehensive research section with
> references to relevant academic papers, extensive examples for using the
> core library directly as well as examples for accessing the library in
> Hive, Pig, and Apache Spark.
> The DataSketches library also includes a characterization repository for
> long running test programs that are used for studying accuracy and
> performance of these sketches over wide ranges of input variables. The data
> produced by these programs is used for generating the many performance
> plots contained in the documentation website and for academic
> publications.
> The code repositories used for production are versioned and published to
> Maven Central on periodic intervals as the library evolves.
> The DataSketches library also includes several experimental repositories
> for use-cases outside the large-scale systems environments, such as
> sketches for mobile, IoT devices (Android), command-line access of the
> sketch library, and an experimental repository for vector-based sketches
> that performs approximate Singular Value Decomposition (SVD) analysis that
> could potentially be used in Machine Learning (ML) applications.
> == Background ==
> The DataSketches library was started in 2012 as internal Yahoo project to
> dramatically reduce time and resources required for distinct (unique)
> counting.  An extensive search on the Internet at the time yielded a number
> of theoretical papers on stochastic streaming algorithms with pseudocode
> examples, but we did not find any usable open-source code of the quality we
> felt we needed for our internal production systems.  So we started a small
> project (one person) to develop our own sketches working directly from
> published theoretical papers.
> The DataSketches library was designed from the start with the objective of
> making these algorithms, usually only described in theoretical papers,
> easily accessible to systems developers for use in our internal production
> systems. By necessity, the code had to be of the highest quality and
> thoroughly tested. The wide variety of our internal production systems
> drove the requirement that the sketch implementations had to have an
> absolute minimum of external, run-time dependencies in order to simplify
> integration and troubleshooting.
> Our internal experiments demonstrated dramatic positive impact on the
> performance of our systems.  As a result, the DataSketches library quickly
> evolved to include different types of sketches for different types of
> queries, such as frequent-items (a.k.a, heavy-hitters) algorithms,
> quantile/histogram algorithms, and weighted and unweighted sampling
> algorithms.
> We quickly discovered that developing these sketch algorithms to be truly
> robust in production environments is quite difficult and requires deep
> understanding of the underlying mathematics and statistics as well as
> extensive experience in developing high quality code for 24/7 production
> systems. This is a difficult combination of skills for any one organization
> to collect and maintain over time. It became clear that this technology
> needed a community larger than Yahoo to evolve.  In November, 2015, this
> factor, along with Yahoo’s strong experience and support of open source,
> led to the decision to open source this technology under an Apache 2.0
> license on GitHub. Since that time our community has expanded considerably
> and the key contributors to this effort includes leading research
> scientists from a number of universities as well as practitioners and
> researchers from a number of major corporations. The core of this group is
> very active as we meet weekly to discuss research directions and
> engineering priorities.
> It is important to note that our internal systems at Yahoo use the current
> public GitHub open source DataSketches library and not an internal version
> of the code.
> The close collaboration of scientific research and engineering development
> experience with actual massive-data processing systems has also produced
> new research publications in the field of stochastic streaming algorithms,
> for example:
> * Daniel Anderson, Pryce Bevan, Kevin J. Lang, Edo Liberty, Lee Rhodes, and
> Justin Thaler. A high-performance algorithm for identifying frequent items
> in data streams. In ACM IMC 2017.
> * Anirban Dasgupta, Kevin J. Lang, Lee Rhodes, and Justin Thaler. A
> framework for estimating stream expression cardinalities. In *EDBT/ICDT
> Proceedings ‘16 *, pages 6:1–6:17, 2016.
> * Mina Ghashami, Edo Liberty, Jeff M. Phillips. Efficient Frequent
> Directions Algorithm for Sparse Matrices. In ACM SIGKDD Proceedings ‘16,
> pages 845-854, 2016.
> * Zohar S. Karnin, Kevin J. Lang, and Edo Liberty. Optimal quantile
> approximation in streams. In IEEE FOCS Proceedings ‘16, pages 71–78, 2016.
> * Kevin J Lang. Back to the future: an even more nearly optimal cardinality
> estimation algorithm. arXiv preprint,
> 2017.
> * Edo Liberty. Simple and deterministic matrix sketching. In ACM KDD
> Proceedings ‘13, pages 581– 588, 2013.
> * Edo Liberty, Michael Mitzenmacher, Justin Thaler, and Jonathan Ullman.
> Space lower bounds for itemset frequency sketches. In ACM PODS Proceedings
> ‘16, pages 441–454, 2016.
> * Michael Mitzenmacher, Thomas Steinke, and Justin Thaler. Hierarchical
> heavy hitters with the space saving algorithm. In SIAM ALENEX Proceedings
> ‘12, pages 160–174, 2012.
> == The Rationale for Sketches ==
> In the analysis of big data there are often problem queries that don’t
> scale because they require huge compute resources and time to generate
> exact results. Examples include count distinct, quantiles, most frequent
> items, joins, matrix computations, and graph analysis.
> If we can loosen the requirement of “exact” results from our queries and be
> satisfied with approximate results, within some well understood bounds of
> error, there is an entire branch of mathematics and data science that has
> evolved around developing algorithms that can produce approximate results
> with mathematically well-defined error properties.
> With the additional requirements that these algorithms must be small
> (compared to the size of the input data), sublinear (the size of the sketch
> must grow at a slower rate than the size of the input stream), streaming
> (they can only touch each data item once), and mergeable (suitable for
> distributed processing), defines a class of algorithms that can be
> described as small, stochastic, streaming, sublinear mergeable algorithms,
> commonly called sketches (they also have other names, but we will use the
> term sketches from here on).
> To be truly streaming and be able to process data in a single pass,
> sketches must make absolute minimum assumptions about the input stream.
> This is critically important, as there is no “second chance” to process the
> data.
> For example, sketches should not make assumptions about the order of stream
> items, the stream length, the dynamic range of values, or the distribution
> of item occurrence frequencies. Sketches should be tolerant of NaNs, Nulls
> and empty objects. About the only thing that the sketch needs to know about
> the stream is how to extract items from it and what type the item is, e.g.,
> is it a numeric value or a string.
> As far as the sketch is concerned, the input stream is a sequence of items
> in some unknown random order with unknown random values.
> The sketch is essentially a complex state machine and combined with the
> random input stream defines a stochastic process. We then apply
> probabilistic methods to interpret the states of the stochastic process in
> order to extract useful information about the input stream itself. The
> resulting information will be approximate, but we also use additional
> probabilistic methods to extract an estimate of the likely probability
> distribution of error.
> There is a significant scientific contribution here that is defining the
> state machine, understanding the resulting stochastic process, developing
> the probabilistic methods, and proving mathematically, that it all works!
> This is why the scientific contributors to this project are a critical and
> strategic component to our success.  The development engineers translate
> the concepts of the proposed state machine and probabilistic methods into
> production-quality code. Even more important, they work closely with the
> scientists, feeding back system and user requirements, which leads not only
> to superior product design, but to new science as well.  A number of
> scientific papers our members have published (see above) is a direct result
> of this close collaboration.
> Because sketches are small they can be processed extremely fast, often many
> orders-of-magnitude faster than traditional exact computations. For
> interactive queries there may not be other viable alternatives, and in the
> case of real-time analysis, sketches are the only known solution.
> For any system that needs to extract useful information from massive data
> sketches are essential tools that should be tightly integrated into the
> system’s analysis capabilities. This technology has helped Yahoo
> successfully reduce data processing times from days to hours or minutes on
> a number of its internal platforms and has enabled subsecond queries on
> real-time platforms that would have been infeasible without sketches.
> The Rationale for Apache DataSketches
> Other open source implementations of sketch algorithms can be found on the
> Internet. However, we have not yet found any open source implementations
> that are as comprehensive, engineered with the quality required for
> production systems, and with usable and guaranteed error properties.  Large
> Internet companies, such as Google and Facebook, have published papers on
> sketching, however, their implementations of their published algorithms are
> proprietary and not available as open source.
> The DataSketches library already provides integrations with a number of
> major Apache data processing platforms such as Apache Hive, Apache Pig,
> Apache Spark and Apache Druid, and is also integrated with a number of
> other open source data processing platforms such as Splice Machine, GCHQ
> Gaffer and PostgreSQL.
> We believe that having DataSketches as an Apache project will provide an
> immediate, worthwhile, and substantial contribution to the open source
> community, will have a better opportunity to provide a meaningful
> contribution to both the science and engineering of sketching algorithms,
> and integrate with other Apache projects.  In addition, this is a
> significant opportunity for Apache to be the "go-to" destination for users
> that want to leverage this exciting technology.
> == Initial Goals ==
> We are breaking our initial goals into short-term (2-6 months) and
> intermediate to long-term ( 6 months to 2 years):
> Our short-term goals include:
> * Understanding and adapting to the Apache development process and
> structures.
> * Start refactoring codebase and move various DataSketches repositories
> code to Apache Git repository.
> * Continue development of new features, functions, and fixes.
> * Specific sub-projects (e.g., C++ and Python) will continue to be
> developed and expanded.
> The intermediate to long term goals include:
> * Completing the design and implementation of the C++ sketches to
> complement what is already available in Java, and the Python wrappers of
> those C++ sketches.
> * Expanding the C++ build framework to include Windows and the popular
> Linux variants.
> * Continued engagement with the scientific research community on the
> development of new algorithms for computationally difficult problems that
> heretofore have not had a sketching solution.
> == Current Status ==
> The DataSketches GitHub project has been quite successful.  As of this
> writing (Feb, 2019) the number of downloads measured by the Nexus
> Repository Manager at has grown by nearly a
> factor
> of 10 over the past year to about 55 thousand per month. The
> DataSketches/sketches-core repository has about 560 stars and 141 forks,
> which is pretty good for a highly specialized library.
> === Development Practices ===
> ==== Source Control ====
> All of our developers have extensive experience with Git version control
> and follow accepted practices for use of Pull Requests (PRs), code reviews
> and commits to master, for example.
> ==== Testing ====
> Sketches, by their nature are probabilistic programs and don’t necessarily
> behave deterministically.  For some of the sketches we intentionally insert
> random noise into the code as this gives us the mathematical properties
> that we need to guarantee accuracy.  This can make the behavior of these
> algorithms quite unintuitive and provides significant challenges to the
> developer who wishes to test these algorithms for correctness. As a result,
> our testing strategy includes two major components: unit tests, and
> characterization tests.
> ===== Unit Testing =====
> Our unit tests are primarily quick tests to make sure that we exercise all
> critical paths in the code and that key branches are executed correctly. It
> is important that they execute relatively fast as they are generally run on
> every code build. The sketches-core repository alone has about 22 thousand
> statements, over 1300 unit tests and code coverage of about 98.2% as
> measured by Atlassian/Clover.  It is our goal for all of our code
> repositories that are used in production that they have code coverage
> greater than 90%.
> ===== Characterization Testing =====
> In order to test the probabilistic methods that are used to interpret the
> stochastic behaviors of our sketches we have a separate characterization
> repository that is dedicated to this.  To measure accuracy, for example,
> requires running thousands of trials at each of many different points along
> the domain axis. Each trial compares its estimated results against a known
> exact result producing an error for that trial.  These error measurements
> are then fed into our Quantiles sketch to capture the actual distribution
> of error at that point along the axis. We then select quantile contours
> across all the distributions at points along the axis.  These contours can
> then be plotted to reveal the shape of the actual error distribution. These
> distributions are not at all Gaussian, in fact they can be quite complex.
> Nonetheless, these distributions are then checked against our statistical
> guarantees inherent to the specific sketch algorithm and its parameters.
> There are many examples of these characterization error distributions on
> our website. The runtimes of these tests can be very long and can range
> from many minutes to hours, and some can run for days.  Currently, we have
> separate characterization repositories for Java and C++ / Python.
> It is our goal that we perform this characterization analysis for all of
> our sketches.  By definition, the code that runs these characterization
> tests is open-source so others can run these tests as well.  We do not have
> formal releases of this code (because it is not production code) and it is
> not published to Maven Central.
> === Meritocracy ===
> DataSketches was initially developed based on requirements within Yahoo. As
> a project on GitHub, DataSketches has received contributions from numerous
> individual developers from around the world, dedicated research work from
> senior scientists at Amazon and Visa, and academic researchers from
> Georgetown University, Princeton, and MIT.
> As a project under incubation, we are committed to expanding our effort to
> build an environment which supports a meritocracy. We are focused on
> engaging the community and other related projects for support and
> contributions. Moreover, we are committed to ensure contributors and
> committers to DataSketches come from a broad mix of organizations through a
> merit-based decision process during incubation. We believe strongly in the
> DataSketches premise that fulfills the concept of a well engineered and
> scientifically rigorous library that implements these powerful algorithms
> and are committed to growing an inclusive community of DataSketches
> contributors and users.
> === Community ===
> Yahoo has a long history and active engagement in the Open Source
> community. Major projects include:, Bullet, Moloch, Panoptes,
>, Athenz, HaloDB, Maha, Mendel, TensorFlowOnSpark, gifshot,
> fluxible, as well as the creation, contribution and incubation of many
> Apache projects such as Apache Hadoop, Pig, Bookkeeper, Oozie, Zookeeper,
> Omid, Pulsar, Traffic Server, Storm, Druid, and many more.
> Every day, DataSketches is actively used by a organizations and
> institutions around the world for batch and stream processing of data. We
> believe acceptance will allow us to consolidate existing
> DataSketches-related work, grow the DataSketches community, and deepen
> connections between DataSketches and other open source projects.
> === Introduction to the Core Developers & Contributors ===
> The core developers and contributors for DataSketches are from diverse
> backgrounds, but primarily are scientists that love engineering and
> engineers that love science. A large part of the value we bring comes from
> this synthesis.  These individuals have already contributed substantially
> to the code, algorithms, and/or mathematical proofs that form the basis of
> the library.
> This core group also form the Initial Committers with write permissions to
> the repository. Those marked with (*) Meet weekly to plan the research and
> engineering direction of the project.
> ==== Scientists That Love Engineering ====
> * Eshcar Hillel: Senior Research Scientist, Yahoo Labs, Israel. Interests:
> distributed systems, scalable systems and platforms for big data
> processing, concurrent algorithms and data structures,
> * Kevin Lang: (*) Distinguished Research Scientist, Yahoo Labs, Sunnyvale,
> California. Interests: algorithms, theoretical and applied mathematics,
> encoding and compression theory, theoretical and applied performance
> optimization.
> * Edo Liberty: (*) Director of Research, Head of Amazon AI Labs, Palo Alto,
> California. Manages the algorithms group at Amazon AI. We build scalable
> machine learning systems and algorithms which are used both internally and
> externally by customers of SageMaker, AWS's flagship machine learning
> platform.
> * Jon Malkin: (*) Senior Scientist, Yahoo Labs, Sunnyvale. Interests:
> Computational advertising, machine learning, speech recognition,
> data-driven analysis, large scale experimentation, big data, stream/complex
> event processing
> * Justin Thaler: (*) Assistant Professor, Department of Computer Science,
> Georgetown University, Washington D.C. Interests: algorithms and
> computational complexity, complexity theory, quantum algorithms, private
> data analysis, and learning theory, developing efficient streaming and
> sketching algorithms
> ==== Engineers That Love Science ====
> * Roman Leventov: Senior Software Engineer,  Metamarkets / Snap. Interests:
> design and implementation of data storing and data processing (distributed)
> systems, performance optimization, CPU performance, mechanical sympathy,
> JVM performance, API design, databases, (concurrent) data structures,
> memory management, garbage collection algorithms, language design and
> runtimes (their tradeoffs), distributed systems (cloud) efficiency, Linux,
> code quality, code transformation, pure functional programming models,
> Haskell.
> * Lee Rhodes: (*) Distinguished Architect, lead developer and founder of
> the DataSketches project, Yahoo, Sunnyvale, California.  Interests:
> streaming algorithms, mathematics, computer science, high quality and high
> performance code for the analysis of massive data, bridging the divide
> between theory and practice.
> * Alexander Saydakov: (*) Senior Software Engineer, Yahoo, Sunnyvale,
> California. Interests: applied mathematics, computer science, big data,
> distributed systems.
> === Introduction to Additional Interested Contributors ===
> These folks have been intermittently involved and contributed, but are
> strong supporters of this project.
> * Frank Grimes: GitHub ID: frankgrimes97
> * Mina Ghashami: [mina.ghashami at gmail dot com] Ph.D. Computer Science,
> Univ of Utah. Interests: Machine Learning, Data Mining, matrix
> approximation, streaming algorithms, randomized linear algebra.
> * Christopher Musco: [christopher.musco at gmail dot com] Ph.D. Computer
> Science, Research Instructor, Princeton University. Interests: algorithmic
> foundations of data science and machine learning, efficient methods for
> processing and understanding large datasets, often working at the
> intersection of theoretical computer science, numerical linear algebra, and
> optimization.
> * Graham Cormode: [g.cormode at dot uk] Ph.D. Computer Science,
> Professor, Warwick University, Warwick, England. Interests: all aspects of
> the "data lifecycle", from data collection and cleaning, through mining and
> analytics. (Professor Cormode is one of the world’s leading scientists in
> sketching algorithms)
> === Alignment ===
> The DataSketches library already provides integrations and example code for
> Apache Hive, Apache Pig, Apache Spark and is deeply integrated into Apache
> Druid.
> == Known Risks ==
> The following subsections are specific risks that have been identified by
> the ASF that need to be addressed.
> === Risk: Orphaned Products ===
> The DataSketches library is presently used by a number of organizations,
> from small startups to Fortune 100 companies, to construct production
> pipelines that must process and analyze massive data. Yahoo has a long-term
> commitment to continue to advance the DataSketches library; moreover,
> DataSketches is seeing increasing interest, development, and adoption from
> many diverse organizations from around the world. Due to its growing
> adoption, we feel it is quite unlikely that this project would become
> orphaned.
> === Risk: Inexperience with Open Source ===
> Yahoo believes strongly in open source and the exchange of information to
> advance new ideas and work. Examples of this commitment are active open
> source projects such as those mentioned above. With DataSketches, we have
> been increasingly open and forward-looking; we have published a number of
> papers about breakthrough developments in the science of streaming
> algorithms (mentioned above) that also reference the DataSketches library.
> Our submission to the Apache Software Foundation is a logical extension of
> our commitment to open source software.
> Key committers at Yahoo with strong open source backgrounds include Aaron
> Gresch, Alan Carroll, Alessandro Bellina, Anastasia Braginsky, Andrews
> Sahaya Albert, Arun S A G, Atul Mohan, Brad McMillen, Bryan Call, Daryn
> Sharp, Dav Glass, David Carlin, Derek Dagit, Eric Payne, Eshcar Hillel,
> Ethan Li, Fei Deng, Francis Christopher Liu, Francisco Perez-Sorrosal, Gil
> Yehuda. Govind Menon, Hang Yang, Jacob Estelle, Jai Asher, James Penick,
> Jason Kenny, Jay Pipes, Jim Rollenhagen, Joe Francis, Jon Eagles, Kihwal
> Lee, Kishorkumar Patil, Koji Noguchi, Kuhu Shukla, Michael Trelinski,
> Mithun Radhakrishnan, Nathan Roberts, Ohad Shacham, Olga L. Natkovich,
> Parth Kamlesh Gandhi, Rajan Dhabalia, Rohini Palaniswamy, Ruby Loo, Ryan
> Bridges, Sanket Chintapalli, Satish Subhashrao Saley, Shu Kit Chan, Sri
> Harsha Mekala, Susan Hinrichs, Yonatan Gottesman, and many more.
> All of our core developers are committed to learn about the Apache process
> and to give back to the community.
> === Risk: Homogeneous Developers ===
> The majority of committers in this proposal belong to Yahoo due to the fact
> that DataSketches has emerged from an internal Yahoo project. This proposal
> also includes developers and contributors from other companies, and who are
> actively involved with other Apache projects, such as Druid.  We expect our
> entry into incubation will allow us to expand the number of individuals and
> organizations participating in DataSketches development.
> === Risk: Reliance on Salaried Developers ===
> Because the DataSketches library originated within Yahoo, it has been
> developed primarily by salaried Yahoo developers and we expect that to
> continue to be the case near term. However, since we placed this library
> into open-source we have had a number of significant contributions from
> engineers and scientists from outside of Yahoo. We expect our reliance on
> Yahoo salaried developers will decrease over time. Nonetheless, Yahoo is
> committed to continue its strong support of this important project.
> === Risk: Lack of Relationship to other Apache Products ===
> DataSketches already directly interoperates with or utilizes several
> existing Apache projects.
> * Build
>    * Apache Maven
> * Integrations and adaptors for the following projects naturally have them
> as dependencies
>    * Apache Hive
>    * Apache Pig
>    * Apache Druid
>    * Apache Spark
> * Additional dependencies for the above integrations and adaptors include
>    * Apache Hadoop
>    * Apache Commons (Math)
> There is no other Apache project that we are aware of that duplicates the
> functionality of the DataSketches library.
> === Risk: An Excessive Fascination with the Apache Brand ===
> With this proposal we are not seeking attention or publicity. Rather, we
> firmly believe in the DataSketches library and concept and the ability to
> make the DataSketches library a powerful, yet simple-to-use toolkit for
> data processing. While the DataSketches library has been open source, we
> believe putting code on GitHub can only go so far. We see the Apache
> community, processes, and mission as critical for ensuring the DataSketches
> library is truly community-driven, positively impactful, and innovative
> open source software. While Yahoo has taken a number of steps to advance
> its various open source projects, we believe the DataSketches library
> project is a great fit for the Apache Software Foundation due to its focus
> on data processing and its relationships to existing ASF projects.
> === Risk: Cryptography ===
> DataSketches does not contain any cryptographic code and is not a
> cryptographic product.
> == Documentation ==
> The following documentation is relevant to this proposal. Relevant portions
> of the documentation will be contributed to the Apache DataSketches
> project.
> * DataSketches website:
> * DataSketches website repository:
> We will need an apache website for this documentation similar to
> *
> == Initial Source ==
> The initial source for DataSketches which we will submit to the Apache
> Foundation will include a number of repositories which are currently hosted
> under the organization:
> All repositories including:
> * Java
>    * sketches-core: This repository has the core sketching classes, which
> are leveraged by some of the other repositories. This repository has no
> external dependencies outside of the DataSketches/memory repository, Java
> and TestNG for unit tests. This code is versioned and the latest release
> can be obtained from Maven Central.
>    * memory: Low level, high-performance memory data-structure management
> primarily for off-heap.
>    * sketches-android: This is a new repository dedicated to sketches
> designed to be run in a mobile client, such as a cell phone. It is still in
> development and should be considered experimental.
>    * sketches-hive: This repository contains Hive UDFs and UDAFs for use
> within Hadoop grid environments. This code has dependencies on
> sketches-core as well as Hadoop and Hive. Users of this code are advised to
> use Maven to bring in all the required dependencies. This code is versioned
> and the latest release can be obtained from Maven Central.
>    * sketches-pig: This repository contains Pig User Defined Functions
> (UDF) for use within Hadoop grid environments. This code has dependencies
> on sketches-core as well as Hadoop and Pig. Users of this code are advised
> to use Maven to bring in all the required dependencies. This code is
> versioned and the latest release can be obtained from Maven Central.
>    * sketches-vector: This is a new repository dedicated to sketches for
> vector and matrix operations. It is still somewhat experimental.
>    * characterization: This relatively new repository is for code that we
> use to characterize the accuracy and speed performance of the sketches in
> the library and is constantly being updated. Examples of the job command
> files used for various tests can be found in the src/main/resources
> directory. Some of these tests can run for hours depending on its
> configuration.
>    * experimental: This repository is an experimental staging area for code
> that will eventually end up in another repository. This code is not
> versioned and not registered with Maven Central.
>    * sketches-misc: Demos and other code not related to production
> deployment
> * C++ and Python
>    * sketches-core-cpp: This is the C++/Python companion to the Java
> sketches-core. These implementations are binary compatible with their
> counterparts in Java. In other words, a sketch created and stored in C++
> can be opened and read in Java and visa-versa. This site also has our
> Python adaptors that basically wrap the C++ implementations, making the
> high performance C++ implementations available from Python.
>    * sketches-postgres: This site provides the postgres-specific adaptors
> that wrap the C++ implementations making them available to the Postgres
> database users.
>    * characterization-cpp: This is the C++/Python companion to the Java
> characterization repository.
>    * experimental-cpp: This repository is an experimental staging area for
> C++ code that will eventually end up in another repository.
> * Command-Line Tools
>    * sketches-cmd
>    * homebrew-sketches
>    * homebrew-sketches-cmd
> These projects have always been Apache 2.0 licensed. We intend to bundle
> all of these repositories since they are all complementary and should be
> maintained in one project. Prior to our submission, we will combine all of
> these projects into a new git repository.
> == Source and Intellectual Property Submission Plan ==
> Contributors to the DataSketches project have also signed the Yahoo
> Individual Contributor License Agreement (
> in order to contribute to the project.
> With respect to trademark rights, Yahoo does not hold a trademark on the
> phrase “DataSketches.” Based on feedback and guidance we receive during the
> incubation process, we are open to renaming the project if necessary for
> trademark or other concerns, but we would prefer not to have to do that.
> == External Dependencies ==
> All external dependencies are licensed under an Apache 2.0 or
> Apache-compatible license. As we grow the DataSketches community we will
> configure our build process to require and validate all contributions and
> dependencies are licensed under the Apache 2.0 license or are under an
> Apache-compatible license.
> == Required Resources ==
> === Mailing Lists ===
> We currently use a mix of mailing lists. We will migrate our existing
> mailing lists to the following:
> *
> *
> *
> *
> === Source Control ===
> The DataSketches team currently uses Git and would like to continue to do
> so. We request a Git repository for DataSketches with mirroring to GitHub
> enabled similar the following:
> *
> === Issue Tracking ===
> We request the creation of an Apache-hosted JIRA. The DataSketches project
> is currently using the public GitHub issue tracker and the public Google
> Groups forum/sketches-user for issue tracking and discussions. We will
> migrate and combine from these two sources to the Apache JIRA.
> Proposed Jira ID: DATASKETCHES
> == Initial Committers ==
> The following list of individuals have been extremely active in our
> community and should have write (commit) permissions to the repository.
> * Eshcar Hillel                      [eshcar at verizonmedia dot com]
> * Kevin Lang                    [langk at verizonmedia dot com]
> * Roman Leventov              [roman.leventov at c.metamarkets dot com]
> * Edo Liberty                   [libertye at amazon dot com]
> * Jon Malkin                    [jmalkin at verizonmedia dot com]
> * Lee Rhodes                  [lrhodes at verizonmedia dot com] & [leerho
> at gmail dot com]
> * Alexander Saydakov         [saydakov at verizonmedia dot com]
> * Justin Thaler                 [justin.thaler at georgetown dot edu]
> == Affiliations ==
> The initial committers are from four organizations: Yahoo, Amazon,
> Georgetown University, and Metamarkets/Snap.
> === Champion ===
> (Recommended to me: )
> Liang Chen, Vice President of Apache CarbonData, [chenliang613 at apache
> dot org]
> Jean-Baptiste Onofré,[[jb at nanthrax dot net]
> === Nominated Mentors ===
> (Recommended to me: )
> Liang Chen, Vice President of Apache CarbonData, [chenliang613 at apache
> dot org]
> Jean-Baptiste Onofré, jb at nanthrax dot net
> Gil Yehuda, gyehuda at verizonmedia dot com
> === Sponsoring Entity ===
> * The Apache Incubator    **** This is our 1st choice ****
> * Apache Druid. The incubating Apache Druid project might also be a logical
> sponsor. However, DataSketches has applications in many areas of computing
> outside of Druid so our preference and recommendation is that DataSketches
> would ultimately be a top-level Apache project.
> ________________
> [1] In 2017 Verizon acquired Yahoo and merged it with previously acquired
> AOL. The merged entity was originally called Oath, Inc., but has recently
> been renamed Verizon Media, Inc., a wholly-owned subsidiary of Verizon,
> Inc.  Since Yahoo is the more recognized name, references in this document
> to Yahoo, are also a reference to Verizon Media, Inc.
> On Fri, Feb 22, 2019 at 9:35 PM Kenneth Knowles <> wrote:
> > The subject line has me interested already. Follow examples like this
> > maybe?
> >
> > 1.
> >
> >
> > 2.
> >
> >
> >
> > Kenn
> >
> > On Fri, Feb 22, 2019 at 8:05 PM leerho <> wrote:
> >
> > > I'll try again ... :)
> > >
> > > On Fri, Feb 22, 2019 at 8:00 PM Ted Dunning <>
> > wrote:
> > >
> > >> It didn't make it again
> > >>
> > >> On Fri, Feb 22, 2019, 8:35 PM leerho <> wrote:
> > >>
> > >> > I'm not sure the attached document made it through.
> > >> >
> > >> > On Fri, Feb 22, 2019 at 7:28 PM leerho <> wrote:
> > >> >
> > >> > >
> > >> > >
> > >> >
> > >>
> > >
> > > ---------------------------------------------------------------------
> > > To unsubscribe, e-mail:
> > > For additional commands, e-mail:
> >

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message