incubator-general mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jean-Baptiste Onofré>
Subject Re: [VOTE] Accept Mnemonic into the Apache Incubator
Date Fri, 04 Mar 2016 09:14:02 GMT
+1 (binding)


On 03/03/2016 08:19 PM, P. Taylor Goetz wrote:
> +1 (binding)
> -Taylor
>> On Feb 29, 2016, at 12:37 PM, Patrick Hunt <> wrote:
>> Hi folks,
>> OK the discussion is now completed. Please VOTE to accept Mnemonic
>> into the Apache Incubator. I’ll leave the VOTE open for at least
>> the next 72 hours, with hopes to close it Thursday the 3rd of
>> March, 2016 at 10am PT.
>> [ ] +1 Accept Mnemonic as an Apache Incubator podling.
>> [ ] +0 Abstain.
>> [ ] -1 Don’t accept Mnemonic as an Apache Incubator podling because..
>> Of course, I am +1 on this. Please note VOTEs from Incubator PMC
>> members are binding but all are welcome to VOTE!
>> Regards,
>> Patrick
>> --------------------
>> = Mnemonic Proposal =
>> === Abstract ===
>> Mnemonic is a Java based non-volatile memory library for in-place
>> structured data processing and computing. It is a solution for generic
>> object and block persistence on heterogeneous block and
>> byte-addressable devices, such as DRAM, persistent memory, NVMe, SSD,
>> and cloud network storage.
>> === Proposal ===
>> Mnemonic is a structured data persistence in-memory in-place library
>> for Java-based applications and frameworks. It provides unified
>> interfaces for data manipulation on heterogeneous
>> block/byte-addressable devices, such as DRAM, persistent memory, NVMe,
>> SSD, and cloud network devices.
>> The design motivation for this project is to create a non-volatile
>> programming paradigm for in-memory data object persistence, in-memory
>> data objects caching, and JNI-less IPC.
>> Mnemonic simplifies the usage of data object caching, persistence, and
>> JNI-less IPC for massive object oriented structural datasets.
>> Mnemonic defines Non-Volatile Java objects that store data fields in
>> persistent memory and storage. During the program runtime, only
>> methods and volatile fields are instantiated in Java heap,
>> Non-Volatile data fields are directly accessed via GET/SET operation
>> to and from persistent memory and storage. Mnemonic avoids SerDes and
>> significantly reduces amount of garbage in Java heap.
>> Major features of Mnemonic:
>> * Provides an abstract level of viewpoint to utilize heterogeneous
>> block/byte-addressable device as a whole (e.g., DRAM, persistent
>> memory, NVMe, SSD, HD, cloud network Storage).
>> * Provides seamless support object oriented design and programming
>> without adding burden to transfer object data to different form.
>> * Avoids the object data serialization/de-serialization for data
>> retrieval, caching and storage.
>> * Reduces the consumption of on-heap memory and in turn to reduce and
>> stabilize Java Garbage Collection (GC) pauses for latency sensitive
>> applications.
>> * Overcomes current limitations of Java GC to manage much larger
>> memory resources for massive dataset processing and computing.
>> * Supports the migration data usage model from traditional NVMe/SSD/HD
>> to non-volatile memory with ease.
>> * Uses lazy loading mechanism to avoid unnecessary memory consumption
>> if some data does not need to use for computing immediately.
>> * Bypasses JNI call for the interaction between Java runtime
>> application and its native code.
>> * Provides an allocation aware auto-reclaim mechanism to prevent
>> external memory resource leaking.
>> === Background ===
>> Big Data and Cloud applications increasingly require both high
>> throughput and low latency processing. Java-based applications
>> targeting the Big Data and Cloud space should be tuned for better
>> throughput, lower latency, and more predictable response time.
>> Typically, there are some issues that impact BigData applications'
>> performance and scalability:
>> 1) The Complexity of Data Transformation/Organization: In most cases,
>> during data processing, applications use their own complicated data
>> caching mechanism for SerDes data objects, spilling to different
>> storage and eviction large amount of data. Some data objects contains
>> complex values and structure that will make it much more difficulty
>> for data organization. To load and then parse/decode its datasets from
>> storage consumes high system resource and computation power.
>> 2) Lack of Caching, Burst Temporary Object Creation/Destruction Causes
>> Frequent Long GC Pauses: Big Data computing/syntax generates large
>> amount of temporary objects during processing, e.g. lambda, SerDes,
>> copying and etc. This will trigger frequent long Java GC pause to scan
>> references, to update references lists, and to copy live objects from
>> one memory location to another blindly.
>> 3) The Unpredictable GC Pause: For latency sensitive applications,
>> such as database, search engine, web query, real-time/streaming
>> computing, require latency/request-response under control. But current
>> Java GC does not provide predictable GC activities with large on-heap
>> memory management.
>> 4) High JNI Invocation Cost: JNI calls are expensive, but high
>> performance applications usually try to leverage native code to
>> improve performance, however, JNI calls need to convert Java objects
>> into something that C/C++ can understand. In addition, some
>> comprehensive native code needs to communicate with Java based
>> application that will cause frequently JNI call along with stack
>> marshalling.
>> Mnemonic project provides a solution to address above issues and
>> performance bottlenecks for structured data processing and computing.
>> It also simplifies the massive data handling with much reduced GC
>> activity.
>> === Rationale ===
>> There are strong needs for a cohesive, easy-to-use non-volatile
>> programing model for unified heterogeneous memory resources management
>> and allocation. Mnemonic project provides a reusable and flexible
>> framework to accommodate other special type of memory/block devices
>> for better performance without changing client code.
>> Most of the BigData frameworks (e.g., Apache Spark™, Apache™ Hadoop®,
>> Apache HBase™, Apache Flink™, Apache Kafka™, etc.) have their own
>> complicated memory management modules for caching and checkpoint. Many
>> approaches increase the complexity and are error-prone to maintain
>> code.
>> We have observed heavy overheads during the operations of data parse,
>> SerDes, pack/unpack, code/decode for data loading, storage,
>> checkpoint, caching, marshal and transferring. Mnemonic provides a
>> generic in-memory persistence object model to address those overheads
>> for better performance. In addition, it manages its in-memory
>> persistence objects and blocks in the way that GC does, which means
>> their underlying memory resource is able to be reclaimed without
>> explicitly releasing it.
>> Some existing Big Data applications suffer from poor Java GC behaviors
>> when they process their massive unstructured datasets.  Those
>> behaviors either cause very long stop-the-world GC pauses or take
>> significant system resources during computing which impact throughput
>> and incur significant perceivable pauses for interactive analytics.
>> There are more and more computing intensive Big Data applications
>> moving down to rely on JNI to offload their computing tasks to native
>> code which dramatically increases the cost of JNI invocation and IPC.
>> Mnemonic provides a mechanism to communicate with native code directly
>> through in-place object data update to avoid complex object data type
>> conversion and stack marshaling. In addition, this project can be
>> extended to support various lockers for threads between Java code and
>> native code.
>> === Initial Goals ===
>> Our initial goal is to bring Mnemonic into the ASF and transit the
>> engineering and governance processes to the "Apache Way."  We would
>> like to enrich a collaborative development model that closely aligns
>> with current and future industry memory and storage technologies.
>> Another important goal is to encourage efforts to integrate
>> non-volatile programming model into data centric processing/analytics
>> frameworks/applications, (e.g., Apache Spark™, Apache HBase™, Apache
>> Flink™, Apache™ Hadoop®, Apache Cassandra™,  etc.).
>> We expect Mnemonic project to be continuously developing new
>> functionalities in an open, community-driven way. We envision
>> accelerating innovation under ASF governance in order to meet the
>> requirements of a wide variety of use cases for in-memory non-volatile
>> and volatile data caching programming.
>> === Current Status ===
>> Mnemonic project is available at Intel’s internal repository and
>> managed by its designers and developers. It is also temporary hosted
>> at Github for general view
>> We have integrated this project for Apache Spark™ 1.5.0 and get 2X
>> performance improvement ratio for Spark™ MLlib k-means workload and
>> observed expected benefits of removing SerDes, reducing total GC pause
>> time by 40% from our experiments.
>> ==== Meritocracy ====
>> Mnemonic was originally created by Gang (Gary) Wang and Yanping Wang
>> in early 2015. The initial committers are the current Mnemonic R&D
>> team members from US, China, and India Big Data Technologies Group at
>> Intel. This group will form a base for much broader community to
>> collaborate on this code base.
>> We intend to radically expand the initial developer and user community
>> by running the project in accordance with the "Apache Way." Users and
>> new contributors will be treated with respect and welcomed. By
>> participating in the community and providing quality patches/support
>> that move the project forward, they will earn merit. They also will be
>> encouraged to provide non-code contributions (documentation, events,
>> community management, etc.) and will gain merit for doing so. Those
>> with a proven support and quality track record will be encouraged to
>> become committers.
>> ==== Community ====
>> If Mnemonic is accepted for incubation, the primary initial goal is to
>> transit the core community towards embracing the Apache Way of project
>> governance. We would solicit major existing contributors to become
>> committers on the project from the start.
>> ==== Core Developers ====
>> Mnemonic core developers are all skilled software developers and
>> system performance engineers at Intel Corp with years of experiences
>> in their fields. They have contributed many code to Apache projects.
>> There are PMCs and experienced committers have been working with us
>> from Apache Spark™, Apache HBase™, Apache Phoenix™, Apache™ Hadoop®
>> for this project's open source efforts.
>> === Alignment ===
>> The initial code base is targeted to data centric processing and
>> analyzing in general. Mnemonic has been building the connection and
>> integration for Apache projects and other projects.
>> We believe Mnemonic will be evolved to become a promising project for
>> real-time processing, in-memory streaming analytics and more, along
>> with current and future new server platforms with persistent memory as
>> base storage devices.
>> === Known Risks ===
>> ==== Orphaned products ====
>> Intel’s Big Data Technologies Group is actively working with community
>> on integrating this project to Big Data frameworks and applications.
>> We are continuously adding new concepts and codes to this project and
>> support new usage cases and features for Apache Big Data ecosystem.
>> The project contributors are leading contributors of Hadoop-based
>> technologies and have a long standing in the Hadoop community. As we
>> are addressing major Big Data processing performance issues, there is
>> minimal risk of this work becoming non-strategic and unsupported.
>> Our contributors are confident that a larger community will be formed
>> within the project in a relatively short period of time.
>> ==== Inexperience with Open Source ====
>> This project has long standing experienced mentors and interested
>> contributors from Apache Spark™, Apache HBase™, Apache Phoenix™,
>> Apache™ Hadoop® to help us moving through open source process. We are
>> actively working with experienced Apache community PMCs and committers
>> to improve our project and further testing.
>> ==== Homogeneous Developers ====
>> All initial committers and interested contributors are employed at
>> Intel. As an infrastructure memory project, there are wide range of
>> Apache projects are interested in innovative memory project to fit
>> large sized persistent memory and storage devices. Various Apache
>> projects such as Apache Spark™, Apache HBase™, Apache Phoenix™, Apache
>> Flink™, Apache Cassandra™ etc. can take good advantage of this project
>> to overcome serialization/de-serialization, Java GC, and caching
>> issues. We expect a wide range of interest will be generated after we
>> open source this project to Apache.
>> ==== Reliance on Salaried Developers ====
>> All developers are paid by their employers to contribute to this
>> project. We welcome all others to contribute to this project after it
>> is open sourced.
>> ==== Relationships with Other Apache Product ====
>> Relationship with Apache™ Arrow:
>> Arrow's columnar data layout allows great use of CPU caches & SIMD. It
>> places all data that relevant to a column operation in a compact
>> format in memory.
>> Mnemonic directly puts the whole business object graphs on external
>> heterogeneous storage media, e.g. off-heap, SSD. It is not necessary
>> to normalize the structures of object graphs for caching, checkpoint
>> or storing. It doesn’t require developers to normalize their data
>> object graphs. Mnemonic applications can avoid indexing & join
>> datasets compared to traditional approaches.
>> Mnemonic can leverage Arrow to transparently re-layout qualified data
>> objects or create special containers that is able to efficiently hold
>> those data records in columnar form as one of major performance
>> optimization constructs.
>> Mnemonic can be integrated into various Big Data and Cloud frameworks
>> and applications.
>> We are currently working on several Apache projects with Mnemonic:
>> For Apache Spark™ we are integrating Mnemonic to improve:
>> a) Local checkpoints
>> b) Memory management for caching
>> c) Persistent memory datasets input
>> d) Non-Volatile RDD operations
>> The best use case for Apache Spark™ computing is that the input data
>> is stored in form of Mnemonic native storage to avoid caching its row
>> data for iterative processing. Moreover, Spark applications can
>> leverage Mnemonic to perform data transforming in persistent or
>> non-persistent memory without SerDes.
>> For Apache™ Hadoop®, we are integrating HDFS Caching with Mnemonic
>> instead of mmap. This will take advantage of persistent memory related
>> features. We also plan to evaluate to integrate in Namenode Editlog,
>> FSImage persistent data into Mnemonic persistent memory area.
>> For Apache HBase™, we are using Mnemonic for BucketCache and
>> evaluating performance improvements.
>> We expect Mnemonic will be further developed and integrated into many
>> Apache BigData projects and so on, to enhance memory management
>> solutions for much improved performance and reliability.
>> ==== An Excessive Fascination with the Apache Brand ====
>> While we expect Apache brand helps to attract more contributors, our
>> interests in starting this project is based on the factors mentioned
>> in the Rationale section.
>> We would like Mnemonic to become an Apache project to further foster a
>> healthy community of contributors and consumers in BigData technology
>> R&D areas. Since Mnemonic can directly benefit many Apache projects
>> and solves major performance problems, we expect the Apache Software
>> Foundation to increase interaction with the larger community as well.
>> === Documentation ===
>> The documentation is currently available at Intel and will be posted
>> under:
>> === Initial Source ===
>> Initial source code is temporary hosted Github for general viewing:
>> It will be moved to Apache after podling.
>> The initial Source is written in Java code (88%) and mixed with JNI C
>> code (11%) and shell script (1%) for underlying native allocation
>> libraries.
>> === Source and Intellectual Property Submission Plan ===
>> As soon as Mnemonic is approved to join the Incubator, the source code
>> will be transitioned via the Software Grant Agreement onto ASF
>> infrastructure and in turn made available under the Apache License,
>> version 2.0.
>> === External Dependencies ===
>> The required external dependencies are all Apache licenses or other
>> compatible Licenses
>> Note: The runtime dependent licenses of Mnemonic are all declared as
>> Apache 2.0, the GNU licensed components are used for Mnemonic build
>> and deployment. The Mnemonic JNI libraries are built using the GNU
>> tools.
>> maven and its plugins ( ) [Apache 2.0]
>> JDK8 or OpenJDK 8 ( [Oracle or Openjdk JDK License]
>> Nvml ( ) [optional] [Open Source]
>> PMalloc ( ) [optional] [Apache 2.0]
>> Build and test dependencies:
>> org.testng.testng v6.8.17  ( [Apache 2.0]
>> org.flowcomputing.commons.commons-resgc v0.8.7 [Apache 2.0]
>> org.flowcomputing.commons.commons-primitives v.0.6.0 [Apache 2.0]
>> com.squareup.javapoet v1.3.1-SNAPSHOT [Apache 2.0]
>> JDK8 or OpenJDK 8 ( [Oracle or Openjdk JDK License]
>> === Cryptography ===
>> Project Mnemonic does not use cryptography itself, however, Hadoop
>> projects use standard APIs and tools for SSH and SSL communication
>> where necessary.
>> === Required Resources ===
>> We request that following resources be created for the project to use
>> ==== Mailing lists ====
>> (moderated subscriptions)
>> ==== Git repository ====
>> ==== Documentation ====
>> ==== JIRA instance ====
>> === Initial Committers ===
>> * Gang (Gary) Wang (gang1 dot wang at intel dot com)
>> * Yanping Wang (yanping dot wang at intel dot com)
>> * Uma Maheswara Rao G (umamahesh at apache dot org)
>> * Kai Zheng (drankye at apache dot org)
>> * Rakesh Radhakrishnan Potty  (rakeshr at apache dot org)
>> * Sean Zhong  (seanzhong at apache dot org)
>> * Henry Saputra  (hsaputra at apache dot org)
>> * Hao Cheng (hao dot cheng at intel dot com)
>> === Additional Interested Contributors ===
>> * Debo Dutta (dedutta at cisco dot com)
>> * Liang Chen (chenliang613 at Huawei dot com)
>> === Affiliations ===
>> * Gang (Gary) Wang, Intel
>> * Yanping Wang, Intel
>> * Uma Maheswara Rao G, Intel
>> * Kai Zheng, Intel
>> * Rakesh Radhakrishnan Potty, Intel
>> * Sean Zhong, Intel
>> * Henry Saputra, Independent
>> * Hao Cheng, Intel
>> === Sponsors ===
>> ==== Champion ====
>> Patrick Hunt
>> ==== Nominated Mentors ====
>> * Patrick Hunt <phunt at apache dot org> - Apache IPMC member
>> * Andrew Purtell <apurtell at apache dot org > - Apache IPMC member
>> * James Taylor <jamestaylor at apache dot org> - Apache IPMC member
>> * Henry Saputra <hsaputra at apache dot org> - Apache IPMC member
>> ==== Sponsoring Entity ====
>> Apache Incubator PMC
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail:
>> For additional commands, e-mail:

Jean-Baptiste Onofré
Talend -

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message