incubator-general mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From lewis john mcgibbney <lewi...@apache.org>
Subject [VOTE] Accept Science Data Analytics Platform (SDAP) into Apache Incubator WAS Re: [DISCUSS] Accept Science Data Analytics Platform (SDAP) into Apache Incubator
Date Tue, 17 Oct 2017 21:04:05 GMT
Hi Folks,
Having secured a mentorship team consisting of the following IPMC Members,
I am happy to open a formal VOTE thread on accepting the Science Data
Analytics Platform (SDAP) into Apache Incubator.

   - Lewis John McGibbney (lewismc@apache.org)
   - Raphael Bircher (bircher at apace dot org)
   - Suneel Marthi (smarthi at apache dot org)

Thank you to both Raphael and Suneel for coming forward. :)
The VOTE will be open for at least 72 hours.

[ ] +1 Accept Science Data Analytics Platform (SDAP) into Apache Incubator
[ ] +/-0 ... just because
[ ] -1 Do NOT Accept Science Data Analytics Platform (SDAP) into Apache
Incubator... because

Thanks in advance to all participants.
Lewis

P.S. Here is a binding +1 from me

On Wed, Oct 11, 2017 at 11:22 AM, lewis john mcgibbney <lewismc@apache.org>
wrote:

> Hi Folks,
> I would like to open a DISCUSS thread on the topic of accepting the
> Science Data Analytics Platform (SDAP) <https://wiki.apache.org/
> incubator/SDAPProposal> Project into the Incubator.
> I am CC'ing Thomas Huang from NASA JPL who I have been working with to
> build community around a kick-ass set of software projects under the SDAP
> umbrella.
> At this stage we would very much appreciate critical feedback from general@
> community. We are also open to mentors who may have an interest in the
> project proposal.
> The proposal is pasted below.
> Thanks in advance,
> Lewis
>
> = Abstract =
> The Science Data Analytics Platform (SDAP) establishes an integrated data
> analytic center for Big Science problems. It focuses on technology
> integration, advancement and maturity.
>
> = Proposal =
> SDAP currently represents a collaboration between NASA Jet Propulsion
> Laboratory (JPL), Florida State University (FSU), the National Center for
> Atmospheric Research (NCAR), and George Mason University (GMU). SDAP brings
> together a number of big data technologies including a NASA funded
> OceanXtremes (Anomaly detection and ocean science), NEXUS (Deep data
> analytic platform), DOMS (Distributed in-situ to satellite matchup), MUDROD
> (Search relevancy and discovery) and VQSS (Virtualized Quality Screening
> Service) under a single umbrella. Within the original Incubator proposal,
> VQSS will not be included however it is anticipated that a future source
> code donation will cover VQSS.
>
> = Background and Rationale =
> SDAP is a technology software solution currently geared to better enable
> scientists involved in advancing the study of the Earth's physical
> oceanography. With increasing global temperature, warming of the ocean, and
> melting ice sheets and glaciers, the impacts can be observed from changes
> in anomalous ocean temperature and circulation patterns, to increasing
> extreme weather events and stronger/more frequent hurricanes, sea level
> rise and storm surges affecting coastlines, and may involve drastic changes
> and shifts in marine ecosystems. Ocean science communities are relying on
> data distributed through data centers such as the JPL's Physical
> Oceanographic Data Active Archive Center (PO.DAAC) to conduct their
> research. In typical investigations, oceanographers follow a traditional
> workflow for using datasets: search, evaluate, download, and apply tools
> and algorithms to look for trends. While this workflow has been working
> very well historically for the oceanographic community, it cannot scale if
> the research involves massive amount of data. NASA's Surface Water and
> Ocean Topography (SWOT) mission, scheduled to launch in April of 2021, is
> expected to generate over 20PB data for a nominal 3-year mission. This will
> challenge all existing NASA Earth Science data archival/distribution
> paradigms. It will no longer be feasible for Earth scientists to download
> and analyze such volumes of data. SDAP was therefore developed primarily as
> a Web-service platform for big ocean data science at the PO.DAAC with open
> source solutions used to enable fast analysis of oceanographic data. SDAP
> has been developed collaboratively between JPL, FSU, NCAR, and GMU and is
> rapidly maturing to become the generic platform for the next generation of
> big science data solutions. The platform is an orchestration of several
> previously funded NASA big ocean data solutions using cloud technology,
> which include data analysis (NEXUS), anomaly detection (OceanXtremes),
> matchup (DOMS), subsetting, discovery (MUDROD), and visualization (VQSS).
> SDAP will enable web-accessible, fast data analysis directly on huge
> scientific data archives to minimize data movement and provide access,
> including subset, only to the relevant data.
>
> = Science Data Analytics Platform Project Overview =
> SDAP consists of several loosely coupled, independently functioning
> sub-projects. The graphic below displays an overview of how these
> sub-projects fuse together. N.B., although the graphic uses terminology
> relating to OceanWorks, essentially the SDAP architecture is identical.
>
> {{attachment:sdap.png}}
>
> == OceanXtremes ==
> Oceanographic Data-Intensive Anomaly Detection and Analysis Portal. An
> application that allows you to view imagery and perform analysis on sea
> level rise data.
>
> '''Objective'''
> Develop an anomaly detection system which identifies items, events or
> observations which do not conform to an expected pattern.
>  * Mature and test domain-specific, multi-scale anomaly and feature
> detection algorithms.
>  * Identify unexpected correlations between key measured variables.
>
> Demonstrate value of technologies in this service:
>  * Adapted Map-Reduce data mining.
>  * Algorithm profiling service.
>  * Shared discovery and exploration search tools.
>  * Automatic notification of events of interest.
>
> == NEXUS ==
> NEXUS is an emerging technology developed at JPL
>  * A Cloud-based/Cluster-based data platform that performs scalable
> handling of observational parameters analysis designed to scale horizontally
>  * Leveraging high-performance indexed, temporal, and geospatial search
> solution
>  * Breaks data products into small chunks and stores them in a Cloud-based
> data store
>
> ''Data Volumes Exploding''
>  * SWOT mission is coming
>  * File I/O is slow
>
> ''Scalable Store & Compute is Available''
>  * NoSQL cluster databases
>  * Parallel compute, in-memory map-reduce
>  * Bring Compute to Highly-Accessible Data (using Hybrid Cloud)
>
> ''Pre-Chunk and Summarize Key Variables''
>  * Easy statistics instantly (milliseconds)
>  * Harder statistics on-demand (in seconds)
>  * Visualize original data (layers) on a map quickly
>
> == DOMS ==
> The Distributed Oceanographic Match-Up Service
> DOMS is designed to reconcile satellite and in situ datasets in support of
> NASA's Earth Science mission. The service will provide a mechanism for
> users to input a series of geospatial references for satellite observations
> and receive the in situ observations that are matched to the satellite data
> within a selectable temporal and spatial domain. DOMS includes several
> characteristic in situ and satellite observation datasets - with an initial
> focus on salinity, sea temperature, and winds. DOMS will be used by the
> marine and satellite research communities to support a range of activities
> and several use cases will be described. The service is designed to provide
> a community-accessible tool that dynamically delivers matched data and
> allows the scientist to only work with the subset of data where the matches
> exist.
>
> == MUDROD ==
> Mining and Utilizing Dataset Relevancy from Oceanographic Datasets to
> Improve Data Discovery and Access
> Data discovery accuracy is a challenging topic for both Earth science and
> other domains. It is especially true for scientific data sets that are not
> as popular as Amazon or Google data. MUDROD is focused on mining oceanic
> knowledge from the PO.DAAC user log files to improve the end user data
> discovery experience at PO.DAAC. There are three steps in the research: a)
> the oceanographic semantics were extracted from three resources of SWEET,
> GCMD ontology, and the keywords used by end users for searching PO.DAAC
> datasets, b) mining the linkage among different vocabularies based on user
> data discvoery sessions, and c) build the linkage among vocabularies based
> on a comprehensive approach by considering domain de facto standard, e.g.,
> SWEET and GCMD, and the knowledge mined from the log files. The semantics
> is used to improve data discovery for ranking results, navigating among
> vocabularies, and recommending data based on user searchers.
>
> = Current Status =
> All components of SDAP were originally designed and developed under grants
> from the NASA-funded Advanced Information Systems and Technologies (AIST)
> program. The initiative to bring them the components together under the
> SDAP umbrella was granted through an AIST-funded follow-on grant which will
> run for another ~18 or so months.
> Currently no projects have made official releases so outside of community
> building, this will be our primary Incubating goal. All SDAP source code is
> currently publicly available and licensed under the ALv2.0.
>
> = Meritocracy =
> The current developers are familiar with meritocratic open source
> development at Apache. The SDAP team consumes Apache products heavily with
> members being part of several Apache user communities. SDAP itself has
> critical dependencies upon Apache products. Lewis McGibbney (JPL employee),
> a Member of the ASF and V.P. of Apache Any23, Gora PMC Nutch, Tika, OODT,
> OCW, etc., is championing the effort to bring SDAP into and through the
> Apache Incubator and has been evangelizing the Apache Way to the current
> SDAP contributors such that the meritocratic process is well understood and
> followed. Apache was chosen specifically because we want to encourage this
> style of community development for the project and for it to sustain SDAP
> forward to become the generic platform for the next generation of big
> science data solutions
>
> = Community =
> The SDAP project is a fairly new effort and our community is not yet
> fully/firmly established. Initial committers comprising the SDAP roster
> have only recently fully come together as a unified team however there is a
> large degree of synergy between constituent members at JPL, FSU, NCAR, and
> GMU. Therefore, community building and publicity continues to be a major
> thrust. With the activity and exposure regularly attained by several
> community members, we hope to grow the SDAP presence in and across several
> (scientific) forums. The SDAP technology is generating interest within
> communities such as the Earth Science Information Partnership (ESIP),
> American Geophysical Union (AGU) and plethora or science meetings around
> the globe. This in effect, we hope, will further contribute towards the
> possibility of SDAP being used across Government Agencies such as NASA,
> NOAA, USGS, EPA, DOI, etc. as well as by researchers and students in
> academic institutions around the globe.
> During incubation, we will explicitly seek to increase our adoption, with
> SDAP already being featured on the agenda for several high profile globally
> significant scientific conferences and meetings.
>
> = Core Developers =
> The current set of core developers is relatively small, including
> full-time and students from across JPL, FSU, NCAR, and GMU. Initial
> community management and participation will be distributed across the
> entire team, most of which have been involved with the constituent projects
> for <2 years.
>
> = Alignment =
> All SDAP code is licensed under Apache v2.0.
>
> = Known Risks =
>
> == Orphaned products ==
> There are currently no orphaned products. Each component of SDAP has
> dedicated personnel leading and participating in its ongoing development.
> Additionally, there is substantial collaboration between projects
> facilitated by regular project meetings which are specific the the initial
> member entities and focused on advancing physical oceanographic science.
>
> == Inexperience with Open Source ==
> JPL (in particular Lewis McGibbney) has been part of several efforts to
> transition to and grow projects communities at Apache e.g. Apache OODT,
> Apache Open Climate Workbench, Apache Joshua (Incubating), Apache SensSoft
> (Incubating), Apache DRAT (Incubating). Most of the code developed under
> the SDAP umbrella was and is open source prior to the Incubator effort so
> we are well familiarized with the nuances of open source software.
>
> = Relationships with Other Apache Products =
> SDAP has strong dependency upon a number of high profile and smaller
> profile Apache products. Examples can be seen in the breakdown of External
> Dependencies. As we continue to grow SDAP within the Incubator, we will
> make efforts to share community stories, software advancements and possible
> improvements in our use of our Apache dependencies back to those project
> communities.
>
> = Developers =
> The SDAP project and hence developers is currently funded through a NASA
> AIST follow-on grant with funding secured for the next ~18 months. There
> are currently no 100% time dedicated developers, however, the same core
> team that does work currently will continue to work on the project
> throughout the next current funding period and after. There is currently no
> business strategy aligned with SDAP however it is perceived that future,
> yet unsecured funding may by directed to further feature advancement and
> project evangelism.
>
> = Documentation =
> Documentation is currently available in a number of locations e.g. Github
> wiki, Github pages, etc. with each repository under the oceanworks-aist
> Github Org maintaining documentation available through wiki’s attached to
> the repositories. Additionally, most of the SDAP sub-projects have been
> extensively documented within plethora of formal academic publications
> across several academic communities. It would be our intention, certainly
> atleast to unify the Github wiki ad Github pages documentation most likely
> to make up the sdap.apache.org Website content.
>
> = Initial Source =
> Current source resides in several locations Github:
>  * https://github.com/dataplumber/nexus (NEXUS, OceanXtremes, DOMS)
>  * https://github.com/dataplumber/edge (EDGE)
>  * https://github.com/aist-oceanworks/mudrod (MUDROD)
>  * https://bitbucket.org/coaps_mdc/doms/src (DOMS)
>
> = External Dependencies =
> Each component of the Science Data Analytics Platform has its own
> dependencies. Documentation will be available for integrating them.
>
> == MUDROD ==
> '''Core'''
> com.google.code.gson gson 2.5 compile
> jar false
> org.jdom jdom 2.0.2 compile
> jar false
> org.elasticsearch elasticsearch 5.2.0 compile
> jar false
> org.elasticsearch elasticsearch-spark-20_2.11 5.2.0 compile
> jar false
> joda-time joda-time 2.9.4 compile
> jar false
> com.carrotsearch hppc 0.7.1 compile
> jar false
> org.apache.spark spark-core_2.11 2.1.0 compile
> jar false
> org.apache.spark spark-sql_2.11 2.1.0 compile
> jar false
> org.apache.spark spark-mllib_2.11 2.1.0 compile
> jar false
> org.scala-lang scala-library 2.11.8 compile
> jar false
> org.codehaus.jettison jettison 1.3.8 compile
> jar false
> commons-cli commons-cli 1.2 compile
> jar false
> net.sf.opencsv opencsv 2.3 compile
> jar false
> org.apache.jena jena-core 3.3.0 compile
> jar false
> junit junit 4.12 test
> jar false
>
> '''Service'''
> gov.nasa.jpl.mudrod mudrod-core 0.0.1-SNAPSHOT compile
> jar false
> javax.servlet javax.servlet-api 3.1.0 provided
> jar false
> com.google.code.gson gson 2.5 compile
> jar false
>
> '''Web'''
>  * AngularJS - MIT License
>  * BootstrapJS - MIT License
>  * jQueryJS - MIT License
>  * Underscore JS - MIT License
>
> == DOMS ==
>  * Apache Solr version 5.5.1http://lucene.apache.org/solr/
>  * EDGE https://github.com/dataplumber/edge
>  * NetCDF4 http://unidata.github.io/netcdf4-python/
>  * Python 3.5 (NOTE: only partial support for py2.7)
>
> Non stdlib Python dependencies:
>  * Jinja2==2.9.5
>  * python-dateutil==2.6.0
>  * cython==0.25.2
>  * numpy==1.12.0
>  * scipy==0.18.1
>  * netCDF4==1.2.7
>  * solrpy3
>  * siphon==0.4.0
>  * neo4j-driver==1.1.0
>  * matplotlib==2.0.0
>  * requests==2.13.0
>  * shapely==1.5.17
>  * flask==0.12
>  * networkx==1.11
>  * pyproj==1.9.5.1
>  * blist==1.3.6
>
> == NEXUS ==
> '''Analysis'''
>  * https://github.com/dataplumber/nexus/blob/master/
> analysis/package-list.txt
>  * https://github.com/dataplumber/nexus/blob/master/
> analysis/requirements.txt
>
> '''Client'''
>  * https://github.com/dataplumber/nexus/blob/master/
> client/requirements.txt
>
> '''Climatology'''
>  * matplotlib
>  * numpy
>  * netCDF4
>  * pathos (https://pypi.python.org/pypi/pathos)
>
> '''Data-access'''
>  * https://github.com/dataplumber/nexus/blob/master/
> data-access/requirements.txt
>
> '''Nexus-ingest'''
> ''Dataset-tiler''
>  * https://github.com/dataplumber/nexus/tree/master/
> nexus-ingest/dataset-tiler/build/reports
>
> ''developer-box''
>  * Just a collection of scripts/vagrant file used to stand up a developer
> instance of nexus ingestion. No dependencies to report
>
> ''Groovy-scripts''
>  * Collection of Groovy scripts that can be used as part of data
> ingestion. They only rely on the standard Groovy library and the
> ‘nexus-messages’ project
>
> ''Nexus-messages''
>  * https://github.com/dataplumber/nexus/tree/master/
> nexus-ingest/nexus-messages/build/reports
>
> ''nexus-sink''
>  * https://github.com/dataplumber/nexus/tree/master/
> nexus-ingest/nexus-sink/build/reports
>
> ''nexus-xd-python-modules''
>  * https://github.com/dataplumber/nexus/blob/master/
> nexus-ingest/nexus-xd-python-modules/package-list.txt
>  * https://github.com/dataplumber/nexus/blob/master/
> nexus-ingest/nexus-xd-python-modules/requirements.txt
>
> ''spring-xd-python''
>  * only python standard libraries are used
>
> ''tcp-shell''
>  * https://github.com/dataplumber/nexus/tree/master/
> nexus-ingest/tcp-shell/build/reports
>
> '''tools/deletebyquery'''
>  * https://github.com/dataplumber/nexus/blob/master/tools/deletebyquery/
> requirements.txt
>
> = Required Resources =
> Mailing Lists
>  * private@sdap.incubator.apache.org
>  * dev@sdap.incubator.apache.org
>  * commits@sdap.incubator.apache.org
>
> Git Repos
>  * https://git-wip-us.apache.org/repos/asf/incubator-nexus.git
>  * https://git-wip-us.apache.org/repos/asf/incubator-doms.git
>  * https://git-wip-us.apache.org/repos/asf/incubator-mudrod.git
>
> Issue Tracking
>  * JIRA Science Data Analytics Platform (SDAP)
>
> Continuous Integration
>  * Jenkins builds on https://builds.apache.org/
>
> Web
>  * http://sdap.incubator.apache.org/
>  * wiki at http://cwiki.apache.org
>
> = Initial Committers =
> The following is a list of the planned initial Apache committers (the
> active subset of the committers for the current repository on Github).
>  * Lewis John McGibbney (lewismc@apache.org)
>  * Vardis M. Tsontos (vardis.m.tsontos@jpl.nasa.gov)
>  * Joseph C. Jacob (Joseph.C.Jacob@jpl.nasa.gov)
>  * Ed Armstrong (edward.m.armstrong@jpl.nasa.gov)
>  * Frank Greguska (greguska@jpl.nasa.gov)
>  * Brian Wilson (brian.wilson@jpl.nasa.gov)
>  * Chaowe Phil Yang (cyang3@gmu.edu)
>  * Yongyao Jiang (yjiang8@gmu.edu)
>  * Yun Li (yli38@gmu.edu)
>  * Shawn R. Smith (smith@coaps.fsu.edu)
>  * Jocelyn Elya (jelya@coaps.fsu.edu)
>  * Mark Bourassa (bourassa@coaps.fsu.edu)
>  * Thomas Cram (tcram@ucar.edu)
>  * Thomas Huang (thomas.huang@jpl.nasa.gov)
>  * Steven Worley (worley@ucar.edu)
>  * Zaihua Ji (zji@ucar.edu)
>
> = Affiliations =
> NASA JPL
>  * Lewis John McGibbney (lewismc@apache.org)
>  * Vardis M. Tsontos (vardis.m.tsontos@jpl.nasa.gov)
>  * Joseph C. Jacob (Joseph.C.Jacob@jpl.nasa.gov)
>  * Ed Armstrong (edward.m.armstrong@jpl.nasa.gov)
>  * Frank Greguska (greguska@jpl.nasa.gov)
>  * Thomas Huang (thomas.huang@jpl.nasa.gov)
>  * Brian Wilson (brian.wilson@jpl.nasa.gov)
>
> George Mason University
>  * Chaowe Phil Yang (cyang3@gmu.edu)
>  * Yongyao Jiang (yjiang8@gmu.edu)
>  * Yun Li (yli38@gmu.edu)
>
> Center for Ocean-Atmospheric Prediction Studies, Florida State University
>  * Shawn R. Smith (smith@coaps.fsu.edu)
>  * Jocelyn Elya (jelya@coaps.fsu.edu)
>  * Mark Bourassa (bourassa@coaps.fsu.edu)
>
> Computational Information Systems Laboratory (CISL) / National Center for
> Atmospheric Research (NCAR)
>  * Thomas Cram (tcram@ucar.edu)
>  * Zaihua Ji (zji@ucar.edu)
>  * Steven Worley (worley@ucar.edu)
>
> = Sponsors =
>
> = Champion =
> * Lewis McGibbney (NASA/JPL)
>
> = Nominated Mentors =
>  * TBD
>  * TBD
>  * TBD
>
> = Sponsoring Entity =
> The Apache Incubator
>
>
> --
> http://home.apache.org/~lewismc/
> @hectorMcSpector
> http://www.linkedin.com/in/lmcgibbney
>



-- 
http://home.apache.org/~lewismc/
@hectorMcSpector
http://www.linkedin.com/in/lmcgibbney

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message