incubator-general mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Mattmann, Chris A (3980)" <chris.a.mattm...@jpl.nasa.gov>
Subject Re: [VOTE] Accept Joshua as an Apache Incubator Podling
Date Fri, 12 Feb 2016 19:33:19 GMT
Yep, will send a result shortly.

Lewis, after that, can you help me get the podling bootstrap tasks
started?

++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Chris Mattmann, Ph.D.
Chief Architect
Instrument Software and Science Data Systems Section (398)
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 168-519, Mailstop: 168-527
Email: chris.a.mattmann@nasa.gov
WWW:  http://sunset.usc.edu/~mattmann/
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Adjunct Associate Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++





-----Original Message-----
From: Lewis John Mcgibbney <lewis.mcgibbney@gmail.com>
Reply-To: "general@incubator.apache.org" <general@incubator.apache.org>
Date: Friday, February 12, 2016 at 11:31 AM
To: "general@incubator.apache.org" <general@incubator.apache.org>
Subject: Re: [VOTE] Accept Joshua as an Apache Incubator Podling

>Hi Chris,
>Is it time to close out this VOTE and bring Joshua on board?
>Lewis
>
>On Wed, Feb 3, 2016 at 4:01 PM, <general-digest-help@incubator.apache.org>
>wrote:
>
>>
>> From: Danese Cooper <danese@gmail.com>
>> To: "general@incubator.apache.org" <general@incubator.apache.org>
>> Cc: "post@cs.jhu.edu" <post@cs.jhu.edu>
>> Date: Wed, 3 Feb 2016 07:43:11 -0800
>> Subject: Re: [VOTE] Accept Joshua as an Apache Incubator Podling
>> +1 (binding) Accept Joshua as an Apache Incubator podling.
>>
>> D
>>
>> > On Jan 30, 2016, at 12:00 PM, Mattmann, Chris A (3980) <
>> chris.a.mattmann@jpl.nasa.gov> wrote:
>> >
>> > Hi Everyone,
>> >
>> > OK the discussion is now completed. Please VOTE to accept Joshua
>> > into the Apache Incubator. I’ll leave the VOTE open for at least
>> > the next 72 hours, with hopes to close it next Friday the 5th of
>> > February, 2016.
>> >
>> > [ ] +1 Accept Joshua as an Apache Incubator podling.
>> > [ ] +0 Abstain.
>> > [ ] -1 Don’t accept Joshua as an Apache Incubator podling because..
>> >
>> > Of course, I am +1 on this. Please note VOTEs from Incubator PMC
>> > members are binding but all are welcome to VOTE!
>> >
>> > Cheers,
>> > Chris
>> >
>> > ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>> > Chris Mattmann, Ph.D.
>> > Chief Architect
>> > Instrument Software and Science Data Systems Section (398)
>> > NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
>> > Office: 168-519, Mailstop: 168-527
>> > Email: chris.a.mattmann@nasa.gov
>> > WWW:  http://sunset.usc.edu/~mattmann/
>> > ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>> > Adjunct Associate Professor, Computer Science Department
>> > University of Southern California, Los Angeles, CA 90089 USA
>> > ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>> >
>> >
>> >
>> >
>> >
>> > -----Original Message-----
>> > From: jpluser <chris.a.mattmann@jpl.nasa.gov>
>> > Date: Tuesday, January 12, 2016 at 10:56 PM
>> > To: "general@incubator.apache.org" <general@incubator.apache.org>
>> > Cc: "post@cs.jhu.edu" <post@cs.jhu.edu>
>> > Subject: [DISCUSS] Apache Joshua Incubator Proposal - Machine
>>Translation
>> > Toolkit
>> >
>> >> Hi Everyone,
>> >>
>> >> Please find attached for your viewing pleasure a proposed new
>>project,
>> >> Apache Joshua, a statistical machine translation toolkit. The
>>proposal
>> >> is in wiki draft form at:
>> https://wiki.apache.org/incubator/JoshuaProposal
>> >>
>> >> Proposal text is copied below. I’ll leave the discussion open for
a
>> week
>> >> and we are interested in folks who would like to be initial
>>committers
>> >> and mentors. Please discuss here on the thread.
>> >>
>> >> Thanks!
>> >>
>> >> Cheers,
>> >> Chris (Champion)
>> >>
>> >> ———
>> >>
>> >> = Joshua Proposal =
>> >>
>> >> == Abstract ==
>> >> [[joshua-decoder.org|Joshua]] is an open-source statistical machine
>> >> translation toolkit. It includes a Java-based decoder for translating
>> with
>> >> phrase-based, hierarchical, and syntax-based translation models, a
>> >> Hadoop-based grammar extractor (Thrax), and an extensive set of tools
>> and
>> >> scripts for training and evaluating new models from parallel text.
>> >>
>> >> == Proposal ==
>> >> Joshua is a state of the art statistical machine translation system
>>that
>> >> provides a number of features:
>> >>
>> >> * Support for the two main paradigms in statistical machine
>>translation:
>> >> phrase-based and hierarchical / syntactic.
>> >> * A sparse feature API that makes it easy to add new feature
>>templates
>> >> supporting millions of features
>> >> * Native implementations of many tuners (MERT, MIRA, PRO, and
>>AdaGrad)
>> >> * Support for lattice decoding, allowing upstream NLP tools to expose
>> >> their hypothesis space to the MT system
>> >> * An efficient representation for models, allowing for quick loading
>>of
>> >> multi-gigabyte model files
>> >> * Fast decoding speed (on par with Moses and mtplz)
>> >> * Language packs — precompiled models that allow the decoder to be
>> run as
>> >> a black box
>> >> * Thrax, a Hadoop-based tool for learning translation models from
>> >> parallel text
>> >> * A suite of tools for constructing new models for any language pair
>>for
>> >> which sufficient training data exists
>> >>
>> >> == Background and Rationale ==
>> >> A number of factors make this a good time for an Apache project
>>focused
>> on
>> >> machine translation (MT): the quality of MT output (for many language
>> >> pairs); the average computing resources available on computers,
>>relative
>> >> to the needs of MT systems; and the availability of a number of
>> >> high-quality toolkits, together with a large base of researchers
>>working
>> >> on them.
>> >>
>> >> Over the past decade, machine translation (MT; the automatic
>>translation
>> >> of one human language to another) has become a reality. The research
>> into
>> >> statistical approaches to translation that began in the early
>>nineties,
>> >> together with the availability of large amounts of training data, and
>> >> better computing infrastructure, have all come together to produce
>> >> translations results that are “good enough† for a large set of
>> language
>> >> pairs and use cases. Free services like
>> >> [[https://www.bing.com/translator|Bing Translator]] and
>> >> [[https://translate.google.com|Google Translate]] have made these
>> services
>> >> available to the average person through direct interfaces and through
>> >> tools like browser plugins, and sites across the world with higher
>> >> translation needs use them to translate their pages through
>> automatically.
>> >>
>> >> MT does not require the infrastructure of large corporations in
>>order to
>> >> produce feasible output. Machine translation can be
>>resource-intensive,
>> >> but need not be prohibitively so. Disk and memory usage are mostly a
>> >> matter of model size, which for most language pairs is a few
>>gigabytes
>> at
>> >> most, at which size models can provide coverage on the order of tens
>>or
>> >> even hundreds of thousands of words in the input and output
>>languages.
>> The
>> >> computational complexity of the algorithms used to search for
>> translations
>> >> of new sentences are typically linear in the number of words in the
>> input
>> >> sentence, making it possible to run a translation engine on a
>>personal
>> >> computer.
>> >>
>> >> The research community has produced many different open source
>> translation
>> >> projects for a range of programming languages and under a variety of
>> >> licenses. These projects include the core “decoder†, which takes
>>a
>> model
>> >> and uses it to translate new sentences between the language pair the
>> model
>> >> was defined for. They also typically include a large set of tools
>>that
>> >> enable new models to be built from large sets of example translations
>> >> (“parallel data†) and monolingual texts. These toolkits are
>>usually
>> built
>> >> to support the agendas of the (largely) academic researchers that
>>build
>> >> them: the repeated cycle of building new models, tuning model
>>parameters
>> >> against development data, and evaluating them against held-out test
>> data,
>> >> using standard metrics for testing the quality of MT output.
>> >>
>> >> Together, these three factors—the quality of machine translation
>> output,
>> >> the feasibility of translating on standard computers, and the
>> availability
>> >> of tools to build models—make it reasonable for the end users to
>>use
>> MT as
>> >> a black-box service, and to run it on their personal machine.
>> >>
>> >> These factors make it a good time for an organization with the
>>status of
>> >> the Apache Foundation to host a machine translation project.
>> >>
>> >> == Current Status ==
>> >> Joshua was originally ported from David Chiang’s Python
>> implementation of
>> >> Hiero by Zhifei Li, while he was a Ph.D. student at Johns Hopkins
>> >> University. The current version is maintained by Matt Post at Johns
>> >> Hopkins’ Human Language Technology Center of Excellence. Joshua has
>> made
>> >> many releases with a list of over 20 source code tags. The last
>>release
>> of
>> >> Joshua was 6.0.5 on November 5th, 2015.
>> >>
>> >> == Meritocracy ==
>> >> The current developers are familiar with meritocratic open source
>> >> development at Apache. Apache was chosen specifically because we
>>want to
>> >> encourage this style of development for the project.
>> >>
>> >> == Community ==
>> >> Joshua is used widely across the world. Perhaps its biggest (known)
>> >> research / industrial user is the Amazon research group in Berlin.
>> Another
>> >> user is the US Army Research Lab. No formal census has been
>>undertaken,
>> >> but posts to the Joshua technical support mailing list, along with
>>the
>> >> occasional contributions, suggest small research and academic
>> communities
>> >> spread across the world, many of them in India.
>> >>
>> >> During incubation, we will explicitly seek to increase our usage
>>across
>> >> the board, including academic research, industry, and other end users
>> >> interested in statistical machine translation.
>> >>
>> >> == Core Developers ==
>> >> The current set of core developers is fairly small, having fallen
>>with
>> the
>> >> graduation from Johns Hopkins of some core student participants.
>> However,
>> >> Joshua is used fairly widely, as mentioned above, and there remains a
>> >> commitment from the principal researcher at Johns Hopkins to
>>continue to
>> >> use and develop it. Joshua has seen a number of new community members
>> >> become interested recently due to a potential for its projected use
>>in a
>> >> number of ongoing DARPA projects such as XDATA and Memex.
>> >>
>> >> == Alignment ==
>> >> Joshua is currently Copyright (c) 2015, Johns Hopkins University All
>> >> rights reserved and licensed under BSD 2-clause license. It would of
>> >> course be the intention to relicense this code under AL2.0 which
>>would
>> >> permit expanded and increased use of the software within Apache
>> projects.
>> >> There is currently an ongoing effort within the Apache Tika
>>community to
>> >> utilize Joshua within Tika’s Translate API, see
>> >> [[https://issues.apache.org/jira/browse/TIKA-1343|TIKA-1343]].
>> >>
>> >> == Known Risks ==
>> >>
>> >> === Orphaned products ===
>> >> At the moment, regular contributions are made by a single
>>contributor,
>> the
>> >> lead maintainer. He (Matt Post) plans to continue development for the
>> next
>> >> few years, but it is still a single point of failure, since the
>>graduate
>> >> students who worked on the project have moved on to jobs, mostly in
>> >> industry. However, our goal is to help that process by growing the
>> >> community in Apache, and at least in growing the community with users
>> and
>> >> participants from NASA JPL.
>> >>
>> >> === Inexperience with Open Source ===
>> >> The team both at Johns Hopkins and NASA JPL have experience with many
>> OSS
>> >> software projects at Apache and elsewhere. We understand "how it
>>works"
>> >> here at the foundation.
>> >>
>> >>
>> >> == Relationships with Other Apache Products ==
>> >> Joshua includes dependences on Hadoop, and also is included as a
>>plugin
>> in
>> >> Apache Tika. We are also interested in coordinating with other
>>projects
>> >> including Spark, and other projects needing MT services for language
>> >> translation.
>> >>
>> >> == Developers ==
>> >> Joshua only has one regular developer who is employed by Johns
>>Hopkins
>> >> University. NASA JPL (Mattmann and McGibbney) have been contributing
>> >> lately including a Brew formula and other contributions to the
>>project
>> >> through the DARPA XDATA and Memex programs.
>> >>
>> >> == Documentation ==
>> >> Documentation and publications related to Joshua can be found at
>> >> joshua-decoder.org. The source for the Joshua documentation is
>> currently
>> >> hosted on Github at
>> >> https://github.com/joshua-decoder/joshua-decoder.github.com
>> >>
>> >> == Initial Source ==
>> >> Current source resides at Github: github.com/joshua-decoder/joshua
>>(the
>> >> main decoder and toolkit) and github.com/joshua-decoder/thrax (the
>> grammar
>> >> extraction tool).
>> >>
>> >> == External Dependencies ==
>> >> Joshua has a number of external dependencies. Only BerkeleyLM (Apache
>> 2.0)
>> >> and KenLM (LGPG 2.1) are run-time decoder dependencies (one of which
>>is
>> >> needed for translating sentences with pre-built models). The rest are
>> >> dependencies for the build system and pipeline, used for constructing
>> and
>> >> training new models from parallel text.
>> >>
>> >> Apache projects:
>> >> * Ant
>> >> * Hadoop
>> >> * Commons
>> >> * Maven
>> >> * Ivy
>> >>
>> >> There are also a number of other open-source projects with various
>> >> licenses that the project depends on both dynamically (runtime), and
>> >> statically.
>> >>
>> >> === GNU GPL 2 ===
>> >> * Berkeley Aligner: https://code.google.com/p/berkeleyaligner/
>> >>
>> >> === LGPG 2.1 ===
>> >> * KenLM: github.com/kpu/kenlm
>> >>
>> >> === Apache 2.0 ===
>> >> * BerkeleyLM: https://code.google.com/p/berkeleylm/
>> >>
>> >> === GNU GPL ===
>> >> * GIZA++: http://www.statmt.org/moses/giza/GIZA++.html
>> >>
>> >> == Required Resources ==
>> >> * Mailing Lists
>> >>  * private@joshua.incubator.apache.org
>> >>  * dev@joshua.incubator.apache.org
>> >>  * commits@joshua.incubator.apache.org
>> >>
>> >> * Git Repos
>> >>  * https://git-wip-us.apache.org/repos/asf/joshua.git
>> >>
>> >> * Issue Tracking
>> >>  * JIRA Joshua (JOSHUA)
>> >>
>> >> * Continuous Integration
>> >>  * Jenkins builds on https://builds.apache.org/
>> >>
>> >> * Web
>> >>  * http://joshua.incubator.apache.org/
>> >>  * wiki at http://cwiki.apache.org
>> >>
>> >> == Initial Committers ==
>> >> The following is a list of the planned initial Apache committers (the
>> >> active subset of the committers for the current repository on
>>Github).
>> >>
>> >> * Matt Post (post@cs.jhu.edu)
>> >> * Lewis John McGibbney (lewismc@apache.org)
>> >> * Chris Mattmann (mattmann@apache.org)
>> >>
>> >> == Affiliations ==
>> >>
>> >> * Johns Hopkins University
>> >>  * Matt Post
>> >>
>> >> * NASA JPL
>> >>  * Chris Mattmann
>> >>  * Lewis John McGibbney
>> >>
>> >>
>> >> == Sponsors ==
>> >> === Champion ===
>> >> * Chris Mattmann (NASA/JPL)
>> >>
>> >> === Nominated Mentors ===
>> >> * Paul Ramirez
>> >> * Lewis John McGibbney
>> >> * Chris Mattmann
>> >>
>> >> == Sponsoring Entity ==
>> >> The Apache Incubator
>> >>
>> >>
>> >>
>> >>
>> >> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>> >> Chris Mattmann, Ph.D.
>> >> Chief Architect
>> >> Instrument Software and Science Data Systems Section (398)
>> >> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
>> >> Office: 168-519, Mailstop: 168-527
>> >> Email: chris.a.mattmann@nasa.gov
>> >> WWW:  http://sunset.usc.edu/~mattmann/
>> >> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>> >> Adjunct Associate Professor, Computer Science Department
>> >> University of Southern California, Los Angeles, CA 90089 USA
>> >> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

Mime
View raw message