incubator-general mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Seetharam Venkatesh <venkat...@innerzeal.com>
Subject Re: [VOTE] Accept Joshua as an Apache Incubator Podling
Date Tue, 02 Feb 2016 22:48:17 GMT
+1 (binding).

Thanks!

On Tue, Feb 2, 2016 at 2:06 PM Henri Yandell <bayard@apache.org> wrote:

> I'm more likely to guide contributions from my employer. There's been some
> contributions thus far, and there is interest to put more dayjob time into
> contributing, but currently there's no coder who personally is committed to
> the project.
>
> Hen
>
> On Mon, Feb 1, 2016 at 7:20 AM, Mattmann, Chris A (3980) <
> chris.a.mattmann@jpl.nasa.gov> wrote:
>
> > Hey Jim,
> >
> > This is a valid concern, one that I hope is mediated by taking
> > however long it takes in Incubation to attract some new committers
> > to work on the project. Hopefully too you saw how long I took to
> > allow the discussion to occur and so forth.
> >
> > Lewis has actively contributed to Joshua already - you can see -
> > via the HomeBrew package he created, see:
> >
> > https://github.com/Homebrew/homebrew/pull/45746
> >
> >
> > You can see too it wasn’t something just recent or something
> > super quick it’s something he had to work at.
> >
> > As for me, my involvement is going to be limited, but I am
> > actively pursuing Tika’s integration with Joshua as part of
> > TIKA-1343: http://issues.apache.org/jira/browse/TIKA-1343.
> >
> > Finally my suspicion is that Tom, Henry and Tommaso will
> > contribute a lot as well.
> >
> > Thanks for listening.
> >
> > Cheers,
> > Chris
> >
> > ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> > Chris Mattmann, Ph.D.
> > Chief Architect
> > Instrument Software and Science Data Systems Section (398)
> > NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
> > Office: 168-519, Mailstop: 168-527
> > Email: chris.a.mattmann@nasa.gov
> > WWW:  http://sunset.usc.edu/~mattmann/
> > ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> > Adjunct Associate Professor, Computer Science Department
> > University of Southern California, Los Angeles, CA 90089 USA
> > ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> >
> >
> >
> >
> >
> > -----Original Message-----
> > From: Jim Jagielski <jim@jaguNET.com>
> > Reply-To: "general@incubator.apache.org" <general@incubator.apache.org>
> > Date: Monday, February 1, 2016 at 4:20 AM
> > To: "general@incubator.apache.org" <general@incubator.apache.org>
> > Cc: "post@cs.jhu.edu" <post@cs.jhu.edu>
> > Subject: Re: [VOTE] Accept Joshua as an Apache Incubator Podling
> >
> > >I know this is specifically called-out in the proposal, but it
> > >does seem worthy of further discussion.
> > >
> > >This has a pretty small list of initial committers, esp when one
> considers
> > >how over-booked 2 of them appear to be.
> > >
> > >So, realistically, how active do both Chris and Lewis expect
> > >to be?
> > >
> > >> On Jan 30, 2016, at 3:00 PM, Mattmann, Chris A (3980)
> > >><chris.a.mattmann@jpl.nasa.gov> wrote:
> > >>
> > >> Hi Everyone,
> > >>
> > >> OK the discussion is now completed. Please VOTE to accept Joshua
> > >> into the Apache Incubator. I’ll leave the VOTE open for at least
> > >> the next 72 hours, with hopes to close it next Friday the 5th of
> > >> February, 2016.
> > >>
> > >> [ ] +1 Accept Joshua as an Apache Incubator podling.
> > >> [ ] +0 Abstain.
> > >> [ ] -1 Don’t accept Joshua as an Apache Incubator podling because..
> > >>
> > >> Of course, I am +1 on this. Please note VOTEs from Incubator PMC
> > >> members are binding but all are welcome to VOTE!
> > >>
> > >> Cheers,
> > >> Chris
> > >>
> > >> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> > >> Chris Mattmann, Ph.D.
> > >> Chief Architect
> > >> Instrument Software and Science Data Systems Section (398)
> > >> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
> > >> Office: 168-519, Mailstop: 168-527
> > >> Email: chris.a.mattmann@nasa.gov
> > >> WWW:  http://sunset.usc.edu/~mattmann/
> > >> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> > >> Adjunct Associate Professor, Computer Science Department
> > >> University of Southern California, Los Angeles, CA 90089 USA
> > >> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> > >>
> > >>
> > >>
> > >>
> > >>
> > >> -----Original Message-----
> > >> From: jpluser <chris.a.mattmann@jpl.nasa.gov>
> > >> Date: Tuesday, January 12, 2016 at 10:56 PM
> > >> To: "general@incubator.apache.org" <general@incubator.apache.org>
> > >> Cc: "post@cs.jhu.edu" <post@cs.jhu.edu>
> > >> Subject: [DISCUSS] Apache Joshua Incubator Proposal - Machine
> > >>Translation
> > >> Toolkit
> > >>
> > >>> Hi Everyone,
> > >>>
> > >>> Please find attached for your viewing pleasure a proposed new
> project,
> > >>> Apache Joshua, a statistical machine translation toolkit. The
> proposal
> > >>> is in wiki draft form at:
> > >>>https://wiki.apache.org/incubator/JoshuaProposal
> > >>>
> > >>> Proposal text is copied below. I’ll leave the discussion open for
a
> > >>>week
> > >>> and we are interested in folks who would like to be initial
> committers
> > >>> and mentors. Please discuss here on the thread.
> > >>>
> > >>> Thanks!
> > >>>
> > >>> Cheers,
> > >>> Chris (Champion)
> > >>>
> > >>> ———
> > >>>
> > >>> = Joshua Proposal =
> > >>>
> > >>> == Abstract ==
> > >>> [[joshua-decoder.org|Joshua]] is an open-source statistical machine
> > >>> translation toolkit. It includes a Java-based decoder for translating
> > >>>with
> > >>> phrase-based, hierarchical, and syntax-based translation models, a
> > >>> Hadoop-based grammar extractor (Thrax), and an extensive set of tools
> > >>>and
> > >>> scripts for training and evaluating new models from parallel text.
> > >>>
> > >>> == Proposal ==
> > >>> Joshua is a state of the art statistical machine translation system
> > >>>that
> > >>> provides a number of features:
> > >>>
> > >>> * Support for the two main paradigms in statistical machine
> > >>>translation:
> > >>> phrase-based and hierarchical / syntactic.
> > >>> * A sparse feature API that makes it easy to add new feature
> templates
> > >>> supporting millions of features
> > >>> * Native implementations of many tuners (MERT, MIRA, PRO, and
> AdaGrad)
> > >>> * Support for lattice decoding, allowing upstream NLP tools to expose
> > >>> their hypothesis space to the MT system
> > >>> * An efficient representation for models, allowing for quick loading
> of
> > >>> multi-gigabyte model files
> > >>> * Fast decoding speed (on par with Moses and mtplz)
> > >>> * Language packs — precompiled models that allow the decoder to be
> run
> > >>>as
> > >>> a black box
> > >>> * Thrax, a Hadoop-based tool for learning translation models from
> > >>> parallel text
> > >>> * A suite of tools for constructing new models for any language pair
> > >>>for
> > >>> which sufficient training data exists
> > >>>
> > >>> == Background and Rationale ==
> > >>> A number of factors make this a good time for an Apache project
> > >>>focused on
> > >>> machine translation (MT): the quality of MT output (for many language
> > >>> pairs); the average computing resources available on computers,
> > >>>relative
> > >>> to the needs of MT systems; and the availability of a number of
> > >>> high-quality toolkits, together with a large base of researchers
> > >>>working
> > >>> on them.
> > >>>
> > >>> Over the past decade, machine translation (MT; the automatic
> > >>>translation
> > >>> of one human language to another) has become a reality. The research
> > >>>into
> > >>> statistical approaches to translation that began in the early
> nineties,
> > >>> together with the availability of large amounts of training data, and
> > >>> better computing infrastructure, have all come together to produce
> > >>> translations results that are “good enough” for a large set of
> language
> > >>> pairs and use cases. Free services like
> > >>> [[https://www.bing.com/translator|Bing Translator]] and
> > >>> [[https://translate.google.com|Google Translate]] have made these
> > >>>services
> > >>> available to the average person through direct interfaces and through
> > >>> tools like browser plugins, and sites across the world with higher
> > >>> translation needs use them to translate their pages through
> > >>>automatically.
> > >>>
> > >>> MT does not require the infrastructure of large corporations in order
> > >>>to
> > >>> produce feasible output. Machine translation can be
> resource-intensive,
> > >>> but need not be prohibitively so. Disk and memory usage are mostly
a
> > >>> matter of model size, which for most language pairs is a few
> gigabytes
> > >>>at
> > >>> most, at which size models can provide coverage on the order of tens
> or
> > >>> even hundreds of thousands of words in the input and output
> languages.
> > >>>The
> > >>> computational complexity of the algorithms used to search for
> > >>>translations
> > >>> of new sentences are typically linear in the number of words in the
> > >>>input
> > >>> sentence, making it possible to run a translation engine on a
> personal
> > >>> computer.
> > >>>
> > >>> The research community has produced many different open source
> > >>>translation
> > >>> projects for a range of programming languages and under a variety of
> > >>> licenses. These projects include the core “decoder”, which takes
a
> > >>>model
> > >>> and uses it to translate new sentences between the language pair the
> > >>>model
> > >>> was defined for. They also typically include a large set of tools
> that
> > >>> enable new models to be built from large sets of example translations
> > >>> (“parallel data”) and monolingual texts. These toolkits are usually
> > >>>built
> > >>> to support the agendas of the (largely) academic researchers that
> build
> > >>> them: the repeated cycle of building new models, tuning model
> > >>>parameters
> > >>> against development data, and evaluating them against held-out test
> > >>>data,
> > >>> using standard metrics for testing the quality of MT output.
> > >>>
> > >>> Together, these three factors—the quality of machine translation
> > >>>output,
> > >>> the feasibility of translating on standard computers, and the
> > >>>availability
> > >>> of tools to build models—make it reasonable for the end users to
use
> > >>>MT as
> > >>> a black-box service, and to run it on their personal machine.
> > >>>
> > >>> These factors make it a good time for an organization with the status
> > >>>of
> > >>> the Apache Foundation to host a machine translation project.
> > >>>
> > >>> == Current Status ==
> > >>> Joshua was originally ported from David Chiang’s Python
> implementation
> > >>>of
> > >>> Hiero by Zhifei Li, while he was a Ph.D. student at Johns Hopkins
> > >>> University. The current version is maintained by Matt Post at Johns
> > >>> Hopkins’ Human Language Technology Center of Excellence. Joshua has
> > >>>made
> > >>> many releases with a list of over 20 source code tags. The last
> > >>>release of
> > >>> Joshua was 6.0.5 on November 5th, 2015.
> > >>>
> > >>> == Meritocracy ==
> > >>> The current developers are familiar with meritocratic open source
> > >>> development at Apache. Apache was chosen specifically because we want
> > >>>to
> > >>> encourage this style of development for the project.
> > >>>
> > >>> == Community ==
> > >>> Joshua is used widely across the world. Perhaps its biggest (known)
> > >>> research / industrial user is the Amazon research group in Berlin.
> > >>>Another
> > >>> user is the US Army Research Lab. No formal census has been
> undertaken,
> > >>> but posts to the Joshua technical support mailing list, along with
> the
> > >>> occasional contributions, suggest small research and academic
> > >>>communities
> > >>> spread across the world, many of them in India.
> > >>>
> > >>> During incubation, we will explicitly seek to increase our usage
> across
> > >>> the board, including academic research, industry, and other end users
> > >>> interested in statistical machine translation.
> > >>>
> > >>> == Core Developers ==
> > >>> The current set of core developers is fairly small, having fallen
> with
> > >>>the
> > >>> graduation from Johns Hopkins of some core student participants.
> > >>>However,
> > >>> Joshua is used fairly widely, as mentioned above, and there remains
a
> > >>> commitment from the principal researcher at Johns Hopkins to continue
> > >>>to
> > >>> use and develop it. Joshua has seen a number of new community members
> > >>> become interested recently due to a potential for its projected use
> in
> > >>>a
> > >>> number of ongoing DARPA projects such as XDATA and Memex.
> > >>>
> > >>> == Alignment ==
> > >>> Joshua is currently Copyright (c) 2015, Johns Hopkins University All
> > >>> rights reserved and licensed under BSD 2-clause license. It would of
> > >>> course be the intention to relicense this code under AL2.0 which
> would
> > >>> permit expanded and increased use of the software within Apache
> > >>>projects.
> > >>> There is currently an ongoing effort within the Apache Tika community
> > >>>to
> > >>> utilize Joshua within Tika’s Translate API, see
> > >>> [[https://issues.apache.org/jira/browse/TIKA-1343|TIKA-1343]].
> > >>>
> > >>> == Known Risks ==
> > >>>
> > >>> === Orphaned products ===
> > >>> At the moment, regular contributions are made by a single
> contributor,
> > >>>the
> > >>> lead maintainer. He (Matt Post) plans to continue development for the
> > >>>next
> > >>> few years, but it is still a single point of failure, since the
> > >>>graduate
> > >>> students who worked on the project have moved on to jobs, mostly in
> > >>> industry. However, our goal is to help that process by growing the
> > >>> community in Apache, and at least in growing the community with users
> > >>>and
> > >>> participants from NASA JPL.
> > >>>
> > >>> === Inexperience with Open Source ===
> > >>> The team both at Johns Hopkins and NASA JPL have experience with many
> > >>>OSS
> > >>> software projects at Apache and elsewhere. We understand "how it
> works"
> > >>> here at the foundation.
> > >>>
> > >>>
> > >>> == Relationships with Other Apache Products ==
> > >>> Joshua includes dependences on Hadoop, and also is included as a
> > >>>plugin in
> > >>> Apache Tika. We are also interested in coordinating with other
> projects
> > >>> including Spark, and other projects needing MT services for language
> > >>> translation.
> > >>>
> > >>> == Developers ==
> > >>> Joshua only has one regular developer who is employed by Johns
> Hopkins
> > >>> University. NASA JPL (Mattmann and McGibbney) have been contributing
> > >>> lately including a Brew formula and other contributions to the
> project
> > >>> through the DARPA XDATA and Memex programs.
> > >>>
> > >>> == Documentation ==
> > >>> Documentation and publications related to Joshua can be found at
> > >>> joshua-decoder.org. The source for the Joshua documentation is
> > >>>currently
> > >>> hosted on Github at
> > >>> https://github.com/joshua-decoder/joshua-decoder.github.com
> > >>>
> > >>> == Initial Source ==
> > >>> Current source resides at Github: github.com/joshua-decoder/joshua
> > (the
> > >>> main decoder and toolkit) and github.com/joshua-decoder/thrax (the
> > >>>grammar
> > >>> extraction tool).
> > >>>
> > >>> == External Dependencies ==
> > >>> Joshua has a number of external dependencies. Only BerkeleyLM (Apache
> > >>>2.0)
> > >>> and KenLM (LGPG 2.1) are run-time decoder dependencies (one of which
> is
> > >>> needed for translating sentences with pre-built models). The rest are
> > >>> dependencies for the build system and pipeline, used for constructing
> > >>>and
> > >>> training new models from parallel text.
> > >>>
> > >>> Apache projects:
> > >>> * Ant
> > >>> * Hadoop
> > >>> * Commons
> > >>> * Maven
> > >>> * Ivy
> > >>>
> > >>> There are also a number of other open-source projects with various
> > >>> licenses that the project depends on both dynamically (runtime), and
> > >>> statically.
> > >>>
> > >>> === GNU GPL 2 ===
> > >>> * Berkeley Aligner: https://code.google.com/p/berkeleyaligner/
> > >>>
> > >>> === LGPG 2.1 ===
> > >>> * KenLM: github.com/kpu/kenlm
> > >>>
> > >>> === Apache 2.0 ===
> > >>> * BerkeleyLM: https://code.google.com/p/berkeleylm/
> > >>>
> > >>> === GNU GPL ===
> > >>> * GIZA++: http://www.statmt.org/moses/giza/GIZA++.html
> > >>>
> > >>> == Required Resources ==
> > >>> * Mailing Lists
> > >>>  * private@joshua.incubator.apache.org
> > >>>  * dev@joshua.incubator.apache.org
> > >>>  * commits@joshua.incubator.apache.org
> > >>>
> > >>> * Git Repos
> > >>>  * https://git-wip-us.apache.org/repos/asf/joshua.git
> > >>>
> > >>> * Issue Tracking
> > >>>  * JIRA Joshua (JOSHUA)
> > >>>
> > >>> * Continuous Integration
> > >>>  * Jenkins builds on https://builds.apache.org/
> > >>>
> > >>> * Web
> > >>>  * http://joshua.incubator.apache.org/
> > >>>  * wiki at http://cwiki.apache.org
> > >>>
> > >>> == Initial Committers ==
> > >>> The following is a list of the planned initial Apache committers (the
> > >>> active subset of the committers for the current repository on
> Github).
> > >>>
> > >>> * Matt Post (post@cs.jhu.edu)
> > >>> * Lewis John McGibbney (lewismc@apache.org)
> > >>> * Chris Mattmann (mattmann@apache.org)
> > >>>
> > >>> == Affiliations ==
> > >>>
> > >>> * Johns Hopkins University
> > >>>  * Matt Post
> > >>>
> > >>> * NASA JPL
> > >>>  * Chris Mattmann
> > >>>  * Lewis John McGibbney
> > >>>
> > >>>
> > >>> == Sponsors ==
> > >>> === Champion ===
> > >>> * Chris Mattmann (NASA/JPL)
> > >>>
> > >>> === Nominated Mentors ===
> > >>> * Paul Ramirez
> > >>> * Lewis John McGibbney
> > >>> * Chris Mattmann
> > >>>
> > >>> == Sponsoring Entity ==
> > >>> The Apache Incubator
> > >>>
> > >>>
> > >>>
> > >>>
> > >>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> > >>> Chris Mattmann, Ph.D.
> > >>> Chief Architect
> > >>> Instrument Software and Science Data Systems Section (398)
> > >>> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
> > >>> Office: 168-519, Mailstop: 168-527
> > >>> Email: chris.a.mattmann@nasa.gov
> > >>> WWW:  http://sunset.usc.edu/~mattmann/
> > >>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> > >>> Adjunct Associate Professor, Computer Science Department
> > >>> University of Southern California, Los Angeles, CA 90089 USA
> > >>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> > >>>
> > >>>
> > >>>
> > >>
> > >>
> > >> ---------------------------------------------------------------------
> > >> To unsubscribe, e-mail: general-unsubscribe@incubator.apache.org
> > >> For additional commands, e-mail: general-help@incubator.apache.org
> > >
> > >
> > >---------------------------------------------------------------------
> > >To unsubscribe, e-mail: general-unsubscribe@incubator.apache.org
> > >For additional commands, e-mail: general-help@incubator.apache.org
> > >
> >
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: general-unsubscribe@incubator.apache.org
> > For additional commands, e-mail: general-help@incubator.apache.org
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message