incubator-general mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Alex Harui <aha...@adobe.com>
Subject Re: [DISCUSS] Apache Joshua Incubator Proposal - Machine Translation Toolkit
Date Wed, 20 Jan 2016 15:46:17 GMT
External is good news.  I'm not sure how much leeway there is in the
following quote from [1], but what percentage of your users are currently
using an all-ASF-compatible set of projects?
 
    The question to ask yourself in this situation is:
        * "Will the majority of users want to use my
           product without adding the optional components?"

-Alex

[1] http://www.apache.org/legal/resolved.html


On 1/20/16, 7:17 AM, "Matt Post" <post@cs.jhu.edu> wrote:

>The dependencies can be split into two kinds: ones required for building
>new models, and ones needed by the decoder to translate new sentences
>with a pre-built model (i.e., black-box translation with the language
>packs).
>
>1. For building new models, you need a way to align the words between
>sentences in parallel text. Both the aligners used by Joshua (GIZA++ and
>the Berkeley aligner) are GPL of some form. These can be implemented as
>external dependencies, or can be replaced with another aligner, like
>fast_align (https://github.com/clab/fast_align), which is
>Apache-licensed. There are many other options, in fact. So this should
>not be a worry.
>
>2. For doing black-box translation, one needs to represent the language
>model, which is very large. The best tool for this is KenLM
>(github.com/kpu/kenlm), which is LGPL 2.1. There is also BerkeleyLM,
>which is just as good for practical purposes and is Apache-licensed.
>KenLM is C++ and is loaded via the JNI, whereas BerkeleyLM is written in
>Java. I have moved to including BerkeleyLM in language packs, because I
>can then include the Joshua-runtime, and people can translate without
>even having to compile anything.
>
>So in short, there are no hard dependencies on unfavorably-licensed
>external projects.
>
>matt
>
>
>
>
>> On Jan 20, 2016, at 10:08 AM, Mattmann, Chris A (3980)
>><chris.a.mattmann@jpl.nasa.gov> wrote:
>> 
>> Hey Hen,
>> 
>> Matt Post who I believe is monitoring this list and who has
>> been one of the key Joshua developers and I have discussed this
>> and we believe that potentially GPL/LGPL dependencies can:
>> 
>> 1. be replaced with category-A or category-B alternatives. Matt
>> mentioned one already to me which has slipped my mind.
>> 2. be made in such a way that they are external tools and the
>> bindings exist in Joshua to call those external tools (aka runtime
>> deps akin to depending on a C compiler, etc.)
>> 
>> Cheers,
>> Chris
>> 
>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>> Chris Mattmann, Ph.D.
>> Chief Architect
>> Instrument Software and Science Data Systems Section (398)
>> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
>> Office: 168-519, Mailstop: 168-527
>> Email: chris.a.mattmann@nasa.gov
>> WWW:  http://sunset.usc.edu/~mattmann/
>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>> Adjunct Associate Professor, Computer Science Department
>> University of Southern California, Los Angeles, CA 90089 USA
>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>> 
>> 
>> 
>> 
>> 
>> -----Original Message-----
>> From: Henri Yandell <bayard@apache.org>
>> Reply-To: "general@incubator.apache.org" <general@incubator.apache.org>
>> Date: Tuesday, January 19, 2016 at 7:38 PM
>> To: "general@incubator.apache.org" <general@incubator.apache.org>
>> Subject: Re: [DISCUSS] Apache Joshua Incubator Proposal - Machine
>> Translation Toolkit
>> 
>>> License-wise, any expectation of problems from the GPL and LGPL
>>> dependencies?
>>> 
>>> On Mon, Jan 18, 2016 at 9:58 PM, Mattmann, Chris A (3980) <
>>> chris.a.mattmann@jpl.nasa.gov> wrote:
>>> 
>>>> Great Hen, we’d love to have you on board as a mentor! Please
>>>> add yourself to the proposal on the wiki.
>>>> 
>>>> Anyone else have interest in Machine Translation? Any OpenNLP folks,
>>>> Hadoop folks, Tika, or Lucene folks? CC’ing the dev lists for
>>>>visibility
>>>> please feel free to reply to general@i.a.o.
>>>> 
>>>> I’ll leave the DISCUSS thread open for a few more days.
>>>> 
>>>> Cheers,
>>>> Chris
>>>> 
>>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>>> Chris Mattmann, Ph.D.
>>>> Chief Architect
>>>> Instrument Software and Science Data Systems Section (398)
>>>> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
>>>> Office: 168-519, Mailstop: 168-527
>>>> Email: chris.a.mattmann@nasa.gov
>>>> WWW:  http://sunset.usc.edu/~mattmann/
>>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>>> Adjunct Associate Professor, Computer Science Department
>>>> University of Southern California, Los Angeles, CA 90089 USA
>>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> -----Original Message-----
>>>> From: Henri Yandell <bayard@apache.org>
>>>> Reply-To: "general@incubator.apache.org"
>>>><general@incubator.apache.org>
>>>> Date: Monday, January 18, 2016 at 7:57 PM
>>>> To: jpluser <chris.a.mattmann@jpl.nasa.gov>,
>>>> "general@incubator.apache.org" <general@incubator.apache.org>
>>>> Subject: Re: [DISCUSS] Apache Joshua Incubator Proposal - Machine
>>>> Translation Toolkit
>>>> 
>>>>> Non-binding +1 to Joshua joining the Incubator. I'd be interested in
>>>>> mentoring.
>>>>> 
>>>>> 
>>>>>> -----Original Message-----
>>>>>> From: jpluser <chris.a.mattmann@jpl.nasa.gov>
>>>>>> Reply-To: "general@incubator.apache.org"
>>>> <general@incubator.apache.org>
>>>>>> Date: Tuesday, January 12, 2016 at 10:56 PM
>>>>>> To: "general@incubator.apache.org" <general@incubator.apache.org>
>>>>>> Cc: "post@cs.jhu.edu" <post@cs.jhu.edu>
>>>>>> Subject: [DISCUSS] Apache Joshua Incubator Proposal - Machine
>>>>>> Translation
>>>>>> Toolkit
>>>>>> 
>>>>>>> Hi Everyone,
>>>>>>> 
>>>>>>> Please find attached for your viewing pleasure a proposed new
>>>> project,
>>>>>>> Apache Joshua, a statistical machine translation toolkit. The
>>>> proposal
>>>>>>> is in wiki draft form at:
>>>>>> https://wiki.apache.org/incubator/JoshuaProposal
>>>>>>> 
>>>>>>> Proposal text is copied below. I’ll leave the discussion open
for a
>>>>>> week
>>>>>>> and we are interested in folks who would like to be initial
>>>> committers
>>>>>>> and mentors. Please discuss here on the thread.
>>>>>>> 
>>>>>>> Thanks!
>>>>>>> 
>>>>>>> Cheers,
>>>>>>> Chris (Champion)
>>>>>>> 
>>>>>>> ———
>>>>>>> 
>>>>>>> = Joshua Proposal =
>>>>>>> 
>>>>>>> == Abstract ==
>>>>>>> [[joshua-decoder.org|Joshua]] is an open-source statistical machine
>>>>>>> translation toolkit. It includes a Java-based decoder for
>>>> translating
>>>>>> with
>>>>>>> phrase-based, hierarchical, and syntax-based translation models,
a
>>>>>>> Hadoop-based grammar extractor (Thrax), and an extensive set
of
>>>> tools
>>>>>> and
>>>>>>> scripts for training and evaluating new models from parallel
text.
>>>>>>> 
>>>>>>> == Proposal ==
>>>>>>> Joshua is a state of the art statistical machine translation
system
>>>>>> that
>>>>>>> provides a number of features:
>>>>>>> 
>>>>>>> * Support for the two main paradigms in statistical machine
>>>>>> translation:
>>>>>>> phrase-based and hierarchical / syntactic.
>>>>>>> * A sparse feature API that makes it easy to add new feature
>>>> templates
>>>>>>> supporting millions of features
>>>>>>> * Native implementations of many tuners (MERT, MIRA, PRO, and
>>>> AdaGrad)
>>>>>>> * Support for lattice decoding, allowing upstream NLP tools to
>>>> expose
>>>>>>> their hypothesis space to the MT system
>>>>>>> * An efficient representation for models, allowing for quick
>>>> loading
>>>>>> of
>>>>>>> multi-gigabyte model files
>>>>>>> * Fast decoding speed (on par with Moses and mtplz)
>>>>>>> * Language packs — precompiled models that allow the decoder
to be
>>>>>> run as
>>>>>>> a black box
>>>>>>> * Thrax, a Hadoop-based tool for learning translation models
from
>>>>>>> parallel text
>>>>>>> * A suite of tools for constructing new models for any language
>>>> pair
>>>>>> for
>>>>>>> which sufficient training data exists
>>>>>>> 
>>>>>>> == Background and Rationale ==
>>>>>>> A number of factors make this a good time for an Apache project
>>>>>> focused on
>>>>>>> machine translation (MT): the quality of MT output (for many
>>>> language
>>>>>>> pairs); the average computing resources available on computers,
>>>>>> relative
>>>>>>> to the needs of MT systems; and the availability of a number
of
>>>>>>> high-quality toolkits, together with a large base of researchers
>>>>>> working
>>>>>>> on them.
>>>>>>> 
>>>>>>> Over the past decade, machine translation (MT; the automatic
>>>>>> translation
>>>>>>> of one human language to another) has become a reality. The
>>>>>>>research
>>>>>> into
>>>>>>> statistical approaches to translation that began in the early
>>>> nineties,
>>>>>>> together with the availability of large amounts of training data,
>>>> and
>>>>>>> better computing infrastructure, have all come together to produce
>>>>>>> translations results that are “good enough” for a large set
of
>>>> language
>>>>>>> pairs and use cases. Free services like
>>>>>>> [[https://www.bing.com/translator|Bing Translator]] and
>>>>>>> [[https://translate.google.com|Google Translate]] have made these
>>>>>> services
>>>>>>> available to the average person through direct interfaces and
>>>> through
>>>>>>> tools like browser plugins, and sites across the world with higher
>>>>>>> translation needs use them to translate their pages through
>>>>>> automatically.
>>>>>>> 
>>>>>>> MT does not require the infrastructure of large corporations
in
>>>> order
>>>>>> to
>>>>>>> produce feasible output. Machine translation can be
>>>> resource-intensive,
>>>>>>> but need not be prohibitively so. Disk and memory usage are mostly
>>>>>>>a
>>>>>>> matter of model size, which for most language pairs is a few
>>>> gigabytes
>>>>>> at
>>>>>>> most, at which size models can provide coverage on the order
of
>>>> tens or
>>>>>>> even hundreds of thousands of words in the input and output
>>>> languages.
>>>>>> The
>>>>>>> computational complexity of the algorithms used to search for
>>>>>> translations
>>>>>>> of new sentences are typically linear in the number of words
in the
>>>>>> input
>>>>>>> sentence, making it possible to run a translation engine on a
>>>> personal
>>>>>>> computer.
>>>>>>> 
>>>>>>> The research community has produced many different open source
>>>>>> translation
>>>>>>> projects for a range of programming languages and under a variety
>>>>>>>of
>>>>>>> licenses. These projects include the core “decoder”, which
takes a
>>>>>> model
>>>>>>> and uses it to translate new sentences between the language pair
>>>>>>>the
>>>>>> model
>>>>>>> was defined for. They also typically include a large set of tools
>>>> that
>>>>>>> enable new models to be built from large sets of example
>>>> translations
>>>>>>> (“parallel data”) and monolingual texts. These toolkits are
usually
>>>>>> built
>>>>>>> to support the agendas of the (largely) academic researchers
that
>>>> build
>>>>>>> them: the repeated cycle of building new models, tuning model
>>>>>> parameters
>>>>>>> against development data, and evaluating them against held-out
test
>>>>>> data,
>>>>>>> using standard metrics for testing the quality of MT output.
>>>>>>> 
>>>>>>> Together, these three factors—the quality of machine translation
>>>>>> output,
>>>>>>> the feasibility of translating on standard computers, and the
>>>>>> availability
>>>>>>> of tools to build models—make it reasonable for the end users
to
>>>>>>>use
>>>>>> MT as
>>>>>>> a black-box service, and to run it on their personal machine.
>>>>>>> 
>>>>>>> These factors make it a good time for an organization with the
>>>> status
>>>>>> of
>>>>>>> the Apache Foundation to host a machine translation project.
>>>>>>> 
>>>>>>> == Current Status ==
>>>>>>> Joshua was originally ported from David Chiang’s Python
>>>> implementation
>>>>>> of
>>>>>>> Hiero by Zhifei Li, while he was a Ph.D. student at Johns Hopkins
>>>>>>> University. The current version is maintained by Matt Post at
Johns
>>>>>>> Hopkins’ Human Language Technology Center of Excellence. Joshua
has
>>>>>> made
>>>>>>> many releases with a list of over 20 source code tags. The last
>>>>>> release of
>>>>>>> Joshua was 6.0.5 on November 5th, 2015.
>>>>>>> 
>>>>>>> == Meritocracy ==
>>>>>>> The current developers are familiar with meritocratic open source
>>>>>>> development at Apache. Apache was chosen specifically because
we
>>>> want
>>>>>> to
>>>>>>> encourage this style of development for the project.
>>>>>>> 
>>>>>>> == Community ==
>>>>>>> Joshua is used widely across the world. Perhaps its biggest (known)
>>>>>>> research / industrial user is the Amazon research group in Berlin.
>>>>>> Another
>>>>>>> user is the US Army Research Lab. No formal census has been
>>>> undertaken,
>>>>>>> but posts to the Joshua technical support mailing list, along
with
>>>> the
>>>>>>> occasional contributions, suggest small research and academic
>>>>>> communities
>>>>>>> spread across the world, many of them in India.
>>>>>>> 
>>>>>>> During incubation, we will explicitly seek to increase our usage
>>>> across
>>>>>>> the board, including academic research, industry, and other end
>>>> users
>>>>>>> interested in statistical machine translation.
>>>>>>> 
>>>>>>> == Core Developers ==
>>>>>>> The current set of core developers is fairly small, having fallen
>>>> with
>>>>>> the
>>>>>>> graduation from Johns Hopkins of some core student participants.
>>>>>> However,
>>>>>>> Joshua is used fairly widely, as mentioned above, and there remains
>>>> a
>>>>>>> commitment from the principal researcher at Johns Hopkins to
>>>> continue
>>>>>> to
>>>>>>> use and develop it. Joshua has seen a number of new community
>>>> members
>>>>>>> become interested recently due to a potential for its projected
use
>>>> in
>>>>>> a
>>>>>>> number of ongoing DARPA projects such as XDATA and Memex.
>>>>>>> 
>>>>>>> == Alignment ==
>>>>>>> Joshua is currently Copyright (c) 2015, Johns Hopkins University
>>>>>>>All
>>>>>>> rights reserved and licensed under BSD 2-clause license. It would
>>>>>>>of
>>>>>>> course be the intention to relicense this code under AL2.0 which
>>>> would
>>>>>>> permit expanded and increased use of the software within Apache
>>>>>> projects.
>>>>>>> There is currently an ongoing effort within the Apache Tika
>>>> community
>>>>>> to
>>>>>>> utilize Joshua within Tika’s Translate API, see
>>>>>>> [[https://issues.apache.org/jira/browse/TIKA-1343|TIKA-1343]].
>>>>>>> 
>>>>>>> == Known Risks ==
>>>>>>> 
>>>>>>> === Orphaned products ===
>>>>>>> At the moment, regular contributions are made by a single
>>>> contributor,
>>>>>> the
>>>>>>> lead maintainer. He (Matt Post) plans to continue development
for
>>>> the
>>>>>> next
>>>>>>> few years, but it is still a single point of failure, since the
>>>>>> graduate
>>>>>>> students who worked on the project have moved on to jobs, mostly
in
>>>>>>> industry. However, our goal is to help that process by growing
the
>>>>>>> community in Apache, and at least in growing the community with
>>>> users
>>>>>> and
>>>>>>> participants from NASA JPL.
>>>>>>> 
>>>>>>> === Inexperience with Open Source ===
>>>>>>> The team both at Johns Hopkins and NASA JPL have experience with
>>>> many
>>>>>> OSS
>>>>>>> software projects at Apache and elsewhere. We understand "how
it
>>>> works"
>>>>>>> here at the foundation.
>>>>>>> 
>>>>>>> 
>>>>>>> == Relationships with Other Apache Products ==
>>>>>>> Joshua includes dependences on Hadoop, and also is included as
a
>>>>>> plugin in
>>>>>>> Apache Tika. We are also interested in coordinating with other
>>>> projects
>>>>>>> including Spark, and other projects needing MT services for
>>>>>>>language
>>>>>>> translation.
>>>>>>> 
>>>>>>> == Developers ==
>>>>>>> Joshua only has one regular developer who is employed by Johns
>>>> Hopkins
>>>>>>> University. NASA JPL (Mattmann and McGibbney) have been
>>>>>>>contributing
>>>>>>> lately including a Brew formula and other contributions to the
>>>> project
>>>>>>> through the DARPA XDATA and Memex programs.
>>>>>>> 
>>>>>>> == Documentation ==
>>>>>>> Documentation and publications related to Joshua can be found
at
>>>>>>> joshua-decoder.org. The source for the Joshua documentation is
>>>>>> currently
>>>>>>> hosted on Github at
>>>>>>> https://github.com/joshua-decoder/joshua-decoder.github.com
>>>>>>> 
>>>>>>> == Initial Source ==
>>>>>>> Current source resides at Github: github.com/joshua-decoder/joshua
>>>> (the
>>>>>>> main decoder and toolkit) and github.com/joshua-decoder/thrax
(the
>>>>>> grammar
>>>>>>> extraction tool).
>>>>>>> 
>>>>>>> == External Dependencies ==
>>>>>>> Joshua has a number of external dependencies. Only BerkeleyLM
>>>> (Apache
>>>>>> 2.0)
>>>>>>> and KenLM (LGPG 2.1) are run-time decoder dependencies (one of
>>>> which is
>>>>>>> needed for translating sentences with pre-built models). The
rest
>>>> are
>>>>>>> dependencies for the build system and pipeline, used for
>>>> constructing
>>>>>> and
>>>>>>> training new models from parallel text.
>>>>>>> 
>>>>>>> Apache projects:
>>>>>>> * Ant
>>>>>>> * Hadoop
>>>>>>> * Commons
>>>>>>> * Maven
>>>>>>> * Ivy
>>>>>>> 
>>>>>>> There are also a number of other open-source projects with various
>>>>>>> licenses that the project depends on both dynamically (runtime),
>>>>>>>and
>>>>>>> statically.
>>>>>>> 
>>>>>>> === GNU GPL 2 ===
>>>>>>> * Berkeley Aligner: https://code.google.com/p/berkeleyaligner/
>>>>>>> 
>>>>>>> === LGPG 2.1 ===
>>>>>>> * KenLM: github.com/kpu/kenlm
>>>>>>> 
>>>>>>> === Apache 2.0 ===
>>>>>>> * BerkeleyLM: https://code.google.com/p/berkeleylm/
>>>>>>> 
>>>>>>> === GNU GPL ===
>>>>>>> * GIZA++: http://www.statmt.org/moses/giza/GIZA++.html
>>>>>>> 
>>>>>>> == Required Resources ==
>>>>>>> * Mailing Lists
>>>>>>>  * private@joshua.incubator.apache.org
>>>>>>>  * dev@joshua.incubator.apache.org
>>>>>>>  * commits@joshua.incubator.apache.org
>>>>>>> 
>>>>>>> * Git Repos
>>>>>>>  * https://git-wip-us.apache.org/repos/asf/joshua.git
>>>>>>> 
>>>>>>> * Issue Tracking
>>>>>>>  * JIRA Joshua (JOSHUA)
>>>>>>> 
>>>>>>> * Continuous Integration
>>>>>>>  * Jenkins builds on https://builds.apache.org/
>>>>>>> 
>>>>>>> * Web
>>>>>>>  * http://joshua.incubator.apache.org/
>>>>>>>  * wiki at http://cwiki.apache.org
>>>>>>> 
>>>>>>> == Initial Committers ==
>>>>>>> The following is a list of the planned initial Apache committers
>>>> (the
>>>>>>> active subset of the committers for the current repository on
>>>> Github).
>>>>>>> 
>>>>>>> * Matt Post (post@cs.jhu.edu)
>>>>>>> * Lewis John McGibbney (lewismc@apache.org)
>>>>>>> * Chris Mattmann (mattmann@apache.org)
>>>>>>> 
>>>>>>> == Affiliations ==
>>>>>>> 
>>>>>>> * Johns Hopkins University
>>>>>>>  * Matt Post
>>>>>>> 
>>>>>>> * NASA JPL
>>>>>>>  * Chris Mattmann
>>>>>>>  * Lewis John McGibbney
>>>>>>> 
>>>>>>> 
>>>>>>> == Sponsors ==
>>>>>>> === Champion ===
>>>>>>> * Chris Mattmann (NASA/JPL)
>>>>>>> 
>>>>>>> === Nominated Mentors ===
>>>>>>> * Paul Ramirez
>>>>>>> * Lewis John McGibbney
>>>>>>> * Chris Mattmann
>>>>>>> 
>>>>>>> == Sponsoring Entity ==
>>>>>>> The Apache Incubator
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>>>>>> Chris Mattmann, Ph.D.
>>>>>>> Chief Architect
>>>>>>> Instrument Software and Science Data Systems Section (398)
>>>>>>> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
>>>>>>> Office: 168-519, Mailstop: 168-527
>>>>>>> Email: chris.a.mattmann@nasa.gov
>>>>>>> WWW:  http://sunset.usc.edu/~mattmann/
>>>>>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>>>>>> Adjunct Associate Professor, Computer Science Department
>>>>>>> University of Southern California, Los Angeles, CA 90089 USA
>>>>>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>> 
>>>> 
>>>>>>> 
>>>>>>>?B�KKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKK
>>>>>>>KK
>>>>>>> KC
>>>>>>> B�
>>>>>> 
>>>> 
>>>>>>> 
>>>>>>>?�?[��X��ܚX�K??K[XZ[?�?�[�\�[?][��X��ܚX�P?[��X�]?܋�\?X�?K�ܙ�B��܈?Y??
>>>>>>>]?
>>>>>>> [ۘ
>>>>>>> [?
>>>>>>> ?��[X[�?�??K[XZ[?�?�[�\�[?Z?[???[��X�]?܋�\?X�?K�ܙ�B
>>>> 
>>>> 
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail: general-unsubscribe@incubator.apache.org
>>>> For additional commands, e-mail: general-help@incubator.apache.org
>>>> 
>> 
>> 
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: general-unsubscribe@incubator.apache.org
>> For additional commands, e-mail: general-help@incubator.apache.org
>
>
>---------------------------------------------------------------------
>To unsubscribe, e-mail: general-unsubscribe@incubator.apache.org
>For additional commands, e-mail: general-help@incubator.apache.org
>

Mime
View raw message