+1
On Sep 26, 2006, at 7:17 PM, Ian Holsman wrote:
>
>
> issues addressed in this release:
> 1. updated proposal included
> 2. The first paragraph explains it to a layperson
> 3. OASIS issue addressed
>
>
> [ ] +1 Accept UIMA as an Incubator podling
> [ ] 0 Don't care
> [ ] -1 Reject this proposal for the following reason:
>
>
> ----8<-------Proposal------8<------
>
>
> Hello everyone -
>
> We are submitting this proposal to the community for a
> new project in the incubator, and look forward to starting to work
> with
> this community.
>
> This is a slightly modified and extended version of the proposal
> that has
> already been posted to general@incubator.apache.org. The whole
> mail thread
> can be found [http://www.nabble.com/Proposal-for-a-new-incubation-
> project%3A-Unstructured-Information-Management-Architecture---UIMA-
> tf2154324.html here].
>
> If you don't feel like reading the whole thread, the main question
> that came up was:
> this is all very well, but what does it really '''do'''? Attempts
> to answer that question
> where made [http://www.nabble.com/Re%3A-Proposal-for-a-new-
> incubation-project%3A-Unstructured-Information-Management-
> Architecture---UIMA-p5986403.html here] and [http://www.nabble.com/
> Re%3A-Proposal-for-a-new-incubation-project%3A-Unstructured-
> Information-Management-Architecture---UIMA-p5987788.html here]. We
> have since worked some of these into the proposal itself.
>
> ----
>
> = Proposal for Incubation Project: Unstructured Information
> Management Architecture - UIMA =
>
> == Abstract ==
>
> UIMA is a component framework for the analysis of unstructured
> content such as text, audio and video. It comprises an SDK and
> tooling for composing and running analytic components written in
> Java and C++.
>
>
> == Proposal: Unstructured Information Management Architecture
> framework ==
>
> Unstructured Information Management applications are software
> systems that analyze large volumes of unstructured information in
> order to discover knowledge that is relevant to an end user. We
> propose UIMA, a framework and SDK for developing such
> applications. An example UIM application might ingest plain text
> and identify entities, such as persons, places, organizations; or
> relations, such as works-for or located-at. UIMA enables such an
> application to be decomposed into components, for example
> ''"language identification"'' -> ''"language specific
> segmentation"'' -> ''"sentence boundary detection"'' -> ''"entity
> detection (person/place names etc.)"''. Each component must
> implement interfaces defined by the framework and must provide self-
> describing metadata via XML descriptor files. The framework
> manages these components and the data flow between them.
> Components are written in Java or C++; the data that flows between
> components is designed for efficient mapping between these
> languages. UIMA additionally provides capabilities to wrap
> components as network services, and can scale to very large volumes
> by replicating processing pipelines over a cluster of networked nodes.
>
> This framework has already attracted a following among government,
> commercial, and academic institutions who previously developed
> analysis algorithms, but were unable to easily build on each
> other's works, and who want to be able to evolve their applications
> by independently upgrading parts, as better technology becomes
> available. Applications built with this framework are being used
> with plain text, audio streams, and image/video streams,
> identifying entities and relations, converting speech to text,
> translating into different languages, and determining properties of
> images.
>
> The UIMA framework runs components in a flow, passing a common data
> object containing unstructured information (free text, audio,
> video, etc.) through the components. Each component examines the
> unstructured information and data added by other components, and
> adds data of its own. The framework mandates a standardized form
> of the data being passed, and a standardized form of the interfaces
> to the components.
>
> We propose a project to develop, implement, support and enhance
> this framework (and, over time, other implementations) that comply
> with the UIMA standard (which has been submitted for
> standardization work within [http://www.oasis-open.org OASIS].
> Members of this community are encouraged to participate in that
> effort, as well; OASIS has an open approach to granting Technical
> Committee voting rights to members of OASIS, described here: http://
> www.oasis-open.org/committees/process.php#2.4.
>
> The proposal includes both the framework, as well as tools to
> develop, describe, compose and deploy UIMA-based components and
> applications. The initial work will be based on the UIMA Version 2
> framework code developed by IBM; snapshots of each release of this
> code are currently made available on [http://sourceforge.net/
> projects/uima-framework SourceForge]. The Source``Forge versions
> would be stabilized in maintenance mode, if we are successful in
> moving to Apache.
> The framework is not specific to any IDE or platform, and does not
> depend on other middleware.
> Background:
>
> Databases are core components of nearly all applications; they
> store information in structured tables. But more and more of the
> available digital data is unstructured (e.g. email, web documents,
> images, audio clips, video streams) with little information
> (metadata) attached to explain its content or context. Although
> many applications have been built to process unstructured data,
> they have either managed it as a BLOB or they have developed
> isolated applications for analyzing the content. In the absence of
> a standardized means for analytical applications to share insights
> extracted from the content, analytical applications cannot build
> upon one another. As a result, the industry has barely begun to tap
> the value locked in unstructured information.
>
> Standardization is key to achieving component interoperability,
> with capabilities to mix components developed in different places
> and in Java, C++ and other languages. The Unstructured Information
> Management Architecture defines standards for component
> interoperability and application composition that will provide this
> needed unifying standard, and allow a variety of framework
> implementations to exist, while preserving the goal of unstructured
> information analytic component reuse.
>
> UIMA was built to help developers create solutions that get more value
> from unstructured information more quickly and at lower cost by making
> it easy to reuse and combine analytic modules from different
> sources into new analytic applications. The architecture and the
> framework have been validated through work with USA's DARPA which
> is using it as a standard for key projects with several
> universities involved in advanced linguistics analysis, such as
> Carnegie Mellon, Columbia, Stanford and University of
> Massachusetts. Other companies, such as the Mayo Clinic and Sloan
> Kettering, are also building efforts around UIMA. In addition,
> over 15 software vendors, including companies such as Inxight,
> Attensity, Clear``Forest, Temis, SPSS, SAS, Cognos, Endeca, Factiva
> and others, announced plans to support UIMA.
>
> The UIMA framework (binary and/or source code) has been downloaded
> over 8000 times from IBM alphaWorks (http://www.alphaworks.ibm.com/
> tech/uima) or Source``Forge (http://uima-framework.sourceforge.net).
>
> == Rationale ==
>
> We believe that moving the UIMA framework development to the Apache
> development community will lead to faster innovation, better
> integration with other open source software, and broader adoption
> of UIMA, accelerating the industry's ability to get the most value
> from text, audio, and video content. The UIMA framework is becoming
> attractive to developers who want to build components; we believe
> that having UIMA on Apache will encourage the development of a
> basic set of open source components that will jumpstart these
> developers' efforts. One of the first components we see possible
> synergy with is a search component based on Apache Lucene that
> would enable semantic search. We like the concept of the Lucene
> Sandbox as a way to encourage innovation around UIMA, and would
> envision something similar for this project.
>
>
> == Initial Goals ==
>
> Some initial work we see in the incubator includes the following:
>
> * redoing the parts of the tooling that were done as derivative
> works of Eclipse source code, to
> enable everything to be licensable under the Apache license
> * extending the framework to better support "scale-out"
> * extending the framework to better align with the emerging UIMA
> Standards work
> * extending the framework to support XMI-based SOAP and/or other
> service interfaces
> * extending the framework to support OSGi-based approaches to
> componentization and packaging
> * exploring embeddings of the framework within other interested
> Apache projects, including synergies with Lucene
> * providing aids to the community to migrate from previous versions
> of the framework to the Apache version
> * setting up community support: hosting a facility similar to the
> Lucene Sandbox to encourage innovation and
> experimentation; establishing a wiki and some process to allow
> better documentation to be developed by the community,
> and linking our existing XHTML documentation via an XSL transform
> to Apache FOP
>
>
> == Current Status ==
> === Meritocracy ===
>
> Meritocracy seems to us an ideal way to grow the community of
> developers around UIMA, it being a controlled, rational way to give
> those who positively contribute, more ability to directly
> contribute. This approach also gives contributors one of the best
> reasons to join the community of volunteers - to be recognized for
> the merit of their contributions.
>
> === Community ===
> Currently, the UIMA Framework development is being done by IBM,
> with input from a group of early adopters in industry and
> government. Going forward, we see IBM continuing to support
> several committers working on it. We have already begun talking
> with other people outside of IBM that have expressed interest in
> contributing towards the development. This includes members of
> academic institutions, people working for some of the software
> vendors that have announced plans to support UIMA, and others from
> companies that have expressed interest since initial announcements
> about our open source plans. Multiple non-IBM people have already
> expressed desires to become committers.
>
> === Core Developers ===
> The previous core developers of UIMA are Adam Lally, Thilo Goetz,
> Marshall Schor, Edward Epstein, Jaroslaw Cwiklik and Thomas
> Hampp. Many others have also contributed. The developers come
> from both the Research and Development parts of IBM.
>
> === Alignment ===
> UIMA has significant synergy with search applications, and we
> expect to see integration with Lucene in the future. UIMA makes use
> of the Apache Portable Runtime (APR) for C++ support. It is
> designed to be embeddable into other frameworks, such as web
> application servers. Part of UIMA is Eclipse-based tooling. We
> use ANT for build scripting. UIMA has support for various
> language bindings including C++ and Java; we also have more limited
> bindings for Perl, Python, and TCL. UIMA uses Web Services as part
> of its approach to wiring up components in its domain. It makes
> use of XML services such as Xerces and Xalan.
>
> The development of UIMA has been based on merit with open
> discussion among a distributed team of developers, from both
> Research and Development organizations.
>
> === License ===
>
> The current license for the source code is CPL, with a small number
> of files licensed under the EPL (Eclipse Public License), because
> these were created as "derivative works" of existing Eclipse open
> source code. When the code base is moved to Apache, it will be
> relicensed under the Apache license, except for the small number of
> files licensed under the EPL as derivative works of Eclipse source
> files. We plan to work in the incubator to redo these parts, so
> the entire offering can be licensed under the Apache license.
>
> The distribution for the C++ enablement layer includes open source
> components ICU (a Unicode package) which has its own license. We
> plan to work with community to properly make use of this non-Apache
> licensed component. Our current vision for the future of UIMA has
> it aligning with and incorporating other standards-based open
> source components/protocols, some of which may have licensing other
> than the Apache license (for example, the Xml Metadata Interchange
> (XMI), and the EMF ECore Model from Eclipse); we will work with the
> community in figuring out how to move forward on this.
>
> === Other IP ===
>
> When we requested OASIS to set up a Technical Committee chartered
> to develop a platform-independent specification for text and multi-
> modal analysis, we specified that it be set up under the "RF on
> Limited Terms" mode of the OASIS IP Policy. "RF" means Royalty
> Free, and the Limited Terms means companies that are working with
> us on the Technical Committee are restricted in adding additional
> terms.
>
> These are the most liberal terms and make any Essential Claims
> available to ALL and ROYALTY FREE.
> For the details please refer to:
>
> * http://www.oasis-open.org/who/ipr/ipr_faq.php
> * http://www.oasis-open.org/who/intellectualproperty.php
>
> Ultimately of course, there is always a risk that someone in the
> world holds a patent that can be claimed as Essential. The most any
> standards organization can do is govern the behavior of those who
> participate in its work and publicly document the licensing
> commitment of all participants.
>
> == Known Risks ==
>
> === Orphaned Software ===
>
> UIMA has been in active development for 5 years. The community of
> users has steadily grown, and there are now significant commercial
> and research organizations actively using it. UIMA is embedded in
> IBM software products and is delivered through IBM services
> engagements. IBM has developers assigned to it, and is continuing
> to support its development. In addition, several people outside of
> IBM have already expressed interest in working on UIMA, and have
> been providing IBM with initial feedback. One of the objectives of
> starting this Apache project is to provide a meritocratic structure
> for those people to begin more actively contributing to UIMA.
>
> === Inexperience with Open Source ===
>
> The individuals working on this software have background as IBM
> software developers. While many of them have experience working
> with open source software, none of them has had extensive
> experience contributing to other open source software. However,
> IBM as an organization has extensive experience contributing to
> open source projects and will make available resources to provide
> guidance to the developers working on this project.
>
> === Homogenous Developers (work for same company?) ===
>
> Currently all the developers work for IBM, although they come from
> different geographically dispersed organizations within IBM. We
> will reach out during the incubation time to get others to
> contribute; we have already received interest from several parties.
>
> === Reliance on salaried developers ===
>
> Currently the developers are paid employees of IBM.
>
> === Relationships with Other Apache Products ===
>
> We make use of several Apache components (SOAP / Web Services, XML
> (Xerces, Xalan), languages (Perl), scripting languages (ANT),
> Apache Portable Runtime. In addition, UIMA has been embedded in
> other frameworks, such as web application servers, and integrated
> with search engines. We are exploring Lucene extensions that could
> take advantage of UIMA processed data. We are currently
> investigating and prototyping some software packaging concepts
> based on OSGi; the Apache Incubator project Felix may have
> relevance as we go forward. The documentation is being moved to
> XHTML and plans to use Apache FOP for producing PDF reference
> materials.
>
> === An Excessive Fascination with the Apache Brand ===
> UIMA is already being adopted by a wide cross section of users,
> both commercial and academic, world-wide. Our experience shows that
> analytic modules can be reused and combined through UIMA making it
> easier and faster for developers to build new analytic applications
> for specific industries or domains. Given the diversity of content
> and analytics that will be required to address the multitude of
> opportunities - from military intelligence to quality assurance to
> contact center analytics -- growing this infrastructure so that it
> better aligns with other major Open Source communities should help
> accelerate industry's ability to get value from content assets.
>
> We believe that the Apache community of developers has the
> experience, background, visibility, and synergistic resources to
> encourage and foster a vibrant developer community around this
> project.
>
>
> == Documentation ==
>
> There is a combination Introduction, Conceptual Overview, Tutorial,
> Tools and Framework User's Guides and References, downloadable from
> http://dl.alphaworks.ibm.com/technologies/uima/
> UIMA_SDK_Users_Guide_Reference_2.0.pdf
>
>
> == Scope of the project ==
>
> The project will develop implementations of the UIMA architecture
> (which is concurrently being submitted to the OASIS standards
> process), supporting the breadth of platforms that developers
> working in this field are using, including Java, C++, Perl, Python
> and TCL; and utilities and tooling to support component and
> application developers and assemblers / packagers. It will
> initially include the Java UIMA framework for UIMA Version 2 (you
> can see a snap shot of the Version 2 release Source``Forge; the
> delivered code would this code base plus normal incremental bug
> fixes and improvements), plus additional components (mainly
> documentation and test cases, which are not currently on
> Source``Forge). Over time, the project is expected grow to include
> supporting various embeddings and integrations with other Apache
> components such as search engines and web application frameworks.
>
> Over time, we envision the project becoming an umbrella for related
> open-source around UIMA, including things like open-source pre-
> annotated corpora, and hosting a facility similar to the Lucene
> Sandbox to encourage innovation and experimentation.
>
> The UIMA framework is primarily a set of libraries (in Java, C++,
> Perl, etc.), test cases, and UIMA utilities and tools (scripts,
> plugins, executables, etc.) used to build, test and debug UIMA
> analytic components. The tooling includes several Eclipse platform
> plugins.
>
> == Initial source ==
>
> The source currently is maintained in IBM internal software control
> systems, with a copy of each release placed on SourceForge. At the
> time of launch, we plan to contribute the latest version of the
> code base (with some renaming of package prefixes to reflect
> apache.org), test cases, build files, and documentation, under the
> terms specified in the ASF Corporate Contributor License. We plan
> to donate the existing C++ enablement layer and the support for
> Perl, Python, and TCL a few months later than the initial donation;
> this delay is to give us time to finish preparing that code base
> for Open Source.
>
> == ASF resources to be created ==
>
> Mailing lists:
> * uima-dev
> * uima-commits
> * uima-user (we already have a substantial user community and
> expect them to turn up at Apache
> soon after we've hopefully been accepted into the incubator)
>
> For other resources such as Subversion repository, JIRA etc. we
> hope for guidance from our mentors.
>
> == Initial Set of Committers ==
>
> * Michael Baessler (mba@michael-baessler.de)
> * Edward Epstein (eddie@aewatercolors.com)
> * Thilo Goetz (twgoetz@gmx.de)
> * Adam Lally (alally@alum.rpi.edu)
> * Marshall Schor (msa@schor.com)
>
> === Sponsor ===
>
> We are requesting the Incubator to sponsor this. Our current
> vision is that it will become a top level project (other projects
> that develop UIMA components could become subprojects, for instance).
>
> === Mentors ===
>
> * Sam Ruby (ruby@apache.org)
> * Ken Coar (coar@apache.org)
> * Ian Holsman (lists@holsman.net)
>
> === Section 6: Open Issues for Discussion ===
>
>
> --
> Ian Holsman
> Ian@Holsman.net
> http://garden-gossip.com/ -- what's in your garden?
>
>
---------------------------------------------------------------------
To unsubscribe, e-mail: general-unsubscribe@incubator.apache.org
For additional commands, e-mail: general-help@incubator.apache.org
|