lucenenet-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From GitBox <...@apache.org>
Subject [GitHub] [lucenenet] NightOwl888 opened a new issue #460: NLP Support (OpenNLP)
Date Thu, 01 Apr 2021 04:26:14 GMT

NightOwl888 opened a new issue #460:
URL: https://github.com/apache/lucenenet/issues/460


   I don't know if this is an issue or a discussion yet, but it seems logical to document
this somewhere in case we make it to release with gaps in support for NLP.
   
   First of all, Lucene 4.8.0 didn't support [Apache OpenNLP](https://opennlp.apache.org/),
it supported [Apache UIMA](https://uima.apache.org/). So, we picked a newer Lucene version
(8.2.0) and did what it did, choosing OpenNLP instead of UIMA (which is [seemingly now part
of the OpenNLP package](https://github.com/apache/opennlp/tree/master/opennlp-uima)).
   
   ## Options for NLP Support in .NET
   
   <table>
   <tbody>
   <tr>
   <th>Option</th>
   <th>Issues</th>
   <th>Notes</th>
   <tr>
   </tr>
   <td>Port OpenNLP from <a href="https://github.com/apache/opennlp/releases/tag/opennlp-1.9.1-rc2">version
1.9.1 tag</a> to .NET</td>
   <td>
   <ul>
   <li>The project is large and would take a lot of effort to port and maintain.</li>
   </ul>
   </td>
   <td></td>
   </tr>
   </tr>
   <td>Use <a href="https://github.com/AlexPoint/OpenNlp">AlexPoint/OpenNlp</a>
from <a href="https://www.nuget.org/packages/OpenNLP/">NuGet</a></td>
   <td>
   <ul>
   <li>API has been refactored significantly from OpenNLP and would take a high-level
analysis to use the new API</li>
   <li>It isn't clear what version of OpenNLP this is as the version number doesn't
seem to track the one in Java, but it is probably long before 1.9.1 and seems to be missing
features Lucene uses</li>
   <li>Currently only supports .NET Framework 4.5+</li>
   </ul>
   </td>
   <td></td>
   </tr>
   </tr>
   <td>Use <a href="https://sergey-tihon.github.io/Stanford.NLP.NET/#/">Standford
NLP.NET</a></td>
   <td>
   <ul>
   <li>The API is significantly different from OpenNLP and it would take a high-level
analysis to determine whether it has the features we need</li>
   <li>It is an IKVM port, which currently only supports .NET Framework 3.5</li>
   <li>Its GNU2 license is <a href="https://apache.org/legal/resolved.html#category-x">too
restrictive to use in an Apache project</a> (we can depend on, but not import code)</li>
   </ul>
   </td>
   <td>There is a project called <a href="http://www.cs.cmu.edu/~ark/TweetNLP/index.html">Tweet
NLP</a> that extends it and seems to supply much of the functionality Lucene uses</td>
   </tr>
   </tr>
   <td>Use <a href="https://github.com/IanMercer/AboditNaturalLanguage">AboditNLP</a>
from <a href="https://www.nuget.org/packages/AboditNLP/">GitHub</a></td>
   <td>
   <ul>
   <li>A high-level analysis is required to determine if it supports the functionality
Lucene uses.</li>
   <li>Closed-source, only demos and the NuGet package are available.</li>
   </ul>
   </td>
   <td>Targets .NET Framework 4.7.2, .NET Standard 2.0, and .NET Standard 2.1.</td>
   </tr>
   </tr>
   <td>Use <a href="https://github.com/SciSharp/CherubNLP">CherubNLP</a>
from <a href="https://www.nuget.org/packages/CherubNLP/">NuGet</a></td>
   <td>
   <ul>
   <li>Would require a high-level analysis to determine if it supports the functionality
Lucene uses</li>
   </ul>
   </td>
   <td>Targets .NET Standard 2.0.</td>
   </tr>
   </tr>
   <td>Use <a href="https://github.com/sergey-tihon/OpenNLP.NET">OpenNLP.NET</a>
from <a href="https://www.nuget.org/packages/OpenNLP.NET/">GitHub</a></td>
   <td>
   <ul>
   <li>It is an IKVM port, which currently only supports .NET Framework 3.5</li>
   </ul>
   </td>
   <td><b>This is the option we currently use. </b>Someone created a strong-named
package named <a href="https://www.nuget.org/packages/OpenNLP.NET.Signed/">OpenNLP.NET.Signed</a>.
It would be preferable to get the original package owner to strong-name, but I suppose that
would mean incrementing to at least version 1.9.1.1, or upgrading to a newer version of OpenNLP.</td>
   </tr>
   </tbody>
   </table>
   
   There are some other options, but the above list seem to be the most "official" ones. However,
there are currently no options for .NET Core/.NET 5+ support of OpenNLP with the same API
as OpenNLP 1.9.1.
   
   ## IKVM
   
   Unfortunately, while IKVM has been a reasonable go-to way to quickly support Java-based
apps in the past, it has been [abandoned by its main contributor](http://weblog.ikvm.net/)
in 2017 and has no .NET Core/NET Standard support.
   
   There is an effort to get it working on .NET Core named [ikvm-revived](https://github.com/ikvm-revived/ikvm)
(to which I have contributed) but it seems to have been stalled for about a year and, as of
the date of this writing, there isn't even a pre-release on NuGet. There is some debate whether
they should support .NET Framework, but if they didn't we would still be able to target the
current OpenNLP.NET version on .NET Framework.
   
   See [NuGet Repository?](https://github.com/ikvm-revived/ikvm/issues/8)
   
   ## Alternatives to IKVM
   
   There was an announcement on the Microsoft Blog about .NET 5 supporting interoperability
with Java, but it isn't clear what they meant by that.
   
   https://devblogs.microsoft.com/dotnet/announcing-net-5-0-preview-1/#comment-4932
   
   In fact, others are mentioning in the comments they cannot use NLP on .NET Core and are
hoping to resolve that in .NET 5.
   
   I have searched, but cannot find any examples anywhere of how .NET 5 supports Java interop,
but if it does that would probably be a better path forward than IKVM for NLP support. However,
it sounds as if this feature was punted from the official .NET 5 release.
   
   ## Current Support for NLP in Lucene.NET
   
   Since we are depending on the IKVM-based [OpenNLP.NET](https://github.com/sergey-tihon/OpenNLP.NET)
project, our current support is limited to .NET Framework 4.5.1+.
   
   We do have some minor issues (namely lack of `InternalsVisibleTo` support) due to the fact
that the library is not strong-named, but these are internal. Time will tell if lack of strong-naming
is going to be an issue for end users, but ideally to get strong naming we should contribute
to OpenNLP.NET rather than using the strong-named clone named [OpenNLP.NET.Signed](https://www.nuget.org/packages/OpenNLP.NET.Signed/).
   
   Most options for supporting NLP on .NET Core would require some work to put into play,
and it isn't clear how much work is involved to analyze this at a high level. It also isn't
clear how big the demand for this functionality will be.
   
   While we could make an effort to change dependencies, it would be sensible to create a
new assembly named after the new dependency (in the `src/dotnet` folder) so it is clear what
it depends on and leave the existing Lucene.Net.Analysis.OpenNLP project as-is.
   
   Another option is just to wait to see whether `ikvm-revieved` releases a .NET Core targeted
package on NuGet and then support it when they finally do.
   
   > NOTE: If we bring back support for native .NET Collation in Lucene.Net.Analysis.Common,
it is possible that its `SortKey`s would not be portable between .NET Framework and .NET Core/.NET
5+ (see [Caveats and Comparisons](https://lucene.apache.org/core/4_8_0/analyzers-common/org/apache/lucene/collation/package-summary.html)).
If we don't have .NET Core/.NET 5 support for Lucene.Net.OpenNLP, that collator option could
cause some issues if indexing can only be done on .NET Framework, but searching is done on
.NET Core or .NET 5. However, we have a [collator in Lucene.Net.ICU](https://lucenenet.apache.org/docs/4.8.0-beta00014/api/icu/Lucene.Net.Analysis.Icu.html#collation)
that is stable across .NET target frameworks that could be used instead in that scenario.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



Mime
View raw message