lucenenet-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Shad Storhaug <s...@shadstorhaug.com>
Subject RE: State / Future of the Lucene.Net Project
Date Thu, 21 Jun 2018 05:56:56 GMT
Hi,

Actually, there is already an optimized Chinese word segmentation tool in the Lucene.Net.ICU
project (https://lucene.apache.org/core/4_8_0/analyzers-icu/index.html), which is still a
work in progress. We have Lucene.Net.ICU 100% ported with all tests passing (see https://github.com/NightOwl888/lucenenet/tree/icu4n-migration),
but we could definitely use some help getting the dependent ICU functionality finished. 

There are still many undecided issues regarding the ICU functionality. For example:

1. Should we use the newly ported ICU4N (https://github.com/NightOwl888/ICU4N) project or
try to add the functionality to the already existing icu.net project (https://github.com/sillsdev/icu-dotnet)?
Note the latter has been attempted, but there are several issues (missing functionality, incompatibilities,
problems loading data) that make it very challenging to provide all of the Lucene.Net.ICU
functionality - it was easier to get it working by porting from ICU4J, but will require maintaining
the ICU4N project.
2. If we use ICU4N, should we make it into a general library that benefits all of the .NET
ecosystem, or should we limit it to primarily support Lucene.NET?
3. If we use ICU4N, how should we best allow the user to load a customized version of the
ICU data? 

If we make ICU4N into a general library, it would probably be best to contribute it back to
the ICU project: http://site.icu-project.org/ so it is maintained and released on the same
schedule and documented there, too. Do note that ICU releases very often to keep up with the
changes to the Unicode standard - we have ported ICU4J from version 60.1 (released November
1, 2017) and they just released version 62.1 yesterday (June 20, 2018). So one of the first
orders of business would be to upgrade the existing ICU4N features to version 62.1 if we go
that route.

Also note that we only have about 40% of ICU4J ported, which is just enough to support Lucene.Net.ICU.
There are several APIs that still need to be refactored to fit into the .NET paradigm, as
well as some gaps in functionality to work out before proceeding with any more porting work.

My hope was to make ICU4N into a first rate .NET component to add complete Unicode support
to the .NET framework with fully .NET like APIs, however we also have the option of limiting
the scope of the project to just what is needed to support Lucene.Net.ICU in order to get
the 4.8.0 release done quicker. Either way, there is still work to be done to make the APIs
of the project consistent if we use ICU4N, and there is quite a bit of missing functionality
to add to icu.net if we use that instead. Basically, there are 3 ways to complete this:

1. Add the required functionality to the icu.net project in order to support the Lucene.Net.ICU
features, port the missing Lucene.Net.ICU features to the current master branch and abandon
work on ICU4N.
2. Finish up the API and fix 19 failing tests to make ICU4N good enough to support Lucene.Net.ICU
without making it into a first-rate component that supports all ICU features.
3. Contact the ICU team about contributing ICU4N to their repository and if they agree, allow
them to lead the direction of the API and features (with the added possibility of their help
and Unicode expertise).

#1 would be the least maintenance long-term solution, but I have doubts we can get more than
about 50% of the Lucene.Net.ICU features to function if we go that route. Failing that, the
preference is to go with option #3 so the whole .NET ecosystem benefits (and contributes)
and we will be able to release 100% of the Lucene.Net.ICU functionality. Would you be interested
in helping out in order to make the word segmentation functionality production-ready, and
if so, for which of these options?

Let me know, and I will start putting together a prioritized list of items that are incomplete
to get you started.

Thanks,
Shad Storhaug (NightOwl888)


-----Original Message-----
From: 小康 [mailto:xiaokang@cnblogs.com] 
Sent: Thursday, June 21, 2018 9:00 AM
To: user@lucenenet.apache.org
Cc: dev@lucenenet.apache.org
Subject: Re: State / Future of the Lucene.Net Project

I want to add a  Chinese word segmentation tool with good performance in
lucenenet.

I think this will be kind to Chinese developers.

Can I do this job?

2018-06-21 5:19 GMT+08:00 Shad Storhaug <shad@shadstorhaug.com>:

> Hello. Thanks for the heads up. For code optimizations, you will need to
> locate the areas that need fixing, patch them, and then submit a separate
> pull request on GitHub for each one. Please provide a small standalone
> piece of code (a console app works great) we can run before and after the
> patch to demonstrate exactly how the fix affects performance.
>
> We will definitely welcome the help.
>
> -----Original Message-----
> From: 小康 [mailto:xiaokang@cnblogs.com]
> Sent: Wednesday, June 20, 2018 8:03 PM
> To: user@lucenenet.apache.org
> Cc: dev@lucenenet.apache.org
> Subject: Re: State / Future of the Lucene.Net Project
>
> I am willing to contribute to lucene.net .Because I am creating a vertical
> search engine with lucene.net.
>
> I want to make lucene.net  faster and better.
>
> I can do some contibution on weekends.
>
> Thank you.
>
> 2018-05-28 23:48 GMT+08:00 Stefan Bodewig <bodewig@apache.org>:
>
> > Hi all
> >
> > it is pretty difficult to write a message like this. I've been one of
> > Lucene.Net's mentors during Apache incubation and even though I never
> > contributed anything significant (at least code-wise) I really care for
> > the project and its community.
> >
> > For more than a year Shad has been the only committer who actually
> > committed to the code base but despite his herculean effort we haven't
> > been able to attract new contributors.
> >
> > Of the project management committee most people seem to be absent by now
> > and the project has rightfully raised concerns by the board [1][2]
> >
> > There really are only two options.
> >
> > * we create a credible plan how to get Lucene.Net back into a healthy
> >   state with multiple contributors and a more active PMC and execute on
> >   it
> >
> > * we start the process of sending the project to the Apache Attic
> >   http://attic.apache.org/ (which is not a one-way road, projects ca be
> >   re-surrected if a new community emerges).
> >
> > We probably should start with trying the first option. We have tried to
> > find new contributors in the past but haven't been succeful, let's give
> > it one more try.
> >
> > What we need are people who are willing to contribute for more than a
> > single pull request or two and who are willing to become members of the
> > developer community here at Apache. If you think this description fits
> > you, please raise your hand :-)
> >
> > Stefan
> >
> > [1] https://lists.apache.org/thread.html/c44ef94020271b3823fe356a255d69
> > 3a76287c1214743dfc074621de@%3Cdev.lucenenet.apache.org%3E
> > [2] https://lists.apache.org/thread.html/70a34c2cd3298afe02827c219e2dc2
> > b66ae594aabcbaa33265301a44@%3Cdev.lucenenet.apache.org%3E
> >
>
Mime
View raw message