lucenenet-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Christopher Currens (Updated) (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (LUCENENET-466) optimisation for the GermanStemmer.vb‏
Date Mon, 26 Mar 2012 17:02:30 GMT

     [ https://issues.apache.org/jira/browse/LUCENENET-466?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Christopher Currens updated LUCENENET-466:
------------------------------------------

    Attachment: DIN2Stemmer.patch

Bjorn,

I've made this patch from the src/contrib/Analyzers folder, on top of the DIN2 changes already
committed to trunk.  Since the extent of my German is "danke!", I was hoping you could see
if this stemmer is working properly before I commit it to trunk.

These were the test cases I made that should hopefully emulate the results of the normal DIN1
stemmer, where the word left of the semicolon is the word, and to the right, the result.

{noformat}
# Test cases for words with ae, ue, or oe in them
Haus;hau
Hauses;hau
Haeuser;hau
Haeusern;hau
steuer;steur
rueckwaerts;ruckwar
geheimtuer;geheimtur
{noformat}

With the last word in particular, it produces fairly different results in each stemmer, though
I think they are expected, due to the different DIN.

Also, the DIN2 stemmer will also translate 'Häuser' and 'Häusern' properly (to hau), so
there is support for both umlauts and the expanded 'ae', 'oe' and 'ue' forms.
                
> optimisation for the GermanStemmer.vb‏
> --------------------------------------
>
>                 Key: LUCENENET-466
>                 URL: https://issues.apache.org/jira/browse/LUCENENET-466
>             Project: Lucene.Net
>          Issue Type: Improvement
>          Components: Lucene.Net Contrib
>    Affects Versions: Lucene.Net 2.9.4, Lucene.Net 2.9.4g, Lucene.Net 3.0.3
>            Reporter: Prescott Nasser
>            Priority: Minor
>             Fix For: Lucene.Net 3.0.3
>
>         Attachments: DIN2Stemmer.patch
>
>
> I have a little optimisation for the GermanStemmer.vb (in 
> Contrib.Analyzers) class. At the moment the function "Substitute" 
> converts the german "Umlaute" "ä" in "a", "ö" in"o" and "ü" in "u". This 
> is not the correct german translation. They must be converted to "ae", 
> "oe" and "ue". So I can write the name "Björn" or "Bjoern" but not 
> "Bjorn". With this optimization a user can search for "Björn" and also 
> find "Bjoern".
>  
> Here is the optimized code snippet:
>  
> else if ( buffer[c] == 'ä' )
>  {
>  buffer[c] = 'a';
>  buffer.Insert(c + 1, 'e');
>  }
>  else if ( buffer[c] == 'ö' )
>  {
>  buffer[c] = 'o';
>  buffer.Insert(c + 1,'e');
>  }
>  else if ( buffer[c] == 'ü' )
>  {
>  buffer[c] = 'u';
>  buffer.Insert(c + 1,'e');
>  }
>  
> Thank You
> Björn

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

       

Mime
View raw message