lucenenet-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Michael Garski (JIRA)" <j...@apache.org>
Subject [jira] Issue Comment Edited: (LUCENENET-190) 2.4.0 Performance in TermInfosReader term caching (New implementation of SimpleLRUCache)
Date Mon, 10 Aug 2009 21:04:15 GMT

    [ https://issues.apache.org/jira/browse/LUCENENET-190?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12741549#action_12741549
] 

Michael Garski edited comment on LUCENENET-190 at 8/10/09 2:02 PM:
-------------------------------------------------------------------

Digy,

In taking a look at the patch, I don't believe it will solve the performance issue.

The original Java implementation of the SimpleLRUCache class uses a LinkedHashMap, and from
what I can tell, would perform similar to the currently committed Lucene.Net version which
uses a Hashtable to maintain the cached items and a LinkedList to manage the contents via
LRU.  The Hashtable operations get & set operations are both O(1), however the LinkedList
operations are where the performance hit is.  Adding an item to the head of the list is an
O(1), but removing an item is O[n].  I believe that maintaining the LRU state ofthe LinkedHashMap
would also be an O[n] - someone please correct me if I am wrong.

In the patch you provided, a LinkedList is no longer used but rather a SortedList, however
both the Add and Remove methods are O[n], which would more than likely degrade performance
even further.  One other thing I noticed is that the use of the incrementing long TimeStamp
value could lead to collisions in the SortedList, resulting in an incorrect LRU state in the
cache.

In looking at the collections provided in the .Net Framework, I don't see any off the shelf
that provide adequate performance, at least not under heavy query load.  The only collection
I've seen so far that would offer ideal performance would be the HasedLinkedList<T>
in the C5 collections [http://www.itu.dk/research/c5/].  The performance on the InsertFirst,
RemoveLast, and Get methods are all O(1), which is pretty hard to beat.

Michael

      was (Author: mgarski):
    Digy,

In taking a look at the patch, I don't believe it will solve the performance issue.

The original Java implementation of the SimpleLRUCache class uses a LinkedHashMap, and from
what I can tell, would perform similar to the currently committed Lucene.Net version which
uses a Hashtable to maintain the cached items and a LinkedList to manage the contents via
LRU.  The Hashtable operations get & set operations are both O(1), however the LinkedList
operations are where the performance hit is.  Adding an item to the head of the list is an
O(1), but removing an item is O(n).  I believe that maintaining the LRU state ofthe LinkedHashMap
would also be an O[n] - someone please correct me if I am wrong.

In the patch you provided, a LinkedList is no longer used but rather a SortedList, however
both the Add and Remove methods are O(n), which would more than likely degrade performance
even further.  One other thing I noticed is that the use of the incrementing long TimeStamp
value could lead to collisions in the SortedList, resulting in an incorrect LRU state in the
cache.

In looking at the collections provided in the .Net Framework, I don't see any off the shelf
that provide adequate performance, at least not under heavy query load.  The only collection
I've seen so far that would offer ideal performance would be the HasedLinkedList<T>
in the C5 collections [http://www.itu.dk/research/c5/].  The performance on the InsertFirst,
RemoveLast, and Get methods are all O(1), which is pretty hard to beat.

Michael
  
> 2.4.0 Performance in TermInfosReader term caching (New implementation of SimpleLRUCache)
> ----------------------------------------------------------------------------------------
>
>                 Key: LUCENENET-190
>                 URL: https://issues.apache.org/jira/browse/LUCENENET-190
>             Project: Lucene.Net
>          Issue Type: Improvement
>         Environment: v2.4.0
>            Reporter: Digy
>            Priority: Minor
>         Attachments: SimpleLRUCache.rar
>
>
> Below is the mail from Michael Garski about the Performance in TermInfosReader term caching.
It would be good to have a faster LRUCache implementation in Lucene.Net
> DIGY
> {quote}
> Doug did an amazing job of porting 2.4.0, doing it mostly on his own!  
> Hooray Doug!
> We are using the committed version of 2.4.0 in production and I wanted to share a performance
issue we discovered and what we've done to work around it.  From the Java Lucene change log:
 "LUCENE-1195: Improve term lookup performance by adding a LRU cache to the TermInfosReader.
In performance experiments the speedup was about 25% on average on mid-size indexes with ~500,000
documents for queries with 3 terms and about 7% on larger indexes with ~4.3M documents."
> The Java implementation uses a LinkedHashMap within the class org.apache.lucene.util.cache.SimpleLRUCache,
which is very efficient at maintaining the cache.  As there is no equivalent collection in
.Net The current 2.4.0 port uses a combination of a LinkedList to maintain LRU state and a
HashTable to provide lookups.  While this implementation works, maintaining the LRU state
via the LinkedList creates a fair amount of overhead and can result in a significant reduction
of performance, most likely attributed to the LinkedList.Remove method being O(n).  As each
thread maintains its own cache of 1024 terms, these overhead in performing the removal is
a drain on performance.
> At this time we have disabled the cache in the method TermInfosReader.TermInfo Get(Term
term, bool useCache) by always setting the useCache parameter to false inside the body of
the method.  After doing this we saw performance return back to the 2.3.2 levels.  I have
not yet had the opportunity to experiment with other implementations within the SimpleLRUCache
to address the performance issue.  One approach that would might solve the issue is to use
the HashedLinkedList<T> class provided in the C5 collection library [http://www.itu.dk/research/c5/].
> Michael
> Michael Garski
> Search Architect
> MySpace.com
> www.myspace.com/michaelgarski <http://%27www.myspace.com/mgarski>
> {quote}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message