lucenenet-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Neil Carson" <ncar...@everdreamcorp.com>
Subject RE: Storing primary key / Change lucene's document ID
Date Mon, 30 Oct 2006 15:28:36 GMT
We are going through this now.
 
Having Lucene retrieve the docs is slow.
 
The recommendation from Doug on some old mailing lists I found was, to use a HitCollector
(since the standard search mechanism re-queries after accessing mroe than doc 100), and to
use the FieldCache to maintain a mapping of Lucene document ID <-> your primary key.
 
We are planning to do this soon, for same reason - search is fast, document retrieval is very
slow.
 
I noticed in Java version, the FieldCache is implemented with a weak hashmap. I don't know
if this is the case in .NET or not (it looked more like a regular one on a quick initial inspection).
 
Hope this helps.
 
    Neil

________________________________

From: Kaufmann M. [mailto:kaufmannma@gmail.com]
Sent: Mon 10/30/2006 6:44 AM
To: lucene-net-dev@incubator.apache.org
Subject: Re: Storing primary key / Change lucene's document ID



Hello Jon,
The most difference in time needed I have found was between:
console.writeln(hits.id(i))
and
console.writeln(hits.doc(i).get(fieldName)

If I return the internal ID within this code, it is a lot faster than
returning a field-name trough ...get().

Overview of the current code:
dim qry as search.query=(...)
dim sw as new io.streamwriter(...)
dim hits as search.hits
hits=lis.search(qry) (lis is defined once at the start of code)
console.write(hits.length)
console.write(" writing file ")
dim intposmax as integer=hits.length-1
for intpos as integer=0 to intposmax
  if not intPos=0 then sw.write(",")
  sw.write(hits.doc(intpos).get("id").tostring
next
sw.close
console.write(" - bulk insert ")

... bulk insert from sw.write file

so you can see the time needed from search and bulk insert in the console.
Bulk insert is not as fast on large resultsets, but the search is still
slower - so my primary bottleneck :).

I already did some tests from hits.id(intPos) to hits.doc(intpos).get("id")
- those two had a big difference in time to take...

Best Regards, Marc



On 10/30/06, Jon Palmer <jpalmer@contactnetworks.com> wrote:
>
> Marc,
>
>
>
> Can you give a few more details of how you are searching lucene. Maybe
> some pseudo code of the method that is fast and the one that is slow. I
> think you suggesting that there is a very large performance hit for
> doing this:
>
>
>
> DocID = Hits.Doc(i).Get("ID")
>
>
>
> rather than:
>
>
>
> DocID = Hits.ID(i)
>
>
>
>
>
> JP
>
>
>
> P.S. Your numbers suggested that your problem is mostly linear. It looks
> like you method has some setup cost and then processes approx 300 Id's a
> second
>
>
>
> 18260 ID's - 72.2 s  -avg 253/s
>
> 3000 ID's - 10.02s  -avg 294/s
>
> 830 ID's - 2.25s  -avg 368/s
>
> 352 ID's - 1.08s  -avg 325/s
>
> 350 ID's - 0.98s  -avg 357/s
>
> 278 ID's - 0.48s  -avg 162/s
>
> 96 ID's - 1.05s  -avg 91/s
>
> 29 ID's - 0.66s  -avg 43/s
>
>
>
> Given this linear-ish behavior are you sure that the bottle neck is not
> writing back to file or to SQL?
>
>
>
>
>
>
>
> -----Original Message-----
> From: Kaufmann M. [mailto:kaufmannma@gmail.com]
> Sent: Monday, October 30, 2006 5:11 AM
> To: lucene-net-dev@incubator.apache.org
> Subject: Re: Storing primary key / Change lucene's document ID
>
>
>
> Hello George,
>
> The Problem is the speed, some samples:
>
>
>
> All Counts include writing IDs to file and BULK Insert to SQL:
>
> 18260 ID's - 72.2 s
>
> 352 ID's - 1.08s
>
> 96 ID's - 1.05s
>
> 29 ID's - 0.66s
>
> 3000 ID's - 10.02s
>
> 350 ID's - 0.98s
>
> 278 ID's - 0.48s
>
> 830 ID's - 2.25s
>
>
>
> As you can see - the time it takes for Records >500 is absolutely
> slow...
>
> If I write back the internal ID - it's a LOT faster...
>
>
>
> I'm not using the lucene-ordering because this also slowed down the
>
> returning process a lot.
>
> And I'd like to count the results in different ways (which I was not
> able to
>
> do in lucene) so I have to give back all ID's into SQL...
>
>
>
> Thanks for helpin'!
>
>
>
>
>
> On 10/30/06, George Aroush <george@aroush.net> wrote:
>
> >
>
> > Hi Marc,
>
> >
>
> > You can't depend on Lucene's internal ID, it will change every time
> when
>
> > you
>
> > update the index -- this is something you can't control.  The way you
> are
>
> > currently doing it, by storing an ID in a field named "id" is the
> right
>
> > way
>
> > to do it.  Don't worry about slowing down Lucene if you call the API
> to
>
> > get
>
> > the ID of your field "id".  Lucene is supper fast.
>
> >
>
> > Regards,
>
> >
>
> > -- George Aroush
>
> >
>
> > -----Original Message-----
>
> > From: Kaufmann M. [mailto:kaufmannma@gmail.com]
>
> > Sent: Friday, October 27, 2006 4:20 PM
>
> > To: lucene-net-dev@incubator.apache.org
>
> > Subject: Storing primary key / Change lucene's document ID
>
> >
>
> > Hello everybody,
>
> > I've got a little question concerning the unique ID stored in the
> Lucene
>
> > index (hits.ID(i)).
>
> > Is it possible to change this ID, or set it on doc.add?
>
> >
>
> > Currently I'm running a test-project wich stores an external primary
> key
>
> > in
>
> > a field named 'id', but if I call it from the search-engine I have to
> use
>
> > the get-method - wich slows it down.
>
> > If I could use this primary key as lucene-ID the whole engine would be
> a
>
> > lot
>
> > faster because I just need the ID's returned...
>
> >
>
> > Does anybody know if this is possible?
>
> >
>
> > Thanks!
>
> > Best Regards, Marc
>
> >
>
> >
>
>
>
>
>



Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message