lucenenet-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Kaufmann M." <kaufman...@gmail.com>
Subject Re: Storing primary key / Change lucene's document ID
Date Tue, 31 Oct 2006 10:50:41 GMT
Hello Neil,
Can you send me any Link on a sample or similar for using HitCollector &
FieldCache?
I do not seem to find anything but the API Documentation (simple Links) in
the DotLucene Documentation.

Thanks!
Best Regards, Marc

On 10/30/06, Neil Carson <ncarson@everdreamcorp.com> wrote:
>
> We are going through this now.
>
> Having Lucene retrieve the docs is slow.
>
> The recommendation from Doug on some old mailing lists I found was, to use
> a HitCollector (since the standard search mechanism re-queries after
> accessing mroe than doc 100), and to use the FieldCache to maintain a
> mapping of Lucene document ID <-> your primary key.
>
> We are planning to do this soon, for same reason - search is fast,
> document retrieval is very slow.
>
> I noticed in Java version, the FieldCache is implemented with a weak
> hashmap. I don't know if this is the case in .NET or not (it looked more
> like a regular one on a quick initial inspection).
>
> Hope this helps.
>
>     Neil
>
> ________________________________
>
> From: Kaufmann M. [mailto:kaufmannma@gmail.com]
> Sent: Mon 10/30/2006 6:44 AM
> To: lucene-net-dev@incubator.apache.org
> Subject: Re: Storing primary key / Change lucene's document ID
>
>
>
> Hello Jon,
> The most difference in time needed I have found was between:
> console.writeln(hits.id(i))
> and
> console.writeln(hits.doc(i).get(fieldName)
>
> If I return the internal ID within this code, it is a lot faster than
> returning a field-name trough ...get().
>
> Overview of the current code:
> dim qry as search.query=(...)
> dim sw as new io.streamwriter(...)
> dim hits as search.hits
> hits=lis.search(qry) (lis is defined once at the start of code)
> console.write(hits.length)
> console.write(" writing file ")
> dim intposmax as integer=hits.length-1
> for intpos as integer=0 to intposmax
>   if not intPos=0 then sw.write(",")
>   sw.write(hits.doc(intpos).get("id").tostring
> next
> sw.close
> console.write(" - bulk insert ")
>
> ... bulk insert from sw.write file
>
> so you can see the time needed from search and bulk insert in the console.
> Bulk insert is not as fast on large resultsets, but the search is still
> slower - so my primary bottleneck :).
>
> I already did some tests from hits.id(intPos) to hits.doc
> (intpos).get("id")
> - those two had a big difference in time to take...
>
> Best Regards, Marc
>
>
>
> On 10/30/06, Jon Palmer <jpalmer@contactnetworks.com> wrote:
> >
> > Marc,
> >
> >
> >
> > Can you give a few more details of how you are searching lucene. Maybe
> > some pseudo code of the method that is fast and the one that is slow. I
> > think you suggesting that there is a very large performance hit for
> > doing this:
> >
> >
> >
> > DocID = Hits.Doc(i).Get("ID")
> >
> >
> >
> > rather than:
> >
> >
> >
> > DocID = Hits.ID(i)
> >
> >
> >
> >
> >
> > JP
> >
> >
> >
> > P.S. Your numbers suggested that your problem is mostly linear. It looks
> > like you method has some setup cost and then processes approx 300 Id's a
> > second
> >
> >
> >
> > 18260 ID's - 72.2 s  -avg 253/s
> >
> > 3000 ID's - 10.02s  -avg 294/s
> >
> > 830 ID's - 2.25s  -avg 368/s
> >
> > 352 ID's - 1.08s  -avg 325/s
> >
> > 350 ID's - 0.98s  -avg 357/s
> >
> > 278 ID's - 0.48s  -avg 162/s
> >
> > 96 ID's - 1.05s  -avg 91/s
> >
> > 29 ID's - 0.66s  -avg 43/s
> >
> >
> >
> > Given this linear-ish behavior are you sure that the bottle neck is not
> > writing back to file or to SQL?
> >
> >
> >
> >
> >
> >
> >
> > -----Original Message-----
> > From: Kaufmann M. [mailto:kaufmannma@gmail.com]
> > Sent: Monday, October 30, 2006 5:11 AM
> > To: lucene-net-dev@incubator.apache.org
> > Subject: Re: Storing primary key / Change lucene's document ID
> >
> >
> >
> > Hello George,
> >
> > The Problem is the speed, some samples:
> >
> >
> >
> > All Counts include writing IDs to file and BULK Insert to SQL:
> >
> > 18260 ID's - 72.2 s
> >
> > 352 ID's - 1.08s
> >
> > 96 ID's - 1.05s
> >
> > 29 ID's - 0.66s
> >
> > 3000 ID's - 10.02s
> >
> > 350 ID's - 0.98s
> >
> > 278 ID's - 0.48s
> >
> > 830 ID's - 2.25s
> >
> >
> >
> > As you can see - the time it takes for Records >500 is absolutely
> > slow...
> >
> > If I write back the internal ID - it's a LOT faster...
> >
> >
> >
> > I'm not using the lucene-ordering because this also slowed down the
> >
> > returning process a lot.
> >
> > And I'd like to count the results in different ways (which I was not
> > able to
> >
> > do in lucene) so I have to give back all ID's into SQL...
> >
> >
> >
> > Thanks for helpin'!
> >
> >
> >
> >
> >
> > On 10/30/06, George Aroush <george@aroush.net> wrote:
> >
> > >
> >
> > > Hi Marc,
> >
> > >
> >
> > > You can't depend on Lucene's internal ID, it will change every time
> > when
> >
> > > you
> >
> > > update the index -- this is something you can't control.  The way you
> > are
> >
> > > currently doing it, by storing an ID in a field named "id" is the
> > right
> >
> > > way
> >
> > > to do it.  Don't worry about slowing down Lucene if you call the API
> > to
> >
> > > get
> >
> > > the ID of your field "id".  Lucene is supper fast.
> >
> > >
> >
> > > Regards,
> >
> > >
> >
> > > -- George Aroush
> >
> > >
> >
> > > -----Original Message-----
> >
> > > From: Kaufmann M. [mailto:kaufmannma@gmail.com]
> >
> > > Sent: Friday, October 27, 2006 4:20 PM
> >
> > > To: lucene-net-dev@incubator.apache.org
> >
> > > Subject: Storing primary key / Change lucene's document ID
> >
> > >
> >
> > > Hello everybody,
> >
> > > I've got a little question concerning the unique ID stored in the
> > Lucene
> >
> > > index (hits.ID(i)).
> >
> > > Is it possible to change this ID, or set it on doc.add?
> >
> > >
> >
> > > Currently I'm running a test-project wich stores an external primary
> > key
> >
> > > in
> >
> > > a field named 'id', but if I call it from the search-engine I have to
> > use
> >
> > > the get-method - wich slows it down.
> >
> > > If I could use this primary key as lucene-ID the whole engine would be
> > a
> >
> > > lot
> >
> > > faster because I just need the ID's returned...
> >
> > >
> >
> > > Does anybody know if this is possible?
> >
> > >
> >
> > > Thanks!
> >
> > > Best Regards, Marc
> >
> > >
> >
> > >
> >
> >
> >
> >
> >
>
>
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message