lucenenet-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "George Aroush" <geo...@aroush.net>
Subject RE: Storing primary key / Change lucene's document ID
Date Wed, 01 Nov 2006 04:32:53 GMT
Hi all,

A good place to look for examples of the many features of Lucene.Net is the
NUnit test code.  Bring up the "Test" project in VS.NET and do a search in
the "Demo" folder on the string "HitCollector" and you will find examples on
how to use it.

Also, may I suggest "Lucene In Action" book?  At least visit
http://lucenebook.com/ and download the Java code of the book which has a
lot of examples about Lucene.

Regards,

-- George Aroush


-----Original Message-----
From: Neil Carson [mailto:ncarson@everdreamcorp.com]
Sent: Tuesday, October 31, 2006 11:28 AM
To: lucene-net-dev@incubator.apache.org; lucene-net-dev@incubator.apache.org
Subject: RE: Storing primary key / Change lucene's document ID

Sorry, no, haven't written it yet.

________________________________

From: Kaufmann M. [mailto:kaufmannma@gmail.com]
Sent: Tue 10/31/2006 2:50 AM
To: lucene-net-dev@incubator.apache.org
Subject: Re: Storing primary key / Change lucene's document ID



Hello Neil,
Can you send me any Link on a sample or similar for using HitCollector &
FieldCache?
I do not seem to find anything but the API Documentation (simple Links) in
the DotLucene Documentation.

Thanks!
Best Regards, Marc

On 10/30/06, Neil Carson <ncarson@everdreamcorp.com> wrote:
>
> We are going through this now.
>
> Having Lucene retrieve the docs is slow.
>
> The recommendation from Doug on some old mailing lists I found was, to 
> use a HitCollector (since the standard search mechanism re-queries 
> after accessing mroe than doc 100), and to use the FieldCache to 
> maintain a mapping of Lucene document ID <-> your primary key.
>
> We are planning to do this soon, for same reason - search is fast, 
> document retrieval is very slow.
>
> I noticed in Java version, the FieldCache is implemented with a weak 
> hashmap. I don't know if this is the case in .NET or not (it looked 
> more like a regular one on a quick initial inspection).
>
> Hope this helps.
>
>     Neil
>
> ________________________________
>
> From: Kaufmann M. [mailto:kaufmannma@gmail.com]
> Sent: Mon 10/30/2006 6:44 AM
> To: lucene-net-dev@incubator.apache.org
> Subject: Re: Storing primary key / Change lucene's document ID
>
>
>
> Hello Jon,
> The most difference in time needed I have found was between:
> console.writeln(hits.id(i))
> and
> console.writeln(hits.doc(i).get(fieldName)
>
> If I return the internal ID within this code, it is a lot faster than 
> returning a field-name trough ...get().
>
> Overview of the current code:
> dim qry as search.query=(...)
> dim sw as new io.streamwriter(...)
> dim hits as search.hits
> hits=lis.search(qry) (lis is defined once at the start of code)
> console.write(hits.length)
> console.write(" writing file ")
> dim intposmax as integer=hits.length-1 for intpos as integer=0 to 
> intposmax
>   if not intPos=0 then sw.write(",")
>   sw.write(hits.doc(intpos).get("id").tostring
> next
> sw.close
> console.write(" - bulk insert ")
>
> ... bulk insert from sw.write file
>
> so you can see the time needed from search and bulk insert in the console.
> Bulk insert is not as fast on large resultsets, but the search is 
> still slower - so my primary bottleneck :).
>
> I already did some tests from hits.id(intPos) to hits.doc
> (intpos).get("id")
> - those two had a big difference in time to take...
>
> Best Regards, Marc
>
>
>
> On 10/30/06, Jon Palmer <jpalmer@contactnetworks.com> wrote:
> >
> > Marc,
> >
> >
> >
> > Can you give a few more details of how you are searching lucene. 
> > Maybe some pseudo code of the method that is fast and the one that 
> > is slow. I think you suggesting that there is a very large 
> > performance hit for doing this:
> >
> >
> >
> > DocID = Hits.Doc(i).Get("ID")
> >
> >
> >
> > rather than:
> >
> >
> >
> > DocID = Hits.ID(i)
> >
> >
> >
> >
> >
> > JP
> >
> >
> >
> > P.S. Your numbers suggested that your problem is mostly linear. It 
> > looks like you method has some setup cost and then processes approx 
> > 300 Id's a second
> >
> >
> >
> > 18260 ID's - 72.2 s  -avg 253/s
> >
> > 3000 ID's - 10.02s  -avg 294/s
> >
> > 830 ID's - 2.25s  -avg 368/s
> >
> > 352 ID's - 1.08s  -avg 325/s
> >
> > 350 ID's - 0.98s  -avg 357/s
> >
> > 278 ID's - 0.48s  -avg 162/s
> >
> > 96 ID's - 1.05s  -avg 91/s
> >
> > 29 ID's - 0.66s  -avg 43/s
> >
> >
> >
> > Given this linear-ish behavior are you sure that the bottle neck is 
> > not writing back to file or to SQL?
> >
> >
> >
> >
> >
> >
> >
> > -----Original Message-----
> > From: Kaufmann M. [mailto:kaufmannma@gmail.com]
> > Sent: Monday, October 30, 2006 5:11 AM
> > To: lucene-net-dev@incubator.apache.org
> > Subject: Re: Storing primary key / Change lucene's document ID
> >
> >
> >
> > Hello George,
> >
> > The Problem is the speed, some samples:
> >
> >
> >
> > All Counts include writing IDs to file and BULK Insert to SQL:
> >
> > 18260 ID's - 72.2 s
> >
> > 352 ID's - 1.08s
> >
> > 96 ID's - 1.05s
> >
> > 29 ID's - 0.66s
> >
> > 3000 ID's - 10.02s
> >
> > 350 ID's - 0.98s
> >
> > 278 ID's - 0.48s
> >
> > 830 ID's - 2.25s
> >
> >
> >
> > As you can see - the time it takes for Records >500 is absolutely 
> > slow...
> >
> > If I write back the internal ID - it's a LOT faster...
> >
> >
> >
> > I'm not using the lucene-ordering because this also slowed down the
> >
> > returning process a lot.
> >
> > And I'd like to count the results in different ways (which I was not 
> > able to
> >
> > do in lucene) so I have to give back all ID's into SQL...
> >
> >
> >
> > Thanks for helpin'!
> >
> >
> >
> >
> >
> > On 10/30/06, George Aroush <george@aroush.net> wrote:
> >
> > >
> >
> > > Hi Marc,
> >
> > >
> >
> > > You can't depend on Lucene's internal ID, it will change every 
> > > time
> > when
> >
> > > you
> >
> > > update the index -- this is something you can't control.  The way 
> > > you
> > are
> >
> > > currently doing it, by storing an ID in a field named "id" is the
> > right
> >
> > > way
> >
> > > to do it.  Don't worry about slowing down Lucene if you call the 
> > > API
> > to
> >
> > > get
> >
> > > the ID of your field "id".  Lucene is supper fast.
> >
> > >
> >
> > > Regards,
> >
> > >
> >
> > > -- George Aroush
> >
> > >
> >
> > > -----Original Message-----
> >
> > > From: Kaufmann M. [mailto:kaufmannma@gmail.com]
> >
> > > Sent: Friday, October 27, 2006 4:20 PM
> >
> > > To: lucene-net-dev@incubator.apache.org
> >
> > > Subject: Storing primary key / Change lucene's document ID
> >
> > >
> >
> > > Hello everybody,
> >
> > > I've got a little question concerning the unique ID stored in the
> > Lucene
> >
> > > index (hits.ID(i)).
> >
> > > Is it possible to change this ID, or set it on doc.add?
> >
> > >
> >
> > > Currently I'm running a test-project wich stores an external 
> > > primary
> > key
> >
> > > in
> >
> > > a field named 'id', but if I call it from the search-engine I have 
> > > to
> > use
> >
> > > the get-method - wich slows it down.
> >
> > > If I could use this primary key as lucene-ID the whole engine 
> > > would be
> > a
> >
> > > lot
> >
> > > faster because I just need the ID's returned...
> >
> > >
> >
> > > Does anybody know if this is possible?
> >
> > >
> >
> > > Thanks!
> >
> > > Best Regards, Marc
> >
> > >
> >
> > >
> >
> >
> >
> >
> >
>
>
>
>




Mime
View raw message