lucenenet-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Kaufmann M." <kaufman...@gmail.com>
Subject Re: Storing primary key / Change lucene's document ID
Date Mon, 30 Oct 2006 14:44:53 GMT
Hello Jon,
The most difference in time needed I have found was between:
console.writeln(hits.id(i))
and
console.writeln(hits.doc(i).get(fieldName)

If I return the internal ID within this code, it is a lot faster than
returning a field-name trough ...get().

Overview of the current code:
dim qry as search.query=(...)
dim sw as new io.streamwriter(...)
dim hits as search.hits
hits=lis.search(qry) (lis is defined once at the start of code)
console.write(hits.length)
console.write(" writing file ")
dim intposmax as integer=hits.length-1
for intpos as integer=0 to intposmax
  if not intPos=0 then sw.write(",")
  sw.write(hits.doc(intpos).get("id").tostring
next
sw.close
console.write(" - bulk insert ")

... bulk insert from sw.write file

so you can see the time needed from search and bulk insert in the console.
Bulk insert is not as fast on large resultsets, but the search is still
slower - so my primary bottleneck :).

I already did some tests from hits.id(intPos) to hits.doc(intpos).get("id")
- those two had a big difference in time to take...

Best Regards, Marc



On 10/30/06, Jon Palmer <jpalmer@contactnetworks.com> wrote:
>
> Marc,
>
>
>
> Can you give a few more details of how you are searching lucene. Maybe
> some pseudo code of the method that is fast and the one that is slow. I
> think you suggesting that there is a very large performance hit for
> doing this:
>
>
>
> DocID = Hits.Doc(i).Get("ID")
>
>
>
> rather than:
>
>
>
> DocID = Hits.ID(i)
>
>
>
>
>
> JP
>
>
>
> P.S. Your numbers suggested that your problem is mostly linear. It looks
> like you method has some setup cost and then processes approx 300 Id's a
> second
>
>
>
> 18260 ID's - 72.2 s  -avg 253/s
>
> 3000 ID's - 10.02s  -avg 294/s
>
> 830 ID's - 2.25s  -avg 368/s
>
> 352 ID's - 1.08s  -avg 325/s
>
> 350 ID's - 0.98s  -avg 357/s
>
> 278 ID's - 0.48s  -avg 162/s
>
> 96 ID's - 1.05s  -avg 91/s
>
> 29 ID's - 0.66s  -avg 43/s
>
>
>
> Given this linear-ish behavior are you sure that the bottle neck is not
> writing back to file or to SQL?
>
>
>
>
>
>
>
> -----Original Message-----
> From: Kaufmann M. [mailto:kaufmannma@gmail.com]
> Sent: Monday, October 30, 2006 5:11 AM
> To: lucene-net-dev@incubator.apache.org
> Subject: Re: Storing primary key / Change lucene's document ID
>
>
>
> Hello George,
>
> The Problem is the speed, some samples:
>
>
>
> All Counts include writing IDs to file and BULK Insert to SQL:
>
> 18260 ID's - 72.2 s
>
> 352 ID's - 1.08s
>
> 96 ID's - 1.05s
>
> 29 ID's - 0.66s
>
> 3000 ID's - 10.02s
>
> 350 ID's - 0.98s
>
> 278 ID's - 0.48s
>
> 830 ID's - 2.25s
>
>
>
> As you can see - the time it takes for Records >500 is absolutely
> slow...
>
> If I write back the internal ID - it's a LOT faster...
>
>
>
> I'm not using the lucene-ordering because this also slowed down the
>
> returning process a lot.
>
> And I'd like to count the results in different ways (which I was not
> able to
>
> do in lucene) so I have to give back all ID's into SQL...
>
>
>
> Thanks for helpin'!
>
>
>
>
>
> On 10/30/06, George Aroush <george@aroush.net> wrote:
>
> >
>
> > Hi Marc,
>
> >
>
> > You can't depend on Lucene's internal ID, it will change every time
> when
>
> > you
>
> > update the index -- this is something you can't control.  The way you
> are
>
> > currently doing it, by storing an ID in a field named "id" is the
> right
>
> > way
>
> > to do it.  Don't worry about slowing down Lucene if you call the API
> to
>
> > get
>
> > the ID of your field "id".  Lucene is supper fast.
>
> >
>
> > Regards,
>
> >
>
> > -- George Aroush
>
> >
>
> > -----Original Message-----
>
> > From: Kaufmann M. [mailto:kaufmannma@gmail.com]
>
> > Sent: Friday, October 27, 2006 4:20 PM
>
> > To: lucene-net-dev@incubator.apache.org
>
> > Subject: Storing primary key / Change lucene's document ID
>
> >
>
> > Hello everybody,
>
> > I've got a little question concerning the unique ID stored in the
> Lucene
>
> > index (hits.ID(i)).
>
> > Is it possible to change this ID, or set it on doc.add?
>
> >
>
> > Currently I'm running a test-project wich stores an external primary
> key
>
> > in
>
> > a field named 'id', but if I call it from the search-engine I have to
> use
>
> > the get-method - wich slows it down.
>
> > If I could use this primary key as lucene-ID the whole engine would be
> a
>
> > lot
>
> > faster because I just need the ID's returned...
>
> >
>
> > Does anybody know if this is possible?
>
> >
>
> > Thanks!
>
> > Best Regards, Marc
>
> >
>
> >
>
>
>
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message