lucenenet-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Hongwei Shen <Hongwei.S...@emedia.com>
Subject RE: Question(problem?) about PriorityQueue
Date Tue, 25 Sep 2007 22:32:12 GMT
Ofcourse I can calculate distance and store in the field, but the problem is that I don't really
want to sort by distance, I want to sort by a "complicated" calculation of score and distance.
I used the word "complicated" here to mean that it is not sort by distance and then score
or sort by score then distance. The problem is that at the time of calculating distance, score
is not available yet.

The solution here is to create my own sort comparator by extending SortComparator class. One
inconvenient thing is that in SortComparator class's  new Comparator method(interface) returns
a instance of a private class and I really need to override this method of the private class
to do my business:

            public virtual int Compare(ScoreDoc i, ScoreDoc j)
            {
                MyScore tl1 = (MyScore)cachedValues[i.doc];
                tl1.Score = i.score;
                MyScore tl2 = (MyScore)cachedValues[j.doc];
                tl2.Score = j.score;
                return tl1.MyComplicatedScore.CompareTo(tl2. MyComplicatedScore);
            }

Although the Compare method is declared as virtual, however the class is private to SortComparator,
there is nothing I can do extend it, so I have to create my own class in my sort comparator
one way or the other. In my opinion, the private class should really be declared as public
or internal:

This code:
    [Serializable]
    public class TrueLocalScoreComparator : SortComparator
    {
        private class AnonymousClassScoreDocComparator : ScoreDocComparator
        {
            ...
            public virtual int Compare(ScoreDoc i, ScoreDoc j)
            {
            }
           ...
        }

          ...
        // inherit javadocs
        public override ScoreDocComparator NewComparator(IndexReader reader, System.String
fieldname)
        {
            System.String field = String.Intern(fieldname);
            System.IComparable[] cachedValues = Lucene.Net.Search.FieldCache_Fields.DEFAULT.GetCustom(reader,
field, this);

            return new AnonymousClassScoreDocComparator(cachedValues, this);
        }
        ...
    }

Should be like this:

   public class AnonymousClassScoreDocComparator : ScoreDocComparator
   {
       ...
       public virtual int Compare(ScoreDoc i, ScoreDoc j)
       {
       }
           ...
   }

    [Serializable]
    public class TrueLocalScoreComparator : SortComparator
    {
          ...
        // inherit javadocs
        public override ScoreDocComparator NewComparator(IndexReader reader, System.String
fieldname)
        {
            System.String field = String.Intern(fieldname);
            System.IComparable[] cachedValues = Lucene.Net.Search.FieldCache_Fields.DEFAULT.GetCustom(reader,
field, this);

            return new AnonymousClassScoreDocComparator(cachedValues, this);
        }
        ...
    }

Then I can just extend AnonymousClassScoreDocComparator to accomplish what I need.

The obstacle I have right now is not how to do it. I already make all this functioning properly,
it is that there are top results missing. If I sort in reversed direction, result is perfect
except that most wanted is at the bottom, logically in the case if I don't do reverse, then
the most wanted should be at the top, right? The fact is that if I do that most wanted is
not at the top, not even at anywhere of the result set. That is the reason I strongly suspected
(am still suspecting) that something wrong in PriorityQueue. Let me tell you one more thing
that I don't understand:

This is the code of DownHeap method of PriorityQueue:

                private void  DownHeap()
                {
                        int i = 1;
                        System.Object node = heap[i]; // save top node
                        int j = i << 1; // find smaller child
                        int k = j + 1;
                        if (k <= size && LessThan(heap[k], heap[j]))
                        {
                                j = k;
                        }
                        while (j <= size && LessThan(heap[j], node))
                        {
                                heap[i] = heap[j]; // shift up child
                                i = j;
                                j = i << 1;
                                k = j + 1;
                                if (k <= size && LessThan(heap[k], heap[j]))
                                {
                                        j = k;
                                }
                        }
                        heap[i] = node; // install saved node
                }

Since the queue is sorted every time a new element is inserted (we only have insert method
here and there is no modify method), so the queue should be in order after an element is inserted
and before inserting a new element, right? If this is the case, then LessThan(heap[k], heap[j])
will never be true because k is always one bigger than j and the less one is at the smaller
index. However, when I traced into it, j=k is always executed, the only explaination to me
is that the queue is not sorted properly.

Any thought about this?

Hongwei


-----Original Message-----
From: DIGY [mailto:digydigy@gmail.com]
Sent: Tuesday, September 25, 2007 4:47 PM
To: lucene-net-dev@incubator.apache.org
Subject: RE: Question(problem?) about PriorityQueue

Hi,
I am not sure that I understand exactly what you are trying to do.

Is it possible, for example, to precalculate the "distance" at index
time(may require two passes on the document) and store it in a field which
can later be used in searching?

In addition, changing the Lucene codes will make you stick to one version,
and with every version changes you will have to update your code
accordingly.

DIGY.



-----Original Message-----
From: Hongwei Shen [mailto:Hongwei.Shen@emedia.com]
Sent: Tuesday, September 25, 2007 4:51 PM
To: lucene-net-dev@incubator.apache.org
Subject: RE: Question(problem?) about PriorityQueue

Thank you for checking. I missed it. The code doesn't have problem, the
least relevant element is at the top and when Doc Collector collecting them,
it is reversed.

We are using a custom sorting on a field that is string type, however the
comparison is based on parsing the string with some calculation. I figured
out that it is the searching which used FieldSortedHitQueue sorts the result
by string instead of by calculation and thus only return top 100 result
sorted by string order is returned from search to the TopFieldDocCollector
which uses FieldDocSortedHitQueue which I modified to use the calculation to
sort.

I have to find a way to do it.

Below is my code to sort by distance which works fine. What we want to
achieve is to sort by the combination of distance and score. This means that
we cannot calculate the distance in GetComparable(), instead, we have to
retain all the information and wait until score is available.

    [Serializable]
    public class DistanceSortComparator : SortComparator
    {
        private float longitude;
        private float latitude;

        /// <summary>
        /// Constructor that take the geographic loation of the center
        /// </summary>
        /// <param name="longitude">
        /// The lontitude of the geo-center
        /// </param>
        /// <param name="latitude">
        /// The latitude of the geo-center
        /// </param>
        public DistanceSortComparator(float latitude, float longitude)
        {
            this.longitude = longitude;
            this.latitude = latitude;
        }

        #region Overriden methods

        /// <summary>
        /// Returns the distance
        /// </summary>
        /// <param name="termtext"></param>
        /// <returns></returns>
        public override IComparable GetComparable(string termtext)
        {
            string[] loc = termtext.Split(',');
            double lat = double.Parse(loc[0]);
            double lon = double.Parse(loc[1]);
            return calculateDistance(this.latitude, this.longitude, lat,
lon);
        }

        override public string ToString()
        {
            return "Distance from (" + longitude + "," + latitude + ")";
        }
}


-----Original Message-----
From: DIGY [mailto:digydigy@gmail.com]
Sent: Monday, September 24, 2007 3:38 PM
To: lucene-net-dev@incubator.apache.org
Subject: RE: Question(problem?) about PriorityQueue

Hi,
I checked the java and .net code and they look the same. It seems like it is
a coding preference not to use the index 0.

                protected internal void  Initialize(int maxSize)
                {
                        size = 0;
                        int heapSize = maxSize + 1; <<<<<<<<<
                        heap = new System.Object[heapSize];
                        this.maxSize = maxSize;
                }


Can you send a sample code where "some top results are missing"?

DIGY

-----Original Message-----
From: Hongwei Shen [mailto:Hongwei.Shen@emedia.com]
Sent: Monday, September 24, 2007 9:06 PM
To: lucene-net-dev@incubator.apache.org
Subject: Question(problem?) about PriorityQueue

Hello there,

The problem we have is that some top results are missing. My debugging led
me to the following piece of code in the PriorityQueue.cs file(line 69). I
simply cannot believe this might be wrong, so I'd like somebody to verify
it.

                public virtual bool Insert(System.Object element)
                {
                        if (size < maxSize)
                        {
                                Put(element);
                                return true;
                        }
                        else if (size > 0 && !LessThan(element, Top()))
                        {
                                heap[1] = element;
                                AdjustTop();
                                return true;
                        }
                        else
                                return false;
                }

Let's assume that maxSize is 100, when size is larger or equal to 100, the
element is compared with the top element which is heap[1], if it is not less
than the top, then the top is being replaced by the element instead of being
bumped down. It seems to me that this is not the right logic here.

If maxSize is 100, the actual heap size is 101 and the document collector
will collect top docs starting from index 1, so index 0 is never used. I
suspect that the original design of the queue is to insert the new element
in the index 0 and then sort it down.

Please let me know what do you think.

Hongwei


Mime
View raw message