madlib-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Frank McQuillan <fmcquil...@pivotal.io>
Subject Re: Performance of array_dot vs cosine_similarity
Date Fri, 10 Feb 2017 18:15:37 GMT
Thanks, please keep the mailing list posted on your findings.

On Fri, Feb 10, 2017 at 2:31 AM, Mauricio Scheffer <
mauricioscheffer@gmail.com> wrote:

> Hi Nandish,
>
> Thanks for looking into this. I just create a new issue on JIRA about
> this: https://issues.apache.org/jira/browse/MADLIB-1067
>
> Frank: at first we're looking into using MADlib for cosine similarity
> only. If that goes well we might use it for other operations that will need
> good performance in dot products.
>
> Cheers,
> Mauricio
>
>
>
>
> --
> Mauricio
>
> On Thu, Feb 9, 2017 at 10:29 PM, Frank McQuillan <fmcquillan@pivotal.io>
> wrote:
>
>> Mauricio,
>>
>> Is the time difference that you observed material? i.e., is that an
>> important difference for your use case?
>>
>> Frank
>>
>> On Thu, Feb 9, 2017 at 11:07 PM, Nandish Jayaram <njayaram@pivotal.io>
>> wrote:
>>
>>> Hi Mauricio,
>>>
>>> I briefly looked through the code and it seems like the dot product in
>>> cosine_similarity is based on what is there in the Eigen library.
>>> The dot product in array_dot seems to be using a native implementation
>>> of the same. Apparently, dot product in Eigen is faster than
>>> the native implementation. Looks like it might be a good idea to move
>>> array_dot also to Eigen based dot product!
>>>
>>> NJ
>>>
>>> On Thu, Feb 9, 2017 at 10:19 AM, Mauricio Scheffer <
>>> mauricioscheffer@gmail.com> wrote:
>>>
>>>> Hi,
>>>>
>>>> I just started evaluating MADlib and one of the first things I tried is
>>>> how it performs for dot product and cosine similarity.
>>>>
>>>> So first I set up some test data (1000000 rows of 150-element float8[])
>>>> Then I ran array_dot and cosine_similarity on it:
>>>>
>>>> select * from (
>>>>   select cosine_similarity -- or array_dot
>>>>     (a_vector, (select array_agg(random()::float8) from
>>>> generate_series(0, 150))) c
>>>>     from vectors
>>>> ) x
>>>> order by c desc
>>>> limit 10
>>>>
>>>> On my machine, cosine_similarity takes 1.3s while array_dot takes 3s,
>>>> which is rather unexpected... I would have expected a dot product to be
>>>> much faster than calculating cosine similarity.
>>>> Can anyone shed some light on this?
>>>>
>>>> Thanks,
>>>> Mauricio
>>>>
>>>
>>>
>>
>

Mime
View raw message