madlib-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Mauricio Scheffer <mauricioschef...@gmail.com>
Subject Re: Performance of array_dot vs cosine_similarity
Date Fri, 10 Feb 2017 10:31:22 GMT
Hi Nandish,

Thanks for looking into this. I just create a new issue on JIRA about this:
https://issues.apache.org/jira/browse/MADLIB-1067

Frank: at first we're looking into using MADlib for cosine similarity only.
If that goes well we might use it for other operations that will need good
performance in dot products.

Cheers,
Mauricio




--
Mauricio

On Thu, Feb 9, 2017 at 10:29 PM, Frank McQuillan <fmcquillan@pivotal.io>
wrote:

> Mauricio,
>
> Is the time difference that you observed material? i.e., is that an
> important difference for your use case?
>
> Frank
>
> On Thu, Feb 9, 2017 at 11:07 PM, Nandish Jayaram <njayaram@pivotal.io>
> wrote:
>
>> Hi Mauricio,
>>
>> I briefly looked through the code and it seems like the dot product in
>> cosine_similarity is based on what is there in the Eigen library.
>> The dot product in array_dot seems to be using a native implementation of
>> the same. Apparently, dot product in Eigen is faster than
>> the native implementation. Looks like it might be a good idea to move
>> array_dot also to Eigen based dot product!
>>
>> NJ
>>
>> On Thu, Feb 9, 2017 at 10:19 AM, Mauricio Scheffer <
>> mauricioscheffer@gmail.com> wrote:
>>
>>> Hi,
>>>
>>> I just started evaluating MADlib and one of the first things I tried is
>>> how it performs for dot product and cosine similarity.
>>>
>>> So first I set up some test data (1000000 rows of 150-element float8[])
>>> Then I ran array_dot and cosine_similarity on it:
>>>
>>> select * from (
>>>   select cosine_similarity -- or array_dot
>>>     (a_vector, (select array_agg(random()::float8) from
>>> generate_series(0, 150))) c
>>>     from vectors
>>> ) x
>>> order by c desc
>>> limit 10
>>>
>>> On my machine, cosine_similarity takes 1.3s while array_dot takes 3s,
>>> which is rather unexpected... I would have expected a dot product to be
>>> much faster than calculating cosine similarity.
>>> Can anyone shed some light on this?
>>>
>>> Thanks,
>>> Mauricio
>>>
>>
>>
>

Mime
View raw message