madlib-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Frank McQuillan <fmcquil...@pivotal.io>
Subject Re: Multiplying a large sparse matrix by a vector
Date Thu, 04 Jan 2018 20:12:28 GMT
Anthony,

In that case, I think you are hitting the 1GB PostgreSQL limit.

Operations on sparse matrix format requires loading into memory 2 INTEGERS
for row/col plus the value (INTEGER, DOUBLE PRECISION, whatever size it is).

It means for your matrix the 2 INTEGERS alone are ~1.00E+09 bytes which is
already on the limit without even considering the value yet.

So I would suggest you do the computation in blocks.  One approach to this:

* chunk your long matrix into n smaller VIEWS, say n=10 (i.e., MADlib
matrix operations do work on VIEWS)
* call matrix*vector for each chunk
* reassemble the n result vectors into the final vector

You could do this in a PL/pgSQL or PL/Python function.

There is one subtlety to be aware of though because you are working with
sparse matrices. For each of the n chunks, if there is no non-zero value in
the 100th column, you will get an error that looks like this:

madlib=# SELECT madlib.matrix_vec_mult('mat_a_view',
NULL,
                              array[1,2,3,4,5,6,7,8,9,10]
                              );
ERROR:  plpy.Error: Matrix error: Dimension mismatch between matrix (1 x 9)
and vector (10 x 1)
CONTEXT:  Traceback (most recent call last):
  PL/Python function "matrix_vec_mult", line 24, in <module>
    matrix_in, in_args, vector)
  PL/Python function "matrix_vec_mult", line 2031, in matrix_vec_mult
  PL/Python function "matrix_vec_mult", line 77, in _assert
PL/Python function "matrix_vec_mult"

See the explanation at the top of
http://madlib.apache.org/docs/latest/group__grp__matrix.html
regarding dimensionality of sparse matrices.

One way around this is to add a (fake) row to the bottom of your VIEW with
a 0 in the 100th column.  But if you do this, be sure to drop the last
(fake) entry of each of the n intermediate vectors before you assemble into
the final vector.

Frank





On Wed, Jan 3, 2018 at 8:15 PM, Anthony Thomas <ahthomas@eng.ucsd.edu>
wrote:

> Thanks Frank - the answer to both your questions is "yes"
>
> Best,
>
> Anthony
>
> On Wed, Jan 3, 2018 at 3:13 PM, Frank McQuillan <fmcquillan@pivotal.io>
> wrote:
>
>>
>> Anthony,
>>
>> Correct the install check error you are seeing is not related.
>>
>> Cpl questions:
>>
>> (1)
>> Are you using:
>>
>> -- Multiply matrix with vector
>>   matrix_vec_mult( matrix_in, in_args, vector)
>>
>> (2)
>> Is matrix_in encoded in sparse format like at the top of
>> http://madlib.apache.org/docs/latest/group__grp__matrix.html
>>
>> e.g., like this?
>>
>> row_id | col_id | value
>> --------+--------+-------
>>       1 |      1 |     9
>>       1 |      5 |     6
>>       1 |      6 |     6
>>       2 |      1 |     8
>>       3 |      1 |     3
>>       3 |      2 |     9
>>       4 |      7 |     0
>>
>>
>> Frank
>>
>>
>> On Wed, Jan 3, 2018 at 2:52 PM, Anthony Thomas <ahthomas@eng.ucsd.edu>
>> wrote:
>>
>>> Okay - thanks Ivan, and good to know about support for Ubuntu from
>>> Greenplum!
>>>
>>> Best,
>>>
>>> Anthony
>>>
>>> On Wed, Jan 3, 2018 at 2:38 PM, Ivan Novick <inovick@pivotal.io> wrote:
>>>
>>>> Hi Anthony, this does NOT look like a Ubuntu problem, and in fact there
>>>> is OSS Greenplum officially on Ubuntu you can see here:
>>>> http://greenplum.org/install-greenplum-oss-on-ubuntu/
>>>>
>>>> Greenplum and PostgreSQL do limit to 1 Gig for each field (row/col
>>>> combination) but there are techniques to manage data sets working within
>>>> these constraints.  I will let someone else who has more experience then
me
>>>> working with matrices answer how is the best way to do so in a case like
>>>> you have provided.
>>>>
>>>> Cheers,
>>>> Ivan
>>>>
>>>> On Wed, Jan 3, 2018 at 2:22 PM, Anthony Thomas <ahthomas@eng.ucsd.edu>
>>>> wrote:
>>>>
>>>>> Hi Madlib folks,
>>>>>
>>>>> I have a large tall and skinny sparse matrix which I'm trying to
>>>>> multiply by a dense vector. The matrix is 1.25e8 by 100 with approximately
>>>>> 1% nonzero values. This operations always triggers an error from Greenplum:
>>>>>
>>>>> plpy.SPIError: invalid memory alloc request size 1073741824 (context
>>>>> 'accumArrayResult') (mcxt.c:1254) (plpython.c:4957)
>>>>> CONTEXT:  Traceback (most recent call last):
>>>>>   PL/Python function "matrix_vec_mult", line 24, in <module>
>>>>>     matrix_in, in_args, vector)
>>>>>   PL/Python function "matrix_vec_mult", line 2044, in matrix_vec_mult
>>>>>   PL/Python function "matrix_vec_mult", line 2001, in
>>>>> _matrix_vec_mult_dense
>>>>> PL/Python function "matrix_vec_mult"
>>>>>
>>>>> Some Googling suggests this error is caused by a hard limit from
>>>>> Postgres which restricts the maximum size of an array to 1GB. If this
is
>>>>> indeed the cause of the error I'm seeing does anyone have any suggestions
>>>>> about how to circumvent this issue? This comes up in other cases as well
>>>>> like transposing a tall and skinny matrix. MVM with smaller matrices
works
>>>>> fine.
>>>>>
>>>>> Here is relevant version information:
>>>>>
>>>>> SELECT VERSION();
>>>>> PostgreSQL 8.3.23 (Greenplum Database 5.1.0 build dev) on
>>>>> x86_64-pc-linux-gnu, compiled by GCC gcc
>>>>>  (Ubuntu 5.4.0-6ubuntu1~16.04.5) 5.4.0 20160609 compiled on Dec 21
>>>>> 2017 09:09:46
>>>>>
>>>>> SELECT madlib.version();
>>>>> MADlib version: 1.12, git revision: unknown, cmake configuration time:
>>>>> Thu Dec 21 18:04:47 UTC 201
>>>>> 7, build type: RelWithDebInfo, build system: Linux-4.4.0-103-generic,
>>>>> C compiler: gcc 4.9.3, C++ co
>>>>> mpiler: g++ 4.9.3
>>>>>
>>>>> Madlib install-check reported one error in the "convex" module related
>>>>> to "loss too high" which seems unrelated to the issue described above.
I
>>>>> know Ubuntu isn't officially supported by Greenplum so I'd like to be
>>>>> confident this issue isn't just the result of using an unsupported OS.
>>>>> Please let me know if any other information would be helpful.
>>>>>
>>>>> Thanks,
>>>>>
>>>>> Anthony
>>>>>
>>>>
>>>>
>>>>
>>>> --
>>>> Ivan Novick, Product Manager Pivotal Greenplum
>>>> inovick@pivotal.io --  (Mobile) 408-230-6491 <(408)%20230-6491>
>>>> https://www.youtube.com/GreenplumDatabase
>>>>
>>>>
>>>
>>
>

Mime
View raw message