madlib-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "FENG, Xixuan (Aaron)" <xixuan.f...@gmail.com>
Subject Re: Long execution time on MADlib
Date Fri, 16 Jul 2021 00:16:51 GMT
My guess is because logregr computes a matrix X’AX which is big when you
have 2000 features. The matrix is not needed for training the model but
only for computing stderr after training. You can probably remove the
matrix completely but in the engineering perspective it is more difficult
than just changing the step size as you may need to take care of
serializing the user defined function states…

https://github.com/apache/madlib/blob/2e34c0f45a6e0f3be224ef58a6f4a576eb8eb89a/src/modules/regress/logistic.cpp#L851

https://github.com/apache/madlib/blob/2e34c0f45a6e0f3be224ef58a6f4a576eb8eb89a/src/modules/regress/logistic.cpp#L821

https://github.com/apache/madlib/blob/2e34c0f45a6e0f3be224ef58a6f4a576eb8eb89a/src/modules/regress/logistic.cpp#L930

https://github.com/apache/madlib/blob/2e34c0f45a6e0f3be224ef58a6f4a576eb8eb89a/src/modules/regress/logistic.cpp#L1052



2021年7月16日(金) 1:43 Lijie Xu <csxulijie@gmail.com>:

> Dear Aaron,
>
> Thanks for your advice. I will try it.
>
> In addition, after following Frank's guide, I found MADlib LR and SVM can
> work normally on some low-dimensional (e.g., 18-28) datasets, even with >1
> million tuples. However, while working on high-dimensional dataset, such as
> epsilon dataset with 400,000 tuples and 2,000 features (
> https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary.html),
> MADlib SVM can finish 20 iterations in a reasonable time but MADlib LR
> (with IGD) cannot finish 2 iterations in several hours.  Any ideas about
> this problem? Thanks!
>
> Best,
> Lijie
>
>
>
> On Thu, Jul 15, 2021 at 4:03 PM FENG, Xixuan (Aaron) <
> xixuan.feng@gmail.com> wrote:
>
>> Hi Lijie,
>>
>> I implemented the logregr with incremental gradient descent a few years
>> ago. Unfortunately at that time we chose to hard-coded the constant
>> step-size. But luckily you can edit the code as you need.
>>
>> Here are the pointers:
>>
>> https://github.com/apache/madlib/blob/2e34c0f45a6e0f3be224ef58a6f4a576eb8eb89a/src/modules/regress/logistic.cpp#L818
>>
>>
>> https://github.com/apache/madlib/blob/2e34c0f45a6e0f3be224ef58a6f4a576eb8eb89a/src/modules/regress/logistic.cpp#L918
>>
>> Good luck!
>> Aaron
>>
>> 2021年7月15日(木) 22:14 Lijie Xu <csxulijie@gmail.com>:
>>
>>> Dear Frank,
>>>
>>> Sorry for the late reply and thanks for your great help. I'm doing some
>>> research work on MADlib. I will follow your advice to test MADlib again.
>>> Another question is if MADlib LR supports tuning learning_rate?
>>>
>>> In MADlib SVM, there is a 'params' in 'svm_classification' to tune the 'init_stepsize'
>>> and 'decay_factor' as follows.
>>>
>>> svm_classification(
>>>     source_table,
>>>     model_table,
>>>     dependent_varname,
>>>     independent_varname,
>>>     kernel_func,
>>>     kernel_params,
>>>     grouping_col,
>>>     params,
>>>     verbose
>>>     )
>>>
>>> However, I did not see this 'params' in LR as:
>>>
>>> logregr_train( source_table,
>>>                out_table,
>>>                dependent_varname,
>>>                independent_varname,
>>>                grouping_cols,
>>>                max_iter,
>>>                optimizer,
>>>                tolerance,
>>>                verbose
>>>              )
>>>
>>> In addition, I checked the Generalized Linear Models, and
>>> its 'optim_params' parameter seems to only support tuning 'tolerance,
>>> max_iter, and optimizer'.
>>> Is there a way to tune the 'init_stepsize' and 'decay_factor' in LR?
>>> Thanks!
>>>
>>> Best,
>>> Lijie
>>>
>>> On Tue, Jul 6, 2021 at 9:04 PM Frank McQuillan <fmcquillan@vmware.com>
>>> wrote:
>>>
>>>> Hello,
>>>>
>>>> Thank you for the questions.
>>>>
>>>> (0)
>>>> Not sure if you are using Postgres just for development or production,
>>>> but keep in mind that MADlib is designed to run on a distributed MPP
>>>> database (Greenplum) with large datasets. It runs fine on Postgres, but
>>>> obviously Postgres won't scale to very large datasets or it will just be
>>>> too slow.
>>>>
>>>> Also see jupyter notebooks here
>>>>
>>>> https://github.com/apache/madlib-site/tree/asf-site/community-artifacts/Supervised-learning
>>>> for other examples in case of use.
>>>>
>>>>
>>>> (1)
>>>> - there are 2 problems with your dataset for logistic regression:
>>>>
>>>> (i)
>>>> - as per
>>>> http://madlib.incubator.apache.org/docs/latest/group__grp__logreg.html
>>>> MADlib: Logistic Regression
>>>> <http://madlib.incubator.apache.org/docs/latest/group__grp__logreg.html>
>>>> Binomial logistic regression models the relationship between a
>>>> dichotomous dependent variable and one or more predictor variables. The
>>>> dependent variable may be a Boolean value or a categorial variable that can
>>>> be represented with a Boolean expression.
>>>> madlib.incubator.apache.org
>>>>
>>>>
>>>> the dependent variable is a boolean or an expression that evaluates to
>>>> boolean
>>>> - your data has dependent variable of -1 but postgres does not evaluate
>>>> -1 to FALSE so you should change the -1 to 0
>>>> - i.e., use 0 for FALSE and 1 for TRUE in postgres
>>>> https://www.postgresql.org/docs/12/datatype-boolean.html
>>>> <https://www.postgresql.org/docs/12/datatype-boolean.html>
>>>> PostgreSQL: Documentation: 12: 8.6. Boolean Type
>>>> <https://www.postgresql.org/docs/12/datatype-boolean.html>
>>>> The key words TRUE and FALSE are the preferred (SQL-compliant) method
>>>> for writing Boolean constants in SQL queries.But you can also use the
>>>> string representations by following the generic string-literal constant
>>>> syntax described in Section 4.1.2.7, for example 'yes'::boolean.. Note that
>>>> the parser automatically understands that TRUE and FALSE are of type
>>>> boolean, but this is not so for NULL ...
>>>> www.postgresql.org
>>>>
>>>>
>>>>
>>>> (ii)
>>>> - an intercept variable is not assumed so it is common to provide an
>>>> explicit intercept term by including a single constant 1 term in the
>>>> independent variable list
>>>> - see the example here
>>>>
>>>> http://madlib.incubator.apache.org/docs/latest/group__grp__logreg.html#examples
>>>> MADlib: Logistic Regression
>>>> <http://madlib.incubator.apache.org/docs/latest/group__grp__logreg.html#examples>
>>>> Binomial logistic regression models the relationship between a
>>>> dichotomous dependent variable and one or more predictor variables. The
>>>> dependent variable may be a Boolean value or a categorial variable that can
>>>> be represented with a Boolean expression.
>>>> madlib.incubator.apache.org
>>>>
>>>>
>>>>
>>>> That is why the log_likelihood value is too big, that model is not
>>>> right.
>>>>
>>>>
>>>> (2)
>>>> if you make the fixes above in (1) it should run OK.  Here are my
>>>> results on PostgreSQL 11.6 using MADlib version: 1.18.0 on the dataset with
>>>> 10 tuples:
>>>>
>>>>
>>>> DROP TABLE IF EXISTS epsilon_sample_10v2 CASCADE;
>>>>
>>>>         CREATE TABLE epsilon_sample_10v2 (
>>>>        did serial,
>>>>        vec double precision[],
>>>>        labeli integer
>>>>         );
>>>>
>>>>         COPY epsilon_sample_10v2 (vec, labeli) FROM STDIN;
>>>>         {1.0,-0.0108282,-0.0196004,0.0422148,...} 0
>>>>         {1.0,0.00250835,0.0168447,-0.0102934,...} 1
>>>>         etc.
>>>>
>>>> SELECT madlib.logregr_train('epsilon_sample_10v2',
>>>> 'epsilon_sample_10v2_logregr_out', 'labeli', 'vec', NULL, 1, 'irls'}
>>>>
>>>>  logregr_train
>>>> ---------------
>>>>
>>>> (1 row)
>>>>
>>>> Time: 317046.342 ms (05:17.046)
>>>>
>>>> madlib=# select log_likelihood from epsilon_sample_10v2_logregr_out;
>>>>   log_likelihood
>>>> -------------------
>>>>  -6.93147180559945
>>>> (1 row)
>>>>
>>>>
>>>> (3)
>>>> -dataset is not scanned again at the end of every iteration to compute
>>>> training loss/accuracy.  It should only scan 1x per iteration for
>>>> optimization
>>>>
>>>>
>>>> (4)
>>>> - I thought the verbose parameter should do that, but it does not seem
>>>> to be working for me.  Will need to look into it more.
>>>>
>>>>
>>>> (5)
>>>> -logistic regression and SVM do not currently support sparse matrix
>>>> format
>>>> http://madlib.incubator.apache.org/docs/latest/group__grp__svec.html
>>>> MADlib: Sparse Vectors
>>>> <http://madlib.incubator.apache.org/docs/latest/group__grp__svec.html>
>>>> dict_id_col : TEXT. Name of the id column in the dictionary_tbl.
>>>> Expected Type: INTEGER or BIGINT. NOTE: Values must be continuous ranging
>>>> from 0 to total number of elements in the dictionary - 1.
>>>> madlib.incubator.apache.org
>>>>
>>>>
>>>> Frank
>>>>
>>>> ------------------------------
>>>> *From:* Lijie Xu <csxulijie@gmail.com>
>>>> *Sent:* Saturday, July 3, 2021 1:21 PM
>>>> *To:* user@madlib.apache.org <user@madlib.apache.org>
>>>> *Subject:* Long execution time on MADlib
>>>>
>>>>
>>>>
>>>> Hi All,
>>>>
>>>>
>>>>
>>>> I’m Lijie and now performing some experiments on MADlib. I found that
>>>> MADlib runs very slowly on some datasets, so I would like to justify my
>>>> settings. Could you help me check the following settings and codes? Sorry
>>>> for this long email. I used the latest MADlib 1.18 on PostgreSQL 12.
>>>>
>>>>
>>>>
>>>> *(1)  **Could you help check whether the data format and scripts I
>>>> used are right for n-dimensional dataset?*
>>>>
>>>>
>>>>
>>>> I have some training datasets, and each of them has a dense feature
>>>> array (like [0.1, 0.2, …, 1.0]) and a class label (+1/-1). For example,
for
>>>> the ‘forest’ dataset (581K tuples) with a 54-dimensional feature array
and
>>>> a class label, I first stored it into PostgreSQL using
>>>>
>>>>
>>>>
>>>> <code>
>>>>
>>>>      CREATE TABLE forest (
>>>>
>>>>           did serial,
>>>>
>>>>           vec double precision[],
>>>>
>>>>           labeli integer);
>>>>
>>>>
>>>>
>>>>       COPY forest (vec, labeli) FROM STDIN;
>>>>
>>>>       ‘[0.1, 0.2, …, 1.0], -1’
>>>>
>>>>       ‘[0.3, 0.1, …, 0.9], 1’
>>>>
>>>>       …
>>>>
>>>> </code>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>         Then, to run the Logistic Regression on this dataset, I use the
>>>> following code:
>>>>
>>>>
>>>>
>>>> <code>
>>>>
>>>> mldb=# \d forest
>>>>
>>>>                                Table "public.forest"
>>>>
>>>>  Column |        Type        |
>>>> Modifiers
>>>>
>>>>
>>>> --------+--------------------+------------------------------------------------------
>>>>
>>>>  did    | integer            | not null default
>>>> nextval('forest_did_seq'::regclass)
>>>>
>>>>  vec    | double precision[] |
>>>>
>>>>  labeli | integer            |
>>>>
>>>>
>>>>
>>>> mldb=# SELECT madlib.logregr_train(
>>>>
>>>> mldb(#     'forest',                                 -- source table
>>>>
>>>> mldb(#     'forest_logregr_out',                     -- output table
>>>>
>>>> mldb(#     'labeli',                                 -- labels
>>>>
>>>> mldb(#     'vec',                                    -- features
>>>>
>>>> mldb(#     NULL,                                     -- grouping columns
>>>>
>>>> mldb(#     20,                                       -- max number of
>>>> iteration
>>>>
>>>> mldb(#     'igd'                                     -- optimizer
>>>>
>>>> mldb(#     );
>>>>
>>>>
>>>>
>>>> Time: 198911.350 ms
>>>>
>>>> </code>
>>>>
>>>>
>>>>
>>>> After about 199s, I got the output table as:
>>>>
>>>> <code>
>>>>
>>>> mldb=# \d forest_logregr_out
>>>>
>>>>              Table "public.forest_logregr_out"
>>>>
>>>>           Column          |        Type        | Modifiers
>>>>
>>>> --------------------------+--------------------+-----------
>>>>
>>>>  coef                     | double precision[] |
>>>>
>>>>  log_likelihood           | double precision   |
>>>>
>>>>  std_err                  | double precision[] |
>>>>
>>>>  z_stats                  | double precision[] |
>>>>
>>>>  p_values                 | double precision[] |
>>>>
>>>>  odds_ratios              | double precision[] |
>>>>
>>>>  condition_no             | double precision   |
>>>>
>>>>  num_rows_processed       | bigint             |
>>>>
>>>>  num_missing_rows_skipped | bigint             |
>>>>
>>>>  num_iterations           | integer            |
>>>>
>>>>  variance_covariance      | double precision[] |
>>>>
>>>>
>>>>
>>>> mldb=# select log_likelihood from forest_logregr_out;
>>>>
>>>>   log_likelihood
>>>>
>>>> ------------------
>>>>
>>>>  -426986.83683879
>>>>
>>>> (1 row)
>>>>
>>>> </code>
>>>>
>>>>
>>>>
>>>> Is this procedure correct?
>>>>
>>>>
>>>>
>>>> *(2)  **Training on a 2,000-dimensional dense dataset (epsilon) is
>>>> very slow:*
>>>>
>>>>
>>>>
>>>>            While training on a 2,000-dimensional dense dataset
>>>> (epsilon_sample_10) with only *10 tuples* as follows, MADlib does not
>>>> finish in 5 hours* for only 1 iteration*. The CPU usage is always 100%
>>>> during the execution. The dataset is available at
>>>> https://github.com/JerryLead/Misc/blob/master/MADlib/train.sql
>>>> <https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2FJerryLead%2FMisc%2Fblob%2Fmaster%2FMADlib%2Ftrain.sql&data=04%7C01%7Cfmcquillan%40vmware.com%7C4b68d873a6434f4a8ddd08d93e603b96%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C0%7C637609405309019768%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=Ga6INXkQiBvI8RWAfCEZI5uSFOscUdG6RPR0NiuKWjU%3D&reserved=0>
>>>> .
>>>>
>>>>
>>>>
>>>> <code>
>>>>
>>>> mldb=# \d epsilon_sample_10
>>>>
>>>>                                Table "public.epsilon_sample_10"
>>>>
>>>>  Column |        Type        |
>>>>             Modifiers
>>>>
>>>>
>>>> --------+--------------------+-----------------------------------------------------------------
>>>>
>>>>  did    | integer            | not null default
>>>> nextval('epsilon_sample_10_did_seq'::regclass)
>>>>
>>>>  vec    | double precision[] |
>>>>
>>>>  labeli | integer            |
>>>>
>>>>
>>>>
>>>> mldb=# SELECT count(*) from epsilon_sample_10;
>>>>
>>>>  count
>>>>
>>>> -------
>>>>
>>>>     10
>>>>
>>>> (1 row)
>>>>
>>>>
>>>>
>>>> Time: 1.456 ms
>>>>
>>>>
>>>>
>>>> mldb=# SELECT madlib.logregr_train('epsilon_sample_10',
>>>> 'epsilon_sample_10_logregr_out', 'labeli', 'vec', NULL, 1, 'igd');
>>>>
>>>> </code>
>>>>
>>>>
>>>>
>>>> *In this case, it is not possible to train the whole epsilon dataset
>>>> (with 400,000 tuples) in a reasonable time. I guess that this problem is
>>>> related to TOAST, since epsilon has a high dimension and it is compressed
>>>> by TOAST. However, are there any other reasons for this so slow execution?*
>>>>
>>>>
>>>>
>>>> *(3)  **For MADlib, is the dataset table scanned once or twice in each
>>>> iteration?*
>>>>
>>>> I know that, in each iteration, MADlib needs to scan the dataset table
>>>> once to perform IGD/SGD on the whole dataset. My question is that, *at
>>>> the end of each iteration*, will MADlib scan the table again to
>>>> compute the training loss/accuracy?
>>>>
>>>>
>>>>
>>>> *(4)  **Is it possible to output the training metrics, such as
>>>> training loss and accuracy after each iteration?*
>>>>
>>>> Currently, it seems that MADlib only outputs the log-likelihood at the
>>>> end of the SQL execution.
>>>>
>>>>
>>>>
>>>> *(5)  **Do MADlib’s Logistic Regression and SVM support sparse
>>>> datasets?*
>>>>
>>>> I also have some sparse datasets denoted as ‘feature_index_vec_array,
>>>> feature_value_array, label’, such as ‘[1, 3, 5], [0.1, 0.2, 0.3], -1’.
Can
>>>> I train these sparse datasets on MADlib using LR and SVM?
>>>>
>>>>
>>>>
>>>> Many thanks for reviewing my questions.
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> Best regards,
>>>>
>>>>
>>>>
>>>> Lijie
>>>>
>>>

Mime
View raw message