madlib-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Lijie Xu <csxuli...@gmail.com>
Subject Re: Long execution time on MADlib
Date Thu, 15 Jul 2021 16:43:31 GMT
Dear Aaron,

Thanks for your advice. I will try it.

In addition, after following Frank's guide, I found MADlib LR and SVM can
work normally on some low-dimensional (e.g., 18-28) datasets, even with >1
million tuples. However, while working on high-dimensional dataset, such as
epsilon dataset with 400,000 tuples and 2,000 features (
https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary.html),
MADlib SVM can finish 20 iterations in a reasonable time but MADlib LR
(with IGD) cannot finish 2 iterations in several hours.  Any ideas about
this problem? Thanks!

Best,
Lijie



On Thu, Jul 15, 2021 at 4:03 PM FENG, Xixuan (Aaron) <xixuan.feng@gmail.com>
wrote:

> Hi Lijie,
>
> I implemented the logregr with incremental gradient descent a few years
> ago. Unfortunately at that time we chose to hard-coded the constant
> step-size. But luckily you can edit the code as you need.
>
> Here are the pointers:
>
> https://github.com/apache/madlib/blob/2e34c0f45a6e0f3be224ef58a6f4a576eb8eb89a/src/modules/regress/logistic.cpp#L818
>
>
> https://github.com/apache/madlib/blob/2e34c0f45a6e0f3be224ef58a6f4a576eb8eb89a/src/modules/regress/logistic.cpp#L918
>
> Good luck!
> Aaron
>
> 2021年7月15日(木) 22:14 Lijie Xu <csxulijie@gmail.com>:
>
>> Dear Frank,
>>
>> Sorry for the late reply and thanks for your great help. I'm doing some
>> research work on MADlib. I will follow your advice to test MADlib again.
>> Another question is if MADlib LR supports tuning learning_rate?
>>
>> In MADlib SVM, there is a 'params' in 'svm_classification' to tune the 'init_stepsize'
>> and 'decay_factor' as follows.
>>
>> svm_classification(
>>     source_table,
>>     model_table,
>>     dependent_varname,
>>     independent_varname,
>>     kernel_func,
>>     kernel_params,
>>     grouping_col,
>>     params,
>>     verbose
>>     )
>>
>> However, I did not see this 'params' in LR as:
>>
>> logregr_train( source_table,
>>                out_table,
>>                dependent_varname,
>>                independent_varname,
>>                grouping_cols,
>>                max_iter,
>>                optimizer,
>>                tolerance,
>>                verbose
>>              )
>>
>> In addition, I checked the Generalized Linear Models, and
>> its 'optim_params' parameter seems to only support tuning 'tolerance,
>> max_iter, and optimizer'.
>> Is there a way to tune the 'init_stepsize' and 'decay_factor' in LR?
>> Thanks!
>>
>> Best,
>> Lijie
>>
>> On Tue, Jul 6, 2021 at 9:04 PM Frank McQuillan <fmcquillan@vmware.com>
>> wrote:
>>
>>> Hello,
>>>
>>> Thank you for the questions.
>>>
>>> (0)
>>> Not sure if you are using Postgres just for development or production,
>>> but keep in mind that MADlib is designed to run on a distributed MPP
>>> database (Greenplum) with large datasets. It runs fine on Postgres, but
>>> obviously Postgres won't scale to very large datasets or it will just be
>>> too slow.
>>>
>>> Also see jupyter notebooks here
>>>
>>> https://github.com/apache/madlib-site/tree/asf-site/community-artifacts/Supervised-learning
>>> for other examples in case of use.
>>>
>>>
>>> (1)
>>> - there are 2 problems with your dataset for logistic regression:
>>>
>>> (i)
>>> - as per
>>> http://madlib.incubator.apache.org/docs/latest/group__grp__logreg.html
>>> MADlib: Logistic Regression
>>> <http://madlib.incubator.apache.org/docs/latest/group__grp__logreg.html>
>>> Binomial logistic regression models the relationship between a
>>> dichotomous dependent variable and one or more predictor variables. The
>>> dependent variable may be a Boolean value or a categorial variable that can
>>> be represented with a Boolean expression.
>>> madlib.incubator.apache.org
>>>
>>>
>>> the dependent variable is a boolean or an expression that evaluates to
>>> boolean
>>> - your data has dependent variable of -1 but postgres does not evaluate
>>> -1 to FALSE so you should change the -1 to 0
>>> - i.e., use 0 for FALSE and 1 for TRUE in postgres
>>> https://www.postgresql.org/docs/12/datatype-boolean.html
>>> <https://www.postgresql.org/docs/12/datatype-boolean.html>
>>> PostgreSQL: Documentation: 12: 8.6. Boolean Type
>>> <https://www.postgresql.org/docs/12/datatype-boolean.html>
>>> The key words TRUE and FALSE are the preferred (SQL-compliant) method
>>> for writing Boolean constants in SQL queries.But you can also use the
>>> string representations by following the generic string-literal constant
>>> syntax described in Section 4.1.2.7, for example 'yes'::boolean.. Note that
>>> the parser automatically understands that TRUE and FALSE are of type
>>> boolean, but this is not so for NULL ...
>>> www.postgresql.org
>>>
>>>
>>>
>>> (ii)
>>> - an intercept variable is not assumed so it is common to provide an
>>> explicit intercept term by including a single constant 1 term in the
>>> independent variable list
>>> - see the example here
>>>
>>> http://madlib.incubator.apache.org/docs/latest/group__grp__logreg.html#examples
>>> MADlib: Logistic Regression
>>> <http://madlib.incubator.apache.org/docs/latest/group__grp__logreg.html#examples>
>>> Binomial logistic regression models the relationship between a
>>> dichotomous dependent variable and one or more predictor variables. The
>>> dependent variable may be a Boolean value or a categorial variable that can
>>> be represented with a Boolean expression.
>>> madlib.incubator.apache.org
>>>
>>>
>>>
>>> That is why the log_likelihood value is too big, that model is not right.
>>>
>>>
>>> (2)
>>> if you make the fixes above in (1) it should run OK.  Here are my
>>> results on PostgreSQL 11.6 using MADlib version: 1.18.0 on the dataset with
>>> 10 tuples:
>>>
>>>
>>> DROP TABLE IF EXISTS epsilon_sample_10v2 CASCADE;
>>>
>>>         CREATE TABLE epsilon_sample_10v2 (
>>>        did serial,
>>>        vec double precision[],
>>>        labeli integer
>>>         );
>>>
>>>         COPY epsilon_sample_10v2 (vec, labeli) FROM STDIN;
>>>         {1.0,-0.0108282,-0.0196004,0.0422148,...} 0
>>>         {1.0,0.00250835,0.0168447,-0.0102934,...} 1
>>>         etc.
>>>
>>> SELECT madlib.logregr_train('epsilon_sample_10v2',
>>> 'epsilon_sample_10v2_logregr_out', 'labeli', 'vec', NULL, 1, 'irls'}
>>>
>>>  logregr_train
>>> ---------------
>>>
>>> (1 row)
>>>
>>> Time: 317046.342 ms (05:17.046)
>>>
>>> madlib=# select log_likelihood from epsilon_sample_10v2_logregr_out;
>>>   log_likelihood
>>> -------------------
>>>  -6.93147180559945
>>> (1 row)
>>>
>>>
>>> (3)
>>> -dataset is not scanned again at the end of every iteration to compute
>>> training loss/accuracy.  It should only scan 1x per iteration for
>>> optimization
>>>
>>>
>>> (4)
>>> - I thought the verbose parameter should do that, but it does not seem
>>> to be working for me.  Will need to look into it more.
>>>
>>>
>>> (5)
>>> -logistic regression and SVM do not currently support sparse matrix
>>> format
>>> http://madlib.incubator.apache.org/docs/latest/group__grp__svec.html
>>> MADlib: Sparse Vectors
>>> <http://madlib.incubator.apache.org/docs/latest/group__grp__svec.html>
>>> dict_id_col : TEXT. Name of the id column in the dictionary_tbl.
>>> Expected Type: INTEGER or BIGINT. NOTE: Values must be continuous ranging
>>> from 0 to total number of elements in the dictionary - 1.
>>> madlib.incubator.apache.org
>>>
>>>
>>> Frank
>>>
>>> ------------------------------
>>> *From:* Lijie Xu <csxulijie@gmail.com>
>>> *Sent:* Saturday, July 3, 2021 1:21 PM
>>> *To:* user@madlib.apache.org <user@madlib.apache.org>
>>> *Subject:* Long execution time on MADlib
>>>
>>>
>>>
>>> Hi All,
>>>
>>>
>>>
>>> I’m Lijie and now performing some experiments on MADlib. I found that
>>> MADlib runs very slowly on some datasets, so I would like to justify my
>>> settings. Could you help me check the following settings and codes? Sorry
>>> for this long email. I used the latest MADlib 1.18 on PostgreSQL 12.
>>>
>>>
>>>
>>> *(1)  **Could you help check whether the data format and scripts I used
>>> are right for n-dimensional dataset?*
>>>
>>>
>>>
>>> I have some training datasets, and each of them has a dense feature
>>> array (like [0.1, 0.2, …, 1.0]) and a class label (+1/-1). For example, for
>>> the ‘forest’ dataset (581K tuples) with a 54-dimensional feature array and
>>> a class label, I first stored it into PostgreSQL using
>>>
>>>
>>>
>>> <code>
>>>
>>>      CREATE TABLE forest (
>>>
>>>           did serial,
>>>
>>>           vec double precision[],
>>>
>>>           labeli integer);
>>>
>>>
>>>
>>>       COPY forest (vec, labeli) FROM STDIN;
>>>
>>>       ‘[0.1, 0.2, …, 1.0], -1’
>>>
>>>       ‘[0.3, 0.1, …, 0.9], 1’
>>>
>>>       …
>>>
>>> </code>
>>>
>>>
>>>
>>>
>>>
>>>         Then, to run the Logistic Regression on this dataset, I use the
>>> following code:
>>>
>>>
>>>
>>> <code>
>>>
>>> mldb=# \d forest
>>>
>>>                                Table "public.forest"
>>>
>>>  Column |        Type        |
>>> Modifiers
>>>
>>>
>>> --------+--------------------+------------------------------------------------------
>>>
>>>  did    | integer            | not null default
>>> nextval('forest_did_seq'::regclass)
>>>
>>>  vec    | double precision[] |
>>>
>>>  labeli | integer            |
>>>
>>>
>>>
>>> mldb=# SELECT madlib.logregr_train(
>>>
>>> mldb(#     'forest',                                 -- source table
>>>
>>> mldb(#     'forest_logregr_out',                     -- output table
>>>
>>> mldb(#     'labeli',                                 -- labels
>>>
>>> mldb(#     'vec',                                    -- features
>>>
>>> mldb(#     NULL,                                     -- grouping columns
>>>
>>> mldb(#     20,                                       -- max number of
>>> iteration
>>>
>>> mldb(#     'igd'                                     -- optimizer
>>>
>>> mldb(#     );
>>>
>>>
>>>
>>> Time: 198911.350 ms
>>>
>>> </code>
>>>
>>>
>>>
>>> After about 199s, I got the output table as:
>>>
>>> <code>
>>>
>>> mldb=# \d forest_logregr_out
>>>
>>>              Table "public.forest_logregr_out"
>>>
>>>           Column          |        Type        | Modifiers
>>>
>>> --------------------------+--------------------+-----------
>>>
>>>  coef                     | double precision[] |
>>>
>>>  log_likelihood           | double precision   |
>>>
>>>  std_err                  | double precision[] |
>>>
>>>  z_stats                  | double precision[] |
>>>
>>>  p_values                 | double precision[] |
>>>
>>>  odds_ratios              | double precision[] |
>>>
>>>  condition_no             | double precision   |
>>>
>>>  num_rows_processed       | bigint             |
>>>
>>>  num_missing_rows_skipped | bigint             |
>>>
>>>  num_iterations           | integer            |
>>>
>>>  variance_covariance      | double precision[] |
>>>
>>>
>>>
>>> mldb=# select log_likelihood from forest_logregr_out;
>>>
>>>   log_likelihood
>>>
>>> ------------------
>>>
>>>  -426986.83683879
>>>
>>> (1 row)
>>>
>>> </code>
>>>
>>>
>>>
>>> Is this procedure correct?
>>>
>>>
>>>
>>> *(2)  **Training on a 2,000-dimensional dense dataset (epsilon) is very
>>> slow:*
>>>
>>>
>>>
>>>            While training on a 2,000-dimensional dense dataset
>>> (epsilon_sample_10) with only *10 tuples* as follows, MADlib does not
>>> finish in 5 hours* for only 1 iteration*. The CPU usage is always 100%
>>> during the execution. The dataset is available at
>>> https://github.com/JerryLead/Misc/blob/master/MADlib/train.sql
>>> <https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2FJerryLead%2FMisc%2Fblob%2Fmaster%2FMADlib%2Ftrain.sql&data=04%7C01%7Cfmcquillan%40vmware.com%7C4b68d873a6434f4a8ddd08d93e603b96%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C0%7C637609405309019768%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=Ga6INXkQiBvI8RWAfCEZI5uSFOscUdG6RPR0NiuKWjU%3D&reserved=0>
>>> .
>>>
>>>
>>>
>>> <code>
>>>
>>> mldb=# \d epsilon_sample_10
>>>
>>>                                Table "public.epsilon_sample_10"
>>>
>>>  Column |        Type        |
>>>             Modifiers
>>>
>>>
>>> --------+--------------------+-----------------------------------------------------------------
>>>
>>>  did    | integer            | not null default
>>> nextval('epsilon_sample_10_did_seq'::regclass)
>>>
>>>  vec    | double precision[] |
>>>
>>>  labeli | integer            |
>>>
>>>
>>>
>>> mldb=# SELECT count(*) from epsilon_sample_10;
>>>
>>>  count
>>>
>>> -------
>>>
>>>     10
>>>
>>> (1 row)
>>>
>>>
>>>
>>> Time: 1.456 ms
>>>
>>>
>>>
>>> mldb=# SELECT madlib.logregr_train('epsilon_sample_10',
>>> 'epsilon_sample_10_logregr_out', 'labeli', 'vec', NULL, 1, 'igd');
>>>
>>> </code>
>>>
>>>
>>>
>>> *In this case, it is not possible to train the whole epsilon dataset
>>> (with 400,000 tuples) in a reasonable time. I guess that this problem is
>>> related to TOAST, since epsilon has a high dimension and it is compressed
>>> by TOAST. However, are there any other reasons for this so slow execution?*
>>>
>>>
>>>
>>> *(3)  **For MADlib, is the dataset table scanned once or twice in each
>>> iteration?*
>>>
>>> I know that, in each iteration, MADlib needs to scan the dataset table
>>> once to perform IGD/SGD on the whole dataset. My question is that, *at
>>> the end of each iteration*, will MADlib scan the table again to compute
>>> the training loss/accuracy?
>>>
>>>
>>>
>>> *(4)  **Is it possible to output the training metrics, such as training
>>> loss and accuracy after each iteration?*
>>>
>>> Currently, it seems that MADlib only outputs the log-likelihood at the
>>> end of the SQL execution.
>>>
>>>
>>>
>>> *(5)  **Do MADlib’s Logistic Regression and SVM support sparse
>>> datasets?*
>>>
>>> I also have some sparse datasets denoted as ‘feature_index_vec_array,
>>> feature_value_array, label’, such as ‘[1, 3, 5], [0.1, 0.2, 0.3], -1’.
Can
>>> I train these sparse datasets on MADlib using LR and SVM?
>>>
>>>
>>>
>>> Many thanks for reviewing my questions.
>>>
>>>
>>>
>>>
>>>
>>> Best regards,
>>>
>>>
>>>
>>> Lijie
>>>
>>

Mime
View raw message