Dear Aaron,

Thanks for your advice. I will try it. 

In addition, after following Frank's guide, I found MADlib LR and SVM can work normally on some low-dimensional (e.g., 18-28) datasets, even with >1 million tuples. However, while working on high-dimensional dataset, such as epsilon dataset with 400,000 tuples and 2,000 features (https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary.html), MADlib SVM can finish 20 iterations in a reasonable time but MADlib LR (with IGD) cannot finish 2 iterations in several hours.  Any ideas about this problem? Thanks!

Best,
Lijie 



On Thu, Jul 15, 2021 at 4:03 PM FENG, Xixuan (Aaron) <xixuan.feng@gmail.com> wrote:
Hi Lijie,

I implemented the logregr with incremental gradient descent a few years ago. Unfortunately at that time we chose to hard-coded the constant step-size. But luckily you can edit the code as you need.

Here are the pointers:
https://github.com/apache/madlib/blob/2e34c0f45a6e0f3be224ef58a6f4a576eb8eb89a/src/modules/regress/logistic.cpp#L818

https://github.com/apache/madlib/blob/2e34c0f45a6e0f3be224ef58a6f4a576eb8eb89a/src/modules/regress/logistic.cpp#L918

Good luck!
Aaron

2021年7月15日(木) 22:14 Lijie Xu <csxulijie@gmail.com>:
Dear Frank,

Sorry for the late reply and thanks for your great help. I'm doing some research work on MADlib. I will follow your advice to test MADlib again. Another question is if MADlib LR supports tuning learning_rate?

In MADlib SVM, there is a 'params' in 'svm_classification' to tune the 'init_stepsize' and 'decay_factor' as follows.
svm_classification(
    source_table,
    model_table,
    dependent_varname,
    independent_varname,
    kernel_func,
    kernel_params,
    grouping_col,
    params,
    verbose
    )
However, I did not see this 'params' in LR as:
logregr_train( source_table,
               out_table,
               dependent_varname,
               independent_varname,
               grouping_cols,
               max_iter,
               optimizer,
               tolerance,
               verbose
             )
In addition, I checked the Generalized Linear Models, and its 'optim_params' parameter seems to only support tuning 'tolerance, max_iter, and optimizer'.
Is there a way to tune the 'init_stepsize' and 'decay_factor' in LR? Thanks!

Best,
Lijie

On Tue, Jul 6, 2021 at 9:04 PM Frank McQuillan <fmcquillan@vmware.com> wrote:
Hello,

Thank you for the questions.

(0)
Not sure if you are using Postgres just for development or production, but keep in mind that MADlib is designed to run on a distributed MPP database (Greenplum) with large datasets. It runs fine on Postgres, but obviously Postgres won't scale to very large datasets or it will just be too slow.

Also see jupyter notebooks here
for other examples in case of use.


(1)
- there are 2 problems with your dataset for logistic regression:

(i)
- as per


the dependent variable is a boolean or an expression that evaluates to boolean
- your data has dependent variable of -1 but postgres does not evaluate -1 to FALSE so you should change the -1 to 0
- i.e., use 0 for FALSE and 1 for TRUE in postgres



(ii)
- an intercept variable is not assumed so it is common to provide an explicit intercept term by including a single constant 1 term in the independent variable list
- see the example here



That is why the log_likelihood value is too big, that model is not right.


(2)
if you make the fixes above in (1) it should run OK.  Here are my results on PostgreSQL 11.6 using MADlib version: 1.18.0 on the dataset with 10 tuples:


DROP TABLE IF EXISTS epsilon_sample_10v2 CASCADE;

        CREATE TABLE epsilon_sample_10v2 (
       did serial,
       vec double precision[],
       labeli integer
        );

        COPY epsilon_sample_10v2 (vec, labeli) FROM STDIN;
        {1.0,-0.0108282,-0.0196004,0.0422148,...} 0
        {1.0,0.00250835,0.0168447,-0.0102934,...} 1
        etc.

SELECT madlib.logregr_train('epsilon_sample_10v2', 'epsilon_sample_10v2_logregr_out', 'labeli', 'vec', NULL, 1, 'irls'}

 logregr_train
---------------
 
(1 row)

Time: 317046.342 ms (05:17.046)

madlib=# select log_likelihood from epsilon_sample_10v2_logregr_out;
  log_likelihood  
-------------------
 -6.93147180559945
(1 row)


(3)
-dataset is not scanned again at the end of every iteration to compute training loss/accuracy.  It should only scan 1x per iteration for optimization


(4)
- I thought the verbose parameter should do that, but it does not seem to be working for me.  Will need to look into it more.


(5)
-logistic regression and SVM do not currently support sparse matrix format


Frank


From: Lijie Xu <csxulijie@gmail.com>
Sent: Saturday, July 3, 2021 1:21 PM
To: user@madlib.apache.org <user@madlib.apache.org>
Subject: Long execution time on MADlib
 


Hi All,

 

I’m Lijie and now performing some experiments on MADlib. I found that MADlib runs very slowly on some datasets, so I would like to justify my settings. Could you help me check the following settings and codes? Sorry for this long email. I used the latest MADlib 1.18 on PostgreSQL 12.

 

(1)  Could you help check whether the data format and scripts I used are right for n-dimensional dataset?

 

I have some training datasets, and each of them has a dense feature array (like [0.1, 0.2, …, 1.0]) and a class label (+1/-1). For example, for the ‘forest’ dataset (581K tuples) with a 54-dimensional feature array and a class label, I first stored it into PostgreSQL using

 

<code>

     CREATE TABLE forest (

          did serial,

          vec double precision[],

          labeli integer);

 

      COPY forest (vec, labeli) FROM STDIN;

      ‘[0.1, 0.2, …, 1.0], -1’

      ‘[0.3, 0.1, …, 0.9], 1’

      …

</code>  

 

 

        Then, to run the Logistic Regression on this dataset, I use the following code:

 

<code>

mldb=# \d forest

                               Table "public.forest"

 Column |        Type        |                      Modifiers                      

--------+--------------------+------------------------------------------------------

 did    | integer            | not null default nextval('forest_did_seq'::regclass)

 vec    | double precision[] |

 labeli | integer            |

 

mldb=# SELECT madlib.logregr_train(

mldb(#     'forest',                                 -- source table

mldb(#     'forest_logregr_out',                     -- output table

mldb(#     'labeli',                                 -- labels

mldb(#     'vec',                                    -- features

mldb(#     NULL,                                     -- grouping columns

mldb(#     20,                                       -- max number of iteration

mldb(#     'igd'                                     -- optimizer

mldb(#     );

 

Time: 198911.350 ms

</code>           

 

After about 199s, I got the output table as:

<code>

mldb=# \d forest_logregr_out

             Table "public.forest_logregr_out"

          Column          |        Type        | Modifiers

--------------------------+--------------------+-----------

 coef                     | double precision[] |

 log_likelihood           | double precision   |

 std_err                  | double precision[] |

 z_stats                  | double precision[] |

 p_values                 | double precision[] |

 odds_ratios              | double precision[] |

 condition_no             | double precision   |

 num_rows_processed       | bigint             |

 num_missing_rows_skipped | bigint             |

 num_iterations           | integer            |

 variance_covariance      | double precision[] |

 

mldb=# select log_likelihood from forest_logregr_out;

  log_likelihood 

------------------

 -426986.83683879

(1 row)

</code>  

 

Is this procedure correct?

 

(2)  Training on a 2,000-dimensional dense dataset (epsilon) is very slow:

 

           While training on a 2,000-dimensional dense dataset (epsilon_sample_10) with only 10 tuples as follows, MADlib does not finish in 5 hours for only 1 iteration. The CPU usage is always 100% during the execution. The dataset is available at https://github.com/JerryLead/Misc/blob/master/MADlib/train.sql.

 

<code>

mldb=# \d epsilon_sample_10

                               Table "public.epsilon_sample_10"

 Column |        Type        |                            Modifiers                           

--------+--------------------+-----------------------------------------------------------------

 did    | integer            | not null default nextval('epsilon_sample_10_did_seq'::regclass)

 vec    | double precision[] |

 labeli | integer            |

 

mldb=# SELECT count(*) from epsilon_sample_10;

 count

-------

    10

(1 row)

 

Time: 1.456 ms

 

mldb=# SELECT madlib.logregr_train('epsilon_sample_10', 'epsilon_sample_10_logregr_out', 'labeli', 'vec', NULL, 1, 'igd');

</code>  

 

In this case, it is not possible to train the whole epsilon dataset (with 400,000 tuples) in a reasonable time. I guess that this problem is related to TOAST, since epsilon has a high dimension and it is compressed by TOAST. However, are there any other reasons for this so slow execution?

 

(3)  For MADlib, is the dataset table scanned once or twice in each iteration?

I know that, in each iteration, MADlib needs to scan the dataset table once to perform IGD/SGD on the whole dataset. My question is that, at the end of each iteration, will MADlib scan the table again to compute the training loss/accuracy?

 

(4)  Is it possible to output the training metrics, such as training loss and accuracy after each iteration?

Currently, it seems that MADlib only outputs the log-likelihood at the end of the SQL execution.

 

(5)  Do MADlib’s Logistic Regression and SVM support sparse datasets?

I also have some sparse datasets denoted as ‘feature_index_vec_array, feature_value_array, label’, such as ‘[1, 3, 5], [0.1, 0.2, 0.3], -1’. Can I train these sparse datasets on MADlib using LR and SVM?

 

Many thanks for reviewing my questions.

 

 

Best regards,

 

Lijie