madlib-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Tetsuo Kobayashi <tkobaya...@pivotal.io>
Subject MADlib 1.8 Random Forest error (array_of_bigint)
Date Sun, 29 Nov 2015 08:41:22 GMT
Hi,

I am currently having an error with the MADlib Random Forest function in
MADlib1.8.0.  Below is the code I tried.

DROP TABLE IF EXISTS rf_output, rf_output_group, rf_output_summary;
SELECT madlib.forest_train('test_rf_data', -- input table name
                           'rf_output', -- output table name
                           'id', -- id column
                           'duration', -- dependent variable
                           '*',  -- list of features
                           NULL,-- exclude columns
                           'linkid' -- grouping column
  ,2::integer -- # of trees
                           ,5::integer,  -- # of random features
                           TRUE::boolean, -- importance
                           1,  -- # of permutations
                           5, -- max_tree_depth
                           10,  -- min_split
                           3,  -- min_bucket
                           10  -- number of splits per continuous variable
                           );

When I tried this with all linkid (the grouping column with 362 linkids), I
got an error as in "error_random_forest.txt" attached here. The error
message is says I have the invalid array length but does not tell any
details what features in the data have this issue.

ERROR:  plpy.SPIError: invalid array length (plpython.c:4648)
DETAIL:  array_of_bigint: Size should be in [1, 1e7], 0 given

I guessed this is the error for the bigint columns but the only bigint
columns is the "id" column. I once had an error that some features have
identical values in all records, but it is not the case this time because I
changed the sample size for each linkid as 1000 or above.
It seems something is zero from the DETAIL saying "0 given" but I have no
idea what in the data this is referring to.


The schema of the input table is as below;
CREATE TABLE input_table (
id bigint,
linkid varchar(32),
duration double precision,
sat_flg int,
sun_flg int,
holiday_flg int,
semi_holiday_flg int,
renkyu_flg int,
ave_temp numeric,
ave_wind numeric,
precip numeric,
radiation numeric,
ave_speed numeric,
travel_time numeric,
);

Can anybody please let me know what the possible cause of this error? The
MADlib linear regression worked without any problems.

I am using MADlib 1.8.0 on GPDB 4.3.6.1. The OS is CentOS.


Thank you,

Tetsuo

Mime
View raw message