madlib-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Tetsuo Kobayashi <tkobaya...@pivotal.io>
Subject MADlib 1.8 Random Forest error
Date Wed, 18 Nov 2015 23:22:57 GMT
I am currently having an error with the MADlib Random Forest function in
MADlib 1.8.0.  Below is the code I tried.

SELECT madlib.forest_train('data_random_forest', -- input table name
                           'rf_output', -- output table name
                           'id', -- id column
                           'duration', -- dependent variable
                           '*',  -- list of features
                           'start_time, end_time, date_only, loc_code,
amedas_code',  -- exclude columns
                           'linkid' -- grouping column
   -- ,10::integer -- # of trees
                           -- ,2::integer,  -- # of random features
                           -- TRUE::boolean, -- importance
                           -- 1,  -- # of permutations
                           -- 10, -- max_tree_depth
                           -- 8,  -- min_split
                           -- 3,  -- min_bucket
                           -- 10  -- number of splits per continuous
variable
                           );

When I tried this with all linkid (the grouping column with 3.043 linkids),
I got an error as in "error_all_sample.txt" attached here. I guessed this
is due to the number of samples in each linkid. The minimum sample size in
a linkid is 62.

Therefore, I removed all linkid with sample fewer than 200 and tried the
same function again. I got another error as in "error_sample_above_200."
The error message this time tells me something is wrong with the feature
"holiday_flg) which is a dummy variable. When I excluded "holiday_flg," the
error message tells me that the "KeyError: 'sun_flg' which is another dummy
variable. I got another dummy variable in "KeyError:" when I excluded
"sun_flg." I can tell one of the dummy variables in the model is picked up
in the error message when I exclude one.

The schema of the input table is as below;
CREATE TABLE input_table (
id bigint,
linkid varchar(32),
start_time timestamp,
end_time timestamp,
duration double precision,
dow_code int,
date_only date,
mth int,
hod int,
sat_flg int,
sun_flg int,
time_of_day int,
holiday_flg int.
semi_holiday_flg int,
renkyu_flg int,
loc_code numeric,
amedas_code numeric,
ave_temp numeric,
high_temp numeric,
low_temp numeric,
ave_wind numeric,
precip numeric,
radiation numeric,
ave_speed numeric,
travel_time numeric
);

Can you please let me know what the possible cause of this error? The
MADlib linear regression worked without any problems.

I am using MADlib 1.8.0 on GPDB 4.3.6.1. The OS is CentOS.


Thank you,

Tetsuo



-- 
----------------------------------------
Pivotalジャパン株式会社
小林哲郎 (Tetsuo Kobayashi)
Senior Data Scientist
E-mail: tkobayashi@pivotal.io
TEL: 080-9979-0757(携帯)
----------------------------------------

Mime
View raw message