madlib-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Tetsuo Kobayashi <tkobaya...@pivotal.io>
Subject Re: MADlib 1.8 Random Forest error (array_of_bigint)
Date Tue, 01 Dec 2015 01:46:12 GMT
Hi Rahul,

Thank you for your comment. It seems I need to investigate the continuous
features more to find out what the issue is.

Based on your comment, I know the madlib.forest_train() separates the
continuous features and categorical features but are there any rules how
the function separate the two? I see some continuous features are
recognized as categorical features when I see cat_features in the
output_summary table.
Are there any ways I can manually specify what features are continuous and
what are categorical?

Thank you,

Tesuo






2015-12-01 4:09 GMT+09:00 Rahul Iyer <riyer@pivotal.io>:

> Hi Tetsuo,
>
> I don't think it's the 'id' that is causing this issue, rather the array of
> features. Decision tree combines the continuous and categorical features in
> two separate arrays - one of those (most probably the continuous feature)
> is empty for a particular tuple. I can't comment more without looking at
> the dataset.
>
> Within the array operations module, we're returning the message as
> "array_of_bigint" for a float array. That's a minor messaging bug; I'll fix
> that as part of the next commit.
>
> Best,
> Rahul
>
> On Sun, Nov 29, 2015 at 12:41 AM, Tetsuo Kobayashi <tkobayashi@pivotal.io>
> wrote:
>
> > Hi,
> >
> > I am currently having an error with the MADlib Random Forest function in
> > MADlib1.8.0.  Below is the code I tried.
> >
> > DROP TABLE IF EXISTS rf_output, rf_output_group, rf_output_summary;
> > SELECT madlib.forest_train('test_rf_data', -- input table name
> >                            'rf_output', -- output table name
> >                            'id', -- id column
> >                            'duration', -- dependent variable
> >                            '*',  -- list of features
> >                            NULL,-- exclude columns
> >                            'linkid' -- grouping column
> >   ,2::integer -- # of trees
> >                            ,5::integer,  -- # of random features
> >                            TRUE::boolean, -- importance
> >                            1,  -- # of permutations
> >                            5, -- max_tree_depth
> >                            10,  -- min_split
> >                            3,  -- min_bucket
> >                            10  -- number of splits per continuous
> variable
> >                            );
> >
> > When I tried this with all linkid (the grouping column with 362 linkids),
> > I got an error as in "error_random_forest.txt" attached here. The error
> > message is says I have the invalid array length but does not tell any
> > details what features in the data have this issue.
> >
> > ERROR:  plpy.SPIError: invalid array length (plpython.c:4648)
> > DETAIL:  array_of_bigint: Size should be in [1, 1e7], 0 given
> >
> > I guessed this is the error for the bigint columns but the only bigint
> > columns is the "id" column. I once had an error that some features have
> > identical values in all records, but it is not the case this time
> because I
> > changed the sample size for each linkid as 1000 or above.
> > It seems something is zero from the DETAIL saying "0 given" but I have no
> > idea what in the data this is referring to.
> >
> >
> > The schema of the input table is as below;
> > CREATE TABLE input_table (
> > id bigint,
> > linkid varchar(32),
> > duration double precision,
> > sat_flg int,
> > sun_flg int,
> > holiday_flg int,
> > semi_holiday_flg int,
> > renkyu_flg int,
> > ave_temp numeric,
> > ave_wind numeric,
> > precip numeric,
> > radiation numeric,
> > ave_speed numeric,
> > travel_time numeric,
> > );
> >
> > Can anybody please let me know what the possible cause of this error? The
> > MADlib linear regression worked without any problems.
> >
> > I am using MADlib 1.8.0 on GPDB 4.3.6.1. The OS is CentOS.
> >
> >
> > Thank you,
> >
> > Tetsuo
> >
>



-- 
----------------------------------------
Pivotalジャパン株式会社
小林哲郎 (Tetsuo Kobayashi)
Senior Data Scientist
E-mail: tkobayashi@pivotal.io
TEL: 080-9979-0757(携帯)
----------------------------------------

Mime
View raw message