madlib-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Rahul Iyer <ri...@pivotal.io>
Subject Re: MADlib 1.8 Random Forest error (array_of_bigint)
Date Tue, 01 Dec 2015 17:45:14 GMT
Hi Tetsuo,

Random forest uses decision tree module that builds the features. The DT
doc page <http://doc.madlib.net/latest/group__grp__decision__tree.html>
says: "... boolean, integer, and text columns are considered categorical
and double precision columns are considered continuous".

Casting your continuous features to double precision should force them to
be used as continuous.

Best,
Rahul


On Mon, Nov 30, 2015 at 5:46 PM, Tetsuo Kobayashi <tkobayashi@pivotal.io>
wrote:

> Hi Rahul,
>
> Thank you for your comment. It seems I need to investigate the continuous
> features more to find out what the issue is.
>
> Based on your comment, I know the madlib.forest_train() separates the
> continuous features and categorical features but are there any rules how
> the function separate the two? I see some continuous features are
> recognized as categorical features when I see cat_features in the
> output_summary table.
> Are there any ways I can manually specify what features are continuous and
> what are categorical?
>
> Thank you,
>
> Tesuo
>
>
>
>
>
>
> 2015-12-01 4:09 GMT+09:00 Rahul Iyer <riyer@pivotal.io>:
>
>> Hi Tetsuo,
>>
>> I don't think it's the 'id' that is causing this issue, rather the array
>> of
>> features. Decision tree combines the continuous and categorical features
>> in
>> two separate arrays - one of those (most probably the continuous feature)
>> is empty for a particular tuple. I can't comment more without looking at
>> the dataset.
>>
>> Within the array operations module, we're returning the message as
>> "array_of_bigint" for a float array. That's a minor messaging bug; I'll
>> fix
>> that as part of the next commit.
>>
>> Best,
>> Rahul
>>
>> On Sun, Nov 29, 2015 at 12:41 AM, Tetsuo Kobayashi <tkobayashi@pivotal.io
>> >
>> wrote:
>>
>> > Hi,
>> >
>> > I am currently having an error with the MADlib Random Forest function in
>> > MADlib1.8.0.  Below is the code I tried.
>> >
>> > DROP TABLE IF EXISTS rf_output, rf_output_group, rf_output_summary;
>> > SELECT madlib.forest_train('test_rf_data', -- input table name
>> >                            'rf_output', -- output table name
>> >                            'id', -- id column
>> >                            'duration', -- dependent variable
>> >                            '*',  -- list of features
>> >                            NULL,-- exclude columns
>> >                            'linkid' -- grouping column
>> >   ,2::integer -- # of trees
>> >                            ,5::integer,  -- # of random features
>> >                            TRUE::boolean, -- importance
>> >                            1,  -- # of permutations
>> >                            5, -- max_tree_depth
>> >                            10,  -- min_split
>> >                            3,  -- min_bucket
>> >                            10  -- number of splits per continuous
>> variable
>> >                            );
>> >
>> > When I tried this with all linkid (the grouping column with 362
>> linkids),
>> > I got an error as in "error_random_forest.txt" attached here. The error
>> > message is says I have the invalid array length but does not tell any
>> > details what features in the data have this issue.
>> >
>> > ERROR:  plpy.SPIError: invalid array length (plpython.c:4648)
>> > DETAIL:  array_of_bigint: Size should be in [1, 1e7], 0 given
>> >
>> > I guessed this is the error for the bigint columns but the only bigint
>> > columns is the "id" column. I once had an error that some features have
>> > identical values in all records, but it is not the case this time
>> because I
>> > changed the sample size for each linkid as 1000 or above.
>> > It seems something is zero from the DETAIL saying "0 given" but I have
>> no
>> > idea what in the data this is referring to.
>> >
>> >
>> > The schema of the input table is as below;
>> > CREATE TABLE input_table (
>> > id bigint,
>> > linkid varchar(32),
>> > duration double precision,
>> > sat_flg int,
>> > sun_flg int,
>> > holiday_flg int,
>> > semi_holiday_flg int,
>> > renkyu_flg int,
>> > ave_temp numeric,
>> > ave_wind numeric,
>> > precip numeric,
>> > radiation numeric,
>> > ave_speed numeric,
>> > travel_time numeric,
>> > );
>> >
>> > Can anybody please let me know what the possible cause of this error?
>> The
>> > MADlib linear regression worked without any problems.
>> >
>> > I am using MADlib 1.8.0 on GPDB 4.3.6.1. The OS is CentOS.
>> >
>> >
>> > Thank you,
>> >
>> > Tetsuo
>> >
>>
>
>
>
> --
> ----------------------------------------
> Pivotalジャパン株式会社
> 小林哲郎 (Tetsuo Kobayashi)
> Senior Data Scientist
> E-mail: tkobayashi@pivotal.io
> TEL: 080-9979-0757(携帯)
> ----------------------------------------
>

Mime
View raw message