madlib-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Tetsuo Kobayashi <tkobaya...@pivotal.io>
Subject Re: MADlib 1.8 Random Forest error (array_of_bigint)
Date Tue, 01 Dec 2015 22:16:08 GMT
Hi Rahul,


This helps a lot. Thank you for your support.

Tetsuo

2015年12月2日水曜日、Rahul Iyer<riyer@pivotal.io>さんは書きました:

> Hi Tetsuo,
>
> Random forest uses decision tree module that builds the features. The DT
> doc page <http://doc.madlib.net/latest/group__grp__decision__tree.html>
> says: "... boolean, integer, and text columns are considered categorical
> and double precision columns are considered continuous".
>
> Casting your continuous features to double precision should force them to
> be used as continuous.
>
> Best,
> Rahul
>
>
> On Mon, Nov 30, 2015 at 5:46 PM, Tetsuo Kobayashi <tkobayashi@pivotal.io
> <javascript:_e(%7B%7D,'cvml','tkobayashi@pivotal.io');>> wrote:
>
>> Hi Rahul,
>>
>> Thank you for your comment. It seems I need to investigate the continuous
>> features more to find out what the issue is.
>>
>> Based on your comment, I know the madlib.forest_train() separates the
>> continuous features and categorical features but are there any rules how
>> the function separate the two? I see some continuous features are
>> recognized as categorical features when I see cat_features in the
>> output_summary table.
>> Are there any ways I can manually specify what features are continuous
>> and what are categorical?
>>
>> Thank you,
>>
>> Tesuo
>>
>>
>>
>>
>>
>>
>> 2015-12-01 4:09 GMT+09:00 Rahul Iyer <riyer@pivotal.io
>> <javascript:_e(%7B%7D,'cvml','riyer@pivotal.io');>>:
>>
>>> Hi Tetsuo,
>>>
>>> I don't think it's the 'id' that is causing this issue, rather the array
>>> of
>>> features. Decision tree combines the continuous and categorical features
>>> in
>>> two separate arrays - one of those (most probably the continuous feature)
>>> is empty for a particular tuple. I can't comment more without looking at
>>> the dataset.
>>>
>>> Within the array operations module, we're returning the message as
>>> "array_of_bigint" for a float array. That's a minor messaging bug; I'll
>>> fix
>>> that as part of the next commit.
>>>
>>> Best,
>>> Rahul
>>>
>>> On Sun, Nov 29, 2015 at 12:41 AM, Tetsuo Kobayashi <
>>> tkobayashi@pivotal.io
>>> <javascript:_e(%7B%7D,'cvml','tkobayashi@pivotal.io');>>
>>> wrote:
>>>
>>> > Hi,
>>> >
>>> > I am currently having an error with the MADlib Random Forest function
>>> in
>>> > MADlib1.8.0.  Below is the code I tried.
>>> >
>>> > DROP TABLE IF EXISTS rf_output, rf_output_group, rf_output_summary;
>>> > SELECT madlib.forest_train('test_rf_data', -- input table name
>>> >                            'rf_output', -- output table name
>>> >                            'id', -- id column
>>> >                            'duration', -- dependent variable
>>> >                            '*',  -- list of features
>>> >                            NULL,-- exclude columns
>>> >                            'linkid' -- grouping column
>>> >   ,2::integer -- # of trees
>>> >                            ,5::integer,  -- # of random features
>>> >                            TRUE::boolean, -- importance
>>> >                            1,  -- # of permutations
>>> >                            5, -- max_tree_depth
>>> >                            10,  -- min_split
>>> >                            3,  -- min_bucket
>>> >                            10  -- number of splits per continuous
>>> variable
>>> >                            );
>>> >
>>> > When I tried this with all linkid (the grouping column with 362
>>> linkids),
>>> > I got an error as in "error_random_forest.txt" attached here. The error
>>> > message is says I have the invalid array length but does not tell any
>>> > details what features in the data have this issue.
>>> >
>>> > ERROR:  plpy.SPIError: invalid array length (plpython.c:4648)
>>> > DETAIL:  array_of_bigint: Size should be in [1, 1e7], 0 given
>>> >
>>> > I guessed this is the error for the bigint columns but the only bigint
>>> > columns is the "id" column. I once had an error that some features have
>>> > identical values in all records, but it is not the case this time
>>> because I
>>> > changed the sample size for each linkid as 1000 or above.
>>> > It seems something is zero from the DETAIL saying "0 given" but I have
>>> no
>>> > idea what in the data this is referring to.
>>> >
>>> >
>>> > The schema of the input table is as below;
>>> > CREATE TABLE input_table (
>>> > id bigint,
>>> > linkid varchar(32),
>>> > duration double precision,
>>> > sat_flg int,
>>> > sun_flg int,
>>> > holiday_flg int,
>>> > semi_holiday_flg int,
>>> > renkyu_flg int,
>>> > ave_temp numeric,
>>> > ave_wind numeric,
>>> > precip numeric,
>>> > radiation numeric,
>>> > ave_speed numeric,
>>> > travel_time numeric,
>>> > );
>>> >
>>> > Can anybody please let me know what the possible cause of this error?
>>> The
>>> > MADlib linear regression worked without any problems.
>>> >
>>> > I am using MADlib 1.8.0 on GPDB 4.3.6.1. The OS is CentOS.
>>> >
>>> >
>>> > Thank you,
>>> >
>>> > Tetsuo
>>> >
>>>
>>
>>
>>
>> --
>> ----------------------------------------
>> Pivotalジャパン株式会社
>> 小林哲郎 (Tetsuo Kobayashi)
>> Senior Data Scientist
>> E-mail: tkobayashi@pivotal.io
>> <javascript:_e(%7B%7D,'cvml','tkobayashi@pivotal.io');>
>> TEL: 080-9979-0757(携帯)
>> ----------------------------------------
>>
>
>

-- 
----------------------------------------
Pivotalジャパン株式会社
小林哲郎 (Tetsuo Kobayashi)
Senior Data Scientist
E-mail: tkobayashi@pivotal.io
TEL: 080-9979-0757(携帯)
----------------------------------------

Mime
View raw message