madlib-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Rahul Iyer <rahulri...@gmail.com>
Subject Re: MADlib 1.8 Random Forest error
Date Wed, 18 Nov 2015 23:49:16 GMT
Hi Tetsuo,

Please find inline the explanation for the errors.

When I tried this with all linkid (the grouping column with 3043 linkids),
> I got an error as in "error_all_sample.txt" attached here. I guessed this
> is due to the number of samples in each linkid. The minimum sample size in
> a linkid is 62.
>
> As you guessed, the error is due to the number of splits being greater
than the number of samples in at least one of the groups. You can avoid the
issue by providing a value to number of splits parameter. In your query,
it's the last parameter - you will have to provide values for the values in
between (or set to NULL in appropriate case for default value), since named
parameters are not yet available for GPDB.
For this case, you can set num_splits to any value less than 62 since
that's the minimum number of samples in a group.


> I got another error as in "error_sample_above_200." The error message this
> time tells me something is wrong with the feature "holiday_flg) which is a
> dummy variable. When I excluded "holiday_flg," the error message tells me
> that the "KeyError: 'sun_flg' which is another dummy variable. I got
> another dummy variable in "KeyError:" when I excluded "sun_flg." I can tell
> one of the dummy variables in the model is picked up in the error message
> when I exclude one.
>

Looking at the code and performing some simulations, this is most probably
because at least 1 group has only 1 level for (some) categorical variables.
The value of each of these categorical variables is constant in at least
one of the groups (you should be able verify this easily doing a "select
count(...) from ... group by ... " ).

This is an easily fixable bug and I urge you file a JIRA
<https://issues.apache.org/jira/browse/MADLIB> (note the new Apache JIRA
link). Till that fix is completed, the only workaround for you is to either
not use those columns or ensure each group has at least 2 values for a
categorical variable.

Feel free to drop a note here if you have further questions.

Best,
Rahul

Mime
View raw message