madlib-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Frank McQuillan <fmcquil...@pivotal.io>
Subject Re: PostgreSQL crashed during random forest training
Date Sat, 28 Jul 2018 11:28:09 GMT
thanks, I added this info to the jira

On Fri, Jul 27, 2018 at 7:23 AM, LUYAO CHEN <luyao_chen@hotmail.com> wrote:

> The similar problem happened in decision tree.  ( with the same set of
> data ).
>
> I got the error (dmesg) that "
>  [ 4289.020198] postmaster[1840]: segfault at 0 ip 00007f17cd5f4ea3 sp
> 00007ffdf867dd50 error 4 in libmadlib.so[7f17cd2ec000+64a000]"
>
>
>
>
> Regards,
> Luyao Chen
>
> ------------------------------
> *From:* Frank McQuillan <fmcquillan@pivotal.io>
> *Sent:* Tuesday, July 24, 2018 2:13 PM
>
> *To:* user@madlib.apache.org
> *Subject:* Re: PostgreSQL crashed during random forest training
>
> Thank you, we created a JIRA to investigate this
> https://issues.apache.org/jira/browse/MADLIB-1257
>
> On Tue, Jul 24, 2018 at 10:31 AM, LUYAO CHEN <luyao_chen@hotmail.com>
> wrote:
>
> Another observation -  It crashed with 84 groups and 73K instance. In this
> scenario, I shall have pretty enough memory and disk.
>
> Also seems during the increasing of the groups, it used a lot of
> temporary disk space when the data is over certain groups.
>
>
> Regards,
>
> ------------------------------
> *From:* LUYAO CHEN <luyao_chen@hotmail.com>
> *Sent:* Tuesday, July 24, 2018 9:15 AM
> *To:* user@madlib.apache.org
> *Subject:* Re: PostgreSQL crashed during random forest training
>
>
> Hi Frank,
>
>
> You may refer to the enclosed dump data for the training table, and I used
> the below  SQL for random forest.
>
>
> DROP TABLE IF EXISTS train_output, train_output_group,
> train_output_summary;
> SELECT madlib.forest_train('train_data',         -- source table
>                            'train_output',    -- output model table
>                            'rowid',              -- id column
>                            'positive',           -- response
>                            'features',   -- features
>                            NULL,              -- exclude columns
>                            'caseid',              -- grouping columns
>                            30::integer,       -- number of trees
>                            30::integer,        -- number of random features
>                            TRUE::boolean,     -- variable importance
>                            1::integer,        -- num_permutations
>                            10::integer,        -- max depth
>                            3::integer,        -- min split
>                            1::integer,        -- min bucket
>                            10::integer,        -- number of splits per
> continuous variable
>                            NULL,         -- null handling parameter
>                            TRUE          --   verbose
>                            );
>
> Regards,
> Luyao Chen
>
> ------------------------------
> *From:* Frank McQuillan <fmcquillan@pivotal.io>
> *Sent:* Monday, July 23, 2018 4:59 PM
> *To:* user@madlib.apache.org
> *Subject:* Re: PostgreSQL crashed during random forest training
>
> Hi Luyao Chen
>
> It's hard to debug just looking at that trace.
>
> 1) If you increase your data size to more than 56K instances in 56
> groups, does it work?  e.g., double it to approx 112K instances and 112
> groups.
>
> 2) Is it possible of you could share a sample of your data so that we
> could try?  If not, perhaps anonymize a sample of the data so that we can
> multiply it out to make it bigger?  Then we could take a closer look.
>
> Frank
>
> On Mon, Jul 23, 2018 at 12:34 PM, LUYAO CHEN <luyao_chen@hotmail.com>
> wrote:
>
> Dear user group,
>
>
> I got a problem when training the grouped data with random forest(300
> features). Small data was fine ( eg, 56K instances in 56 groups), but
> failed for 240K instances in 250 groups. Postgres forced to disconnect the
> session after showing the below message in verbose mode:
>
>
> NOTICE:  view "__madlib_temp_60124179_1532371657_7130296__" will be a
> temporary view
> NOTICE:  sql_create_empty_result_table:
>
>             CREATE TABLE analysis.dx_rf_train_output_1 (
>                 gid         integer,
>                 sample_id   integer,
>                 tree        madlib.bytea8);
>
> NOTICE:  sql_refresh_training_pois_cnt:
>
>                             TRUNCATE TABLE __madlib_temp_91155016_1532371657_5660955__
> CASCADE;
>                             INSERT INTO __madlib_temp_91155016_1532371
> 657_5660955__
>                             SELECT
>                                 *,
>                                 madlib.poisson_random(1) AS poisson_count
>                             FROM
>                             (
>                                 SELECT
>                                     *,
>                                     0.::double precision AS
> __madlib_temp_14328459_1532371657_7318497__
>                                 FROM analysis.dxpredict_svec
>                             ) subq
>                             WHERE __madlib_temp_14328459_1532371657_7318497__
> < 1
>
> NOTICE:
>                         src_cnt: 158360,
>                         oob_cnt: 92418,
>                         dup_cnt: 250617.
>
> NOTICE:  Started tree building for all groups
> server closed the connection unexpectedly
>         This probably means the server terminated abnormally
>         before or while processing the request.
> The connection to the server was lost. Attempting reset: Failed.
>
> The PostgreSQL did not capture the detail log even I increased the
> logstatement to "all"
> 2018-07-23 14:47:50.229 EDT [1090] LOG:  server process (PID 1980) was
> terminated by signal 11: Segmentation fault
> 2018-07-23 14:47:50.229 EDT [1090] DETAIL:  Failed process was running:
> SELECT madlib.forest_train('analysis.dxpredict_svec',
>                                    'analysis.dx_rf_train_output_1',
>                                    'rowid',
>                                    'positive',
>                                    '*',
>                                    'rowid,positive,case_icd',
>                                    'case_icd',
>                                    30::integer,
>                                    30::integer,
>                                    TRUE::boolean,
>                                    1::integer,
>                                    10::integer,
>                                    3::integer,
>                                    1::integer,
>                                    10::integer,
>                                    NULL,
>                                    TRUE
>                                    );
> 2018-07-23 14:47:50.229 EDT [1090] LOG:  terminating any other active
> server processes
> 2018-07-23 14:47:50.229 EDT [1401] WARNING:  terminating connection
> because of crash of another server process
>
>
>
>
>
>
>

Mime
View raw message