madlib-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From LUYAO CHEN <luyao_c...@hotmail.com>
Subject Re: PostgreSQL crashed during random forest training
Date Fri, 27 Jul 2018 14:23:53 GMT
The similar problem happened in decision tree.  ( with the same set of data ).

I got the error (dmesg) that "
 [ 4289.020198] postmaster[1840]: segfault at 0 ip 00007f17cd5f4ea3 sp 00007ffdf867dd50 error
4 in libmadlib.so[7f17cd2ec000+64a000]"




Regards,
Luyao Chen

________________________________
From: Frank McQuillan <fmcquillan@pivotal.io>
Sent: Tuesday, July 24, 2018 2:13 PM
To: user@madlib.apache.org
Subject: Re: PostgreSQL crashed during random forest training

Thank you, we created a JIRA to investigate this
https://issues.apache.org/jira/browse/MADLIB-1257

On Tue, Jul 24, 2018 at 10:31 AM, LUYAO CHEN <luyao_chen@hotmail.com<mailto:luyao_chen@hotmail.com>>
wrote:

Another observation -  It crashed with 84 groups and 73K instance. In this scenario, I shall
have pretty enough memory and disk.

Also seems during the increasing of the groups, it used a lot of temporary disk space when
the data is over certain groups.


Regards,

________________________________
From: LUYAO CHEN <luyao_chen@hotmail.com<mailto:luyao_chen@hotmail.com>>
Sent: Tuesday, July 24, 2018 9:15 AM
To: user@madlib.apache.org<mailto:user@madlib.apache.org>
Subject: Re: PostgreSQL crashed during random forest training


Hi Frank,


You may refer to the enclosed dump data for the training table, and I used the below  SQL
for random forest.


DROP TABLE IF EXISTS train_output, train_output_group, train_output_summary;
SELECT madlib.forest_train('train_data',         -- source table
                           'train_output',    -- output model table
                           'rowid',              -- id column
                           'positive',           -- response
                           'features',   -- features
                           NULL,              -- exclude columns
                           'caseid',              -- grouping columns
                           30::integer,       -- number of trees
                           30::integer,        -- number of random features
                           TRUE::boolean,     -- variable importance
                           1::integer,        -- num_permutations
                           10::integer,        -- max depth
                           3::integer,        -- min split
                           1::integer,        -- min bucket
                           10::integer,        -- number of splits per continuous variable
                           NULL,         -- null handling parameter
                           TRUE          --   verbose
                           );


Regards,
Luyao Chen

________________________________
From: Frank McQuillan <fmcquillan@pivotal.io<mailto:fmcquillan@pivotal.io>>
Sent: Monday, July 23, 2018 4:59 PM
To: user@madlib.apache.org<mailto:user@madlib.apache.org>
Subject: Re: PostgreSQL crashed during random forest training

Hi Luyao Chen

It's hard to debug just looking at that trace.

1) If you increase your data size to more than 56K instances in 56 groups, does it work? 
e.g., double it to approx 112K instances and 112 groups.

2) Is it possible of you could share a sample of your data so that we could try?  If not,
perhaps anonymize a sample of the data so that we can multiply it out to make it bigger? 
Then we could take a closer look.

Frank

On Mon, Jul 23, 2018 at 12:34 PM, LUYAO CHEN <luyao_chen@hotmail.com<mailto:luyao_chen@hotmail.com>>
wrote:

Dear user group,


I got a problem when training the grouped data with random forest(300 features). Small data
was fine ( eg, 56K instances in 56 groups), but failed for 240K instances in 250 groups. Postgres
forced to disconnect the session after showing the below message in verbose mode:


NOTICE:  view "__madlib_temp_60124179_1532371657_7130296__" will be a temporary view
NOTICE:  sql_create_empty_result_table:

            CREATE TABLE analysis.dx_rf_train_output_1 (
                gid         integer,
                sample_id   integer,
                tree        madlib.bytea8);

NOTICE:  sql_refresh_training_pois_cnt:

                            TRUNCATE TABLE __madlib_temp_91155016_1532371657_5660955__ CASCADE;
                            INSERT INTO __madlib_temp_91155016_1532371657_5660955__
                            SELECT
                                *,
                                madlib.poisson_random(1) AS poisson_count
                            FROM
                            (
                                SELECT
                                    *,
                                    0.::double precision AS __madlib_temp_14328459_1532371657_7318497__
                                FROM analysis.dxpredict_svec
                            ) subq
                            WHERE __madlib_temp_14328459_1532371657_7318497__ < 1

NOTICE:
                        src_cnt: 158360,
                        oob_cnt: 92418,
                        dup_cnt: 250617.

NOTICE:  Started tree building for all groups
server closed the connection unexpectedly
        This probably means the server terminated abnormally
        before or while processing the request.
The connection to the server was lost. Attempting reset: Failed.


The PostgreSQL did not capture the detail log even I increased the logstatement to "all"
2018-07-23 14:47:50.229 EDT [1090] LOG:  server process (PID 1980) was terminated by signal
11: Segmentation fault
2018-07-23 14:47:50.229 EDT [1090] DETAIL:  Failed process was running: SELECT madlib.forest_train('analysis.dxpredict_svec',
                                   'analysis.dx_rf_train_output_1',
                                   'rowid',
                                   'positive',
                                   '*',
                                   'rowid,positive,case_icd',
                                   'case_icd',
                                   30::integer,
                                   30::integer,
                                   TRUE::boolean,
                                   1::integer,
                                   10::integer,
                                   3::integer,
                                   1::integer,
                                   10::integer,
                                   NULL,
                                   TRUE
                                   );
2018-07-23 14:47:50.229 EDT [1090] LOG:  terminating any other active server processes
2018-07-23 14:47:50.229 EDT [1401] WARNING:  terminating connection because of crash of another
server process







Mime
View raw message