madlib-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Frank McQuillan <fmcquil...@pivotal.io>
Subject Re: Time dependent variables in Cox regression model
Date Tue, 15 Nov 2016 18:48:23 GMT
Thanks for opening the JIRA, Pietro.

It would be great if a developer in the MADlib community could work on this
since it is a valuable feature.

Frank

On Tue, Nov 15, 2016 at 3:46 AM, Pietro Pugni <pietro.pugni@gmail.com>
wrote:

> Hi all,
> I opened a JIRA on this topic https://issues.apache.org/
> jira/browse/MADLIB-1040 as suggested by Raul Iyer but can’t help with
> development. I will be happy to do some testing if needed. I usually work
> with very big cohorts (potentially from 4 to 10 millions subjects followed
> for 1 to 9 nine years). This means potentially billions of total
> days*person.
>
> Thank you
>  Pietro
>
> Il giorno 01 nov 2016, alle ore 11:56, Pietro Pugni <
> pietro.pugni@gmail.com> ha scritto:
>
> I’m sorry, the dataset from the R vignette isn’t 1 row per subject.. I
> interpreted the dataframe row id as the subject id.
> In that dataframe the subject 1 has three rows while subject 2 has two
> rows. It’s quite intuitive, but:
>  - subject 1 got infected on day 219 and day 373; his follow-up time ends
> at day 414 with no infection
>  - for subject 2 the R vignette prints only 2 rows but from the initial
> description, he got 7 infections
> So, the counting process format is a way to represents changes in time for
> each subject.
>
>
> Il giorno 01 nov 2016, alle ore 11:47, Pietro Pugni <
> pietro.pugni@gmail.com> ha scritto:
>
> Hi there,
> I’m sorry for being so late but was very busy.
> Thank you for the responses and for your interest in the development
> process of Time-dependent Cox.
> I’m not able to help you on the coding part, but can give you some advices.
>
> First, take a look at this document, which is related to SAS (the
> enterprise counter-part of R) and talks about the counting process format
> needed for time-dependent analysis (page 7 of 10): http://support.sas.com/
> resources/papers/proceedings12/168-2012.pdf
>
> The R vignette linked by Woo is another good place to look for.
>
> I suggest reading “Survival Analysis Using SAS - A practical Guide -
> Second Edition - Paul D. Allison - SAS Publishing”, ISBN 978-1-59994-640-5,
> in particular Chapter 5 starting from page 153. There are formulas and
> other related stuff and he talks about the counting process method.
>
> Generally, non-counting process method involves longitudinal dataset with
> each column for time event change in each variable. The counting process
> verticalizes this kind of data and each row represent a constant period of
> time for a subject. If a subject has more rows, it means that one or more
> covariates changes between two adjacent rows. The time interval length can
> vary from row to row. So, the basic information are: subject id, start time
> interval, stop time interval, outcome (dichotomous), a set of covariates.
>
> I took two screenshots from the R’s vignette representing a counting
> process dataset with 1 row for subject (page 7 of
> https://cran.r-project.org/web/packages/survival/vignettes/timedep.pdf ):
>
> <coxph_counting process dataframe.png>
>
> and here’s the coxph() invocation:
>
> <coxph_time dependent covariates estimates.png>
>
>
> Here, cluster(id) specifies the subject clustering variable, Surv() is the
> survival function evaluated in the time range [Start, Stop) for the outcome
> infect, while threat, inherit and steriods are the time-dependent
> covariates.
>
> The above example has only 1 row per subject, but as I said the counting
> process involves more than 1 row per subject. You can also build a dataset
> where a row represents a day for each subject (this is  very inefficient,
> but is possible and works too). You can have rows where nothing happens
> (all values are time-independent), etc. The counting process format is very
> flexible.
>
> From a performance point of view, I’ve seen poor results from the survival
> package. With big cohort datasets (a lot of subjects - more than 1 milion
> and more than 1 year of follow-up) the memory usage is massive and the
> processing time of the model estimates increases a lot. The advantage of
> running the Cox model from inside the database probably is the memory
> management, which is automatically balanced by the DB. In many cases, R
> goes out of memory.
>
> Hope this helps and sorry again for the late response.
>
> I appreciate your work and your interest
> Thank you
>  Pietro Pugni
>
>
>
> Il giorno 07 ott 2016, alle ore 00:29, Frank McQuillan <
> fmcquillan@pivotal.io> ha scritto:
>
> Re-posting Woo's comments to the list since it bounced for him...
>
> "Hi Pietro,
>
> Many thanks for your comments and questions!  I agree that it would be
> great to see support for time-dependent effects in the MADlib coxph
> module.  I think it would be good to have items in the roadmap for
> 'time-dependent covariates' and also 'time-dependent coefficients', and I
> believe Frank has already started the process of creating stories for these
> features.  You've mentioned R's implementation, and I think R's survival
> package vignette
> <https://cran.r-project.org/web/packages/survival/vignettes/timedep.pdf> has
> some nice info on usage of these two flavors of time-dependent effects,
> which I believe will be good starting points for the team.
>
> Hope this helps, and please do keep the feedback coming!
>
> Thanks,
> Woo"
>
> On Thu, Oct 6, 2016 at 2:40 PM, Woo Jae Jung <wjung@pivotal.io> wrote:
>
>> Hi Pietro,
>>
>> Many thanks for your comments and questions!  I agree that it would be
>> great to see support for time-dependent effects in the MADlib coxph
>> module.  I think it would be good to have items in the roadmap for
>> 'time-dependent covariates' and also 'time-dependent coefficients', and I
>> believe Frank has already started the process of creating stories for these
>> features.  You've mentioned R's implementation, and I think R's survival
>> package vignette
>> <https://cran.r-project.org/web/packages/survival/vignettes/timedep.pdf> has
>> some nice info on usage of these two flavors of time-dependent effects,
>> which I believe will be good starting points for the team.
>>
>> Hope this helps, and please do keep the feedback coming!
>>
>> Thanks,
>> Woo
>>
>>
>>
>>
>>
>>>
>>> ---------- Forwarded message ----------
>>> From: Pietro Pugni <pietro.pugni@gmail.com>
>>> Date: Wed, Oct 5, 2016 at 11:10 AM
>>> Subject: Time dependent variables in Cox regression model
>>> To: user@madlib.incubator.apache.org
>>>
>>>
>>> Hi there,
>>> I just found this amazing library and was wondering if it’s possible to
>>> estimate a Cox model using time-dependent variables. I’m used to survival
>>> and rms packages available in R. Those libraries ingest datasets built
>>> using the counting process method.
>>> From the docs http://madlib.incubator.apache.org/docs/latest/group__g
>>> rp__cox__prop__hazards.html this doesn’t seem possible. Do you plan to
>>> add this feature in the future?
>>>
>>> Thank you
>>>  Pietro Pugni
>>>
>>>
>>
>
>
>
>

Mime
View raw message