madlib-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Pietro Pugni <pietro.pu...@gmail.com>
Subject Re: Time dependent variables in Cox regression model
Date Tue, 01 Nov 2016 10:47:36 GMT
Hi there,
I’m sorry for being so late but was very busy.
Thank you for the responses and for your interest in the development process of Time-dependent
Cox.
I’m not able to help you on the coding part, but can give you some advices.

First, take a look at this document, which is related to SAS (the enterprise counter-part
of R) and talks about the counting process format needed for time-dependent analysis (page
7 of 10): http://support.sas.com/resources/papers/proceedings12/168-2012.pdf

The R vignette linked by Woo is another good place to look for.

I suggest reading “Survival Analysis Using SAS - A practical Guide - Second Edition - Paul
D. Allison - SAS Publishing”, ISBN 978-1-59994-640-5, in particular Chapter 5 starting from
page 153. There are formulas and other related stuff and he talks about the counting process
method. 

Generally, non-counting process method involves longitudinal dataset with each column for
time event change in each variable. The counting process verticalizes this kind of data and
each row represent a constant period of time for a subject. If a subject has more rows, it
means that one or more covariates changes between two adjacent rows. The time interval length
can vary from row to row. So, the basic information are: subject id, start time interval,
stop time interval, outcome (dichotomous), a set of covariates.

I took two screenshots from the R’s vignette representing a counting process dataset with
1 row for subject (page 7 of https://cran.r-project.org/web/packages/survival/vignettes/timedep.pdf
<https://cran.r-project.org/web/packages/survival/vignettes/timedep.pdf> ):



and here’s the coxph() invocation:




Here, cluster(id) specifies the subject clustering variable, Surv() is the survival function
evaluated in the time range [Start, Stop) for the outcome infect, while threat, inherit and
steriods are the time-dependent covariates.

The above example has only 1 row per subject, but as I said the counting process involves
more than 1 row per subject. You can also build a dataset where a row represents a day for
each subject (this is  very inefficient, but is possible and works too). You can have rows
where nothing happens (all values are time-independent), etc. The counting process format
is very flexible.

From a performance point of view, I’ve seen poor results from the survival package. With
big cohort datasets (a lot of subjects - more than 1 milion and more than 1 year of follow-up)
the memory usage is massive and the processing time of the model estimates increases a lot.
The advantage of running the Cox model from inside the database probably is the memory management,
which is automatically balanced by the DB. In many cases, R goes out of memory.

Hope this helps and sorry again for the late response.

I appreciate your work and your interest 
Thank you
 Pietro Pugni



> Il giorno 07 ott 2016, alle ore 00:29, Frank McQuillan <fmcquillan@pivotal.io>
ha scritto:
> 
> Re-posting Woo's comments to the list since it bounced for him...
> 
> "Hi Pietro,
> 
> Many thanks for your comments and questions!  I agree that it would be great to see support
for time-dependent effects in the MADlib coxph module.  I think it would be good to have items
in the roadmap for 'time-dependent covariates' and also 'time-dependent coefficients', and
I believe Frank has already started the process of creating stories for these features.  You've
mentioned R's implementation, and I think R's survival package vignette <https://cran.r-project.org/web/packages/survival/vignettes/timedep.pdf>
has some nice info on usage of these two flavors of time-dependent effects, which I believe
will be good starting points for the team.  
> 
> Hope this helps, and please do keep the feedback coming!
> 
> Thanks,
> Woo"
> 
> On Thu, Oct 6, 2016 at 2:40 PM, Woo Jae Jung <wjung@pivotal.io <mailto:wjung@pivotal.io>>
wrote:
> Hi Pietro,
> 
> Many thanks for your comments and questions!  I agree that it would be great to see support
for time-dependent effects in the MADlib coxph module.  I think it would be good to have items
in the roadmap for 'time-dependent covariates' and also 'time-dependent coefficients', and
I believe Frank has already started the process of creating stories for these features.  You've
mentioned R's implementation, and I think R's survival package vignette <https://cran.r-project.org/web/packages/survival/vignettes/timedep.pdf>
has some nice info on usage of these two flavors of time-dependent effects, which I believe
will be good starting points for the team.  
> 
> Hope this helps, and please do keep the feedback coming!
> 
> Thanks,
> Woo
> 
> 
> 
> 
> 
> 
> ---------- Forwarded message ----------
> From: Pietro Pugni <pietro.pugni@gmail.com <mailto:pietro.pugni@gmail.com>>
> Date: Wed, Oct 5, 2016 at 11:10 AM
> Subject: Time dependent variables in Cox regression model
> To: user@madlib.incubator.apache.org <mailto:user@madlib.incubator.apache.org>
> 
> 
> Hi there,
> I just found this amazing library and was wondering if it’s possible to estimate a
Cox model using time-dependent variables. I’m used to survival and rms packages available
in R. Those libraries ingest datasets built using the counting process method. 
> From the docs http://madlib.incubator.apache.org/docs/latest/group__grp__cox__prop__hazards.html
<http://madlib.incubator.apache.org/docs/latest/group__grp__cox__prop__hazards.html>
this doesn’t seem possible. Do you plan to add this feature in the future?
> 
> Thank you
>  Pietro Pugni
> 
> 
> 


Mime
View raw message