madlib-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Pietro Pugni <pietro.pu...@gmail.com>
Subject Re: Time dependent variables in Cox regression model
Date Tue, 01 Nov 2016 10:56:07 GMT
I’m sorry, the dataset from the R vignette isn’t 1 row per subject.. I interpreted the
dataframe row id as the subject id.
In that dataframe the subject 1 has three rows while subject 2 has two rows. It’s quite
intuitive, but:
 - subject 1 got infected on day 219 and day 373; his follow-up time ends at day 414 with
no infection
 - for subject 2 the R vignette prints only 2 rows but from the initial description, he got
7 infections
So, the counting process format is a way to represents changes in time for each subject.


> Il giorno 01 nov 2016, alle ore 11:47, Pietro Pugni <pietro.pugni@gmail.com> ha
scritto:
> 
> Hi there,
> I’m sorry for being so late but was very busy.
> Thank you for the responses and for your interest in the development process of Time-dependent
Cox.
> I’m not able to help you on the coding part, but can give you some advices.
> 
> First, take a look at this document, which is related to SAS (the enterprise counter-part
of R) and talks about the counting process format needed for time-dependent analysis (page
7 of 10): http://support.sas.com/resources/papers/proceedings12/168-2012.pdf <http://support.sas.com/resources/papers/proceedings12/168-2012.pdf>
> 
> The R vignette linked by Woo is another good place to look for.
> 
> I suggest reading “Survival Analysis Using SAS - A practical Guide - Second Edition
- Paul D. Allison - SAS Publishing”, ISBN 978-1-59994-640-5, in particular Chapter 5 starting
from page 153. There are formulas and other related stuff and he talks about the counting
process method. 
> 
> Generally, non-counting process method involves longitudinal dataset with each column
for time event change in each variable. The counting process verticalizes this kind of data
and each row represent a constant period of time for a subject. If a subject has more rows,
it means that one or more covariates changes between two adjacent rows. The time interval
length can vary from row to row. So, the basic information are: subject id, start time interval,
stop time interval, outcome (dichotomous), a set of covariates.
> 
> I took two screenshots from the R’s vignette representing a counting process dataset
with 1 row for subject (page 7 of https://cran.r-project.org/web/packages/survival/vignettes/timedep.pdf
<https://cran.r-project.org/web/packages/survival/vignettes/timedep.pdf> ):
> 
> <coxph_counting process dataframe.png>
> 
> and here’s the coxph() invocation:
> 
> <coxph_time dependent covariates estimates.png>
> 
> 
> Here, cluster(id) specifies the subject clustering variable, Surv() is the survival function
evaluated in the time range [Start, Stop) for the outcome infect, while threat, inherit and
steriods are the time-dependent covariates.
> 
> The above example has only 1 row per subject, but as I said the counting process involves
more than 1 row per subject. You can also build a dataset where a row represents a day for
each subject (this is  very inefficient, but is possible and works too). You can have rows
where nothing happens (all values are time-independent), etc. The counting process format
is very flexible.
> 
> From a performance point of view, I’ve seen poor results from the survival package.
With big cohort datasets (a lot of subjects - more than 1 milion and more than 1 year of follow-up)
the memory usage is massive and the processing time of the model estimates increases a lot.
The advantage of running the Cox model from inside the database probably is the memory management,
which is automatically balanced by the DB. In many cases, R goes out of memory.
> 
> Hope this helps and sorry again for the late response.
> 
> I appreciate your work and your interest 
> Thank you
>  Pietro Pugni
> 
> 
> 
>> Il giorno 07 ott 2016, alle ore 00:29, Frank McQuillan <fmcquillan@pivotal.io
<mailto:fmcquillan@pivotal.io>> ha scritto:
>> 
>> Re-posting Woo's comments to the list since it bounced for him...
>> 
>> "Hi Pietro,
>> 
>> Many thanks for your comments and questions!  I agree that it would be great to see
support for time-dependent effects in the MADlib coxph module.  I think it would be good to
have items in the roadmap for 'time-dependent covariates' and also 'time-dependent coefficients',
and I believe Frank has already started the process of creating stories for these features.
 You've mentioned R's implementation, and I think R's survival package vignette <https://cran.r-project.org/web/packages/survival/vignettes/timedep.pdf>
has some nice info on usage of these two flavors of time-dependent effects, which I believe
will be good starting points for the team.  
>> 
>> Hope this helps, and please do keep the feedback coming!
>> 
>> Thanks,
>> Woo"
>> 
>> On Thu, Oct 6, 2016 at 2:40 PM, Woo Jae Jung <wjung@pivotal.io <mailto:wjung@pivotal.io>>
wrote:
>> Hi Pietro,
>> 
>> Many thanks for your comments and questions!  I agree that it would be great to see
support for time-dependent effects in the MADlib coxph module.  I think it would be good to
have items in the roadmap for 'time-dependent covariates' and also 'time-dependent coefficients',
and I believe Frank has already started the process of creating stories for these features.
 You've mentioned R's implementation, and I think R's survival package vignette <https://cran.r-project.org/web/packages/survival/vignettes/timedep.pdf>
has some nice info on usage of these two flavors of time-dependent effects, which I believe
will be good starting points for the team.  
>> 
>> Hope this helps, and please do keep the feedback coming!
>> 
>> Thanks,
>> Woo
>> 
>> 
>> 
>> 
>> 
>> 
>> ---------- Forwarded message ----------
>> From: Pietro Pugni <pietro.pugni@gmail.com <mailto:pietro.pugni@gmail.com>>
>> Date: Wed, Oct 5, 2016 at 11:10 AM
>> Subject: Time dependent variables in Cox regression model
>> To: user@madlib.incubator.apache.org <mailto:user@madlib.incubator.apache.org>
>> 
>> 
>> Hi there,
>> I just found this amazing library and was wondering if it’s possible to estimate
a Cox model using time-dependent variables. I’m used to survival and rms packages available
in R. Those libraries ingest datasets built using the counting process method. 
>> From the docs http://madlib.incubator.apache.org/docs/latest/group__grp__cox__prop__hazards.html
<http://madlib.incubator.apache.org/docs/latest/group__grp__cox__prop__hazards.html>
this doesn’t seem possible. Do you plan to add this feature in the future?
>> 
>> Thank you
>>  Pietro Pugni
>> 
>> 
>> 
> 


Mime
View raw message