# load rstpm2
library(rstpm2)
# install and use the entjosR package
::install_github("entjos/entjosR")
remoteslibrary(entjosR)
Have you ever encountered a situation in which you wanted to test different degrees of freedom (DF) for a generalised survival model? I encounter this situation quite often and after writing for loops for looping through the different combinations of DFs for different spline functions, I decided to write a R function that does this for me instead.
In today’s blog post I would like to introduce you to fpm_test_dfs()
, which does exactly that. The function can be used to test combinations of DFs for different spline functions in a generalised survival model fitted with rstpm2::stpm()
. However, before discussing about the function, I would like to quickly introduce the generalised survival model modeled on the log cumulative hazard function (
rstpm2::stpm2()
function uses restricted cubic splines, which are natural cubic splines in which the first and second order derivatives of the spline function at the knots are restricted to be equal:
The number of knots and DFs for restricted cubic splines are related through
However, we cannot only use spline function for modelling the baseline hazard function, but also for modelling time-varying covariate effects. In this case the above model can be generalised for including
In the model above we need to specify a lot of different DFs one for the baseline hazard function and one for each time-varying covariate effect that we would like to include in our model. Finding the right combination of DFs for a model can be a challenging task. A common approach is to fit models with different DFs and compare their AIC or BIC value in order to identify the best fitting model. This is where fpm_test_dfs()
comes in handy. fpm_test_dfs()
allows you to specify which DFs you would like to test for each spline function included in your model and returns a data frame including the corresponding AIC and BIC values for each possible combination of DFs.
For this, fpm_test_dfs()
requires the following 5 arguments:
formula
: A formular describing the model that you would like to test. This formula will be passed directly torstpm2::stpm()
.dfs_bh
: An integer vector of DFs used for the spline function of the baseline hazard.dfs_tvc
: A list of named integer vectors of DFs used for the spline function of a time-varying covariate effect. The name of the vector should correspond to the name of the variable that should be included in the model using a time-varying effect.by_vars
: A character vector including the names of variables that should be used for stratifying the model, if needed.data
: A dataset which is passed torstpm2::stpm2()
.
Let’s test the function using the breast cancer dataset included in the rstpm2
package. For this, we first need to load the rstpm2
package and install the fpm_test_dfs()
function, which is included in the entjosR
package on GitHub.
Now that we installed the entjosR
package, we can use the fpm_test_dfs()
. We might for example be interested in fitting a crude model for estimating survival after breast cancer, e.g.,:
fpm_test_dfs()
as follows. Let’s for instance test one to ten DFs for the baselien hazard function.
# Test for best fitting baseline hazard function
<- fpm_test_dfs(Surv(rectime, censrec) ~ 1,
dfs_test_model1 dfs_bh = 1:10,
data = brcancer)
dfs_test_model1
df_bh df_tvc aic bic
1 1 0 5278.553 5287.614
2 2 0 5233.610 5247.203
3 3 0 5232.631 5250.754
4 4 0 5231.785 5254.440
5 5 0 5233.910 5261.095
6 6 0 5235.563 5267.279
7 7 0 5234.047 5270.294
8 8 0 5238.742 5279.520
9 9 0 5231.902 5277.210
10 10 0 5238.197 5288.037
We now got a dataframe with one row for each DF. However, what we are interested in is to find the best fitting model. For this we can pass the result of fpm_test_dfs()
to fpm_get_best_fit()
.
# Check which model fits best
fpm_get_best_fit(dfs_test_model1)
df_bh df_tvc aic bic
4 4 0 5231.785 5254.440
2 2 0 5233.610 5247.203
We now see that based on the AIC we would chose 4 DFs as the best fitting model and based on the BIC we would chose 2 DFs for the baseline hazard function as the best fitting model. Now you might think “Why would I need a function that does this for me when I can just write a simple for-loop”. I agree with you in principle, but adding more time-depending co variate effects to the model increases the complexity quite quickly. Let’s assume we would like to add the following two variables to our model and add time-varying effects for those:
x1
: Age at diagnosis in years,hormon
: Indicator for receiving hormonal therapy.
Our model would then formally look like this
For simplicity, let’s assume that we stick to 4 DFs for the baseline hazard function and test for different DFs for the time-varying effect of age and hormonal therapy.
# Test for best fitting spline fuction for time-varying effects of hormonal
# therapy and age at diagnosis
<- fpm_test_dfs(Surv(rectime, censrec) ~ hormon + x1,
dfs_test_model2 dfs_bh = 4,
dfs_tvc = list(hormon = 1:4,
x1 = 1:4),
data = brcancer)
dfs_test_model2
df_bh df_tvc_hormon df_tvc_x1 aic bic
1 4 1 1 5224.228 5265.006
2 4 2 1 5226.164 5271.472
3 4 3 1 5228.085 5277.924
4 4 4 1 5228.764 5283.134
5 4 1 2 5226.157 5271.465
6 4 2 2 5228.055 5277.895
7 4 3 2 5229.965 5284.336
8 4 4 2 5230.637 5289.539
9 4 1 3 5227.309 5277.149
10 4 2 3 5229.188 5283.559
11 4 3 3 5230.882 5289.784
12 4 4 3 5231.643 5295.075
13 4 1 4 5225.276 5279.646
14 4 2 4 5227.152 5286.054
15 4 3 4 5228.893 5292.326
16 4 4 4 5230.417 5298.380
We now receive a dataset with 16 rows, one row for each model that we tested as we tested for all different combinations of DFs for the two time-varying effects.
# Check which model fits best
fpm_get_best_fit(dfs_test_model2)
df_bh df_tvc_hormon df_tvc_x1 aic bic
1 4 1 1 5224.228 5265.006
2 4 1 1 5224.228 5265.006
Checking the best fitting model suggests that one DF is sufficient for modelling the time-varying effect of both hormonal therapy and age. However, one DF is equivalent to not using any knots for the spline function (see above). Hence, using one DF for the time-varying effect is equivalent to only using a linear interaction between log time and hormonal therapy, and age at diagnosis, respectively.
Another thing that we might want to do is to fit separate models for those that received hormonal therapy and those that did not, aka, stratifying by hormonal therapy. In this situation we need to test two separate models. We can easily do this by using the by_vars
option in fpm_test_dfs()
. Let’s assume that we stick with only one DF for the time-varying effect of age, but we would like to test for different DFs for the baseline hazard function in our two models.
# Test for best fitting spline fuction for the baseline hazard stratified
# by hormonal therapy use
<- fpm_test_dfs(Surv(rectime, censrec) ~ x1,
dfs_test_model3 dfs_bh = 1:5,
dfs_tvc = list(x1 = 1),
by_vars = "hormon",
data = brcancer)
dfs_test_model3
data$filter_vars: 0
df_bh dfs_tvc_x1 aic bic
1 1 1 3569.536 3585.883
2 2 1 3535.832 3556.266
3 3 1 3536.267 3560.788
4 4 1 3537.851 3566.458
5 5 1 3539.667 3572.361
------------------------------------------------------------
data$filter_vars: 1
df_bh dfs_tvc_x1 aic bic
1 1 1 1707.547 1721.569
2 2 1 1696.212 1713.739
3 3 1 1695.230 1716.262
4 4 1 1695.250 1719.787
5 5 1 1696.354 1724.397
We know receive a list of dataframes as output of fpm_test_dfs()
. The first dataframe includes the AIC and BIC values for the model for those that did not receive hormonal therapy (hormon == 0
), and the second dataframe include those values for patient that did receive hormonal therapy (hormon == 1
). We can also pass this list to fpm_get_best_fit()
to get the two best fitting models.
# Check which models fit best
fpm_get_best_fit(dfs_test_model3)
$`0`
df_bh dfs_tvc_x1 aic bic
2 2 1 3535.832 3556.266
21 2 1 3535.832 3556.266
$`1`
df_bh dfs_tvc_x1 aic bic
3 3 1 1695.230 1716.262
2 2 1 1696.212 1713.739
The results shows that among those that did not receive hormonal therapy 2 DFs are sufficient for modelling the baseline hazard function according to both AIC and BIC. However, among those that did receive hormonal therapy the best fitting model according to the AIC value includes 3 DFs for the baseline hazard function whereas the best fitting model according to the BIC includes 2 DFs for the baseline hazard function.
I hope this function might be of use for you and please let me know if you should have any feedback or suggestions for it.