Optimisation of a Logistic Regression Model using Optimx in R

This blog post explains how one can manually optimise a logistic regression model using a maximum likelihood estimation (MLE) method in R without the use pre-defined functions.
Optimisation
R
Author

Joshua Philipp Entrop

Published

May 27, 2020

R Code

In my last post I used the optim() command to optimise a linear regression model. In this post, I am going to take that approach a little further and optimise a logistic regression model in the same manner. Thanks to John C. Nash, I got a first glimpse into the world of optimisation functions in R. His book showed me how important it is to compare the results of different optimisation algorithms. Where I used the base optimisation function optim() in my last post, I will use optimx() from the optimx package in this post. The optimx package and function were developed by Nash and colleagues as a wrapper of the stats::optim() function. There are numerous advantages in using optimx() instead of optim(). In my opinion, among the most important is an easier comparison between different optimisation methods. In case someone is more interested in the variety of optimisation functions and problems that come with them, I can warmly recommend John C. Nash’s book Nonlinear Parameter Optimisation Using R Tools.

However, coming back to my main focus: the optimisation of a logistic regression model using the optimx() function in R. For this, I would like to use the icu data set from the package aplore3. The data set contains data from 200 patients in an intensive care unit (ICU) and provides information whether the patient survived their stay or died. The particular question I would like to take a look at is whether the probability of dying during the ICU stay \(P(Y = 1)\) is related to age \((x_1)\) and sex \((x_2)\). In order to do so, I firstly would like to load the data set and set up our variables:

# 1. Prefix -------------------------------------------------------------------

# Remove all files from ls
rm(list = ls())

# Loading packages
require(aplore3)
require(optimx)
require(numDeriv)
require(dplyr)

# 2. Loading dataset ----------------------------------------------------------

#Reading the example data set icu from the package aplore3
icu <- as.data.frame(icu)
icu$sta_n <- ifelse(icu$sta == "Died", 1, 0)
icu$female <- ifelse(icu$gender == "Female", 1, 0)

The specific model that I would like to use is:

\[ P(Y|x_1, x_2) \sim \alpha + \beta_1 x_1 + \beta_2 x_2\] Using the logistic link-function we can find a linear function for the right side of the equation.

\[ \ln \Bigg[ \frac{P(Y|x_1, x_2)}{1 - P(Y|x_1, x_2)} \Bigg] = \alpha + \beta_1 x_1 + \beta_2 x_2\]

For this model, I would like to find the values \(\alpha\), \(\beta_1\) and \(\beta_2\) that maximize the log-likelihood function and hence, provides the best fit to our empirical data provided in the icu data set. Therefore, we also need to define the log-likelihood function for our logistic regression model. According to Hosmer and Lemeshow the log-likelihood function for a logistic regression model can be defined as

\[ \sum_{i = 1}^{n}(y_i - \ln(\pi_i)) + (1 - y_i) * \ln(1 - \pi_i)). \] Where \(\pi\) is defined using the sigmoid function as

\[ P(Y|x_1, x_2) = \pi = \frac{\exp(\alpha + \beta_1 x_1 + \beta_2 x_2)}{1 + \exp(\alpha + \beta_1 x_1 + \beta_2 x_2)}. \]

For the optimisation in R we need to define the log-likelihood function as a function in R. Additionally, we need to add the constrain \(0 < \pi < 1\) to our like-likelihood function, since we are interested in a probability \(\pi\) which needs to be in the range between \(0\) and \(1\). We can use an if statement in R to include our constrain to our R function. For all parameter values that return a value of \(\pi\) that is out of the bounds, we set the value to a very high number, for instance \(10^{200}\). Using these high numbers for values outside the bounds, the optimisation algorithm will dismiss these parameter values from our solutions. Note that we calculate -sum(), since we want to find the maximum of the log-likelihood function.

# 3. Define log-likelihood function for logistic regression model -------------
# (see applied logistic regression)
negll <- function(par){
  
  #Extract guesses for alpha and beta1
  alpha <- par[1]
  beta1 <- par[2]
  beta2 <- par[3]
  
  #Define dependent and independent variables
  y  <- icu$sta_n
  x1 <- icu$age
  x2 <- icu$female
  
  #Calculate pi and xb
  xb <- alpha + beta1 * x1 + beta2 * x2
  pi <- exp(xb) / (1 + exp(xb))
  
  #Set high values for 0 < pi < 1
  if(any(pi > 1) | any(pi < 0)) {
    val <- 1e+200
  } else {
    val <- -sum(y * log(pi) + (1 - y) * log(1 - pi))
  }
  val
}

Additionally to our log-likelihood function, it is also useful to specify the gradient function for our log-likelihood. This is not necessary for all optimisation algorithms, however, it improves the testing for convergence. Hence, we can obtain better estimates by also supplying the gradient function. According to Hosmer and Lemeshow, the gradient function of the log-likelihood function is defined as

\[ g_\alpha(\pi) = \sum(y_i - \pi_i) \] \[ g_{x_j}(\pi) = \sum(x_{ji} * (y_i - \pi_i)). \]

In our case we yield 3 gradient functions

\[ g_\alpha(\pi) = \sum(y_i - \pi_i) \] \[ g_{x_1}(\pi) = \sum(x_{1_i} * (y_i - \pi_i)) \]

\[ g_{x_2}(\pi) = \sum(x_{2_i} * (y_i - \pi_i)). \]

We can then use these 3 functions to calculate the gradients in R. Also here we need to use -sum() for the gradient functions.

# 4. Define fradient function for logistic regression model -------------------
# (see applied logistic regression)
negll.grad <- function(par){
  
  #Extract guesses for alpha and beta1
  alpha <- par[1]
  beta1 <- par[2]
  beta2 <- par[3]
  
  #Define dependent and independent variables
  y  <- icu$sta_n
  x1 <- icu$age
  x2 <- icu$female
  
  #Create output vector
  n <- length(par[1])
  gg <- as.vector(rep(0, n))
  
  #Calculate pi and xb
  xb <- alpha + beta1 * x1 + beta2 * x2
  pi <- exp(xb) / (1 + exp(xb))
  
  #Calculate gradients for alpha and beta1
  gg[1] <- -sum(y - pi)
  gg[2] <- -sum(x1 * (y - pi))
  gg[3] <- -sum(x2 * (y - pi))
  
  return(gg)
}

R also provides functions to estimate a numerical approximation of the gradient function. One of these function is grad() from the numDeriv package. It is useful to double check your analytic gradient function using one of these numerical approximations. Since, optimx() uses the grad() function for doing this, we are also going to use this function

# 4.1 Compare gradient function with numeric approximation of gradient ========
# compare gradient at 0, 0, 0
mygrad <- negll.grad(c(0, 0, 0))
numgrad <- grad(x = c(0, 0, 0), func = negll)

all.equal(mygrad, numgrad)
[1] TRUE

We see, that the results from our analytic gradient function are identical to the results using the grad() function. So we can proceed and use the optimx() function to find the maximum of our log-likelihood function. As a first guess we use \(0\) as initial value for all unknown parameters.

# 4. Find maximum of log-likelihood function ----------------------------------
opt <- optimx(par = c(alpha  = 0,
                      beta_1 = 0, 
                      beta_2 = 0), 
              fn = negll, 
              gr = negll.grad, 
              control = list(trace = 0, 
                             all.methods = TRUE))

# print reulsts of optimisation
# remove not needed information for purpose of presentation 
summary(opt, order = "convcode") %>% 
  select(-value, -niter, -gevals, -fevals)
                alpha     beta_1      beta_2 convcode  kkt1 kkt2 xtime
Nelder-Mead -3.055741 0.02756878 -0.01126589        0 FALSE TRUE  0.00
L-BFGS-B    -3.056762 0.02758493 -0.01130317        0  TRUE TRUE  0.00
nlm         -3.056691 0.02758409 -0.01131100        0  TRUE TRUE  0.01
nlminb      -3.056690 0.02758409 -0.01131159        0  TRUE TRUE  0.00
Rcgmin      -3.056691 0.02758409 -0.01131098        1  TRUE TRUE  0.03
Rvmmin       0.000000 0.00000000  0.00000000       21 FALSE TRUE  0.00
BFGS               NA         NA          NA     9999    NA   NA  0.00
CG                 NA         NA          NA     9999    NA   NA  0.00
spg                NA         NA          NA     9999    NA   NA  0.00
ucminf             NA         NA          NA     9999    NA   NA  0.00
newuoa             NA         NA          NA     9999    NA   NA  0.05
bobyqa             NA         NA          NA     9999    NA   NA  0.00
nmkb               NA         NA          NA     9999    NA   NA  0.00
hjkb               NA         NA          NA     9999    NA   NA  0.00

A value of \(0\) in the convcode column of the output indicates, that the algorithm converged. Even though multiple algorithms converged and gave us a value for our three unknown parameters, they all provide slightly different estimates. Therefore, I think it would be interesting to compare our estimates with the estimates from the commonly used glm() function. Below I wrote a small function that estimates the mean differences in the estimates from the different optimisation methods and the glm model.

# 5. Estimate regression coeficents using glm ---------------------------------
glm_model <- glm(sta_n ~ age + female, 
                 data = icu,
                 family = binomial(link = "logit"))

# Print coefficents
coef(glm_model)
(Intercept)         age      female 
-3.05669068  0.02758409 -0.01131098 
# 6. Comparing results from optimx and glm ------------------------------------
glm_results <- unname(coef(glm_model))
coef_opt <- coef(opt)

lapply(1:nrow(coef_opt), function(i){
    
    optimisation_algorithm <- attributes(coef_opt)$dimnames[[1]][i]

    mle_glm1 <- (coef_opt[i, "alpha" ] - glm_results[1])
    mle_glm2 <- (coef_opt[i, "beta_1"] - glm_results[2])
    mle_glm3 <- (coef_opt[i, "beta_2"] - glm_results[3])
    
    mean_difference <- mean(mle_glm1, mle_glm2, mle_glm3, na.rm = TRUE)
    
    data.frame(optimisation_algorithm, mean_difference)
    
  }) %>% 
    bind_rows() %>% 
  filter(!is.na(mean_difference)) %>% 
  mutate(mean_difference = abs(mean_difference)) %>% 
  arrange(mean_difference)
  optimisation_algorithm mean_difference
1                 Rcgmin    1.887799e-08
2                    nlm    5.856574e-08
3                 nlminb    2.071622e-07
4               L-BFGS-B    7.134885e-05
5            Nelder-Mead    9.493966e-04
6                 Rvmmin    3.056691e+00

This shows that the Rcgmin algorithm yield the most similar results to the estimates from the glm model. However, most of the algorithms in the table provide estimates similar to the estimates from the glm model, which indicates that our optimisation of the logistic regression model using the log-likelihood function and the gradient function worked out well.