#Reading the example data set icu from the package aplore3
library(aplore3)
= icu$sys #Set our depended variable
y = icu$age #Set our fist independed variable
x1 = as.numeric(icu$gender) #Set our second independed variable
x2
#Define our liklihood function we like to optimise
<- function(par, y, x1, x2){
ll_lm
<- par[1]
alpha <- par[2]
beta1 <- par[3]
beta2 <- par[4]
sigma
= y - alpha - beta1 * x1 - beta2 * x2
R
-sum(dnorm(R, mean = 0, sigma, log = TRUE))
}
If you want to optimise a function, the most important question of course is which function should be optimised? As it is taught in most statistic classes this function is in case of a linear regression model the distribution of the residuals \((R)\). We know that this distribution follows a normal distribution with mean 0 and a unknown standard deviation \(\sigma\):
\[\sum_{i = 1}^{i = n} R_i \sim N(0, \sigma)\]
Where \(R\) equals:
\[ R_i = y_1 - \hat{y_i} \]
Thus, we are interested in a function for \(\hat{y_i}\) which minimises the residuals and hence, gives us the best estimates for the function of interest. Taking all this information together we can write the function we would like to optimise, shown below.
A function that can be used for the optim()
command needs to have a par
argument, which includes the unknown parameters. The par
arguments needs a vector with initial values or guesses for all unknown parameters. As shown in the example above the par
argument includes initial values for all 4 unknown parameters. In this example the first value of the par
arguments equals \(\alpha\), the second \(\beta_1\), the third \(\beta_2\) and the fourth \(\sigma\), which is our unknown standard deviation of the normal distribution of our residuals. Additionally, we included our two independent variables \(x_1\) and \(x_2\) and our dependent variable y as function arguments in the optim()
call.
Also note, that I used the log
argument of the dnorm()
function to get the logarithmic values of the dnorm function. This is necessary, if we would like to sum the single likelihood values instead of taking the product of them.
The linear model we would like to fit in this example is:
\[ E(Y|\mathbf{X}) = \alpha + \beta_1 x_1 + \beta_2 x_2\]
Hence, the residuals for this model can be calculated as:
\[R = y - \alpha - \beta_1 x_1 - \beta_2 x_2\]
Since we know that these residuals follow a normal distribution with mean 0, we just need to find the standard deviation for the normal distribution of the residuals and the values for our coefficients, that fits best our data. To do this, we would like to minimise the sum of errors. This is done by the optim()
command. However, since the optim()
command always maximises functions, we just need to put a minus before our summation.
Before, we run the optim()
command we also need to find good guesses for our estimates, since the initial parameter values which are chosen for the optimisation influences our estimates. In this case we just calculate the conditional means for our subgroups and use them as guess for our coefficients.
<- mean(icu$sys)
est_alpha <- mean(icu$sys[icu$age >= 40 & icu$age <= 41]) - mean(icu$sys[icu$age >= 41 & icu$age <= 52])
est_beta1 <- mean(icu$sys[icu$gender == "Male"]) - mean(icu$sys[icu$gender == "Female"])
est_beta2 <- sd(icu$sys) est_sigma
Now we can use the optim()
function to search for our maximum likelihood estimates (mles) for the different coefficients. For the optim()
function, we need the function we like to optimise, in this case ll_lm()
, our guesses for the estimates and the empirical data we want to use for the optimisation.
<- optim(fn = ll_lm, #Function to be optimised
mle_par par = c(alpha = est_alpha, #Initial values
beta1 = est_beta1,
beta2 = est_beta2,
sigma = est_sigma),
y = icu$sys, #Empirical data from the data set icu
x1 = icu$age,
x2 = as.numeric(icu$gender))
$par #Showing estimates for unknown parameters mle_par
alpha beta1 beta2 sigma
123.99854056 0.06173058 3.37040256 32.77094809
If we now compare our estimates with the results of the lm()
command for the same model, we can see slight differences in the estimates, which may be due to different optimisation algorithms or due to our initial guesses for the parameters.
summary(lm(sys ~ age + as.numeric(gender), data = icu))
Call:
lm(formula = sys ~ age + as.numeric(gender), data = icu)
Residuals:
Min 1Q Median 3Q Max
-98.544 -20.510 -1.445 18.884 122.272
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 124.39209 9.32708 13.337 <2e-16 ***
age 0.06276 0.11738 0.535 0.593
as.numeric(gender) 3.09867 4.83773 0.641 0.523
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 33.05 on 197 degrees of freedom
Multiple R-squared: 0.003889, Adjusted R-squared: -0.006224
F-statistic: 0.3845 on 2 and 197 DF, p-value: 0.6813
This approach for a manual optimisation of a linear regression model can also be adopted for other linear regression model with more coefficients. In this case the formula for the residuals needs to be adjusted to the structure of the model that you like to fit.