Manual linear regression analysis using R

Last updated: 2025-01-09

Checks: 7 0

Knit directory: muse/

This reproducible R Markdown analysis was created with workflowr (version 1.7.1). The Checks tab describes the reproducibility checks that were applied when the results were created. The Past versions tab lists the development history.

R Markdown file: up-to-date

Great! Since the R Markdown file has been committed to the Git repository, you know the exact version of the code that produced these results.

Environment: empty

Great job! The global environment was empty. Objects defined in the global environment can affect the analysis in your R Markdown file in unknown ways. For reproduciblity it’s best to always run the code in an empty environment.

Seed: set.seed(20200712)

The command set.seed(20200712) was run prior to running the code in the R Markdown file. Setting a seed ensures that any results that rely on randomness, e.g. subsampling or permutations, are reproducible.

Session information: recorded

Great job! Recording the operating system, R version, and package versions is critical for reproducibility.

Cache: none

Nice! There were no cached chunks for this analysis, so you can be confident that you successfully produced the results during this run.

File paths: relative

Great job! Using relative paths to the files within your workflowr project makes it easier to run your code on other machines.

Repository version: 4d7cdb8

Great! You are using Git for version control. Tracking code development and connecting the code version to the results is critical for reproducibility.

The results in this page were generated with repository version 4d7cdb8. See the Past versions tab to see a history of the changes made to the R Markdown and HTML files.

Note that you need to be careful to ensure that all relevant files for the analysis have been committed to Git prior to generating the results (you can use wflow_publish or wflow_git_commit). workflowr only checks the R Markdown file, but you know if there are other scripts or data files that it depends on. Below is the status of the Git repository when the results were generated:


Ignored files:
    Ignored:    .Rhistory
    Ignored:    .Rproj.user/
    Ignored:    data/pbmc3k.csv
    Ignored:    data/pbmc3k.csv.gz
    Ignored:    data/pbmc3k/
    Ignored:    r_packages_4.4.0/
    Ignored:    r_packages_4.4.1/

Note that any generated files, e.g. HTML, png, CSS, etc., are not included in this status report because it is ok for generated content to have uncommitted changes.

These are the previous versions of the repository in which changes were made to the R Markdown (analysis/lm.Rmd) and HTML (docs/lm.html) files. If you’ve configured a remote Git repository (see ?wflow_git_remote), click on the hyperlinks in the table below to view the files as they were in that past version.

File	Version	Author	Date	Message
Rmd	4d7cdb8	Dave Tang	2025-01-09	Multiple linear regression
html	243642c	Dave Tang	2025-01-09	Build site.
Rmd	0981d28	Dave Tang	2025-01-09	Perfect fit
html	9a03e0f	Dave Tang	2025-01-09	Build site.
Rmd	ae8a6bd	Dave Tang	2025-01-09	Vertical line
html	1d0b364	Dave Tang	2025-01-08	Build site.
Rmd	de97616	Dave Tang	2025-01-08	Manual linear regression analysis using R

On Wikipedia, linear regression is described as:

In statistics, linear regression is a linear approach for modelling the relationship between a scalar response and one or more explanatory variables (also known as dependent and independent variables). The case of one explanatory variable is called simple linear regression; for more than one, the process is called multiple linear regression.

Linear simply means a straight-line, regression in this context means relationship, and modelling means fitting. Therefore, simple linear regression modelling can be interpreted as trying to find the equation of the straight-line that fits the data points the best and captures the relationship between the variables; the best line is the one that minimises the sum of squared residuals of the linear regression model. The equation of a straight line is:

\[y = mx + b\]

where \(m\) is the slope/gradient and \(b\) is the y-intercept.

Here’s a quick example using the “women” dataset that comes with R. I will be using {ggplot2} for plotting, so install it if you want to follow along.

install.packages('ggplot2')

The women dataset gives the average heights and weights for American women aged 30–39.

library(ggplot2)
head(women)

  height weight
1     58    115
2     59    117
3     60    120
4     61    123
5     62    126
6     63    129

Plot height vs. weight and fit a linear model.

ggplot(women, aes(height, weight)) +
  geom_point() +
  geom_smooth(method = "lm") +
  theme_minimal()

`geom_smooth()` using formula = 'y ~ x'

Version	Author	Date
1d0b364	Dave Tang	2025-01-08

There is a clear relationship between height and weight. How do we figure out this line of best fit manually? We can work out \(m\) for the women dataset using the following formula:

\[m = \frac{\sum_{i=1}^{n}(x_i - \overline{x})(y_i - \overline{y})}{\sum_{i=1}^{n}(x_i - \overline{x})^2}\]

where \(n\) is the number of observations in the dataset, \(\sum_{i=1}^n\) means the sum of all the observations, \(x_i\) and \(y_i\) are the \(i\)th observation of \(x\) and \(y\), and \(\overline{x}\) and \(\overline{y}\) are the means of \(x\) and \(y\), respectively.

Once we have \(m\), we can work out \(b\) with:

\[b = \overline{y} - m\overline{x}\]

which is just a rearrangement of:

\[y = mx + b\]

but using the mean values.

We can express these functions in R as:

slope <- function(x, y){
  mean_x <- mean(x)
  mean_y <- mean(y)
  top <- sum((x - mean_x)*(y-mean_y))
  bottom <- sum((x - mean_x)^2)
  top / bottom
}

# the slope formula is just
# covariance(x, y) / variance(x)
slope2 <- function(x, y){
  cov(x, y)/var(x)
}

intercept <- function(x, y, m){
  mean(y) - (m * mean(x))
}

Before we work out \(m\) and \(b\), a word on dependent and independent variables. When modelling, we want to work out the dependent variable from an independent variable (or variables). For this post, we would like to work out weight from height, therefore our independent variable is height and dependent variable is weight. It can be the other way around too. When plotting, the independent variable is usually on the x-axis, while the dependent variable is on the y-axis (as I have done above). Now to use our slope() and intercept() functions.

(my_slope <- slope(women$height, women$weight))

[1] 3.45

(my_intercept <- intercept(women$height, women$weight, my_slope))

[1] -87.51667

Let’s plot this using base R functions

plot(women$height, women$weight, pch=16, xlab = "Height", ylab = "Weight", main = "Manual linear regression!")
abline(my_intercept, my_slope)

Version	Author	Date
1d0b364	Dave Tang	2025-01-08

The lm() function (used for fitting linear models) in R can work out the intercept and slope for us (and other things too). The arguments for lm() are a formula and the data; the formula starts with the dependent variable followed by a tilde ~ and followed with the independent variable.

(fit <- lm(weight ~ height, women))


Call:
lm(formula = weight ~ height, data = women)

Coefficients:
(Intercept)       height  
     -87.52         3.45

Note that the coefficients (Intercept) and height are the same as what we calculated manually for the intercept and slope.

There are other values stored in fit object.

names(fit)

 [1] "coefficients"  "residuals"     "effects"       "rank"         
 [5] "fitted.values" "assign"        "qr"            "df.residual"  
 [9] "xlevels"       "call"          "terms"         "model"

The residuals data is the difference between the observed data of the dependent variable and the fitted values. We can plot our observed values against the fitted values to see how well the regression model fits.

plot(women$weight, fit$residuals, pch=16)
abline(h = 0, lty = 2)

Version	Author	Date
1d0b364	Dave Tang	2025-01-08

The fourth point from the left has a residual that is close to zero; this is because the fitted weight (122.9333 pounds) is almost the same as the real weight (123 pounds) for someone who is 61 inches tall.

We can use the predict() function to predict someone’s weight based on their height. Below we predict someone’s weight when they are 61 and 80 inches tall.

predict(fit, data.frame(height = 61))

       1 
122.9333

predict(fit, data.frame(height = 80))

       1 
188.4833

Statistics for Dummies

There’s a nice example on linear regression in the “Statistics For Dummies” book that also demonstrates how you can manually calculate the slope and intercept. Here I’ll perform the analysis using R and the data containing the number of cricket (the insect) chirps versus temperature. First we will create the vectors containing the chirps and temperature.

chirp <- c(18,20,21,23,27,30,34,39)
temp <- c(57,60,64,65,68,71,74,77)

The five summaries required for calculating the slope and intercept of the best fitting line are:

The mean of x
The mean of y
The sd of x
The sd of y
The r between x and y

We can work out the five summarises using built-in functions in R.

mean(chirp)

[1] 26.5

mean(temp)

[1] 67

sd(chirp)

[1] 7.387248

sd(temp)

[1] 6.845228

cor(chirp,temp)

[1] 0.9774791

Note that in the “Statistics For Dummies” book, dependent variables are referred to as response variables and independent variables are referred to as explanatory variables. You can think of the response variable as the variable that responds on a set of variables (explanatory variables) that explains the response.

# formula for the slope m, response variable first
# m = r(sd(x) / sd(y))
(m <- cor(chirp,temp) * (sd(chirp)/sd(temp)))

[1] 1.054878

# formula for intercept as above
# b = mean(y) - (m * mean(x))
mean(chirp) - (m * mean(temp))

[1] -44.17683

We can use the lm() function in R to check our calculation. Models for lm() are specified symbolically and a typical model has the form response ~ terms, where response is the response vector and terms is a series of terms which specifies a linear predictor for response.

We’ll make chirps the response (dependent variable) and temperature the explanatory variable (independent variable), i.e. predicting the number of chirps from the temperature.

lm(chirp~temp)


Call:
lm(formula = chirp ~ temp)

Coefficients:
(Intercept)         temp  
    -44.177        1.055

We see that the coefficients are the same as what we had calculated. We can use the slope and intercept to write a function to predict the number of chirps based on temperature.

# in the form y = mx + b
# y = 1.055x - 44.177
model <- function(x){
   1.055 * x - 44.177
}

# how many chirps at 68 Fahrenheit using our linear model?
model(68)

[1] 27.563

model(77)

[1] 37.058

We can check our function using the built-in R function called predict().

reg <- lm(chirp~temp)
predict(reg, data.frame(temp=68))

       1 
27.55488

predict(reg, data.frame(temp=77))

       1 
37.04878

The linear model generated by R contains some useful information such as the coefficient of determination (i.e. R-square).

summary(reg)$r.squared

[1] 0.9554655

Use the summary() function to get a full summary.

summary(reg)


Call:
lm(formula = chirp ~ temp)

Residuals:
    Min      1Q  Median      3Q     Max 
-2.3354 -0.8872 -0.2195  1.1509  2.0488 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) -44.17683    6.25773   -7.06 0.000404 ***
temp          1.05488    0.09298   11.35 2.81e-05 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1.684 on 6 degrees of freedom
Multiple R-squared:  0.9555,    Adjusted R-squared:  0.948 
F-statistic: 128.7 on 1 and 6 DF,  p-value: 2.808e-05

Finally, we can plot the data and add the line of best fit. As per convention, the response variable is on the y-axis and the explanatory variable is on the x-axis.

plot(temp, chirp, pch=16)
abline(reg)

Version	Author	Date
1d0b364	Dave Tang	2025-01-08

The same plot can be generated using ggplot2.

library(ggplot2)

data.frame(temp = temp, chirp = chirp) |>
  ggplot(aes(temp, chirp)) +
  geom_point() +
  geom_smooth(method = "lm") +
  theme_minimal()

`geom_smooth()` using formula = 'y ~ x'

Version	Author	Date
1d0b364	Dave Tang	2025-01-08

Vertical line

What happens if you try to perform linear regression on points that form a vertical line?

Predict x using y.

n <- 10
my_df <- data.frame(
  x = rep(5, n),
  y = 1:n
)

fit <- lm(x ~ y, my_df)
with(data = my_df, plot(x, y, pch=16))
abline(fit)

Version	Author	Date
9a03e0f	Dave Tang	2025-01-09

Predict y using x.

fit <- lm(y ~ x, my_df)
with(data = my_df, plot(x, y, pch=16))

Version	Author	Date
9a03e0f	Dave Tang	2025-01-09

abline(fit)

Error in int_abline(a = a, b = b, h = h, v = v, untf = untf, ...): 'a' and 'b' must be finite

Perfect fit

What happens if there’s a perfect fit?

n <- 10
my_df <- data.frame(
  x = 1:n,
  y = 1:n
)

fit <- lm(x ~ y, my_df)
with(data = my_df, plot(x, y, pch=16))
abline(fit)

Version	Author	Date
243642c	Dave Tang	2025-01-09

Summary.

summary(fit)

Warning in summary.lm(fit): essentially perfect fit: summary may be unreliable


Call:
lm(formula = x ~ y, data = my_df)

Residuals:
       Min         1Q     Median         3Q        Max 
-5.661e-16 -1.157e-16  4.273e-17  2.153e-16  4.167e-16 

Coefficients:
             Estimate Std. Error   t value Pr(>|t|)    
(Intercept) 1.123e-15  2.458e-16 4.571e+00  0.00182 ** 
y           1.000e+00  3.961e-17 2.525e+16  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 3.598e-16 on 8 degrees of freedom
Multiple R-squared:      1, Adjusted R-squared:      1 
F-statistic: 6.374e+32 on 1 and 8 DF,  p-value: < 2.2e-16

Floating-point precision errors with the residuals, which should be zero, and the intercept, which should also be zero. The slope is correct in that when y increases in 1, x increases in 1.

Multiple linear regression

Predict x using y and z.

n <- 10
set.seed(1984)
my_df <- data.frame(
  x = (1:n),
  y = (1:n) + rnorm(n, sd = 0.5),
  z = rnorm(n)
)

fit <- lm(x ~ y + z, my_df)
summary(fit)


Call:
lm(formula = x ~ y + z, data = my_df)

Residuals:
    Min      1Q  Median      3Q     Max 
-0.4473 -0.3698 -0.1993  0.3606  0.9200 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  0.28067    0.43138   0.651    0.536    
y            0.93656    0.06898  13.577 2.77e-06 ***
z           -0.08051    0.19662  -0.409    0.694    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.5473 on 7 degrees of freedom
Multiple R-squared:  0.9746,    Adjusted R-squared:  0.9673 
F-statistic: 134.2 on 2 and 7 DF,  p-value: 2.618e-06

The intercept represents the predicted value of x when y and z (the predictors) are equal to zero. The intercept is often a baseline or reference point for the model. For example, suppose we are modelling house prices based on area and number of bedrooms. If both predictors are zero (a house with zero area and zero bedrooms), the intercept represents the baseline price of a property.

sessionInfo()

R version 4.4.1 (2024-06-14)
Platform: x86_64-pc-linux-gnu
Running under: Ubuntu 22.04.5 LTS

Matrix products: default
BLAS:   /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3 
LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.20.so;  LAPACK version 3.10.0

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

time zone: Etc/UTC
tzcode source: system (glibc)

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] lubridate_1.9.3 forcats_1.0.0   stringr_1.5.1   dplyr_1.1.4    
 [5] purrr_1.0.2     readr_2.1.5     tidyr_1.3.1     tibble_3.2.1   
 [9] ggplot2_3.5.1   tidyverse_2.0.0 workflowr_1.7.1

loaded via a namespace (and not attached):
 [1] sass_0.4.9        utf8_1.2.4        generics_0.1.3    lattice_0.22-6   
 [5] stringi_1.8.4     hms_1.1.3         digest_0.6.37     magrittr_2.0.3   
 [9] timechange_0.3.0  evaluate_1.0.1    grid_4.4.1        fastmap_1.2.0    
[13] Matrix_1.7-0      rprojroot_2.0.4   jsonlite_1.8.9    processx_3.8.4   
[17] whisker_0.4.1     ps_1.8.1          promises_1.3.0    mgcv_1.9-1       
[21] httr_1.4.7        fansi_1.0.6       scales_1.3.0      jquerylib_0.1.4  
[25] cli_3.6.3         rlang_1.1.4       splines_4.4.1     munsell_0.5.1    
[29] withr_3.0.2       cachem_1.1.0      yaml_2.3.10       tools_4.4.1      
[33] tzdb_0.4.0        colorspace_2.1-1  httpuv_1.6.15     vctrs_0.6.5      
[37] R6_2.5.1          lifecycle_1.0.4   git2r_0.35.0      fs_1.6.4         
[41] pkgconfig_2.0.3   callr_3.7.6       pillar_1.9.0      bslib_0.8.0      
[45] later_1.3.2       gtable_0.3.6      glue_1.8.0        Rcpp_1.0.13      
[49] highr_0.11        xfun_0.48         tidyselect_1.2.1  rstudioapi_0.17.1
[53] knitr_1.48        farver_2.1.2      nlme_3.1-164      htmltools_0.5.8.1
[57] labeling_0.4.3    rmarkdown_2.28    compiler_4.4.1    getPass_0.2-4