# Simple Linear Regression

Simple Linear Regression in R (2018) [STHDA]

## Formula and basics

Simple linear regression predicts a quantitative outcome variable `y`

, on the basis of a single predictor variable `x`

.

**An important assumption** of the linear regression model is that the relationship between the predictor variables and the outcome is **linear** and **additive**.

**Residual Sum of Squares (RSS)**

The sum of the squares of the residual errors

**Ordinary least squares regression (OLS)**

$b_0$ and $b_1$ are determined so that the RSS is minimized

**Residual Standard Error (RSE)**

The average variation of points around the fitted regression line

Lower the RSE, the better the fitted regression model

## Load & preview the data

`library(tidyverse) ## For data manipulation and visualization`

`library(ggpubr) ## For publication-ready plots`

`theme_set(theme_pubr())`

The `marketing`

data set describes the impact of Youtube, Facebook and newspaper advertising on sales.

`#devtools::install_github('kassambara/datarium')`

`library(datarium)`

`data("marketing", package = "datarium") `

`head(marketing)`

### Visualize the data

Create a scatterplot displaying the sales in thousands of dollars vs. Youtube advertising budget:

`ggplot(marketing, aes(x = youtube, y = sales)) + `

` geom_point() + `

` stat_smooth()`

### Correlation coefficient

Measures the level of association between X and Y

-1: perfect negative correlation

+1: perfect positive correlation

~0: weak relationship between the variables

`cor(marketing$sales, marketing$youtube)`

## Fit a linear regression model

The `jtools`

package is very useful for summarizing and visualizing regression results. Read more about it here.

`library(jtools)`

`model <- lm(sales ~ youtube, data = marketing) `

`summ(model)`

Intercept=8.43 means that when Youtube advertising budget is $0, we can expect $8,430 in sales

### Fit the linear regression line

The confidence band reflect the uncertainty about the line

`ggplot(marketing, aes(youtube, sales)) + `

` geom_point() + `

` stat_smooth(method = lm)`

## Model assessment

### Model summary

**Call**: the function used to compute the regression model**Residuals**: distribution of the residuals (median should be ~0 and min ~ max)**Coefficients**: the regression beta coefficients and their statistical significance. Statistical significance indicated by asterisks.**RSE**, $R^2$ and**F-statistic**: metrics of how well the model fits to the data

`summary(model)`

### Coefficients significance

#### Standard errors

Measures the variability/accuracy of the beta coefficients. Can be used to compute the confidence intervals of the coefficients.

#### t-value

A t-test is performed to check whether or not these coefficients are significantly different from zero. **High t-statistics (which go with low p-values near 0)** indicate that a predictor should be retained in a model.

#### Model accuracy

**RSE **(Closer to zero, the better)

Whether the RSE is acceptable depends on the problem

$R^2$** **(Higher the better)

Represents the proportion of information (i.e. variation) in the data that can be explained by the model, or how well the model fits the data

For a simple linear regression, $R^2$ is the square of the Pearson correlation coefficient

The adjusted $R^2$ is adjusted for the degrees of freedom. As $R^2$ tends to increase when there are more predictors, should consider this metric for multiple linear regression model

**F statistic **(Higher the better)

Give overall significance of the model, assessing whether at least one predictor variable has a non-zero coefficient

For a simple linear regression, this just duplicates the information of the t-test in the coefficient table

More important in multiple linear regression

A large F statistic corresponds to a statistically significant P value