Updated Apr 2015:
Sometimes it's nice to quickly visualise the data that went into a simple linear regression, especially when you are performing lots of tests at once. Here is a quick and dirty solution with ggplot2 to create the following plot:
Let's try it out using the iris
dataset in R:
data(iris)
head(iris)
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1 5.1 3.5 1.4 0.2 setosa
## 2 4.9 3.0 1.4 0.2 setosa
## 3 4.7 3.2 1.3 0.2 setosa
## 4 4.6 3.1 1.5 0.2 setosa
## 5 5.0 3.6 1.4 0.2 setosa
## 6 5.4 3.9 1.7 0.4 setosa
Create fit1
, a linear regression of Sepal.Length and Petal.Width. Normally we would quickly plot the data in R base graphics:
fit1 <- lm(Sepal.Length ~ Petal.Width, data = iris)
summary(fit1)
##
## Call:
## lm(formula = Sepal.Length ~ Petal.Width, data = iris)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.3882 -0.2936 -0.0439 0.2643 1.3452
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.7776 0.0729 65.5 <2e-16 ***
## Petal.Width 0.8886 0.0514 17.3 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.478 on 148 degrees of freedom
## Multiple R-squared: 0.669, Adjusted R-squared: 0.667
## F-statistic: 299 on 1 and 148 DF, p-value: <2e-16
##
plot(Sepal.Length ~ Petal.Width, data = iris)
abline(fit1)
This can be plotted in ggplot2 using stat_smooth(method = "lm")
:
library(ggplot2)
ggplot(iris, aes(x = Petal.Width, y = Sepal.Length)) +
geom_point() +
stat_smooth(method = "lm", col = "red")
However, we can create a quick function that will pull the data out of a linear regression, and return important values (R-squares, slope, intercept and P value) at the top of a nice ggplot graph with the regression line.
ggplotRegression <- function (fit) {
require(ggplot2)
ggplot(fit$model, aes_string(x = names(fit$model)[2], y = names(fit$model)[1])) +
geom_point() +
stat_smooth(method = "lm", col = "red") +
labs(title = paste("Adj R2 = ",signif(summary(fit)$adj.r.squared, 5),
"Intercept =",signif(fit$coef[[1]],5 ),
" Slope =",signif(fit$coef[[2]], 5),
" P =",signif(summary(fit)$coef[2,4], 5)))
}
After specifying this function, all you would have to run is:
fit1 <- lm(Sepal.Length ~ Petal.Width, data = iris)
ggplotRegression(fit1)
or even just
ggplotRegression(lm(Sepal.Length ~ Petal.Width, data = iris))
Disclaimer: This is not suitable for non-normal distributions or categorical variables!
This is very useful. Thank you. Just one question, is there a way of putting the model descriptive inside the graph?
Extremely thankful to you Susane Johnston for this post… you made my work very easy..
Hi, This function has worked earlier but it seems some of the changes in the theming system in ggplot2 necessitate updating it. Can you please post the updated version? I am an R beginner and my trials in doing this have been unsuccessful.
Reference: https://github.com/wch/ggplot2/wiki/New-theme-system
Thanks a lot!
M
You can fix this by simply replacing:
opts(title =
with:
ggtitle(
So that the last element of the function reads:
ggtitle(paste(“Adj R2 = “,signif(summary(fit)$adj.r.squared, 5),
“; Intercept =”,signif(fit$coef[[1]],5 ),
“; Slope =”,signif(fit$coef[[2]], 5),
“; P =”,signif(summary(fit)$coef[2,4], 5)))
Thanks for posting this Susan. Very helpful!
Thanks for posting this, it’s so useful! ANy ideas on what to do if you want to have a facet on your plot, with seperate lm’s for each facet?
What a help this is! I immediately adapted ggPlotRegression() for my particular needs and it worked perfectly. Until I had your code to use as an example I didn’t know how to write a function; didn’t even know I *could* write a function. Thank you very much, Susan!
I was looking for method to obtain residuals and do other kind of regression using ggplot, which brought me here, I learned few things about regression. But my goal is still unfulfilled, you have not mentioned anywhere, how to find residual and plot residuals using ggplot without taking using ‘lm’ command. Would you throw some light on it. While going through your post meticulously, I found that there is no need to define a function and this line // ggplot(fit$model, aes_string(x = names(fit$model)[2], y = names(fit$model)[1])) // is redudant, as you have used the fit$model in place of iris, (fit$model)[2] in place of iris$Sepal.Length ……In spite of going through a cumbersome function method, we can simply write // ggplot(iris,aes( Petal.Width,Sepal.Length))+geom_point()+geom_smooth(method=lm, col=’red’)// . Please correct me, if have missed something. Thank you
Sam, the function is plotting based on the model object, not the data itself, that is why aes_string and the model parameters are in there. I have outlined in the post already the code to plot with the data alone. This post is not for the residuals, merely visualisation of the regression itself. A quick Google of plotting residuals in ggplot2 gives plenty of hits, for example: https://rpubs.com/therimalaya/43190
Dear Susan,
thanks for illusteration, its very useful.
Can we plot a Multiple regression plot using the above code ? if not – can you please guide me how to do that.
Thanks.
how does it work in facets? How to work on facets?
Pingback: Validity by Linear Regression in R | Sport Statistics R Sweet!
Very nice! Never thought about writing functions for plots – inspiring 🙂
Great work Susan! Any advice on how to adapt it to a multiple regression model (e.g. y~x+z+c)?
Hi, nice post and thanks for the info.
I have problem to plot dose-response curves from a create model of class “drm” from a “drc” package using “ggplot”.
do you have any tips?
Thanks,
Danilo
This is great, thanks! It may be inappropriate statistically to use lm, but just for plotting categorical variables, I modified it thus:
ggplotRegression <- function (fit, jit=FALSE) {
require(ggplot2)
if(jit){
fit$model[,1] <- jitter(fit$model[,1])
fit$model[,2] <- jitter(fit$model[,2])
}
ggplot(fit$model, aes_string(x = names(fit$model)[2], y = names(fit$model)[1])) +
geom_point() +
stat_smooth(method = "lm", col = "red") +
labs(title = paste("Adj R2 = ",signif(summary(fit)$adj.r.squared, 5),
"Intercept =",signif(fit$coef[[1]],5 ),
" Slope =",signif(fit$coef[[2]], 5),
" P =",signif(summary(fit)$coef[2,4], 5)))
}
Hi,
What if you want to use your own defined fitted model which includes more than 1 input variable instead of the geom_smooth?
(i.e. myfit = lm(y ~ x + z + d + x:d, mydata)
Is there a patch or equivalent function that allows using “myfit” in geom_smooth instead of formula = …. ?
Hey, you can specify formulas in geom_smooth(), see http://docs.ggplot2.org/0.9.3.1/stat_smooth.html
This is easy to understand and apply. Great illustration. Thanks.
What about if you have two different datasets that have differnt variables and you need all the variables to be in your regression, how would you do it?
I tired cbind function but when I run my regression, I had same names for all of the varialbles that are in my regrssion.
Thanks again
Hisham Alhawal
Reblogged this on it 4 developers.
Thanks for a useful script. One suggestion: unless you’re comparing models with different numbers of predictors, you should display R^2 rather than adjusted R^2. R^2 has a nice interpretation which adjusted R^2 does not have.
Thanks Susane! I’m a beginner at R and I was wondering Dave’s same question.
How can the code be modified to show R2 and not adj R2?
You made my day!!!! Thanks a lot!
Hi Susane,
Very nice example. I particularly like the way to automatically label the plot with the coefficient of the regression. I have used it in an study I am doing on F1 races. Thanks a lot!
Nice very helpful!
Hi. This changes values from R2 when i try to use cubic regressions.
Extremely awesome fzslq http://fzslq.ga/ for the winter! They continue to keep me vogue and toasty!
Is there any way to print the results instead of putting them in the label?
Hello, is there a way to change the size of the label title? I have tried to adjust by changing the number “5”, or by putting in a “size =” but have been unsuccessful. I am also having problems with including the “slope =”, where my console warning is that the dimensions “[2,4]” are incorrect. Here is my code:
KCCPIexportcorn %>% ggplot(aes(y = SIYieldNonirr, x = OM)) +
geom_point(size = 2, color = “purple4″) + geom_smooth(method=”lm”, se = FALSE, color = “black”, size = 1) +
labs(x = ‘OM Subrule’) + labs(y = expression(bold(paste(‘Dryland Corn Yield (kg ha’^-1,’)’))) ) +
labs(title = paste(“Adj R2 =”,signif(summary(OMsubrule)$adj.r.squared,5))) +
theme_classic() +
theme(
axis.title.x = element_text(color=”black”, size=8, face=”bold”),
axis.title.y = element_text(color=”black”, size=8, face=”bold”),
axis.text = element_text(color=”black”, size = 7, face=”bold”),
axis.line = element_line(color = “black”, size = 1))
Nevermind, solved! Thank you for the great post!
How could you remove the grey area around the lm regression line?
Thank you
Sarah
What does the grey area represent? Does it represent anything like standard error, or…? What does it visualise?
Thanks. It’s very helpful.
Hi, nice post and thanks for the info.
I have problem to plot dose-response curves from a create model of class “drm” from a “drc” package using “ggplot”.
do you have any tips?
Thanks,
Dong-Mei
Great tutorial! Does what I need, but can anyone tell me how to swap out the legend from geom_smooth back to geom_point?
What if I need a polynomial function instead of just linear? And I have a group by variables, which means I will show several functions, each for one group?
Hello, and thank you for this helpful code. Why is the p-value: <2e-16 in the raw output above, but P=2.3255e-37 at the top of the graph in the final output? Thank you.
Gracias amigo, It is really helpful