r - Unexpected behavior in dplyr::group_by_ and dplyr::summarise_ -
i wrote little function find r-squared value of regression performed on 2 variables in mtcars
data set, included in r default:
get_r_squared = function(x) summary(lm(mpg ~ hp, data = x))$r.squared
it seems work expected when give full data set:
get_r_squared(mtcars) # [1] 0.6024373
however, if try use part of dplyr
pipeline on subset of data, returns same answer above 3 times when expected return different value each subset.
library(dplyr) mtcars %>% group_by_("cyl") %>% summarise_(r_squared = get_r_squared(.)) ## source: local data frame [3 x 2] ## ## cyl r_squared ## 1 4 0.6024373 ## 2 6 0.6024373 ## 3 8 0.6024373
i expecting values instead
sapply( unique(mtcars$cyl), function(cyl){ get_r_squared(mtcars[mtcars$cyl == cyl, ]) } ) # [1] 0.01614624 0.27405583 0.08044919
i've confirmed not plyr
namespace issue: package not loaded.
search() ## [1] ".globalenv" "package:knitr" "package:dplyr" ## [4] "tools:rstudio" "package:stats" "package:graphics" ## [7] "package:grdevices" "package:utils" "package:datasets" ## [10] "package:methods" "autoloads" "package:base"
i'm not sure what's going on here. related nonstandard evaluation in lm
function? or misunderstanding how group_by
works? or perhaps else?
i think you've misunderstood how summarise()
works - doesn't .
, , fact works @ happy chance. instead, try this:
library(dplyr) get_r_squared <- function(x, y) summary(lm(x ~ y))$r.squared mtcars %>% group_by(cyl) %>% summarise(r_squared = get_r_squared(mpg, wt))
Comments
Post a Comment