machine learning - How to calculate KNN Variable Importance in R -


i implemented authorship attribution project able train knn model articles 2 authors using knn. then, classify author of new article either author or author b. use knn() function generate model. output of model table below.

   word1 word2 word3  author 11    1     48    8      2     2     0     0      b 29    1     45    9      1     2     0     0      b 4     0     0     0      b 28    3     1     1      b 

as seen model, obvious see word2 , word3 significant variables cause classification between author , author b.

my question how can identify using r.

basically, question boils down having variables (word1, word2, , word3 in example) , binary outcome (author in example) , wanting know importance of different variables in determining outcome. natural approach training regression model predict outcome using variables , check variable importance in model. i'll include 2 approaches (logistic regression , random forest) here, many others used.

let's start larger example, in outcome depends on word2 , word3, , word2 has larger effect word3:

set.seed(144) dat <- data.frame(word1=rnorm(10000), word2=rnorm(10000), word3=rnorm(10000)) dat$author <- ifelse(runif(10000) < 1/(1+exp(-10*dat$word2+dat$word3)), "a", "b") 

we can use summary of logistic regression model predicting author determine important variables:

summary(glm(i(author=="a")~., data=dat, family="binomial")) # [snip] # coefficients: #             estimate std. error z value pr(>|z|)     # (intercept)  0.05117    0.04935   1.037    0.300     # word1       -0.02123    0.04926  -0.431    0.666     # word2        9.52679    0.26895  35.422   <2e-16 *** # word3       -0.97022    0.05629 -17.236   <2e-16 *** 

from p-values, can see word2 has large positive effect , word3 has large negative effect. coefficients can see word2 has higher magnitude of effect on outcome (since construction know variables on same scale).

we can use variable importance random forest predicting author outcome similarly:

library(randomforest) rf <- randomforest(as.factor(author)~., data=dat) rf$importance #       meandecreasegini # word1         294.9039 # word2        4353.2107 # word3         351.3268 

we can identify word2 far important variable. tells else that's interesting -- given know word2, word3 isn't more useful word1 in predicting outcome (and word1 shouldn't useful because wasn't used compute outcome).


Comments

Popular posts from this blog

asp.net mvc - SSO between MVCForum and Umbraco7 -

Python Tkinter keyboard using bind -

ubuntu - Selenium Node Not Connecting to Hub, Not Opening Port -