loops - R dismo::gbm.step parameter selection function in parallel -
i have working function coded optimize parallel processing (hopefully). still not proficient r
, functions , iterating.
i hoping out there me optimize function have written along code aid in computing time , optimize parallel processing options.
specifically using %do%
vs %dopar%
, moving additional code , parallel processing functions inside of function. cannot seem %dopar%
work , not sure if issue code, r
version, or conflicting libraries.
i greatly appreciate suggestions on possible ways same results in more efficient manner.
background:
i using dismo::gbm.step
build gbm
models. gbm.step
selects optimal number of trees through k-fold cross validation. however, parameters tree complexity , learning rate still need set. understand caret::train
built task, , have had lot of fun learning caret
, it's adaptive resampling capabilities. however, response binomial , caret
not have option return auc binomial distributions; use auc replicate similar published studies in field (ecology).
i using dismo::gbm.simplify
later in analysis identify possible reduced models. gbm.simplify
relies on data created when building models in dismo
, cannot work on models built in caret
.
finally, of gbm
literature in ecology follows methods described in elith et al. 2008 "a working guide boosted regression trees", brt functions in dismo
based on. purposes of study, keep using dismo
build gbm
models.
the function wrote tests several combinations of tree.complexity
, learning.rate
, returns list of several performance metrics each model. combine of lists
data.frame
easy sorting.
goal of of function
- create
gbm
model each iteration oftree.complexity
,learning.rate
. - store
$self.statistics$discrimination
,cv.statistics$discrimination.mean
,self.statistics$mean.resid
, ,cv.statistics$deviance.mean
inlist
eachgbm
model created. - remove each
gbm
model save space. - combine each of lists format enables easy sorting. remove each list.
- do of above in manner optimizes parallel processing reducing computing time , memory used.
reproducible example using anguilla_train
dataset dismo
package
#load libraries require(pacman) p_load(gbm, dismo, teachingdemos, foreach, doparallel, data.table) data(anguilla_train) #identify cores on current system cores<-detectcores(all.tests = false, logical = false) cores #create training function gbm.step step.train.fx=function(tree.com,learn){ #set seed reproducibility char2seed("stackoverflow", set = true) k1<-gbm.step(data=anguilla_train, gbm.x = 3:13, gbm.y = 2, family = "bernoulli", tree.complexity = tree.com, learning.rate = learn, bag.fraction = 0.7, prev.stratify=true, n.folds=10, n.trees=700, step.size=25, silent=true, plot.main = false, n.cores=cores) k.out=list(interaction.depth=k1$interaction.depth, shrinkage=k1$shrinkage, n.trees=k1$n.trees, auc=k1$self.statistics$discrimination, cv.auc=k1$cv.statistics$discrimination.mean, deviance=k1$self.statistics$mean.resid, cv.deviance=k1$cv.statistics$deviance.mean) return(k.out) } #define complexity , learning rate tree.complexity<-c(1:5) learning.rate<-c(0.01,0.025,0.005,0.0025,0.001) #setup parallel backend use n processors cl<-makecluster(cores) registerdoparallel(cl) #run actual function foreach(i = tree.complexity) %do% { foreach(j = learning.rate) %do% { nam=paste0("gbm_tc",i,"lr",j) assign(nam,step.train.fx(tree.com=i,learn=j)) } } #stop parallel stopcluster(cl) registerdoseq() #disable scientific notation options(scipen=999) #find item in workspace contain "gbm_tc" train.all<-ls(pattern="gbm_tc") #cbind each list contains "gbm_tc" train.results<-list(do.call(cbind,mget(train.all))) #place in data frame train.results<- do.call(rbind, lapply(train.results, rbind)) train.results <- data.frame(matrix(unlist(train.results),ncol=7 , byrow=t)) #change column names colnames(train.results)<-c("tc","lr","n.trees", "auc", "cv.auc", "dev", "cv.dev") #round 4:7 train.results[,4:7]<-round(train.results[,4:7],digits=3) #sort cv.dev, cv.auc, auc train.results<-train.results[order(train.results$cv.dev,-train.results$cv.auc, -train.results$auc),] train.results
i'm still trying work out how myself, , you've gotten lot further me! 1 thing occurs me problem might in nested %do%
s? test, why not try %dopar%
j
, or see if can collapse j
& k
matrix single vector, possible list containing permutations of both terms, passed gbm.step
? e.g.
tree.complexity = i[1], learning.rate = i[2],
please let me know if have success!
edit: also, potential route %:%
here.
foreach(tree.com = 1:5) %:% foreach(learn = c(0.01,0.025,0.005,0.0025,0.001)) %dopar% { gbm.step ... return(list(...))}
if added tree.com
& learn
list potentially spit out nice matrix of values. option:
foreach(tree.com = 1:5, learn = c(0.01,0.025,0.005,0.0025,0.001) %dopar% { gbm.step ... return(list(...))}
Comments
Post a Comment