loops - R dismo::gbm.step parameter selection function in parallel -


i have working function coded optimize parallel processing (hopefully). still not proficient r, functions , iterating.

i hoping out there me optimize function have written along code aid in computing time , optimize parallel processing options.

specifically using %do% vs %dopar% , moving additional code , parallel processing functions inside of function. cannot seem %dopar% work , not sure if issue code, r version, or conflicting libraries.

i greatly appreciate suggestions on possible ways same results in more efficient manner.

background:

i using dismo::gbm.step build gbm models. gbm.step selects optimal number of trees through k-fold cross validation. however, parameters tree complexity , learning rate still need set. understand caret::train built task, , have had lot of fun learning caret, it's adaptive resampling capabilities. however, response binomial , caret not have option return auc binomial distributions; use auc replicate similar published studies in field (ecology).

i using dismo::gbm.simplify later in analysis identify possible reduced models. gbm.simplify relies on data created when building models in dismo , cannot work on models built in caret.

finally, of gbm literature in ecology follows methods described in elith et al. 2008 "a working guide boosted regression trees", brt functions in dismo based on. purposes of study, keep using dismo build gbm models.

the function wrote tests several combinations of tree.complexity , learning.rate , returns list of several performance metrics each model. combine of lists data.frame easy sorting.

goal of of function

  1. create gbm model each iteration of tree.complexity , learning.rate.
  2. store $self.statistics$discrimination, cv.statistics$discrimination.mean, self.statistics$mean.resid, , cv.statistics$deviance.mean in list each gbm model created.
  3. remove each gbm model save space.
  4. combine each of lists format enables easy sorting. remove each list.
  5. do of above in manner optimizes parallel processing reducing computing time , memory used.

reproducible example using anguilla_train dataset dismo package

#load libraries require(pacman) p_load(gbm, dismo, teachingdemos, foreach, doparallel, data.table)   data(anguilla_train)  #identify cores on current system cores<-detectcores(all.tests = false, logical = false) cores  #create training function gbm.step step.train.fx=function(tree.com,learn){   #set seed reproducibility   char2seed("stackoverflow", set = true)   k1<-gbm.step(data=anguilla_train,                 gbm.x = 3:13,                 gbm.y = 2,                family = "bernoulli",                 tree.complexity = tree.com,                learning.rate = learn,                bag.fraction = 0.7,                prev.stratify=true,                n.folds=10,                n.trees=700,                step.size=25,                silent=true,                plot.main = false,                n.cores=cores)    k.out=list(interaction.depth=k1$interaction.depth,              shrinkage=k1$shrinkage,              n.trees=k1$n.trees,              auc=k1$self.statistics$discrimination,              cv.auc=k1$cv.statistics$discrimination.mean,              deviance=k1$self.statistics$mean.resid,              cv.deviance=k1$cv.statistics$deviance.mean)     return(k.out) }  #define complexity , learning rate tree.complexity<-c(1:5) learning.rate<-c(0.01,0.025,0.005,0.0025,0.001)  #setup parallel backend use n processors cl<-makecluster(cores) registerdoparallel(cl)  #run actual function foreach(i = tree.complexity) %do% {   foreach(j = learning.rate) %do% {     nam=paste0("gbm_tc",i,"lr",j)     assign(nam,step.train.fx(tree.com=i,learn=j))    } }  #stop parallel stopcluster(cl) registerdoseq()  #disable scientific notation options(scipen=999)  #find item in workspace contain "gbm_tc" train.all<-ls(pattern="gbm_tc")  #cbind each list contains "gbm_tc" train.results<-list(do.call(cbind,mget(train.all)))  #place in data frame train.results<- do.call(rbind, lapply(train.results, rbind)) train.results <- data.frame(matrix(unlist(train.results),ncol=7 , byrow=t))  #change column names colnames(train.results)<-c("tc","lr","n.trees", "auc", "cv.auc", "dev", "cv.dev")  #round 4:7 train.results[,4:7]<-round(train.results[,4:7],digits=3)  #sort cv.dev, cv.auc, auc train.results<-train.results[order(train.results$cv.dev,-train.results$cv.auc, -train.results$auc),]  train.results 

i'm still trying work out how myself, , you've gotten lot further me! 1 thing occurs me problem might in nested %do%s? test, why not try %dopar% j, or see if can collapse j & k matrix single vector, possible list containing permutations of both terms, passed gbm.step? e.g.

tree.complexity = i[1], learning.rate = i[2], 

please let me know if have success!

edit: also, potential route %:% here.

foreach(tree.com = 1:5) %:% foreach(learn = c(0.01,0.025,0.005,0.0025,0.001)) %dopar% { gbm.step ... return(list(...))} 

if added tree.com & learn list potentially spit out nice matrix of values. option:

foreach(tree.com = 1:5, learn = c(0.01,0.025,0.005,0.0025,0.001) %dopar% {     gbm.step ... return(list(...))} 

Comments

Popular posts from this blog

jquery - How do you format the date used in the popover widget title of FullCalendar? -

asp.net mvc - SSO between MVCForum and Umbraco7 -

Python Tkinter keyboard using bind -