| Issue 3: | support parallel processing | |
| 6 people starred this issue and may be notified of changes. | Back to list |
What steps will reproduce the problem? 1. Train a large data set with 5 CV on a machine with 2+ CPUs and adequate What is the expected output? What do you see instead? I expect all the CPUs are used for CV, but instead it is slow. Some of my models take many hours, and after trying different interaction levels and maximum trees, it can take days. What version of the product are you using? On what operating system? gbm 1.6 (and probably 2.0) Please provide any additional information below. Please support %dopar% like the caret package
Jan 20, 2013
#1
ahz...@gmail.com
Jan 23, 2013
I just started looking at the documentation for the 'parallel' package which comes with a base installation of R. So far as I can tell, it ought to be straightforward to use parLapply to parallelize the loop that does the cross-validation. Does anyone know differently? In principle, it might be better to parallelize the tree-building in the C++ code (except runs in which stumps are used), but I'm an awful lot more familiar with R than C++ so would choose the R route.
Jan 23, 2013
The benefit of doing the parallelization with %dopar% instead of C++ is %dopar% abstracts the backend, so it works on different environment (such as Linux, Windows, and SNOW). This vignette gives an example of building a random forest in parallel, so doing CV should be similar. http://cran.r-project.org/web/packages/foreach/vignettes/foreach.pdf
Jan 23, 2013
I have been parallel processing with gbm but in a slightly different context. From what I read in the manual, part of what makes the CV calculations slower is that one needs to use gbm instead of gbm.fit(). The former relies on model.frame which slows things down. I have used gbm.fit in situations for repeated bootstrap type evaluations. This works fine in parallel - I've used Revolution R but any of the parallel routines that your OS support should work. I have been stymied trying to write my own CV routines with gbm.fit(). Again I have noticed in the manual that the routines internally shuffle the records prior to training (just in case targets are grouped together). What this means is that one can not take advantage of gbm$valid.error for calculations. The CV holdout needs scored separately from the training runs. Thus one has to do 10 gbm.fit() and 10 predict.gbm() in order to find the appropriate number of trees and then train the entire data set for final model. At least for just a dozen or so predictors it does not end up any faster. Have yet to try this with a few hundred predictors where it might be an advantage. One could send each fold to a separate core which theoretically would reduce time by 8-10. I recall reading that caret package has some functionality for parallel processing with gbm. I believe it was in the setting of exploring parameter optimization. Have not really explored that avenue. I tend to tune parameters with gbm.fit() and then train the model with gbm() using CV-10. Probably not the most correct but working out okay so far.
Jan 23, 2013
Yes, I am suggesting sending "each fold to a separate core"---or more specifically, by splitting them up using %dopar% (the number of workers may not equal the number of folds). That is how I understand caret does it, and it should help a lot. Actually I would use caret except caret seems to have a worse way of calculating the optimum number of trees: it requires each evaluation point to given separately, so evaluating a grid of c(1:4000) could be slow(?) compared to gbm.perf(). Right now I am doing 5 CV and 4000 trees on a data set that is about 200Kx200, and it takes about 12 hours while my 8 core machine is running at 12% CPU (i.e., there are unused resources). If I had enough RAM to use all 8 CPUs, I should be done in ~2 hours.
Jan 24, 2013
I hear you. The caret approach is probably the only way to write your own CV routine. Since gbm.fit() shuffles the records internally one can not use the nTrain parameter and be certain of which rows are used for training and validation. If it didn't shuffle then things would be much easier/faster. You could combine the $valid.error parameters on the 10 runs without having to iterate over the 4000 trees 5 times - it would already be done. On my wish list for gbm.fit would be to add a boolean parameter that would allow one to 'turn off' this shuffling. I am not the most advanced programmer and perhaps others may have a better approach. Good luck.
Jan 24, 2013
Attached is an R script that computes a cross validation model using gbm.fit() and parallel processing via doParallel and foreach/%dopar% construct. I keep finding little things to improve so i've reposted this code several times. This looks like a fairly solid version. Feel free to check for errors. This is working on my machine.
Jan 27, 2013
I've started work on this. I want to use the parallel package because it comes with the base install. Take a look at the 'parallel' branch on the source tree if you're interested.
Status:
Started
Jan 28, 2013
Please see version 2.0-9, downloadable from the project's home page. This passes R CMD check on my Linux system and on Windows. I still need to edit some of the examples in the help files before considering submitting to CRAN. I'll also attempt to address some of the other issues raised before doing that. Please test this out and let me know if you encounter any issues. Harry
Jan 28, 2013
I tried V2.0-9 on my home laptop (Win7-64bit-2 cores-RStudio). Tried example: robusrReg. Seems to recruit cores just fine. The plots appear in the example but I do get an error: Error in eval(expr, envir, enclos) : object 'rb8' not found In addition: Warning messages: 1: In is.na(x) : is.na() applied to non-(list or vector) of type 'NULL' 2: In is.na(x) : is.na() applied to non-(list or vector) of type 'NULL' 3: In is.na(x) : is.na() applied to non-(list or vector) of type 'NULL' 4: In is.na(x) : is.na() applied to non-(list or vector) of type 'NULL' Don't think this is from the parallel portion. Defer to Harry. On Wednesday will be back in office where can test on more robust machine and also Revolution R environment. Thanks for improvements!!!!! Bob
Jan 29, 2013
Hmm, that script is out of date and expects residuals to be available when they're not. I'll have a think about what to do with it.
Apr 15, 2013
I used the code above for 0-1 classification using bernouilli distribution and it worked very well, but now I would need to do a k-classes classification (factors 0 to 6),thus using multinomial distribution. I am not sure how to change the error function.
Apr 16, 2013
If your response is a factor, gbm should automatically decide it is multinomial and tell you. Do myresp <- factor(myresp) to be sure.
Nov 26, 2013
(No comment was entered for this change.)
Status:
Fixed
|
|
| ► Sign in to add a comment |