SASHELP.CARS, with 428 observations and 15 variables, is a free dataset in SAS for me to exercise any classification methods. I always have the fantasy to predict which country a random car is manufactured by, such as US, Japan or Europe. After trying many methods in SAS, including decision tree, logistic regression, k-NN and SVM, I eventually found that random forest, an ensemble classifier of many decision trees [Ref. 1], can slash the overall misclassification rate to around 25%. The SAS code is powered by R’s package ‘randomForest’. In my tiny experiment, it seems that the ensemble of 100 trees would achieve optimum effect.
The concept of random forest was first raised by Leo Breiman and Adele Cutler [Ref. 2]. They also developed elegant Fortran codes for it. Andy Liaw in Merck did a fantastic job to port those Fortran codes into R [Ref. 3]. Now everybody with a computer can use this state of the art classification method for fun or work.
Reference:
1. Albert Montillo. ‘Random Forest’. http://www.ist.temple.edu/
2. Leo Breiman and Adele Cutler. http://stat-www.berkeley.edu/users/breiman/RandomForests/
3. Andy Liaw. ‘randomForest: Breiman and Cutler's random forests for classification and regression’. http://cran.r-project.org/web/packages/randomForest/index.html
/*******************READ ME*********************************************
* - A macro calls random forest in SAS by R -
*
* SAS VERSION: 9.1.3
* R VERSION: 2.13.0 (library: 'randomForest', 'foreign')
* DATE: 18may2011
* AUTHOR: hchao8@gmail.com
*
****************END OF READ ME******************************************/
****************(1) MODULE-BUILDING STEP********************************;
%macro rf(train = , validate = , result = , targetvar = , ntree = ,
tmppath = , rpath = );
/*****************************************************************
* MACRO: rf()
* GOAL: invoke randomForest in R to perform random forest
* classification in SAS
* PARAMETERS: train = dataset for training
* validate = dataset for validation
* result = dataset after prediction
* ntree = number of trees specified
* targetvar = target variable
* tmppath = temporary path for exchagne files
* rpath = installation path for R
*****************************************************************/
proc export data = &train outfile = "&tmppath\sas2r_train.csv" replace;
run;
proc export data = &validate outfile = "&tmppath\sas2r_validate.csv" replace;
run;
proc sql;
create table _tmp0 (string char(200));
insert into _tmp0
set string = 'train=read.csv("sas_path/sas2r_train.csv",header=T)'
set string = 'validate=read.csv("sas_path/sas2r_validate.csv",header=T)'
set string = 'sink("sas_path/result.txt", append=T, split=F)'
set string = 'require(randomForest,quietly=T)'
set string = 'model=randomForest(sas_targetvar~ .,data=train,'
set string = 'do.trace=10,ntree=sas_treenumber,importance=T)'
set string = 'predicted = predict(model,newdata=validate,type="class")'
set string = 'result=as.data.frame(predicted)'
set string = 'importance(model)'
set string = 'table(validate$sas_targetvar, predicted)'
set string = 'require(foreign, quietly=T)'
set string = 'write.foreign(result,"sas_path/r2sas_tmp.dat",'
set string = '"sas_path/r2sas_tmp.sas",package="SAS")';
quit;
data _tmp1;
set _tmp0;
string = tranwrd(string, "sas_treenumber", "&ntree");
string = tranwrd(string, "sas_targetvar", propcase("&targetvar"));
string = tranwrd(string, "sas_path", translate("&tmppath", "/", "\"));
run;
data _null_;
set _tmp1;
file "&tmppath\sas_r.r";
put string;
run;
options xsync xwait;
x "cd &rpath";
x "R.exe CMD BATCH --vanilla --slave &tmppath\sas_r.r";
data _null_;
infile "&tmppath\result.txt";
input;
if _n_ = 1 then put "NOTE: Statistics by R";
put _infile_;
run;
%include "&tmppath\r2sas_tmp.sas";
data &result;
set &validate;
set rdata;
run;
%mend rf;
****************(2) TESTING STEP****************************************;
%rf(train = cars_train, validate = cars_validate, result = cars_result,
targetvar = origin, ntree = 100, tmppath = c:\tmp,
rpath = D:\Program Files\R\R-2.13.0\bin);
****************END OF ALL CODING***************************************;
This raises an important question: If I need to do RF, SVM, GBM etc. in R and get much better predictors, why should my company pay a fortune in SAS?
ReplyDeleteI like this post and am trying to learn to connect SAS and R. But, I was unable to run this sample codes as I cannot find the included file "r2sas_tmp.sas (%include "&tmppath\r2sas_tmp.sas");
ReplyDeleteCan you share that codes?
Thanks,
Abdul
The RF in R is limited to 32 levels in each categorical variable. R (S) is an object oriented language but not SAS. There are two softwares that can handle more than 32 levels but with non-optimal solution.
ReplyDelete