Tuesday, May 3, 2011

Support vector machine in SAS by R


I just recently discovered endless fun to synchronize SAS and R to do something meaningful. Yep, I am a SAS programmer: during the day time, I use SAS for my work; at the evening, I use R for entertainment. It is always exciting to hook up them together. How about a SAS/R module, like SAS/STAT or SAS/BASE, in the future?

Some SAS programmers or SAS ‘developers’ already utilized coding to communicate SAS and R [Ref. 1 and 2] (thanks to Rick Wicklin’s mentioning). Since R can write dataset in SAS code (the ‘foreign’ package) and SAS can use call R in X command to do batch execution, so far I didn’t find much difficulty to use SAS as a GUI for R.

Support vector machine (SVM) is a cool and fancy classification method. It is said that a secret SVM procedure is already running under SAS Enterpriser Miner (I did not get a chance to try it yet). The package ‘e1071’ in R provides the state-of-art protocols for SVM classification. Then I made a small macro to call it in SAS, which performs like a typical SAS procedure. Hope everyone who is doing data could enjoy it.

Reference:
1.Philip R Holland ‘SAS to R to SAS’. Holland Numerics Limited.
2.Phil Rack. ‘A Bridge to R for SAS Users’. www.MineQuest.com

/*******************READ ME*********************************************
* - SUPPORT VECTOR MACHINE FOR CLASSIFICATION IN SAS BY R -
*
* SAS VERSION: SAS 9.1.3
* R VERSION: R 2.13.0 (library: 'e1071', 'foreign')
* DATE: 03may2011
* AUTHOR: hchao8@gmail.com
*
****************END OF READ ME******************************************/

****************(1) MODULE-BUILDING STEP******************;
%macro svm(train = , validate = , result = , targetvar = , tmppath = , rpath = );
/*****************************************************************
* MACRO: svm()
* GOAL: invoke e1071 in R to perform support vector machine
* classification in SAS
* PARAMETERS: train = dataset for training
* validate = dataset for validation
* result = dataset after prediction
* targetvar = target variable
* tmppath = temporary path for exchagne files
* rpath = installation path for R
*****************************************************************/
proc export data = &train outfile = "&tmppath\sas2r_train.csv" replace;
run;
proc export data = &validate outfile = "&tmppath\sas2r_validate.csv" replace;
run;

proc sql;
create table _tmp0 (string char(80));
insert into _tmp0
set string = 'train=read.csv("sas_path/sas2r_train.csv",header=T)'
set string = 'validate=read.csv("sas_path/sas2r_validate.csv",header=T)'
set string = 'require(e1071,quietly=T)'
set string = 'model=svm(sas_targetvar ~ . ,data=train)'
set string = 'predicted=predict(model,newdata=validate,type="class")'
set string = 'result=as.data.frame(predicted)'
set string = 'require(foreign, quietly=T)'
set string = 'write.foreign(result,"sas_path/r2sas_tmp.dat",'
set string = '"sas_path/r2sas_tmp.sas",package="SAS")';
quit;
data _tmp1;
set _tmp0;
string = tranwrd(string, "sas_targetvar", propcase("&targetvar"));
string = tranwrd(string, "sas_path", translate("&tmppath", "/", "\"));
run;
data _null_;
set _tmp1;
file "&tmppath\sas_r.r";
put string;
run;

options xsync xwait;
x "cd &rpath";
x "R.exe CMD BATCH --vanilla --slave &tmppath\sas_r.r";

data _null_;
infile "&tmppath\sas_r.r.rout";
input;
if _n_ = 1 then put "NOTE: Time used by R";
put _infile_;
run;

%include "&tmppath\r2sas_tmp.sas";
data &result;
set &validate;
set rdata;
run;

proc datasets nolist;
delete _: rdata;
quit;
%mend;

****************(2) TESTING STEP******************;
******(2.1) BUILD A PARTITION MACRO TO SEPARATE TESTING DATASET*************;
%macro partbyprop2(data = , targetvar = , samprate = , seed = , train = , validate = );
/**************************************************************
* MACRO: partbyprop2()
* GOAL: partition dataset by target variable's proportion
* and choose numerical variables for classification
* PARAMETERS: data = input dataset
* targetvar = target variable
* samprate = ratio of train v.s. validate datasets
* seed = random seed for sampling
**************************************************************/
ods listing close;
ods output variables = _varlist;
proc contents data = &data;
run;
proc sql;
select variable into: num_var separated by ' '
from _varlist
where lowcase(type) = "num";
quit;

proc sort data = &data out = _tmp1;
by &targetvar;
run;

proc surveyselect data = _tmp1 samprate = &samprate
out = _tmp2 seed = &seed outall;
strata &targetvar / alloc = prop;
run;

data &train &validate;
set _tmp2;
format _numeric_ best13.;
keep &num_var &targetvar;
if selected = 0 then output &train;
else output &validate;
run;

proc datasets;
delete _:;
quit;
ods listing;
%mend;

******(2.2) DIVIDE SASHELP.IRIS DATASET INTO TWO EQUAL PARTS*************;
%partbyprop2(data = sashelp.iris, targetvar = species, samprate = 0.5, seed = 20110503,
train = iris_train, validate = iris_validate);

******(2.3) USE THE SVM MACRO*************;
%svm(train = iris_train, validate = iris_validate, result = iris_result,
targetvar = species, tmppath = c:\tmp, rpath = c:\Program Files\R\R-2.13.0\bin);

****************(3) VISUALIZATION STEP******************;
data iris_visual;
set iris_result;
length color shape $8.;
predvalue = put(predicted, predictd.);
if species = "Setosa" then shape = "club";
if species = "Versicolor" then shape = "diamond";
if species = "Virginica" then shape = "spade";
if predvalue = "Setosa" then color = "blue";
if predvalue = "Versicolor" then color = "red";
if predvalue = "Virginica" then color = "green";
run;

ods html style = harvest
proc g3d data = iris_visual;
scatter PetalLength * PetalWidth = SepalLength /
color = color shape = shape;
run;quit;
ods html close;

****************END OF ALL CODING***************************************;
 
Link of r2sas_tmp.sas

No comments:

Post a Comment