Friday, February 18, 2011

Visualize decision tree by coding Proc Arboretum


Decision tree (tree-based partition or recursive partition) dominates the top positions of recent data mining competitions. It is easy to realize and explain like logistic regression, but usually brings more powers (AUC). Not like SVM, neural network or random forest, decision tree is quick and resource-efficient. It is really a blessing for big data. No wonder regression tree and classification tree are widely used in industry: thanks to Google’s application on its Gmail, I am seldomly harassed by spam.

The documents about Proc Arboretum are still scarce. From my experience, Proc Arboretum is pretty robust and powerful. It divides input variables as different categories: nominal/interval/interval. It allows users to trim the tree interactively. It also generates a number of statistics about portioning criterion. And it supports an integrated training-validation-scoring flow and even code output. Overall, it satisfies my wildest dream about decision tree. However, since it is one of the pillars of SAS Enterprise Miner, SAS Institute probably feels reluctant to disclose more detail of this procedure to those who have the license and are more willing to do hard coding themselves. SAS programmers can hardly build physical tree it if without Enterprise Miner. Some resort to R instead, because R’s package ‘rpart’ is now stable for production purpose and provides convenient functions to show the trees.

SAS’s plotting procedures could visualize the results by Proc Arboretum. In the example, I still used the example SASHELP.CARS to explore if the decision tee recognizes the origin of a car, such as Asia/Europe/US. With an ancient procedure Proc Netdraw, I built a not-good-looking tree. By other high-level plotting SG procedures, I displayed some deeper information according to the results by Proc Arboretum, such as the significance of variables or the predication accuracy.

Reference: The ARBORETUM Procedure. 'www.sasenterpriseminer.com/documents/proc_arbor.pdf'.

********(1) CONSTRUCT DECISION TREE AND OUTPUT DATASETS********;
filename outcode 'h:\outcode.txt';
proc arboretum data=sashelp.cars ;
target origin / level=nominal;
input MSRP Cylinders Length Wheelbase MPG_City
MPG_Highway Invoice Weight Horsepower/ level=interval;
input EngineSize/level=ordinal;
input DriveTrain Type /level=nominal;
code file=outcode;
save IMPORTANCE=imp1 MODEL=model1 NODESTATS=nodstat1
RULES=rul1 SEQUENCE=seq1 STATSBYNODE= statb1 SUM=sum1;
run;
quit;
********END OF STEP(1)***********;

********(2) VISUALIZE DECISION TREE RESULTS************;
****(2.1) SIGNIFICANCE OF VARIABLES*****;
proc sgplot data=imp1;
vbar name/response=importance;
run;

****(2.2) INTERACTION AMONG THE MOST THREE SIGNIFICANT VARIABLES****;
proc sgscatter data=sashelp.cars;
plot invoice*(wheelbase length)/group=origin;
run;

****(2.3) CONSTITUENTS OF EACH NODE****;
proc sgplot data=statb1;
vbar node/response=STATVALUE group=CATEGORY;
run;

****(2.4) BUILD PHYSICAL TREE****;
proc sql;
create table treedata as
select a.parent as act1, a.node, b.NODETEXT, b.U_Origin
from nodstat1 as a, nodstat1 as b
where a.parent=b.node
union
select c.node as act1, . as node, c.nodetext, c.U_Origin
from nodstat1 as c
;quit;

data treedata1;
set treedata;
if U_Origin='Asia' then _pattern=1;
else if U_Origin='Europe' then _pattern=2;
else _pattern=3;
run;

pattern1 c=green; pattern2 v=s c=red; pattern3 v=s c=blue;
/*NOTE: USE PROC NETDRAW TO REALIZE PHYSICAL TREE*/
footnote c=green 'Asia ' c=red 'Europe ' c=blue 'USA';
proc netdraw data=treedata1 graphics;
actnet /activity=act1 successor=NODE id=(NODETEXT) tree compress rotate rotatetext font=simplex arrowhead=0 htext=6;
run;
footnote ' ';

****(2.5) SHOW ALL PARTITION STATISTICS *****;
proc transpose data=seq1 out=seq1_t(rename=(col1=value));
var _ASSESS_ _MISC_ _MAX_ _SSE_ _ASE_;
by _NW_ notsorted;
run;

proc sgpanel data=seq1_t;
panelby _name_/UNISCALE=column COLUMNS=4 rows=2 SPACING=5 NOVARNAME;
step x=_NW_ y=value;
colaxis TYPE= DISCRETE grid;
run;

****(2.6) SHOW FINAL PREDICATION ACCURACY****;
proc sort data=sum1( drop=_total_) out=sum1_s;
by _TARGET_;
where _STAT_='N' AND _TARGET_ ^= 'TOTAL';
run;

proc transpose data=sum1_s out=sum1_t(rename=(col1=Number));
var _numeric_;
by _TARGET_;
run;

proc sgplot data=sum1_t;
vbar _LABEL_/response=Number group=_TARGET_;
run;
********END OF STEP(2)*********;

*********END OF ALL CODING*****TESTED ON PC SAS 9.2 ***********;

No comments:

Post a Comment