Friday, September 30, 2011

Modeling loss given default (LGD) by finite mixture model

The 'highly skewed' and 'highly irregular' loss data from the insurance and banking world is routinely fitted by a simple beta/ lognormal/gamma/Pareto distribution. While looking at the distribution plot, I bet that many people don’t want to buy this story and are willing to explore better ways. Finite mixture model that incorporates multiple distributions can be a good option in the radar map. For example, Matt Flynn will present how to use PROC NLMIXED to realize finite mixture model for insurance loss data in the incoming SAS ANALYTICS 2011 conference. Finally the revolutionary FMM procedure shipped with SAS 9.3 makes building finite mixture model easy.



For example, I have a sample loss given default dataset with 317 observations: lgd(real loss given default) is the dependent variable; lgd_a(mean default rate by industry), lev(leverage coefficient by firm) and i_def( mean default rate by year) are independent variables. The kernel distribution is plotted and difficult to be estimated by naked eyes.

data lgddata;
informat lgd lev 12.9 lgd_a 6.4 i_def 4.3;
input lgd lev lgd_a i_def;
label lgd = 'Real loss given default'
lev = 'Leverage coefficient by firm'
lgd_a = 'Mean default rate by year'
i_def = 'Mean default rate by industry';
cards;
0.747573451 0.413989786 0.6261 1.415
/* Other data*/
0.748255544 0.607452819 0.3645 3.783
;
run;

proc kde data = lgddata;
univar lgd / plots = all;
run;

data _lgddata01;
set lgddata;
id + 1;
run;
proc transpose data = _lgddata01 out = _lgddata02 ;
by id;
run;
proc sgplot data = _lgddata02;
hbox col1 / category = _LABEL_;
xaxis label = ' ';
run;
What I need PROC FMM to do is to estimate: 1. which distribution is the best from beta, lognormal, and gamma distributions; 2. how many components (ranging from 1 to 10) are the best for each distribution. To automate and visualize the process, I designed a macro. From the plots above, all penalized criterions (AIC, BIC, etc.) indicate that beta distribution is better than the other two. Also the beta distribution has higher Pearson statistic value and less parameter numbers.

ods html style = money;
%macro modselect(data = , depvar = , kmin= , kmax = , modlist = );
%let modcnt=%eval(%sysfunc(count(%cmpres(&modlist),%str( )))+1);
%do i = 1 %to &modcnt;
%let modelnow = %scan(&modlist, &i);
ods output fitstatistics = &modelnow(rename=(value=&modelnow));
ods select densityplot fitstatistics;
proc fmm data = &data;
model &depvar = / kmin=&kmin kmax= &kmax dist=&modelnow;
run;
%end;
data _final;
%do i = 1 %to &modcnt;
set %scan(&modlist, &i);
%end;
run;
proc sgplot data = _tmp01;
%do i = 1 %to &modcnt;
%let modelnow = %scan(&modlist, &i);
series x = descr y = &modelnow;
where descr ne :'E' and descr ne :'P';
%end;
yaxis label = ' ' grid;
run;
proc transpose data = _tmp01 out = _tmp02;
where descr = :'E' or descr = :'P';
id descr;
run;
proc sgplot data = _tmp02;
bubble x = effective_parameters y = effective_components
size = pearson_statistic / datalabel = _name_;
xaxis grid; yaxis grid;
run;
%mend;
%modselect(data = lgddata, depvar = lgd, kmin= 1,
kmax = 10, modlist = beta lognormal gamma);


The optimized component number for the beta distribution is 5 – beautiful matching curve. Lognormal distribution exhausted the maximum 10 components and fits the kernel distribution very awkwardly. Gamma distribution used 9 components and fits relatively well.


Then I chose the 5-compenent Homogeneous beta distribution to model the LGD data. PROC FMM provided all parameter estimates for these 5 components. From the plot above, the intercepts and the scale parameter s are different as expected. Interestingly, the parameters of lgd_a(mean default rate by industry) present big diversity, while the parameters of i_def( mean default rate by year) tend to converge at the zero point.

ods output parameterestimates = parmds;
proc fmm data = lgddata;
model lgd = lev lgd_a i_def / k = 5 dist=beta;
run;

proc sgplot data = parmds;
series x = Effect y = Estimate / group = Component;
xaxis grid label = ' '; yaxis grid;
run;
ods html style = htmlbluecml;
In conclusion, although PROC FMM is still an experimental procedure, its powerful model selection features would significantly change the way how people feel and use the loss data in the risk management industry.

Thursday, September 29, 2011

An easy solution for Multi-Sheet EXCEL reporting

Currently the only way to output SAS datasets as a multi-sheet EXCEL workbook for reporting is to use ExcelXP ODS tagset. I like this method a lot, because it can generate stylish multiple EXCEL sheets and is highly customizable. However, in practice it has some weaknesses. 1 - Running this tagset is resource-costly, since it depends on an 8k lines SAS codes - ExcelXP.sas. While dealing with a large SAS dataset, it always gets jammed. 2- It only allows one grouping variable by the BY statement inside the output procedures (PROC REPORT, PROC PRINT, etc.). 3 - The user often has to estimate the width for each column in EXCEL.

Actually we can use SAS macro and VBA macro together to obtain high-quality multi-sheet EXCEL workbook. The workflow is pretty simple: first a SAS macro splits a SAS dataset into many XLS files in a folder through ODS HTML targset. Second a VBA macro merges those single XLS files as sheets in to a workbook. For example, SAS shipped with a sample dataset SASHELP.PRDSAL2 with 23040 observations and 11 variables. If we want to generate a multi-sheet EXCEL workbook grouped by two variables such as ‘state’ and ‘year’, we can set up an empty directory in the hard disk and run a macro like below. As a result, we will have a number of small XLS files.

%macro split(data = , folder = , clsvar1 = , clsvar2 = );
options nocenter nodate nonumber ps = 9000;
title; footnote;
ods listing close;
proc sql noprint;
create table _tmp01 as
select &clsvar1, &clsvar2, count(*) as number
from &data
group by &clsvar1, &clsvar2
order by &clsvar1, &clsvar2
;quit;
data _tmp02;
set _tmp01 nobs = nobs;
where number gt 0;
index = _n_;
call symput('nobs', nobs);
run;
%do i = 1 %to &nobs;
proc sql noprint;
select &clsvar1, &clsvar2
into:clsvar1name,:clsvar2name
from _tmp02
where index = &i
;quit;
%let filepath = &folder\%sysfunc(dequote(&clsvar1name))_%sysfunc(dequote(&clsvar2name)).xls;
ods html file = "&filepath " style = minimal;
proc print data = &data noobs label;
where &clsvar1 = "&clsvar1name" and &clsvar2 = &clsvar2name;
run;
%end;
ods listing;
ods html close;
%mend;
%split(data = sashelp.PRDSAL2, folder = C:\test1, clsvar1 = state , clsvar2 = year)
Then we can open EXCEL, press ALT+F11, paste the VBA code below and run it. Then we will be able to have a decent multi-sheet EXCEL workbook. The biggest strength for this method is that it is very fast – the overall process (running SAS macro and VBA macro) only takes less than a minute for this relatively large dataset SASHELP.PRDSAL2. And it can be expanded to many grouping variables by modifying the SAS macro a little. In conclusion, for big data EXCEL reporting, combining SAS macro and VBA macro together is a good alternative other than ExcelXP ODS tagset.
VBA Draft 3.1 Blogpost

Monday, September 19, 2011

A test to count missing values for large data

This morning Rick introduced how to count the missing frequencies for all variables in a dataset, including character and numeric variables. He provided two solutions by either PROC FREQ or PROC IML. I have a petty macro based on PROC SQL’s nmiss() function to do the same job. In this big data era, I am interested in those SAS codes’ efficiencies to check large data.

Then I simulated a 1e8 size dataset (5 numerical and 5 character variables) and tested the 3 methods. According to the results, PROC FREQ is slightly better than PROC SQL in both memory usage and speed. However, a macro may be needed to integrate a number of output tables by PROC FREQ for reporting (the demo macro by SAS Knowledge Base is an option and can be improved). Sadly, PROC IML used out all the 250MB memories the server allocated to me and can’t finish this test. Similarly, the memory volume should be considered for other matrix languages like R and Matlab.

P.S.
Thanks to Rick’s new codes, missing values were counted successfully by PROC IML finally. The real time spent by PROC IML was 150 seconds, which is almost equal to running the macro by PROC SQL. It is interesting to compare the functions from these two procedures: COUNT() and NMISS() functions in PROC SQL and COUNTN() and COUNTMISS() functions in PROC IML can work for both numeric and character variables.


**********************(1) SIMULATION STEP**************************************;
data test;
array _c{*} $ c1-c5;
array _n{*} n1-n5;
do i = 1 to 1e8;
do j = 1 to 5;
if ranuni(1234)>0.2 then _c[j]= 'A';
else call missing(_c[j]);
if ranuni(4321)>0.3 then _n[j] = rannor(0);
else call missing(_n[j]);
end;
output;
drop i j;
end;
run;

**********************(2) MODULE-BUILDING STEP*********************************;
%macro countmiss(data = );
ods listing close;
ods output variables = _varlist;
proc contents data = &data;
run;
proc sql;
alter table _varlist
add nmiss num
;
select count(*) into :nvar
from _varlist
;
select count(*) into :nobs
from &data
;
quit;
%do i = 1 %to &nvar;
proc sql;
select cats('nmiss(', variable, ')') into :miss_var
from _varlist
where num = &i
;
select &miss_var into: miss_num
from &data
;
update _varlist
set nmiss = &miss_num
where num = &i
;
quit;
%end;
proc sort data = _varlist;
by num;
run;
ods listing;
proc report data = _varlist nowd split = '*';
columns type num variable label nmiss pcnmiss;
define type / group;
define variable / width = 15;
define nmiss / display 'Missing*Number';
define pcnmiss / computed format = percent8.3
'Missing*Percentage' width = 10;
compute pcnmiss;
pcnmiss = nmiss / &nobs;
endcomp;
run;
%mend countmiss();

proc format;
value $missfmt ' '='Missing' other='Not Missing';
value missfmt . ='Missing' other='Not Missing';
run;

**********************(3) TESTING STEP***************************************;
options fullstimer;
*******************METHOD(1): PROC SQL***************************************;
%countmiss(data = test);

*******************METHOD(2): PROC FREQ*************************************;
proc freq data = TEST;
format _CHAR_ $missfmt.;
tables _CHAR_ / missing missprint nocum nopercent;
format _NUMERIC_ missfmt.;
tables _NUMERIC_ / missing missprint nocum nopercent;
run;
*******************METHOD(3): PROC IML***************************************;
proc iml;
use test;
read all var _NUM_ into x[colname=nNames];
n = countn(x,"col");
nmiss = countmiss(x,"col");
read all var _CHAR_ into x[colname=cNames];
close one;
c = countn(x,"col");
cmiss = countmiss(x,"col");
Names = cNames || nNames;
rNames = {" Missing", "Not Missing"};
cnt = (cmiss // c) || (nmiss // n);
print cnt[r=rNames c=Names label=""];
quit;
********************END OF ALL CODING*****************************************;

Sunday, September 11, 2011

Add 10 buttons to enhance SAS 9.3 environment

One striking change in SAS 9.3 is that everything is HTMLized. Adding user-defined buttons to the toolbar of SAS’s windowing environment would make coding more efficient.
1. Clear log and result
The most used SAS command is now prompted by a button.

output;clear;log;clear;wpgm;
2. Clear files left by results viewer
Lots of trash (mostly HTML files and PNG-formatted images by SAS 9.3's results viewer) would permanently remain in the current folder after each SAS session (the current folde is indicated on the Window Bar at the bottom). To avoid the possibility that someday the hard disk is jammed, clearing them out at times may be a good habit.

gsubmit "options noxsync noxwait;x 'del *.htm';x 'del *.png';"
3. Clear work directory
Everything in the WORK directory can be cleared out during a SAS session by such an approach.

gsubmit "proc catalog c=work.sasmacr kill force;proc datasets kill nolist;quit;"
4. Turn off ODS graphics
This code line closes ODS graphics for the statistics procedures if needed, and then saves resource expense.

gsubmit "ods graphics off;"
5. Separate html
SAS 9.3 stores all results in a very long html file that is what we see in the results viewer new html files. Ken Kleinman mentioned that running this code would generate a new html file.

gsubmit "ods html close; ods html;"
6. Listing on
One click to go back to the listing output if with output-intensive job.

gsubmit "ods _all_ close; ods listing;"
7. Increase DPI
The original image resolution in SAS 9.3 is set at low level for processing efficiency. Actually DPI (dots per inch) in SAS can be raised as high as 600. Graphs with 300 dpi would be decent for regular paperwork.

gsubmit"ods html image_dpi=300;"

8. Change style
The default style in SAS 9.3 is htmlblue, which is an elegant and smooth html template. However, it has weaknesses: while it draws scatter plots, the overlapping dots tend to be indistinguishable. My favorite for scatting plots is 'harvest'. Aligning them together would see the difference. SAS 9.3 has many other styles. I like 'navy' or 'money' for pictures that need high contrast. The styles 'journal/2/3' may be considered for publishing purpose.

gsubmit"ods html style=harvest;"
9. Recover formchar
If you see tables on listing separated by weird charaters, it is time to click this button.

gsubmit "options formchar='|----|+|---+=|-/\<>*';"
10. Community help
Searching the key words in SAS-L is probably the most efficient way to find answers for any SAS question. Another option is Lex Jansen that searches from total 15925 SAS conference proceeding papers.

wbrowse 'http://listserv.uga.edu/archives/sas-l.html'

Those user-defined buttons can be exported and imported by PROC CATALOG.

********EXPORT****************************;
proc catalog;
copy in=sasuser.profile out=work.backup;
select sasedit_main / et=toolbox;
quit;
********IMPORT****************************;
proc catalog;
copy in=work.backup out=sasuser.profile;
select sasedit_main / et=toolbox;
quit;