Thursday, March 22, 2012

Multicollinearity and the solutions


In his book, Rudolf Freund described a confounding phenomenon while fitting a linear regression. Given a small data set below, there are three variables - dependent variable(y) and independent variables(x1 and x2). Using x2 to fit y alone, the estimated parameter of x2 f is positive that is 0.78. Then using x1 and x2 together to fit y, the parameter of x2 becomes -1.29, which is hard to explain since clearly x2 and y has a positive correlation.

data raw;
input y x1 x2;
cards;
2 0 2
3 2 6
2 2 7
7 2 5
6 4 9
8 4 8
10 4 7
7 6 10
8 6 11
12 6 9
11 8 15
14 8 13
;;;
run;

ods graphics on / border = off;
proc sgplot data = raw;
reg x = x2 y = y;
reg x = x2 y = y / group = x1 datalable = x1;
run;


The reason is that x1 and x2 have strong correlation each other. Diagnostics are well when using x2 to fit y. However, counting x1 and x2 together into the regression model causes multicollinearity, and therefore demonstrates severe heteroskedasticity and a skewed distribution of the residuals, which violates the assumptions for OLS regressions. Shown in the top scatter plot, 0.78 is the slope of the regression line by y ~ x2 (the longest straight line), while -1.29 is actually the slope of the partial regression lines by y ~ x2|x1 (four short segments).

proc reg data = raw;
model y = x2;
ods select parameterestimates diagnosticspanel;
run;

proc reg data = raw;
model y = x1 x2;
ods select parameterestimates diagnosticspanel;
run;
Solutions:

1. Drop a variable
Standing alone, x1 seems like a better predictor (higher R-square and lower MSE) than x2. The easiest way to remove this multicollinearity is to keep only x1 in the model.

proc reg data = raw;
model y = x1;
ods select parameterestimates diagnosticspanel;
run;

2. Principle component regression
If we want to keep both variables to avoid information loss, principle component regression is a good option. PCA would transform the correlated variables to the orthogonal factors. In this case, the 1st eigenvector explains 97.77% of the total variance, which is fairly enough for the following regression. SAS's PLS procedure can also perform the principle component regression.


proc princomp data = raw out = pca;
ods select screeplot corr eigenvalues eigenvectors;
var x1 x2;
run;

proc reg data = pca;
model y = prin1;
ods select parameterestimates diagnosticspanel;
run;
3. Repeated measure
We may apply the mixed model and compute the R matrix using x1 as covariate.

proc mixed data = raw plots(only) = ResidualPanel method = ml;
model y = x2 / s;
repeated / type=ar(1) subject=x1 r;
run;

Thursday, March 15, 2012

Make a frequency function in SAS/IML


Aggregation is probably the most popular operation in the data world. R comes with a handy table() function. Usually in SAS, the FREQ procedure would deal with this job. It will be great if SAS/IML has an equivalent function. I just created a user-defined function or module for such a purpose. Since it contains a DO loop, the efficiency is not very ideal -- always 10 times slower than PROC FREQ for a simulated data set of one million records.

/* 1 - Use IML for simulation and aggregation */
proc iml;
start freq(invec);
x = t(unique(invec));
y = repeat(x, 1, 2);
do i = 1 to nrow(x);
y[i, 2] = ncol(loc(invec=y[i, 1]));
end;
return(y);
finish;
store module = freq;
quit;

proc iml;
load module = freq;
/* Simulate a vector with 1 million values */
test = abs(floor(rannor(1:1e6)*100));
t0 = time();
result = freq(test);
timer = time() - t0;
print timer;
/* Output the result matrix as SAS data set */
create a var{"level" "frequency"};
append from result;
close a;
quit;

proc sgplot data = a;
series x = level y = frequency;
run;

/* 2 - Use PROC FREQ for simulation and aggregation */
data test;
do i = 1 to 1e6;
test = abs(floor(rannor(1234)*100));
output;
end;
drop i;
run;

options fullstimer;
proc freq data = test noprint;
table test / out = b;
run;

Thursday, March 8, 2012

Top 10 reasons for a modeler to learn PROC FCMP


10.  New financial functions
34 pre-compiled financial functions were shipped with SAS for free, which can be called in DATA step.

9. A management GUI
There is a nice-looking Java-powered GUI to manage the user-defined functions: SAS FCmp function editor.

8. Dynamic parameters
Parameters from other statistical procedures in SAS can be passed to a function. We can use a macro to wrap PROC FCMP to create tons of data-driven functions (temporary or permanent).

7. Migration from VBA
Excel dominates every desktop. However, using VBA to process large data is just a pain, such as calculating a transition matrix for probability of default. PROC FCMP is built on the logic of VBA -- easy to port any Excel macro/functions to SAS.

6. Backtesting
PROC FCMP loads data into memory. For backtesting task, it is much more efficient than hard-disk based DATA step.

5. Function/subroutine compiler
A lot of functions/subroutines SAS doesn’t have can be built, such as Binomial tree function or an asset price simulation subroutine.

4.  Encapsulation
To combine a lot of macros/nested macros for a complicated system, encapsulating the macro variables is a serious concern – PROC FCMP may be a rescue.

3. Monte Carlo simulation
The MCMC procedure, the designated procedure for Monte Carlo simulation in SAS, fully supports the syntax of PROC FCMP with its BEGINCNST and ENDCNST statements.

2. Deal with matrix
As part of SAS/BASE, PROC FCMP, allows all kinds of matrix operations everywhere with a SAS system.

1. Low level computation
The combination of PROC FCMP and PROC PROTO together wraps C functions in SAS. It is interesting to compare the efficiency between a C function and a SAS function.

Thursday, March 1, 2012

Rolling regressions for backtesting




Market always generates huge volume time series data with millions of records. Running regressions to obtain the coefficients in a rolling time window is common for many backtesing jobs. In SAS, writing a macro based on the GLM procedures such as PROC REG is not an efficient option. We can imagine the situations: calling PROC REG thousands of times in a big loop would easily petrify any system.

The better way is to go down to the bottom to re-implement the OLS clichés: inverse, transpose and multiply the vectors and matrices. We can do it in either PROC IML, DATA step array or PROC FCMP. For such attempts PROC IML is really powerful but needs extra license. DATA step array would require very high data manipulation skills, since it is not designed for matrix operations. PROC FCMP, a part of SAS/BASE, seems like a portable solution for SAS 9.1 or later. To test this method, I simulated a two-asset portfolio with 100k records, and under a 1000-obs long rolling window, eventually ran 99,001 regressions. The time cost was just 10 seconds on an old laptop. Overall, the speed is quite satisfying.


/* 1 -- Simulate a fund and russell2000 index */
data simuds;
_beta0 = 0.2;
_beta1 = 0.6;
_mse = 0.8;
format day date9.;
do day = today() - 20000 to today();
russell2000 = rannor(1234);
myfund = _beta0 + _beta1*russell2000 + _mse*rannor(3421);
output;
end;
drop _:;
run;

proc print data = simuds(obs = 100);
run;

/* 2 -- Decide length of rolling window */
%macro rollreg(data = , wlength = , benchmark = , fund = , out = , rfree = );
data _1;
set &data nobs = nobs;
&benchmark = &benchmark - &rfree;
&fund = &fund - &rfree;
call symput('nloop', nobs - &wlength + 1);
call symput('nobs', nobs);
run;
%put &nloop &nobs;
/* 3 -- Manipulate matrices */
proc fcmp;
/* Allocate spaces for matrices */
array input[&nobs, 2] / nosym;
array y[&wlength] / nosym;
array ytrans[1, &wlength] / nosym;
array xtrans[2, &wlength] / nosym;
array x[&wlength, 2] / nosym;
array sscp[2, 2] / nosym;
array sscpinv[2, 2] / nosym;
array beta[2] / nosym;
array result[&nloop, 4] / nosym;
array beta_xtrans_y[1] / nosym;
array xtrans_y[2, 1] / nosym;
array ytrans_y[1] / nosym;

/* Input simulation dataset */
rc1 = read_array("_1", input, "&benchmark", "&fund");

/* Calculate OLS regression coefficients and r square */
do j = 1 to &nloop;
ytotal = 0;
do i = 1 to &wlength;
xtrans[2, i] = input[i+j-1, 1];
xtrans[1, i] = 1;
y[i] = input[i+j-1, 2];
ytotal = ytotal + y[i];
end;
call transpose(y, ytrans);
call mult(ytrans, y, ytrans_y);
call transpose(xtrans, x);
call mult(xtrans, x, sscp);
call inv(sscp, sscpinv);
call mult(xtrans, y, xtrans_y);
call mult(sscpinv, xtrans_y, beta);
call mult(beta, xtrans_y, beta_xtrans_y);
nymeansquare = ytotal**2 / &wlength;
result[j, 1] = beta[1];
result[j, 2] = beta[2];
result[j, 3] = (beta_xtrans_y[1]-nymeansquare) / (ytrans_y[1]-nymeansquare);
result[j, 4] = j;
end;

/* Output resulting matrix as dataset */
rc2 = write_array("&out", result, 'beta0', 'beta1', 'rsquare', 'day');
if rc1 + rc2 > 0 then put 'ERROR: I/O error';
else put 'NOTE: I/O was successful';
quit;
%mend;

%rollreg(data = simuds, wlength = 50, benchmark = russell2000, fund = myfund, out = result, rfree = 0.005);

ods graphics on / noborder;
/* 4 -- Visualize result */
proc sgplot data = result;
needle x = day y = beta1;
yaxis label = 'Systematic Risk';
run;

proc sgplot data = result;
needle x = day y = beta0;
yaxis label = "Jensen's alpha";
run;

proc sgplot data = result;
needle x = day y = rsquare;
yaxis label = "Coefficient of Determination";
run;