Tuesday, January 10, 2012

Benchmarking of the CUSUM function in SAS/IML


Cumulative sums can be obtained in SAS’s DATA step by the RETAIN statement. As the codes below, a new variable of the cumulative values will be returned by DATA step’s implicit DO loop.

data one;
do i = 1 to 1e6;
z = ranuni(0);
output;
end;
drop i;
run;

data two;
set one;
retain y 0;
y + z;
run;

The same logic can be realized by an explicit DO loop in SAS/IML. However, SAS/IML has a function CUSUM which is specially made for cumulative sums. To compare the efficiency of the two methods, I tried an experiment: calculating the cumulative sums for a random vector with incremental sizes from 1 million to 20 million by 1 million. The result shows that the CUSUM function is always 100 times faster than a raw DO loop. With the increase of the vector size, the gap gets wider. When the number of elements in the vector reaches 20 million, a Do loop would ask almost 40 seconds, while the CUSUM function requires only 0.35 seconds. Therefore, the vectorized CUSUM function is going to be a good candidate to do such jobs in SAS/IML.

proc iml;
a = t(do(1e6, 2e7, 1e6));
timer = j(nrow(a), 2);
do p = 1 to nrow(a);
n = a[p];
z = t(ranuni(1:n));

t0 = time();
x = cusum(z);
timer[p, 1] = time() - t0;

y = j(n, 1, .);
t0 = time();
do i = 1 to n;
if i = 1 then y[i] = z[i];
else y[i] = y[i-1] + z[i];
end;
timer[p, 2] = time() - t0;
end;

t = a||timer;
create _1 from t;
append from t;
close _1;
quit;

data _2;
set _1;
length test $100.;
label col1 = "Number of observations"
time = "Time by seconds for cumulative summation";
test = "The CUSUM function"; time = col2; output;
test = "DO loop"; time = col3; output;
keep test time col1;
run;

proc sgplot data = _2;
series x = col1 y = time / curvelabel group = test;
yaxis grid;
run;
P.S.
Thanks to Rick’s correction, I included his improved DO loop. Comparing with my previous DO loop with a conditional statement, this new DO loop is obviously more efficient. The conclusion is still the same: the CUSUM function outperforms the DO loops in this competition. And with more data involved, the contrast gets even more striking.


proc iml;
a = t(do(1e6, 2e7, 1e6));
timer = j(nrow(a), 3);
do p = 1 to nrow(a);
n = a[p];
z = t(ranuni(1:n));
/* 1 -- The CUSUM function */
t0 = time();
x = cusum(z);
timer[p, 1] = time() - t0;
/* 2 -- Bad DO loop*/
y = j(n, 1, .);
t0 = time();
do i = 1 to n;
if i = 1 then y[i] = z[i];
else y[i] = y[i-1] + z[i];
end;
timer[p, 2] = time() - t0;
/* 3 -- Good DO loop */
w = z;
t0 = time();
do i = 2 to n;
w[i] = w[i-1] + z[i];
end;
timer[p, 3] = time() - t0;
end;

t = a||timer;
create _1 from t;
append from t;
close _1;
quit;

data _2;
set _1;
length test $100.;
label col1 = "Number of observations"
time = "Time by seconds for cumulative summation";
test = "The CUSUM function"; time = col2; output;
test = "bad DO loop"; time = col3; output;
test = "good DO loop"; time = col4; output;
keep test time col1;
run;

proc sgplot data = _2;
series x = col1 y = time / curvelabel group = test;
yaxis grid;
run;

No comments:

Post a Comment