SAS/IML has a number of vector-wise subscripts/operators/functions available, which can make many things easy. A cheat sheet about them can be found at Rick Wicklin’s blog.
To try out those wonderful features( and their combinations?), I designed a test to use them for the total of the odd numbers in a random numeric sequence. A typical solution is always placing a cumulative counter in a looping structure for many programming languages. Therefore I just used such basic DO loop in SAS/IML as a benchmark.
At the beginning, the MOD function computes the modulo. Then the simplest ways is the SUM function to aggregate all odd number. Subscript, like [, +], serves the same purpose. The method of the CHOOSE + SUM functions quite assembles the DO loop and is the generalized form of the SUM only method(also brings overhead on resources). The LOC function can subset or index a vector and is combined with two other functions.
Observations:
1. All vector-wise methods beat the DO loop, especially with big dataset. The simpler the method is, the faster the result is.
2. The robust CHOOSE function. For binary conditions, the CHOOSE function can replace the if-else-then statements plus a DO loop and is much more efficient. Rick Wicklin has an article about this function.
3. The LOC function is very handy and plays a role like the WHERE statement in SAS’s DATA step or the which()/subset() functions in R.
4. The SUM function seems slightly faster than the subscript [, +] for a vector.
proc iml;
a = t(do(1e6, 2e7, 1e6));
timer = j(nrow(a), 6);
do p = 1 to nrow(a);
n = a[p];
/* Simulate a numeric sequence */
x = ceil(ranuni(1:n)*100000);
/* 1 -- SUM function*/
t0 = time();
r1 = sum(mod(x, 2));
timer[p, 1] = time() - t0;
/* 2 -- Subscript + */
t0 = time();
r2 = mod(x, 2)[ , +];
timer[p, 2] = time() - t0;
/* 3 -- SUM + CHOOSE functions*/
t0 = time();
r3 = sum(choose(mod(x, 2), 1, 0));
timer[p, 3] = time() - t0;
/* 4 -- NCOL + LOC functions */
t0 = time();
r4 = ncol(loc(mod(x, 2) = 1));
timer[p, 4] = time() - t0;
/* 5 -- DO loop */
t0 = time();
r5 = 0;
do i = 1 to ncol(x);
if mod(x[i], 2) = 1 then r5 = r5 + 1;
end;
timer[p, 5] = time() - t0;
/* 6 -- COUNTMISS + LOC functions */
t0 = time();
x[loc(mod(x, 2) = 1)] = .;
r6 = countmiss(x);
timer[p, 6] = time() - t0;
/* Validate all results */
print r1 r2 r3 r4 r5 r6;
end;
t = a||timer;
create _1 from t;
append from t;
close _1;
quit;
data _2;
set _1;
length test $100.;
label col1 = "Number of observations"
time = "Time by seconds to count odd numbers";
test = "SUM function"; time = col2; output;
test = "Subscript + "; time = col3; output;
test = "SUM + CHOOSE functions"; time = col4; output;
test = "NCOL + LOC functions"; time = col5; output;
test = "DO loop"; time = col6; output;
test = "COUNTMISS + LOC functions"; time = col7; output;
keep test time col1;
run;
proc sgplot data = _2;
series x = col1 y = time / curvelabel group = test;
yaxis grid;
run;
This is a good way to build intuition for efficient programming. Well done.
ReplyDeleteOne observation I have is that the following vectors are always equal:
m = mod(x, 2);
c = choose(mod(x, 2), 1, 0);
Consequently, the second method is always slower.