backup: January 2012

Monday, January 23, 2012

A test for memory management of SAS/IML

Programming always involves the considerations for the efficiency and the memory usage. For efficient programming in SAS/IML, my shortcut is to look at the tip sheet from Rick Wicklin and search ways to simplify the codes. As for the memory management mechanism of SAS/IML, I only found one page of SAS/IML 9.2 User’s Guide on the Internet.

To see the performance of SAS/IML ‘s memory management, I designed a simple test, since the SHOW SPACE statement would indicate the memory details of SAS/IML. A simulated 200 rows * 300 columns matrix occupies about 400k memory. I just requested 1MB memory by specifying the WORKSIZE at the beginning, which means that three matrices of such size would blow the tiny work space away. First I assigned multiple references toward this matrix, and then changed one value in this matrix. Finally I cleared all of the objects in the memory and generated a matrix with the same size. The memory change is shown in the plot above.


/* Only request 1MB memory for this test*/
proc iml worksize=1048576;
    /* 0 -- State 0*/
     show space;
    /* 1 -- State 1 (1 reference on 1 object)*/
        x = j(200, 300, 0);
        call randgen(x, "Normal");
     show space;
    /* 2 -- State 2 (2 references on 1 object)*/
       y = x;
    show space;
    /* 3 -- State 3 (3 references on 1 object)*/
       z = x;
    show space;
    /* 4 -- State 4 (2 refernces on 1 object and 2 refernces on another object)*/
       x[1, 1] = 5;
       w = x;
    show space;
    /* 5 -- State 5 (0 references on 2 objects) */
       free x y z w;
    show space;
    /* 6 -- State 1 again (1 reference on 1 object)*/
       x = j(200, 300, 0);
       call randgen(x, "Normal");
    show space;
quit;

Conclusions:
1. Interestingly, an object with two references costs more memory than the same object with three references. Apparently two identical copies of the same matrix exist at State 2, while three references all point to a single matrix at State 3.

SAS/IML seems to have a unique memory management system. The memory manager of IML first checks the size available of the workspace (memory allocated to IML). If the memory usage doesn’t pass the alarming line, it will generously make copies. Otherwise, it will become very aggressive and start to compress the memory frenetically.

2. Two similar matrices stay in the memory at State 4, although the difference between them is just one cell. Each of the matrices has 2 references.

SAS/IML follows the rule of copy-on-change. Once values in a matrix are changed, the memory manager will make a copy for the changed object.

3. Running the FREE Statement doesn't immediate clear the memory at State 5.

SAS/IML’s FREE statement only kills the references. The unreferenced objects will be discarded during the ensuing operations.

Thursday, January 19, 2012

6 ways to count odd numbers in SAS/IML

SAS/IML has a number of vector-wise subscripts/operators/functions available, which can make many things easy. A cheat sheet about them can be found at Rick Wicklin’s blog.

To try out those wonderful features( and their combinations?), I designed a test to use them for the total of the odd numbers in a random numeric sequence. A typical solution is always placing a cumulative counter in a looping structure for many programming languages. Therefore I just used such basic DO loop in SAS/IML as a benchmark.

At the beginning, the MOD function computes the modulo. Then the simplest ways is the SUM function to aggregate all odd number. Subscript, like [, +], serves the same purpose. The method of the CHOOSE + SUM functions quite assembles the DO loop and is the generalized form of the SUM only method(also brings overhead on resources). The LOC function can subset or index a vector and is combined with two other functions.

Observations:
1. All vector-wise methods beat the DO loop, especially with big dataset. The simpler the method is, the faster the result is.
2. The robust CHOOSE function. For binary conditions, the CHOOSE function can replace the if-else-then statements plus a DO loop and is much more efficient. Rick Wicklin has an article about this function.
3. The LOC function is very handy and plays a role like the WHERE statement in SAS’s DATA step or the which()/subset() functions in R.
4. The SUM function seems slightly faster than the subscript [, +] for a vector.


proc iml;
   a = t(do(1e6, 2e7, 1e6));
   timer = j(nrow(a), 6);
   do p = 1 to nrow(a);
      n = a[p];
   /* Simulate a numeric sequence */
      x = ceil(ranuni(1:n)*100000); 
   
      /* 1 -- SUM function*/
      t0 = time(); 
         r1 = sum(mod(x, 2)); 
      timer[p, 1] = time() - t0;
   
      /* 2 -- Subscript + */          
      t0 = time();
         r2 = mod(x, 2)[ , +];
      timer[p, 2] = time() - t0;
   
      /* 3 -- SUM + CHOOSE functions*/  
      t0 = time(); 
         r3 = sum(choose(mod(x, 2), 1, 0)); 
      timer[p, 3] = time() - t0;
   
      /* 4 -- NCOL + LOC functions */        
      t0 = time();
         r4 = ncol(loc(mod(x, 2) = 1));
      timer[p, 4] = time() - t0;
   
      /* 5 -- DO loop */  
      t0 = time();
         r5 = 0;
         do i = 1 to ncol(x);
            if mod(x[i], 2) = 1 then r5 = r5 + 1;
         end;
      timer[p, 5] = time() - t0;
   
      /* 6 -- COUNTMISS + LOC functions */        
      t0 = time();
         x[loc(mod(x, 2) = 1)] = .;
         r6 = countmiss(x);
      timer[p, 6] = time() - t0;
   
      /* Validate all results */
      print r1 r2 r3 r4 r5 r6;
   end;
   t =  a||timer;
   create _1 from t;
      append from t;
   close _1;
quit;

data _2;
   set _1;
   length test $100.;
   label col1 = "Number of observations" 
      time = "Time by seconds to count odd numbers";
   test = "SUM function"; time = col2; output;
   test = "Subscript + "; time = col3; output;
   test = "SUM + CHOOSE functions"; time = col4; output;
   test = "NCOL + LOC functions"; time = col5; output;
   test = "DO loop"; time = col6; output;
   test = "COUNTMISS + LOC functions"; time = col7; output;
   keep test time col1;
run;

proc sgplot data = _2;
   series x = col1 y = time / curvelabel group = test;
   yaxis grid;
run;

Thursday, January 12, 2012

Random seeds

A footnote toward Rick Wilkin’s recent article “How to Lie with a Simulation”. (Sit in front of a laptop w/o SAS; have to port all SAS/IML codes into R)

Generated 10 seeds randomly to run Stochastic simulation of Buffon's needle experiment by Rick's method. Hardly converge any of them ….


# Replicate Rick Wicklin's SAS/IML codes for Buffon's needle experiment 
simupi <- function(N, seed) {
  set.seed(seed)
  z <- matrix(runif(N*2, 0, 1), N, 2)
  theta <- pi*z[, 1]
  y <- z[, 2] / 2
  P <- sum(y < sin(theta)/2) / N
  piEst <- 2/P
  Trials <- 1:N
  Hits <- (y < sin(theta)/2)
  Pr <- cumsum(Hits)/Trials
  Est <- 2/Pr
  cbind(Est, Trials, seed)
}

# Generated 10 seeds randomly
seed_vector <- floor(runif(1:10)*10000)

# Each simulation with 50000 iterations
N <- 50000

# Run these 10 simulations
rt <- list()
for (i in 1:length(seed_vector)) {
  rt[[i]] <- simupi(N, seed_vector[i])
}
results <- as.data.frame(do.call("rbind", rt))
results$seed <- as.factor(results$seed)

# Visualize individual simulation results
library(ggplot2)
p <- qplot(x = Trials, y = Est, data = results, geom = "line", 
           color = seed, ylim = c(2.9, 3.5))
p + geom_line(aes(x = Trials, y = pi), color = "Black") 
ggsave("c:/plot1.png")

Averaging all results of the 10 simulations out. Then the curve converges easily. The application of this Monte Carlo simulation in Buffon's needle experiment is explained here by Rick Wicklin.


# Visuazlie the average result
rtmean <- aggregate(Est ~ Trials, data = results, mean)
p <- qplot(x = Trials, y = Est, data = rtmean, geom = "line")
p + geom_line(aes(x = Trials, y = pi), color = "Red")
ggsave("c:/plot2.png")

Tuesday, January 10, 2012

Benchmarking of the CUSUM function in SAS/IML

Cumulative sums can be obtained in SAS’s DATA step by the RETAIN statement. As the codes below, a new variable of the cumulative values will be returned by DATA step’s implicit DO loop.


data one;
   do i = 1 to 1e6;
      z = ranuni(0);
      output;
   end;
   drop i;
run;

data two;
   set one;
   retain y 0;
   y + z;
run;

The same logic can be realized by an explicit DO loop in SAS/IML. However, SAS/IML has a function CUSUM which is specially made for cumulative sums. To compare the efficiency of the two methods, I tried an experiment: calculating the cumulative sums for a random vector with incremental sizes from 1 million to 20 million by 1 million. The result shows that the CUSUM function is always 100 times faster than a raw DO loop. With the increase of the vector size, the gap gets wider. When the number of elements in the vector reaches 20 million, a Do loop would ask almost 40 seconds, while the CUSUM function requires only 0.35 seconds. Therefore, the vectorized CUSUM function is going to be a good candidate to do such jobs in SAS/IML.


proc iml;
   a = t(do(1e6, 2e7, 1e6));
   timer =  j(nrow(a), 2);
   do p = 1 to nrow(a);
      n = a[p];
      z = t(ranuni(1:n));
      
      t0 = time(); 
          x = cusum(z);
      timer[p, 1] = time() - t0;

      y = j(n, 1, .);
      t0 = time(); 
         do i = 1 to n;
            if i = 1 then y[i] = z[i];
            else y[i] = y[i-1] + z[i];
         end;
      timer[p, 2] = time() - t0;
   end;

   t =  a||timer;
   create _1 from t;
      append from t;
   close _1;
quit;

data _2;
   set _1;
   length test $100.;
   label col1 = "Number of observations" 
      time = "Time by seconds for cumulative summation";
   test = "The CUSUM function"; time = col2; output;
   test = "DO loop"; time = col3; output;
   keep test time col1;
run;

proc sgplot data = _2;
   series x = col1 y = time / curvelabel group = test;
   yaxis grid;
run;

P.S.
Thanks to Rick’s correction, I included his improved DO loop. Comparing with my previous DO loop with a conditional statement, this new DO loop is obviously more efficient. The conclusion is still the same: the CUSUM function outperforms the DO loops in this competition. And with more data involved, the contrast gets even more striking.


proc iml;
   a = t(do(1e6, 2e7, 1e6));
   timer =  j(nrow(a), 3);
   do p = 1 to nrow(a);
      n = a[p];
      z = t(ranuni(1:n));
      /* 1 -- The CUSUM function */
      t0 = time(); 
          x = cusum(z);
      timer[p, 1] = time() - t0;
      /* 2 -- Bad DO loop*/
      y = j(n, 1, .);
      t0 = time(); 
         do i = 1 to n;
            if i = 1 then y[i] = z[i];
            else y[i] = y[i-1] + z[i];
         end;
      timer[p, 2] = time() - t0;
      /* 3 -- Good DO loop */
      w = z;
      t0 = time(); 
         do i = 2 to n;
            w[i] = w[i-1] + z[i];
         end;
      timer[p, 3] = time() - t0;
   end;

   t =  a||timer;
   create _1 from t;
      append from t;
   close _1;
quit;

data _2;
   set _1;
   length test $100.;
   label col1 = "Number of observations" 
      time = "Time by seconds for cumulative summation";
   test = "The CUSUM function"; time = col2; output;
   test = "bad DO loop"; time = col3; output;
   test = "good DO loop"; time = col4; output;
   keep test time col1;
run;

proc sgplot data = _2;
   series x = col1 y = time / curvelabel group = test;
   yaxis grid;
run;

Saturday, January 7, 2012

DO loop vs. vectorization in SAS/IML

Vectorization is an important skill for many matrix languages. From Rick Wiklin’s book about SAS/IML and his recent cheat sheet, I found a few new vector-wise functions since SAS 9.22. To compare the computation efficiency between the traditional do loop style and the vectorization style, I designed a simple test in SAS/IML: square a number sequence(from 1 to 10000) and calculate the time used.

Two modules were written according to these two coding styles. Each module was repeated 100 times, and system time consumed was recorded by SAS/IML’s time() function.


proc iml;
   start module1; * Build the first module;
      result1 = j(10000, 1, 1); * Preallocate memory to the testing vector;
      do i = 1 to 10000;  * Use a do-loop to square the sequence;
         result1[i] = i**2; 
      end;
      store result1; * Return the resulting object;
   finish;   
   t1 = j(100, 1, 1); * Run the first test;
   do m = 1 to 100;
      t0 = time(); * Set a timer;
         call module1;
      t1[m] =  time() - t0;
   end;
   store t1;
quit;

proc iml;
   start module2; * Build the second module;
      result2 = t(1:10000)##2; * Vectorise the sequence;
      store result2; * Return the resulting object;
   finish;   
   t2 = j(100, 1, 1); * Run the second test;
   do m = 1 to 100;
      t0 = time(); * Set a timer;
         call module2;
      t2[m] =  time() - t0;
   end;
   store t2;
quit;

proc iml;
   load result1 result2; * Validate the results;
   print result1 result2;
quit;

Then the results were released to Base SAS and visualized by a box plot with the SG procedures. In this experiment, the winner is the vectorizing method: vectorization seems much faster than do loop in SAS/IML. Therefore, my conclusions are: (1) avoid the do loop if possible; (2)use those vector-wise functions/operators in SAS/IML; (3) always test the speed of modules/functions by SAS/IML’s time() function.


proc iml;
   load t1 t2;
   t = t1||t2;
   create _1 from t;
      append from t;
   close _1;
   print t;
quit;

data _2;
   set _1;
   length test $25.;
   test = "do_loop"; time = col1; output;
   test = "vectorization"; time = col2; output;
   keep test time;
run;

proc sgplot data = _2;
   vbox time / category = test;
   yaxis grid;
run;