backup: July 2013

Wednesday, July 31, 2013

Regularization adjustment for PROC SVM

SVM is a popular statistical learning method for either classification or regression. For classification, a linear classifier or a hyperplane, such as $f(x) = w^TX+b$ with w as weight vector and b as the bias, would label data into various categories. The geometric margin is defined as $2\over{||w||}$ . For SVM, the maximum margin approach is equivalent to minimize ${||w||}^2$ . However, with the introduction of regularization to inhibit complexity, the optimization has to upgrade to minimize ${||w||}^2+C\Sigm^N_i\xi_i$ , where C is the regularization parameter. Eventually the solution for SVM turns out to be a quadratic optimization problem over w and ξ with the constraints of $y_if(X_i)\ge1-\xi_i$ .

Since the sashelp.class dataset is extremely simple, I attempt to use its variables age and weight to predict sex, which is just for demonstration purpose. According to the plot, the data points are linearly non-separable. The kernel methods have to be applied to map the input data to a high-dimensional space so that they are linearly separable. To harness SVM in SAS, three procedures are commonly used under the license of SAS EMiner. For example, PROC DMDB is used to recode the categorical data and set up the working catalog, PROC SVM is used to build the model, and PROC SVMSCORE is applied to implement the model.

proc sgplot data=sashelp.class;
   scatter x = weight y = age / group = sex;
run;

proc dmdb batch data=sashelp.class dmdbcat=_cat out=_class;
   var weight age;
   class sex;
run;

Hard margin

If we let C be infinitely large, then all constraints will be executed. Therefore, the margin is narrowed down.

proc svm data=_class dmdbcat=_cat c=1e11 kernel=linear out= _1;
   title 'hard margin';
   ods output restab = restab1;
   var weight age ;
   target sex;
run;

The accuracy is 63.16%. Overall, the result is below.

Name	Value
Regularization Parameter C	100000000000
Classification Error (Training)	7.000000
Geometric Margin	1.624447E-10
Number of Support Vectors	17
Estimated VC Dim of Classifier	3.4494098E24
Number of Kernel Calls	74

Soft margin

On the contrary, the small C allows constraints to be easily ignored, which leads to the desired large margin.

proc svm data=_class dmdbcat=_cat kernel=linear out= _2;
   title 'soft margin';
   ods output restab = restab2;
   var weight age;
   target sex;
run;

The accuracy or miscalculation rate keeps the same, since the data is so small. In PROC SVM, without the specification, the C value is solved to be almost near zero, and the margin are huge.

Name	Value
Regularization Parameter C	0.000098161
Classification Error (Training)	7.000000
Geometric Margin	158.553426
Number of Support Vectors	18
Estimated VC Dim of Classifier	3.850370
Number of Kernel Calls	76

Conclusion

For the SVM procedure, except the training data, adding a validation data for the testdata option at the PROC statement could effectivley increase the C parameter and decrease the possibility of overfitting.
There are a few advantages for SVM over other data mining methods. First SVM is suitable for high dimension data, and more importantly the complexity can be easily controlled by the adjustment of the regularization parameter C.

Friday, July 19, 2013

Cluster analysis on a pivot table

The link of the pivot table is here

The increasing supremacy of JavaScript on both server side and client side seems a good news for those statistical workers who deal with data and model, and therefore always live in the darkness. They could eventually find a relatively easier way to show off their hard work on Web, the final destination of data. Here I show how to display the result of a cluster analysis on a web-based pivot table.

Back-end: cluster analysis

SAS has a FASTCLUS procedure, which implements a nearest centroid sorting algorithm and is similar to k-means. It has some time and space advantages over other more complicated clustering algorithms in SAS.

I still use the SASHELP.CLASS dataset and cluster the rows by weight and height. I specify 2 clusters and easily obtain the distances to the centroids by PROC FASTCLUS. The plot demonstrates thatweight=100 looks like the boundary to separate the two clusters. Next in DATA Step, I translate the SAS dataset to JSON format so that the browser can understand it.

************(1) Cluster the dataset*******;
proc fastclus data = sashelp.class maxclusters = 2 out = class;
   var height weight;
run;

proc sgplot data = class;
   scatter x = height y = weight /group = cluster;
   yaxis grid;
run;

************(2) Transform to JSON*********;
data toJSON;
   set class;
   length line $200.;
   array a[5] _numeric_; 
   array _a[5] $20.;
   do i = 1 to 5;
      _a[i] = cat('"',vname(a[i]),'":', a[i], ',');
   end;
   array b[2] name sex;
   array _b[2] $20.;
   do j = 1 to 2;
      _b[j] =cat('"',vname(b[j]),'":"', b[j], '",');
   end;
   line = cats('{', cats(of _:), '},');
   substr(line, length(line)-2, 1) = ' ';
   keep line;
run;

Front-end: pivot table

Pivot table is a nice way to present data, especially raw data. There are a few approaches to realize pivot table on web, such as Google's fusion table. Nicolas Kruchten developed a framework called PivotTable.js on github, which is very popular.

I embed the JSON data with the PivotTable.js to make the HTML file static, since the Blogger doesn't provide the function of HTTP server. The file content will be like:

Eventually we can view the cluster result on a pivot table. The audience can now interactively play with data.

Tuesday, July 9, 2013

Why sometimes use DATA Step instead of PROC SQL

A question

There is an interesting thread about the efficiency of PROC SQL on Mitbbs.com.

Both datasets have1 and have2 have few variables, but many observations. Very simple program, but without any result after running it for a whole afternoon.
proc sql;
    create table want as 
    select *
    from have1 as a left have2 as b
    on b.serie_beg <= a.serie <= b.serie_end;
quit;
If I replace left join with inner join, the result is the same. Why it takes so long?

Four types of joins in PROC SQL

To understand this question, we may have to go back into PROC SQL. A technical paper introduces that like other more generalized database systems, there are four kinds of joins to combine two data sets (tables) in the SQL procedure of SAS: Cartesian join, Hash join, Merge join and Indexed join. Using the command _method at the proc sql statement can let the log of SAS tell which join the current join is based on. For example, sqxjsl stands for Cartesian join; sqxjhsh is for Hash join; sqxjm is for Merge join; sqxjndx is for Indexed join.

Cartesian join

Cartesian join uses two nested loops to join two tables. In other SQL systems, Cartesian join is often referred as Nested loop join. Suppose one table has N elements and the other has M elements, the complexity of this algorithm is $O(N*M )$ , and is a bilateral merge that costs more resources. It can be shown in an example below.

************(0)Separate as 2 datasets***************;
data male female;
    set sashelp.class;
    if sex = 'F' then output female;
    else output male;
run;

************(1) Cartesian join ***************;
proc sql _method;
    create table friends1 as
    select a.name as name1, b.name as name2, a.age as age1, b.age as age2
    from male as a left join female as b on a.age > b.age
;quit;

The _method option outlines the execution plan of the join. Here this Cartesian join can't be replaced by more advanced join, and SAS honestly indicates that it is the only approach available.

NOTE: The execution of this query involves performing one or more Cartesian product joins that can not be optimized.

NOTE: SQL execution methods chosen are:

      sqxcrta
          sqxjsl
              sqxsrc( WORK.FEMALE(alias = B) )
              sqxsrc( WORK.MALE(alias = A) )

Merge join

Merge join implements a sorting for each of the tables and merges them afterward. The time spent is the summation between the sorting and the scanning, and therefor the complexity is $O(Nlog(N) + Mlog(M))$ .

************(2) Merge join ***************;
proc sql _method;
    create table friends2 as
    select a.name as name1, b.name as name2, a.age as age1, b.age as age2
    from male as a left join female as b on a.age = b.age
;quit;

Clearly this join contains two sqxsort steps before the following sqxjm step. The logic here is quite similar to the traditional DATA Step's sort-sort-merge.

NOTE: SQL execution methods chosen are:

      sqxcrta
          sqxjm
              sqxsort
                  sqxsrc( WORK.FEMALE(alias = B) )
              sqxsort
                  sqxsrc( WORK.MALE(alias = A) )

Hash join

Hash join loads the tables (if they are small enough to fit memory) into memory as hashs table and then do the scanning. The in-memory process is fast and the big O is $O(N+M )$ .

************(3) Hash join ***************;
proc sql _method;
    create table friends3 as
    select a.name as name1, b.name as name2, a.age as age1, b.age as age2
    from male as a inner join female as b on a.age = b.age
;quit;

Usually such an inner join initiates the Hash join.

NOTE: SQL execution methods chosen are:

      sqxcrta
          sqxjhsh
              sqxsrc( WORK.MALE(alias = A) )
              sqxsrc( WORK.FEMALE(alias = B) )

Indexed join

For relative large data sets that are frequently read, building an index is a good idea. An index is a basically a B-tree structure that translates to $O(log(N))$ . If we have indexes for both lookup table and base table, the complexity will be like $Max[O(Mlog(N)), O(Nlog(M))]$ . In the demo below, I create two indexes for a lookup table and a simulated data set.

************(4) Indexed join ***************;
data simu;    
    do number = 1 to 1e6;
        output;
    end;
run;

proc sql _method;
    create index number on simu(number)
    ;
    create index age on male(age)
    ;
    create table friends4 as
    select a.name as name1, a.age as age1, b.number
    from male as a inner join simu as b on a.age = b.number
;quit;

Hereby the Indexed join has been automatically activated.

NOTE: SQL execution methods chosen are:

      sqxcrta
          sqxjndx
              sqxsrc( WORK.MALE(alias = A) )
              sqxsrc( WORK.SIMU(alias = B) )

Alternative solutions to the question

The good thing for PROC SQL is that the execution plan manager always chooses the optimized join plan. The bad thing is that the user is not allowed to specify any particular join. As for the question at the beginning, the Cartesian joins are obviously chosen to realize the left joins. If the data sets are quite large, the waiting seems to be painful. However, I can imagine some other ways in SAS to apply faster algorithms and avoid the Cartesian joins if not necessary.

DATA Step Merge

The sort-sort-merge forces a merge join, which decreases the complexity from $O(N*M )$ to $O(Nlog(N) + Mlog(M))$ .

PROC FORMAT

If the lookup table is not very large, a user defined format with PROC FORMATis a unique way in SAS to construct the hash table in memory to perform lookup or scanning, which further decreases the complexity from $O(N*M )$ to $O(N+M )$ .