Thursday, April 26, 2012

10 keywords taken out from SAS Global Forum 2012


1. In-memory
SAS is famous for hitting hard disk at every operation, which is a proved strategy to save memory.  To speed up the processing of ‘Big Data’, SAS at the server side will aggregate memories, load data into memory and then deal with data there, which is 1000 times faster than the hard disk based operation.

2. Hadoop
Informationweek described that Dr. Goodnight, CEO of SAS, loathes Hadoop, the distributed open source platform. However, this time SAS presented its DI Studio and SAS/ACCESS interface, which now allows data access by Hive and Pig. It looks like a challenge for SAS to run its statistical procedures on HDFS.

3. Web application
SAS’s applications obviously start to move toward web, like SAS Visual Analytics, which also fits various mobile devices. There is a way to distinguish a desktop application and a web application in SAS: the former’s default background color is white and the latter is black.

4. High performance procedures
SAS is vigorously developing the procedures with HP as prefix, mostly for the servers. Currently 10-20 procedures in SAS/BASE and SAS/STAT can find their counterpart, such as PROC HPSummary and PROC HPLogistic. Those procedures can also run locally but won't not improve the efficiency significantly.

5. JMP
JMP remains as a lean desktop analysis package while SAS evolves toward gigantic enterprise solution platforms. One interesting thing -- you can always find the motion chart (which JMP can do and SAS can’t) and John Sall at the demo area.

6. SAS 12.1?
Next release of SAS is not 9.4, but 12.1. SAS version 9, including 9.1, 9.2 and 9.3, dwells for almost a decade. Thus, the version update from 9.3 to 12.1 is quite a great leap forward.
Correction - thanks to Chris Hemedinger, 'the new release numbering applies only to the analytical products (STAT, ETS, and so on)'.

7. Data Step 2?
To support data management along with the high performance procedures at the servers, a language called DS2 is under development. It is a strong typing language more like Java or C++ more than Data Step. However, SAS has a macro which can transform Data Step codes to Data Step 2.

Thanks to the corrections by Jason Secosky, who is the development manager for DS2 --
"While DS2 is based on the DATA step, its name is just "DS2" not "DATA step 2". DS2 is statically typed, not strongly typed. Ok, ok, there is no implicit type conversion between some types, like double and timestamp, yet there are functions to explicitly convert these values.
And, there is a PROC that can be used to translate DATA steps generated by SAS Enterprise Miner to DS2. The PROC isn't intended to convert *any* DATA step to DS2."

8. In-database
SAS’s in-database technology now supports all database systems beyond Teradata and Greenplum. To avoid compilation error, it is better to apply ANSI SQL functions instead of SAS’s own function. As for me, it is still not very clear how SAS passes its statistical procedures into the relation database systems.

9. Risk management
SAS’s risk management platform is quite mature and implemented the latest procedures like PROC COPULA. It seems that the end users have to own Bloomberg or other vendor’s license to update the market data.

10. Statistical graph
More SAS’s procedures and solution plans integrated layer-based statistical graph technology to visualize results. Still SAS’s Windowing Environment still doesn’t support the syntax highlighting for  Graph Template Language and the SG procedures, since it always shows red fonts warning syntax errors.

Thursday, April 19, 2012

Stored Processes: SAS's voice on Business Intelligence

Everyday I write SAS scripts to extract, transform and load data from various sources, which is a step before the database, and also pull out data to do analysis such as aggregation and regression in SAS, which is a step after the database. According to Norman Nie, a data shop has a four-layer structure: ETL layer, database layer, analytics layer and BI layer. It seems that recently the database layer and the analytics layer look more and more identical. The relational databases start to fiercely create their statistics arms, such as the SQL Server Analysis Services and Oracle R Enterprise. Tricia and Angela’s new book, The 50 Keys to Learning SAS Stored Processes, discloses SAS’s revenge on Business intelligence layer with its stored processes, which are similiar to the stored procedures usually carried out by the database systems.

This book has 12 chapters, including how to create stored processes, optimize the stored processes and implement the stored processes. The book is easy to read and contains detailed illustrations with lots of colorful graphics. Following the examples, I may create my own stored processes in SAS, for example, a simple query to count numbers of a small data set SASHELP.CLASS by gender. It is also interesting to compare SAS’s stored process with the databases' 
stored procedure such as SQL Server -- they have the same functionality. Therefore, on a Windows server, a developer will be able to choose either SQL Server or SAS to build the web applications.

Now with the stored processes, a SAS programmer may evolve toward a SAS developer. Tricia and Angela’s book will be a good reference for this role transition.

/*********(1) LOAD DATA(SASHELP.CLASS) TO SQL SERVER OR SAS *************/
create table class (name char(8), sex char(1),
age numeric, height numeric, weight numeric );
insert into class values('ALFRED','M',14,69,112.5);
insert into class values('ALICE','F',13,56.5,84);
insert into class values('BARBARA','F',13,65.3,98);
insert into class values('CAROL','F',14,62.8,102.5);
insert into class values('HENRY','M',14,63.5,102.5);
insert into class values('JAMES','M',12,57.3,83);
insert into class values('JANE','F',12,59.8,84.5);
insert into class values('JEFFREY','M',13,62.5,84);
insert into class values('JOHN','M',12,59,99.5);
insert into class values('JOYCE','F',11,51.3,50.5);
insert into class values('JUDY','F',14,64.3,90);
insert into class values('LOUISE','F',12,56.3,77);
insert into class values('MARY','F',15,66.5,112);
insert into class values('PHILIP','M',16,72,150);
insert into class values('ROBERT','M',12,64.8,128);
insert into class values('RONALD','M',15,67,133);
insert into class values('THOMAS','M',11,57.5,85);
insert into class values('WILLIAM','M',15,66.5,112);

/**********************(2) A STORED PROCEDURE IN SQL SERVER **************/
IF OBJECT_ID('CntBySex') IS NOT NULL
drop proc CntBySex;
go

create proc CntBySex @sex char(1) = null, @Cnt int output
as
select @Cnt = COUNT(*)
from class
where sex = ISNULL(@sex, sex)
go

declare @OutCnt int
exec CntBySex @Cnt = @OutCnt output
select @OutCnt as AllCnt

exec CntBySex @sex = 'F', @Cnt = @OutCnt output
select @OutCnt as FemaleCnt

exec CntBySex @sex = 'M', @Cnt = @OutCnt output
select @OutCnt as MaleCnt
go


/*************(3) A STORED PROCESS IN SAS *********************/
%macro query;
proc sql;
select count(*)
from class
%if %length(&sex) > 0 %then %do;
where sex = &sex;
%end;
;
quit;
%mend query;

%stpbegin;
%query;
%stpend;

Thursday, April 12, 2012

Correlations of three variables

Question
There is an interesting question in statistics --
“There are 3 random variables X, Y and Z. The correlation between X and Y is 0.8 and the
correlation between X and Z is 0.8. What is the maximum and minimum correlation between Y and Z?”

Solutions

1. Geometric illustration
The value of corr(Y, Z) is the COS function of the angle between Y and Z. We already know the corr(X, Y) and corr(X, Z). In this particular case, the angle can be zero, which suggests Y and Z are identical and the max value of corr(Y, Z) is 1. The min value of corr(Y, Z) is caused by the biggest angle between Y and Z, which is 0.28.

2. Positive semi-definiteness property of the correlation matrix
Due to this feature, the determinant of the correlation matrix is greater than or equal to zero. Thus we will be able to construct a quadratic inequality to evaluate the boundaries, which is from 0.28 to 1.

proc fcmp outlib=work.funcs.test1;
function corrdet(x, a, b);
return(-x**2 + 2*a*b*x - a**2 -b**2 +1);
endsub;
function solvecorr(ini, a, b);
array solvopts[5] initial abconv relconv
maxiter solvstat (.5 .001 1.0e-6 100);
initial = ini;
x = solve('corrdet', solvopts, 0, ., a, b);
return(x);
endsub;
quit;

options cmplib = work.funcs;
data one;
* Max value;
upper = solvecorr(1, 0.8, 0.8);
upper_check = corrdet(upper,0.8,0.8);
* Min value;
lower = solvecorr(-1, 0.8, 0.8);
lower_check = corrdet(lower,0.8,0.8);
run;




Generalization
We can generalize the question to all possibilities for corr(X, Y) and corr(X, Z). First we need to create two user-defined functions to solve the maximum and the minimum values. Then we will be able to draw the max values and min values in the same plot. It is very interesting to see that only four points the upper surface and lower surface converge together, which are (1, 1, 1), (-1, 1, -1), (1, -1, 1) and (-1, -1, -1).

A lot other phenomenon can be summarized from this plot, such as that when corr(X, Y) = corr(X, Z) the max value of corr(Y, Z) is always equal to 1.


proc fcmp outlib = work.funcs.test2;
function upper(a, b);
x = 4*(a**2)*(b**2) - 4*(a**2+b**2-1);
if x ge 0 then y = -0.5*(sqrt(x) - 2*a*b);
else y = .;
return(y);
endsub;
function lower(a, b);
x = 4*(a**2)*(b**2) - 4*(a**2+b**2-1);
if x ge 0 then y = -0.5*(-sqrt(x) - 2*a*b);
else y = .;
return(y);
endsub;
quit;

data two;
do xy = -.99 to .99 by 0.01;
do xz = -.99 to .99 by 0.01;
upper = upper(xy, xz);
lower = lower(xy, xz);
output;
end;
end;
run;

proc template;
define statgraph surface001;
begingraph;
layout overlay3d / cube = false rotate = 150 tilt = 30
xaxisopts = (label="Correlation between X and Y")
yaxisopts = (label="Correlation between X and Z")
zaxisopts = (label="Boundaries of correlation between Y and Z") ;
surfaceplotparm x = xy y = xz z = upper;
surfaceplotparm x = xy y = xz z = lower;
endlayout;
endgraph;
end;
run;

proc sgrender data = two template = surface001;
run;

Monday, April 2, 2012

The years to get green card for Indian and Chinese

The path toward a green card is especially difficult for Indian and Chinese who are working in the US and named as EB2 workers, since they have to wait years to submit an i485 form after being approved with their i140 form submission. Once the cutoff date set by USCIS each month catches the priority date that is a date of filling labor certification and decides the position in the long line, the applicants holding their priority date will be able to file the i485 form. However, the cutoff dates are unpredictable for the public, and “it is impossible to accurately estimate how long that may take”.

Situation
The recent cut-off dates data can be copied and pasted from Wiki. Then I adjusted the cut-off dates within the retrogression period. In the past decade, the waiting years range from about 0 year to 4.25 years. July 2007 looks like an amnesty for those who have priory date before that, otherwise people have to wait at least 1.85 years. Applicants from the adjacent 3 or 4 years usually wait in line for the door to be open. The door is recently shut off on June 2012 and hopefully will be open again one day in 2014.

data pd;
input @1 _cmon $3. @5 _cyear $4. @13 _pdmon $3. @17 _pddate : @21 _pdyear $2.;
format c_time pd_time date9.;
c_time = input(cats('01', _cmon, _cyear), date9.);
pd_time = input(cats(_pddate, _pdmon,_pdyear), date9.);
dif_day = c_time - pd_time;
dif_year = dif_day/365;
drop _:;
cards;
May 2012 Aug 15 07
Apr 2012 May 1 10
Mar 2012 May 1 10
Feb 2012 Jan 1 10
Jan 2012 Jan 1 09
Dec 2011 Mar 15 08
Nov 2011 Nov 1 07
Oct 2011 Jul 15 07
Sep 2011 Apr 15 07
Aug 2011 Apr 15 07
Jul 2011 Mar 8 07
Jun 2011 Oct 15 06
May 2011 Jul 1 06
Apr 2011 May 8 06
Mar 2011 May 8 06
Feb 2011 May 8 06
Jan 2011 May 8 06
Dec 2010 May 8 06
Nov 2010 May 8 06
Oct 2010 May 8 06
Sep 2010 May 8 06
Aug 2010 Mar 1 06
Jul 2010 Oct 1 05
Jun 2010 Feb 1 05
May 2010 Feb 1 05
Apr 2010 Feb 1 05
Mar 2010 Feb 1 05
Feb 2010 Jan 22 05
Jan 2010 Jan 22 05
Dec 2009 Jan 22 05
Nov 2009 Jan 22 05
Oct 2009 Jan 22 05
Sep 2009 Jan 8 05
Aug 2009 Oct 1 03
Jul 2009 Jan 1 00
Jun 2009 Jan 1 00
May 2009 Feb 15 04
Apr 2009 Feb 15 04
Mar 2009 Feb 15 04
Feb 2009 Jan 1 04
Jan 2009 Jul 1 03
Dec 2008 Jun 1 03
Nov 2008 Jun 1 03
Oct 2008 Apr 1 03
Sep 2008 Aug 1 06
Aug 2008 Jun 1 06
Jul 2008 Apr 1 04
Jun 2008 Apr 1 04
May 2008 Jan 1 04
Jul 2007 Jul 1 07
Jun 2007 Apr 1 04
May 2007 Jan 8 03
Apr 2007 Jan 8 03
Mar 2007 Jan 8 03
Feb 2007 Jan 8 03
Jan 2007 Jan 8 03
Dec 2006 Jan 8 03
Nov 2006 Jan 8 03
;;;
run;

proc sql;
create table pd1 as
select a.pd_time 'priority date', a.c_time, (select min(c_time) from pd as b
where a.pd_time le b.c_time and a.pd_time le b.pd_time) as i140_submit_date format date9.,
(calculated i140_submit_date - a.pd_time) / 365 as i140_waiting_year
from pd as a
where year(pd_time) gt 2002
order by a.pd_time
;quit;

proc sgplot data = pd1;
series x = pd_time y = i140_submit_date;
xaxis grid;
run;

proc sgplot data = pd1;
series x = pd_time y = i140_waiting_year;
yaxis grid;
run;


Possible change
The cut-off dates much depend on the changes of law and policy which reflects the economic environment. Therefore the US unemployment rate is possibly helpful in predicting the fluctuation of the cut-off dates. First I imported such data, and then transformed the difference between the priority date and the cut-off date to 6-month average. Then the two curves match well. I can even fit the relationship with a simple linear regression. As a conclusion, one percent of unemployment rate decrease may shorten 1/3 year of the waiting time from i140 to i485.

filename _infile url "http://research.stlouisfed.org/fred2/data/UNRATE.txt" debug lrecl=100;
data unempl;
infile _infile missover firstobs = 22;
format date date9.;
input @1 date yymmdd10. @13 unemployement_rate 4.1;
run;

proc sql;
create table combine as
select a.dif_year, b.*
from pd as a, unempl as b
where a.c_time = b.date
;quit;

proc expand data = combine out = combine1 method=none;
convert dif_year = ma_dif_year / transform = (movave 6);
run;

proc sgplot data = combine1;
series x = date y = ma_dif_year;
series x = date y = unemployement_rate;
xaxis grid; yaxis grid label = ' ';
run;

proc reg data = combine1;
model ma_dif_year = unemployement_rate;
ods select ParameterEstimates FitPlot;
run;