Tuesday, October 25, 2011

NCAA football and computer rankings

I am a big fan of NCAA football. I found that in the past weeks the cold-blooded computer rankings are more accurate than the poll rankings(BCS, Harris Poll and USA Today). And they are pretty good in predicting the game results, such as the fall of Oklahoma last week.

Data and plotting
Those ranking data are available on ESPN’s website (and they are well structured data and easy to grab). I subtracted the 6 computer rankings by the overall ranking and drew those differences on a scatter plot.
-Alabama seems to have more chance to take LSU’s place.
-Although Michigan State beat Wisconsin last week, the computers still don’t favor it.
-Auburn looks very promising.
-Oklahoma State is highly possible to become #2.
One complaint about the computer ranking
I don’t like the averaging method of the 6 computer rankings used by BCS. Transformation and factor analysis may make full use of the information - SAS’s user guide provided a detailed solution by PROC PRINQUAL and PROC FACTOR.

data week9;
input @1 RK: 20. @5 TEAM: $40. AVG_bcs: PVS_bcs: $2. RK_hp: $2. PTS_hp pct_hp RK_usa: $2.
PTS_usa pct_usa AVG_computer AH RB CM KM JS PW;
cards;
/* COPY AND PASTE DATA FROM http://espn.go.com/college-football/bcs
*/
;;;
run;

data _tmp01;
set week9;
array rank[6] ah--pw;
do i = 1 to 6;
if rank[i]= 0 then rank[i] = 25;
rank[i] = rk - rank[i];
drop i;
end;
run;
proc sort data=_tmp01;
by team;
run;
proc transpose data=_tmp01 out=_tmp02;
var ah--pw;
by team;
run;

proc template;
define Style HeatMapStyle;
parent = styles.htmlblue;
style GraphFonts from GraphFonts /
'GraphLabelFont' = (", ",6pt)
'GraphValueFont' = (", ",6pt)
'GraphDataFont' = (", ",6pt);
end;
run;
proc template;
define statgraph HeatMap.Grid;
begingraph;
layout overlay / border=true xaxisopts=(label='TEAM') yaxisopts=(label='COMPUTER ALGORITHM');
scatterplot x=team y=_name_ / markercolorgradient=col1 markerattrs=(symbol=squarefilled size=32)
colormodel=threecolorramp name='s';
continuouslegend 's' / orient=vertical location=outside valign=center halign=right;
endlayout;
endgraph;
end;
run;
ods html style=HeatMapStyle image_dpi=300 ;
proc sgrender data=_tmp02 template=HeatMap.grid;
run;

Thursday, October 20, 2011

Rick Wicklin’s 195th blog post

Today I ran a SAS routine to check the KPIs for a few websites I am interested in. I accidentally found the total number of posts on Rick Wicklin’s blog is going to approach 200 pretty soon. I followed his blog since its creation. It is an amazing number in a little more than one year. Rick is a unique blogger: he is a statistician who does programming; he is a programmer who plots data; he is a data analyst who is a good writer. As for me, it’s meaningful to summarize what I have learned from his blog.
Data extracted from The Do Loop
SAS official blogs have been restructured this summer. Since I can’t find the previous XML button on the website, I rewrote a program to directly extract HTML data to drive the KPI. Jiangtang Hu also created a program to extract data from The Do Loop, and mentioned that Rick is an incredibly productive writer.

%macro extract(page = );
options mlogic mprint;
%do index = 1 %to &page;
filename raw url "http://blogs.sas.com/content/iml/page/&index/";
data _tmp01;
infile raw lrecl= 550 pad ;
input record $550. ;
if find(record, 'id="post') gt 0 or find(record, 'class="post') gt 0;
run;
data _tmp02;
set _tmp01;
_n + 1;
_j = int((_n+2) / 3);
run;
proc transpose data=_tmp02 out=_tmp03;
by _j;
var record;
run;
data _&index;
set _tmp03;
array out[3] $100. title time pageview;
array in[3] col1-col3;
do i = 1 to 3;
if i = 1 then do; _str1 = 'rel="bookmark">'; _str2 = "</a></"; end;
if i = 2 then do; _str1 = '+0000">'; _str2 = '</abbr>'; end;
if i = 3 then do; _str1 = '="postviews">'; _str2 = "</span>"; end;
_len = length(compress(_str1));
_start = find(in[i], compress(_str1)) + _len ;
_end = find(in[i], _str2, _start);
out[i] = substr(in[i] , _start , _end - _start);
end;
drop _: col: i;
run;
%end;
data out;
set %do n = 1 %to &page;
_&n
%end;;
run;
proc datasets nolist;
delete _:;
quit;
%mend;
%extract(page = 20);
data out1;
set out nobs = nobs;
j + 1;
n = nobs - j + 1;
length level $20.;
label pageview1 = 'PAGEVIEW' time1 = 'TIME' n = 'TOTAL POSTS';
pageview1 = input(pageview, 5.);
_month = scan(time, 1);
_date = scan(time, 2);
_year = scan(time, 3);
time1 = input(cats(_date, substr(_month, 1, 3), _year), date9. );
weekday = weekday(time1);
drop _:;
format time1 date9.;
run;

ods html style = htmlbluecml;
proc sql noprint;
select count(*), sum(pageview1) into: nopost, :noview
from out1
;quit;
proc gkpi mode=basic;
dial actual = &nopost bounds = (0 100 200 300 400) /
target=200 nolowbound
afont=(f="Garamond" height=.6cm)
bfont=(f="Garamond" height=.7cm) ;
proc gkpi mode=basic;
dial actual = &noview bounds = (0 2e4 4e4 6e4 8e4) /
afont=(f="Garamond" height=.6cm)
bfont=(f="Garamond" height=.7cm) ;
quit;
What I learned
I accumulated all the 195 titles, replaced/removed some words and processed them with Wordle. As I expected, Rick’s blog is mainly about ‘Matrix’, ‘Statistics’ and ‘Data’. It is interesting to learn how to create ‘Function’ in SAS/IML, which involves a lot of programming skills. I also enjoyed his topics about ‘Simulating’ and ‘Computing’ with ‘Random’ numbers. He also has exciting articles about how to deal with ‘Missing’ values and ‘Curve’.


data word_remove;
input word : $15. @@;
cards;
sas iml using use creating create proc blog vesus
;;;

proc sql noprint;
select quote(upcase(compress(word))) into :wordlist separated by ','
from word_remove
;quit;

data _null_;
set out(keep=title);
title =tranwrd(upcase(title), 'MATRICES', 'MATRIX');
title =tranwrd(upcase(title), 'FUNCTIONS', 'FUNCTION');
title =tranwrd(upcase(title), 'STATISTICAL', 'STATISTICS');
length i $8.;
do i = &wordlist;
title =tranwrd(upcase(title), compress(i), ' ');
end;
file 'c:\tmp\output1.txt';
put title;
run;
When the number reaches 200
Except the holidays (those gaps in the finger plot above), Rick keeps a constant rate in writing articles (approximately 3 posts a week).

No double the OLS regression gives a straight line. It seems that the total number will hit the 200 target pretty soon: next next week I believe.

proc sgplot data=out1;
needle x = time1 y = n;
yaxis grid max = 300;
run;

proc sgplot data = out1;
reg x =time1 y = n;
refline 200/ axis=y ;
yaxis max = 300;
run;


What a SAS user likes to know
From my experience, clicks in a web browser are mostly originated form search engines, while a regular reader would like to use feeds instead. The page views recorded on the website of The Do Loop can reflect what SAS users try to find. Rick follows his pattern -- introductory tips on Monday, intermediate techniques for Wednesday, and topics for experienced programmers Friday. If we separate the page view trends at the three levels, we can see that the intermediate and advanced posts attract more page views than the basic ones.

data out2;
set out1;
if weekday = 2 then level = '1-Basic';
else if weekday in (3, 4) then level = '2-Intermediate';
else level = '3-Advanced';
output;
set out1;
level = '4-Overall';
output;
run;

proc sgpanel data = out2;
panelby level / spacing=5 columns = 2 rows = 2 novarname;
series x = time1 y = pageview1;
rowaxis grid; colaxis grid;
run;
Conclusion
I agree with what Rick Wicklin said: blogging helps us to become more aware of what we know and what we don't know. I benefited a lot from his book and his resourceful blog in the past year. Cheers on Rick’s incoming 200th post!

Friday, October 14, 2011

What are those SAS jobs around Cary, NC?


SAS Institute is located in Cary, NC. In this job-scarce economy, an interesting question is: what job opportunities are available for a SAS user around this great company which created SAS, say, in an area of 150-mile radius. Fortunately, I found that the returned values from the omnipotent job search engine, Indeed.com, are highly digestible, although this website doesn’t provide analytics service to general public. To integrate the data from Indeed.com, I designed a macro to extract essential variables from the returned HTML pages. Then I set the time limit for the opening as the past 30 days, ‘SAS’ as keyword to search, and 100 openings to show on each returned page. I extracted 5 such pages by running this macro and eventually I obtained 500 job openings to conduct this experiment.

%macro extract(city =, state = , radius = , page = );
options mlogic mprint;
%do i = 1 %to &page;
%let j = %eval((&i-1)*100);
filename raw url
"http://www.indeed.com/jobs?q=sas%nrstr(&l)=&city+,&state%nrstr(&radius)=&radius%nrstr(&limit)=100%nrstr(&fromage)=30%nrstr(&start)=&j";
data _tmp01;
infile raw lrecl= 500 pad ;
input record $500. ;
if find(record, 'jobmap[') gt 0 and find(record, 'srcname') gt 0;
run;
data _&i;
set _tmp01;
array id[5] $50. indeed_id srcname corpname location title;
length _str $8.;
do _k = 1 to 5;
if _k = 1 then _str = "jk:";
else if _k = 2 then _str = "srcname:";
else if _k = 3 then _str = "cmpesc:";
else if _k = 4 then _str = "loc:";
else _str = "title:";
_len = length(compress(_str));
_start = find(record, compress(_str)) + _len ;
_end = find(record, ",", _start) ;
id[_k] = compress(substr(record , _start , _end - _start), "'");
end;
extract_time = datetime(); format extract_time datetime.;
drop record _:;
run;
%end;
data &city._&state;
set %do n = 1 %to &page;
_&n
%end;;
run;
data &city._&state;
retain obs extract_time corpname location title indeed_id srcname;
set &city._&state;
obs + 1;
run;
proc datasets nolist;
delete _:;
quit;
%mend;
%extract(city = cary, state = nc, radius = 150, page = 5);



Popular job titles
I used Wordle, a text cloud website to summarize all job titles. ‘Analyst’, ‘Developer’ and ‘Programmer’ seem to be quite common titles for a job related to SAS. Obviously many openings ask for experience, since they frequently mentioned ‘Senior’, ‘Manager’, or‘Management’. You can also predict the daily routines of those jobs by ‘Data’, ‘Marketing’, ‘Statistical’, ‘Risk’, ‘Database’, ‘Research’ and ‘Clinical’.

data _null_;
set cary_nc(keep=title);
string =tranwrd(upcase(title), 'SAS', ' ');
string =tranwrd(upcase(string), 'SR', 'SENIOR ');
file 'c:\tmp\output.txt';
put string;
run;
Top 10 job providers
As I expected, SAS is the biggest job provider with 18 openings in the past month, followed by Bank of America (14), Ettain Group(12) and other companies. The top job providers don’t include insurance companies, CRO or pharmaceuticals, which suggests that there are fewer local companies or they may transfer the recruiting task to staffing companies.

proc sql outobs=10;
create table _1 as
select upcase(corpname) as name, count(*) as freq
from cary_nc
where corpname is not missing
group by name
order by freq desc
;quit;

data _2;
set _1;
n + 1;
length category $10.;
if n = 1 then category = 'SAS';
else if n in (2, 6) then category = 'Bank';
else if n in(4, 9, 10) then category = 'Consulting';
else category = 'Staffing';
run;

ods html gpath = 'c:\tmp\' style = harvest;
goptions device=javaimg ftitle="arial/bold" ftext="arial"
htitle=.15in htext=.2in xpixels=600 ypixels=500;
proc gtile data = _2;
tile freq tileby = (name, freq) / colorvar = category;
run;


Location, location, location
Job accumulated in five cities for those talents who have SAS skills: Charlotte, Raleigh, Cary, Durham and Richmond. Except working for SAS Institute, Charlotte and Raleigh are the best cities for a job seeker in Center North Carolina to consider.

proc gchart data= cary_nc;
pie location;
run;


Source of job posts
Nowadays most company websites require the applicants to disclose where they get the information. In this experiment, the top 3 sites are DICE.com, Careebuilder.com and Corp-to-corp.com.

proc sql outobs=10;
create table _3 as
select upcase(srcname) as name, count(*) as freq
from cary_nc
where corpname is not missing
group by name
order by freq desc
;quit;
proc sgplot data = _3;
waterfall category = name response = freq ;
xaxis label= ' '; yaxis label=' ';
run;

Lesson learned from this weekly project
1. PROC GMAP in SAS 9.3 now supports alpha-transparency feature to draw a map -- a significant improvement.
2. Parsing online data doesn’t need to use the awesome Regular Expression syntax, although SAS has a few RE functions. SAS’s character functions are pretty robust.
3. Analyzing the online job information is quite entertaining for me. I believe that mining of those texts deeper would lead to more discoveries of secrets.

Friday, October 7, 2011

Create Nelson-Siegel function for yield curve

U.S. Treasury bonds with maturity ranging from 1 year to 30 years are daily updated on the Treasury’s website. However, some yields, such as from 4 years maturity bond, have to be inferred. The Nelson-Siegel function is probably one of the most important formulas to construct the yield curve. Many people use EXCEL’s Solver to find the nonlinear estimates of the four parameters (a1, a2, a3 and beta) for the Nelson-Siegel function. SAS’s PROC NLIN can also do this job smoothly. I copied today’s yield data and imported into SAS. Then I assigned some initial values to the unknown parameters and solved those parameters. With the PROC FCMP, I was able to build a function to predict the yields at any given year. Finally I plotted the yield curve (I also fitted the real yields with a quadratic regression line). The results showed that the nonlinear Nelson-Siegel function behaves better at longer maturities than OLS regression.


**********************(1) IMPORT STEP******************************************;
data _1;
input ytm : $4. @@;
cards;
1mo 3mo 6mo 1yr 2yr 3yr 5yr 7yr 10yr 20yr 30yr
;;;

data _2;
input yield : 3.2 @@;
cards;
0.01 0.01 0.03 0.09 0.29 0.46 1.01 1.52 2.01 2.71 2.96
;;;

data _3(keep=yield years);
set _1;
set _2;
if find(ytm, 'yr') gt 0;
years = input(compress(ytm, 'yr'),3.);
yield = yield / 100;
run;

**********************(2) SOLVE PARAMETERS************************************;
proc nlin data=_3 method=newton;
parms a1 = 0.05 a2 = 0.05
a3 = 0.05 beta = 1;
model yield=a1+(a2+a3)*(beta/years)*(1-exp(-years/beta))-a3*exp(-years/beta);
ods output parameterestimates = parmds;
run;

data _null_;
set parmds;
call symput(parameter, estimate);
run;

**********************(3) BUILD FUNCTION AND APPLY IT****************************;
proc fcmp outlib = work.func.finance;
function nelson_siegel(years);
return(
&a1 + (&a2+&a3)*(&beta/years)*(1-exp(-years/&beta))
- &a3*exp(-years/&beta)
);
endsub;
run;
options cmplib=work.func;
data _4;
do years = 1 to 30;
predyield = nelson_siegel(years);
output;
end;
run;

data _5;
merge _3(rename=(yield=realyield)) _4(in=a);
by years;
if a;
run;

ods html style = money;
proc sgplot data = _5;
scatter x = years y = predyield;
series x = years y = realyield ;
reg x = years y = realyield / degree = 2;
xaxis grid; yaxis grid label = ' ';
format realyield percent8.2;
run;
ods html style = htmlbluecml;
********************END OF ALL CODING*****************************************;