Thursday, March 31, 2011

Optimize many-to-one mapping by user-defined functions


In many occasions, fast access into a lookup table to find desired value is necessary. In computer science, linked list, associative array, and hash table are widely used to construct the relationship between values and keys. Hash function, like value <-- index = Function(key), is essential to build such a hash table. Improving the hash function’s performance is pretty challenging and rewarding [Ref. 1]. In SAS, macro may be utilized to substitute function. However, macro would be failed in front of some cases, such as f(x1) + g(x2) or f(g(x)). Function or functional programming is still a better choice. With the user-defined function complier, Proc Fcmp, building reliable functions in SAS to map many keys to a single value looks promising.

However, SAS’s hash object is not applicable for coding interactive or reusable hash functions. The concept of hash object is introduced since SAS version 9, and it is ‘the first truly runtime dynamic, memory-resident DATA step structure’ [Ref. 2]. Evidence shows that it is robust, and most importantly, faster than other hard-disk based lookup solutions. And this technology is vigorously growing: SAS’s hash object has more methods, and can even accept duplicate keys [Ref. 3]. The biggest problem for SAS’s hash object is that it is transient and not savable. Hash object has to be declared within a data step to load data. If there is any other query, it has to be loaded again. If with frequently invocation, the loading factor will be formidable. As the result, SAS’s hash object is particularly useful for batch processing of lookup tasks, say, finding many key-value pairs simultaneously. That is probably why SAS programmers like to use hash object in merging datasets.

Thus, mapping many keys to one value by user-defined functions has to rely on other methods. Senior SAS programmers may prefer SAS arrays with help of the POINT option. However, SAS’s array is still transient. SQL and Proc Format are the options available. SQL is interactive and does not rely on any particular data type. Format is a unique data type in SAS created by Proc Format, which is permanent and exchangeable with a common SAS dataset. To test the efficiency of many-to-one mapping by user-defined functions, first a dataset of a million records with 2 keys and 1 value was simulated, and three functions were defined and complied by Proc Fcmp. Then those functions were repeatedly called for 10000 times. For SQL-based function, invoking each time is equivalent to querying the raw dataset once. On my notebook, the total time cost of running SQL-based function is more than 3 minutes, which is intolerable. Adding indexes to both keys of the lookup table will largely decrease the time expense to 1/6 of the original time. In addition, loading the lookup table into the memory, by the SASFILE statement, would improve the efficiency of the function a little further. Since Proc Format only allows one key, the two keys need to be concatenated as a composite key before finalizing the lookup format. Amazingly, calling of Format-based function 10000 times only spends less than 2 seconds. As the result, the Format-based function is the champion of this test. Understandably, the memory consumption is proportional to the functions’ efficiency: faster speed means more memory requirement. Much more than other methods, the Format-based function used 200 MB memories. I guess that the ‘format’ data type may be loaded into memory and stayed there during the processing time, which slashed the loading factor. In conclusion, Proc Format produced ‘format’ is not only an alternative solution to merge large datasets, but also is a reliable foundation to define key-value pairs by many-to-one mapping functions.

References:
1. Hash table on Wikimedia. http://en.wikipedia.org/wiki/Hash_table
2. Elena Muriel. Hashing Performance Time with Hash Tables. SAS Global 2007.
3. Robert Ray and Jason Secosky. Better Hashing in SAS 9.2. SAS Global 2008.

/*******************READ ME****************************************
* --- OPTIMIZE MANY-TO-ONE MAPPING BY USER-DEFINED FUNCTIONS ----
*
* HARDWARE: a notebook with amd64 cpu 2.0g, 3g ram
* SAS: SAS 9.2(TS2M0), pc 64bit
*
* TEST PASSED: 31mar2011
* AUTHOR: hchao8@gmail.com
*
****************END OF READ ME************************************/

************(1) GENERATE TEST DATASET*********************;
******(1.1) SIMULATE A DATASET OF 1M RECORDS WITH 2 KEYS*********;
data raw;
do key1 =1 to 1000;
do key2 = 1 to 1000;
value = ranuni(20100322) + rannor(20100322);
output;
end;
end;
run;

******(1.2) DERIVE INDEXED DATASET*********;
proc sql;
create table indexed as select * from raw;
create index key1 on indexed(key1);
create index key2 on indexed(key2);
quit;

******(1.3) TRANSFORM DATASET INTO FORMAT*********;
****(1.3.1) CAST VALUE FROM NUMBER TO CHARACTER*****;
proc format;
picture key low-high = '0000'(fill = '0');
run;
****(1.3.2) COMBINE 2 KEYS TO 1 COMPOSITE KEY******;
data keycombined;
set raw;
length keystr $8;
keystr = cats(put(key1, key.), put(key2, key.));
drop key1 key2;
run;
****(1.3.3) DERIVE FORMAT DATASET*****;
data fmtds;
set keycombined;
rename keystr = start
value = label;
fmtname = '$myfmt';
type = 'C';
run;
****(1.3.4) INCORPORATE FORMAT DATASET TO BUILD FORMAT *****;
proc format cntlin = fmtds;
run;

**************END OF STEP (1)*****************;

**************(2) CREATE USER-DEFINED FUNCTIONS***********;
*******(2.1) BUILD THE FIRST FUNCTION************;
****(2.1.1) THE EMBEDDED MACRO************;
option mstored sasmstore = sasuser;
%macro myfunc1_macro / store source;
%let key1 = %sysfunc(dequote(&key1));
%let key2 = %sysfunc(dequote(&key2));
proc sql noprint;
select value into: value
from raw
where key1 = &key1
and key2 = &key2
;quit;
%mend;
****(2.1.2) CREATE THE FUNCTION AND OUTPUT***********;
proc fcmp outlib = sasuser.keyvalue.funcs;
function myfunc1(key1, key2);
rc = run_macro('myfunc1_macro', key1, key2, value);
if rc eq 0 then return(value);
else return(.);
endsub;
run;

*******(2.2) BUILD THE SECOND FUNCTION************;
****(2.2.1) THE EMBEDDED MACRO************;
option mstored sasmstore = sasuser;
%macro myfunc2_macro / store source;
%let key1 = %sysfunc(dequote(&key1));
%let key2 = %sysfunc(dequote(&key2));
proc sql noprint;
select value into: value
from indexed
where key1 = &key1
and key2 = &key2
;quit;
%mend;
****(2.2.2) CREATE THE FUNCTION AND OUTPUT***********;
proc fcmp outlib = sasuser.keyvalue.funcs;
function myfunc2(key1, key2);
rc = run_macro('myfunc2_macro', key1, key2, value);
if rc eq 0 then return(value);
else return(.);
endsub;
run;

*******(2.3) BUILD THE THIRD FUNCTION************;
proc fcmp outlib = sasuser.keyvalue.funcs;
function myfunc3(key1, key2);
key = put(cats(put(key1, key.), put(key2, key.)), $8.);
value = put(key, $myfmt.);
return(value);
endsub;
run;

**************END OF STEP (2)*****************;

**************(3) TEST USER-DEFINED FUNCTIONS***********;
****(3.1) CREATE A TEST MACRO TO RUN FUNCTIONS 10000 TIMES******;
%macro test(num);
option cmplib = (sasuser.keyvalue) mstored sasmstore = sasuser;
data test;
do x = 101 to 200;
do y = 301 to 400;
z = myfunc&num(x, y);
output;
end;
end;
run;
%mend;

****(3.2) TEST THE FIRST FUNCTION********;
%test(1);

****(3.3) TEST THE SECOND FUNCTION********;
%test(2);

****(3.4) TEST THE SECOND FUNCTION WITH IN-MEMORY LOOKUP TABLE********;
sasfile indexed open;
%test(2);
sasfile indexed close;

****(3.5) TEST THE THIRD FUNCTION********;
%test(3);

**************END OF STEP (3)*****************;

*************************END OF ALL CODING***************************;

Friday, March 18, 2011

Array 2.0: matrix-friendly array in Proc Fcmp


Array is probably the only number-indexed data type in SAS. Interestingly, SAS and R, the major statistical softwares, use 1 instead of 0 to specify the first element. SAS programmers adopt array mostly for multiple-variable batch-processing. For example, longitudinal summation can be achieved by specifying a one-dimensional array and then adding all array elements together. On the contrary, general-purpose programming languages, like Java, Python, C++,  widely use array or list to store and index data. The programmers who newly switch into SAS often feel confused about the limitation of SAS’ array, which is transient and need output to the parental data step to save result. Since data step does not memorize the array’s original structure, multiple dimensional arrays are almost useless in creating more complex data structures. In SAS, those multiple dimensional arrays are occasionally implemented to reshape data [Ref. 1]. If a matrix, such as a big one with random numbers of 5000 rows by 5000 columns [Example 1], is needed, Other programming languages would use int[][] to declare a matrix. In SAS, 2D array, such as a [5000, 5000] array, may be intuitively chosen to perform this task. However, the result is not desired, and the right way is still to apply a one-dimensional array. 

Array in Proc Fcmp is an upgraded array data type other than data step array. This new version of array allows not only creating matrices but also indexing matrix. In addition, the communication between data step and Proc Fcmp is fast and convenient: this procedure supplies the READ_ARRAY() and WRITE_ARRAY() functions, which transform a dataset to an array and vice versa. For example, to test an interviewee’s knowledge on SAS’s data step array, a typical interview question is to ask her/him to write a 9X9 multiplication table [Example 2]. The expected solution is to place the OUTPUT statement between the inner Do-loop and outer Do-loop. Placing OUTPUT into the inner layer of Do-loops builds an 81*9 monster, while ignoring OUPTUT would only generate the last row of multiplication table. This task is much simpler in Proc Fcmp: just produce a matrix and write it to a dataset. Proc Fcmp is a nice alternative to SAS/IML, a specialized matrices language module in SAS. For instance, a typical SAS/IML kind of job, such as filling missing cells in a matrix (or a dataset) with its diagonal elements, can be fulfilled by arrays in Proc Fcmp [Example 3]. Proc Fcmp is shipped in SAS/BASE, which means that no extra out-of-pocket money is needed for another license. Another concern is the efficiency of SAS/IML module. It is reported that frequently calling of SAS/IML's functions would decrease system speed dramatically[Ref. 3].

The matrix feature of Proc Fcmp’s array benefits other SAS programming areas. For example, Proc Transpose and data step array usually reshape data structure from either long to wide or wide to long. However, in many cases that positions between observation and variable in a dataset have to be exchanged, the two methods may require several try-and-error steps. The TRANSPOSE() subroutine in Proc Fcmp solves such a problem easily [Example 4]. Currently there are 13 functions or subroutines available in Proc Fcmp for matrices operation[Ref. 2]. Some may complain they are still not enough for their particular need. Don’t forget: Proc Fcmp makes user-defined functions and subroutines! For example, to simulate set() function in Python, a deldup_array() function, based on encapsulating Proc SQL in a macro by RUN_MACRO(), can delete duplicate elements in array[Example 5]. Therefore, users of Proc Fcmp’s array can always construct and accumulate their tools to suit their purpose.

References: 1. UCLA Statistics Course. http://www.ats.ucla.edu/stat/sas/library/multidimensional_arrays.htm
2. SAS 9.2 Online Help. http://support.sas.com/documentation/cdl/en/proc/61895/HTML/default/viewer.htm#a003193719.htm
3. Wang, Songfeng; Zhang, Jiajia. Developing User-Defined Functions in SAS®: A Summary and Comparison. SAS Global 2011.

******(1) EXAMPLE 1: CREATE A 5000X5000 RANDOM MATRIX****;
****(1.1) WRONG 2D ARRAY SOLUTION******;
data matrix1;
array v[5000, 5000];
do i = 1 to 5000;
do j = 1 to 5000;
v[i, j] = ranuni(0);
end;
output;
end;
run;

****(1.2) CORRECT 1D ARRAY SOLUTION*****;
data matrix2;
array x[5000];
do i = 1 to 5000;
do j = 1 to 5000;
x[j] = ranuni(0);
end;
output;
end;
run;

******(2) EXAMPLE 2: CREATE A MULTIPLICATION TABLE******;
****(2.1) DATA STEP ARRAY SOLUTION ********;
data mt1;
array a[9] a1-a9;
do row = 1 to 9;
do col= 1 to 9;
if row ge col then a[col]=row*col;
end;
output;
end;
drop row col;
run;

****(2.2) PROC FCMP ARRAY SOLUTION*****;
proc fcmp;
array a[9, 9] / nosymbols;
do row =1 to 9;
do col = 1 to 9;
if row ge col then a[row, col] = row*col;
end;
end;
rc1 = write_array('mt2', a);
quit;

****(3) EXAMPLE 3: FILL MISSING CELL WITH DIAGONAL ELEMENT******;
****(3.0) INPUT RAW DATA****;
data have;
input x1-x4;
datalines;
1 . . .
2 1 . .
3 4 1 .
7 6 5 1
;
run;

****(3.1) PROC FCMP TRANSPOSITION******;
proc fcmp;
array a[4, 4] / nosymbols;
rc1 = read_array('have', a);
do i = 1 to 4;
do j = 1 to 4;
if missing(a[i, j]) = 1 then a[i, j] = a[j, i];
end;
end;
rc2 = write_array('want', a);
quit;

*****(4) EXAMPLE 4: RESHAPE SQUARE-SHAPE DATA********;
****(4.0) INPUT RAW DATA********;
data have1;
input x1-x5;
cards;
1 . 0 1 1
0 1 . 0 0
. . 1 1 1
0 0 0 1 .
. 0 0 1 1
;
run;

****(4.1) PROC TRANSPOSE SOLUTION******;
data trps1;
set have1;
obs = _n_;
run;

proc transpose data = trps1 out = trps2;
by obs;
var x1-x5;
run;

proc sort data = trps2 out = trps3;
by _name_;
run;

proc transpose data = trps3 out = want1_1;
by _name_;
var col1;
run;

****(4.2) PROC FCMP SOLUTION********;
proc fcmp;
array a[5, 5] / nosymbols;
rc1 = read_array('have1', a);
array b[5, 5] ;
call transpose(a, b);
rc2 = write_array('want1_2', b);
quit;

******(5) EXAMPLE 5: FCMP DEDUPLICATION FUNCTION FOR FCMP ARRAY*******;
****(5.1) ENCAPSULATE DEDUPLICATIONA UTILITY OF PROC SQL IN A MACRO ******;
%macro deldup_array;
%let dsname = %sysfunc(dequote(&dsname));
%let arrayname = %sysfunc(dequote(&arrayname));
/*(5.1.1) GENERATE UNIQUE DATASET AND ITS OBSERVATION NUMBER*/
proc sql noprint;
select count(unique(&arrayname.1)) into: obs_num
from &dsname;
create table _temp as
select distinct *
from &dsname;
quit;
/*(5.1.2) USE TEMP DATASET TO REPLACE RAW DATASET*/
data &dsname;
set _temp;
run;
/*(5.1.3) DELETE TEMP DATASET*/
proc datasets;
delete _temp;
run;
%mend deldup_array;

****(5.2) ENCAPSULATE MACRO ABOVE IN A FUNCTION*****;
proc fcmp outlib=work.func.practice;
function deldup_array(dsname $, arrayname $);
rc = run_macro('deldup_array', dsname, arrayname, obs_num);
if rc eq 0 then return(obs_num);
else return(.);
endsub;
run;

****(5.3) USE THIS FUNCION TO DELETE DUPLICATES IN ARRAY*****;
option cmplib = (work.func) mlogic mprint symbolgen;
proc fcmp;
array a[1000] /nosymbols;
do j = 1 to 1000;
a[j] = ceil((ranuni(12345)*100) + rannor(12345));
end;

dsname = 'numbers';
rc1 = write_array(dsname, a);

n = deldup_array(dsname, %sysfunc(quote(a)));

call dynamic_array(a, n);
rc2 = read_array(dsname, a);
put a = ;
quit;

*****************END OF ALL CODING****************************;

Friday, March 11, 2011

Effectiveness of two mail list groups: SAS-L and R-help


Software’s strength depends on the cohesion of the community backing it. Though a commercial package comes with technique support guarantee, the speed and efficiency of telephone wired customer service may not suit the fast-evolving programming need. Especially for a statistical package, such as SAS and R, which typically deals with many small extracting, loading, transformation and analysis tasks, quick short answer to a tricky question is desired. Community based mail list is a fast approach to get question posted and solved. With the help of Google’s Gmail, huge volume of emails generated by such mail lists can be collected and sorted conveniently. As for me, R-help and SAS-L are probably the most popular user groups for R and SAS, separately. And as a learner, I constantly gain kind help from those SAS or R gurus in the two communities. To compare the two user groups, I gathered threads from Gmail up-to-March 7th this year and parsed them to digestible data.

R’s ever-increasing popularity is reflected by enormous number of topics to be discussed. Averagely 37.4 questions are posted on R-help every day, compared with meagerly 8.8 a day on SAS-L. R-help Users tend to discuss a wide range of questions, most likely focusing on modeling and visualization on a specific package, while SAS-L users are mainly interested in data integration and management, such as Proc SQL, data step and macro. The usage deviation may demonstrate the distinctive fields for SAS and R in daily practice. Most SAS-L users are more familiar with the background involved for the questions others ask: a question usually gets 5.0 follow-ups. On the contrary, R-help users expect 2.8 follow-ups for the question they posted, and the chance of no-reply-at-all is also high. In SAS-L, information concentrates on several senior SAS experts who are experienced and ready to provide clues. R-help users are more diverse, possibly because that R has more than 3000 packages and is difficult for an R user to be an all-around player.

SAS and R also have other active mail lists and forum for technical discussion. For those who love SAS-L and R-help mailing lists and wish to become a two-way statistical programmer, getting to know some aspects of the two user groups could benefit us in using them effectively.

****(0) EXPORT THE HEADS FROM MY GMAIL TO FLAT TEXT FILES BY DIFFERENT GROUP TAGS****;
********(1) PREPARE A MACRO TO EXTRACT KEY WORDS FROM TEXTS***************;
%macro extract(group);
/*(1.1) INTEGRATE LINES FROM RAW TEXT*/
data &group.ug;
infile "H:\Regular_expression\&group._group.txt" truncover dsd lrecl=400;
input string $400.;
run;
/*(1.2) REMOVE GMAIL-TAG-CAUSING REDUNDANT LINES AND INDEX EACH LINES BY THREAD*/
data &group.ug_c;
set &group.ug;
where string is not missing
and string not in ('LinkedIn', 'R-Group', 'Inbox', 'SAS(noreply)', 'SAS-L');
thread = ceil(_n_/3);
run;
/*(1.3) TRANSPOSE TOPIC|POST AUTHORS|TIME*/
proc transpose data=&group.ug_c out=&group.ug_s(rename=(col1 = writer col2 = topic col3 = time));
by thread;
var string;
run;
/*(1.4) PARSE TARGET INFORMATION FOR TOPIC|POST AUTHORS|TIME*/
data &group.ug1;
set &group.ug_s(drop=thread _name_);
/*(1.4.1) PARSE POST AUTHOR */
position = prxmatch("/\(\d+/",writer);
if position ne 0 then
rep_num = input(compress(substr(writer, position+1, 2), ')'), 2.) ;
else rep_num = 0;
/*(1.4.2) PARSE THREAD TOPIC*/
idx = index(topic, '? -');
topic_str= input(substr(topic, 1, idx), $200.);
/*(1.4.3) PARSE TIME*/
length time1 $6.;
if sum(index(time, 'pm'), index(time, 'am')) > 0 then time1 = 'Mar 7';
else time1 = strip(time);
date = input(cats(substr(time1, 5, 2), substr(time1, 1, 3), '2011'), date9.);
format date date9.;
drop idx writer topic position time:;
run;
/*(1.5)* SUMMARIZE BY EACH DATE*/
proc sql;
create table &group.ug2 as
select mean(rep_num) as &group._mean_reply format=3.1,
sum(ifn(rep_num=0, 1, 0)) as &group._zero_reply,
max(rep_num) as &group._max_reply, count(topic_str) as &group._topicnum,
calculated &group._zero_reply / calculated &group._topicnum
as &group._zero_reply_ratio format=3.1,
date
from &group.ug1
group by date
;quit;
%mend extract;

**********(2) EXECUATE MACRO TO BUILD DATASETS FOR 2 GROUPS*****;
option mprint symbolgen;
%extract(group=sas);
%extract(group=R);

*********(3) CONCATENATE DATA AND REPORT****;
****(3.1) CONCATENATE 2 DATASETS FROM SAS AND R******;
proc sql;
create table reportdata as
select mean(rep_num) as avg_rep_num, count(topic_str) as topicnum, date, 'SAS' as ug length=3
from SASug1
group by date
union
select mean(rep_num) as avg_rep_num, count(topic_str) as topicnum, date, 'R' as ug length=3
from Rug1
group by date
order by date
;quit;

****(3.2) REPORT MEAN AND STD FOR MEAN NUMBERS OF TOPIC AND REPLY****;
proc report data=reportdata nowd headline split='|';
column ug topicnum,(mean std) avg_rep_num,(mean std);
define ug/group 'USER|GROUP' width=6;
define topicnum/'AVERAGE|TOPIC NUMBER';
define avg_rep_num/'AVERAGE|REPLY NUMBER';
define mean/format=5.1 'MEAN';
define std/format=5.2 'STD';
run;

***********(4) VISUALIZE THE COMPARISON RESULT IN TIME SERIES*******;
****(4.1) PREPARE DATASET FOR PLOTTING******;
proc sql;
create table combine as
select a.date, a.*, b.*
from rug2 as a left join sasug2 as b
on a.date = b.date
;quit;

****(4.2) BUILD PLOTTING TEMPLATE*******;
proc template;
define statgraph compplot;
begingraph / designwidth=1000px designheight=800px;
entrytitle "Comparison of user groups between SAS and R";
layout lattice / columns=1 columndatarange=union rowweights=(0.3 0.3 0.4);
columnaxes;
columnaxis / offsetmin=0.02 griddisplay=on;
endcolumnaxes;
/*(4.2.1)DEFINITION OF TOP PANEL*/
layout overlay / cycleattrs=true yaxisopts=(griddisplay=on label=" "
display=(line) displaysecondary=all);
seriesplot x=date y=R_topicnum / lineattrs=(thickness=2px)
legendlabel="R's topics" name="d25";
seriesplot x=date y=SAS_topicnum / lineattrs=(pattern=solid thickness=2px)
legendlabel="SAS's topics" name="d50";
discretelegend "d25" "d50" / across=1 border=on valign=top halign=left
location=inside opaque=true;
endlayout;
/*(4.2.2)DEFINITION OF MIDDLE PANEL*/
layout overlay / cycleattrs=true yaxisopts=(griddisplay=on label=" "
display=(line) displaysecondary=all);
seriesplot x=date y=R_zero_reply_ratio/ lineattrs=(thickness=2px)
legendlabel="R's zero reply ratio" name="d2";
seriesplot x=date y=SAS_zero_reply_ratio/ lineattrs=(pattern=solid thickness=2px)
legendlabel="SAS's zero reply ratio" name="d5";
discretelegend "d2" "d5" / across=1 border=on valign=top halign=left
location=inside opaque=true;
endlayout;
/*(4.2.3)DEFINITION OF BOTTOM PANEL*/
layout overlay / yaxisopts=(griddisplay=on display=(line) label=" " displaysecondary=all)
cycleattrs=true xaxisopts=(griddisplay=on);
seriesplot x=date y=R_mean_reply / lineattrs=(thickness=2px)
legendlabel="R's mean reply" name="ser1";
seriesplot x=date y=sas_mean_reply / lineattrs=(pattern=solid thickness=2px)
legendlabel="SAS's mean reply" name="ser2";
seriesplot x=date y=R_max_reply / lineattrs=(thickness=2px)
legendlabel="R's max reply" name="ser3";
seriesplot x=date y=sas_max_reply / lineattrs=(pattern=solid thickness=2px)
legendlabel="SAS's max reply" name="ser4";
discretelegend "ser1" "ser2" "ser3" "ser4"/ across=1 border=on valign=top halign=left
location=inside opaque=true;
endlayout;
endlayout;
endgraph;
end;
run;

****(4.3) RENDER IMAGES BY USING TEMPLATE******;
proc sgrender data=combine template=compplot;
run;

*************************END OF ALL CODING*****************************************;

Friday, March 4, 2011

Music social network on DNA microarray


The incoming 2011 KDD Cup data mining competition [1] by Yahoo! Lab posts an interesting challenge to predict the users' ratings for individual songs out of this company’s huge music database. Unlike previous KDD Cups projects filled by tons of variables that make dimension reduction a serious concern, Yahoo! Lab provides few variables: artist/genre/album. No demographic or geographic information is disclosed. It is interesting to forecast the behavior of a web user by limited web records. Digging valuable clues out for potential following direct marketing is also rewarding. Especially while the competition datasets contain up to 1 million users, 600 thousand songs, the project is a real world web-scale analytics question.

To predict song’s rating may need two-stage modeling. The relationship looks straight-forward: rating = (genre + album + artist)* user. Building 1millinon multiply by 600 thousand models to suit each of the 1 million users on each of the 600K songs would be a formidable job. The first stage has to group users and songs to reasonable levels by unsupervised modeling methods, such as clustering. At the second stage, supervised models are to be trained for the songs and users separately. The relationship among rate, genre and artist, in 3D cubes, would decide to which group those songs or users are most likely to fit in. After scoring, in the test dataset, each song would have a song group ID and each user has a user group ID. The association between song group and user group predicts the rating values.

Song listeners get connected by their preference, purposely or not purposely. In a broad picture, those relationships formed an artist-centered social network and listeners scattered around those axis with various distances. As a result, the music fans interact each other and dwell in many social neighborhoods. In this project, the scoring card to predict song's rating will be like a huge matrix between song groups and user groups with rating as values. Biologists tackling the interaction among thousands of genes by DNA microarray would be familiar with this scenario. Usually DNA microarray is used to discover high-throughput information of relationship of multiple genes. Similarly, for music social network, this DNA microarray like matrix would show the affinity between users and songs. To increase the number of cells in this matrix, or overall number of song groups or user groups, is a good idea to bring about higher precision. However, prediction accuracy and computer resource consumption of the models are the first and foremost considerations for this particular question.

Statistics persons call data mining as statistical learning, while computer persons refer it as machine learning. Eventually both roads lead to the same way. The datasets for KDD cup 2011 data mining competition will be released on March 15. Though its 263m rating data for track1 and 63m rating data for track looks quite a lot, statistics guys probably can gear up with their high-level weapons, such as SAS and R, to explore this usually computer-geek-dominated field.

Reference: 1. 'KDD CUP 2011 from Yahoo! Lab'. http://kddcup.yahoo.com/
********(0) DOWNLOAD PREVIEW DATASET TRACK2*****;
http://kddcup.yahoo.com/dataset.track2.sample.tar.bz2

******(1) INTEGRATE PREVIEW SONG DATA*********;
data trackdata;
infile 'C:\trackData2.txt' dlm='|' missover dsd;
informat TrackId AlbumId ArtistId GenreId1-GenreId15 $6.;
input TrackId AlbumId ArtistId GenreId1-GenreId15 ;
if AlbumId = 'None' then call missing(AlbumId);
if ArtistId = 'None' then call missing(ArtistId);
run;

proc sql;
select count(unique(ArtistId))
from trackdata
;quit;

********(2) NORMALIZE SONG DATA*********;
data three;
set trackdata (where=(missing(ArtistId)=0));
array genre[15] GenreId1-GenreId15;
do i=1 to 15;
if missing(genre[i])= 0 then do;
songgenre=genre[i]; output;
end;
end;
keep ArtistId songgenre;
run;

*******(3) FAST CLUSTERING THE SONGS ONLY BY GENRES******;
proc sort data=three out=four;
by ArtistId;
run;

proc sql;
create table five as
select ArtistId, songgenre, count(songgenre) as freq
from four
group by ArtistId, songgenre
order by ArtistId, songgenre
;quit;

proc transpose data=five out=five_t(drop=_name_) prefix=genre;
by ArtistId;
id songgenre;
var freq;
run;

data six;
set five_t;
array x[*] _numeric_;
do i=1 to dim(x);
if missing(x[i]) = 1 then x[i]=0;
end;
drop i;
run;

proc fastclus data=six maxc=10 maxiter=100 out=clus;
var genre:;
run;

*****(4) Simulate data for Genomic Heat Map********;
data music (keep=songgroup usergroup rate);
do i=1 to 20;
songgroup = cats('song', i);
do j=1 to 20;
usergroup=cats('user', j);
rate= ranuni(0)*100;
output;
end;
end;
run;

proc template;
define statgraph myheat.Grid;
begingraph;
layout overlay / border=true xaxisopts=(label='Song groups' )
yaxisopts=(label='User groups' ) pad=(top=5px bottom=0px right=15px);
scatterplot x=songgroup y=usergroup / markercolorgradient=rate
markerattrs=(symbol=squarefilled size=30) colormodel=threecolorramp name='s2';
continuouslegend 's2' / orient=vertical location=outside valign=center
halign=right valuecounthint=10;
endlayout;
endgraph;
end;
run;

proc template;
define Style HeatMapStyle;
parent = styles.harvest;
style GraphFonts from GraphFonts "Fonts used in graph styles" /
'GraphFootnoteFont' = (", ",8pt)
'GraphLabelFont' = (", ",7pt)
'GraphValueFont' = (", ",10pt)
'GraphDataFont' = (", ",5pt);
style GraphColors from graphcolors /
"gcdata1" = cxaf1515
"gcdata2" = cxeabb14
"gcdata3" = cxffffff
"gramp3cend" = cxaa081b
"gramp3cneutral" = cx000000
"gramp3cstart" = cx1ba808;
end;
run;

ods html style=HeatMapStyle image_dpi=300 file='heatmap.html' path='d:\myfun';

ods graphics on / reset imagename='GTLHandout_Heatmaps' imagefmt=gif;
proc sgrender data=music template=myheat.grid;
run;

ods html close;