Friday, May 6, 2011

Multidimensional scaling for ZIP codes clustering


Multidimensional scaling maps the distances among multiple objects in a two or more dimensional space. This method is getting hotter in analyzing social network, since many SNS website now offer handy tools to visualize the social connections for the users. SAS’s MDS procedure, based on such an algorithm, is a fascinating tool. Larry [Ref. 1] utilized it to map SAS-L, an email list, to a circle shape, by extracting threads and email addresses. Proc MDS is also used to reflect the perceptions of customers to perceptual maps.

In business, some direct marketing activities need to scale down the levels of zip codes. For example, Texas has 2650 ZIP codes. Sometimes it is useful to divide them into manageable sectors. SAS 9.2 has built-in ZIP code and map datasets. And the ‘zipcitydistance’ function lookups the distance between any pair of ZIP codes. At the beginning, a 2650*2650 matrix, based on the distances between any two ZIP codes in Texas, was constructed. Then according to this matrix, Proc MDS calculated the two dimension variables. A following clustering procedure separated them into 5 clusters. Proc GMAP allows generating customized map [Ref. 2]. Thus by it, I annotated those dots back onto a physical map. The comparison between the two images shows the reconstructed relative locations are pretty accurate, though the map's angle by Proc MDS is not very much correct.

Reference:
1. Larry Hoyle. ‘Visualizing Two Social Networks Across Time with SAS’. SAS Global 2009.
2. Darrell Massengill and Jeff Phillips. ‘Tips and Tricks IV: More SAS/GRAPH Map Secrets’. SAS Global 2009.

/*******************READ ME*********************************************
* - MULTIDIMENSIONAL SCALING FOR ZIP CODES CLUSTERING -
*
* SAS VERSION: SAS 9.2.2
* DATE: 07may2011
* AUTHOR: hchao8@gmail.com
*
****************END OF READ ME******************************************/

****************(1) RETRIEVE MAP AND ZIP CODE DATA IN TEXAS FROM SAS ***;
data txzip;
set sashelp.zipcode;
where statecode = "TX";
run;

data txmap;
length x y 8;
set maps.counties;
where state = 48;
run;

****************(2) CREATE A ZIP-TO-ZIP DISTANCE MATRIX*****************;
proc sql;
create table zip01 as
select a.zip as zipa, b.zip as zipb,
zipcitydistance(zipa, zipb) as distance
from txzip as a, txzip as b
;quit;

proc transpose data = zip01 out = zip02 prefix = var;
by zipa;
id zipb;
var distance;
run;

****************(3) CONDUCT MDS AND CLUSTERING WITH 5 CLUSTERS**********;
proc mds data = zip02 level = absolute out = mds_done ;
id zipa;
run;

proc fastclus data = mds_done maxc = 5 out = clus_done;
var dim:;
where _name_ is not missing;
run;

proc sgscatter data = clus_zip;
plot dim1 * dim2 / group = cluster grid;
run;

****************(4) TRANSFORM ZIP CODE TO GEOGRAPHIC DATA FOR ANNOTATION**;
proc sql;
create table clus_zip as
select a.zipa as zip, a.cluster, b.x, b.y
from clus_done as a left join txzip as b
on a.zipa = b.zip
;quit;

data clus_zip1;
retain anno_flag 1;
set clus_zip;
x = -x * atan(1) / 45;
y = y * atan(1) / 45;
length function style color text $8;
function = 'label';
xsys='2'; ysys='2'; hsys='3';
when='A'; style='special'; text='L';
size = 2; position = 'E';
if cluster = 1 then color = 'blue';
else if cluster = 2 then color = 'red';
else if cluster = 3 then color = 'purple';
else if cluster = 4 then color = 'green';
else color = 'yellow';
run;

****************(5) MAP CLUSTERED ZIP CODES GEOGRAPHICALLY ******************;
data combine;
set txmap clus_zip1;
run;

proc gproject data = combine out = combined dupok;
id county;
run;

data txmap anno_dots;
set combined;
if anno_flag > 0 then output anno_dots;
else output txmap;
drop anno_flag;
run;

ods html gpath = 'c:\';
goptions reset=all dev=gif xpixels = 1280 ypixels = 1024;
proc gmap data = txmap map = txmap anno = anno_dots;
id county;
choro state / nolegend;
run;
quit;
ods html close;

****************END OF ALL CODING***************************************;
mds_to_post

1 comment:

  1. when you have the distance matrix, why do we need to run PROC MDS, we can directly use proc cluster.

    ReplyDelete