Saturday, December 31, 2011

Using SQL for data management


Every weekday I pull out data from IBM DB2 database and do analysis in SAS. I use SQLite to record my personal chores, such as credit card transactions. And I am taking courses of database development by Microsoft SQL Server. Then I have pretty enough time to compare the 4 softwares. Recently I have the rough feeling -- the relational database end and the analytics end look more and more similar, while the databases brings more analysis procedures and every statically packages is eager to introduce SQL feature as enhancement. Thus, Structured Query Language (SQL) tends to be the universal language in the data world.

SQLs in DB2, Microsoft SQL Server (T-SQL), SQLite and SAS (PROC SQL) follow the same logic but contain different characteristics. We can treat them as the branches originated from the ANSI SQL standard. I play around their own SQL dialects and try to learn how to manage data in different ways.

Data
Using the SQL syntax below,  the SASHELP.CLASS data set, which contains the gender, name, age, weight and height information of 18 fake teenagers, will be able to be transferred to any database system.
create table class (name char(8), sex char(1), 
age numeric, height numeric, weight numeric );
insert into class values('ALFRED','M',14,69,112.5);
insert into class values('ALICE','F',13,56.5,84);
insert into class values('BARBARA','F',13,65.3,98);
insert into class values('CAROL','F',14,62.8,102.5);
insert into class values('HENRY','M',14,63.5,102.5);
insert into class values('JAMES','M',12,57.3,83);
insert into class values('JANE','F',12,59.8,84.5);
insert into class values('JEFFREY','M',13,62.5,84);
insert into class values('JOHN','M',12,59,99.5);
insert into class values('JOYCE','F',11,51.3,50.5);
insert into class values('JUDY','F',14,64.3,90);
insert into class values('LOUISE','F',12,56.3,77);
insert into class values('MARY','F',15,66.5,112);
insert into class values('PHILIP','M',16,72,150);
insert into class values('ROBERT','M',12,64.8,128);
insert into class values('RONALD','M',15,67,133);
insert into class values('THOMAS','M',11,57.5,85);
insert into class values('WILLIAM','M',15,66.5,112);

Random sampling
I randomly select 8 out of the 18 persons. The realization of this purpose mainly relies on the software's' random function and output control.

/********************** a. SQL Server **********************/
select top (8) *
from class
order by NEWID();

/********************** b. DB2 **********************/
select *
from class
order by rand()
fetch first 8 rows only;

/********************** c. SQLite **********************/
select *
from class
order by random()
limit 8;

/********************** d. SAS **********************/
proc sql outobs = 8;
select *
from class
order by ranuni(0);

Report subtotal of a variable as the last observation
It is a quite common task to make a table with subtotal for reporting. DB2 and SQL Server both have ROLLUP operator/function, while SAS may utilize its macro facility.

/********************** a. SQL Server **********************/
select coalesce(name, 'Total') as name, SUM(weight) as weight
from class
group by name with rollup;

/********************** b. DB2 **********************/
select case grouping(name)
when 0 then name
else 'Total'
end name,
sum(weight) as weight
from class
group by rollup(name);

/********************** c. SQLite **********************/
select name, sum(weight) as weight
from class
group by name
union all
select 'Total', sum(weight)
from class;

/********************** d. SAS **********************/
select sum(weight) into: sum_weight from class;
create table class2 as select name, weight from class;
insert into class2 values('Total', &sum_weight);

Retrieve column information of a table
Each database tries to store the metadata somewhere, so that we are able to retrieve the information of the columns/variables, such as their name, data type and position at the table. Only SQLite, a portable database, has no designated table to store this information. However, running the command line shell of SQLite has the same effect.

/********************** a. SQL Server **********************/
select column_name, data_type, ordinal_position
from information_schema.columns
where table_name = 'class';

/********************** b. DB2 **********************/
select colname, typename, colno
from syscat.columns
where tabname = 'class';

/********************** c. SQLite **********************/
pragma table_info(class);

/********************** d. SAS **********************/
select name, type, varnum
from sashelp.vcolumn
where libname = 'WORK' and memname = 'CLASS';

Obtain the running total of a variable
It seems like a difficult job for native SQL syntax. SQL Server and DB2 evolve with their ranking functions -- DB2 has sum() over() and SQL Server has row_number() over(), which can be used to solve this problem. In SAS, the better way is to use DATA step instead.

/********************** a. SQL Server **********************/
select name, weight, sum(weight) over (order by weight) as running_total
from class
order by weight

/********************** b. DB2 **********************/
create view v_class as
select ROW_NUMBER() OVER(ORDER BY name) as rownum, *
from class;
select a.name, a.weight,
(select SUM(b.weight) from v_class b
where b.rownum <= a.rownum) as running_total
from v_class as a

/********************** d. SAS **********************/
data class2;
class;
retain running_total;
running_total + weight;
keep name weight running_total;
run;

Sunday, December 25, 2011

Four ways to add row number in SQL Server



It is easy to sort any data in SQL Server, while it is not a trivial job to add a new variable of row number to an existent table. Here I come with four ways to realize the purpose.


/********** 0. Input data source *********************/
use tempdb
if object_id('class') is not null
drop table class;
go

-- Use a table with 5 variables from 18 teenagers
create table class (name char(8), sex char(1),
age numeric, height numeric, weight numeric );
insert into class values('ALFRED','M',14,69,112.5);
insert into class values('ALICE','F',13,56.5,84);
insert into class values('BARBARA','F',13,65.3,98);
insert into class values('CAROL','F',14,62.8,102.5);
insert into class values('HENRY','M',14,63.5,102.5);
insert into class values('JAMES','M',12,57.3,83);
insert into class values('JANE','F',12,59.8,84.5);
insert into class values('JEFFREY','M',13,62.5,84);
insert into class values('JOHN','M',12,59,99.5);
insert into class values('JOYCE','F',11,51.3,50.5);
insert into class values('JUDY','F',14,64.3,90);
insert into class values('LOUISE','F',12,56.3,77);
insert into class values('MARY','F',15,66.5,112);
insert into class values('PHILIP','M',16,72,150);
insert into class values('ROBERT','M',12,64.8,128);
insert into class values('RONALD','M',15,67,133);
insert into class values('THOMAS','M',11,57.5,85);
insert into class values('WILLIAM','M',15,66.5,112);

/********** 1. Primary key with auto-increment *********************/
create table class2 (row_num smallint identity(1,1) primary key clustered,
name char(8), sex char(1), age numeric, height numeric, weight numeric );
insert into class2(name, sex, age, height, weight)
select * from class
order by weight, name;

select * from class2;
drop table class2;

/********** 2. Subquery *********************/
select name, weight, (
select count(*) from class as b
where b.weight < a.weight or (b.weight = a.weight and b.name <= a.name)
) as row_num
from class as a
order by weight , name

/********** 3. Cursor *********************/
create table classcursor(name char(8), weight int, row_num int);
declare @name as char(8), @weight as int, @row_num as int

begin tran
declare rncursor cursor fast_forward for
select name, weight from class order by weight , Name;

open rncursor;

-- Set the initial value
set @row_num = 0;
fetch next from rncursor into @name, @weight;

-- Here come a loop
while @@fetch_status = 0
begin
set @row_num = @row_num + 1;
insert into classcursor(name, weight, row_num)
values(@name, @weight, @row_num)
fetch next from rncursor into @name, @weight
end
close rncursor;
deallocate rncursor;
commit tran

select * from classcursor
drop table classcursor

/********** 4. Row_number() function *********************/
select name, weight, row_number() over(order by weight, name) as rownum
from class

Friday, December 16, 2011

Merry Christmas!

Just for fun.

data cars;
set sashelp.cars;
length label $10.;
keep invoice weight enginesize type label;
if _n_ = 1 then label = 'Merry';
if _n_ = 50 then label = 'Christmas';
run;

proc sgplot data = cars noautolegend;
bubble x = invoice y = weight size = enginesize / group = type datalabel = label datalabelattrs=(color=red
family="garamond" style=italic size=45 weight= bold) transparency=0.4;
xaxis max = 80000 label = ' '; yaxis label = ' ' max = 6000;
run;

Monday, December 12, 2011

10 toughest interview questions for R developers (1)

I recently discovered 10 questions most daunting in R development, and I am trying to find the answers below.

1. Data structure -- How many data structures R has? How do you build a binary search tree in R?
2. Sorting -- How many sorting algorithms are available? Show me an example in R.
3. Low level -- How do you build a R function powered by C?
4. String -- How do you implement string operation in R?
5. Vectorization -- If you want to do Monte Carlo simulation by R, how do you improve the efficiency?
6. Function -- How do you take function as argument of another function? What is the apply() function family?
7. Threading -- How do you do multi-threading in R?
8. Memory limit and database -- What is the memory limit of R? How do you avoid it? How do you use SQL in R?
9. Testing -- How do you do testing and debugging in R?
10. Software development -- How do you develop a package? How do you do version control?

Q1. Data structure -- How many data structures R has? How do you build a binary search tree in R?
My answer: R mainly has 5 data structures.
Homogeneous(contain the same type of objects): vector --> matrix --> array
Heterogeneous(allow different type of objects): list --> data frame

A binary search tree requires several actions: implement a tree, insert nodes and delete nodes. We should create individual routines in R.

In R, a matrix is the ideal data structure to contain the linked elements. Then a list should be used to wrap the matrix and pass the arguments.

For insert-node routine, there is the pseudocode for it. The key point here is to use recursion in R’s function. Norman Matloff gave a complete example at page 180 of his book. 

insert(X, node){
if(node = NULL)
node = new binaryNode(X,NULL,NULL)
return
}
if(X = node:data)
return
else if(X < node:data)
insert(X, node:leftChild)
else // X > node:data
insert(X, node:rightChild)
}

Q2. Sorting -- How many sorting algorithms are available? Show me an example in R.
My answer: there are mainly 5 kinds of sorting algorithms(with their complexity):
Bubble Sort - O(n^2);
Selection Sort - O(n^2)
Merge Sort - O(n log n)
Quick Sort - from O(n log n) to O(n^2)
Bucket Sort - O(n + m)

R has a native sort function sort(), which is written by C and uses Quick Sort.

There is an example of Quick Sort in R by recursion

qs <- function(x) {
if (length(x) <= 1) return(x)
seed <- x[1]
rest <- x[-1]
sv1 <- rest[rest < seed]
sv2 <- rest[rest >= seed]
sv1 <- qs(sv1)
sv2 <- qs(sv2)
return(c(sv1,seed,sv2))
}

Those most productive R developers



The number of R packages on CRAN is 3,483 on 2011-12-12. The growth of R package in the past years can be fitted by a quadratic regression perfectly.

I am always interested in who are maintaining those packages. Then I wrote an R script to extract the package head information from CRAN’s website and stored them in a SQLite database. Most R developers are maintaining 1-3 R packages. Some of them are really productive. By the correspondence addresses (Email), the top 50 R developers are listed below:

developer package
1 Kurt Hornik 23
2 Martin Maechler 23
3 Hadley Wickham 21
4 Rmetrics Core Team 19
5 Achim Zeileis 17
6 Henrik Bengtsson 17
7 Paul Gilbert 17
8 Brian Ripley 14
9 Roger D. Peng 13
10 Torsten Hothorn 13
11 Karline Soetaert 12
12 Philippe Grosjean 12
13 Robin K. S. Hankin 12
14 Charles J. Geyer 11
15 Matthias Kohl 11
16 Charlotte Maia 10
17 Mikis Stasinopoulos 10
18 Simon Urbanek (1) 10
19 Thomas Lumley 10
20 Arne Henningsen 9
21 Gregory R. Warnes 9
22 Jonathan M. Lees 9
23 Michael Hahsler 9
24 Peter Ruckdeschel 9
25 A.I. McLeod 8
26 Brian Lee Yung Rowe 8
27 Dirk Eddelbuettel 8
28 John Fox 8
29 Kaspar Rufibach 8
30 Korbinian Strimmer 8
31 Michael Friendly 8
32 Peter Solymos 8
33 Roger Bivand 8
34 Simon Urbanek (2) 8
35 Christopher Brown 7
36 David Meyer 7
37 ORPHANED 7
38 Revolution Analytics 7
39 Rob J Hyndman 7
40 Romain Francois 7
41 Ulrike Groemping 7
42 Christophe Genolini 6
43 Frank Schaarschmidt 6
44 G. Grothendieck 6
45 Hana Sevcikova 6
46 Jeffrey A. Ryan 6
47 Kjetil Halvorsen 6
48 Pei Wang 6
49 Trevor Hastie 6
50 Yihui Xie 6


### A script of R to extract R package information and
### build a SQLite databse by hchao8@gmail.com
library(ggplot2)
library(XML)
library(RSQLite)

# Create and connect a SQLite database
conn <- dbConnect("SQLite", dbname = "c:/Rpackage.db")

# Extract names of R packages available from web
allPackageURL <-
"http://cran.r-project.org/web/packages/available_packages_by_name.html"
allPackage <- na.omit(melt(readHTMLTable(allPackageURL))[, c("V1")])

# Extract individual package information from web and store data in SQLite
for (i in 1:length(allPackage)){
packageName <- allPackage[i]
packageURL <- paste("http://cran.r-project.org/web/packages/",packageName,
"/index.html", sep="")
y <- melt(readHTMLTable(packageURL))
y$L1 <- packageName
if(dbExistsTable(conn, "Rpackage")) {
dbWriteTable(conn, "Rpackage", y, append = TRUE)
} else {
dbWriteTable(conn, "Rpackage", y)
}
}
# Pull out maintainer information from SQLite database
all <- fetch(dbSendQuery(conn, "
select v2 as author, count(v2) as package
from rpackage
where v1 = 'Maintainer:'
group by v2
order by package desc
;"))

# Disconnect SQLite database
dbDisconnect(conn)

# Draw a histogram
qplot(package, data = all, binwidth = 1, ylab = "Frequency",
xlab = "R packages maintained by individual developer")
ggsave("c:/Rlist.png")

# Find 50 most productive developers
head(all, 50)

Tuesday, December 6, 2011

A new way to draw maps in SAS


SAS’s ODS Graphics technology brought the concept of layer into data visualization. We can use those SG procedures to do many tricks. Previously in SAS, a map has to be drawn from its GMAP procedure. Now we can simply use 3-4 lines of codes to sketch some maps by the scatter statement in PROC SGPLOT, such as North America or Asia.

ods html style = money;
proc sgplot data = maps.namerica noautolegend;
scatter x = x y = y / group = id markerattrs=(size=1);
xaxis grid label = ' '; yaxis grid label = ' ';
run;
proc sgplot data = maps.china ;
scatter x = x y = y /markerattrs=(size=2);
xaxis grid label = ' '; yaxis grid label = ' ';
run;



We can apply it to single countries, like China or India. For India, the aspect has to be modified a little. It can be aslo done by PROC SGSCATTER. My friend Xiangxiang has a great tutorial for this procedure.

ods html style = harvest;
proc sgplot data = maps.asia noautolegend;
scatter x = x y = y / group = id markerattrs=(size=1);
xaxis grid label = ' '; yaxis grid label = ' ';
run;

ods html style = htmlbluecml;
ods graphics on / width=6in height = 6in;
proc sgplot data = maps.india;
scatter x = x y = y /markerattrs=(size=2);
xaxis grid label = ' '; yaxis grid label = ' ';
run;
ods graphics / reset;
Thanks to SAS’s ODS group, those SG procedures always give me limitless fun