Thursday, June 27, 2013

The complexity for Fibonacci numbers in SAS

Fibonacci numbers have a lot of applications. There have been a lot of interesting researches regarding Fibonacci numbers since 800 years ago. Mathematically they are the numbers from a sequence, which is defined as F_n = F_{n-1} + F_{n-2}\hspace{5}with\hspace{5}F_0 =0, \hspace{2} F_1=1.

Experiment and Result

Fibonacci numbers can be derived from either the original definition that is a typical example of recursion, or a mathematically simplified form that is called closed-from expression. In SAS, a user-defined Fib1 function in PROC FCMP can realize the recursion expression, and another Fib2 function is created to express the closed-form expression. I insert both functions to a loop from 10 to 40 with step size of 5 and record the real time costed.
According to the figure above, both algorithms have no actually difference while N is relative small. When N becomes 30 or bigger, the recursion expression function spends a lot of system time. When N is even greater than 40, SAS enters a freezing status and I have to break from the command manually. Thus, the curve for the recursion expression seems to fit an exponential relationship fairly well. On the contrary, the algorithm of closed-from expression takes almost constant time for the execution,

Complexity

  • Recursion expression
    T(n) = T(n-1) + T(n-2) + O(1), which may imply O(2^{n-1}) + O(2^{n-2}) + O(1) = O(2^n).
  • Closed-form expression
    The equation is fairly straightforward and described as F_n= {\phi^n-(-\phi)^{-n}\over\sqrt{5}}\hspace{2}with\hspace{2} \phi={(1+\sqrt{5})\over2}.  Here \phi is the well-known golden ratio. Its complexity is a constant, which suggests O(n) = O(n-1) = O(1).

Conclusion

Mathematics sometimes significantly decreases the complexity of the computation, in this case, from O(2^n) to O(1). It is also useful for us to estimate how much resources are needed for specific purpose.

****(1) Bulid user-defined functions *****************;
proc fcmp outlib = work.test.fibonacci;
* Create the 1st function based on recursion;
function fib1(n);
if n<0 then return(0);
else if n = 1 or n = 2 then return(1);
else return(fib1(n-1) + fib1(n-2));
endsub;
* Create the 2nd function based on closed form;
function fib2(n);
phi = (1 + sqrt(5))/2;
return(round((phi**n - (1-phi)**n) / sqrt(5)));
endsub;
quit;

****(2) Run the tests sequentially********************;
options cmplib = work.test fullstimer;
%macro test();
* Create datasets for verification
%do j = 10 %to 40 %by 5;
data _fib1_&j;
do i = 1 to &j;
x = fib1(i);
output;
end;
%put Recursion method at &j;
run;
data _fib2_&j;
do i = 1 to &j;
x = fib2(i);
output;
end;
%put Closed form method at &j;
run;
%end;
%mend;

* Output SAS log to text file;
proc printto log = "c:\tmp\test.log" new;
run;
%test()
proc printto;
run;

****(3) Display the results***************************;
proc datasets nolist;
delete _:;
quit;

data one;
infile "c:\tmp\test.log" truncover;
input @1 line $80.;
if index(line, 'real') > 0;
run;

data two;
set one;
length method $20.;
retain number 5;
* Ignore the time used by PROC PRINTTO;
if _n_ > 1;
if mod(_n_ , 2) = 0 then do;
method = "Recursion";
number + 5;
end;
else method = "Closed form";
time = input(substr(line, 21, 5), 5.2);
run;

proc sgplot data = two;
series x = number y = time / group = method;
yaxis label = "Time elapsed by seconds" grid;
run;

Tuesday, June 25, 2013

Use doParallel for rolling regression in R

Rolling regression for a large data set costs lots of resources. There are a few strategies to speed up this process. For example, in R, there is a rollapply function in the dynlm package. In SAS, PROC FCMP is one of the options for optimization.
Revolution Analytics' doParallel package utilizes the power of the foreach package and realizes parallel processing. It seems that the recent update version 1.0.3 has fixed a bug on Windows system. I am interested in seeing if the doParallel package would allow any efficiency boost for rolling regression, on my old Windows computer which has two cores (Intel Core 2 Duo 3.06GHz).
I first created the random vectors of 20000 for x and y, and set the rolling window size to be 20. I selected 3 scenarios: sequential processing, parallel processing with 2 cores and parallel processing with 4 cores. The parameters solved by the rolling regressions are shown in the picture above. The matrices by the three methods have no difference.
It turned out that the time cost has been significantly improved under the parallel mode. Since I actually have no more than 2 cores on this computer, the registerDoParallel(cores=4) automatically killed the redundant connections and performed the same as the cores=2 mode. Since currently most computers have multiple cores, the doParallel package has a lot of potentials in statistics.
####(1) Import libraries and set up directory#################
library(doParallel)
library(ggplot2)
setwd('C:/Google Drive/r')
getwd()

####(2) Simulate the data set ################################
# Set a few parameters for regressions
n <- 20000
beta0 <- 0.01; beta1 <- 0.4
mse = 0.9
set.seed(1234)

# Generate x and y variables
x <- rnorm(n)/100
y <- beta0 + beta1*x + rnorm(n, 0, mse)/100

####(3) Run the tests #######################################
# Set the rollling window size to be 20
windowSize <- 20

# Run sequentially
time1 <- system.time(
z1 <- foreach(i=seq(1, (n-windowSize+1), 1), .combine=rbind) %do% {
idx <- i:(i+windowSize-1)
coefficients(lm(y[idx]~x[idx]))
}
)

# Run with 2 cores registered
registerDoParallel(cores=2)
time2 <- system.time(
z2 <- foreach(i=seq(1, (n-windowSize+1), 1), .combine=rbind) %dopar% {
idx <- i:(i+windowSize-1)
coefficients(lm(y[idx]~x[idx]))
}
)

# Run with 4 cores registered
registerDoParallel(cores=4)
time3 <- system.time(
z3 <- foreach(i=seq(1, (n-windowSize+1), 1), .combine=rbind) %dopar% {
idx <- i:(i+windowSize-1)
coefficients(lm(y[idx]~x[idx]))
}
)

####(4) Display the results #################################
# Show the parameters from rolling regression
z <- data.frame(cbind(z1, 1:nrow(z1)))
names(z) <- c('beta0', 'beta1', 'exp')
ggplot(data=z, aes(x=exp, y=beta1, alpha=0.5)) + geom_line() + xlab("Number of regressions") +
geom_hline(aes(yintercept=0.4, color = 'red3')) + opts(legend.position="none")
ggplot(data=z, aes(x=exp, y=beta0, alpha=0.5)) + geom_line() + xlab("Number of regressions") +
geom_hline(aes(yintercept=0.01, color = 'red3')) + opts(legend.position="none")

# Show the time elapsed
time <- data.frame(rbind(time1, time2, time3))
row.names(time) <- c('No parallel', '2 cores', '4 cores')
ggplot(data=time, aes(x=row.names(time), y=elapsed)) + geom_bar(aes(fill=row.names(time))) +
ylab("Total time elapsed by seconds") + xlab("")

Thursday, June 13, 2013

The US airports with most flight routes

The summer just begins and a lot of people start to experience flight delay. I am also interested in seeing which cites in the US are well connected by flights. The OpenFlights project provides free flight information. Then I used R to download the data and loaded them into a SQLite database, since I try to keep them as the persistent data. RSQLite facilities R to join and query tables in the database. Finally the flight routes and the airports were visualized by ggplot2 and ggmap.
RankIATA codeCityArriving flight routes
1ORDChicago494
2LAXLos Angeles438
3DENDenver401
4JFKNew York363
5ATLAtlanta339
6DFWDallas-Fort Worth281
7SFOSan Francisco259
8IAHHouston244
9MIAMiami244
10EWRNewark242
Wiki says that Atlanta is the busiest airport in the US according to total passenger boardings. However, from the number of the incoming flight routs, it only ranks 5th, following Chicago, Los Angeles, Denver and New York. Possibly Atlanta is the hub mostly for passengers to do connection. If somebody really loves air traveling, Chicago(with ORD) and New York(with both JFK and EWR) are the two most convenient cities to stay with, because they have the most options.
This post is inspired by one post on the blog Data Science and R
# Import libraries and set up directory
library(ggmap)
library(RSQLite)
setwd("C:/Google Drive/Codes")

# Read data directly from URLs
airport <- read.csv("http://openflights.svn.sourceforge.net/viewvc/openflights/openflights/data/airports.dat", header = F)
route <- read.csv("http://openflights.svn.sourceforge.net/viewvc/openflights/openflights/data/routes.dat", header = F)

# Remove the airports without IATA codes and rename the variables
airport <- airport[airport$V5!='', c('V3', 'V4', 'V5','V7','V8','V9')]
colnames(airport) <- c("City", "Country", "IATA", "lantitude", "longitude", "altitude")
route <- route[c('V3', 'V5')]
colnames(route) <- c("Departure", "Arrival")

# Store data to SQLite database
conn <- dbConnect("SQLite", dbname = "air.db")
dbSendQuery(conn, "drop table if exists airport;")
dbWriteTable(conn, "airport", airport)
dbSendQuery(conn, "drop table if exists route;")
dbWriteTable(conn, "route", route)
dbDisconnect(conn)

# Manipulate data in SQLite database
conn <- dbConnect("SQLite", dbname = "air.db")
sqlcmd01 <- dbSendQuery(conn, "
select a.type, a.city as iata, a.frequency, b.city, b.country, b.lantitude, b.longitude
from (select 'depart' as type, departure as city, count(departure) as frequency
from route
group by departure
order by frequency desc, type) as a
left join airport as b on a.city = b.iata
order by frequency desc
;")
top <- fetch(sqlcmd01, n = -
1)
sqlcmd02 <- dbSendQuery(conn, "
select route.rowid as id, route.departure as point, airport.lantitude as lantitude, airport.longitude as longitude
from route left join airport on route.departure = airport.iata
union
select route.rowid as id, route.arrival as point, airport.lantitude as lantitude, airport.longitude as longitude
from route left join airport on route.arrival = airport.iata
order by id
;")
combine <- fetch(sqlcmd02, n = -
1)
dbDisconnect(conn)

# Draw the flight routes and the airports on Google map
ggmap(get_googlemap(center = 'us', zoom =
4, maptype = 'roadmap'), extent = 'device') +
geom_line(data = combine, aes(x = longitude, y = lantitude, group = id), size =
0.1,
alpha =
0.05,color = 'red4') +
geom_point(data = top, aes(x = longitude, y = lantitude, size = frequency), colour = "blue", alpha =
0.3) +
scale_size(range=c(
0,15))

Friday, June 7, 2013

Use Google Trends and SAS to select movies to watch


The newest success story about data science is Google search predicts box office with 94 percent accuracy. I am a frequent movie theater goer, and it will be great if we can implement Google's impressive research result.
There are quite a few offering for this summer. Now I am considering five incoming movies.
TitleDate
This is the EndWednesday, June 12
World War ZFriday, June 21
Man of SteelFriday, June 14
Monsters UniversityFriday, June 21
The InternshipFriday, June 7

Google Trends reflects what keywords people are searching for, which is a reliable and free data source. Let's use SAS to do some scripting work to generate the URL query based on the get method.
data one;
input @1 title $25.;
cards;
This is the End
World War Z
Man of Steel
Monsters University
The Internship
;;;run;

data two(where=(missing(word)=0));
set one nobs = nobs;
if _n_ ne nobs then
title1 = cat(title, "%2C");
else title1 = title;
do i = 1 to 10;
word = scan(title1, i, " ");
output;
end;
keep word;
run;

proc sql noprint;
select word into: string separated by "%20"
from two
;quit;

data three;
length fullstring $500.;
fullstring = cats("http://www.google.com/trends/explore?q=", "&string", '&geo=US&date=today%203-m&cmpt=q');
run;

proc print;
run;
From SAS, I print the resulting URL. Once I paste the url in a browser, the graphics clearly tells that the box office winners are going to be Man of Steel and World War Z. Finally my choice will be easier. I will surely not miss the two hottest movies.