Tuesday, February 3, 2015

Solve the Top N questions in SAS/SQL

This is a following post after my previous post about SAS/SQL.
SAS’s SQL procedure has a basic SQL syntax. I found that the most challenging work is to use PROC SQL to solve the TOP N (or TOP N by Group) questions. Comparing with other modern database systems, PROC SQL is lack of -
  • The ranking functions such as RANK() or the SELECT TOP clause such as
    select TOP 3 * 
    from class
    ;

  • The partition by clause such as
    select sex, name, weight
    from (select sex, name, max(weight) over(partition by sex) max_weight
    from class)
    where weight = max_weight
    ;

However, there are always some alternative solutions in SAS. I list a few question from an ascending difficulty below to explore the possibilities.
Prepare the data
First a SASHELP.CLASS dataset is used as a demo (availabe for every SAS copy). It is a small weight and height dataset from a faked class of 19 children. Now I only keep the weight variable as target column.
data class;
set sashelp.class;
keep name sex weight;
run;

proc sort;
by descending weight;
run;
NameSexAgeWeight
PhilipM16150
RonaldM15133
RobertM12128
AlfredM14112.5
JanetF15112.5
MaryF15112
WilliamM15112
CarolF14102.5
HenryM14102.5
JohnM1299.5
BarbaraF1398
JudyF1490
ThomasM1185
JaneF1284.5
AliceF1384
JeffreyM1384
JamesM1283
LouiseF1277
JoyceF1150.5
1. Select highest value
It is straightforward to use the outobs option at the begining to single out the highest weight.
title "Select highest weight overall";
proc sql outobs = 1;
select name, weight
from class
order by weight desc
;quit;
NameWeight
Philip150
2. Select second highest value
How about the second highest weight? The logic is simple — if we remove the highest weight first, then the second highest weight will take the first row.
title "Select second highest weight overall";
proc sql outobs = 1;
select name, weight
from class
where weight not in (select max(weight) from class)
order by weight desc
;quit;
NameWeight
Ronald133
3. Select Nth highest value
Now it comes to the hard part. How about the Nth highest value, say, the fourth highest weight? Now we have to do a self-joining to let the distinct value point to 3. Since there are two children with the weight 112.5, the query returns the two tied names.
title "Select Nth highest weight";
%let n = 4;

proc sql;
select distinct a.name, a.weight
from class as a
where (select count(distinct b.weight)
from class as b
where b.weight > a.weight
)
= &n - 1;
quit;
NameWeight
Alfred112.5
Janet112.5
4. Select highest values by group
There are two groups Male and Female in the class, and the easiest way to find the highest weight for each category is select max for female union select max for male. However, a more scalable solution is to use the group by clause that fits more than two groups.
title "Select highest weights by group";
proc sql;
select sex, name, weight
from class
group by sex
having weight
= max(weight)
;quit;
SexNameWeight
FJanet112.5
MPhilip150
5. Rank all values
The ultimate solution to solve all the question above is to derive a rank column for the target. There are two solutions: the first one use a subquery in the select clause, while the second one utilizes a subquery in the where clause.
The subquery in the first solution is independent to the main query, which uses less codes and is easier to recall in practice. The second one is actually a self-joining that is faster than the first solution.
/* Solution I */
proc sql;
select name, weight, (select count(distinct b.weight)
from class as b where b.weight >
= a.weight) as Rank
from class as a
order by rank
;quit;

/* Solution II */
proc sql;
select a.name, a.weight, count(b.weight) as rank
from class as a, (select distinct weight
from class
) as b
where a.weight <
= b.weight
group by a.name, a.weight
order by a.weight desc
;quit;
NameWeightRank
Philip1501
Ronald1332
Robert1283
Alfred112.54
Janet112.54
Mary1125
William1125
Henry102.56
Carol102.56
John99.57
Barbara988
Judy909
Thomas8510
Jane84.511
Alice8412
Jeffrey8412
James8313
Louise7714
Joyce50.515
6. Select top N values by group
Once with the rank column at hand, many perplexing problems could be easily solved. For example, we can use it to find the top 3 heaviest people for each category of male and female. And it is also scalable to more than two groups.
title "Select Top N weights by group";
proc sql;
select a.sex, a.name, a.weight, (select count(distinct b.weight)
from class as b where b.weight >
= a.weight and a.sex = b.sex ) as rank
from class as a
where calculated rank <
= 3
order by sex, rank
;quit;
SexNameWeightrank
FJanet112.51
FMary1122
FCarol102.53
MPhilip1501
MRonald1332
MRobert1283

Sunday, February 1, 2015

Deploy a MongoDB powered Flask app in 5 minutes

This is a quick tutorial to deploy a web service (a social network) by the LNMP (Linux, Nginx, MongoDB, Python) infrastructure on any IaaS cloud. The repo at Github is at https://github.com/dapangmao/minitwit-mongo-ubuntu.

Stack

The stack is built on the tools in the ecosystem of Python below. 

ToolNameAdvantage
CloudDigitalOceanCheap but fast
Server distroUbuntu 14.10 x64Everything is latest
WSGI proxyGunicornManage workers automatically
Web proxyNginxFast and easy to configure
FrameworkFlaskSingle file approach for MVC
Data storeMongoDBNo scheme needed and scalable
DevOpsFabricAgentless and Pythonic
In addition, a Supervisor running on the server provides a daemon to protect the Gunicorn-Flask process.

The MiniTwit app

The MiniTwit application is an example provided by Flask, which is a prototype of Twitter like multiple-user social network. The original application depends on SQLite. However, the data store could be modified to fit the category of NoSQL such as Google Data Store or MongoDB. A live MintiTwit demo is hosted at http://minitwit-123.appspot.com/public

Deployment

1. Install Fabric and clone the Github repo
The DevOps tool is fabric that is simply based on SSH. The fabfile.py and the staging flask files are stored on Github. We should install fabric and download the fabfile.py on the local machine before the deployment.
sudo pip install fabric 
wget https://raw.githubusercontent.com/dapangmao/minitwit-mongo-ubuntu/master/fabfile.py
fab -l
2. Enter IP from the virtual machine
A new VM from ausually emails IP address and the root password. Then we could modify the head part of the fabfile.py accordingly. There are quite a less expensive cloud providers for prototyping other than the costly Amazon EC2. For example, a minimal instance from DigitalOcean only costs five dollars a month. If SSH key has been uploaded, the password could be ignored.
env.hosts = ['YOUR IP ADDRESS'] #  Enter IP
env.user = 'root'
env.password = 'YOUR PASSWORD' # Enter password
3. Fire up Fabric
Now it is time to formally deploy the application. With the command below, the fabric will first install pip, git, nginx, gunicorn, supervisor and the latest MongodB, and configure them sequentially. In less than 5 minutes, a Flask and MongoDB application will be ready for use. Since DigitalOcean has its own software repository for Ubuntu, and its VMs are on SSD, the deployment is even faster, which is usually finished in one minute.
fab deploy_minitwit