backup: Top 10 tips and tricks about PROC SQL

INTRODUCTION

PROC SQL is the implementation of the SQL syntax in SAS. It first appeared in SAS 6.0, and since then has been very popular for SAS users. SAS ships with a few sample data sets in its HELP library, and SASHELP.CLASS is one of them. This dataset contains 5 variables including name, weight, height, sex and age for 19 simulated teenagers, and in this paper I primarily use it for the demonstration purpose. Here I summarize the 10 interesting tricks and tips using PROC SQL. At the beginning, I first make a copy of SASHELP.CLASS at the WORK library and transform the row number of the data set to a new variable obs.

dataclass;

setsashelp.class;

/* Give an index for each child*/

obs = _n_;

run;

1. Calculate the median of a variable

With the aggregating HAVING clause and some self-join techniques, PROC SQL can easily calculate the median for a variable.

procsql;

selectavg(weight) as Median

from(select e.weight

fromclass e, class d

groupby e.weight

havingsum(case when e.weight = d.weight then 1else 0end)

>= abs(sum(sign(e.weight - d.weight))));

quit;

2. Draw a horizontal histogram

A histogram visualizes the distribution pattern of a variable. PROC SQL can draw a horizontal histogram by showing the frequency bars with a few asterisks for each level of the variable age.

procsql;

selectage, repeat('*',count(*)*4) as Frequency

fromclass

groupby age

orderby age;

quit;

3. Return the running total for a variable

A running total is the summation of a sequence of numbers which is updated each time with the increase of the observations. In the example below, I calculate the running total and save them as a new variable Running_total by the SUM function and a conditional statement, which logically is similar to an example in SAS/IML[1].

procsql;

selectname, weight,

(selectsum(a.weight) from class as

a wherea.obs <= b.obs) as Running_total

fromclass as b;

quit;

4. Report the total number for a variable

PROC SQL is a flexible way to find the total number for any variable by its set operator UNION and the SUM function. In the example, the total number of the variable weight is reported at the bottom of the output table.

procsql;

selectname, weight

fromclass

unionall

select'Total', sum(weight)

fromclass;

quit;

5. Retrieve the metadata for a data set

SAS stores the metadata at its DICTIONARY data sets. PROC SQL can visit the directory, retrieve the column detail, and return the information to the users.

procsql;

selectname, type, varnum

fromsashelp.vcolumn

wherelibname = 'WORK' andmemname = 'CLASS';

quit;

6. Rank a variable

Besides the designated ranking procedure PROC RANK in SAS, PROC SQL can also do some simple ranking as well.

procsql;

selectname, a.weight, (select count(distinctb.weight)

fromclass b

/* Rank by the ascending order for the weight variable*/

whereb.weight <= a.weight) as rank

fromclass a;

quit;

7. Simple random sampling

PROC SQL is widely used in simple random sampling. For example, I randomly choose 8 observations by the OUTOBS option at the PROC statement. The randomization process is realized by the RANUNI function at the ORDER BY statement with a seed 1234.

procsql outobs = 8;

select*

fromclass

orderby ranuni(1234);

quit;

8. Replicate a data set without data

In PROC SQL, it is a fairly straightforward one-line statement to create a new empty data set while keeps all the structure of the original data set.

procsql;

createtable class2 likeclass;

quit;

9. Transpose data

Usually DATA step ARRAY and PROC TRANSPOSE allow SAS users to restructure the data set, while PROC SQL sometimes is an alternative solution. For instance, if we need a wide-to-long operation to list the names of the children by their gender in the CLASS date set, then PROC SQL can fulfill the functionality through the combinations of some queries and subqueries.

procsql;

selectmax(case when sex='F'

thenname else ' ' end) as Female,

max(casewhen sex='M'

thenname else ' ' end) as Male

from(select e.sex,

e.name,

(selectcount(*) from class d

wheree.sex=d.sex and e.obs < d.obs) aslevel

fromclass e)

groupby level;

quit;

10. Count the missing values

Another advantage of PROC SQL is that its NMISS function works for both numeric and character variables [2], which makes PROC SQL an ideal tool for missing value detection.

procsql;

selectcount(*) 'Total', nmiss(weight)

'Number of missing values for weight'

fromclass;

quit;

CONCLUSION

The combination of SAS’s powerful functions and the SQL procedure will benefit SAS users in data management and descriptive statistics.

5 comments:

AnonymousMay 4, 2012 at 5:35 AM
Nice tips! Data step can do #8 (replicate dataset without data) like:

data a;
set sashelp.class;
stop;
run;

But as usual, the SQL code probably reads better. (And I say that as a datastep die-hard : )

--Q.
AnonymousFebruary 5, 2013 at 2:25 PM
true median is definitely annoying.
AnonymousFebruary 20, 2013 at 1:12 PM
Since visuals facilitate quick and easy inferences,
but precise numbers are needed for reliable inferences,
I was dismayed that the horizontal bar chart does not display the count,
only a string of asterisks that is 4 times the Count PLUS 1.

With a little help from Alexandra Riley,
I came up with a PROC SQL horizontal bar chart drawn with this:

proc sql;
select age, '|',
count(*) as Count,
repeat('*',count(*)-1) as HorizontalBar
from sashelp.class
group by age
order by Count descending;
quit;

I always like to Show Them What's Important with ranking.
Hence, the descending.
With a huge number of bars,
and in a situation where lookup by categorical key is more important,
ranking is inappropriate.

With this data set, the counts happen to be small.
In the general case, they could be large,
and the multiplier by four would cause a problem.

With very large counts,
you might have to increase the SAS linesize,
which does have a maximum of 256.

LeRoy Bessler (Le_Roy_Bessler@wi.rr.com)
AnonymousJuly 1, 2014 at 9:13 AM
I noticed the same thing - we tried this on one of our 'smaller' datasets (~2.9 million records), and it took forever.

Excellent solution, but maybe PROC UNIVARIATE will get you there faster on a large dataset.
AnonymousDecember 14, 2014 at 11:40 AM
Just a reminder with SQL, count(*) and count([varname]) are not the same. The first returns a count of the number of records in the dataset/table. The second returns the count of the non missing values for the variable.

Thursday, May 3, 2012

Top 10 tips and tricks about PROC SQL

1. Calculate the median of a variable

5 comments: