Friday, January 28, 2011

Top 10 most powerful functions for PROC SQL


ABSTRACT

PROC SQL is not only one of the many SAS procedures and also a distinctive subsystem with all common features from SQL (Structured Query Language). Equipped with PROC SQL, SAS upgrades to a full-fledging relational database management system. PROC SQL provides alternative ways to manage data other than the traditional DATA Step and SAS procedures. In addition, SAS’s built-in functions are the add-on tools to increase the power of PROC SQL. In this paper, we illustrate ten popular SAS functions, which facilitate the capacity of PROC SQL in data management and descriptive statistics.

INTRODUCTION

Structured Query Language (SQL) is a universal computer language for all relational database management systems. PROC SQL is the implementation of the SQL syntax in SAS. It first appeared in SAS 6.0, and since then has been widely used for SAS users. PROC SQL greatly increases SAS’s flexibility in handling data, especially for multiple-table joining and database access. There are a number of comparisons between the DATA Step and the SQL procedure in SAS [1]. A majority of SAS functions can be directly used in the SQL procedure. And the PROC procedure also enjoys a few unique functions. In this paper, we select the 10 SAS functions, and show their usage in the SQL procedure. For demonstration purpose, we simulate a Social Security Number (SSN) dataset with two entriesfrom different sources. Each entry of the 1000 records should be identical but some values are missing. 
*****(0) Simulate two datasets for demonstration********************;
*****(0.1) Simulate a dataset for two SSN entries*******************;
datassn_data;
   doi = 1 to1000;
      ssn1 = ceil((ranuni(1234)*1E9));
      ssn2 = ssn1;
      ifssn1 le ceil((ranuni(1000)*1E9)) then call missing(ssn1);
      ifssn2 le ceil((rannor(2000)*1E9)) then call missing(ssn2);
      dropi;
      output;
   end;
   formatssn1 ssn2 ssn11.;
run;

We also simulate a patient-visiting dataset with three patient IDs. Every patient receives three different treatments at each visit. The effects of the treatments (1 means effective; 0 means not effective) and the cost for each visit are recorded. Other than the two simulated datasets, two datasets shipped with SAS, SASHELP.CLASS andSASHELP.CARS, are also used in the paper.

*****(0.2) Simulate a dataset for hospital visits ******************;
datahospital_data;
   inputid visit treat1 treat2 treat3 cost;
   formatcost dollar8.2;
   cards;                                  
   1 1 0 0 0 520
   1 2 1 0 0 320
   1 3 0 1 0 650
   2 1 1 0 0 560
   2 2 1 0 0 360
   3 1 1 0 0 500
   3 2 0 0 1 350
;;;
run;

TOP 10 FUNCTIONS FOR THE SQL PROCEDURE IN SAS

1. The MONOTONIC function
The MONOTONIC function is quite similar to the internal variable _N_ in DATA Step. We can use it to select the records according to their row number. For example, we choose the SSNs from the 501thline to the 888th line in the SSN dataset.
****(1) MONOTONIC: specify row numbers******************************;
procsql;  
   select*
   fromssn_data
   wheremonotonic() between 501 and800
;quit;  

2. The COUNT, N and NMISS functions
These counting functions are especially useful in data cleaning. By using them, the detailed missing status is shown in only one output table. For the SSN dataset, we can display the total numbers of the missing and non-missing values for each SSN entry.

****(2) COUNT/N/NMISS: find total and missing values****************;
procsql;
   selectcount(*) as n 'Total number of the observations',
      count(ssn1) asm_ssn1 'Number of the missing values for ssn1',
      nmiss(ssn1) asnm_ssn1 'Number of the missing values for ssn1',
      n(ssn2) asm_ssn1 'Number of the nonmissing values for ssn2'
      nmiss(ssn2) asnm_ssn2 'Number of the non-missing values for ssn2'
   fromssn_data
;quit;

3. The COALESCE function
The COALESCE function does the magic to combine multiple rows into a single one with any non-missing value. In this example, there are two rows of SSNs, and supposedly they should be identical each other. However, some of them are missing due to input errors or other reason. The COALESCE function in the SQL statement below checks the value of the two rows and returns the first non-missing value, which maximizes the SSN information.
****(3) COALESCE: combine values among columns**********************;
procsql;
   selectmonotonic() as obs, coalesce(ssn1, ssn2) asssn format = ssn11.  
   fromssn_data
;quit;

4. The MISSING function
The MISSING function returns a Boolean value for a variable (0 when non-missing; 1 when missing). In the example below, the missing status of the values in the SSN dataset is displayed row by row.
****(4) MISSING: return Boolean for missing value*******************;
procsql ;
   selectmonotonic() as obs,
      (casesum(missing(ssn1), missing(ssn2))
        when0 then 'No missing'
        when1 then 'One missing value'
        else'Both missing values'
       end) as status 'Missing status'
   fromssn_data
;quit;
5. The SPEDIS and SOUNDEX functions
The two functions can fulfill fuzzy matching. For example, if we want to examine the first entry of the SSN dataset to see if there is any possible duplicate, we can use the SPEDIS function in the SQL statement to look up any pair of the records. Here we set the argument to be 25 in order to detect any singlet [2].

****(5)SPEDIS/SOUNDEX: fuzz matching*********************************;
****(5.1)SPEDIS: find spelling mistakes******************************;
procsql;
   selecta.ssn1 as x, monotonic(a.ssn1) asx_obs,
      b.ssn1 asy, monotonic(b.ssn1) as y_obs
   fromssn_data as a, ssn_data asb
   where(x gt y) and (spedis(put( x, z11.), put( y, z11.)) le 25)
;quit;

For human names, w
e can check similarities by the SOUNDEX function to avoid duplicates [3]. The SASHELP.CLASS has 19 names. Phonically, John and Jane look similar according to the SOUNDEX function.

****(5.2)SOUNDEX: find phonic similarity*****************************;
procsql;
   selecta.name as name1, b.name asname2
   fromsashelp.class as a, sashelp.class asb
   wheresoundex(name1) = soundex(name2) and(name1 gt name2)
;quit;

6. The RANUNI function
This function does simple random sampling like PROC SURVEYSELECT. We can specify the OUTOBS option at the beginning to choose the sample size.

****(6)RANUNI: simple random sampling********************************;
procsql outobs=30;
   select*
   fromssn_data
   orderby ranuni(1234)
;quit;

7. The MAX function
The MAX function returns the maximum value and sometimes simplifies column-wise aggregation. For the patient-visiting dataset, if we need to know if each treatment is effective for the patients, it may take some time to code the RETAIN statement and temporary variables at DATA Step, while the MAX function at PROC SQL is quite straightforward.
****(7)MAX: find the maximum value for each column******************;
procsql;
   select id, max(treat1) aseffect1 'Effect after Treatment 1',
      max(treat2) aseffect2 'Effect after Treatment 2',
      max(treat3) aseffect3 'Effect after Treatment 3'
   fromhospital_data
   groupby id
;quit;

8. The IFC and IFN functions
The two functions play a role like the CASE-WHEN-END statements in typical SQL syntax, if the condition is about a binary selection. The IFC function deals with character variables, while the IFN function is for numbers. For the patient-visiting dataset, we can use the two functions together to find the total cost, the discounted cost (a 15% discount is applied if the total cost is greater than $1,000), and whether the first treatment is effective for each patient.

****(8)IFC/IFN: binary selection for either character and number****;
procsql;
   selectid, ifc(max(treat1) = 1, 'Yes', 'No') asoverall_effect
      length = 3'Any effect after treatment 1',
      sum(cost) assum_cost format = dollar8.2 'Total cost',
      ifn(calculated sum_cost ge 1000,
         calculated sum_cost*0.85,
         calculated sum_cost*1) as discounted_cost
      format=dollar8.2'Total cost after discount if any'
   fromhospital_data
   groupby id
;quit;

9. The UNIQUE function
This function is very convenient to show the number of the levels for every categorical variable.

****(9)UNIQUE: find the levels of categorical variables************;
procsql;
   selectcount(unique(make)) as u_make 'Number of the car makers',
      count(unique(origin)) asu_origin 'Number of the car origins',
      count(unique(type)) asu_type 'Number of the car types'
   fromsashelp.cars
;quit;

10. The PUT function
We can apply the PUT function with a user-defined format by PROC FORMAT in the WHERE statement to create filters.  For the SASHELP.CARS dataset, this strategy is used to choose only the high or medium priced cars.

****(10)PUT: create an filter by user-defined format***************;
procformat;
   valuerange
      40000-high='High'
      26000-< 40000='Medium'
      other ='Low';
run;

procsql;
   selectmodel, make, msrp,
      msrp asrange 'Price Range' format = range.
   fromsashelp.cars
   whereput(msrp, range.) in('High', 'Medium')
;quit;

CONCLUSION

The combination of SAS’s powerful functions and the SOL procedure will benefit SAS users in data management and descriptive statistics.

REFERENCES
1.     Christianna S. Williams. ‘PROC SQL for DATA Step Die-hards’. SAS Global Forum Proceeding 2008.
2.     Yefim Gershteyn.'Use of SPEDIS Function in Finding Specific Values'. SAS Users Group International 25.
3.     Amanda Roesch.'Matching Data Using Sounds-Like Operators and SAS® Compare Functions'. Northeast SAS Users Group 2011.