backup: proc sql

Showing posts with label proc sql. Show all posts

Monday, June 4, 2012

Index tuning in SAS for high-volume transactional data

Why use indexes in SAS?
A page is the smallest I/O unit that SAS can read or write, including data set page and index file page. Index files in SAS are sorting and searching structures made by B-trees. “When an index is used to process a request, such as a WHERE expression, SAS does a binary search on the index file and positions the index to the first entry that contains a qualified value. SAS then uses the value’s RID or RIDs to read the observations containing the value. The entire index file is not loaded to memory; one index page is accessed at a time. The most recent index page is kept in memory”. Thus, by scanning the index file pages first, SAS may significantly reduce the number for logical reading and physical reading, and therefore improve query performance by orders of magnitude, say from O(N) to O(log(N)) for large data set.

In this example, I simulated a data set of 10 million transaction records starting from January 1st, 2005 with 7 variables including PRIMARY_KEY, HASH_KEY, FIRST_NAME, LAST_NAME , ORDER_AMOUNT, ORDER_DATE , SHIPPING_ADDRESS. This fake transactional data set is about 600 MB big on disk.


data transaction;
   retain primary_key hash_key first_name 
      last_name order_amount order_date;
   do primary_key = 1 to 1e7;
      first_name = md5(ranuni(1));
      last_name = md5(ranuni(2));
      shipping_address = md5(ranuni(3));
      hash_key = put(md5(primary_key), hex16.);
      order_amount = put(ranuni(4)*1000, dollar8.2);
      order_date = put('01jan2005'd + floor(ranuni(5)*2500), mmddyy9.);
      output;
   end;
run;

Index strategies:
1. Create clustered index for primary key
Clustered index gives data the physical row order on hard disk. Not like other relational database systems, SAS doesn’t have a special statement by either PROC SQL Or Data Step, to specify a clustered index. However, I found the easiest way is to use PROC SORT to sort the data set itself.

PROC SORT is also the prerequisite of the merge join at the DATA Step, while PROC SQL mostly uses the hash join. One thing to note is that the sorting should happen before the creation of other indexes. Otherwise, after sorting the existing index files may be lost.


proc sort data = transaction;
   by primary_key;
run;

2. Create unique index for hash value
To avoid misleading identifiers for later joining or hide sensitive information, it is quite popular nowadays to transform the primary key to the hash value for future references. SAS has a MD5 function which can generate distinguishable 128-bit hash values. To display them, I choose a valid hex16 value format.


proc sql;
   create unique index hash_key on transaction(hash_key);
quit;

3. Create simple indexes
Individual indexes are created for each of the variables such as ORDER_DATE, ORDER_AMOUNT and SHIPPING_ADDRESS.


proc sql;
   create index order_date on transaction(order_date);
   create index order_amount on transaction(order_amount);
   create index shipping_address on transaction(shipping_address);
quit;

4. Create composite indexes
Then another composite index is generated to include the first name and the last name, in response to full name search.


proc sql;
   create index full_name on transaction(first_name, last_name);
quit;

proc contents data = transaction position;
run;

Eventually this clustered (or sorted) data set contains 5 indexes, which are stored in a same-name index file data set that occupies 700 MB disk separately. In a conclusion, although building a few indexes for a large transactional data set is time-consuming and disk-costly, a query in PROC SQL or DATA Step by the WHERE statement is right now much faster (slash processing time up to 80%-90%), which is quite rewarding for many read-heavy jobs.

Friday, May 18, 2012

Use the set operator UNION in PROC SQL

SQL syntax contains a few set operators, such as UNION, EXCEPT and INTERSECT. The UNION operator concatenates the results of multiple SQL queries vertically into a single table for all matching rows, which I found particularly useful in PROC SQL while using SAS to manage data. Here come two examples.

Example 1 – Transpose data
PROC SQL can transform a dataset to any desired structure, without referring to DATA step or PROC TRANSPOSE. For example, SASHELP.CLASS can be transposed from wide to long by the UNION ALL clause, and reversely from long to wide by the MAX function and the GROUP clause.

From wide to long


data wide;
   set sashelp.class;
   drop sex;
run;

proc sql;
   create table long as
   select name, 'Age' as var, age as col1
   from wide
   union all
   select name, 'Weight' as var, weight as col1
   from wide
   union all
   select name, 'Height' as var, height as col1
   from wide
;quit;

From long to wide


proc sql;
   create table wide as
   select name,
      max(case when var = 'Age' then col1 end) as Age,
      max(case when var = 'Weight' then col1 end) as Weight,
      max(case when var = 'Height' then col1 end) as Height
   from long
   group by name;
quit;

Example 2 – Aggregate data into a cube
In SAS, PROC SQL doesn’t support the ROLLUP or CUBE clause. However, we can apply multiple UNION operators to simulate such functionality. For example, we can create a cube table to list all possible summations of the ACTUAL variable by STATE, PRODUCT, YEAR in the SASHELP.PRDSAL2 dataset. Afterward we can easily query this multi-dimensional data structure to look for interesting aggregation information, without running any other aggregating procedure again.


data prdsal2;
   set sashelp.prdsal2;
run;

proc sql;
   create table cube as
   select state,product, year, 'total by state, prodcut and year' as category,
   sum(actual) as actual
   from prdsal2
   group by state, product, year
   union
   select state, product, ., 'total by state and prodcuct', sum(actual)
   from prdsal2
   group by state, product
   union
   select state,'', year, 'total by state and year', sum(actual)
   from prdsal2
   group by state, year
   union
   select '',product, year, 'total by product and year', sum(actual)
   from prdsal2
   group by product, year
   union 
   select '' ,'', year, 'total by year', sum(actual)
   from prdsal2
   group by year
   union 
   select state, '',. , 'total by state', sum(actual)
   from prdsal2   
   group by state
   union 
   select '',  product, ., 'total by product', sum(actual)
   from prdsal2
   union
   select '', '', ., 'grand total', sum(actual)
   from prdsal2
   order by state, product, year
;quit;

Thursday, May 3, 2012

Top 10 tips and tricks about PROC SQL

INTRODUCTION

PROC SQL is the implementation of the SQL syntax in SAS. It first appeared in SAS 6.0, and since then has been very popular for SAS users. SAS ships with a few sample data sets in its HELP library, and SASHELP.CLASS is one of them. This dataset contains 5 variables including name, weight, height, sex and age for 19 simulated teenagers, and in this paper I primarily use it for the demonstration purpose. Here I summarize the 10 interesting tricks and tips using PROC SQL. At the beginning, I first make a copy of SASHELP.CLASS at the WORK library and transform the row number of the data set to a new variable obs.

dataclass;

setsashelp.class;

/* Give an index for each child*/

obs = _n_;

run;

1. Calculate the median of a variable

With the aggregating HAVING clause and some self-join techniques, PROC SQL can easily calculate the median for a variable.

procsql;

selectavg(weight) as Median

from(select e.weight

fromclass e, class d

groupby e.weight

havingsum(case when e.weight = d.weight then 1else 0end)

>= abs(sum(sign(e.weight - d.weight))));

quit;

2. Draw a horizontal histogram

A histogram visualizes the distribution pattern of a variable. PROC SQL can draw a horizontal histogram by showing the frequency bars with a few asterisks for each level of the variable age.

procsql;

selectage, repeat('*',count(*)*4) as Frequency

fromclass

groupby age

orderby age;

quit;

3. Return the running total for a variable

A running total is the summation of a sequence of numbers which is updated each time with the increase of the observations. In the example below, I calculate the running total and save them as a new variable Running_total by the SUM function and a conditional statement, which logically is similar to an example in SAS/IML[1].

procsql;

selectname, weight,

(selectsum(a.weight) from class as

a wherea.obs <= b.obs) as Running_total

fromclass as b;

quit;

4. Report the total number for a variable

PROC SQL is a flexible way to find the total number for any variable by its set operator UNION and the SUM function. In the example, the total number of the variable weight is reported at the bottom of the output table.

procsql;

selectname, weight

fromclass

unionall

select'Total', sum(weight)

fromclass;

quit;

5. Retrieve the metadata for a data set

SAS stores the metadata at its DICTIONARY data sets. PROC SQL can visit the directory, retrieve the column detail, and return the information to the users.

procsql;

selectname, type, varnum

fromsashelp.vcolumn

wherelibname = 'WORK' andmemname = 'CLASS';

quit;

6. Rank a variable

Besides the designated ranking procedure PROC RANK in SAS, PROC SQL can also do some simple ranking as well.

procsql;

selectname, a.weight, (select count(distinctb.weight)

fromclass b

/* Rank by the ascending order for the weight variable*/

whereb.weight <= a.weight) as rank

fromclass a;

quit;

7. Simple random sampling

PROC SQL is widely used in simple random sampling. For example, I randomly choose 8 observations by the OUTOBS option at the PROC statement. The randomization process is realized by the RANUNI function at the ORDER BY statement with a seed 1234.

procsql outobs = 8;

select*

fromclass

orderby ranuni(1234);

quit;

8. Replicate a data set without data

In PROC SQL, it is a fairly straightforward one-line statement to create a new empty data set while keeps all the structure of the original data set.

procsql;

createtable class2 likeclass;

quit;

9. Transpose data

Usually DATA step ARRAY and PROC TRANSPOSE allow SAS users to restructure the data set, while PROC SQL sometimes is an alternative solution. For instance, if we need a wide-to-long operation to list the names of the children by their gender in the CLASS date set, then PROC SQL can fulfill the functionality through the combinations of some queries and subqueries.

procsql;

selectmax(case when sex='F'

thenname else ' ' end) as Female,

max(casewhen sex='M'

thenname else ' ' end) as Male

from(select e.sex,

e.name,

(selectcount(*) from class d

wheree.sex=d.sex and e.obs < d.obs) aslevel

fromclass e)

groupby level;

quit;

10. Count the missing values

Another advantage of PROC SQL is that its NMISS function works for both numeric and character variables [2], which makes PROC SQL an ideal tool for missing value detection.

procsql;

selectcount(*) 'Total', nmiss(weight)

'Number of missing values for weight'

fromclass;

quit;

CONCLUSION

The combination of SAS’s powerful functions and the SQL procedure will benefit SAS users in data management and descriptive statistics.

Friday, February 4, 2011

Self-matching and its applications

Programming is all about data structure and algorithm. For example, value comparison needs to find right data structure and iteration method. To fulfill this purpose, the first thing is to load the variable with a key-value like data structure, followed by key-driven iterations. In SAS, the typical key-value data types are array and hash table. Proc Format powered format can be another option. For data merging or equivalent values searching, those data types are pretty adequate. If exact numerical difference is desired, those SAS data structures may have some obstacles. First of all, array and hash table have to be created in a data step, which means they cannot be saved and reused, unless they are unidirectional translated to another data structure - dataset. Second, data loading is somewhat cumbersome: as for array, data has to be transposed to be assigned to an array; as for hash table, data step inherent iteration has to be carefully avoided and hash object still has limited methods.

Value comparison can utilize either data step's inherent iteration or SQL's Cartesian-product iteration. Explicit loops could be largely bypassed through these means. For instance, to remove unrelated adjacent numbers in a grouped data set, a strategy called ‘look-ahead and look-back’ is suggested by one blog post on SAS Community[1]. Several SAS-L users kindly provided answers for this specific question [2]. Within the four workable solutions, using data step, Arhur Tabachneck and BBSER2009 implemented the lagged or temporary variables derived from the same variable to compare values; Joe Matise and Bolotin Yevgeniy realized the same goal by Proc SQL (Example 1).

The strategy can be further extended to the cases requiring more iterations. In one real-world sample, to search values with fixed difference in a variable, a conditional self-matching SQL command can achieve n-choosing-2 comparisons simply and quickly(Example 2). Finding longest repeated substring is a famous algorithm question in computer science. And the common answer is to construct Suffix tree and look for the deepest node. In another example below, self-matching by Proc SQL can be a short-cut solution without considering many algorithm details (Example 3).

Self-matching is not the most efficient algorithm. Proc SQL is demanding on memory, while intensive data I/O can cause data step freezing. Arhur Tabachneck metioned that the application on large dataset may lead into SQL infinite loop [2], which I think as a stack overflow due to calling too much memory. However, in many scenarios of values comparison, using self-matching technique could dramatically decrease coding complexity and save time.

References: 1. 'Look-Ahead and Look-Back'. http://www.sascommunity.org/wiki/Look-Ahead_and_Look-Back.
2. 'Removing numbers'. http://www.listserv.uga.edu/cgi-bin/wa?A1=ind1101d&L=sas-l#44.

*********(1)EXAMPLE1 --  REMOVE UNRELATED NUMBERS WITHIN GROUPS*****;
******(1.0) INPUT OF THE RAW DATA******;
data one;
 input id:$1. var1 @@;
 cards;
a  4 b  2 b  3 b  4 b  8 b  9 b  10 b  11
b  13  b  15 b  16 c  1 c  2  c  5
;
run;

*******(1.1) ARTHUR TABACHNECK'S PLAN************;
data AT_RESULT (drop=prev_: next_:);
    set one;
    by id;
    set one ( firstobs = 2 keep = var1
            rename = (var1 = next_var1) )
    one ( obs = 1 drop = _all_ );
    prev_var1 = ifn( first.id, (.), lag(var1) );
    next_var1 = ifn(  last.id, (.), next_var1 );
    if abs(var1-prev_var1) eq 1 or
    abs(var1-next_var1) eq 1 then output;
run;

******(1.2) BBSER2009'S PLAN****;
data BBSER2009_RESULT(rename=(temp_id=id temp_var1=var1)) ;
      length pre_id $1;
      retain pre_id; retain pre_var1 .; retain pre_op 0;
      set one;
      if var1-pre_var1=1 then  do;
         if pre_op=0 then do;
    temp_id=pre_id;
    temp_var1=pre_var1;
    output;
    end;
        temp_id=id;
      temp_var1=var1;
    output;
        pre_op=1;
      end;
      else pre_op=0;
      pre_id=id; pre_var1=var1;
   keep temp_id temp_var1;
run;

********(1.3) JOE MATISE'S PLAN*************;
proc sql;
create table JM_RESULT  as
  select * from one h
  where exists (
      select 1 from one v
      where h.id=v.id
      and (h.var1 = v.var1-1
        or h.var1 = v.var1+1));
quit;

*******(1.4) BOLOTIN YEVGENIY'S PLAN*****;
proc sql;
  create table BY_RESULT as
       select distinct a.*
       from    one a
               inner join one b
       on a.id = b.id
       where abs(a.var1 - b.var1)=1;
quit;

*********END OF EXAMPLE (1)**********;

*******(2) EXAMPEL 2 -- FIND ANY VALUES WITH FIXED DIFFERENCES****** ;
*****(2.1) INPUT RAW DATA******;
data timeseries;
    input  date1 date9.  var1;
    cards;
    11jan2008  11
    12jan2008 10
    13jan2008 21
    14jan2008 20
    15jan2008 30
    16jan2008 30
;
run;

*****(2.2) FIND ALL DATES WITH UNTI DIFFERENC EQUALS 10****;
proc sql;
    create table timeseries1 as
    select a.date1 as date1 format=date9., a.var1 as unit1, b.date1 as date2 format=date9., b.var1 as unit2
    from timeseries as a, timeseries as b
    where a.var1-b.var1=10
;quit;

****END OF EXAMPLE (2)**********;

****(3) EXAMPLE3 --  FIND LONGEST DUPLICATED SUBSTRING SEARCH****;
***(3.0) INPUTRAW DATA***;
data test(where=(missing(char)=0));
   input char $1.@@;
    id=_n_;
   cards;
abcaabcabd
;
run;

***(3.1) SINGLE CHARACTER MATCHING****;
proc sql;
    create table test1 as
    select  a.id as id1, b.id as id2, a.char as char1, b.char as char2, 
            abs(id1-id2) as pos
    from test as a, test as b
    where a.char=b.char and id1 lt id2
    order by pos, id1
;quit;

****(3.2) REMOVE ONE-CHARACTER DUPLICATES****;
data test2;
  set test1;
  by pos;
  dif = dif(id1);
  if first.pos then dif=1;
  if dif eq 1;
  if first.pos and last.pos then delete;
run;

****(3.3) CONCATENATING DUPLICATES****;
proc transpose data=test2 out=test3;
    by pos ; 
    var char1;
    run;
    
data test4;
    set test3;
    array allvar[*] col:;
    repeat = cats(of allvar[*]);
run;

****(3.4) SHOW THE LONGEST DUPLICATED SUBSTRING****;
proc sql;
    select max(repeat) 'The longest duplicated substring'
    from test4
;quit;

******END OF EXAMPLE(3)****************;

*****(4) EXAMPLE 4 -- FIND THE PRODUCT FROM ANY TWO NUMBERS IN AN ARRAY***

data have;    
    input var1;
    cards;
    1
    -3 
    3
    9
    12
    23
    34
    5
    8
    9
    -5
    ;
run;

data temp;
    set have;
    idx = _n_;
run;

proc sql;
    create table want as
    select a.var1 * b.var1 as product
    from temp as a, temp as b
    where a.idx ne b.idx
;quit;
**************END OF EXAMPLE 4**********;

Friday, January 28, 2011

Top 10 most powerful functions for PROC SQL

ABSTRACT

PROC SQL is not only one of the many SAS procedures and also a distinctive subsystem with all common features from SQL (Structured Query Language). Equipped with PROC SQL, SAS upgrades to a full-fledging relational database management system. PROC SQL provides alternative ways to manage data other than the traditional DATA Step and SAS procedures. In addition, SAS’s built-in functions are the add-on tools to increase the power of PROC SQL. In this paper, we illustrate ten popular SAS functions, which facilitate the capacity of PROC SQL in data management and descriptive statistics.

INTRODUCTION

Structured Query Language (SQL) is a universal computer language for all relational database management systems. PROC SQL is the implementation of the SQL syntax in SAS. It first appeared in SAS 6.0, and since then has been widely used for SAS users. PROC SQL greatly increases SAS’s flexibility in handling data, especially for multiple-table joining and database access. There are a number of comparisons between the DATA Step and the SQL procedure in SAS [1]. A majority of SAS functions can be directly used in the SQL procedure. And the PROC procedure also enjoys a few unique functions. In this paper, we select the 10 SAS functions, and show their usage in the SQL procedure. For demonstration purpose, we simulate a Social Security Number (SSN) dataset with two entriesfrom different sources. Each entry of the 1000 records should be identical but some values are missing.

*****(0) Simulate two datasets for demonstration********************;

*****(0.1) Simulate a dataset for two SSN entries*******************;

datassn_data;

doi = 1 to1000;

ssn1 = ceil((ranuni(1234)*1E9));

ssn2 = ssn1;

ifssn1 le ceil((ranuni(1000)*1E9)) then call missing(ssn1);

ifssn2 le ceil((rannor(2000)*1E9)) then call missing(ssn2);

dropi;

output;

end;

formatssn1 ssn2 ssn11.;

run;

We also simulate a patient-visiting dataset with three patient IDs. Every patient receives three different treatments at each visit. The effects of the treatments (1 means effective; 0 means not effective) and the cost for each visit are recorded. Other than the two simulated datasets, two datasets shipped with SAS, SASHELP.CLASS andSASHELP.CARS, are also used in the paper.

*****(0.2) Simulate a dataset for hospital visits ******************;

datahospital_data;

inputid visit treat1 treat2 treat3 cost;

formatcost dollar8.2;

cards;

1 1 0 0 0 520

1 2 1 0 0 320

1 3 0 1 0 650

2 1 1 0 0 560

2 2 1 0 0 360

3 1 1 0 0 500

3 2 0 0 1 350

;;;

run;

TOP 10 FUNCTIONS FOR THE SQL PROCEDURE IN SAS

1. The MONOTONIC function

The MONOTONIC function is quite similar to the internal variable _N_ in DATA Step. We can use it to select the records according to their row number. For example, we choose the SSNs from the 501^thline to the 888^th line in the SSN dataset.

****(1) MONOTONIC: specify row numbers******************************;

procsql;

select*

fromssn_data

wheremonotonic() between 501 and800

;quit;

2. The COUNT, N and NMISS functions

These counting functions are especially useful in data cleaning. By using them, the detailed missing status is shown in only one output table. For the SSN dataset, we can display the total numbers of the missing and non-missing values for each SSN entry.

****(2) COUNT/N/NMISS: find total and missing values****************;

procsql;

selectcount(*) as n 'Total number of the observations',

count(ssn1) asm_ssn1 'Number of the missing values for ssn1',

nmiss(ssn1) asnm_ssn1 'Number of the missing values for ssn1',

n(ssn2) asm_ssn1 'Number of the nonmissing values for ssn2',

nmiss(ssn2) asnm_ssn2 'Number of the non-missing values for ssn2'

fromssn_data

;quit;

3. The COALESCE function

The COALESCE function does the magic to combine multiple rows into a single one with any non-missing value. In this example, there are two rows of SSNs, and supposedly they should be identical each other. However, some of them are missing due to input errors or other reason. The COALESCE function in the SQL statement below checks the value of the two rows and returns the first non-missing value, which maximizes the SSN information.

****(3) COALESCE: combine values among columns**********************;

procsql;

selectmonotonic() as obs, coalesce(ssn1, ssn2) asssn format = ssn11.

fromssn_data

;quit;

4. The MISSING function

The MISSING function returns a Boolean value for a variable (0 when non-missing; 1 when missing). In the example below, the missing status of the values in the SSN dataset is displayed row by row.

****(4) MISSING: return Boolean for missing value*******************;

procsql ;

selectmonotonic() as obs,

(casesum(missing(ssn1), missing(ssn2))

when0 then 'No missing'

when1 then 'One missing value'

else'Both missing values'

end) as status 'Missing status'

fromssn_data

;quit;

5. The SPEDIS and SOUNDEX functions

The two functions can fulfill fuzzy matching. For example, if we want to examine the first entry of the SSN dataset to see if there is any possible duplicate, we can use the SPEDIS function in the SQL statement to look up any pair of the records. Here we set the argument to be 25 in order to detect any singlet [2].

****(5)SPEDIS/SOUNDEX: fuzz matching*********************************;

****(5.1)SPEDIS: find spelling mistakes******************************;

procsql;

selecta.ssn1 as x, monotonic(a.ssn1) asx_obs,

b.ssn1 asy, monotonic(b.ssn1) as y_obs

fromssn_data as a, ssn_data asb

where(x gt y) and (spedis(put( x, z11.), put( y, z11.)) le 25)

;quit;

For human names, we can check similarities by the SOUNDEX function to avoid duplicates [3]. The SASHELP.CLASS has 19 names. Phonically, John and Jane look similar according to the SOUNDEX function.

****(5.2)SOUNDEX: find phonic similarity*****************************;

procsql;

selecta.name as name1, b.name asname2

fromsashelp.class as a, sashelp.class asb

wheresoundex(name1) = soundex(name2) and(name1 gt name2)

;quit;

6. The RANUNI function

This function does simple random sampling like PROC SURVEYSELECT. We can specify the OUTOBS option at the beginning to choose the sample size.

****(6)RANUNI: simple random sampling********************************;

procsql outobs=30;

select*

fromssn_data

orderby ranuni(1234)

;quit;

7. The MAX function

The MAX function returns the maximum value and sometimes simplifies column-wise aggregation. For the patient-visiting dataset, if we need to know if each treatment is effective for the patients, it may take some time to code the RETAIN statement and temporary variables at DATA Step, while the MAX function at PROC SQL is quite straightforward.

****(7)MAX: find the maximum value for each column******************;

procsql;

select id, max(treat1) aseffect1 'Effect after Treatment 1',

max(treat2) aseffect2 'Effect after Treatment 2',

max(treat3) aseffect3 'Effect after Treatment 3'

fromhospital_data

groupby id

;quit;

8. The IFC and IFN functions

The two functions play a role like the CASE-WHEN-END statements in typical SQL syntax, if the condition is about a binary selection. The IFC function deals with character variables, while the IFN function is for numbers. For the patient-visiting dataset, we can use the two functions together to find the total cost, the discounted cost (a 15% discount is applied if the total cost is greater than $1,000), and whether the first treatment is effective for each patient.

****(8)IFC/IFN: binary selection for either character and number****;

procsql;

selectid, ifc(max(treat1) = 1, 'Yes', 'No') asoverall_effect

length = 3'Any effect after treatment 1',

sum(cost) assum_cost format = dollar8.2 'Total cost',

ifn(calculated sum_cost ge 1000,

calculated sum_cost*0.85,

calculated sum_cost*1) as discounted_cost

format=dollar8.2'Total cost after discount if any'

fromhospital_data

groupby id

;quit;

9. The UNIQUE function

This function is very convenient to show the number of the levels for every categorical variable.

****(9)UNIQUE: find the levels of categorical variables************;

procsql;

selectcount(unique(make)) as u_make 'Number of the car makers',

count(unique(origin)) asu_origin 'Number of the car origins',

count(unique(type)) asu_type 'Number of the car types'

fromsashelp.cars

;quit;

10. The PUT function

We can apply the PUT function with a user-defined format by PROC FORMAT in the WHERE statement to create filters. For the SASHELP.CARS dataset, this strategy is used to choose only the high or medium priced cars.

****(10)PUT: create an filter by user-defined format***************;

procformat;

valuerange

40000-high='High'

26000-< 40000='Medium'

other ='Low';

run;

procsql;

selectmodel, make, msrp,

msrp asrange 'Price Range' format = range.

fromsashelp.cars

whereput(msrp, range.) in('High', 'Medium')

;quit;

CONCLUSION

The combination of SAS’s powerful functions and the SOL procedure will benefit SAS users in data management and descriptive statistics.

REFERENCES

1. Christianna S. Williams. ‘PROC SQL for DATA Step Die-hards’. SAS Global Forum Proceeding 2008.

http://www2.sas.com/proceedings/forum2008/185-2008.pdf

2. Yefim Gershteyn.'Use of SPEDIS Function in Finding Specific Values'. SAS Users Group International 25.

http://www2.sas.com/proceedings/sugi25/25/cc/25p086.pdf

3. Amanda Roesch.'Matching Data Using Sounds-Like Operators and SAS® Compare Functions'. Northeast SAS Users Group 2011.

http://www.nesug.org/Proceedings/nesug11/ap/ap07.pdf