Thursday, May 9, 2013

QC by proc format

Image and video hosting by TinyPic
Quality Control is a very important part of SAS programmer's job. Sometimes we want to find the unexpected values hidden in the data set. Proc Format would be a good way to go.
Suppose we have a lab test data set "eg" here:
Image and video hosting by TinyPic
We have the range for the variable BASOS, HCT and HGB. If the value goes beyond the range we know, we need to flag it out.
Here is the simple way to do by proc format:
proc format; 
    value basos
        low - 1, 
        6 - high = 99
        other = 1; 
    value hct
        low - 20, 
        60 - high = 99
        other = 1; 
    value hgb
        low - 5, 
        20 - high = 99
        other = 1; 
run; 
We can either use a new variable diag to flag the unexpected value, or use proc freq.
1. The "egDiag":
data egDiag;
    set eg; 
    diag = (max(put(basos, basos.), put(hct, hct.), put(hgb, hgb.)) >= 99); 
run; 
Image and video hosting by TinyPic
2. Proc Freq:
proc freq data = eg; 
    tables basos hct hgb; 
    format basos basos. 
           hct hct.
           hgb hgb.
    ; 
run; 
Image and video hosting by TinyPic

Wednesday, May 1, 2013

Yearcutoff

Image and video hosting by TinyPic
If you are inputing two digits years, you may encounter the option "yearcutoff".
How to control the prefix? Like if you have two digit year number "18", how can you get "1918" rather than "2018"?
First, let's check the option "yearcutoff".
Here is a good explaination:
Before you use the YEARCUTOFF= system option, examine the dates in your data:
  • If the dates in your data fall within a 100-year span, you can use the YEARCUTOFF= system option.
  • If the dates in your data do not fall within a 100-year span, you must either convert the two-digit years to to four-digit years or use a DATA step with conditional logic to assign the proper century prefix.
Once you've determined that the YEARCUTOFF= system option is appropriate for your range of data, you can determine the setting to use. The best setting for YEARCUTOFF= is a year just slightly lower than the lowest year in your data. For example, if you have data in a range from 1921 to 1999, set YEARCUTOFF= to 1920, if that is not already your system default. The result of setting YEARCUTOFF= to 1920 is that
  • SAS interprets all two-digit dates in the range of 20 through 99 as 1920 through 1999.
  • SAS interprets all two-digit dates in the range 00 through 19 as 2000 through 2019.
The following figure shows the span of years when the YEARCUTOFF= option is set to a value of 1920. The 100-year span in this case is from 1920 to 2019.
Span of Years When the YEARCUTOFF= Option Is Set to 1920
[IMAGE]
With YEARCUTOFF= set to 1920, a two-digit year of 10 would be interpreted as 2010, and a two-digit year of 22 would be interpreted as 1922.
Learn by example
Now let's code. 
/* macro variable to store the path */
%let path=C:\Documents and Settings\Administrator\My Documents\Dropbox\Transfer\tech post\yearcutoff; 

options yearcutoff = 1820; 

proc options option = yearcutoff; 
run; 

data year_1820; 
 input month day year; 
 bday = mdy(month, day, year); 
 format bday date9.; 
cards; 
7 11 18
7 11 48
1 1 60 
; 
run; 

options yearcutoff = 1920; 

proc options option = yearcutoff; 
run; 

data year_1920; 
 input month day year; 
 bday = mdy(month, day, year); 
 format bday date9.; 
cards; 
7 11 18
7 11 48
1 1 60 
; 
run; 
Result
Image and video hosting by TinyPic
Image and video hosting by TinyPic

Monday, April 29, 2013

SAS functions

Image and video hosting by TinyPic
What's function? 
In wiki:
In mathematics, a function is a relation between a set of inputs and a set of permissible outputs with the property that each input is related to exactly one output.
What about SAS functions? 
Image and video hosting by TinyPic<
As you can see, SAS 9.2 now has nearly 500 functions available. And SAS 9.2 brings one of the greatest improvement---user-defined functions.
It's no need to say how important to learn the SAS functions if you want to improve your sas programming.
SAS/BASE provides two types of powerful functions (from SAS.com) :
1. Function
"A SAS function performs a computation or system manipulation on arguments, and returns a value that can be used in an assignment statement or elsewhere in expressions.
In Base SAS software, you can use SAS functions in DATA step programming statements, in a WHERE expression, in macro language statements, in PROC REPORT, and in Structured Query Language (SQL).
Some statistical procedures also use SAS functions. In addition, some other SAS software products offer functions that you can use in the DATA step. Refer to the documentation that pertains to the specific SAS software product for additional information about these functions."
Examples:
  • x=max(cash,credit);
  • x=sqrt(1500);
  • NewCity=left(upcase(City));
  • x=min(YearTemperature-July,YearTemperature-Dec);
  • s=repeat('----+',16);
  • x=min((enroll-drop),(enroll-fail));
  • dollars=int(cash);
  • if sum(cash,credit)>1000 then
       put 'Goal reached';
2. Call routine
"A CALL routine alters variable values or performs other system functions. CALL routines are similar to functions, but differ from functions in that you cannot use them in assignment statements or expressions.
All SAS CALL routines are invoked with CALL statements. That is, the name of the routine must appear after the keyword CALL in the CALL statement."
Examples:
  • call prxsubstr(prx,string,position);
  • call prxchange('/old/new',1+k,trim(string),result,length);
  • call set(dsid);
  • call ranbin(Seed_1,n,p,X1);
  • call label(abc{j},lab);
  • call cats(result,'abc',123);
Learn by Example
Suppose you need to import and process a messy txt file "Demog":
Image and video hosting by TinyPic
And do the following:
1. find the last name from the name
2. put (Year) after the Age
3. extract the number of weight from the WT (eg, 155lbs)
4. keep the Quote of each person with the quotation mark
5. order the scores and calculate the mean of the top three scores
6. replace the "XXX" in the "Demog" with "Unknown"
Target
Image and video hosting by TinyPic
Code
/* macro variable to store the path */
%let path=C:\Documents and Settings\Administrator\My Documents\Dropbox\Transfer\tech post\dlmstr; 

data raw; 
 /* Define the sequence of the variables and length */
 length  Name $20. Last_name $10. Gender $7. Age 3. Age_Year $15. 
   Height 3. WT $6. WT_Lbs Position $15. Quote $20.
   score1-score5 3.; 

 /* Options: truncover & dlmstr & dsd */
 infile "&path\demog.txt" truncover dlmstr="@#" dsd;

 input @;  

 /* Perl regular expressions */
 _infile_ = prxchange("s/XXX/Unknown/",-1,_infile_); 

 input          Name  $ 
   Gender  $ 
   Age 
   Height 
   WT   $ 
   Position $ 
 /* use ~ to keep quotation mark; */
   Quote ~ $    
   score1-score5; 

 /* put */
 Age_Year = put(Age, 3.)||" (Years)"; 

 /* compress */
 WT_Lbs = input( compress(WT, , "kd"), 3.); 

 /* propcase */
 Last_name = propcase( scan(Name, -1, " ") ); 

 /* call & sortn */
 call sortn(of score1-score5); 

 /* round & mean */
 top3 = round(mean(of score3-score5), 0.1); 
run; 
Discussion
Lots of things need to take a close look:
  • infile method to import txt flat file & options: truncover, dlmstr, dsd
  • Perl regular expressions : prxchange
  • automatic variable : _infile_
  • compress
  • propcase
  • scan 
  • sortn
  • call routine

Thursday, April 25, 2013

To excel

Image and video hosting by TinyPic
Here are methods for exporting data to excel:
%let gd = your_file_storage_path; 

/* 1. proc export */
proc export data = eg
 dbms = excel
 outfile = "&gd.\eg_export.xls"
 replace; 
 sheet = "page1";
run; 

/* 2. libname */
libname here "&gd."; 
libname here excel 
"&gd.\eg_libname.xls" ver=2002;
data here.eg(dblabel=yes);
 set eg;
run; 
libname here clear;

/* 3. ODS tagset */
ods tagsets.excelxp
 file  = "&gd.\eg_tagset.xls"
 style = minimal
 options (Orientation = "landscape"
 FitToPage = "yes"
 Pages_FitWidth = "1"
 Pages_FitHeight = "100"); 
proc print data = eg NOOBS; 
run; 
ods tagsets.excelxp close; 

/* 4. ODS CSV */
ods csv file = "&gd.\eg_csv.csv";
proc print data = eg;
run; 
ods csv close; 

/* 5. ODS html */
ods html body = "&gd.\eg_html.xls";
proc print data = eg; 
run; 
ods html close; 

/* 6. ODS MSOFFICE2k */
ods tagsets.msoffice2k file = "&gd.\eg_msoffice2k.xls";
proc print data = eg; 
run; 
ods tagsets.msoffice2k close; 

/* 7. DDE */

/* 8. Integrated object model */

Image and video hosting by TinyPic

Tuesday, April 16, 2013

Transforming the data structure

Image and video hosting by TinyPic
Fate rarely calls upon us at a moment of our choosing. 
---Prime, Transformers Revenge of the Fallen
Why do we need "Transforming" the dataset structure?
Lots of reasons. (it's our fate!)
Like after proc freq or means, you need to transform the statistics for various tables.

Basically: 
  • You may want to create multiple observations from a single observation to count frequencies, or to allow for BY variable processing.
  • You may also want to restructure SAS data sets for certain statistical analyses. 
The routine would be proc transpose. But array would be also good to know.

Example: 
Here we have a data set "salary" from another "jobs". The data set is more like "wide", horizontal structure.
Image and video hosting by TinyPic
First, we will "transform" it to "deep", vertical structure like this:
Image and video hosting by TinyPic
Secondly, we will "transform" it back to horizontal structure.
Image and video hosting by TinyPic

Code: 
We will do the following things:
  • 0. input the data
  • 1.1 transform the data to vertical by array
  • 1.2 transform back to horizontal by array
  • 2.1 transform the data to vertical by transpose
  • 2.2 transform the back to horizontal by transpose
dm 'clear output'; 
dm 'clear log'; 

/*==============================================================================================*/
/* Programe    : transformer.sas      */
/* Objective  :  Transform Data Set Structure by array and transpose */
/* Author   : Ray       */
/* SAS Version  :  9.2              */
/*==============================================================================================*/

/*----------------------------------------*/
/* 0 Input data             */
/*----------------------------------------*/
data jobs; 
    infile datalines missover;
    input #1 Info_Company   $20.
  #2  Score_Personal   10.2
  #3  Info_Web          $50.
  #4  Info_HQ    $50.
  #5 Info_Country   $4.
  #6 Info_Employees   10.
  #7 Job_Statistician  $50.
  #8 Salary_Stat   10.2
  #9 Job_Programmer   $50.
  #10 Salary_Pro          10.2
  #11     Job_Manager   $50.
  #12     Salary_Man   10.2; 
datalines; 
ICON
60.00
www.iconplc.com
Dublin, Ireland
IRL
5000
Statistician I

SAS Programmer Analyst I
59553.00
Group Manager
90000.00
Covance
80.00
www.covance.com
Princeton, NJ
USA
11200
Bio Statistician I
59000.00
SAS Programmer I
51872.00
Manager
87029.00
Medpace
81.00
www.medpace.com
Cincinnati, OH
USA
990
Bio Statisitican I
70418.00
SAS Programmer I
40000.00
Clinical Trial Manager
84027.00
Shire Pharmaceuticals
90.00
www.shire.com
Basingstoke, United Kingdom
GBR
5000
Statistician I

SAS Programmer I
61000.00
Manager
96000.00
TechData Service
90.01
www.techdataservice.com
King of Prussia, PA
USA
5000
Statistician

SAS Programmer
86667.00
Manager

; 
run; 

* subset data set to salary; 
data salary; 
    set jobs (keep = Info_Company  Job_Statistician  Job_Programmer  Job_Manager 
              Salary_Stat      Salary_Pro  Salary_Man); 
run; 

* sort the salary; 
proc sort data = salary; 
    by Info_Company; 
run; 

* print out; 
proc print data = salary; 
run; 

/*--------------------------------------------------------------*/
/* 1.1 Transform the structure from horizontal to vertical */
/* In other words:   from wide to deep  */
/*--------------------------------------------------------------*/
* create salary_vertical; 
data salary_vertical; 
    set salary; 
    array job{3} Job_Statistician    Job_Programmer Job_Manager ; 
    array sal{3} Salary_Stat      Salary_Pro   Salary_Man; 
    do i = 1 to 3; 
        Position = job{i}; 
 Salary = sal{i}; 
 output; 
    end; 
    keep Info_Company Position Salary; 
run; 

* change the Salary from numeric to character; 
data salary_ver;
    set salary_vertical; 
    sal = put(Salary, 10.2); 
    drop Salary; 
    rename sal=Salary; 
run; 

proc print data = salary_vertical; 
run; 

/*----------------------------------------------------------*/
/* 1.2 Transform back to horizontal              */
/*----------------------------------------------------------*/
* transform the structure from vertical back to horizontal; 
data salary_horizontal; 
    set salary_ver; 
    by Info_Company; 
    if first.Info_Company then do; 
        i = 0; 
 array all(3, 3) $50. a1-a9; 
 array var{3} $50. Info_Company Position Salary; 
 retain a1-a9; 
    end; 
 i + 1; 
 do j = 1 to 3; 
     all(i, j) = var(j); 
 end; 
    if last.Info_Company then output; 
    rename  a2 = Job_Statistician
  a3 = Salary_Stat
  a5 = Job_Programmer
  a6 = Salary_Pro
  a8 = Job_Manager
  a9 = Salary_Man
    ; 
    keep Info_Company a2 a3 a5 a6 a8 a9; 
run; 

* sort the dataset; 
proc sort data = salary_horizontal; 
    by Info_Company; 
run; 

proc print data = salary_horizontal; 
run; 

/*------------------------------------------------------------------------------*/
/* 2.1 Transform the structure from horizontal to vertical by proc transpose */
/*------------------------------------------------------------------------------*/
proc transpose data = salary 
 out = first
 prefix = Position; 
 by Info_Company; 
 var Job_Statistician Job_Programmer Job_Manager; 
run; 

proc transpose data = salary 
 out = second
 prefix = Salary; 
 by Info_Company; 
 var Salary_Stat Salary_Pro Salary_Man; 
run; 

* merge the first and second; 
* sort before you merge. but it's already been sorted; 
data salary_ver_transpose; 
 merge first second; 
 by Info_Company; 
 drop _NAME_; 
 rename  Position1 = Position
 Salary1 = Salary
 ; 
run; 

/*----------------------------------------------------------*/
/* 2.2 Transform back to horizontal by proc transpose     */
/*----------------------------------------------------------*/
proc transpose data = salary_ver_transpose
 out = third
 prefix = Position; 
 by Info_Company; 
 var Position; 
run; 

proc transpose data = salary_ver_transpose
 out = fourth
 prefix = Salary;
 by Info_Company; 
 var Salary; 
run; 

* sort before you merge. but it's already been sorted; 
data salary_hor_transpose; 
 keep Info_Company Position1 Salary1 Position2 Salary2 Position3 Salary3; 
 format Info_Company Position1 Salary1 Position2 Salary2 Position3 Salary3; 
 merge third fourth; 
 by Info_Company;
 rename  Position1 = Job_Statistician
  Salary1 = Salary_Stat
  Position2 = Job_Programmer
  Salary2 = Salary_Pro
  Position3 = Job_Manager
  Salary3 = Salary_Man
 ; 
run; 


Good job!
Now you've got the power to save the world. Go! Transformers!

Wednesday, April 10, 2013

Timestamp

Image and video hosting by TinyPic
"What's the time?"
Certainly it's very important for everyone to keep track of the time.
For SAS programmer, sometimes it's also important to label the files or certain place (like footnote or title) with "timestamp".
Here comes a simple SAS macro to perform this function:
%now( ) 
%MACRO now( fmt= DATETIME23.3 ) /DES= 'timestamp'; 
 %sysfunc( compress( %SYSFUNC( DATETIME(), &fmt))  )
%MEND;
It's a utility macro. You can put it in the autocall library whenever you need it.

Example: 
Here I put it in the title and file name for the data set "eg".
ods listing close; 
options nonumber nodate; 
ods pdf file = "C:\Bancova\eg_%now(fmt=b8601dt).pdf"; 
title "eg_%now(fmt=b8601dt)"; 
proc print data = eg; 
run; 
ods pdf close; 
ods listing; 
You will see the result:
Image and video hosting by TinyPic
For file names, the ISO 8601 format YYYYMMDDThhmmss maintains readability and also sorts in chronological order.
Image and video hosting by TinyPic

Remember: 
Although it is very short, the application of this base SAS macro requires knowledge of the following:
  • macro basics (know how to write a macro)
  • format (know how to define a date format)
  • %sysfunc (one of the most important tools to the macro laguage since the %let statement)
  • ISO 8601 standard (It's compliant with CDISC, FYI)
  • ODS pdf (This destination is part of the ODS PRINTER family) 

Simple Quiz: 
  • Why do we need compress inside the %now( ) ? 
References:
Does your SAS code know what time it is? Add a timestamp. 

Thursday, March 28, 2013

Index for table lookup operation

Image and video hosting by TinyPic
If you want to find a topic in the book quickly, you may use the index at the back. It is not too hard to understand how a SAS index can help you to directly access the observations that you need in a particular SAS data set.
Here is an example how you can use the index with double-set in SAS.
1. Code
Code in gist
/*----------------------------------------------------------
    Index and Double-set                                  
      
------------------------------------------------------------
    This example is to demonstrate the use of proc 
datasets (index) and double-set.                           
                                            
------------------------------------------------------------
    The first data set TRANSACTION is a lookup 
table.             
    The second data set MASTER is to get multiple 
observations per lookup observation.                
                                                                                                
------------------------------------------------------------
    This program is modifed from Kevin McKinney               
                                                                                                 
----------------------------------------------------------*/

* TRANSACTION as lookup table, with multiple 
records; 
data TRANSACTION;
 input ein year quarter dataitem $2.;
datalines;
170000000000 1990 1 a
170000000000 1990 1 b
170000000001 1991 1 c
170000000001 1992 4 d
170000000000 1990 1 e
170000000071 1992 4 f
170000000449 1992 3 g
170000000528 1990 1 h
170000000798 1991 2 i
170000001003 1992 1 j
;
run;

proc print; 
run; 

* MASTER has no unique Key; 
data MASTER; 
 input ein year quarter who $6.; 
datalines; 
170000000000 1990 1 one
170000000000 1991 1 two
170000000449 1992 3 three
170000000449 1992 3 four
170000001003 1992 1 five
170000001003 1992 1 six
; 
run; 

proc print; 
run; 

* Create index for MASTER; 
proc datasets lib = work; 
 modify MASTER; 
 index create ein; 
run; 
data out;
 set TRANSACTION;
 continue=0;
 continue1=0;
 do while ( continue=0 );
     set MASTER key=ein;
     if continue1=1 then continue=1;
     if _iorc_ = 0 then do;
         output;
         end;
     else do;
        _error_=0;
        ein= 00000000000;
        continue1=1;
     end;
 end;
run;

proc print; 
run; 
2. Result
Image and video hosting by TinyPic
Image and video hosting by TinyPic
Image and video hosting by TinyPic
3. Discussion
What is the purpose of this example?
  • To show you a SAS Tip in table lookup operation with index and double-set
Why do we need to care about table lookup operation?
  • Table lookup operations are often the most time consuming part of many SAS programs
What are the common way to perform table lookup operation?
  • DATA step MERGE statement
  • SQL inner join
  • SQL subquery
  • FORMAT procedure and PUT function
  • SET statements and KEY = option
  • CALL EXECUTE routine
  • SQL INTO clause
What are the main elements for table lookup operation?
  • The  primary file (MASTER) is the file for which you want to obtain auxiliary information.
  • The  lookup file (TRANSACTION) is an auxiliary file that is maintained separately from the primary file and that is referenced for one or more of the observations of the primary file.
  • The key variable (ein)is the variable or variables whose values are the common elements between the primary file and the lookup file. Typically, key values are unique in the lookup file but not necessarily unique in the primary file.
  • The lookup result (out) is the auxiliary information obtained using the key variable or variables as reference into the lookup file.
What's inside the code?
  • proc datasets (options: modify, index)
  • double-set in the data step
Why proc datasets?
  • If you want to extract a small subset (1% to 15% of the total number of observations) from a large SAS data set, it's good to use the index. Or if you predict that you would access a SAS data set very often via an index, then that data set is a good index candidate. In doing so, the amount of computer resources (CPU time, I/O’s, Elapsed time, etc.) expended should be less than if SAS read the entire data set sequentially.
  • proc datasets is one of the three methods to create index (proc sql, data step and proc datasets).  
Why double-set?
  • The dataset in the first set statement(TRANSACTION, lookup) is not a unique list of key values. And the dataset in the second set statement (MASTER) does NOT have a unique set of key values. The KEY (ein) is repeated for observation one, two, and five. So this requires resetting the pointer on each iteration of the data step, and two additional lookups in the MASTER dataset. Once to know when the end of all the records for that EIN is reached. Once more to reset the pointer. The auxiliary variables continue and continue1 will create a supplementary condition within the do while loop. These two auxiliary variables together with double-set would yield the result we want.                               
4. References
From Bancova.com