Peek (Into a file)

Here is a little bash Unix utility to see the top and bottom of a file (as I like to do) in Unix. It is (almost) my first Unix utility. Enjoy!!!

#!/usr/bin/env bash

if (( $# != 1 ))
then
echo ‘This utility shows the head and tail of a file ‘
echo ‘usage: ‘
echo ‘peek <-n number of lines> list of files ‘
echo ”
echo ‘if -n option is not provided the default of 6 is used’
echo ”
echo ‘Developed by Mario Segal – free to use and modify as you wish’

else

num=6

while getopts n:: flag;
do
case $flag in
n) num=$OPTARG
esac
done

shift $((OPTIND-1))

for file in “$@”
do
echo file: “$file”
echo ‘head’
echo ‘—————————-‘
head -n $num $file
echo ‘—————————-‘

echo ‘tail’;
echo ‘—————————-‘
tail -n $num $file
echo ‘—————————-‘
echo ”

done

fi

: <<END_IF_DOCS

=head1

This utility will show both the head and the tail for a file, or a list fo files

=head1 OPTIONS

-n <number of lines to show on head and tail, default if the head/tail default of 6>

=head1 LICENSE AND COPYRIGHT

Developed by Mario Segal – use it at will

=cut

END_IF_DOCS

 

Create SAS Code from R ‘tree’ Objects

I recently was faced with the desire to port some tree models developed in R to SAS so I could score a large database. To me this makes sense as SAS is better with large files (or at least so as not to offend anyone, I am better with large files in SAS).

I started with a web search and identified the following post:

Model decision tree in R, score in Base SAS

which basically talked about the same situation, and created code to write the SAS code. Unfortunately for me they used package party::ctree for the trees, while I use package tree.

Inspired by the great idea, I developed a code that generates the if statements to insert into  the Data Step for ‘tree’ objects developed by package tree. It is likely not the most efficient way to move through the tree, but it works (I have not dealt with binary trees in code since programming 2 in college – in Pascal of all languages – over 20 years ago). I welcome any suggestions.

    This program is free software: you can redistribute it and/or modify
    it under the terms of the GNU General Public License as published by
    the Free Software Foundation, either version 3 of the License, or
    (at your option) any later version.

    This program is distributed in the hope that it will be useful,
    but WITHOUT ANY WARRANTY; without even the implied warranty of
    MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
    GNU General Public License for more details.

    <http://www.gnu.org/licenses/>.

tree2sas <- function (treeobj,predictor_name,filename) {

#Developed by Mario Segal. 

#initialize;
frame <- treeobj$frame
frame$printed <- 0;
frame$close <- 0;

code <- data.frame(line=””,stringsAsFactors =F)
j <- 1
count=0;
end <- row.names(frame[dim(frame)[1],])
mymax <- 0;
while (row.names(frame[i,]) != end) {
i <- mymax+1
#cycle the process until you get to the last leaf on the left from current node;
while (frame[i,]$var != ‘<leaf>’) {
code[j,] = paste(‘if’,frame[i,]$var,frame[i,]$splits[1,’cutleft’],’then do;’)
i<- i+1;
j<-j+1;
}

mymax <- max(i,mymax)
# we are at the farthest left possible, so write the assignment;
code[j,] = paste(predictor_name,’ = ‘,”‘”,frame[i,]$yval,”‘”,”;”,sep=””)
#since I printed the prediction mark as printed;
frame[i,7] <- 1;
j<- j+1

#move up until you find the first not printed. To account for the way back from right;
#check also if you are back at node #1 if not stop
while (frame[i,]$printed==1) {
if (i==1) {break}
i<-i-1
if ( frame[i,]$close==0) {
#if not not closed ,close it, I am closing leafs as I prinmt them so otherwise I duplicate;
code[j,] = ‘end;’
j <- j+1;
frame[i,8] <- 1;
}
}

if (i == 1 & mymax==dim(frame)[1]) {break}
code[j,] = paste(‘if’,frame[i,]$var,frame[i,]$splits[1,’cutright’],’then do;’)
j<- j+1;
frame[i,7] <- 1;
#the next right one here will be max+1;

}

for (k in 1:dim(code)[1]) {
cat(code[k,],sep=”\n”,append=T,file=filename)
}

}

An Alternative Approach to Create Quintiles in SAS

The code below shows how one can use some advanced data step techniques to create quintiles (or any -iles) in SAS. One can of course use PROC RANK but that creates a new dataset that may need to be merged to  the original. In this case one does not need to do any merging.

Note how we load an auxiliary dataset with the breaks for the deciles on the first iteration of the DATA step.

    This program is free software: you can redistribute it and/or modify
    it under the terms of the GNU General Public License as published by
    the Free Software Foundation, either version 3 of the License, or
    (at your option) any later version.

    This program is distributed in the hope that it will be useful,
    but WITHOUT ANY WARRANTY; without even the implied warranty of
    MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
    GNU General Public License for more details.

    <http://www.gnu.org/licenses/>.
    Developed by Mario Segal

proc means data=mydataset  noprint;

var desired_var;
output out=breaks p20(desired_var)=p20 p40(desired_var)=p40 p60(desired_var)=p60 p80(desired_var)=p80;

run;

data mydataset;

if _N_ eq 1 then do;
set breaks (drop=_freq_ _type_);
array breaks{4} _temporary_ ;
breaks{1} = p20; breaks{2} = p40; breaks{3}=p60; breaks{4}=p80;
end;

set mydataset;

select ;
when (desired_var le breaks{1}) quintile=5;
when (desired_var gt breaks{1} and desired_var le breaks{2}) quintile=4;
when (desired_var gt breaks{2} and desired_var le breaks{3}) quintile=3;
when (desired_var gt breaks{3} and desired_var le breaks{4}) quintile=2;
when (desired_var gt breaks{4}) quintile=1;
end;

drop p: ;

run;

A Graphical Approach to Showing the Result of Classification Models

This is one of my favorite charts, it easily allows one to see how many predictions are right, and it allows one to see where the wrong ones are as well. It is the equivalent of a confusion matrix, but sometimes a picture is worth a thousand words. Some sample code is included below.

Rplot04

 

    This program is free software: you can redistribute it and/or modify
    it under the terms of the GNU General Public License as published by
    the Free Software Foundation, either version 3 of the License, or
    (at your option) any later version.

    This program is distributed in the hope that it will be useful,
    but WITHOUT ANY WARRANTY; without even the implied warranty of
    MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
    GNU General Public License for more details.

    <http://www.gnu.org/licenses/>.
    Developed by Mario Segal


#requires ggplot2. Data has to be in a dataset with three columns: actual, predicted and match.
#match is Yes if actual and predicted match, No otherwise.

 

require(ggplot2)

fitchart <- ggplot(chartdata,aes(x=actual,y=predicted,color=actual,shape=match))+geom_jitter(alpha=.6)+theme_bw()
fitchart <- fitchart + ylab(“Predicted Activity”)+xlab(“Actual Activity”)+ggtitle(“Summary of Classification Accuracy”)
fitchart <- fitchart+theme(legend.position=”bottom”)+theme(plot.title = element_text(size=14,color=”blue”, face=”bold”))
fitchart <- fitchart+theme(axis.title.x = element_text(face=”bold”,size=14),axis.title.y = element_text(face=”bold”,size=14))
fitchart <- fitchart+theme(axis.text.x=element_text(angle=0,color=”black”,size=12),axis.text.y=element_text(color=”black”,size=12))
fitchart <- fitchart + theme(legend.text = element_text(colour=”black”, size = 10),legend.title = element_text( face=”bold”))
fitchart <- fitchart + scale_color_discrete(name=”Actual\nActivity”)+scale_shape_manual(values=c(4,20),name=”Correct\nPrediction”)
fitchart

In Memory Approch to Find Closest Location to a point – SAS

This approcah makes maximum use of in-emory processing to search iteratively through a set and find the closest location in a target list. I developed to find the closest ATM to a given one, but it has other applications.

Scroll down for some explanations.

    This program is free software: you can redistribute it and/or modify
    it under the terms of the GNU General Public License as published by
    the Free Software Foundation, either version 3 of the License, or
    (at your option) any later version.

    This program is distributed in the hope that it will be useful,
    but WITHOUT ANY WARRANTY; without even the implied warranty of
    MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
    GNU General Public License for more details.

    <http://www.gnu.org/licenses/>.
    Developed by Mario Segal

#Create a view with the atms we want to find closest locations to;

data atm_view1 / view=atm_view1;

set sample_atms( where=( x1 ne . or y1 ne .));

keep id1 x1 y1 group;

run;

#Create a view with the atms we want to search into, In my example they are a subset of the same set;

 data atm_view2 / view=atm_view2;

set sample_atms( where=( (x1 ne . or y1 ne .) and group ne ‘Offsite’) );

rename id1=id2 x1=x2 y1=y2;

keep id1 x1 y1 ;

run;

data Closest_ATM1;

*define the variables for hash objects, line only executes at  compilation;

if 0 then set atm_view1 atm_view2 ;

*define variables to store best distance and associated ATM ID;

length best_id $ 8 best_distance 8;

*at beginning of execution, load atm_views with dasta into hash objects and also define their iterators;

if _N_ eq 1 then do;

dcl hash list(dataset: “atm_view1”);

list.definekey(‘y1′,’x1’);

list.definedata(all: ‘yes’);

list.definedone();

dcl hiter hi_list(‘list’);

dcl hash lookup(dataset: “atm_view2”);

lookup.definekey(‘y2′,’x2’);

lookup.definedata(all: ‘yes’);

lookup.definedone();

dcl hiter hi_lookup(‘lookup’);

end;

*load the atm_data from memory for which we want to find the nearest atm and  calculate the distance;

rc=hi_list.first();

do while (rc=0);

best_distance = 999999; *set best distance to a large value;

best_id=”; *initialize best match ID;

rc=hi_lookup.first();

rc1=0;

do while (rc1=0);

current = geodist(y1,x1,y2,x2,’DM’);

if current lt best_distance and id1 ne id2  and (x1 ne x2 and y1 ne y2) then do;

*do not set best_distance to distance to self or to a co-located ATM;

  best_distance = current;

 best_id = id2;

end;

rc1=hi_lookup.next();

end;

output;

rc=hi_list.next();

end;

keep id1 group best_id best_distance;

run;

Explanation:

1) I developed 2 views to hold the data, the only thing stored is the code to execute them and I can make changes to the data without impacting the source or making a copy.

2) Then  I load them into a hash object and create an iterator each.

3) The we use the iterators to search through all options for each one and select the best

4) The we output the results.

I hope this works for you.

Generating Nice Looking Tree Diagrams in R

This function generates nice looking tree diagrams (see sample) below from tree objects (generated by package tree).

    This program is free software: you can redistribute it and/or modify
    it under the terms of the GNU General Public License as published by
    the Free Software Foundation, either version 3 of the License, or
    (at your option) any later version.

    This program is distributed in the hope that it will be useful,
    but WITHOUT ANY WARRANTY; without even the implied warranty of
    MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
    GNU General Public License for more details.

    <http://www.gnu.org/licenses/>.
    Developed by Mario Segal

plotTree <- function(x) {

require(ggdendro)
require(odfWeave)
require(tree)
require(ggplot2)
require(plyr)

#get tree data in a form suitable for ggdendro;
tree_data <- dendro_data(x)
#get labels and splits and modify labels to include splits, also add Ns for terminal nodes
#and apply formating for splits
labels <- as.matrix(as.character(tree_data$labels$label),nrow=1)
splits <- as.matrix(x$frame$splits[,1][as.numeric(row.names(tree_data$labels))],nrow=1)
splits <- prettyNum(splits,big.mark=”,”,scientific=F,digits=2,format=”f”)
ns <- as.matrix(x$frame$n[as.numeric(row.names(tree_data$leaf_labels))],nrow=1)
new_labels <- matrixPaste(as.character(tree_data$labels$label),splits,sep=”\n”)
tree_data$labels$label <- new_labels
new_ends <- paste(as.character(tree_data$leaf_labels$label),” (N=”,as.character(ns),”)”,sep=””)
tree_data$leaf_label$labels <- new_ends

range <- max(tree_data$segments$y)-min(tree_data$segments$y)
my_limits = c(min(tree_data$segments$y)-0.25*range,max(tree_data$segments$y)+0.25*range)
my_limits[1] = round_any(my_limits[1],10,f=floor)
my_limits[2] = round_any(my_limits[2],10,f=ceiling)
#create the ggplot chart (parts of code copied from ggdendro documentation);
plot_x <- plot_x <- ggplot(segment(tree_data)) +geom_segment(aes(x=x, y=y, xend=xend, yend=yend),colour=”blue”, alpha=0.5) +theme_dendro()
plot_x <- plot_x +geom_text(data=label(tree_data),aes(x=x, y=y, label=label), vjust=-0.5, size=3)
plot_x <- plot_x +geom_text(data=tree_data$leaf_label,aes(x=x, y=y, label=labels,color=label), vjust=0.5, hjust=1, size=3,angle=90)
plot_x <- plot_x + scale_y_continuous(limits=my_limits)+theme(legend.position=”none”)
#colors for terminal node labels are optional but very nice, they can be custom defined as below;
#or if the next line is ignored they will be ggplot default colors;
plot_x <- plot_x + scale_color_manual(values=c(“#FFB300″,”#007856″,”#C3E76F”,”#86499D”,”#003359″,”#AFAAA3″))
print(plot_x)
return(plot_x)
}

Sample Output using a tree developed with the IRIS dataset

Ex1

SAS – Custom Statistics from Merge Operations

The code below shows how to get custom statistics from merge operations above and beyond what the standard DATA step output provides.

    This program is free software: you can redistribute it and/or modify
    it under the terms of the GNU General Public License as published by
    the Free Software Foundation, either version 3 of the License, or
    (at your option) any later version.

    This program is distributed in the hope that it will be useful,
    but WITHOUT ANY WARRANTY; without even the implied warranty of
    MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
    GNU General Public License for more details.

    <http://www.gnu.org/licenses/>.
    Developed by Mario Segal

data c;
merge a (in=left) b (in=right) end=eof;
retain missleft missright;
by merge_key;
if left then output; *We want all records of left, modify as needed;
*collect the desired statistics;
if left and not right then missright+1;
if right and not left then missleft+1;
if eof then do; *output the statistics;
put ‘There were ‘ missleft ‘Record on left dataset not found on the right dataset’;
put ‘WARNING: There were ‘ missright ‘ Records on right dataset not found on the left dataset’;
end;
drop miss: ; *don’t forget to drop the statistci variables or they will be part of the output dataset;
run;