Musings

yet another note for myself

Dry Run in Shell Scripts

Most shell scripts are like black box; they just do their magic without giving the slightest hint to the user about what just happened. Nevertheless, sometimes it is useful to show the commands that a shell script will execute. In Unix/Linux lingo, this is often called as dry run. Dry run is not available out of the box for custom shell scripts but can be easily implemented as show below.

# dry is a variable that indicates whether we want to do dry run or not
# Currently dry run is enabled. To disable it, set the dry variable to any other string
# Note: define the dry variable above the execute function below so
# that its available in the execute function
dry="y"

# Function: execute
function execute(){
        # Irrespective of whether dry run is enabled or not, we display
        # the command on the screen
	echo "COMMAND: ${@}"

        # if dry run is enabled then simply return
	if [ $dry == "y" ]; then
		return 0
	fi

        # if dry run is disabled, then execute the command
	eval "$@"
}

# Now just append execute to all your commands
# This will cause them to print irrespective of whether we are doing any dry run or not
# Commands are only executed if dry run is disabled
execute "ls ~/"
execute "echo \"hello\""

Enjoy

Dividing data into training and testing dataset in R

During machine learning one often needs to divide the two different data sets, namely training and testing datasets. While you can’t directly use the “sample” command in R, there is a simple workaround for this. Essentially, use the “sample” command to randomly select certain index number and then use the selected index numbers to divide the dataset into training and testing dataset. Below is the sample code for doing this. In the code below I use 20% of the data for testing and rest of the 80% for training.

#lets say we have data variable that is of type data.frame
> class(data)
[1] "data.frame"

>indexes <- sample(1:nrow(data), size=0.2*nrow(data))
>test <- data[indexes]
>train <- data[-indexes]

Extract Top N Records in Each Group in Hadoop/hive

Assume you have a table with three columns: user, category and value. For each user, you want to select top N categories. To achieve this in hive, you can use the following query:

SELECT user, category, value
FROM (
	SELECT user, category, rank(user) as rank, value
	FROM $compTable
	WHERE user is NOT NULL AND AND ctr > 0
	DISTRIBUTE BY user
	SORT BY user, value desc
) a
WHERE rank < 5
ORDER BY user, rank

In the above query, I am using a custom rank function. The overall approach is as follows:

  1. divide the data by user (distribute by user)
  2. Sort each group by user and value (sort by user, value desc)
  3. Within each group, assign rank order to each record. This is achieved by custom rank function. The rank function keeps track of last user key and simply increments the counter. As soon as it sees a new user, it reset counter to zero. Since the data is already sorted by user and is in descending order of value, we know for sure that all records related to a single user will be sent to the same node and they will be grouped together and also sorted by value.
  4. Pick top 5 categories (where rank < 5). Note since our index starts with 0, we only need to categories from 0 to 4.

Below is the custom rank function:

package com.example.hive.udf;
import org.apache.hadoop.hive.ql.exec.UDF;

public final class Rank extends UDF{
    private int  counter;
    private String last_key;
    public int evaluate(final String key){
	  if ( !key.equalsIgnoreCase(this.last_key) ) {
	     this.counter = 0;
	     this.last_key = key;
	  }
	  return this.counter++;
    }
}

Below is the complete sequence of commands required to get the above query working

#Compile Rank class
> javac -classpath $HIVE_HOME/lib/hive-serde-1.7.jar:$HIVE_HOME/lib/hive-exec.jar:$HADOOP_HOME/hadoop-core.jar -d /tmp/jar/ Rank.java

#Create Rank jar
> jar -cf Rank.jar /tmp/jar/Rank.class

#start hive
> hive

#Inside hive: add jar file
hive> add jar Rank.jar

#assign name to the custom function
hive> create temporary function rank as 'com.example.hive.udf.Rank';

#now we are ready to issue our query
hive> SELECT user, category, value
FROM (
	SELECT user, category, rank(user) as rank, value
	FROM $compTable
	WHERE user is NOT NULL AND AND ctr > 0
	DISTRIBUTE BY user
	SORT BY user, value desc
) a
WHERE rank < 5
ORDER BY user, rank;
Posted in Hadoop | Tagged , | Leave a comment

Visualizing Confusion Matrix using heatmap in R

Confusion matrix is one of the many ways to analyze accuracy of a classification model. As show in the table below, a confusion matrix is basically a two dimensional table with two axes. On one axis it has actual or target categories and on the other it contains predicted categories. Diagonal cells
indicate true positives i.e. number of test cases that were correctly predicted by the model. For instance, in the table below the model corrected predicted 2 out of 11 (or 18%) actual A’s as A. Non diagonal elements indicate false positives or true negatives i.e. number of test cases that were incorrectly predicated by the model to belong to a different category.

A B C
A 2 4 5
B 4 5 6
C 6 2 1

While the above confusion matrix is insightful, it only works when you few limited categories. However, while working on a problem I had more than 20 categories and visualizing a series of numbers across the table and making sense of them was an arduous task. So I started loooking for a way to visualize the confusion matrix. After exploring possible visualization techniques, I came with the idea of using heatmap. Luckily it was easy to produce heatmap in R using excellent ggplot library. (If you haven’t played with ggplot, try it right now. Its great !).

Below is the final output. In the figures below, color indicates percentage of test cases in that cell. The diagonal cells are highlighted with darker black border. Additionally, the two variants of the image display actual percentage value as text. if you like, it is possible to replace percentage text with actual frequency.

#generate random data
data <- data.frame(sample(LETTERS[0:20], 100, replace=T),sample(LETTERS[0:20], 100, replace=T))
names(data) <- c("Actual","Predicted")

#compute frequency of actual categories
actual <- as.data.frame(table(data$Actual))
names(actual) <- c("Actual","ActualFreq")

#build confusion matrix
confusion <- as.data.frame(table(data$Actual, data$Predicted))
names(confusion) <- c("Actual","Predicted","Freq")

#calculate percentage of test cases based on actual frequency
confusion <- merge(confusion, actual, by=c("Actual"))
confusion$Percent <- confusion$Freq/confusion$ActualFreq*100

#render plot
# we use three different layers
# first we draw tiles and fill color based on percentage of test cases
tile <- ggplot() +
geom_tile(aes(x=Actual, y=Predicted,fill=Percent),data=confusion, color="black",size=0.1) +
labs(x="Actual",y="Predicted") 

# next we render text values. If you only want to indicate values greater than zero then use data=subset(confusion, Percent > 0)
tile <- tile +
geom_text(aes(x=Actual,y=Predicted, label=sprintf("%.1f", Percent)),data=confusion, size=3, colour="black") +
scale_fill_gradient(low="grey",high="red")

# lastly we draw diagonal tiles. We use alpha = 0 so as not to hide previous layers but use size=0.3 to highlight border
tile <- tile +
geom_tile(aes(x=Actual,y=Predicted),data=subset(confusion, as.character(Actual)==as.character(Predicted)), color="black",size=0.3, fill="black", alpha=0) 

#render
tile

Shell Tips

Array Operations

#Append
var=( ${array[@]-} $(echo "$variable") )

#Get Unique elements from any array
echo ${oldArray[@]}|awk '{for(i=1;i<NF;i++) printf "%s\n",$i}' | sort -u

#Check if element is present in an array

File Operations

#list files greater than 10bytes
find *.csv.gz -type f -size +10

#list files less than 11 bytes
find *.csv.gz -type f -size -11

#add line number
cat -n infile > outfile

#fetch nth line. Example fetch 74th line from file
awk 'NR==74' file
head -n 74 file | tail -1 

#fetch nth line from all the files matching certain criteria
find path_to_folder -n "file_pattern_* | xargs -n 1 awk 'NR==74' 

#add two files column wise using tab as delimiter
paste -d$'\t' file1 file2 > newfile

# remove ctrl-M from end of line. Ctrl-M is dos format to indicate end of line. Use tr command to remove it
cat file | tr -d '\015' > newfile

#change delimiter from tab to pipe
cat file | tr "\t" '|' > newfile

# Remove lines present in file 1 and file 2
cat file1 file2 | sort | uniq -u > newfile

#Read file line by line
while read line
do
  echo $line
done < input_file

#change file extension from gz to txt
for file in folder/*.gz; do mv $file `echo $file | sed 's/\(.*\.\)gz/\1txt/'`; done

Postgres Command Line Utilities

#Delete all rows from all tables. In xargs command -n1 indicates that process each line separately and -I table refers to the table.
psql -U user database -c "select relname, n_live_tup from pg_stat_user_tables order by n_live_tup desc;" | awk -F'|' '{if($2 > 0) print $1}' | xargs -n1 -I table psql -U user database -c "delete from table;"

#display row count for all tables in a database
psql -U user database -c "select tablename from pg_tables where tablename not like 'pg_%' and tablename not like 'sql_%';" | xargs -n1 -I table psql -U user database -c "select 'table' as table, count(*) from table;"

Setting Parameters of a Ruby Class through Command Line Arguments

In this post, I want to address two things:

  1. How to run a ruby class through command line? This is rather simple and just need few lines (see if __FILE__ = $0 in the code below)
  2. How to set configuration parameters of a class either through command line or through another class ? This was much more challenging to figure out then the first problem. I found the solution in OptionParser class.

Assumption: I assume that you are familiar with the OptionParser ruby library. If not, I recommend looking at some of these articles

Solution

Code:

require 'rubygems'
require 'optparse'

class SayHello

       # Define some class variables to
       # store configuration parameters
       attr_reader :optparse, :options, :mandatory, :arguments

      def initialize(arguments)
          @options = {}
          @optparse = OptionParser.new()
          set_mandatory
          set_arguments(arguments)
          set_options
          validate
      end

      #Set mandatory configuration parameters
      def set_mandatory
          @mandatory = [:name]
      end

      #Handle arguments
      def set_arguments(arguments)
         return if arguments.nil?
         if(arguments.kind_of?(String))
            @arguments = arguments.split(/\s{1,}/)
         elsif (arguments.kind_of?(Array))
            @arguments = arguments
         else
            raise Exception, "Expecting either String or an Array"
         end
      end

     #Use OptionParser library and
     # define some parameters
     def set_options
         @optparse.banner = "Usage: ruby #{File.basename(__FILE__)} --[no]-debug"

         #First parameter definition
         @options[:debug] = false
         @optparse.on('--[no-]debug','Print debug statements'){|opt| @options[:debug] = opt }

         #Second Parameter definition
         @options[:name] = nil
         @optparse.on('-n','--name LIST',Array,'Say hello to'){|opt|
            @options[:name] = opt
         }
     end

     #Validate that mandatory parameters are specified
     def validate
         begin
            @optparse.parse!(@arguments)
            missing = @mandatory.select{|p| @options[p].nil? }
            if not missing.empty?
                puts "Missing options: #{missing.join(', ')}"
                puts @optparse
            end
         rescue OptionParser::ParseError,
                OptionParser::InvalidArgument,
                OptionParser::InvalidOption,
                OptionParser::MissingArgument
                puts $!.to_s
                puts @optparse
                exit
         end
         #Print options if debug is on
         @options.inspect if @options[:debug]
     end

     #Run your code
     def execute
         #Do something
         @options[:name].each{|n| puts "hello #{n}" }
     end
end #End of Class Test

#Include code below so that we can run SayHello through command line
if __FILE__ == $0
    cls = SayHello.new(ARGV)
    cls.execute
end

Now, as shown below,  we can run the SayHello class either through command line or call it through another class. Also, the configuration parameters are handled in the same way.

ruby SayHello.rb --debug --name PersonA,PersonB,PersonC

Through Another Class

SayHello.new("--debug --name PersonA,PersonB,PersonC")

Easy-Hadoop, Rapid Hadoop Programming

In today’s digital world, the biggest challenge is how to deal with the large volume of data. Apache Hadoop provides one solution to this problem. It’s an open source software for distributed and scalable computing. Hadoop is usually deployed in a cluster with 100s of nodes.

For quite sometime I have been using Hadoop and it has been quite pleasing experience. However, over time I realized that I have been writing almost similar code for different projects. I was spending more time on writing various pieces of code rather than analyzing data. This led me to think of easy-hadoop, a rapid hadoop programming library. The idea of easy-hadoop is to develop small customizable module. For instance, one of the common tasks is to extract columns from the input dataset and set the key and value pairs. As shown below, one can write a simple ruby code to achieve this:


STDIN.each_line{|line|
#Assuming input data is tab delimited
tokens = line.to_s.strip.split("\t")

#Let's assume we want the key to be combination of the
#3 and 4 columns and use pipe character as a delimiter
key = tokens[3..4].join('|')

#Let’s assume that we want 1,2,5 as values
values = []
[1,2,5].each{|i| values << tokens[i] }
value = values.join(“\t”)

#Write key and value pair
puts “#{key}\t#{value}”
}

While the above code works, its very specific to a particular dataset and a project as it assumes lot of things (such as input delimiter, key delimiter, etc). Instead we can convert each of the assumption we made in the above code into a parameter and make the same code much more useful. Below is the another version of the same code:


@options = {
:input_delimiter=>"\t",
:key_delimiter=>'|',
:value_delimiter=>"\t",
:keys=[3,4],
:values=[1,2,5]
}
STDIN.each_line{|line|
#Tokenize input data
tokens =
line.to_s.strip.split(@options[:input_delimiter])

#Generate key
keys = []
@options[:keys].each{|k|  keys << tokens[k] }
keys = keys.join(@options[:key_delimiter])

#Generate Values
values = []
@options[:values].each{|v| values << tokens[v] }
values = values.join(@options[:value_delimiter])

#Write output
puts “#{keys}\t#{values}”
}

By converting most of the assumptions into parameters, we made our simple function much more useful in many other project. Now we just need to modify @options and the code remains the same. However, its still not yet elegant as one has to still modify the source code each time one of the parameters changes. This can be solved by using setting different parameters through command line arguments (as shown below) and using optparse utility in ruby to parse and set parameter hash map (@options). If you are interested, checkout the ColExtractor.rb in Easy-Hadoop project.

ruby ColExtractor.rb --input_delimiter="\t" --key=3,4 --value=1,2,5 --key_delimiter="|" --value_delimiter="\t"

Evaluating Linear Regression Model in R

R has many different commands to quickly analyze a linear model. Below is a list of such commands and some tips on how to interpret there output.

1. Summary

Example

>model <- lm(Income~(Price+Temp+Consumption)^2,data=ice) >summary(model)

Sample Output


Interpretation

Parameter Formula Explanation
Residual (R) =y-h_\theta(x) In general, residuals should have a normal distribution with mean close to zero and 1Q and 3Q having about the same absolute value.
Degree of Freedom (df ) Number of Training Records – Number of Coefficients
Residual Standard Error (\sigma ) \sqrt{\frac{\sum{R^2}}{df}} Residual standard error should be as small as possible (Note, the whole objective of machine learning was minimize error. Also think of overfitting.)
Standard Error Of Variable (\sigma_i) $latex \frac{\sigma}{\sqrt{n}} Its the standard deviation of the sample divided by square root of the sample size. This is same as estimating population variation from sample variation.
t-value \frac{\theta_i}{\hat{\sigma_i}} t value make sense only if we have P value, which is listed in the next column
P Value From table It gives the probability of achieving a value as large as t so that null hypothesis is true. Here, null hypothesis is that coefficient estimate is zero. Assuming we are using 5% confidence interval for two-tailed t-test, then P value should be less than 0.05 in order for us to believe in the coefficient estimate.
F Statistics ?? ??

2. Plot

Example

>par(mfrow=c(2,2))
>plot(model)

Output

Interpretation

Plot General behavior and Interpretation
Residual Vs Fitted Points should be randomly scattered around the center line (dotted). Any pattern (such as all points on one side) indicates either violation of linearlity or homoscedasticity. The plot also help identify the boundary of the model. For instance, the red line coincides with the center line initially but moves upwards as fitted value increases. This in general indicates that the model is good only lower fitted values.
Scale Location ??
Normal Q-Q Plot This plot helps analyze whether the distribution of residual error is normal or not. In general, the points should be along the diagonal line. If you notice any particular pattern (such as bump), it usually indicate the the residual error is not uniformly distributed. Sometimes, points will be in a line but either parallel to the diagonal or in some other direction. This usually indicate ?
Residual Vs Leverage Plot ??

References

Notes: Linear Regression

Note: This post is more of my running notes than any authoritative guide on the subject. I just find blogging a good way to develop thorough understanding and also easy to access and hence I am putting over here. You are most welcome to read and comment on it. Please let me know if I am wrong or mistaken in my understanding/writing.

1.0 Motivation

Assume you have a sample housing data for a town containing following three fields: (1) area of a house, (2) number of rooms and (3) the price at which it was sold. Based on this data, you want to predict the price of an another home whose area is 1300 sq ft and has 2 rooms. Now, one can look in our sample data and try to find a matching record to get an estimate of the home we are interested it. Another approach is to build some kind of model that can, within reasonable margin of error, predict the price of a house based on area and number of rooms. Here, we focus on this second approach.

2.0 Introduction

As discussed above, the aim is to build a model that for a given area and number of rooms predicts the price with reasonable accuracy. However, model is very generic term.  In this context think of a model as a mathematical equation that we can learn from the data. The price of a house can be any real value between 0 and infinite and is continuous value. In data mining and machine learning domain, when the output is real value the problem is referred is that of regression. In contrast, if the output of the model is discrete, the problem is referred is that of classification. An example of the classification problem is categorizing weather into sunny, raining, etc based on different input variables such as temperature, humidity, precipitation, etc.

So once we have decided that we have a regression problem, the next step is deciding what approach we want to use for building this model. One has many different possibilities and also might have to several of them to get the best model. However, one of the simpler option is to build a linear regression model.  Before we delve into the details of how to build a linear regression model, however lets first try to understand what is linear equation.

Naively, a linear equation can be explained as a function that leads to a straight line. Thus one of the simplest linear equation that we all have learnt in school is that of a line ( y=mx+b), where m represents the slope of the line and b represents the intercept or the point at which the line crosses the y-axis (or where is x=0).  Wikipedia provides a more technical definition of linear equation as an algebraic equation in which the term is either a constant or the product of a constant and (the first power of) of a single variable. In the above equation of a line, slope of the line (m) represents the constant term and the variable is x. First power of a single variable means that variables can’t be higher degree polynomial i.e. x^2 , x^3 , etc. Equations containing higher order of variables are referred as non-linear equations.

Coming back to our topic of linear regression model, however, one has to note that linear regression model are linear in terms of constants (hereafter referred as weights or parameters \theta ) and not necessarily linear in terms of variables. Thus it is possible to have higher degree polynomial functions or other functions of variable (x). These higher degree polynomial functions are referred as basis function (f(x)). Some examples of basis function are polynomial function $((f_i(x)=x^i)$, spline function, gaussian, sigmoid, wavelets, etc. Also note that you are not restricted to having a single variable in your model but can have several of them. Thus in our sample dataset, we have two variables: (1) area and (2) number of rooms. These variables are often referred as input variable. In contrast to these two input variables, price is the target variable that we want to predict based on our trained model. Before we dive into linear regression, there is another term with which we want to get familiar. It is bias (\theta_0). Bias is a constant term. In the above equation of a straight line, b represents the bias (and previously was referred as intercept). Based on our above understanding of weights, basis function and bias, we can know write a generalized linear regression model as shown below:

h_\theta(x) = \theta_0+\theta_1f_1(x_1)+\theta_2f_2(x_2)+\theta_2f_3(x_3)+\cdots

In the above equation, x1, x2, … are different input variables. \theta_1,$\theta_2$,… are the constant terms or weights that we want to learn and $\theta_0$ is the bias term. The above equation can be represented more compactly by representing the bias term as \theta_0f_0(x_0), where f_0(x_0)=1. Following this assumption, now we can rewrite the above general equation as shown below:

h_\theta(x)=\sum_{i=0}^{m}\theta_if_i(x_i)

In the above example, m represents the number of training examples. Great, now we have a general understanding of general linear model, but how do we move from here. There are three major challenges

  1. Selecting appropriate basis function
  2. Computing weights \theta
  3. Evaluating the model

3.0 Building Model

3.1 Deciding Basis function

Note: I need help clearly understand this part of the linear regression problem. Please let me know if you have any resource that I can use to learn more about this topic.

3.2 Computing Weights/Parameters (\theta )

Once we have formalized our hypothesis function (h_\theta(x)), the next step is to compute parameters (\theta_0,\theta_1,\cdots). Obviously, we want such parameters that minimizes the difference between the price of the house as predicted by our trained model and the actual price in our training data over all the training records. That is, our optimization function can be defined as

ObjectiveFunction=\min_{\theta}\sum_{i=1}^{m}(h_\theta(x^{i})-y^i))

Where:

h_\theta(x^i) = predicted house price for the ith record
y^(i) = actual house price for the ith record.

However, there is a problem with the above optimization function. It cancels out positive and negative differences. For instance, if the difference is 2K for the first record and -2K for the second record, then the summation of differences overall all the record will be zero. To overcome this problem, one can either minimize the sume of absolute difference i.e (\min_{\theta}\sum_{i=1}^{m}(|h_\theta(x^{i})-y^i)|)) or take the square of differences and minimizes that (\min_{\theta}\sum_{i=1}^{m}(h_\theta(x^{i})-y^i))^2). However, it is shown that taking absolute difference does not lead to a unique solution and to optimize $\theta$, we need to take square of the difference. This gets us into the topic of least square methods. Thus, the function that we want to optimize is

Optimize=\frac{1}{2}\min_{\theta}\sum_{i=1}^{m}(h_\theta(x^{i})-y^i)^2)

We added \frac{1}{2} to
make future computation easy. Now to solve the above equation, we can gradient descent algorithm. In gradient descent algorithm, we take the partial derivative of the optimization function to update the parameter. That is, at each iteration, we calculate new \theta as follows

\theta_i = \theta_i - \alpha\sum_{j=1}^{m}\frac{\partial }{\partial \theta_i}(\frac{1}{2}(h_\theta(x^j)-y^j)^2)
or
\theta_i = \theta_i - \alpha\sum_{j=1}^{m}((h_\theta(x^j)-y^j)x_i^j)

The above gradient descent approach is also known as batch gradient descent algorithm as we are updating \theta after observing error across all the records. Alternative to batch gradient descent algorithm is stochastic gradient descent or online gradient descent in which we update \theta for each record. For large dataset, stochastic gradient descent algorithm is shown to converge much faster.

Note: Since gradient descent algorithm is used for many other machine learning techniques, I am planning to write a more detailed post on the gradient descent algorithm itself. For now, just search on gradient descent algorithm to find more about it.

3.3 Evaluation

4.0 Misc. Notes

4.1 Locally weighted linear regression

Linear regression as described above is a parametric model as it does not need to keep the data in order to predict new outcomes. Furthermore, it provides us with a single model for the whole dataset. However, sometimes it is impossible to fit a single line through all the data and thereby one has to explore new options. One such option is to use locally weighted linear regression. It is a non-parametric model that is it keeps the training data around in order to predict new outcomes. Each time a new prediction has to made, it performs a regression using only points that are around (also known as local) the given point.

4.2 Bias Term = 0

In some applications, it make sense to set the bias term (\theta_0) to zero. For instance, in the case of house price if the only input variable is area of the house, then setting bias term equal to zero makes sense as the house price will be zero if the area of the house is zero.

Reference:

  1. Andrew Ng, Machine Learning – Lecture 2
  2. Interpreting Regression Coefficients: An interesting article on how to interpret parameter values
  3. Interpreting interactions: This explains how to understand interactions between two or more variables. It is demonstrates the case where your model has more than one variables.
  4. Assumptions in Linear Regression – Great tips on necessary conditions to apply linear regression

Using ruby gems along with hadoop streaming

I just discovered there is a simple way to use ruby gems (or ruby libraries) in your mapper or reducer script even if you don’t have administrative rights. Below is a short and quick explanation of how to do this. One of the parameter in hadoop streaming is “-cacheArchive”. It allows you to specify path of the archive on the master machine and create a symbolic link. You can read more about it over here. In order to use ruby gems, we will need to do four simple steps

Step 1. Zip gem source code:
Download the source code of a gem and zip it. Lets assume you want to use the awesome geokit gem. At the top level of the geokit gem there is one file (geokit.rb) and a folder (geokit). Use the following command on MacOSX to create a zip file

$> zip -r geokit.zip geokit.rb geokit

Note: use -r parameter to recursively include subfolders

Step 2: Upload the zip file on hadoop’s distributed file system

$> hadoop dfs -copyFromLocal geokit.zip lib/