Note: This post is more of my running notes than any authoritative guide on the subject. I just find blogging a good way to develop thorough understanding and also easy to access and hence I am putting over here. You are most welcome to read and comment on it. Please let me know if I am wrong or mistaken in my understanding/writing.
1.0 Motivation
Assume you have a sample housing data for a town containing following three fields: (1) area of a house, (2) number of rooms and (3) the price at which it was sold. Based on this data, you want to predict the price of an another home whose area is 1300 sq ft and has 2 rooms. Now, one can look in our sample data and try to find a matching record to get an estimate of the home we are interested it. Another approach is to build some kind of model that can, within reasonable margin of error, predict the price of a house based on area and number of rooms. Here, we focus on this second approach.
2.0 Introduction
As discussed above, the aim is to build a model that for a given area and number of rooms predicts the price with reasonable accuracy. However, model is very generic term. In this context think of a model as a mathematical equation that we can learn from the data. The price of a house can be any real value between 0 and infinite and is continuous value. In data mining and machine learning domain, when the output is real value the problem is referred is that of regression. In contrast, if the output of the model is discrete, the problem is referred is that of classification. An example of the classification problem is categorizing weather into sunny, raining, etc based on different input variables such as temperature, humidity, precipitation, etc.
So once we have decided that we have a regression problem, the next step is deciding what approach we want to use for building this model. One has many different possibilities and also might have to several of them to get the best model. However, one of the simpler option is to build a linear regression model. Before we delve into the details of how to build a linear regression model, however lets first try to understand what is linear equation.
Naively, a linear equation can be explained as a function that leads to a straight line. Thus one of the simplest linear equation that we all have learnt in school is that of a line (
), where m represents the slope of the line and b represents the intercept or the point at which the line crosses the y-axis (or where is x=0). Wikipedia provides a more technical definition of linear equation as an algebraic equation in which the term is either a constant or the product of a constant and (the first power of) of a single variable. In the above equation of a line, slope of the line (m) represents the constant term and the variable is x. First power of a single variable means that variables can’t be higher degree polynomial i.e.
,
, etc. Equations containing higher order of variables are referred as non-linear equations.
Coming back to our topic of linear regression model, however, one has to note that linear regression model are linear in terms of constants (hereafter referred as weights or parameters
) and not necessarily linear in terms of variables. Thus it is possible to have higher degree polynomial functions or other functions of variable (x). These higher degree polynomial functions are referred as basis function (
). Some examples of basis function are polynomial function $((f_i(x)=x^i)$, spline function, gaussian, sigmoid, wavelets, etc. Also note that you are not restricted to having a single variable in your model but can have several of them. Thus in our sample dataset, we have two variables: (1) area and (2) number of rooms. These variables are often referred as input variable. In contrast to these two input variables, price is the target variable that we want to predict based on our trained model. Before we dive into linear regression, there is another term with which we want to get familiar. It is bias (
). Bias is a constant term. In the above equation of a straight line, b represents the bias (and previously was referred as intercept). Based on our above understanding of weights, basis function and bias, we can know write a generalized linear regression model as shown below:
In the above equation,
,
, … are different input variables.
,$\theta_2$,… are the constant terms or weights that we want to learn and $\theta_0$ is the bias term. The above equation can be represented more compactly by representing the bias term as
, where
. Following this assumption, now we can rewrite the above general equation as shown below:
In the above example, m represents the number of training examples. Great, now we have a general understanding of general linear model, but how do we move from here. There are three major challenges
- Selecting appropriate basis function
- Computing weights

- Evaluating the model
3.0 Building Model
3.1 Deciding Basis function
Note: I need help clearly understand this part of the linear regression problem. Please let me know if you have any resource that I can use to learn more about this topic.
3.2 Computing Weights/Parameters (
)
Once we have formalized our hypothesis function (
), the next step is to compute parameters (
). Obviously, we want such parameters that minimizes the difference between the price of the house as predicted by our trained model and the actual price in our training data over all the training records. That is, our optimization function can be defined as

Where:
= predicted house price for the ith record
= actual house price for the ith record.
However, there is a problem with the above optimization function. It cancels out positive and negative differences. For instance, if the difference is 2K for the first record and -2K for the second record, then the summation of differences overall all the record will be zero. To overcome this problem, one can either minimize the sume of absolute difference i.e (
) or take the square of differences and minimizes that (
). However, it is shown that taking absolute difference does not lead to a unique solution and to optimize $\theta$, we need to take square of the difference. This gets us into the topic of least square methods. Thus, the function that we want to optimize is

We added
to
make future computation easy. Now to solve the above equation, we can gradient descent algorithm. In gradient descent algorithm, we take the partial derivative of the optimization function to update the parameter. That is, at each iteration, we calculate new
as follows

or

The above gradient descent approach is also known as batch gradient descent algorithm as we are updating
after observing error across all the records. Alternative to batch gradient descent algorithm is stochastic gradient descent or online gradient descent in which we update
for each record. For large dataset, stochastic gradient descent algorithm is shown to converge much faster.
Note: Since gradient descent algorithm is used for many other machine learning techniques, I am planning to write a more detailed post on the gradient descent algorithm itself. For now, just search on gradient descent algorithm to find more about it.
3.3 Evaluation
4.0 Misc. Notes
4.1 Locally weighted linear regression
Linear regression as described above is a parametric model as it does not need to keep the data in order to predict new outcomes. Furthermore, it provides us with a single model for the whole dataset. However, sometimes it is impossible to fit a single line through all the data and thereby one has to explore new options. One such option is to use locally weighted linear regression. It is a non-parametric model that is it keeps the training data around in order to predict new outcomes. Each time a new prediction has to made, it performs a regression using only points that are around (also known as local) the given point.
4.2 Bias Term = 0
In some applications, it make sense to set the bias term (
) to zero. For instance, in the case of house price if the only input variable is area of the house, then setting bias term equal to zero makes sense as the house price will be zero if the area of the house is zero.
Reference:
- Andrew Ng, Machine Learning – Lecture 2
- Interpreting Regression Coefficients: An interesting article on how to interpret parameter values
- Interpreting interactions: This explains how to understand interactions between two or more variables. It is demonstrates the case where your model has more than one variables.
- Assumptions in Linear Regression – Great tips on necessary conditions to apply linear regression