Linear regression using python – demystified

Key focus: Let’s demonstrate basics of univariate linear regression using Python SciPy functions. Train the model and use it for predictions.

Linear regression model

Regression is a framework for fitting models to data. At a fundamental level, a linear regression model assumes linear relationship between input variables () and the output variable (). The input variables are often referred as independent variables, features or predictors. The output is often referred as dependent variable, target, observed variable or response variable.

If there are only one input variable and one output variable in the given dataset, this is the simplest configuration for coming up with a regression model and the regression is termed as univariate regression. Multivariate regression extends the concept to include more than one independent variables and/or dependent variables.

Univariate regression example

Let us start by considering the following example of a fictitious dataset. To begin we construct the fictitious dataset by our selves and use it to understand the problem of linear regression which is a supervised machine learning technique. Let’s consider linear looking randomly generated data samples.

import numpy as np
import matplotlib.pyplot as plt #for plotting

np.random.seed(0) #to generate predictable random numbers

m = 100 #number of samples
x = np.random.rand(m,1) #uniformly distributed random numbers
theta_0 = 50 #intercept
theta_1 = 35 #coefficient
noise_sigma = 3

noise = noise_sigma*np.random.randn(m,1) #gaussian random noise

y = theta_0 + theta_1*x + noise #noise added target
 
plt.ion() #interactive plot on
fig,ax = plt.subplots(nrows=1,ncols=1)
plt.plot(x,y,'.',label='training data')
plt.xlabel(r'Feature $x_1$');plt.ylabel(r'Target $y$')
plt.title('Feature vs. Target')
Simulated data for linear regression problem
Figure 1: Simulated data for linear regression problem

In this example, the data samples represent the feature and the corresponding targets . Given this dataset, how can we predict target as a function of ? This is a typical regression problem.

Linear regression

Let be the pair that forms one training example (one point on the plot above). Assuming there are such sample points as training examples, then the set contains all the pairs .

In the univariate linear regression problem, we seek to approximate the target as a linear function of the input , which implies the equation of a straight line (example in Figure 2) as given by

where, is the intercept, is the slope of the straight line that is sought and is always . The approximated target serves as a guideline for prediction. The approximated target is denoted by

Using all the samples from the training set , we wish to find the parameters that well approximates the relationship between the given target samples and the straight line function .

If we represent the variables s, the input samples for and the target samples as matrices, then, equation (1) can be expressed as a dot product between the two sequences

It may seem that the solution for finding is straight forward

However, matrix inversion is not defined for matrices that are not square. Moore-Penrose pseudo inverse generalizes the concept of matrix inversion to a matrix. Denoting the Moore-Penrose pseudo inverse for as , the solution for finding is

For coding in Python, we utilize the scipy.linalg.pinv function to compute Moore-Penrose pseudo inverse and estimate .

xMat = np.c_[ np.ones([len(x),1]), x ] #form x matrix
from scipy.linalg import pinv
theta_estimate = pinv(xMat).dot(y)
print(f'theta_0 estimate: {theta_estimate[0]}')
print(f'theta_1 estimate: {theta_estimate[1]}')

The code results in the following estimates for , which are very close to the values used to generate the random data points for this problem.

>> theta_0 estimate: [50.66645323]
>> theta_1 estimate: [34.81080506]

Now, we know the parameters of our example system, the target predictions for new values of feature can be done as follows

x_new = np.array([[-0.2],[0.5],[1.2] ]) #new unseen inputs
x_newmat = np.c_[ np.ones([len(x_new),1]), x_new ] #form xNew matrix
y_predict  = np.dot(x_newmat,theta_estimate)
>>> y_predict #predicted y values for new inputs for x_1
array([[43.70429222],
       [68.07185576],
       [92.43941931]])

The approximated target as a linear function of feature, is plotted as a straight line.

plt.plot(x_new,y_predict,'-',label='prediction')
plt.text(0.7, 55, r'Intercept $\theta_0$ = %0.2f'%theta_estimate[0])
plt.text(0.7, 50, r'Coefficient $\theta_1$ = %0.2f'%theta_estimate[1])
plt.text(0.5, 45, r'y= $\theta_0+ \theta_1 x_1$ = %0.2f + %0.2f $x_1$'%(theta_estimate[0],theta_estimate[1]))
plt.legend() #plot legend
Figure 2: Linear Regression – training samples and prediction

Rate this article: Note: There is a rating embedded within this post, please visit this post to rate it.

References

[1] Boyd and Vandenberghe , “Convex Optimization”, ISBN: 978-0521833783, Cambridge University Press, 1 edition, March 2004.↗

Related topics

[1] Introduction to Signal Processing for Machine Learning
[2] Generating simulated dataset for regression problems - sklearn make_regression
[3] Hands-on: Basics of linear regression

Books by the author


Wireless Communication Systems in Matlab
Second Edition(PDF)

Note: There is a rating embedded within this post, please visit this post to rate it.
Checkout Added to cart

Digital Modulations using Python
(PDF ebook)

Note: There is a rating embedded within this post, please visit this post to rate it.
Checkout Added to cart

Digital Modulations using Matlab
(PDF ebook)

Note: There is a rating embedded within this post, please visit this post to rate it.
Checkout Added to cart
Hand-picked Best books on Communication Engineering
Best books on Signal Processing

Introduction to Signal Processing for Machine Learning

Key focus: Fundamentals of signal processing for machine learning. Speaker identification is taken as an example for introducing supervised learning concepts.

Signal Processing

A signal, mathematically a function, is a mechanism for conveying information. Audio, image, electrocardiograph (ECG) signal, radar signals, stock price movements, electrical current/voltages etc.., are some of the examples.

Signal processing is an engineering discipline that focuses on synthesizing, analyzing and modifying such signals. Some of the applications of signal processing are

● Converting one signal to another – filtering, decomposition, denoising
● Information extraction and interpretation – computer vision, speech recognition, Iris recognition, finger print recognition
● Error control and source coding – low density parity codes (LDPC), turbo coding, linear prediction coding, JPG, PNG
● Detection – SONAR, RADAR

Machine Learning (ML)

Machine learning is a science that deals with the development of algorithms that learn from data. According to Arthur Samuel (1959)[1] machine learning is a “Field of study that gives computers the ability to learn without being explicitly programmed”. Kevin Murphy, in his seminal book [2], defines machine learning as a collection of algorithms that automatically detect patterns in data that use the uncovered patterns to predict future data or other outcomes of interest.

Essentially, a machine learning algorithm may learn from data to
● learn from data to recognize patterns – example: recognizing text patterns in a set of spam emails
● classify data into different categories – example: classifying the emails into spam or non-spam emails.
● predict a future outcome – example: predicting whether the incoming email is spam or not

Machine learning algorithms are divided into three main types
Supervised learning – a predictive learning approach where the goal is to learn from a labeled set of input-output pairs. The labeled set provides the training examples for further classification or prediction. In machine learning jargon, inputs are called ‘features’ and outputs are called ‘response variables’.
Unsupervised learning – A kind of less well defined knowledge discovery process, the goal is to learn structured patterns in the data by separating them from pure unstructured noise
Reinforced learning – is learning by interacting with an environment in order to make decision making tasks

Based on the discussion so far, we can start to recognize how the synergy between the fields of signal processing and machine learning can provide a new perspective to approach many problems.

Speaker identification – an application of ML algorithms in signal processing

Speaker identification (Figure 1) is the identification of a person from the analysis of voice characteristics. In this supervised classification application, a labeled training set of voice samples (from a set of speakers) are used in the learning process.

Figure 1: Speaker recognition using machine learning and signal processing

Voice samples/recordings cannot be used as such in the learning process. For further processing, it may require sampling, cleaning (removal of noise or invalid samples etc..,) or re-formatting the samples to suitable format. This step is called ‘data pre-processing‘.

Also, we may have to transform the data specific to the ML algorithm and the knowledge of the problem. To train the ML model recognize the patterns in the voice samples, feature extraction on voice samples is performed using signal processing. In this case, the features that are used to train the ML model are pitch and Mel-Frequency Cepstrum Coefficients (MFCC) [3] extracted from the voice samples.

Generally, the available dataset (set of input voice samples) is split into two sets: one set for training the model and the other set for testing needs (typically in 75%-25% ratio). The training set is used to train the ML model and the test set is used to evaluate the effectiveness and performance of the ML algorithm.

The training process should attempt to generalize the underlying relationship between the feature vectors (input to the supervised learning algorithm) and the class labels (supervised learner’s output). Cross-validation is one of the verification technique for evaluating the generalization ability of the ML model.

The training process should also avoid overfitting, which may cause poor generalization and erroneous classification in the execution phase. If the performance of the algorithm needs improvement, we need to go back and make changes to the previous steps. Metrics such as accuracy, recall, confusion matrix are typically used to evaluate the effectiveness and performance of the ML algorithm.

After the ML model is adequately trained to provide satisfying performance, we move on to the execution phase. In the execution phase, when an unlabeled instance of an voice sample is presented to the trained classifier, it identifies the person to which it belongs to.

Rate this article: Note: There is a rating embedded within this post, please visit this post to rate it.

References

[1] Samuel, Arthur L. “Some Studies in Machine Learning Using the Game of Checkers,” IBM Journal of Research and Development 44:1.2 (1959): 210–229.↗
[2] Kevin P. Murphy, “Machine Learning – A Probabilistic Perspective”, ISBN 978-0262018029, The MIT Press, Cambridge, UK.↗
[3] P. M. Chauhan and N. P. Desai, “Mel Frequency Cepstral Coefficients (MFCC) based speaker identification in noisy environment using wiener filter,” 2014 International Conference on Green Computing Communication and Electrical Engineering (ICGCCEE), Coimbatore, 2014, pp. 1-5.↗

Articles in this series

[1] Introduction to Signal Processing for Machine Learning
[2] Generating simulated dataset for regression problems - sklearn make_regression
[3] Hands-on: Basics of linear regression

Books by the author


Wireless Communication Systems in Matlab
Second Edition(PDF)

Note: There is a rating embedded within this post, please visit this post to rate it.
Checkout Added to cart

Digital Modulations using Python
(PDF ebook)

Note: There is a rating embedded within this post, please visit this post to rate it.
Checkout Added to cart

Digital Modulations using Matlab
(PDF ebook)

Note: There is a rating embedded within this post, please visit this post to rate it.
Checkout Added to cart
Hand-picked Best books on Communication Engineering
Best books on Signal Processing