Ordinary Least Squares : estimate unknown parameters

Key focus: Know how to estimate unknown parameters using Ordinary Least Squares (OLS) method.

As mentioned in the previous post, it is often required to estimate parameters that are unknown to the receiver. For example, if a fading channel is encountered in a communication system, it is desirable to estimate the channel response and cancel out the fading effects during reception.

Assumption and model selection:

For any estimation algorithm, certain assumptions have to be made. One of the main assumption is choosing an appropriate model for estimation. The chosen model should produce minimum statistical deviation and therefore should provide a “good fit”. A metric is often employed to determine the “goodness of fit”. Tests like “likelihood ratio test”, “Chi-Square test”, “Akaike Information Criterion” etc.., are used to measure the goodness of the assumed statistical model and decisions are made on the validity of the model assumption.

To keep things simple, we will consider only polynomial models.

Observed data and Least Squares estimation:

Suppose, an experiment is performed and the following data points are observed in the following table. The variable is the input to the experiment and is the output obtained. The input variable is often called “regressor”, “independent variable”, “manipulated variable”, etc. Similarly, the regression analysis jargon, the output variable or the observed variable is called “observed variable”, “explanatory variable”, “regressand” , “response variable” etc.

The aim of the experiment is to fit the experiment into a model with appropriate parameters. Once we know the model, we no longer need to perform the experiment to determine the output for any given arbitrary input. All we need to do is to use the model and generate the desired output.

Generic outline of an estimation algorithm:

We have a very small number of data points and it is four in our case. To illustrate the concept, we will choose linear model for parameter estimation. Higher order models will certainly give better performance. Remember!!! More the polynomial order, more is the number of parameters to be estimated and therefore the computational complexity will be more. It is necessary to strike a balance between the required performance and the model order.

We assume that the data points follow a linear trend. So, the linear model is chosen for the estimation problem.

Now the estimation problem simplifies to finding the parameters and . We can write the relationship between the observed variable and the input variable as

Next step is to solve for the above mentioned simultaneous equation based on least square error criterion. According to the criterion, the estimated values for and should produce minimum total squared error. Note the underlined words.

Let’s define the term – “error” for the above mentioned system of simultaneous equations. Error (which is a function of the model parameters) for one data point is the difference between the observed data and the data from the estimated model.

The sum of all squared errors is given by

Next step is to solve for and that gives minimum total squared error.

How do we find that? Employ calculus to find that. Separately take the partial derivative of with respect to and and set them to zero.

Solving the above simultaneous equations leads to the following solution

Thus, the model becomes

Solving using Matrices:

A system of simultaneous equations can be solved by Matrix manipulation. In DSP implementation of estimation algorithm, it is often convenient to work in matrix domain (especially when the number of data points becomes larger). Linear algebra is extensively used in implementing estimation algorithms on DSP processors.

The above mentioned set of data points can be represented in matrix notation as

Defining,

The set of simultaneous equations shrinks to

Given the criterion that the solution to the above equation must satisfy the minimum total squared error $latex S(\alpha)$,

The above equation simply denotes that the estimated parameter is the value of for which the error function attains the minimum.

The simultaneous equation mentioned above is a very simple case taken for illustration. For a generic case,

Usually, the above mentioned simultaneous equation may not have a unique solution. Unique solution exists if and only if all the columns of the matrix are linearly independent. Furthermore, the condition that the columns of matrix are linearly independent only means that they are orthogonal to each other. Thus the matrix is an orthogonal matrix.

For an orthogonal matrix, transpose and inverse are equivalent. That is,

Under this orthogonality condition, system of simultaneous equations become,

The solution is obtained by finding,

The aforementioned solution involves the computation of inverse of the matrix . Directly computing matrix inverses is cumbersome and computationally inefficient, especially when it comes to implementation in DSP, let alone the finite word length effects of fixed point processors. If the matrix , however large, has a very low condition number (i.e, well-conditioned) and if it is positive definite , then we can use Cholesky Decomposition to solve the equations.

This leads us to the next topic : “Cholesky Decomposition”

Rate this article: Note: There is a rating embedded within this post, please visit this post to rate it.

Related Posts

[1]An Introduction to Estimation Theory
[2]Bias of an Estimator
[3]Minimum Variance Unbiased Estimators (MVUE)
[4]Maximum Likelihood Estimation
[5]Maximum Likelihood Decoding
[6]Probability and Random Process
[7]Likelihood Function and Maximum Likelihood Estimation (MLE)
[8]Score, Fisher Information and Estimator Sensitivity
[9]Introduction to Cramer Rao Lower Bound (CRLB)
[10]Cramer Rao Lower Bound for Scalar Parameter Estimation
[11]Applying Cramer Rao Lower Bound (CRLB) to find a Minimum Variance Unbiased Estimator (MVUE)
[12]Efficient Estimators and CRLB
[13]Cramer Rao Lower Bound for Phase Estimation
[14]Normalized CRLB - an alternate form of CRLB and its relation to estimator sensitivity
[15]Cramer Rao Lower Bound (CRLB) for Vector Parameter Estimation
[16]The Mean Square Error – Why do we use it for estimation problems
[17]How to estimate unknown parameters using Ordinary Least Squares (OLS)
[18]Essential Preliminary Matrix Algebra for Signal Processing
[19]Why Cholesky Decomposition ? A sample case:
[20]Tests for Positive Definiteness of a Matrix
[21]Solving a Triangular Matrix using Forward & Backward Substitution
[22]Cholesky Factorization - Matlab and Python
[23]LTI system models for random signals – AR, MA and ARMA models
[24]Comparing AR and ARMA model - minimization of squared error
[25]Yule Walker Estimation
[26]AutoCorrelation (Correlogram) and persistence – Time series analysis
[27]Linear Models - Least Squares Estimator (LSE)
[28]Best Linear Unbiased Estimator (BLUE)

Books by the author


Wireless Communication Systems in Matlab
Second Edition(PDF)

Note: There is a rating embedded within this post, please visit this post to rate it.
Checkout Added to cart

Digital Modulations using Python
(PDF ebook)

Note: There is a rating embedded within this post, please visit this post to rate it.
Checkout Added to cart

Digital Modulations using Matlab
(PDF ebook)

Note: There is a rating embedded within this post, please visit this post to rate it.
Checkout Added to cart
Hand-picked Best books on Communication Engineering
Best books on Signal Processing

The Mean Square Error – Why do we use it for estimation problems

Note: There is a rating embedded within this post, please visit this post to rate it.
“Mean Square Error”, abbreviated as MSE, is an ubiquitous term found in texts on estimation theory. Have you ever wondered what this term actually means and why is this getting used in estimation theory very often ?

Any communication system has a transmitter, a channel or medium to communicate and a receiver. Given the channel impulse response and the channel noise, the goal of a receiver is to decipher what was sent from the transmitter. A simple channel is usually characterized by a channel response – \( h \) and an additive noise term – \( n \). In time domain, this can be written as

$$ y = h \circledast x + n $$

Here, is the convolution operation. Equivalently, in frequency domain, the convolution operation is equivalent to multiplication in frequency domain and vice-versa.

$$ Y = HX + N $$

Remember!!! Capitalized letters indicate frequency domain representation and small caps indicate time domain representation. The frequency domain equation looks simple and the only spoiler is the noise term. The receiver receives the information over the channel corrupted by noise and tries to decipher what was sent from the transmitter – \(X\). If the noise is cancelled out in the receiver, \(N =0\), the observed spectrum at the receiver will look like,

$$ Y = HX $$

Now, to know \( X \), the receiver has to know \(H\). Then it can simple divide the observed spectrum \(Y\) with the channel frequency response \(H\) to get \(X\). Unfortunately, things are not that easy. Cancellation of noise from the received samples/spectrum is the hardest part. Complete nullification of noise at the receiver is hardest to achieve and the entire communication system design engineering revolves around reducing this noise to minimum acceptable level to achieve acceptable performance.

Given the noise term, how do we know \(H\) from the observed/received spectrum \(Y\). This is a classical estimation problem.

Usually a known sequence ( pilot sequence in OFDM and training sequence in GSM etc.., ) is transmitted and sent across the channel and from that the channel response \(H\) or the impulse response \(h\) is estimated. This estimated channel response is used to decipher the transmitted spectrum/sequence when receiving the actual data. This type of estimation is useful if and only if the channel response remains constant across the frequency band of interest (Channel is flat across the band of interest – “flat fading”).

Okay !!! To estimate \(H\) in the presence of noise, we need some metric to quantify the accuracy of the estimation.

Line equations:

Consider a generic line equation \(y = mx+c\), where \(m\) is the slope of the line and \(c\) is the intercept. Both \(m\) and \(c\) are constants. Since \(x\) is a first order term (highest degree of \(x\)’s degree is one), this is called a linear equation. If the equation looks like \(y = mx^2+kx+c\), the highest degree of \(x\) is 2 and it becomes a quadratic equation. Similarly, the term \(x^3\) in the polynomial equation gives rise to the name cubic equation, \(x^4\) – quartic equation so on and so forth.

Naming a univariate Polynomial equation

Univariate Polynomial EquationHighest degree of 'x'Name
1Linear
2Quadratic
3Cubic
4Quartic
5Quintic

Coming back to the linear-line equation, \(y = mx +c\), to simplify things, let’s assign \(m=2\) and \(c=0\) and generate values for \(y\) by varying \(x\) from x=0 to 9 in steps of 1.

In the above equation, the input \(x\) gets transformed into output \(y\) by the transformation \(y = mx\). This can be considered analogous to a communication system in frequency domain, where the input \(X\) is transmitted, it gets transformed by a channel \(H\) and gives the output \(Y\).

$$ Y = HX $$

Frequency domain is considered because it has the same structure as the linear equation, whereas, in time domain the output of the channel is the convolution of the channel impulse response \(h\) and the input \(x\).

We can now consider that the channel impulse response in frequency domain \(H\) is equal to the constant \(m\) (flat fading assumption).

To make the channel look closer to a real one, we will add Additive White Noise Gaussian (AWGN) noise to the channel.

$$Y = HX + N $$

To represent this scenario in our line fitting problem, the noise is represented as being generated from a set of uniformly generated random numbers – ‘\(n\)’. We call this – “observed data”.

$$ y_1 = mx + n $$

Note: The term \(n\) in the above is not a constant but a random variable, whereas,the term \( c \) is a constant (This can be considered as a DC bias in the observed data , if present). I have generated the following table for illustration. For convenience and to illustrate the importance of the term MSE, the noise terms in the following table are not drawn from an uniform set of random numbers, instead, they are manually created in a way to make the total error term zero.

The first column is the input \( x \), the second column is the ideal (actual) output \( y \) that follows the equation \( y = mx + c \), with \( c \) set to \( 0 \). The third column is the noise term. The fourth column is the observed samples at the receiver after the ideal samples are corrupted by the noise term. The fourth column represents the equation \( y_1 = mx + n \).

Now, our job is to estimated the constant \( m \) in the presence of noise. Can you think of a possible metric which could aid in this estimation problem ? Given that a known data \( x \) is transmitted, the obvious choice is to measure the average error between the observed sequence of data and the actual data and to use a brute force search for \( m \). Plug-in various values in the place of \(m\) and choose the one that gives the minimum error.

Selecting the “error” as a metric seems to be a great and simple approach. But there exists a basic flaw in this approach. Now, consider the fifth column in the table which measures the error between the observed and actual data. The noise terms in the third column are chosen such that the average-error-measured becomes zero. Even though the average error is zero, it is obvious that the observed data is far from the ideal one. This is a big drawback in the error metric. This is because the positive and the negative errors cancel out. This can happen in the real scenario too, where the errors across all samples of observed data can cancel out each other.

To circumvent this problem, lets square the error terms (sixth column) and average them out. This metric is called – Mean Squared Error. Now, no matter what the sign of error is, the squaring operation always amplifies the errors in the positive direction. The issue of errors cancelling each other is solved by this approach. An estimation approach that attempts to Minimize the Mean Square Error is called a Minimum Mean Square Error (MMSE) estimator.1

I hope that this text might have helped in understanding the logic behind using Mean Square Error as a metric for estimation problems. Comments/suggestions for improvements are welcome. The next post will focus on Ordinary Least Squares (OLS) algorithm (using the mean square error metric) applied to a linear-line fitting problem.

Related Posts

[1]An Introduction to Estimation Theory
[2]Bias of an Estimator
[3]Minimum Variance Unbiased Estimators (MVUE)
[4]Maximum Likelihood Estimation
[5]Maximum Likelihood Decoding
[6]Probability and Random Process
[7]Likelihood Function and Maximum Likelihood Estimation (MLE)
[8]Score, Fisher Information and Estimator Sensitivity
[9]Introduction to Cramer Rao Lower Bound (CRLB)
[10]Cramer Rao Lower Bound for Scalar Parameter Estimation
[11]Applying Cramer Rao Lower Bound (CRLB) to find a Minimum Variance Unbiased Estimator (MVUE)
[12]Efficient Estimators and CRLB
[13]Cramer Rao Lower Bound for Phase Estimation
[14]Normalized CRLB - an alternate form of CRLB and its relation to estimator sensitivity
[15]Cramer Rao Lower Bound (CRLB) for Vector Parameter Estimation
[16]The Mean Square Error – Why do we use it for estimation problems
[17]How to estimate unknown parameters using Ordinary Least Squares (OLS)
[18]Essential Preliminary Matrix Algebra for Signal Processing
[19]Why Cholesky Decomposition ? A sample case:
[20]Tests for Positive Definiteness of a Matrix
[21]Solving a Triangular Matrix using Forward & Backward Substitution
[22]Cholesky Factorization - Matlab and Python
[23]LTI system models for random signals – AR, MA and ARMA models
[24]Comparing AR and ARMA model - minimization of squared error
[25]Yule Walker Estimation
[26]AutoCorrelation (Correlogram) and persistence – Time series analysis
[27]Linear Models - Least Squares Estimator (LSE)
[28]Best Linear Unbiased Estimator (BLUE)