How to estimate unknown parameters using Ordinary Least Squares (OLS)
Assumption and model selection:
For any estimation algorithm, certain assumptions have to be made. One of the main assumption is choosing an appropriate model for estimation. The chosen model should produce minimum statistical deviation and therefore should provide a “good fit”. A metric is often employed to determine the “goodness of fit”. Tests like “likelihood ratio test”, “ChiSquare test”, “Akaike Information Criterion” etc.., are used to measure the goodness of the assumed statistical model and decisions are made on the validity of the model assumption.
To keep things simple, we will consider only polynomial models.
Observed data and Least Squares estimation:
Suppose, an experiment is performed and the following data points are observed in the following table. The variable “\(x\)” is the input to the experiment and “\(y\)” is the output obtained. The input variable is often called “regressor”, “independent variable”, “manipulated variable”, etc. Similarly, the regression analysis jargon, the output variable or the observed variable is called “observed variable”, “explanatory variable”, “regressand” , “response variable” etc.
Generic outline of an estimation algorithm:
We have a very small number of data points and it is four in our case. To illustrate the concept, we will choose linear model for parameter estimation. Higher order models will certainly give better performance. Remember!!! More the polynomial order, more is the number of parameters to be estimated and therefore the computational complexity will be more. It is necessary to strike a balance between the required performance and the model order.
We assume that the data points follow a linear trend. So, the linear model is chosen for the estimation problem.
$$ y = \alpha_2 x + \alpha_1 $$
Now the estimation problem simplifies to finding the parameters \( \alpha_2 \) and \( \alpha_1 \). We can write the relationship between the observed variable and the input variable as
Next step is to solve for the above mentioned simultaneous equation based on least square error criterion. According to the criterion, the estimated values for \( \alpha_1 \) and \( \alpha_2 \) should produce minimum total squared error. Note the underlined words.
Let’s define the term – “error” for the above mentioned system of simultaneous equations. Error (which is a function of the model parameters) for one data point is the difference between the observed data and the data from the estimated model.
$$ E_i (\alpha_1, \alpha_2) = y_i – \left( \alpha_2 x_i + \alpha_1 \right) $$
The sum of all squared errors is given by
Next step is to solve for \( \alpha_1 \) and \( \alpha_2 \) that gives minimum total squared error.
How do we find that? Employ calculus to find that. Separately take the partial derivative of \( S \) with respect to \( \alpha_2 \) and \( \alpha_2 \)and set them to zero.
Solving the above simultaneous equations leads to the following solution
Thus, the model becomes
$$ y = 3.577 x 9.923 $$
Solving using Matrices:
A system of simultaneous equations can be solved by Matrix manipulation. In DSP implementation of estimation algorithm, it is often convenient to work in matrix domain (especially when the number of data points becomes larger). Linear algebra is extensively used in implementing estimation algorithms on DSP processors.
The above mentioned set of data points can be represented in matrix notation as
Defining,
The set of simultaneous equations shrinks to
$$ X \alpha = Y $$
Given the criterion that the solution to the above equation must satisfy the minimum total squared error \( S(\alpha) \),
The above equation simply denotes that the estimated parameter \( \hat{\alpha }\) is the value of for which the error function \( S(\alpha) \) attains the minimum.
The simultaneous equation mentioned above is a very simple case taken for illustration. For a generic case,
Usually, the above mentioned simultaneous equation may not have a unique solution. Unique solution exists if and only if all the columns of the matrix \( X \) are linearly independent. Furthermore, the condition that the columns of matrix \( X \)are linearly independent only means that they are orthogonal to each other. Thus the matrix \( X \) is an orthogonal matrix.
For an orthogonal matrix, transpose and inverse are equivalent. That is,
$$ X^{1} = X^T $$
Under this orthogonality condition, system of simultaneous equations become,
$$ (X^T X )\hat{\alpha} = X^{T}Y $$
The solution is obtained by finding,
$$ \hat{\alpha} = (X^T X )^{1}X^{T}Y $$
The aforementioned solution involves the computation of inverse of the matrix \( (X^T X )^{1} \). Directly computing matrix inverses is cumbersome and computationally inefficient, especially when it comes to implementation in DSP, let alone the finite word length effects of fixed point processors. If the matrix \( X \) , however large, has a very low condition number (i.e, wellconditioned) and if it is positive definite , then we can use Cholesky Decomposition to solve the equations.
This leads us to the next topic : “Cholesky Decomposition”
<h2>Related Posts:</h2>

veda

http://www.gaussianwaves.com/ Mathuranathan