Linear Models – Least Squares Estimator (LSE)

Key focus: Understand step by step, the least squares estimator for parameter estimation. Hands-on example to fit a curve using least squares estimation

Background:

The various estimation concepts/techniques like Maximum Likelihood Estimation (MLE), Minimum Variance Unbiased Estimation (MVUE), Best Linear Unbiased Estimator (BLUE) – all falling under the umbrella of classical estimation – require assumptions/knowledge on second order statistics (covariance) before the estimation technique can be applied. Linear estimators, discussed here, do not require any statistical model to begin with. It only requires a signal model in linear form.

Linear models are ubiquitously used in various fields for studying the relationship between two or more variables. Linear models include regression analysis models, ANalysis Of VAriance (ANOVA) models, variance component models etc. Here, one variable is considered as a dependent (response) variable which can be expressed as a linear combination of one or more independent (explanatory) variables.

Studying the dependence between variables is fundamental to linear models. For applying the concepts to real application, following procedure is required

Problem identification
Model selection
Statistical performance analysis
Criticism of the model based on statistical analysis
Conclusions and recommendations

Following text seeks to elaborate on linear models when applied to parameter estimation using Ordinary Least Squares (OLS).

Linear Regression Model

A regression model relates a dependent (response) variable y to a set of k independent explanatory variables {x₁, x₂ ,…, x_k} using a function. When the relationship is not exact, an error term e is introduced.

$latex y = f(x_1,x_2,…,x_k) + e \quad\quad (1) &s=1$

If the function f is not a linear function, the above model is referred as Non-Linear Regression Model. If f is linear, equation (1) is expressed as linear combination of independent variables x_k weighted by unknown vector parameters θ = {θ₁, θ₂,…, θ_k } that we wish to estimate.

$latex y = x_1 \theta_1 + x_2 \theta_2 + … + x_k \theta_k + e \quad\quad (2) &s=1$

Equation (2) is referred as Linear Regression model. When N such observations are made

$latex y_i = x_{1i} \theta_1 + x_{2i} \theta_2 + … + x_{ki} \theta_k + e , \left(i=1,2,…,N \right) \quad (3) &s=1$

where,
y_i – response variable
x_i – independent variables – known expressed as observed matrix X with rank k
θ_i – set of parameters to be estimated
e – disturbances/measurement errors – modeled as noise vector with PDF N(0, σ² I)

It is convenient to express all the variables in matrix form when N observations are made.

$latex y=\begin{bmatrix} y_1\\ \vdots \\ y_n \end{bmatrix} ,\; X=\begin{bmatrix} x_{11} & x_{21} & … & x_{k1} \\ \vdots &\vdots & \ddots & \vdots \\ x_{1n} & x_{2n} & … & x_{kn} \end{bmatrix} ,\; \theta =\begin{bmatrix} \theta_1\\ \vdots \\ \theta_k \end{bmatrix} ,\; e=\begin{bmatrix} e_1\\ \vdots \\ e_n \end{bmatrix} \quad (4) &s=1$

Denoting equation (3) using (4),

$latex y = X \theta + e \quad\quad (5) &s=1$

Except for X which is a matrix, all other variables are column/row vectors.

Ordinary Least Squares Estimation (OLS)

In OLS – all errors are considered equal as opposed to Weighted Least Squares where some errors are considered significant than others.

If $latex \hat{\theta}$ is a k ⨉ 1 vector of estimates of θ, then the estimated model can be written as

$latex y = X \hat{\theta} + e \quad\quad(6) &s=1$

Thus the error vector e can be computed from the observed data matrix y and the estimated $latex \hat{\theta}$ as

$latex e = y-X \hat{\theta} \quad\quad (7) &s=1$

Here, the errors are assumed to be following multivariate normal distribution with zero mean and standard deviation σ².

To determine the least squares estimator, we write the sum of squares of the residuals (as a function of $latex \hat{\theta}$ ) as

$latex \begin{aligned} S(\hat{\theta})&=\sum e^2_i = e^Te=(y-X\hat{\theta})^T(y-X\hat{\theta})\\ &=y^Ty-y^T X \hat{\theta} -\hat{\theta}^TX^Ty + \hat{\theta}^TX^TX\hat{\theta} \end{aligned} \quad (8) &s=1$

The least squares estimator is obtained by minimizing $latex S(\hat{\theta})$. In order to get the estimate that gives the least square error, differentiate with respect to $latex \hat{\theta}$ and equate to zero.

$latex \begin{aligned} \frac{\delta S}{\delta \hat{\theta}}&= -2X^Ty+2X^TX\hat{\theta} = 0\\ &=> \hat{\theta} = \left (X^TX \right )^{-1}X^Ty \end{aligned}\quad (9) &s=1$

Thus, the least squared estimate of θ is given by

$latex \boxed{ \hat{\theta} = \left (X^TX \right )^{-1}X^Ty } &s=1$

where the operator T denotes Hermitian Transpose (conjugate transpose).

Summary of computations

Step 1: Choice of variables. Choose the variable to be explained (y) and the explanatory variables { x₁, x₂ ,…, x_k } where x₁ is often considered a constant (optional) that always takes the value 1 – this is to incorporate a DC component in the model.
Step 2: Collect data. Collect n observations of y and for a set of known values of { x₁, x₂ ,…, x_k }. Example: { x₁, x₂ ,…, x_k } is the pilot data in OFDM using which we would like to estimate the channel impulse response θ and y is the received vector of samples. Store the observed data y in an – n⨉1 vector and the data on the explanatory variables in the n⨉k matrix X.
Step 3: Compute the estimates. Compute the least squares estimates by the formula
$latex \boxed{ \hat{\theta} = \left (X^TX \right )^{-1}X^Ty } &s=1$

The superscript T indicates Hermitian Transpose (conjugate transpose) operation.

Key Points

We do not need a probabilistic assumption but only a deterministic signal model.
It has a broader range of applications.
Least squares is unbiased.
Estimating the disturbance variance (k variables to estimate and n observations are available).
$latex \sigma^2 = \frac{e^Te}{n-k} &s=1$
To keep the variance low, the number of observations must be greater than the number of variables to estimate.
The observation matrix X should have maximum rank – this leads to independent rows and columns which always happens with real data. This will make sure (X^TX) is invertible.
Least Squares Estimator can be used in block processing mode with overlapping segments – similar to Welch’s method of PSD estimation.
Useful in time-frequency analysis.
Adaptive filters are utilized for non-stationary applications.

LSE applied to curve fitting

Matlab snippet for implementing Least Estimate to fit a curve is given below.

x = -5:.1:5; % set of x- values - known explanatory variables
y = 5.3 + 1.2* x; % Straight line without noise
e=randn(size(y));
y = y + e; % adding random noise to get observed variable - 
%Linear model - Y=Xa+e where a - parameters to be estimated

X = [ ones(length(x),1) x']; %first column treated aas all ones since x_1=1
y = y'; %column vector for proper dimension during multiplication
a = inv(X'*X)*X'*y  % Least Squares Estimator - equivalent code X\y
h=plot ( x , y , 'o'); %original data
hold on;
plot( x , a(1)+ a(2)*x , 'r-' ); %Fitted line
legend('observed samples',['y=' num2str(a(1)) '+' num2str(a(2)) 'x']) 
title('Least Squares Estimate for Curve Fitting');
xlabel('X values');
ylabel('Y values');

Simulation Results

Least Squares Estimate for Curve Fitting Matlab — *Figure 1: Least Squares Estimate for Curve Fitting*

Rate this article: [ratings]

Books by the author

[table id = 23/]

4 thoughts on “Linear Models – Least Squares Estimator (LSE)”

Nivedita negi

December 8, 2014 at 9:45 pm

can u please tell me how to do same estimation of parameter in linear model using Maximum likelihood? as soon as possible…in MLE u have solved only x=A+wn but I want to know for x = H*s(n)+w
- Mathuranathan
  
  December 8, 2014 at 11:11 pm
  
  For your question on x=H*s(n)+w, I assume your goal is to estimate the channel – ‘H’. This problem is very specific to the application and the nature of the channel (channel model dependent).
  
  To apply MLE for channel estimation, you need to first understand the channel model. Then develop a statistical model that represents the mix of received signal, noise and interference (if any).
  
  An excellent example would be pilot estimation algorithms in OFDM systems. Some of them can be found here.
  http://www.freescale.com/files/dsp/doc/app_note/AN3059.pdf
  - Nivedita negi
    
    March 30, 2015 at 2:14 pm
    
    thank you so much.
Girish

August 12, 2015 at 2:25 pm

Hello Sir

I want to do channel equalization and I am using the zero forcing equalizer.

I am using this code.

enbtx=dlmread(‘input.txt’);

uerx_cap=dlmread(‘output.txt’);

enbtx=enbtx(:,1)+1i*enbtx(:,2);

enbtx_norm=enbtx/max(abs(enbtx));

uerx_cap=uerx_cap(:,1)+1i*uerx_cap(:,2);

uerx_cap_norm=uerx_cap/max(abs(uerx_cap));

x=enbtx_norm; % I/P

y=uerx_cap_norm; %o/p

X=fft(x,);

Y=fft(y,);

H=Y*pinv(X); channel estimation

H_zf=pinv(H); making 1/H(z)

As channel is estimated then I take new data which is passed by the same channel

z is the new data taken

Z=fft(z);

Y_eq=H_zf*Y;

y_eq=ifft(Y_eq);

But for the new input output the equalizer is not working
Kindly help me, I am stuck in it.

With warm regards