Generating simulated dataset for regression problems

Key focus: Generating simulated dataset for regression problems using sklearn make_regression function (Python 3) is discussed in this article.

Problem statement

Suppose, a survey is conducted among the employees of a company. In that survey, the salary and the years of experience of the employees are collected. The aim of this data collection is to build a regression model that could predict the salary from the given experience (especially for the values not seen by the model).

If you are developer, you often have no access to survey data. In this scenario, you wish you could simulate the data for building a regression model.

Generating the dataset

To construct a simulated dataset for this scenario, the sklearn.dataset.make_regression↗ function available in the scikit-learn library can be used. The function generates the samples for a random regression problem.

The make_regression↗ function generates samples for inputs (features) and output (target) by applying random linear regression model. The values for generated samples have to be scaled to appropriate range for the given problem.

import numpy as np
from sklearn import datasets
import matplotlib.pyplot as plt #for plotting

x, y, coef = datasets.make_regression(n_samples=100,#number of samples
                                      n_features=1,#number of features
                                      n_informative=1,#number of useful features 
                                      noise=10,#bias and standard deviation of the guassian noise
                                      coef=True,#true coefficient used to generated the data
                                      random_state=0) #set for same data points for each run

# Scale feature x (years of experience) to range 0..20
x = np.interp(x, (x.min(), x.max()), (0, 20))

# Scale target y (salary) to range 20000..150000 
y = np.interp(y, (y.min(), y.max()), (20000, 150000))

plt.ion() #interactive plot on
plt.plot(x,y,'.',label='training data')
plt.xlabel('Years of experience');plt.ylabel('Salary $')
plt.title('Experience Vs. Salary')
Simulated dataset for linear regression problem
Figure 1: Simulated dataset for linear regression problem

If you want the data to be presented in pandas dataframe format:

import pandas as pd
df = pd.DataFrame(data={'experience':x.flatten(),'salary':y})
df.head(10)
Generated dataset presented as pandas dataframe
Figure 2: Generated dataset presented as pandas dataframe

We have successfully completed generating simulated dataset for regression problems in Python3. Let’s move on to build and train a linear regression model using the generated dataset and use it for predictions.

Rate this article: PoorBelow averageAverageGoodExcellent (3 votes, average: 4.33 out of 5)

Related topics

[1] Introduction to Signal Processing for Machine Learning
[2] Generating simulated dataset for regression problems - sklearn make_regression
[3] Hands-on: Basics of linear regression

Books by the author

Wireless Communication Systems in Matlab
Wireless Communication Systems in Matlab
Second Edition(PDF)

Note: There is a rating embedded within this post, please visit this post to rate it.
Digital modulations using Python
Digital Modulations using Python
(PDF ebook)

Note: There is a rating embedded within this post, please visit this post to rate it.
digital_modulations_using_matlab_book_cover
Digital Modulations using Matlab
(PDF ebook)

Note: There is a rating embedded within this post, please visit this post to rate it.
Hand-picked Best books on Communication Engineering
Best books on Signal Processing

Post your valuable comments !!!