In this lesson, you will learn how to build and predict with a simple linear regression using Python. Specifically, this lesson will cover:
1. Simple Linear Regression in Python and Data Description
Linear regression is a statistical method that models the relationship between two variables by fitting a linear equation to observed data. In simple terms, it helps us understand how one variable (independent variable) affects another variable (dependent variable).
You are going to learn how to build a simple linear regression in Python. You’ll use data from a non-profit organization to predict the amount of donations based on the number of social media shares a donation campaign receives. Each row in the data represents a different fundraising campaign.
In the example for this tutorial:
-
Explanatory variable (X): Number of social media shares
-
Response variable (Y): Amount of donations
1a. Creating Pandas DataFrame
You will be importing an Excel file that is stored in a Github repository into a pandas DataFrame using the code below. This code installs a necessary library, imports pandas, reads an Excel file from a URL, and prints the data.
Let’s break down the steps of the code line by line.
Step 1: Setting up using Python on the web.
import pyodide_http
pyodide_http.patch_all()
- Think of pyodide_http as a tool that helps your Python code work on the web (which is what Jupyter Notebook is, a web-based version of Python).
- patch_all() is fixing everything so it works smoothly online.
Step 2: Getting extra tools.
import micropip
await micropip.install("openpyxl")
- micropip is like an app store for Python tools.
- await micropip.install("openpyxl") is like downloading and installing an app.
- Openpyxl helps you work with Excel files.
Step 3: Import the pandas Python library.
import pandas as pd
- This line imports the pandas library and gives it the alias pd. It is like saying, “Hey, I want to use pandas, but I will call it pd for short.”
Step 4: Define the URL of the Excel file.
- This line sets the variable URL to the web address where the Excel file is stored. The file is hosted on GitHub.
Step 5: Read the Excel file.
donations = pd.read_excel(url)
- This line reads the Excel file from the URL and stores the data in a pandas DataFrame named donations.
Step 6: Print the data.
print(donations)
- This line prints the contents of the donations DataFrame.
1b. Visualizing the Data
A scatter plot helps you see if there is a relationship between the two continuous variables you are interested in. The scatter plot helps you visually assess the relationship between the two variables. You can see if there’s a linear trend, or if the relationship is more complex.
The code below creates a scatter plot to show the relationship between the number of shares and the amount of donations. It labels the axes and adds a title to make the plot easier to understand.
Let’s break down the steps of the code line by line.
Step 1: Import the matplotlib.pyplot Library
import matplotlib.pyplot as plt
- This line imports the matplotlib.pyplot library and gives it the nickname plt. This library is used for creating graphs and plots in Python.
Step 2: Create a Scatter Plot
plt.scatter(donations['Shares'], donations['Donations'])
- This line creates a scatter plot. It takes two columns from the donations DataFrame: Shares and Donations. Each point on the plot represents a pair of values from these columns.
Step 3: Label the X-Axis
plt.xlabel('Number of Shares')
- This line labels the x-axis of the plot as “Number of Shares.”
Step 4: Label the Y-Axis
plt.ylabel('Amount of Donations ($)')
- This line labels the y-axis of the plot as “Amount of Donations ($).”
Step 5: Add a Title to the Plot
plt.title('Shares vs. Donations')
- This line adds the title “Shares vs. Donations” to the plot.
Step 6: Display the Plot
plt.show()
- This line displays the plot on the screen.
Based on the scatter plot, you can conclude there is a strong positive correlation between the predictor and explanatory variables. Recall a strong positive correlation between two variables in a scatter plot means that as one variable increases, the other variable also tends to increase.
1c. Build the Simple Linear Regression Model
Now, you are ready to build a simple linear regression model for the amount of donations the non-profit will receive based on the number of social media shares a campaign receives.
The code below performs a simple linear regression analysis using the statsmodels library. It defines the explanatory variable (number of shares) and the response variable (amount of donations) from a DataFrame, adds a constant to include an intercept, and builds the regression model using the Ordinary Least Squares (OLS) method. Finally, it fits the model and prints a summary, which includes coefficients and the R-squared value, providing insights into the relationship between the number of shares and the amount of donations.
Let’s break down the steps of the code line by line.
Step 1: Import the statsmodels Library
import statsmodels.api as sm
- This line imports the statsmodels library and gives it the nickname sm. This library is used for statistical modeling.
Step 2: Define the Response and Explanatory Variables
x = donations['Shares']
y = donations['Donations']
- In this example, x is defined as the explanatory variable (number of shares), and y is defined as the response variable (amount of donations). These variables are extracted from the DataFrame donations.
Step 3: Add a Constant to the Explanatory Variable
x = sm.add_constant(x)
- Recall the simple linear regression model is given by
This step is adding the intercept term,
to the model.
Step 4: Build the Simple Linear Regression Model
model = sm.OLS(y, x).fit()
- This line builds the simple linear regression model using the Ordinary Least Squares (OLS) method. The fit() method is called to estimate the parameters of the model.
Step 5: Print the Model Summary
print(model.summary())
- This line prints a summary of the regression model. The summary includes the coefficients, the R-squared value (which indicates how well the model explains the variability of the data), and other statistical information.
The summary of the regression model is provided below.
1d. Interpreting the Results of the Simple Linear Regression Model
Let's explore the simple linear regression model output below, where key areas have been highlighted to guide your interpretation and understanding of the model's results.
1. R-squared
This value indicates the proportion of the variance in the response variable that is explained by the explanatory variable.
is a statistic that tells us how well one variable explains or predicts another. It measures the proportion of variation in the response variable (e.g., donation amounts) that can be explained by the explanatory variable (e.g., social media shares). An
value close to 1 means the explanatory variable is doing an excellent job at prediction. For instance, with an
of 94.1%, we can say that 94.1% of the variability in the donation amounts is explained by the number of social media shares a campaign receives. This suggests that social media shares are a strong predictor of donation amounts.
2. coef column has 2 parts:
- const – this is the intercept for the model,
in the regression equation
- shares – this is the slope value (coefficient) for the model,
in the regression equation
The slope value,
is 6.97.
The simple linear regression model is
The slope value,

is used to interpret the simple linear regression model. The slope value represents the change in the response variable for a one-unit change in the explanatory variable.
For example, the
value of 6.97 means that for every additional share, the predicted amount of donations is expected to increase by $6.97, assuming all other factors remain constant.
This slope value indicates a positive relationship between the number of shares and the amount of donations, suggesting that as the number of shares increases, the donations tend to increase as well.
In most analyses, the primary interest lies in understanding how the explanatory variable (x) influences the response variable (y), which is captured by the slope rather than the intercept. However, the intercept
can sometimes have a meaningful interpretation. For example, in healthcare, the intercept might represent the baseline costs of care without considering any additional treatments or factors. Another example would be predicting air quality index (AQI) based on the number of vehicles on the road. If the intercept is 30, it means that with zero vehicles, the AQI is expected to be 30. This represents the baseline air quality without vehicle emissions.
In this example,
provides a baseline value. When the number of social media shares (x) is 0, the predicted amount of donations is $1,002.73.
You have successfully applied a simple linear regression model to predict the amount of donations based on the number of social media shares a donation campaign receives using Python. The non-profit can use this model to set realistic fundraising targets and develop effective strategies to achieve them, ultimately enhancing their planning and forecasting capabilities.
1e. Hands-On Practice
Now, it is your turn to build and apply a simple linear regression model in Python!
-
You are a data analyst working on an analytical project. Your task is to build a simple linear regression model that can predict sales revenue based on marketing spend using Python.
The sales_revenue.xlsx file contains the historical data of sales revenue based on marketing spend for 10 observations for 10 months:
https://raw.githubusercontent.com/sophiaAcademics/BDA_Excel/main/StudentData/Tutorials/Unit5/5.1.2/sales_revenue.xlsx
Each row in the data represents the amount spent on marketing and the sales revenue for each month. The two columns are described below:
-
Marketing_Spend: The amount of money spent on marketing (in dollars) each month.
-
Sales_Revenue: The revenue generated from sales (in dollars) each month.
Perform the following.
- Import the Excel file and create a pandas DataFrame named sales_revenue.
- Use Python to build the simple linear model.
- Interpret the
value.
- Interpret the slope of the model.
Solution:
1. The code below will create a pandas DataFrame named sales_revenue and print the data.
2. The code below will build the simple linear model, where Sales_Revenue is the response variable and Marketing_Spend is the explanatory variable.
3. The

value from the regression output is 99.4%. 99.4% of the variability in sales revenue can be explained using the amount spent on marketing.
4. The slope for the regression model is 1.26. For every additional dollar spent on marketing, the predicted amount of sales revenue is expected to increase by $1.26.
-
Check out this video on building a simple linear regression model to predict sales revenue based on marketing spending using Python.
In this lesson, you mastered the basics of building a simple linear regression model in Python. You learned how to import an Excel file from an external website, create a pandas DataFrame, and visualize the data in Python. You then built and interpreted a simple linear regression model using a real-world example from a non-profit organization to predict donation amounts based on social media shares. Additionally, you gained hands-on practice by working through another example to reinforce your understanding of completing these data analysis tasks in Python.