Use Sophia to knock out your gen-ed requirements quickly and affordably. Learn more
×

The Multiple Linear Regression Model

Author: Sophia

what's covered
In this lesson, you will learn how to build and predict with a multiple linear regression using Python. Specifically, this lesson will cover:

Table of Contents

1. Introduction to Multiple Linear Regression

In your previous tutorials, you learned how to build a simple linear regression model in Python to predict the amount of donations based on the number of social media shares a donation campaign receives. This was a great starting point to understand the relationship between two variables.

Now, let’s take it a step further and explore multiple linear regression. While simple linear regression involves one predictor variable, multiple linear regression allows us to include multiple explanatory variables to better understand and predict the response variable.

Imagine you are still working with the non-profit organization, but now you have additional data that might predict donation amounts. For example, besides the number of social media shares, you also have information on the amount spent on marketing and the number of email subscribers. By incorporating these additional explanatory variables, you can build a more comprehensive model that captures the combined effect of all these factors on the donation amounts.

By the end of this tutorial, you will be equipped with the skills to build and interpret multiple linear regression models, enabling you to make more informed decisions based on a broader set of data.

Let’s get started on this exciting journey into multiple linear regression!

1a. Multiple Linear Regression

In previous tutorials, you learned about simple linear regression, which helps you understand the relationship between a response and an explanatory variable. For example, you used the number of social media shares to predict the amount of donations a campaign receives. The equation for simple linear regression looks like this:

y with hat on top equals b subscript 0 plus b subscript 1 x

where:
  • y with hat on top is the predicted value of the response variable
  • x is the explanatory variable
  • b subscript 0 is the y-intercept
  • b subscript 1 is the slope
Now, let’s extend this idea to multiple linear regression. Multiple linear regression allows you to use more than one explanatory variable to predict the response variable. For example, besides social media shares, you might also consider the amount spent on marketing and the number of email subscribers. The equation for multiple linear regression looks like this:

y with hat on top equals b subscript 0 plus b subscript 1 x subscript 1 plus b subscript 2 x subscript 2 plus b subscript 3 x subscript 3

In this equation:
  • y with hat on top is still the predicted value of the response variable (amount of donation).
  • x subscript 1 comma x subscript 2 comma and x subscript 3 are the explanatory variables (number of social media shares, amount spent on marketing, and number of email subscribers, respectively).
  • b subscript 0 is the y-intercept.
  • b subscript 1 comma b subscript 2 comma and b subscript 3 are the coefficients that show the impact of the explanatory variables on the response variable.
Multiple linear regression helps you understand how several explanatory variables together predict the response variable. By including more explanatory variables, you can better understand the relationships between different explanatory variables.

For example, using multiple linear regression, you can predict the amount of donations more accurately by considering not just social media shares but also how much money was spent on marketing and how many email subscribers the fundraising campaign has.

This method is powerful because it gives you a more complete picture and helps you make better decisions based on multiple pieces of information.

The general multiple regression for p explanatory variables is given by:

y with hat on top equals b subscript 0 plus b subscript 1 x subscript 1 plus b subscript 2 x subscript 2 plus b subscript 3 x subscript 3 plus... plus b subscript p x subscript p

The graph below visually illustrates the multiple linear regression model for the donation prediction example.

  • The x-axis (horizontal) represents the first explanatory variable, x subscript 1 comma number of social media shares.
  • The y-axis (vertical) represents the second explanatory variable, x subscript 2 comma amount spent on marketing.
  • The z-axis (depth) represents the response variable, y, amount of donations received for a particular fundraising campaign.
  • Each dot on the graph represents an actual data point from the data set. These dots show the actual values for number of social media shares, marketing spend, and donation amount.
    • The color of each dot represents the number of email subscribers open parentheses x subscript 3 close parentheses. Lighter colors mean fewer subscribers, and darker colors mean more subscribers.
  • The blue surface (hyperplane) represents the multiple regression model. That is, the hyperplane represents predictions made by the multiple linear regression model.
  • This surface helps us see the overall trend and how changes in social media shares and marketing spend might affect donations.

1b. Build the Multiple Linear Regression Model

Now, you are ready to build a multiple linear regression model for the amount of donations the non-profit will receive, based on the number of social media shares a campaign receives, the amount of marketing spent on the fundraising campaign, and the number of email subscribers there are for each campaign.

The data is contained in an Excel file stored on GitHub named:
https://raw.githubusercontent.com/sophiaAcademics/BDA_Excel/main/StudentData/Tutorials/Unit5/5.1.4/donations_multiple_regression.xlsx

The data has four variables, as described below:

  • Donations – response variable y
  • Shares – number of social media shares each fundraising campaign receives open parentheses x subscript 1 close parentheses
  • Marketing_Amount – amount spent on marketing for each fundraising campaign open parentheses x subscript 2 close parentheses
  • Num_Email_Subscribers – number of email subscribers for each fundraising campaign open parentheses x subscript 3 close parentheses
Each row in the data represents a different fundraising campaign.

The code below imports pandas, imports an Excel file from a URL, creates a pandas DataFrame, and displays the first five rows of the DataFrame.

did you know
The key differences between the current code above and the code you used previously to import an Excel file into Python are:
  • Data Source: The previous code reads from donations.xlsx, while the current code reads from donations_multiple_regression.xlsx.
  • Type of Regression:
    • Previous Code: Performs a simple linear regression with one explanatory variable (Shares).
    • Current Code: Performs a multiple linear regression with three explanatory variables (Shares, Marketing_Amount, Num_Email_Subscribers).

The summary of the regression model is provided below.

key concept
Although a residual analysis was not performed before building the multiple linear regression model, it is essential to conduct a residual analysis when performing multiple linear regression. This step ensures that the assumptions of the regression model are met and helps identify any potential issues with the model, like detecting outliers. The importance and methods of residual analysis were covered in a previous tutorial.

1c. Interpret the Multiple Linear Regression Model

Let's explore the multiple linear regression model output below, where key areas have been highlighted to guide your interpretation and understanding of the model's results.

1. bold italic R to the power of bold 2 (Coefficient of Determination)

In a previous tutorial, when you used just the number of social media shares to predict the donation amount, the R squared value was 94.1%. This means that 94.1% of the variation in donation amounts could be explained by social media shares alone.

Now, after adding marketing spend and the number of email subscribers to the model, the R squared value increased slightly to 94.2%. This tells you several things:

  • Small Improvement: Adding marketing spend and number of email subscribers only improved the model's ability to explain the variation in donations by 0.1%. This means these additional variables do not add much extra explanatory power beyond what the number of social media shares already provides.
  • Dominant Predictor: The number of social media shares is a very strong predictor of donation amounts. The small increase in R squared suggests that most of the predictive power comes from the number of social media shares, and the other explanatory variables (marketing spend and email subscribers) have a much smaller impact.
  • Model Complexity: Even though the model is slightly better with the additional variables, the improvement is minimal. This might suggest that a simpler model with just social media shares could be nearly as effective and easier to interpret.
In the next tutorial, you will learn how to perform statistical inference in a multiple regression model. This means you will determine if each explanatory variable (like marketing spend and email subscribers) is helpful in predicting the response variable (donation amount). This will help you understand not just how much each variable contributes, but also if their contributions are statistically significant.

In summary, while adding more variables can sometimes improve a model, in this case, the number of social media shares is already doing most of the work in predicting donation amounts. You will soon learn how to test if the additional variables are truly useful predictors.

2. coef column

  • const – this is the intercept for the model, b subscript 0 in the regression equation y with hat on top equals b subscript 0 plus b subscript 1 x subscript 1 plus b subscript 2 x subscript 2 plus b subscript 3 x subscript 3.
  • Shares – this is the coefficient, b subscript 1 in the regression equation y with hat on top equals b subscript 0 plus b subscript 1 x subscript 1 plus b subscript 2 x subscript 2 plus b subscript 3 x subscript 3. The value of b subscript 1 is 7.85.
  • Marketing_Amount – this is the coefficient, b subscript 2 in the regression equation y with hat on top equals b subscript 0 plus b subscript 1 x subscript 1 plus b subscript 2 x subscript 2 plus b subscript 3 x subscript 3. The value of b subscript 2 is -1.19.
  • Num_Email_Subscribers – this is the coefficient, b subscript 3 in the regression equation y with hat on top equals b subscript 0 plus b subscript 1 x subscript 1 plus b subscript 2 x subscript 2 plus b subscript 3 x subscript 3. The value of b subscript 3 is 0.82.
The multiple linear regression model is y with hat on top equals short dash 0.0078 plus 7.85 x subscript 1 minus 1.19 x subscript 2 plus 0.82 x subscript 3.

Interpreting the Model:

Each coefficient in the model quantifies the relationship between the corresponding explanatory variable and the response variable while accounting for the effects of the other variables in the model. This allows you to understand the individual impact of each explanatory variable on the predicted amount of donation. For example:

  • Coefficient for Shares open parentheses b subscript 1 close parentheses colon This coefficient (7.85) indicates the change in the predicted amount of donation for each additional social media share, holding all other variables constant. Specifically, for each additional social media share, the predicted amount of donation increases by $7.85.
  • Coefficient for Marketing_Amount open parentheses b subscript 2 close parentheses colon This coefficient (-1.19) shows the change in the predicted amount of donation for each additional unit of money (dollar in this case) spent on marketing, holding all other variables constant. Here, for each additional dollar spent on marketing, the predicted amount of donation decreases by $1.19. This negative coefficient might suggest that beyond a certain point, additional marketing expenditure does not translate into higher donations, or it could indicate inefficiencies in the marketing strategy.
  • Coefficient for Num_Email_Subscribers open parentheses b subscript 3 close parentheses colon This coefficient (0.82) represents the change in the predicted amount of donation for each additional email subscriber, holding all other variables constant. Thus, for each additional email subscriber, the predicted amount of donation increases by $0.82.
did you know
When interpreting the coefficients of a multiple linear regression model, it is important to include the phrase "holding all other variables constant" to clarify that the effect of each explanatory variable on the response variable is being isolated. This phrase ensures that the interpretation of each coefficient reflects the unique contribution of that specific explanatory variable, without the influence of the other explanatory variables in the model.

Here's why this is important:

  1. Isolation of Effects: In a multiple regression model, several explanatory variables are included. By holding all other explanatory variables constant, you can isolate the effect of one variable at a time, making it clear how much that particular variable contributes to the response variable.
  2. Accurate Interpretation: This approach provides a more accurate and meaningful interpretation of each coefficient. It tells you how the response variable changes with a one-unit change in the explanatory variable of interest, presuming that the other variables do not change.

Now, it is your turn to build and interpret a multiple linear regression model in Python!

try it
You are going to return to a similar example you have worked with previously, but now you are going to use more explanatory variables to build a multiple regression model.

You are a data analyst working on an analytical project. Your task is to build a multiple linear regression model that can predict sales revenue based on several explanatory variables using Python.

The Excel file is located here:
https://raw.githubusercontent.com/sophiaAcademics/BDA_Excel/main/StudentData/Tutorials/Unit5/5.1.4/sales_revenue_multiple_regression.xlsx

It contains the historical data of sales revenue based on marketing spend, the number of customer reviews left on a website, and the sales team size for 10 observations. Each row in the data represents the sales revenue, amount spent on marketing, number of customer reviews, and team size for each month for the last 10 months. The columns are described below:

  • Sales_Revenue: The revenue generated from sales (in dollars) each month.
  • Marketing_Spend: The amount of money spent on marketing (in dollars) each month.
  • Customer_Reviews: Number of customer reviews left on a website, like Google reviews, each month.
  • Sales_Team_Size: Number of sales team members for each month.
Perform the following:

  1. Import the Excel file and create a pandas DataFrame named sales_revenue_multiple_reg and view the first five rows of the DataFrame.
  2. Use Python to build the multiple linear regression model.
  3. Interpret the R squared value.
  4. Compare the R squared value to the R squared value for the simple linear regression that was found in a previous tutorial to predict sales revenue using marketing spend. What does the increase/decrease in the R squared value for the multiple linear regression tell you?
  5. Interpret the coefficients for each of the explanatory variables.
Solution:

1. The code below will create a pandas DataFrame named sales_revenue_multiple_reg and display the first five rows of the data.



2. The code below will build the multiple linear model, where Sales_Revenue is the response variable and Marketing_Spend, Customer_Reviews, and Sales_Team_Size are the explanatory variables.



3. The R squared value from the regression output is 99.6%. 99.6% of the amount of variability in sales revenue can be explained using the amount spent on marketing, the number of customer reviews posted on the website, and the number of team members on the sales team.

4. When only the amount spent on marketing was used, the R squared value was 99.4%. This means that marketing spend alone is already a very strong predictor of sales revenue, explaining 99.4% of the variability.

The slight increase to 99.6% when adding customer reviews and sales team size indicates that these additional explanatory variables do contribute to explaining the variability in sales revenue, but their impact is relatively small compared to marketing spend.

5. Interpreting the Model:

Using the regression output, the multiple linear regression model that predicts sales revenue is given by y with hat on top equals 3831.64 plus 1.24 x subscript 1 plus 0.33 x subscript 2 plus 20.10 x subscript 3.

  • Coefficient for Marketing_Spend open parentheses b subscript 1 equals 1.24 close parentheses colon This coefficient indicates that for every additional dollar spent on marketing, the predicted sales revenue increases by $1.24, holding the number of customer reviews and the size of the sales team constant. This suggests that marketing spend has a positive impact on sales revenue.
  • Coefficient for Customer_Reviews open parentheses b subscript 2 equals 0.33 close parentheses colon This coefficient shows that for each additional customer review, the predicted sales revenue increases by $0.33, holding marketing spend and sales team size constant. This means that more customer reviews are associated with higher sales revenue, although the impact is smaller compared to marketing spend.
  • Coefficient for Sales_Team_Size open parentheses b subscript 3 equals 20.10 close parentheses colon This coefficient represents the change in predicted sales revenue for each additional member of the sales team, holding marketing spend and customer reviews constant. Specifically, for each additional sales team member, the predicted sales revenue increases by $20.10. This indicates that having a larger sales team greatly enhances the predicted sales revenue.

watch
Check out this video on building a multiple linear regression model that can predict sales revenue.

summary
In this lesson, you learned how to build and interpret a multiple linear regression model using Python. Using a real-world example of predicting donations for a non-profit organization, you discovered how to incorporate multiple explanatory variables—the number of social media shares, the amount spent on marketing for each fundraising campaign, and the number of email subscribers for each campaign to enhance the prediction of donation amounts. Two methods were used to interpret the multiple linear regression model: R squared (coefficient of determination) and the values of the coefficients from the model. R squared measures how well the explanatory variables account for the variability in the response variable, while the coefficients quantify the relationship between each explanatory variable and the response variable. Both interpretability methods are important in a multiple linear regression because they provide a comprehensive understanding of the model's performance and the individual contributions of each explanatory variable, enabling more informed decision-making.

Source: THIS TUTORIAL WAS AUTHORED BY SOPHIA LEARNING. PLEASE SEE OUR TERMS OF USE.