Use Sophia to knock out your gen-ed requirements quickly and affordably. Learn more
×

Steps for Data Cleaning

Author: Sophia

what's covered
In this lesson, you will be introduced to several steps that are commonly used to clean data. Specifically, this lesson will cover:

Table of Contents

1. Steps for Cleaning Data

Data cleaning is essential for preparing the data for analysis. Below are several steps that can be taken to ensure you obtain clean data.

  1. Identify and deal with missing values.
  2. Determine outliers and how to handle them.
  3. Remove duplicates.
  4. Remove irrelevant data.
  5. Standardize the data.
  6. Fix data errors.
The next several sections will discuss each of these data cleaning steps.

1a. Identify and Deal with Missing Values

One of the most straightforward ways to identify missing data is to scan the data and look for missing values like blank cells or blank values in textual responses. You can also perform a count match to detect missing values. That is, compare the count of values in a column compared to the total number of rows in the data. If there is a mismatch, you have identified a column with missing values.

Once you have identified the missing values, you must determine what to do with them. Your decision on how to handle missing values will be guided by whether you can determine a reason for the missingness. There may be situations in which the missing values are legitimate and contribute meaningfully to the context of the business problem. For example, it may be completely legitimate that unemployed individuals have a missing value for income. Missing data can reflect privacy concerns like when a patient refuses to answer sensitive questions related to their mental health. Missing data could also reflect genuine preferences like a customer satisfaction survey where a customer can choose not to answer certain questions. In these situations, you want to leave the missing values as-is.

If you encounter a situation where the missingness carries meaning in the context of the business problem, an effective technique is to create an indicator variable, sometimes referred to as a dummy variable. A missing indicator variable is a binary variable that takes on the value “1” if the observation is missing and a value of “0” if the observation is not missing. The figure below shows a missing indicator variable. The column M_Income is the missing indicator variable. The prefix of M_ followed by the name of the column is the naming convention to denote the missing indicator variable. When Income is not missing, M_Income is 0 and when Income is missing, M_Income is 1.

You can use the missing indicator variable in the analysis or in a downstream process. For example, when calculating an average, consider excluding rows where the indicator value is 1 (missing data). You can also use the missing indicator variables in predictive models, and the results can provide you with additional information about the problem. If the missing indicator variable turns out to be important in a predictive model, this information can inform you of an insight that might not be obvious without the missing data being present. For example, customers who did not provide their income also tend to pay their bills late.

Another way of handling missing values is to impute the missing values. Missing value imputation is the process of estimating or filling in missing data values in a data set. There are several imputation techniques. Let’s explore a few of these.

  • Mean/Median Imputation. Impute the missing values with the mean or median for a continuous variable. While this is a simple method, this type of imputation presumes that the missing values are like the mean or median values.
  • Mode Imputation. For categorical variables, replace the missing categories with the category that occurs most frequently.
  • Model Imputation. Use a statistical model like a regression model to predict the missing values. The variable with the missingness becomes the one being modeled, and the other variables in the data set are used as predictors to predict the missing values. This method requires a strong relationship between the variables with missing values and the predictors.
  • Domain-Specific Imputation. Use business knowledge or context to impute missing values. For example, impute salary based on job role or industry norms.
The table below provides scenarios and general guidelines for when you should use each of these imputation methods.

Imputation Method Scenario Appropriate Use
Mean You have a data set of supply chain costs (total cost associated with delivering goods). Some of the costs are missing. There are no outliers, and the distribution of the costs is symmetric. Mean imputation is suitable to replace missing values when no outliers are present, and the distribution is approximately symmetric.
Median You have a data set that consists of household incomes with several outlying observations. Some of the income values are missing. When you examine the distribution, you observe a long tail in the right or left direction. Median imputation is suitable to replace missing values when outliers are present. The median is not influenced by outliers.
Mode You have a data set that describes the customer payment method (credit card, debit card, and cash). Several of the payment methods are missing. Mode imputation is suitable for categorical data where you replace missing values with the most frequent category. This method is straightforward and preserves the mode of the data set.
Model Imputation You have a data set of patient health records with missing values in several variables, and you want to predict the missing values using other available data. Model imputation, such as using regression or machine learning models, is appropriate when you have multiple variables and can leverage the relationships between them to predict missing values. This method requires more expertise on the part of the data analyst.
Domain-Specific You are analyzing daily sales data from a retail chain to understand seasonal trends and optimize inventory management. The dataset includes daily sales figures, product categories, store locations, and promotional activities. However, some sales data is missing due to system outages or data entry errors. Use historical sales patterns and promotional activity data to impute the missing sales values. Leveraging historical sales data and promotional activity provides a more accurate estimate than simple mean or median imputation. This method considers the specific characteristics of retail sales data, such as seasonal trends and promotional impacts. Incorporating domain-specific knowledge ensures that the imputed values are realistic and consistent with known sales behaviors.

terms to know
Indicator Variable
A binary variable that takes on exactly one of two possible values.
Dummy Variable
Another term used for indicator variable.
Missing Indicator Variable
A binary variable that indicates whether a value is missing for a particular observation.
Impute/Imputation
The process of replacing missing or incomplete data points with substituted values.
Missing Value Imputation
Process of filling in missing values in a data set.

1b. Managing Outliers

In a previous tutorial, you learned how to detect outliers using frequency distributions, histograms, and boxplots. Once you have detected that a data value(s) may be a potential outlier, you must decide what to do about the outlier(s). There are several strategies for dealing with outliers as outlined below.

  • Detecting Data Entry Errors. If an outlier results from a typo or measurement error, correct the value if possible. If correction is not feasible, delete the data point. Be cautious not to delete meaningful outliers.
  • Apply a Transformation. Applying a logarithmic or square root transformation to the variable can make the distribution more symmetric and the outlier does not have as much effect on the skewness of the variable’s distribution. For example, the figure below illustrates the effect of an outlier on sales data. You can see from the histogram in the left panel that there could be some potential outliers in the higher values of the sales variable. By applying a logarithmic transformation to the data (right panel), the distribution becomes more symmetric. The transformation reduces the impact of outliers by compressing the scale.



  • Remove or do not remove the outlier. Removing outliers is a subjective decision and one that requires careful consideration. It is important that you understand the domain and the data collection process. Some outliers may be correct observations that represent natural variation in the data. True outliers may provide important insight into the business problem and the data collection process. If you know an outlier is incorrect, consider removing the outlier from the data. If you have ample data, removing one or two observations will not dramatically affect the results of the analysis. Investigate the cause of the outliers before you decide to remove or keep the outlier.

1c. Remove Duplicates

Duplicate data can appear in data sets for various reasons. The same data can be represented in multiple ways. Variations in addresses (123 Main St. Vs. 123 Main Street) or product descriptions (Manatee Inflatable Kayak vs. Mantee Inf Kay) can appear multiple times. The duplicate data can be a product of combining data from multiple places like CRMs and ERPs. Standardizing data can help with reducing duplicate data.

EXAMPLE

The figure below shows some employee bonus data from a company that sells solar heating panels.



Based on the first two columns, there are two duplicates for the employees Elena and Malik. In addition, the name Malik appears four times, but the last appearance has a different Employee ID. The last appearance of Malik is a duplicate based only on the second column and not the unique identifier Employee ID. This last instance of Malik should not be removed because it is not a duplicate. We must remove 3 total duplicate rows from this table by leaving one green highlighted row, and one yellow highlighted row.

The de-duplicated data is shown below without the highlighting.

1d. Remove Irrelevant Data (Filtering)

Filtering data means removing part of the data. Filtering the data as part of the data cleaning process serves several essential purposes as outlined below.

  • Reduces Noise. Filtering helps remove noisy or irrelevant data points that might distort your analysis. Noise can arise from measurement errors, outliers, or inconsistencies in data collection. For example, if you are analyzing sales data, removing outliers ensures that your average sales figures aren’t skewed by a few unusually high or low transactions.
  • Enhances Accuracy. By filtering out irrelevant or incorrect data, you improve the overall accuracy of your analysis. Imagine you are analyzing customer reviews. Removing duplicate reviews or those in a language you do not understand ensures that your sentiment analysis results are more reliable.
  • Efficiency. Smaller data sets are easier to work with and process. Filtering allows you to focus on relevant subsets of your data. For example, if you’re analyzing monthly sales trends, you can filter out data from previous years to concentrate on recent patterns.
term to know
Filtering
Process of selecting a subset of data based on specified criteria.

1e. Standardize the Data

The last tutorial emphasized the reasons why data needed to be standardized. Data standardization is a step in the data cleaning process. When you standardize the data, you are ensuring that the data adheres to a certain consistent format. This consistency allows the computer systems that process and move the data to be more efficient and make fewer mistakes. For example, a column used to indicate the country of origin could have the categories “US” and “United States.” Both these categories will show up in the data, but they should be analyzed as one category. You will have to standardize the country names to something consistent like “US.”

1f. Fix Data Errors

You will need to explore your data and determine if there are any data errors. Data errors can occur due to typos, mislabeling of categories, or incorrect symbols being inserted into the data. There could be data values in a particular column that do not belong there. For example, personal names may be in a column used to designate a country.

summary
In this lesson, you saw that data cleaning involves several crucial steps to ensure the accuracy and reliability of your dataset. First, identify and address missing values through methods like imputation or using indicators. Next, determine and handle outliers to prevent skewed analysis. Removing duplicates and filtering out irrelevant data helps streamline the dataset. Standardizing data formats ensures consistency, and fixing data errors corrects inaccuracies. Through an example in Excel, you were able to remove duplicates, analyze outliers, and filter data effectively. For example, “Manatee Inflatable Kayak” and “Mantee Inf Kay” are duplicates that need to be identified and consolidated.

Source: THIS TUTORIAL WAS AUTHORED BY SOPHIA LEARNING. PLEASE SEE OUR TERMS OF USE.

Terms to Know
Dummy Variable

Another term used for indicator variable.

Filtering

Process of selecting a subset of data based on specified criteria.

Impute/Imputation

The process of replacing missing or incomplete data points with substituted values.

Indicator Variable

A binary variable that takes on exactly one of two possible values.

Missing Indicator Variable

A binary variable that indicates whether a value is missing for a particular observation.

Missing Value Imputation

Process of filling in missing values in a data set.