First, please create an account

Already have a Sophia account?

The Data Lifecycle

Author: Sophia

what's covered

In this lesson, you will be introduced to the eight stages of the data lifecycle. Specifically, this lesson will cover:

1. Generation
2. Collection
3. Processing
4. Storage
5. Management
6. Analysis
7. Visualization
8. Interpretation

before you start

The data lifecycle is the journey data goes through from being generated to the final interpretation of the analysis that provides insight into a business problem. The data lifecycle has eight stages that are displayed in the figure below:

The eight stages of the data lifecycle are, in order: Generation, Collection, Processing, Storage, Management, Analysis, Visualization, Interpretation.

The eight stages of the data lifecycle are, in order: Generation, Collection, Processing, Storage, Management, Analysis, Visualization, Interpretation.

Understanding the data lifecycle is crucial for businesses to manage their data responsibly. It enables businesses to identify opportunities to enhance data management, increase efficiency, and improve analytical outcomes. By following the stages of the data lifecycle, businesses can effectively oversee the data they generate and collect. Let's explore each stage of the data lifecycle in detail.

term to know

Data Lifecycle

The eight stages that a business’s data will go through from generation to informed insights and interpretation.

1. Generation

Before data can be used in any meaningful manner, it must be generated. Data generation is the starting point. This can happen through various external activities such as:

Customer interactions: Data is generated from the day-to-day activities of interacting with customers through phone calls, text messages, online chats, survey responses, and complaints.
Digital interactions: Data is generated by tracking customer behavior on websites and mobile applications. Every online purchase, search query, and product review generates massive datasets that online retailers and marketplaces use to personalize recommendations and predict consumer behavior.

EXAMPLE

Social media websites, discussion boards, and chat rooms also generate data related to business products or services. For example, many car companies like Toyota have online groups with chat rooms that discuss new features or problems of a particular make of Toyota vehicle.

With the rise of the Internet of Things (IoT), data is generated from sensors embedded in products or raw materials. For example, smart appliances have sensors installed in the appliance that generate a wealth of data that provides feedback on how the customer is using the appliance.

Supplier interactions: Data is generated from external vendors. Businesses rely on suppliers (also known as vendors) to provide materials or complete tasks that are used in a company’s finished product. All the interactions with suppliers generate data such as purchase orders and invoices, time to complete requested orders, payments, returns, and any other financial interactions.

IN CONTEXT
Measuring Environmental Impact

Businesses are starting to generate data related to the environment and their impact on the environment. Businesses have several compelling reasons to reduce pollution and address climate change. First, there is regulatory pressure from government agencies. Many states have established environmental regulations and emission standards. Businesses must comply with these rules to avoid fines, legal penalties, and reputational damage. Non-compliance can lead to financial losses and hinder business operations.

Another reason businesses are generating environmental data is due to expectations from a variety of business stakeholders. Investors, customers, employees, and communities increasingly expect companies to demonstrate environmental responsibility. Environmental, social, and governance (ESG) scores are gaining in popularity in business. An ESG score is a number from 0 to 100 that measures a business’s sustainability practices and societal impact. The higher the score indicates better performance. Businesses that prioritize sustainability are more attractive to stakeholders and can maintain positive relationships.

Climate interactions: Data generated by sensors that are installed at manufacturing plant locations or on delivery trucks that generate data related to the amount of carbon emissions such as greenhouse gas emissions like carbon dioxide.

IN CONTEXT
Scenario: Data Generation at the EPA

Suppose you are working as a data analyst for the EPA (Environmental Protection Agency). The EPA studies a wide range of economic issues related to climate change. You have been given six months to develop a pollution prevention dashboard. A dashboard is a collection of visualizations that contain interactivity. The dashboard will track pollutants for several food manufacturing sectors.

Pollution prevention is a method used to prevent or eliminate pollutants at their source before the need to recycle or dispose of them. The EPA has conducted a survey and many food manufacturing businesses responded they would share the environmental data they generate for the EPA to provide them with a pollution prevention dashboard.

terms to know

Internet of Things (IoT)

A network of physical devices, vehicles, appliances, and other objects embedded with sensors.

Environmental, Social, and Governance (ESG) Score

A number that quantifies how attentive and aware a business is to environmental and societal issues.

Sustainability

Conducting business operations without negatively impacting the environment or the local community.

2. Collection

Not all data that is generated will be used for an analytical project. In this stage, a question must be developed that will be answered using data analytics that address the business problem. To illustrate the importance of asking good questions of your data, consider a quote commonly attributed to Albert Einstein.

“If I had one hour to solve a problem and my life depended on it, I would use the first 55 minutes determining the proper questions to ask.”

Asking questions about the data assists in defining the objective and goals of the analytical project. You will decide on what key metrics to calculate, what type of data analytical technique to employ, and the type and quantity of visualizations necessary. Asking these questions assists in targeting the data that needs to be collected to complete the analytical task. Once the business problem is defined, the problem must be translated into an analytical task. The purpose of the analytical task is to determine what data will be collected.

IN CONTEXT
Scenario: Data Collection at the EPA

After working with the stakeholders and other members of your project team at the EPA, you have decided that one of the main visualizations in the dashboard will be a bubble map of the United States showing the concentration of pollutants for several food manufacturing sectors. Below is an example of what the bubble map might look like.

The metric used for the bubble map will be an air quality index (AQI). An AQI index is a number from 0 to 500. The closer the index is to 0, the smaller the number of pollutants in the air. The closer the index is to 500 the more pollutants are present in the air. The components required to calculate AQI are:

Ground level ozone

Carbon dioxide

Sulfur dioxide

Nitrogen dioxide

Airborne particles, or aerosols

The size of the circle represents the AQI for a manufacturing plant location while the color represents the particular sector. For example, bakery is denoted by blue while dairy is represented by orange.

The food manufacturing plants have on-site sensors at their physical locations that collect these components on a daily, and sometimes hourly, basis. The EPA has reached out to the food manufacturing plants and requested they submit their data to the EPA so the dashboard can be created.

In the top right of the bubble map, there is a drop-down option for the date as shown in the figure below. For any day, the plant will be able to observe how its AQI index is performing. This interactivity allows the plants to monitor their AQI daily.

If you were working as a data analyst for the EPA, your task would be to take the data collected from each plan and calculate the AQI metric and construct the bubble map. You would take the data collected from each plant and calculate the AQI metric, then construct the bubble map.

3. Processing

Once data has been collected, it must be processed. Data processing refers to preparing the data so that a data analytical technique can be applied. Data rarely, if ever, is collected in a manner that is ready for analysis. Processing is one of the most time-consuming and least glamorous parts of completing an analytical project. A data analyst can expect to spend 80% of their time on processing and 20% of their time performing the data analysis.

IN CONTEXT
Scenario: Data Processing at the EPA

You have now received all the data files from the food manufacturers' plants, and the data is a mess! Some plants sent the data in Excel files and some in text files. Some plants put all the components (ground-level ozone, carbon dioxide, etc.) in one file while others sent multiple files with one single component per file.

In addition to the data collected not being properly structured, you notice there are missing values. You will have to decide how you will handle missing values. You can choose to delete them or estimate the values using other metrics like the mean or median values.

You also notice there is a problem with consistency in the files the plants submitted. Some plants reported the different components (ground level ozone, carbon monoxide, etc.) on an hourly basis and some reported the components daily. Some of the components are measured using different scales. For example, one bakery reported carbon monoxide using parts per million and a dairy reported carbon monoxide as a percentage. You will have to ensure that the values of the components are all in the same measurement scale before you calculate AQI.

Data processing means you check for outliers in the data. Outliers are data values outside the range of typical values. Similarly, with missing values, you must decide how you want to handle outliers. It may be a good idea to omit the observation that contains an outlier. If you make this decision, make sure you properly document what data was omitted and why.

You will have to spend a couple of weeks cleaning the data and processing the data to obtain one single data file that can be used to create the bubble map.

All these decisions related to processing the data are left to the business data analyst. It is your responsibility to document all changes made to the data when formally presenting the data. Add footnotes to the report indicating what data was deleted, modified, or manipulated during the processing stage.

term to know

Outliers

A data point that significantly differs from all the other data points.

4. Storage

Choosing the appropriate type of storage for the data after it has been collected and processed will help secure the data and allow for the data to be reused for another analysis. The data sets can be stored in the cloud or on a server. Many businesses store their data in a relational database. A relational database is a way of storing data sets so that the data sets can be easily related to one another.

IN CONTEXT
Scenario: Data Storage at the EPA

The image below shows how the data collected from the food manufacturing plants might be stored in a relational database.

It is common for businesses to store their data across multiple related tables as shown in the image above. The tables are related by the column PlantID. All tables have this variable in common. This method of storing data helps reduce the amount of redundant data. Redundant data is data that appears in multiple places and can lead to inconsistencies. Splitting the data into separate tables ensures each piece of information is only stored once.

Imagine if one of your plant managers’ email changes. If the email address was stored in multiple tables, the Plant and Sector tables, the email address would have to be updated in both tables. This practice could be error-prone and time-consuming. By keeping a particular column of data (such as email) in a single location, the risk of having different data in different places is minimized. Data updates become simpler and less error prone.

terms to know

Relational Database

A container used to store data sets that are related to each other.

Redundant Data

Data that has been duplicated.

5. Management

Data management is the process of managing and accessing data for a business. The goal of data management is to make sure the data is secure and can be easily accessed by the data analyst and others when needed. Typically, there is an individual, known as a database administrator, whose sole job is to manage the business data. The database administrator designs and creates the relational databases and manages which employees have access to the data tables in the databases. The management of which employees have access to the data tables is part of data security.

terms to know

Data Management

Practice of storing, accessing, and securing data for a business.

Database Administrator

An individual who creates, manages, secures, and updates a relational database.

6. Analysis

People often interchange the terms analysis and analytics. The terms have two different meanings. Analysis refers to the study or critique of some process. For example, a business analysis could be a report that describes how a company has performed compared to what stockholders were expecting. Analytics refers to applying quantitative methods like math and statistical techniques to data to glean insights that will assist a business in making better decisions. Visualization can also be an analytical technique. In this course's context and business data analytics in general, the use of the word "analysis" often refers to "analytics."

terms to know

Analysis

Process of breaking down a complex problem into something more easily understood.

Analytics

Applying mathematical and statistical methods to data to gain insight from the data.

7. Visualization

Data visualization is a subset of analytics. Visualizations are analytical tools used to graphically display data. Visualizations are powerful tools because they allow you to communicate data insights to a wide audience. When you are communicating data to a non-technical audience, data visualization should be a first consideration. Visualizations make the data more accessible. Visualizations allow you to represent the data in an appealing format that is easier for non-technical audiences to consume.

IN CONTEXT
Scenario: Visualization at the EPA

The bubble map that you are creating for the food manufacturing sectors allows the plants to quickly detect which food manufacturing sector is contributing the most to a poor AQI. This insight is more quickly and easily derived using a bubble map rather than using a table with a list of AQI metrics.

8. Interpretation

Perhaps the most valuable stage of the data lifecycle is interpretation. Interpreting the analytical results assists the business in making decisions based on data. A gold standard in modern-day businesses that use data to enhance their operations is an individual who understands how data interacts with other parts of the business. Part of analytical interpretation includes how to explain the impact and make recommendations from the analytical results to high-level managers through the lens of the business context.

IN CONTEXT
Scenario: Interpretation at the EPA

The bubble map which is one of the visualizations for the pollution prevention dashboard can significantly impact a manufacturing plant’s decision-making process by providing daily insights related to environmental sustainability. Your job as the data analyst is to assist the plants in interpreting the bubble map. That is, pay attention to the visual cues of the map like the bubble size. You should also provide the plants with context for the bubble map. For example, explain the source of data used to construct the bubble map. In this scenario, the map was constructed using data generated at the plant’s location. However, this data may not tell the entire story. Discuss with the plants how other factors like industrial activity, traffic, weather conditions, and geographical features could also be impacting the AQI. Provide suggestions for how these other factors could be added to the dashboard to further assist with sustainability decision-making.

summary

In this lesson, you explored the eight stages of the data lifecycle—generation, collection, processing, storage, management, analysis, visualization, and interpretation. You learned about how these stages interact and flow by following a data analyst at the Environmental Protection Agency (EPA) as they moved through each stage of the data lifecycle.

Source: THIS TUTORIAL WAS AUTHORED BY SOPHIA LEARNING. PLEASE SEE OUR TERMS OF USE.

Terms to Know

Analysis: Process of breaking down a complex problem into something more easily understood.
Analytics: Applying mathematical and statistical methods to data to gain insight from the data.
Data Lifecycle: The eight stages that a business’ data will go through from generation to informed insights and interpretation.
Data Management: Practice of storing, accessing, and securing data for a business.
Database Administrator: An individual who creates, manages, secures, and updates a relational database.
Environmental, Social, and Governance (ESG) Score: A number that quantifies how attentive and aware a business is to environmental and societal issues.
Internet of Things (IoT): A network of physical devices, vehicles, appliances, and other objects embedded with sensors.
Outliers: A data point that significantly differs from all the other data points.
Redundant Data: Data that has been duplicated.
Relational Database: A container used to store data sets that are related to each other.
Sustainability: Conducting business operations without negatively impacting the environment or the local community.

First, please create an account

The Data Lifecycle

Table of Contents

1. Generation

2. Collection

3. Processing

4. Storage

5. Management

6. Analysis

7. Visualization

8. Interpretation