Covid-19 Data Analysis | End-To-End Data Engineering Project

Description:
In this project, I undertook a comprehensive data engineering journey focused on COVID-19 data, leveraging AWS services to create a powerful data infrastructure. My goal was to make the COVID-19 data accessible, understandable, and valuable for analysis.

Key Steps:
Data Collection and Storage: I started by downloading COVID-19 datasets from Registry of Open Data on AWS and uploaded them to an Amazon S3 bucket, establishing a Data Lake. Here is the data set url:

https://registry.opendata.aws/aws-covid19-lake/


Data Cataloging: I ran AWS Crawlers on the S3 data to extract schema and information, making the data easily discoverable and queryable.


Data Exploration: Using AWS Athena’s query editor, I conducted ad-hoc SQL queries to explore and understand the data’s insights.


Data Modeling: I constructed an Entity-Relationship (ER) Data Model by gathering valuable data insights obtained from Athena.


Dimensional Modeling: I designed a Dimensional Model with a star schema. It includes the COVID-19 Fact Table, Date Dimension Table, Region Dimension Table, and Hospital Dimension Table for structured data representation.


ETL Processing: Python was utilized to perform Extract-Load-Transform (ETL) operations on the data, generating various dataframes that are the foundation for our Dimensional Model tables.


Data Storage: The transformed data was saved in the Amazon S3 bucket for accessibility.


Data Warehousing: I established a Dimensional Schema within Snowflake, a powerful data warehouse.


Data Integration: The Dimensional Model was loaded into Snowflake, offering a robust and scalable data warehousing solution.


Data Visualization: To analyze data effectively, I connected Power BI to Snowflake, creating insightful data visualizations.

Tools and Their Use:


Amazon S3: Used for data storage, providing scalability and flexibility.
Crawler: Employed to automate schema extraction and data cataloging from S3, making data easily queryable.


Amazon Athena: Enabled ad-hoc SQL querying for data exploration and analysis.


AWS Glue: Utilized for ETL processing to transform and structure data.
Snowflake: Served as the data warehouse to store the Dimensional Model, providing robust data storage and management.


Power BI: Connected to Snowflake for data visualization and generating actionable insights.


This project demonstrates the effective use of cloud services and data engineering techniques to harness the potential of COVID-19 data, from storage to meaningful insights.

Leave a comment

Create a website or blog at WordPress.com

Up ↑

Design a site like this with WordPress.com
Get started