About Me

An Aspiring Data Professional

A Master’s student in Information Systems at Iowa State University, have hands-on experience in data engineering and cloud computing, with expertise in AWS, Azure, and Databricks. I specialize in developing scalable data pipelines and end-to-end solutions for data ingestion, processing, and analytics, including ETL processes and CI/CD pipelines. With proficiency in big data technologies like PySpark, as well as strong technical skills in Python, SQL, data warehousing, and cloud infrastructure, I am adept at automating workflows, ensuring data quality, and optimizing data pipelines for improved processing efficiency. My passion for leveraging data to generate actionable insights drives me to continuously improve processes, and I am eager to apply these skills in a Data Engineering to solve real-world challenges.

Education

MS in Information Systems

08/2023 – present

Institution: Iowa State University
Location: Ames, USA

B.Tech in Computer Science and Engineering

07/2019 – 05/2023

Institution: Gandhi Institute of Technology and Management
CGPA: 8.12/10
Location: Hyderabad, India

Projects

End to End ELT Pipeline for Netflix Data using Azure

A comprehensive real-time data pipeline for ingesting, transforming, and validating Netflix's TV shows and movies data. Utilized Azure Databricks with Delta Lake for scalable, fault-tolerant data storage and Apache Spark for distributed data processing. Implemented Azure Data Factory (ADF) to automate ETL workflows, moving data from Azure Data Lake Storage (ADLS). Leveraged Autoloader for continuous real-time parameterized data ingestion, removing the need for batch processing. Applied Medallion Architecture (Bronze, Silver, Gold) for structured data processing, and used Delta Live Tables (DLT) for automated transformations and data quality assurance. Employed Unity Catalog for centralized metadata management and improved data governance and security.

Skills : Extract, Transform, Load (ETL) · Azure Data Factory · Azure Data Lake · Azure Databricks · Apache Spark · Apache Spark Streaming · Data Engineering · Tableau · Real-time Data · Pipelines

GitHub Repository

End to End Data Engineering Pipeline for the Paris Games Data 2024 using Azure

A pipeline was designed and implemented using Microsoft Azure technologies to ensure seamless real-time ingestion, transformation, and processing of diverse datasets while maintaining data integrity and minimizing latency. The architecture followed a layered approach, optimizing for scalability, automation, and robustness. Azure DevOps served as the central hub for version control and CI/CD automation, with each pipeline component organized in repositories for efficient updates and testing. Data ingestion was managed through Azure Data Factory pipelines, which handled the extraction, loading, and pre-processing of various data formats like JSON, CSV, and Parquet, with validation steps embedded to detect and handle inconsistencies. Delta Live Tables in Databricks ensured schema validation and error resolution, while real-time incremental updates minimized overhead. The data was transformed using PySpark within Databricks, leveraging distributed processing for large-scale transformations. Structured Streaming was used for real-time insights, and automated orchestration through Databricks and ADF ensured smooth job scheduling and dependency management, enabling efficient, error-free, and timely data processing for analytics and reporting.

Skills : Data Engineering · Microsoft Azure · Continuous Integration and Continuous Delivery (CI/CD) · PySpark · Azure Databricks · Extract, Transform, Load (ETL)

GitHub Repository

Real-Time Weather Data Processing Pipeline using AWS and Snowflake

A scalable, real-time data pipeline that ingests, processes, and analyzes weather data using AWS services and Snowflake. The pipeline automatically fetches data from an external Weather API at scheduled intervals via AWS Lambda. The weather data is stored in Amazon DynamoDB, a NoSQL database that enables efficient querying and storage. By utilizing DynamoDB Streams, I was able to process the stored data in real-time as changes occurred, ensuring continuous updates. The processed data is streamed into Amazon S3 for secure, long-term storage. Finally, the data is loaded into Snowflake, a cloud data warehouse, where I can perform advanced analytics and derive valuable insights.

Skills : AWS Lambda · Amazon DynamoDB · Amazon S3 · Snowflake · AWS EventBridge · Amazon CloudWatch

GitHub Repository

Generative AI Blog Content Generator using Amazon Web Services (AWS)

Developed an automated solution to generate blog content using AWS Bedrock and Meta’s Llama3-70B-Instruct model. The end-to-end pipeline triggers content generation via AWS API Gateway, which invokes AWS Bedrock to create 200 words of blog content based on a predefined prompt. The generated content is saved as a .txt file in Amazon S3 for easy access and scalability. Integrated AWS CloudWatch to monitor and log the process, ensuring smooth operation and real-time issue detection. Utilized Postman for testing and validating API calls. This solution streamlines content creation, improving productivity and scalability.

Skills : AWS Lambda · Amazon Bedrock · Amazon S3 · Amazon CloudWatch · Postman API

GitHub Repository