Posts

Showing posts from May, 2023

SQL vs Pandas

Understand when to use what? As a data scientist, I've always found pandas to be an excellent tool for data manipulation. Its seamless integration with NumPy allows for fast mathematical operations, making it perfect for working with smaller datasets. However, as I started working with larger datasets and had to perform data-cleaning tasks, my initial approach was to use regex in pandas. My work involves managing datasets that are in the millions and need to be processed daily. Our data orchestration tool is Airflow, but running code on Airflow for an hour wasn't an option. This is where my senior said we are using Snowflake, a NoSQL database that can run queries to clean the data for millions of data within lesser time. I used regex in Snowflake queries to clean the data, and the processing time was significantly faster than running it on pandas. While pandas is an excellent tool for smaller datasets, it's not suitable for all data manipulation tasks. I realized the import