SQL vs Pandas

Understand when to use what?

As a data scientist, I've always found pandas to be an excellent tool for data manipulation. Its seamless integration with NumPy allows for fast mathematical operations, making it perfect for working with smaller datasets. However, as I started working with larger datasets and had to perform data-cleaning tasks, my initial approach was to use regex in pandas.

My work involves managing datasets that are in the millions and need to be processed daily. Our data orchestration tool is Airflow, but running code on Airflow for an hour wasn't an option. This is where my senior said we are using Snowflake, a NoSQL database that can run queries to clean the data for millions of data within lesser time. I used regex in Snowflake queries to clean the data, and the processing time was significantly faster than running it on pandas.

While pandas is an excellent tool for smaller datasets, it's not suitable for all data manipulation tasks. I realized the importance of learning a Relational Database Management System (RDBMS), and I started with PostgreSQL.

So, when should you use SQL over pandas? If you're dealing with cloud-based databases, go for SQL for data transformation. However, if you need cleaned or processed data, use SQL for cleaning and pandas for EDA processes and model building.

In conclusion, it's essential to diversify your data science toolkit and know when to use SQL and pandas. In the next blog post, we will delve deeper into SQL and its functionalities. It's my personal opinion it may vary different person to person.

Comments

Popular posts from this blog

Impact of Covid-19 on India

Best practice to follow on git and github

Deep learning with TensorFlow