How to compress CSV file efficiently in just 25 lines of code
Data is perhaps the most important thing in this century. We have generated more than 90% of data in the previous 5 years. According to the estimate alone Facebook generates 4 petabytes of data on every single day, Twitter generates 12 terabytes of data per day, WhatsApp generates more than 3 petabytes of data. According to an estimate, we produce 2.5 quintillions of data every day.
There is a rhyme that comes in my mind. I would like to quote that
“Data data everywhere but not any data is clean”
In this world, data is considered to be king. All those data are stored in huge databases in the form of rows and tables. We spend billions of dollars to maintain that huge database.
Can we save some money by doing data compression?
Of course, many tech Giants spend millions of dollars to develop these types of software that will help them to reduce the size of data.
Recently due to the advancement in artificial intelligence in the field of genetics proclaimed that we can store whole worlds of data in single DNA by the end of 2050.
As a data scientist, we are always curious to solve the problem and help the industry to save some real cash.
In this article, we are going to make a CSV file compression model in just a few lines of Python code
Before moving to the coding part let us understand what are the prerequisites that we have to know
1- pandas
It is a python library that is used to load and read the data frame.
In our case, we are using a CSV file of size 617mb and we are going to compress the size without affecting the quality.
2- Python
You must have a basic understanding of Python to understand the code. There is not any deep knowledge is required.
Code
Let us break the code line by line
Line 1–2 import our necessary packages
pandas: It is used to load the data and read it.
numpy: It is used for array and numerical operations.
Line 3 Now we are loading the csv file with the help of pandas library.
Line 4 Now we are formatting the timestamp column in the format of ‘%Y-%m-%d’.
Line 5–6 Now we display the top five-column of our data and shape of our data.
Line 8 Now we show the file size by using memory usage function
Now we define our compress function in which we pass the df and set verbose to true.
Line 2 we are specifying the data types of our column
Line 3 Now we are estimating how much memory does data frame have and then we divide it by 1024**2.
Line 4–27 Now we looping over all columns if the column type is numeric then we are calculating the minimum and maximum value of that column. If the column type is an integer.
After that, if the minimum is greater than np.iinfo(np.int8).min where np.iinfo is the telling us the information about dtype and maximum is less than np.iinfo(np.int8).max then we cast the entire pandas to np.int8 type. Then we do that for all other data types like int16, int32, int 64.
Else if the column type is a float then we cast the entire pandas to np.float variable type
After that, we return how much memory is decreased.
Then we call our compress function and it will reduce the memory size by 53.1%.
Conclusion
In this blog post, we learn about how to compress the csv file efficiently. In my case, my CSV file contains only numeric value. If your csv contain numerical and categorical value both. Then convert the categorical columns into 0 or 1 by using label encoder. And then apply this code to reduce the memory size
Also, if you want to receive my latest article directly on your email, just subscribe to my newsletter.
I always appreciate any feedback or suggestions on my articles, so please feel free to connect with me in the comments section below.