Python allows you to easily process files and work their data. Sometimes you may need to read large CSV files in Python. This is a common requirement since most applications and processes allow you to export data as CSV files. There are various ways to do this. In this article, we will look at the different ways to read large CSV file in python.
How to Read Large CSV File in Python
Here are the different ways to read large CSV file in python. Let us say you have a large CSV file at /home/ubuntu/data.csv. In most of these approaches, we will read CSV file as chunks or using iterators, instead of loading the entire file in memory for reading. When we use chunks or iterators, it read only a part of the file at a time and uses very little memory.
1. Using Pandas
Pandas is a popular python library that allows you to work with data in highly optimized and sophisticated manner. One of its features allows you to read files as chunks. In this case, we specify the chunk size and pandas’s read function function will iterate through the file contents, one chunk at a time. Since it only read a few lines at a time, this approach consumes very little memory.
Here is an example where we read 1000 lines at a time. You can change it as per your requirements.
import pandas as pd filename='/home/ubuntu/data.csv' chunksize = 10000 for chunk in pd.read_csv(filename, chunksize=chunksize): ## process chunk print(chunk)
2. Using yield
yield keyword returns a generator and not the actual data and is processed only at runtime, when it needs to generate a value, saving a lot of memory. A generator is a onetime iterator that returns values on the fly. It is really useful if you want to read huge amount of data only once.
filename='/home/ubuntu/data.csv' def read_in_chunks(file_object, chunk_size=1024): """Lazy function (generator) to read a file piece by piece. Default chunk size: 1k.""" while True: data = file_object.read(chunk_size) if not data: break yield data with open(filename) as f: for piece in read_in_chunks(f): #process data print(piece)
In the above code, we basically, read csv file 1000 lines at a time and use yield keyword to return a generator instead of actual data, which is executed only when required, thereby not loading the entire file but only 1 chunk at a time. read_in_chunks function read 1000 lines at a time and returns a generator as long as there is data to be read from the file. We use open keyword to open the file and use a for loop that runs as long as there is data to be read. In each iteration it simply prints the output of read_in_chunks function that returns one chunk of data.
3. Using iterators
You may also use iterators to easily read & process csv or other files one chunk at a time. Here is an example.
filename='/home/ubuntu/data.csv' def read_chunk(): return filename.read(1024) for piece in iter(read_chunk, ''): print(piece)
4. Using Lazy Generator
In fact, since csv file is a line-based file, you can simply use open function to loop through the data, one line at a time. open function already returns a generator and does not load the entire file into memory.
filename='/home/ubuntu/data.csv' for line in open(filename): print(line)
In this article, we have learnt different ways to read large CSV file. They all work on the same principle of reading a large file one chunk at a time. In fact, these functions and codes can be used on all files, not just csv files. In our examples, we are simply reading data and printing it. You may modify it as per your requirement. Among the above methods, it is advisable to use pandas to read and work with large files, because it has been specifically built for large-scale data processing.