Create NumPy array from Text file

1. Intro

NumPy has helpful methods to create an array from text files like CSV and TSV. In real life our data often lives in the file system, hence these methods decrease the development/analysis time dramatically.

numpy.loadtxt(fname, dtype=, comments='#', delimiter=None, converters=None, skiprows=0, usecols=None, unpack=False, ndmin=0, encoding='bytes', max_rows=None)

The Numpy loadtxt () method is an efficient way to load data from text files where each row has distinct value counts.

2. NumPy array from CSV file

We have a CSV file with Delhi rainfall data in millimeters for every month of years 2017 and 2018.

CSV file

12.0, 12.0, 14.0, 16.0, 19.0, 12.0, 11.0, 14.0, 17.0, 19.0, 11.0, 11.5
13.0, 11.0, 13.5, 16.7, 15.0, 11.0, 12.0, 11.0, 19.0, 18.0, 13.0, 12.5

We will create a NumPy array from a CSV file using numpy.loadtxt() method. This method takes a delimiter character, which makes it very flexible to handle files.

#%%
# Create an array from rain-fall.csv, keeping rainfall data in mm
array_rain_fall = np.loadtxt(fname="rain-fall.csv", delimiter=",")
print("NumPy array: \n", array_rain_fall)
print("Shape: ", array_rain_fall.shape)
print("Data Type: ", array_rain_fall.dtype.name)

OUTPUT

NumPy array: 
 [[12. 12. 14. 16. 19. 12. 11. 14. 17. 19. 11. 11.5]
 [13. 11. 13.5 16.7 15. 11. 12. 11. 19. 18. 13. 12.5]]
Shape:  (2, 12)
Data Type:  float64

2.1 Error when different column counts in rows

While creating NumPy array using numpy.loadtxt() method, make sure CSV rows have distinct column counts, lack of it will result in an error.

We are trying to use numpy.loadtxt() method when there is a difference in column counts in the rain-fall-wrong.csv file.

#%%
# Check error when different column counts in rows
array_rain_fall_wrong = np.loadtxt(
    fname="rain-fall-wrong.csv", delimiter=","
)

OUTPUT:

ValueError: Wrong number of columns at line 2

2.2 Skipping rows and columns in CSV

We can skip rows and columns while creating a NumPy array from CSV. It is useful when CSV contains row and column names.

We have to pass skiprows and usecols argument to loadtxt() method.

rain-fall-row-col-names.csv file:

Year, Jan, Feb, Mar, Apr, May, Jun, July, Aug, Sep, Oct, Nov, Dec
2017, 12.0, 12.0, 14.0, 16.0, 19.0, 12.0, 11.0, 14.0, 17.0, 19.0, 11.0, 11.5
2018, 13.0, 11.0, 13.5, 16.7, 15.0, 11.0, 12.0, 11.0, 19.0, 18.0, 13.0, 12.5

#%%
# Skip first row and first column
array_rain_fall_named = np.loadtxt(
    fname="rain-fall-row-col-names.csv",
    delimiter=",",
    skiprows=1,
    usecols=(1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12),
)
print("NumPy array: \n", array_rain_fall_named)
print("Shape: ", array_rain_fall_named.shape)
print("Data Type: ", array_rain_fall_named.dtype.name)

OUTPUT:

NumPy array: 
 [[12. 12. 14. 16. 19. 12. 11. 14. 17. 19. 11. 11.5]
 [13. 11. 13.5 16.7 15. 11. 12. 11. 19. 18. 13. 12.5]]
Shape:  (2, 12)
Data Type:  float64

2.3 Create NumPy array with GZipped file

Gzip is helpful in reducing the size of files, especially text. For .gz extension file, NumPy.loadtxt() automatically unzip first; before processing as usual.

We can use it for text value files with any delimiters.

#%%
# Create array from gzipped csv
array_rain_fall_zip = np.loadtxt(
    fname="rain-fall-row-col-names.csv.gz",
    delimiter=",",
    skiprows=1,
    usecols=(1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12),
)
print("NumPy array: \n", array_rain_fall_zip)
print("Shape: ", array_rain_fall_zip.shape)
print("Data Type: ", array_rain_fall_zip.dtype.name)

OUTPUT:

NumPy array: 
 [[12. 12. 14. 16. 19. 12. 11. 14. 17. 19. 11. 11.5]
 [13. 11. 13.5 16.7 15. 11. 12. 11. 19. 18. 13. 12.5]]
Shape:  (2, 12)
Data Type:  float64

3. Create NumPy array from TSV

TSV (Tab Separated Values) files are used to store plain text in the tabular form. We create a NumPy array from TSV by passing \t as value to delimiter argument in numpy.loadtxt() method.

#%%
# Create array from tsv files
array_rain_fall_tab = np.loadtxt(
    fname="rain-fall-row-col-names.tsv",
    delimiter="\t",
    skiprows=1,
    usecols=(1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12),
)
print("NumPy array: \n", array_rain_fall_zip)
print("Shape: ", array_rain_fall_zip.shape)
print("Data Type: ", array_rain_fall_zip.dtype.name)

OUTPUT:

NumPy array:
[[12. 12. 14. 16. 19. 12. 11. 14. 17. 19. 11. 11.5]
[13. 11. 13.5 16.7 15. 11. 12. 11. 19. 18. 13. 12.5]]
Shape: (2, 12)
Data Type: float64

4. Conclusion

In this tutorial, we learned about key techniques to create a NumPy array using data stored on plain text files like CSV, TSV, etc. These methods are very handy while doing data exploration as well as developing programs.

Please download source code related to this tutorial here. You can run the Jupyter notebook for this tutorial here.

Mrityunjay

Search This Blog