Skip to main content

Create NumPy array from Text file

1. Intro

NumPy has helpful methods to create an array from text files like CSV and TSV. In real life our data often lives in the file system, hence these methods decrease the development/analysis time dramatically.

numpy.loadtxt(fname, dtype=, comments='#', delimiter=None, converters=None, skiprows=0, usecols=None, unpack=False, ndmin=0, encoding='bytes', max_rows=None) 

The Numpy loadtxt() method is an efficient way to load data from text files where each row has distinct value counts.

2. NumPy array from CSV file

We have a CSV file with Delhi rainfall data in millimeters for every month of years 2017 and 2018.

CSV file

12.0, 12.0, 14.0, 16.0, 19.0, 12.0, 11.0, 14.0, 17.0, 19.0, 11.0, 11.5
13.0, 11.0, 13.5, 16.7, 15.0, 11.0, 12.0, 11.0, 19.0, 18.0, 13.0, 12.5

We will create a NumPy array from a CSV file using numpy.loadtxt() method. This method takes a delimiter character, which makes it very flexible to handle files.

#%%
# Create an array from rain-fall.csv, keeping rainfall data in mm
array_rain_fall = np.loadtxt(fname="rain-fall.csv", delimiter=",")
print("NumPy array: \n", array_rain_fall)
print("Shape: ", array_rain_fall.shape)
print("Data Type: ", array_rain_fall.dtype.name)

OUTPUT

NumPy array: 
 [[12. 12. 14. 16. 19. 12. 11. 14. 17. 19. 11. 11.5]
 [13. 11. 13.5 16.7 15. 11. 12. 11. 19. 18. 13. 12.5]]
Shape:  (2, 12)
Data Type:  float64

2.1 Error when different column counts in rows

While creating NumPy array using numpy.loadtxt() method, make sure CSV rows have distinct column counts, lack of it will result in an error.

We are trying to use numpy.loadtxt() method when there is a difference in column counts in the rain-fall-wrong.csv file.

#%%
# Check error when different column counts in rows
array_rain_fall_wrong = np.loadtxt(
    fname="rain-fall-wrong.csv", delimiter=","
)

OUTPUT:

ValueError: Wrong number of columns at line 2

2.2 Skipping rows and columns in CSV

We can skip rows and columns while creating a NumPy array from CSV. It is useful when CSV contains row and column names.

We have to pass skiprows and usecols argument to loadtxt() method.

rain-fall-row-col-names.csv file:

Year, Jan, Feb, Mar, Apr, May, Jun, July, Aug, Sep, Oct, Nov, Dec
2017, 12.0, 12.0, 14.0, 16.0, 19.0, 12.0, 11.0, 14.0, 17.0, 19.0, 11.0, 11.5
2018, 13.0, 11.0, 13.5, 16.7, 15.0, 11.0, 12.0, 11.0, 19.0, 18.0, 13.0, 12.5
#%%
# Skip first row and first column
array_rain_fall_named = np.loadtxt(
    fname="rain-fall-row-col-names.csv",
    delimiter=",",
    skiprows=1,
    usecols=(1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12),
)
print("NumPy array: \n", array_rain_fall_named)
print("Shape: ", array_rain_fall_named.shape)
print("Data Type: ", array_rain_fall_named.dtype.name)

OUTPUT:

NumPy array: 
 [[12. 12. 14. 16. 19. 12. 11. 14. 17. 19. 11. 11.5]
 [13. 11. 13.5 16.7 15. 11. 12. 11. 19. 18. 13. 12.5]]
Shape:  (2, 12)
Data Type:  float64

2.3 Create NumPy array with GZipped file

Gzip is helpful in reducing the size of files, especially text. For .gz extension file, NumPy.loadtxt() automatically unzip first; before processing as usual.

We can use it for text value files with any delimiters.

#%%
# Create array from gzipped csv
array_rain_fall_zip = np.loadtxt(
    fname="rain-fall-row-col-names.csv.gz",
    delimiter=",",
    skiprows=1,
    usecols=(1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12),
)
print("NumPy array: \n", array_rain_fall_zip)
print("Shape: ", array_rain_fall_zip.shape)
print("Data Type: ", array_rain_fall_zip.dtype.name)

OUTPUT:

NumPy array: 
 [[12. 12. 14. 16. 19. 12. 11. 14. 17. 19. 11. 11.5]
 [13. 11. 13.5 16.7 15. 11. 12. 11. 19. 18. 13. 12.5]]
Shape:  (2, 12)
Data Type:  float64

3. Create NumPy array from TSV

TSV (Tab Separated Values) files are used to store plain text in the tabular form. We create a NumPy array from TSV by passing \t as value to delimiter argument in numpy.loadtxt() method.

#%%
# Create array from tsv files
array_rain_fall_tab = np.loadtxt(
    fname="rain-fall-row-col-names.tsv",
    delimiter="\t",
    skiprows=1,
    usecols=(1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12),
)
print("NumPy array: \n", array_rain_fall_zip)
print("Shape: ", array_rain_fall_zip.shape)
print("Data Type: ", array_rain_fall_zip.dtype.name)

OUTPUT:

NumPy array:
[[12. 12. 14. 16. 19. 12. 11. 14. 17. 19. 11. 11.5]
[13. 11. 13.5 16.7 15. 11. 12. 11. 19. 18. 13. 12.5]]
Shape: (2, 12)
Data Type: float64

4. Conclusion

In this tutorial, we learned about key techniques to create a NumPy array using data stored on plain text files like CSV, TSV, etc. These methods are very handy while doing data exploration as well as developing programs.

Please download source code related to this tutorial here. You can run the Jupyter notebook for this tutorial here.

Comments

Popular posts from this blog

Extend and reuse an existing AirByte destination connector

AirByte is an open-source ELT (Extract, Load, and Transformation) application. It heavily uses containerization for the deployment of its various components. On the local machine, we need docker to run it. AirByte has an impressive list of source and destination connectors available. One of my use case data destinations is the  ClickHouse data warehouse and its destination connector is not yet (2021-12-08) available. As per the documentation, It seems that creating a destination connector is a non-trivial job. It's a great idea to build an open-source ClickHouse destination connector. However, I tried avoiding the temptation to create one because of the required effort. AirByte has a  MySql destination connector available. ClickHouse provides a MySQL connector for access from any MySQL client. We need to configure Clickhouse to give support for the MySQL connector. Accessing ClickHouse from AirByte using its MySQL destination connector looks promising. However, when ...

Understanding Type Checking

A few examples of types in the context of programming language can be integer, float, character, string, array, etc.  When a program executes then data flow between instructions and values of specific types are assigned to a variable after some operation. It's important for the system to verify if the correct types are used as operands in operations. For e.g. In a sum operation, the expectation for operands to be of numeric type. The program's execution should fail in the case there is inconsistency. We can classify programming languages into two categories based as per their ability to cater to type safety: Dynamically Typed Language Statically Typed Language

Setting Clickhouse column data warehouse at Google Cloud Compute Engine VM

I didn't have a Google Cloud account associated with my email, so I signed up for one. It needs a valid Credit Card and mobile number to check if you are human. On successful sign up I get 300$ to spend within 3 months. Creating a free forever Google Cloud Compute Engine VM As per Google Cloud documentation you can have 1 non-preemptible e2-micro VM instance (1GB 2vCPU, 30GB Disk, etc.) per month free forever in some regions with some restrictions. I wanted the following stuff in my VM before I can install Clickhouse on to that: Ubuntu 20.x LTS SSH access from my machine Enabling SSH-based access to Google Compute Engine VM Step 1 Created an ssh private and public key on my mac using the following command ssh-keygen -t rsa -f ~/.ssh/gcloud-ssh-key -C mrityunjay -b 2048 Step 2 Copied the public key from the console using the following command: cat ~/.ssh/gcloud-ssh-key.pub output ssh-rsa <Gibrish :)> mrityunjay Step 3 I went to Google Cloud Console > Co...