Data Types While Reading Csv in Pandas

Pandas read_csv() tricks you should know to speed up your data analysis

Some of the most helpful Pandas tricks to speed upwards your data analysis

B. Chen

Importing information is the first step in any data science project. Oftentimes, you'll work with information in CSV files and come across problems at the very beginning. In this commodity, you'll run into how to use the Pandas read_csv() part to deal with the following common problems.

  1. Dealing with different graphic symbol encodings
  2. Dealing with headers
  3. Dealing with columns
  4. Parsing date columns
  5. Setting information blazon for columns
  6. Finding and locating invalid value
  7. Appending data to an existing CSV file
  8. Loading a huge CSV file with chunksize

Please check out my Github repo for the source lawmaking.

i. Dealing with different grapheme encodings

Character encodings are specific sets of rules for mapping from raw binary byte strings to characters that brand upward the human-readable text [ane]. Python has congenital-in support for a listing of standard encodings.

Character due east n coding mismatches are less common today as UTF-8 is the standard text encoding in most of the programming languages including Python. Still, information technology is definitely still a problem if you are trying to read a file with a different encoding than the i it was originally written. You are about probable to end upward with something like beneath or DecodeError when that happens:

Source from Kaggle character encoding

The Pandas read_csv() role has an statement call encoding that allows you to specify an encoding to use when reading a file.

Let'due south have a look at an example below:

First, nosotros create a DataFrame with some Chinese characters and save information technology with encoding='gb2312' .

          df = pd.DataFrame({'name': '一 二 三 四'.dissever(), 'northward': [2, 0, 2, 3]})          df.to_csv('data/data_1.csv',            encoding='gb2312', index=False)        

Then, you should get an UnicodeDecodeError when trying to read the file with the default utf8 encoding.

          # Read information technology with default encoding='utf8'
# Y'all should get an error
pd.read_csv('information/data_1.csv')

In order to read information technology correctly, you should pass the encoding that the file was written.

          pd.read_csv('data/data_1.csv',            encoding='gb2312')        

2. Dealing with headers

Headers refer to the column names. For some datasets, the headers may be completely missing, or y'all might want to consider a dissimilar row equally headers. The read_csv() office has an argument called header that allows you lot to specify the headers to use.

No headers

If your CSV file does not have headers, then you need to set the argument header to None and the Pandas will generate some integer values as headers

For instance to import data_2_no_headers.csv

          pd.read_csv('data/data_2_no_headers.csv',            header=None)        

Consider dissimilar row as headers

Allow's take a look at data_2.csv

          x1,       x2,      x3,     x4
product, cost, toll, profit
a, 10, 5, 1
b, 20, 12, 2
c, 30, 20, 3
d, twoscore, xxx, 4

It seems like more sensible columns proper name would be product, cost, … turn a profit, just they are non in the commencement row. The argument header also allows you to specify the row number to apply as the column names and the kickoff of data. In this case, we would like to skip the starting time row and use the 2nd row as headers:

          pd.read_csv('data/data_2.csv',            header=i)        

3. Dealing with columns

When your input dataset contains a large number of columns, and you desire to load a subset of those columns into a DataFrame, then usecols will be very useful.

Functioning-wise, it is better considering instead of loading an entire DataFrame into memory so deleting the spare columns, we tin can select the columns we need while loading the dataset.

Let's use the same dataset data_2.csv and select the product and cost columns.

          pd.read_csv('information/data_2.csv',
header=1,
usecols=['product', 'cost'])

We can also pass the column index to usecols:

          pd.read_csv('information/data_2.csv',
header=1,
usecols=[0, 1])

iv. Parsing date columns

Date columns are represented as objects past default when loading data from a CSV file.

          df = pd.read_csv('information/data_3.csv')
df.info()
RangeIndex: four entries, 0 to 3
Data columns (total 5 columns):
# Column Non-Nil Count Dtype
--- ------ -------------- -----
0 date 4 not-null object
i production four not-null object
2 cost 4 not-goose egg int64
3 price four not-null int64
4 profit 4 non-null int64
dtypes: int64(3), object(ii)
retention usage: 288.0+ bytes

To read the appointment column correctly, we can use the argument parse_dates to specify a list of date columns.

          df = pd.read_csv('data/data_3.csv',            parse_dates=['date'])
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4 entries, 0 to 3
Data columns (full 5 columns):
# Cavalcade Non-Zippo Count Dtype
--- ------ -------------- -----
0 date iv non-zippo datetime64[ns]
1 product 4 non-null object
2 price 4 non-null int64
3 price four non-null int64
iv profit four not-zip int64
dtypes: datetime64[ns](one), int64(3), object(1)
memory usage: 288.0+ bytes

Sometime appointment is split up into multiple columns, for example, year , month , and day . To combine them into a datetime, nosotros can pass a nested listing to parse_dates.

          df = pd.read_csv('data/data_4.csv',
parse_dates=[['twelvemonth', 'calendar month', 'twenty-four hours']])
df.info()
RangeIndex: 4 entries, 0 to 3
Data columns (total 5 columns):
# Cavalcade Not-Nada Count Dtype
--- ------ -------------- -----
0 year_month_day 4 non-null datetime64[ns]
1 production 4 non-null object
two price 4 non-null int64
iii cost 4 non-null int64
4 profit 4 non-zippo int64
dtypes: datetime64[ns](ane), int64(3), object(1)
memory usage: 288.0+ bytes

To specify a custom cavalcade name instead of the auto-generated year_month_day , we can pass a dictionary instead.

          df = pd.read_csv('information/data_4.csv',
parse_dates={ 'date': ['year', 'month', 'day'] })
df.info()

If your date column is in a different format, and then you lot tin can customize a appointment parser and laissez passer it to the argument date_parser:

          from datetime import datetime
custom_date_parser = lambda x: datetime.strptime(x, "%Y %m %d %H:%One thousand:%S")
pd.read_csv('data/data_6.csv',
parse_dates=['date'],
date_parser=custom_date_parser)

For more than about parsing date columns, please check out this commodity

v. Setting data type

If yous want to fix the data blazon for the DataFrame columns, you can use the argument dtype , for case

          pd.read_csv('information/data_7.csv',
dtype={
'Proper noun': str,
'Grade': int
}
)

vi. Finding and locating invalid values

You might get the TypeError when setting information type using the argument dtype

It is always useful to discover and locate the invalid values when this error happens. Here is how you can find them:

          df = pd.read_csv('data/data_8.csv')                      is_error = pd.to_numeric(df['Course'], errors='coerce').isna()                    df[is_error]        

seven. Appending data to an existing CSV file

You tin can specify a Python write mode in the Pandas to_csv() function. For appending data to an existing CSV file, we can apply mode='a':

          new_record = pd.DataFrame([['New name', pd.to_datetime('today')]],
columns=['Name', 'Date'])
new_record.to_csv('data/existing_data.csv',
manner='a',
header=None,
index=Simulated)

eight. Loading a huge CSV file with chunksize

By default, Pandas read_csv() function volition load the entire dataset into memory, and this could be a memory and functioning issue when importing a huge CSV file.

read_csv() has an argument called chunksize that allows you to retrieve the data in a same-sized chunk. This is specially useful when reading a huge dataset every bit part of your data scientific discipline project.

Let's take a look at an example beneath:

Starting time, let's make a huge dataset with 400,000 rows and save it to big_file.csv

          # Brand up a huge dataset          nums = 100_000          for name in 'a b c d'.split up():
df = pd.DataFrame({
'col_1': [1]*nums,
'col_2': np.random.randint(100, 2000, size=nums)
})
df['name'] = name
df.to_csv('information/big_file.csv',
manner='a',
alphabetize=Fake,
header= name=='a')

Next, let's specify a chucksize of 50,000 when loading information with read_csv()

          dfs = pd.read_csv('data/big_file.csv',
chunksize=50_000,
dtype={
'col_1': int,
'col_2': int,
'proper name': str
})

Allow'southward perform some aggregations on each chunk and and so concatenate the outcome into a single DataFrame.

          res_dfs = []
for chunk in dfs:
res = chunk.groupby('name').col_2.agg(['count', 'sum'])
res_dfs.suspend(res)
pd.concat(res_dfs).groupby(level=0).sum()

Let's validate the result against a solution without chunksize

          pd.read_csv('data/big_file.csv',
dtype={
'col_1': int,
'col_2': int,
'proper name': str
}).groupby('proper noun').col_2.agg(['count', 'sum'])

And y'all should get the same output.

That's it

Thanks for reading.

Please checkout the notebook on my Github for the source code.

Stay tuned if you are interested in the practical attribute of automobile learning.

Here are some related articles

  • 4 tricks you should know to parse date columns with Pandas read_csv()
  • 6 Pandas tricks you should know to speed up your data analysis
  • 7 setups you should include at the beginning of a data science projection.

References

  • [1] Kaggle Data Cleaning: Character Encoding

johnsonagaing81.blogspot.com

Source: https://towardsdatascience.com/all-the-pandas-read-csv-you-should-know-to-speed-up-your-data-analysis-1e16fe1039f3

0 Response to "Data Types While Reading Csv in Pandas"

Post a Comment

Iklan Atas Artikel

Iklan Tengah Artikel 1

Iklan Tengah Artikel 2

Iklan Bawah Artikel