Data Types While Reading Csv in Pandas

Pandas read_csv() tricks you should know to speed up your data analysis

Some of the most helpful Pandas tricks to speed upwards your data analysis

Search result from https://www.flaticon.com/search?word=csv%20file

Importing information is the first step in any data science project. Oftentimes, you'll work with information in CSV files and come across problems at the very beginning. In this commodity, you'll run into how to use the Pandas read_csv() part to deal with the following common problems.

Dealing with different graphic symbol encodings
Dealing with headers
Dealing with columns
Parsing date columns
Setting information blazon for columns
Finding and locating invalid value
Appending data to an existing CSV file
Loading a huge CSV file with chunksize

Please check out my Github repo for the source lawmaking.

i. Dealing with different grapheme encodings

Character encodings are specific sets of rules for mapping from raw binary byte strings to characters that brand upward the human-readable text [ane]. Python has congenital-in support for a listing of standard encodings.

Character due east n coding mismatches are less common today as UTF-8 is the standard text encoding in most of the programming languages including Python. Still, information technology is definitely still a problem if you are trying to read a file with a different encoding than the i it was originally written. You are about probable to end upward with something like beneath or DecodeError when that happens:

The Pandas read_csv() role has an statement call encoding that allows you to specify an encoding to use when reading a file.

Let'due south have a look at an example below:

First, nosotros create a DataFrame with some Chinese characters and save information technology with encoding='gb2312' .

          df = pd.DataFrame({'name': '一 二 三 四'.dissever(), 'northward': [2, 0, 2, 3]})          df.to_csv('data/data_1.csv',            encoding='gb2312', index=False)

Then, you should get an UnicodeDecodeError when trying to read the file with the default utf8 encoding.

          # Read information technology with default encoding='utf8'
# Y'all should get an error            
pd.read_csv('information/data_1.csv')

In order to read information technology correctly, you should pass the encoding that the file was written.

          pd.read_csv('data/data_1.csv',            encoding='gb2312')

2. Dealing with headers

Headers refer to the column names. For some datasets, the headers may be completely missing, or y'all might want to consider a dissimilar row equally headers. The read_csv() office has an argument called header that allows you lot to specify the headers to use.

No headers

If your CSV file does not have headers, then you need to set the argument header to None and the Pandas will generate some integer values as headers

For instance to import data_2_no_headers.csv

          pd.read_csv('data/data_2_no_headers.csv',            header=None)

Consider dissimilar row as headers

Allow's take a look at data_2.csv

          x1,       x2,      x3,     x4
            product,  cost,   toll,   profit            
a,        10,      5,      1
b,        20,      12,     2
c,        30,      20,     3
d,        twoscore,      xxx,     4

It seems like more sensible columns proper name would be product, cost, … turn a profit, just they are non in the commencement row. The argument header also allows you to specify the row number to apply as the column names and the kickoff of data. In this case, we would like to skip the starting time row and use the 2nd row as headers:

          pd.read_csv('data/data_2.csv',            header=i)

3. Dealing with columns

When your input dataset contains a large number of columns, and you desire to load a subset of those columns into a DataFrame, then usecols will be very useful.

Functioning-wise, it is better considering instead of loading an entire DataFrame into memory so deleting the spare columns, we tin can select the columns we need while loading the dataset.

Let's use the same dataset data_2.csv and select the product and cost columns.

          pd.read_csv('information/data_2.csv',
            header=1,
            usecols=['product', 'cost'])

We can also pass the column index to usecols:

          pd.read_csv('information/data_2.csv',
            header=1,
            usecols=[0, 1])

iv. Parsing date columns

Date columns are represented as objects past default when loading data from a CSV file.

          df = pd.read_csv('information/data_3.csv')
df.info()          RangeIndex: four entries, 0 to 3
Data columns (total 5 columns):
            #   Column   Non-Nil Count  Dtype            
---  ------   --------------  -----            
            0   date     4 not-null      object            
            i   production  four not-null      object
            2   cost    4 not-goose egg      int64            
            3   price     four not-null      int64            
            4   profit   4 non-null      int64            
dtypes: int64(3), object(ii)
retention usage: 288.0+ bytes

To read the appointment column correctly, we can use the argument parse_dates to specify a list of date columns.

          df = pd.read_csv('data/data_3.csv',            parse_dates=['date'])
df.info()          <class 'pandas.core.frame.DataFrame'>
RangeIndex: 4 entries, 0 to 3
Data columns (full 5 columns):
            #   Cavalcade   Non-Zippo Count  Dtype            
---  ------   --------------  -----            
            0   date     iv non-zippo      datetime64[ns]            
            1   product  4 non-null      object            
            2   price    4 non-null      int64            
            3   price     four non-null      int64            
            iv   profit   four not-zip      int64            
dtypes: datetime64[ns](one), int64(3), object(1)
memory usage: 288.0+ bytes

Sometime appointment is split up into multiple columns, for example, year , month , and day . To combine them into a datetime, nosotros can pass a nested listing to parse_dates.

          df = pd.read_csv('data/data_4.csv',
            parse_dates=[['twelvemonth', 'calendar month', 'twenty-four hours']])
df.info()          RangeIndex: 4 entries, 0 to 3
Data columns (total 5 columns):
            #   Cavalcade          Not-Nada Count  Dtype            
---  ------          --------------  -----            
            0   year_month_day  4 non-null      datetime64[ns]            
            1   production         4 non-null      object            
            two   price           4 non-null      int64            
            iii   cost            4 non-null      int64            
            4   profit          4 non-zippo      int64            
dtypes: datetime64[ns](ane), int64(3), object(1)
memory usage: 288.0+ bytes

To specify a custom cavalcade name instead of the auto-generated year_month_day , we can pass a dictionary instead.

          df = pd.read_csv('information/data_4.csv',
            parse_dates={ 'date': ['year', 'month', 'day'] })
df.info()

If your date column is in a different format, and then you lot tin can customize a appointment parser and laissez passer it to the argument date_parser:

          from datetime import datetime
            custom_date_parser = lambda x: datetime.strptime(x, "%Y %m %d %H:%One thousand:%S")                    pd.read_csv('data/data_6.csv',
            parse_dates=['date'],
            date_parser=custom_date_parser)

For more than about parsing date columns, please check out this commodity

v. Setting data type

If yous want to fix the data blazon for the DataFrame columns, you can use the argument dtype , for case

          pd.read_csv('information/data_7.csv',
            dtype={
              'Proper noun': str,
              'Grade': int
              })

vi. Finding and locating invalid values

You might get the TypeError when setting information type using the argument dtype

It is always useful to discover and locate the invalid values when this error happens. Here is how you can find them:

          df = pd.read_csv('data/data_8.csv')                      is_error = pd.to_numeric(df['Course'], errors='coerce').isna()                    df[is_error]

seven. Appending data to an existing CSV file

You tin can specify a Python write mode in the Pandas to_csv() function. For appending data to an existing CSV file, we can apply mode='a':

          new_record = pd.DataFrame([['New name', pd.to_datetime('today')]],
            columns=['Name', 'Date'])          new_record.to_csv('data/existing_data.csv',
            manner='a',            
            header=None,            
            index=Simulated)

eight. Loading a huge CSV file with `chunksize`

By default, Pandas read_csv() function volition load the entire dataset into memory, and this could be a memory and functioning issue when importing a huge CSV file.

read_csv() has an argument called chunksize that allows you to retrieve the data in a same-sized chunk. This is specially useful when reading a huge dataset every bit part of your data scientific discipline project.

Let's take a look at an example beneath:

Starting time, let's make a huge dataset with 400,000 rows and save it to big_file.csv

          # Brand up a huge dataset          nums = 100_000          for name in 'a b c d'.split up():
            df = pd.DataFrame({
            'col_1': [1]*nums,
            'col_2': np.random.randint(100, 2000, size=nums)
            })                      df['name'] = name
            df.to_csv('information/big_file.csv',
            manner='a',
            alphabetize=Fake,
            header= name=='a')

Next, let's specify a chucksize of 50,000 when loading information with read_csv()

          dfs = pd.read_csv('data/big_file.csv',
            chunksize=50_000,            
            dtype={
            'col_1': int,
            'col_2': int,
            'proper name': str
            })

Allow'southward perform some aggregations on each chunk and and so concatenate the outcome into a single DataFrame.

          res_dfs = []
            for chunk in dfs:            
            res = chunk.groupby('name').col_2.agg(['count', 'sum'])
            res_dfs.suspend(res)                      pd.concat(res_dfs).groupby(level=0).sum()

Let's validate the result against a solution without chunksize

          pd.read_csv('data/big_file.csv',
            dtype={
            'col_1': int,
            'col_2': int,
            'proper name': str
            }).groupby('proper noun').col_2.agg(['count', 'sum'])

And y'all should get the same output.

That's it

Thanks for reading.

Please checkout the notebook on my Github for the source code.

Stay tuned if you are interested in the practical attribute of automobile learning.

Here are some related articles

4 tricks you should know to parse date columns with Pandas read_csv()
6 Pandas tricks you should know to speed up your data analysis
7 setups you should include at the beginning of a data science projection.

References

[1] Kaggle Data Cleaning: Character Encoding

johnsonagaing81.blogspot.com

Source: https://towardsdatascience.com/all-the-pandas-read-csv-you-should-know-to-speed-up-your-data-analysis-1e16fe1039f3

Data Types While Reading Csv in Pandas

Pandas read_csv() tricks you should know to speed up your data analysis

Some of the most helpful Pandas tricks to speed upwards your data analysis

i. Dealing with different grapheme encodings

2. Dealing with headers

No headers

Consider dissimilar row as headers

3. Dealing with columns

iv. Parsing date columns

v. Setting data type

vi. Finding and locating invalid values

seven. Appending data to an existing CSV file

eight. Loading a huge CSV file with `chunksize`

That's it

References

0 Response to "Data Types While Reading Csv in Pandas"

Post a Comment

Iklan Atas Artikel

Iklan Tengah Artikel 1

Iklan Tengah Artikel 2

Iklan Bawah Artikel

Data Types While Reading Csv in Pandas

Pandas read_csv() tricks you should know to speed up your data analysis

Some of the most helpful Pandas tricks to speed upwards your data analysis

i. Dealing with different grapheme encodings

2. Dealing with headers

No headers

Consider dissimilar row as headers

3. Dealing with columns

iv. Parsing date columns

v. Setting data type

vi. Finding and locating invalid values

seven. Appending data to an existing CSV file

eight. Loading a huge CSV file with chunksize

That's it

References

0 Response to "Data Types While Reading Csv in Pandas"

Post a Comment

Iklan Atas Artikel

Iklan Tengah Artikel 1

Iklan Tengah Artikel 2

Iklan Bawah Artikel

eight. Loading a huge CSV file with `chunksize`