Pandas 101: Reading Data (CSV)

This is the first part of Pandas manual that I’ve composed as a quick reference. I’ve been using it for some time for personal needs and after some updates and changes, I share it here.

The content is compiled from various sources, primarily the official Pandas manual, as well as from videos by Daniel Chen, Pandas is for Everyone – PyCon 2019, Kevin Markham’s video Data Science Best Practices with pandas (PyCon 2019). I also incorporate some tips and tricks I’ve learned from my projects.

Loading Datasets into Pandas

Pandas can work with a variety of data sources. Data sources can be textual, binary or SQL. Popular data formats are:

CSV (text)
JSON (text)
Microsoft Excel (binary)
SQL (SQL)
HTML (text)

Besides these, there are several other sources that Pandas can read directly (OpenDocument, Python Pickle, Google Big Query and more).

For each of these types, Pandas offers read and write functions, prefixed as read_ and to_. So the popular data formats can be read and written using these functions:

CSV: read_csv, to_csv
JSON: read_json, to_json
Excel: readexcel, to_excel
SQL: read_sql, to_sql
HTML: read_html, to_html

Reading CSV files

Most frequently I deal with tabular data contained in text files. For example, a dataset of weekly product views that I can import with the read_csv function.

product_pd = pd.read_csv(path + 'product_views.csv')
product_pd.shape

Output:

(23743, 8)

The only parameter I pass in this example is the file path (filepath_or_buffer). This parameter can be a file system path, or URL (http, ftp, S3). I can also accept an opened file object or an in-memory stream for text I/O (StringIO) or any other object that implements a read() method and returns text data.

Loading data from a string

You can load data from an in-memory stream. If for some reason, the source parameter is not in the first position, you can explicitly specify its name filepath_or_buffer.

from io import StringIO

data = 'col1,col2,col3\na,1,2\nb,3,4\nc,5,6'
read_data = pd.read_csv(sep=',',filepath_or_buffer=StringIO(data))

read_data.shape

Output:

(3, 3)

Loading data from a URL

Pandas gives you a way to load data from a remote location via a URL:

remote = pd.read_csv('https://www.divisionlab.com/solvingmagento/wp-content/uploads/2019/2019/tabular_data.txt', sep='\t')

remote.shape

Output:

(5, 3)

Data Parsing

Below are several aspects of Pandas data IO that influence how the data is read and parsed.

Separator (delimiter)

Parameters:

sep, or
delimiter

The default separator used by read_csv is comma (,). In the case of data that uses a different separator (e.g., tab), we need to pass it as a value to the sep parameter. The alternative name for this parameter is delimiter.

Loading tab-separated data without the separator parameter does not work:

tab = pd.read_csv(path + 'tabular_data.txt')
tab.head(2)

Output:

letter first second
0	a\t1\t2
1	b\t3\t4

Loading tab-separated value with a separator:

tab_2 = pd.read_csv(path + 'tabular_data.txt', sep='\t')
tab_2.head(2)

Output:

letter	first	second
0	a	1	2
1	b	3	4

Detecting the separator

When the separator parameter is set to None, Pandas is sometimes able to detect it using a special tool csv.Sniffer. this feature is available only with the Python parsing engine, so unless you specify it, you get a warning message.

Let’s try loading semicolon-separated data:

tab_3 = pd.read_csv(path + 'semicolon_data.txt', sep=None)
tab_3.head(2)

Output:

/usr/local/lib/python3.6/dist-packages/ipykernel_launcher.py:1: ParserWarning: Falling back to the 'python' engine because the 'c' engine does not support sep=None with delim_whitespace=False; you can avoid this warning by specifying engine='python'.
  """Entry point for launching an IPython kernel.
letter	first	second
0	a	1	2
1	b	3	4

Detecting the tab-separator and suppressing the warning message by specifying the parsing engine:

tab = pd.read_csv(path + 'tabular_data.txt', sep=None, engine='python')
tab.head(2)

Output:

letter	first	second
0	a	1	2
1	b	3	4

Multicharacter (regex) separator

When the data is separated with multiple characters, you can pass a regular expression to the sep parameter.

Possible problem: when your data is quoted, the regex separator parsing may not work.

data_mult = 'col1---col2---col3\na---1---2\nb---3---4\nc---5---6'
read_data_mult = pd.read_csv(sep='---',filepath_or_buffer=StringIO(data_mult))
read_data_mult.head(2)

Output:

/usr/local/lib/python3.6/dist-packages/ipykernel_launcher.py:2: ParserWarning: Falling back to the 'python' engine because the 'c' engine does not support regex separators (separators > 1 char and different from '\s+' are interpreted as regex); you can avoid this warning by specifying engine='python'.
  
col1	col2	col3
0	a	1	2
1	b	3	4

The regex separators work only with the somewhat slower Python parser:

# Regex separator and pyhton engine
read_data_mult_2 = pd.read_csv(sep='-{3}',
                               filepath_or_buffer=StringIO(data_mult),
                              engine='python')
read_data_mult_2.head(2)

Output:

col1	col2	col3
0	a	1	2
1	b	3	4

Quoted and escaped data

Often data may contain separator characters in the column values. For example, columns of a text containing commas in a comma-separated data file. In this case, column values are quoted so that the parser can ignore separator characters between the quotes.

Most often, the data is quoted with the double-quote character ‘”‘, Pandas can recognize the quotation automatically:

data_comma_quot_double = \
  'col1,col2,col3\n"cats meow, dogs bark","1,2,3","2,3,4"\nb,3,4\nc,5,6'

read_data_comma_quot_double = pd.read_csv(StringIO(data_comma_quot_double))

read_data_comma_quot_double.head(2)

Output:

col1	col2	col3
0	cats meow, dogs bark	1,2,3	2,3,4
1	b	3	4

If the quoting character is different, you can specify it with the quotechar parameter:

data_comma_quot = \
  "col1,col2,col3\n'cats meow, dogs bark','1,2,3','2,4,5'\nb,3,4\nc,5,6"

read_data_comma_quot = pd.read_csv(StringIO(data_comma_quot), quotechar="'")

read_data_comma_quot.head(2)

Output:

col1	col2	col3
0	cats meow, dogs bark	1,2,3	2,4,5
1	b	3	4

Sometimes, column values containing separator characters are not quoted. Instead, separators are escaped. In this case, you need to use the escapechar parameter:

data_escaped = 'col1,col2,col3\ncats meow\, dogs bark ,1\,2\,3,2\,3\,4\nb,3,4\nc,5,6'

read_data_escaped = pd.read_csv(StringIO(data_escaped), escapechar="\\")

read_data_escaped.head(2)

Output:

col1	col2	col3
0	cats meow, dogs bark	1,2,3	2,3,4
1	b	3	4

If in the quoted data, the quote character appears in the column value and is self-escaped (i.e., given twice: ""), you can tell Pandas to treat it as a single character and a part of the column value:

data_comma_quot_double = \
  'col1,col2,col3\n"cat says,""Meow"", dog barks","1,2,3","2,3,4"\nb,3,4\nc,5,6'

read_data_comma_quot_double = pd.read_csv(StringIO(data_comma_quot_double))

read_data_comma_quot_double.head(2)

Output:

col1	col2	col3
0	cat says,"Meow", dog barks	1,2,3	2,3,4
1	b	3	4

Quoted data and regex separator

To illustrate the problem that regex separator has with quoted data look at the data string below. The first-row column values contain the character sequence--- that is also used to separate columns. These values are quoted to prevent their mixing with separators. To indicate that you are using quoted data, you can pass the parameter quotechar. Still, the regex separator cannot detect them properly:

data_mult_quot = \
  "col1---col2---col3\n'a---'---'1---'---'2---'\nb---3---4\nc---5---6"
read_data_mult_quot = pd.read_csv(sep='-{3}',
                               filepath_or_buffer=StringIO(data_mult_quot),
                              engine='python',
                              quotechar="'")
read_data_mult_quot.head(2)

Output:

col1	col2	col3
'a	'	'1	'	'2	'
b	3	4	None	None	None

Dialect

Separators, quoting, and escaping can be controlled collectively by the dialect. You can override all default values used to handle separators, quotes, and escape characters by specifying one of the available csv dialects:

default
excel
excel-tab
unix

When you specify a dialect, you tell Pandas to use its values for the following parameters:

delimiter
doublequote
escapechar
lineterminator
quotechar
quoting
skipinitialspace
strict

Below are the values supplied by various dialects.

read_dialect_data = pd.read_csv(path + 'dialect_data.txt', quotechar='"', 
                                sep='\t', index_col=None)
read_dialect_data

Output:

Parameter	default	excel	excel-tab	unix
0	delimiter	,	,	\t	,
1	doublequote	True	True	True	True
2	escapechar	None	None	None	None
3	lineterminator	\r\n	\r\n	\r\n	\n
4	quotechar	"	"	"	"
5	quoting	csv.QUOTE_MINIMAL	csv.QUOTE_MINIMAL	csv.QUOTE_MINIMAL	csv.QUOTE_ALL
6	skipinitialspace	True	False	False	False
7	strict	False	False	False	False

Custom dialects

For more flexibility, you can define your own dialect and provide such values for its properties that you need to load your data source.

import csv

class myDialect(csv.Dialect):
  delimiter = '\t'
  doublequote = True
  escapechar = None
  lineterminator = '\r\n'
  quotechar = '"'
  quoting = csv.QUOTE_MINIMAL
  skipinitialspace = True
  strict = False
  
  
# Register the new dialect name
csv.register_dialect('my_dialect', myDialect)

# Check if my dialect is available
print(csv.list_dialects())

my_read_dialect_data = pd.read_csv(path + 'dialect_data.txt',
                                   dialect='my_dialect')
# Check my data
my_read_dialect_data.shape

Output:

['excel', 'excel-tab', 'unix', 'my_dialect']
(8, 5)

You can use an existing dialect by simply modifying one or more of its properties.

In the example below, one of the column values contains a quote character. The dialect used by default by read_csv is excel. This dialect treats " as a quoting character and will fail if it finds an unclosed quote.

The following custom dialect is based on excel but is set up to ignore quoting:

my_dialect_2 = csv.excel()

open_quote_data = 'col1,col2,col3\n"a,1,2\nb,3,4\nc,5,6'

my_dialect_2.quoting = csv.QUOTE_NONE

read_open_quote_data = pd.read_csv(StringIO(open_quote_data), dialect=my_dialect_2)

read_open_quote_data.head()

Output:

col1	col2	col3
0	"a	1	2
1	b	3	4
2	c	5	6

Comments

Some data sources may include comments. Comments can come in two ways:

Entire lines commented out, or
A column value or part of it is commented out.

Pandas allow specifying the comment character in the comment parameter.

Without specifying the comment parameter commented lines and values will be imported:

comment_data = ('col1,col2,col3\n'
'a # comment in the beginning of the line,1,2\n'
                '# commented line\n'
                'b,3,4 # comment at the end of the line\n'
                'c,5,6'
               )

read_data_with_comments = pd.read_csv(StringIO(comment_data))

read_data_with_comments.head(3)

Output:

col1	col2	col3
0	a # comment in the beginning of the line	1.0	2
1	# commented line	NaN	NaN
2	b	3.0	4 # comment at the end of the line

If you need to hide comments, you must specify the comment character. Only single characters are accepted, e.g. #:

read_data_without_comments = pd.read_csv(StringIO(comment_data), comment='#')

read_data_without_comments.head(3)

Output:

col1	col2	col3
0	a	NaN	NaN
1	b	3.0	4.0
2	c	5.0	6.0

Note that the comment suppression hides the entire content of the first line past the comment character – the separators are ignored and the values for columns col2 and col3 are not read.

Empty lines

Empty lines are ignored by default. If empty lines need to be imported (e.g., because they actually mean missing data that you will have to impute later), You can set the skip_blank_lines parameter to False:

empty_line_data = ('col1,col2,col3\n'
                'a,1,2\n'
                '\n'
                'b,3,4\n'
               )

read_empty_line_data = pd.read_csv(StringIO(empty_line_data),
                                   skip_blank_lines=False)
read_empty_line_data.head()

Output:

col1	col2	col3
0	a	1.0	2.0
1	NaN	NaN	NaN
2	b	3.0	4.0

Missing values

Pandas allow me to control, which values must be interpreted as missing (NA) values. Some strings are recognized by default as empty. They are:

-1.#IND
1.#QNAN
1.#IND
-1.#QNAN
#N/A
N/A
#N/A
N/A
n/a
NA
#NA
NULL
null
NaN
-NaN
nan
-nan
” (no character)

empty_data = ('col1,col2,col3\n'
                'null,1,2\n'
                'b,3,4\n'
                'c,N/A,4\n'
               )

read_empty_data = pd.read_csv(StringIO(empty_data))
read_empty_data.head()

Output:

col1	col2	col3
0	NaN	1.0	2
1	b	3.0	4
2	c	NaN	4

When necessary, you can extend the list with my custom entries. The custom values can be strings and numbers. For example, you want that the letter ‘c’ and the number 1 are processed as missing values:

read_custom_empty_data = pd.read_csv(StringIO(empty_data), na_values=['c', 1])
read_custom_empty_data.head()

Output:

	col1	col2	col3
0	NaN	NaN	2
1	b	3.0	4
2	NaN	NaN	4

If you want to ignore the default missing values, you can that tell pandas by setting the parameter keep_default_na to False:

read_custom__no_default_empty_data = pd.read_csv(StringIO(empty_data),
                                                 na_values=['c', 1],
                                                 keep_default_na=False)
read_custom__no_default_empty_data.head()

Output:

col1	col2	col3
0	null	NaN	2
1	b	3	4
2	NaN	N/A	4

Data Columns

Header (column names)

When Pandas imports data, it assumes that the first row contains column names. If the first row contains data other than column names, you can specify, what row to use as the header:

# The first row contains the sheet title
header_data = (
                'This is the title of the data sheet\n'
                'col1,col2,col3\n'
                'a,1,2\n'
                'b,3,4\n'
                'c,5,6\n'
               )

read_header_data = pd.read_csv(StringIO(header_data), header=1)
read_header_data.head()

Output:

col1	col2	col3
0	a	1	2
1	b	3	4
2	c	5	6

Duplicate column names

If the data contains duplicate column names, Pandas will automatically append a counter to the duplicate.

dup_col_data = (
                'col1,col2,col1\n'
                'a,1,2\n'
                'b,3,4\n'
                'c,5,6\n'
               )

read_dup_col_data = pd.read_csv(StringIO(dup_col_data))
read_dup_col_data.head()

Output:

col1	col2	col1.1
0	a	1	2
1	b	3	4
2	c	5	6

Custom column names

You can supply the parameter names with a list of the column names that you want to use instead of those in the data source. If you omit the header parameter or set it to None, the row with the original header will be imported as a row of data.

cust_col_name_data = (
                'col1,col2,col1\n'
                'a,1,2\n'
                'b,3,4\n'
                'c,5,6\n'
               )

read_cust_col_name_data = pd.read_csv(StringIO(cust_col_name_data),                                       
                                      names=['my_col_1', 'my_col_2', 
                                             'my_col_3'])
read_cust_col_name_data.head()

Output:

my_col_1	my_col_2	my_col_3
0	col1	col2	col1
1	a	1	2
2	b	3	4
3	c	5	6

To avoid importing the original header as data, I need to specify the row containing it in the header parameter.

read_cust_col_name_data = pd.read_csv(StringIO(cust_col_name_data),                                       
                                      names=['my_col_1', 'my_col_2', 
                                             'my_col_3'],
                                     header=0)
read_cust_col_name_data.head()

Output:

	my_col_1	my_col_2	my_col_3
0	a	1	2
1	b	3	4
2	c	5	6

Filtering by columns

If you need only a few of the columns in the data source, you can specify them in the parameter usecols:

col_data = ('col1,col2,col3\n'
                'a,1,2\n'
                'b,3,4\n'
                'c,5,6\n'
               )
read_col_data = pd.read_csv(StringIO(col_data), usecols=['col1', 'col3'])
read_col_data.head()

Output:

col1	col3
0	a	2
1	b	4
2	c	6

Index columns

Sometimes, the data source contains fewer columns in the header row than there are columns in the following rows. In this case, Pandas will thread the first column as an index column:

index_col_data = ('col1,col2,col3\n'
                '10,a,1,2\n'
                '20,b,3,4\n'
                '30,c,5,6\n'
               )
read_index_col_data = pd.read_csv(StringIO(index_col_data))
read_index_col_data.head()

You can also explicitly specify the location of the index column by using the index_col. This is handy when the number of header columns and the value columns is the same. The index will get the name from the respective header column:

index_col_data = ('col1,Category,col2,col3\n'
                '10,a,1,2\n'
                '20,b,3,4\n'
                '30,c,5,6\n'
               )
read_index_col_data = pd.read_csv(StringIO(index_col_data), index_col=1)
read_index_col_data.head()

Output:

col1	col2	col3
Category			
a	10	1	2
b	20	3	4
c	30	5	6

Trailing delimiters

Sometimes only some of the rows in the data source have an extra column. This can happen when such rows end in a trailing delimiter. In this case, the index column parser will not work. To avoid the problem, you will have to set index_col to False:

index_col_data = ('col1,col2,col3\n'
                'a,1,2,\n'
                'b,3,4\n'
                'c,5,6\n'
               )
read_index_col_data = pd.read_csv(StringIO(index_col_data), index_col=False)
read_index_col_data.head()

MultiIndex

If the data has a multi-index structure and the header row does not include the index columns, Pandas is capable to recognize the MultiIndex automatically:

multi_index_col_data = ('col1,col2\n'
                'a,10,1,2\n'
                'a,20,5,6\n'
                'b,10,3,4\n'
                'c,10,5,6\n'
                
               )
read_multi_index_col_data = pd.read_csv(StringIO(multi_index_col_data))
read_multi_index_col_data.head()

Output:

col1	col2
a	10	1	2
20	5	6
b	10	3	4
c	10	5	6

When the MultiIndex columns are named in the header row, you can pass a list of the index column numbers to the index_col parameter:

multi_index_col_data = ('category,number,col1,col2\n'
                'a,10,1,2\n'
                'a,20,5,6\n'
                'b,10,3,4\n'
                'c,10,5,6\n'
                
               )
read_multi_index_col_data = pd.read_csv(StringIO(multi_index_col_data),
                                        index_col=[0, 1])
read_multi_index_col_data.head()

Output:

	col1	col2
category	number		
a	10	1	2
20	5	6
b	10	3	4
c	10	5	6

Data Types

Explicit column types

If you are sure about the data types present in the data source, you can explicitly specify what data type to use. You can do it to the entire data frame or to specific columns.

When you pass object to the dtype parameter, the data is read as strings:

object_col_data = ('col1,col2,col3\n'
                'a,1,2\n'
                'b,3,4\n'
                'c,5,6\n'
               )
read_object_col_data = pd.read_csv(StringIO(object_col_data), dtype=object)
print (read_object_col_data.iloc[0][1])
print (type(read_object_col_data.iloc[0][1]))
read_object_col_data.head()

Output:

1

col1	col2	col3
0	a	1	2
1	b	3	4
2	c	5	6

When you have mixed-type data, you can import some columns as strings and some as integers or floats:

object_col_data = ('col1,col2,col3\n'
                'a,1,2\n'
                'b,3,4\n'
                'c,5,6\n'
               )
read_object_col_data = pd.read_csv(StringIO(object_col_data), 
                                   dtype={'col1': object, 'col2':'Int64', 
                                          'col3':'Float64'})
print (read_object_col_data.iloc[0][0])
print (type(read_object_col_data.iloc[0][0]))

print (read_object_col_data.iloc[0][1])
print (type(read_object_col_data.iloc[0][1]))

print (read_object_col_data.iloc[0][2])
print (type(read_object_col_data.iloc[0][2]))
read_object_col_data.head()

Output:

a

1

2.0

col1	col2	col3
0	a	1	2.0
1	b	3	4.0
2	c	5	6.0

Coercing a data type

When a numerical column contains strings, you can enforce numerical conversion after the import by using the to_numeric method. To this method, you can pass the column number and the errors='coerce':

mixed_col_data = ('col1,col2\n'
                '1,2\n'
                'a,4\n'
                '5,6\n'
               )
read_mixed_col_data = pd.read_csv(StringIO(mixed_col_data))

read_mixed_col_data['col1'] = pd.to_numeric(read_mixed_col_data['col1'], 
                                             errors='coerce')

read_mixed_col_data.head()

Output:

col1	col2
0	1.0	2
1	NaN	4
2	5.0	6

Parsing numbers

Parsing of integers is straightforward and happens automatically:

numbers_data = ('col1,col2,col3\n'
                'a,1,2\n'
                'b,3,4\n'
                'c,5,6\n'
               )
read_numbers_data = pd.read_csv(StringIO(numbers_data))
read_numbers_data.dtypes

Output:

col1    object
col2     int64
col3     int64
dtype: object

If the source data has numbers formatted as a float (i.e., have a decimal separator), they also will be parsed as float data type automatically. It is enough that one value is float-formatted in a column of numbers to have the whole column parsed as float:

numbers_data = ('col1,col2,col3\n'
                'a,1.0,2\n'
                'b,3.0,4.5\n'
                'c,5.0,6\n'
               )
read_numbers_data = pd.read_csv(StringIO(numbers_data))
read_numbers_data.dtypes

Output:

col1     object
col2    float64
col3    float64
dtype: object

During a data import, if you want to parse a number column that has float formatted values to ints, Pandas will throw a TypeError. However, casting float columns to int is possible after the import.

To handle commas as the decimal separator, pass decimal=','. In a case of comma-decimals, make sure that the decimal values are quoted in a comma-separated data source:

numbers_data = ('col1,col2,col3\n'
                'a,"1,0",2\n'
                'b,"3,0","4,5"\n'
                'c,"5,0",6\n'
               )
read_numbers_data = pd.read_csv(StringIO(numbers_data), decimal=',')
read_numbers_data.dtypes

Output:

col1     object
col2    float64
col3    float64
dtype: object

To parse numbers containing a thousand separators, pass the thousands parameter:

numbers_data = ('col1,col2,col3\n'
                'a,"1,000.0",2\n'
                'b,"3,010.0",4.5\n'
                'c,"500,000,000.0",6\n'
               )
read_numbers_data = pd.read_csv(StringIO(numbers_data), thousands=',')
read_numbers_data.dtypes

Output:

col1     object
col2    float64
col3    float64
dtype: object

Parsing Boolean values

If a data source contains values like ‘True’, ‘true’, ‘TRUE’, ‘False’, ‘false’, ‘False’, columns containing these values will be automatically parsed as booleans. Note that if there is a single value in a column that cannot be parsed as a boolean, the entire column will be parsed as a string (object):

bool_data = ('col1,col2,col3\n'
                'True,TRUE,true\n'
                'true,false, 1\n'
                'False,FALSE, false\n'
               )
read_bool_data = pd.read_csv(StringIO(bool_data))
print(read_bool_data.dtypes)
read_bool_data.head()

Output:

col1      bool
col2      bool
col3    object
dtype: object
col1	col2	col3
0	True	True	true
1	True	False	1
2	False	False	false

You can add more values that should be parsed as boolean by passing them in a list to true_values and false_values. Passing integers 0 and 1 will not work, but these values can be cast to a boolean after the import.

bool_data = ('col1,col2,col3\n'
                'True,yes,0\n'
                'true,no, 1\n'
                'False,FALSE, false\n'
               )
read_bool_data = pd.read_csv(StringIO(bool_data), true_values=['yes'],
                            false_values=['no'])
print(read_bool_data.dtypes)
read_bool_data.head()

Output:

col1      bool
col2      bool
col3    object
dtype: object
col1	col2	col3
0	True	True	0
1	True	False	1
2	False	False	false

Parsing DateTime

By default, datetime values are parsed as strings. By passing True to the parameter parse_dates, it is possible to parse them into Python datetime objects. Below, the resulting DataFrame gets a DatetimeIndex:

date_data = ('col1,col2,col3\n'
                '2019-01-01,a,1,2\n'
                '2019-01-02,b,2,3\n'
                '2019-01-03,c,3,4\n'
               )
read_date_data = pd.read_csv(StringIO(date_data), parse_dates=True)
print(read_date_data.dtypes)
print(read_date_data.index)
read_date_data.head()

Output:

col1    object
col2     int64
col3     int64
dtype: object
DatetimeIndex(['2019-01-01', '2019-01-02', '2019-01-03'], dtype='datetime64[ns]', freq=None)
col1	col2	col3
2019-01-01	a	1	2
2019-01-02	b	2	3
2019-01-03	c	3	4

The parameter parse_dates can be used to indicate what columns should be parsed into dates.

By passing a list of column numbers to it, you can tell it to try and parse each column into a datetime object. Note that when encountering a time string, the parser assumes the current date. So, the column tea is interpreted as the time on 2019-09-26:

meals_data = ('date,lunch,tea,amt\n'
'2019-01-01,12:00:00,17:00:00,1\n'
'2019-01-02,12:10:00,17:05:00,1\n'
'2019-01-02,12:05:00,17:10:00,1\n')

read_meals_data = pd.read_csv(StringIO(meals_data), parse_dates=[0,2])

print(read_meals_data.dtypes)
read_meals_data.head()

Output:

date     datetime64[ns]
lunch            object
tea      datetime64[ns]
amt               int64
dtype: object
date	lunch	tea	amt
0	2019-01-01	12:00:00	2019-09-26 17:00:00	1
1	2019-01-02	12:10:00	2019-09-26 17:05:00	1
2	2019-01-02	12:05:00	2019-09-26 17:10:00	1

If you need to combine data from several columns to build datetime values out of them, you can pass a nested list to the parse_dates parameter. So, you will get the date and time in the columns date_lunch and date_tea:

meals_data = ('date,lunch,tea,amt\n'
'2019-01-01,12:00:00,17:00:00,1\n'
'2019-01-02,12:10:00,17:05:00,1\n'
'2019-01-02,12:05:00,17:10:00,1\n')

read_meals_data = pd.read_csv(StringIO(meals_data),
                              parse_dates=[[0,1], [0, 2]])

print(read_meals_data.dtypes)
read_meals_data.head()

Output:

date_lunch    datetime64[ns]
date_tea      datetime64[ns]
amt                    int64
dtype: object
date_lunch	date_tea	amt
0	2019-01-01 12:00:00	2019-01-01 17:00:00	1
1	2019-01-02 12:10:00	2019-01-02 17:05:00	1
2	2019-01-02 12:05:00	2019-01-02 17:10:00	1

By default, the original columns are discarded after the parsing. You can tell Pandas to keep them by passing true to the parameter keep_date_col:

read_meals_data = pd.read_csv(StringIO(meals_data),
                              parse_dates=[[0,1], [0, 2]],
                              keep_date_col=True)

print(read_meals_data.dtypes)
read_meals_data.head()

Output:

date_lunch    datetime64[ns]
date_tea      datetime64[ns]
date                  object
lunch                 object
tea                   object
amt                    int64
dtype: object
date_lunch	date_tea	date	lunch	tea	amt
0	2019-01-01 12:00:00	2019-01-01 17:00:00	2019-01-01	12:00:00	17:00:00	1
1	2019-01-02 12:10:00	2019-01-02 17:05:00	2019-01-02	12:10:00	17:05:00	1
2	2019-01-02 12:05:00	2019-01-02 17:10:00	2019-01-02	12:05:00	17:10:00	1

Parsing variously formatted date strings. Note that the data in the column date2 could not be parsed. The values there put the day before the month and cannot be parsed to datetime objects without using the dayfirst parameter:

date_data = (
'date1,date2,date3,date4,date5\n'
'2019-01-01,2019-17-01,01-01-19,17-01-19,20190102\n'
'2019-12-12,2019-14-03,01-03-19,18-01-19,20190103'
)

read_date_data = pd.read_csv(StringIO(date_data), parse_dates=[0,1,2,3,4])

print(read_date_data.dtypes)
read_date_data.head()

Output:

date1    datetime64[ns]
date2            object
date3    datetime64[ns]
date4    datetime64[ns]
date5    datetime64[ns]
dtype: object
date1	date2	date3	date4	date5
0	2019-01-01	2019-17-01	2019-01-01	2019-01-17	2019-01-02
1	2019-12-12	2019-14-03	2019-01-03	2019-01-18	2019-01-03

The same data, not parsed with the day_first=True. Note that values in the first 4 columns are parsed assuming the day is before the month. However, the last column is still parsed with the assumption that the month precedes the day:

date_data = (
'date1,date2,date3,date4,date5\n'
'2019-01-01,2019-17-01,01-01-19,17-01-19,20190102\n'
'2019-12-12,2019-14-03,01-03-19,18-01-19,20190103'
)

read_date_data = pd.read_csv(StringIO(date_data), parse_dates=[0,1,2,3,4], dayfirst=True)

print(read_date_data.dtypes)
print(read_date_data.head())
print(read_date_data.iloc[1].date1.month)
print(read_date_data.iloc[1].date2.month)
print(read_date_data.iloc[1].date3.month)
print(read_date_data.iloc[1].date4.month)
print(read_date_data.iloc[1].date5.month)

Output:

date1    datetime64[ns]
date2    datetime64[ns]
date3    datetime64[ns]
date4    datetime64[ns]
date5    datetime64[ns]
dtype: object
       date1      date2      date3      date4      date5
0 2019-01-01 2019-01-17 2019-01-01 2019-01-17 2019-01-02
1 2019-12-12 2019-03-14 2019-03-01 2019-01-18 2019-01-03
12
3
3
1
1

Date strings without delimiters are parsed without regard to the dayfirst parameter unless there is a value that can be parsed only with the day-first assumption (e.g., ‘20191703’).

date_data_1 = (
'date1\n'
'20190102\n'
'20190103'
)

date_data_2 = (
'date1\n'
'20190102\n'
'20190103\n'
    '20191703'
)

# no regard to the dayfirst parameter if thre is not day first formatted date:
df_1 = pd.read_csv(StringIO(date_data_1), parse_dates=[0])
df_2 = pd.read_csv(StringIO(date_data_1), parse_dates=[0], dayfirst=True)
print('Source data:')
print(date_data_1)
print('\n')
print('Not using dayfirst:')
print(df_1.head())
print('\n')
print('Using dayfirst:')
print(df_2.head())
print('\n')
print('\n')

# has a date formatted with the day first:

df_3 = pd.read_csv(StringIO(date_data_2), parse_dates=[0])
df_4 = pd.read_csv(StringIO(date_data_2), parse_dates=[0], dayfirst=True)

print('Source data:')
print(date_data_2)
print('\n')
print('Parsing fails:')
print(df_3.head())
print('\n')

print('Parsing day first:')
print(df_4.head())

Output:

Source data:
date1
20190102
20190103


Not using dayfirst:
       date1
0 2019-01-02
1 2019-01-03


Using dayfirst:
       date1
0 2019-01-02
1 2019-01-03




Source data:
date1
20190102
20190103
20191703


Parsing fails:
      date1
0  20190102
1  20190103
2  20191703


Parsing day first:
       date1
0 2019-01-02
1 2019-01-03
2 2019-03-17

Error Processing

Sometimes, the data source can contain lines that have more separators than other lines. In this case, columns cannot be recognized and the parser cannot generate a DataFrame.

You can choose to skip bad lines by passing error_bad_lines=False.

bad_lines_data = ('date,lunch,tea,amt\n'
'2019-01-01,12:00:00,17:00:00,1\n'
'2019-01-02,12:10:00,17:05:00,1,\n'
'2019-01-02,12:05:00,17:10:00,1\n')

df_1 = pd.read_csv(StringIO(bad_lines_data), error_bad_lines=False)
print(df_1.head())

Output:

 date     lunch       tea  amt
0  2019-01-01  12:00:00  17:00:00    1
1  2019-01-02  12:05:00  17:10:00    1
b'Skipping line 3: expected 4 fields, saw 5\n'

The parser will also output a warning that says what line was skipped. To subdue this warning I can also pass warn_bad_lines=False.

df_2 = pd.read_csv(StringIO(bad_lines_data), error_bad_lines=False,
                   warn_bad_lines=False)
print(df_2.head())

Output:

 date     lunch       tea  amt
0  2019-01-01  12:00:00  17:00:00    1
1  2019-01-02  12:05:00  17:10:00    1

Oleg.Ishenko

Pandas 101: Reading Data (CSV)

Table of Contents

Loading Datasets into Pandas

Reading CSV files

Loading data from a string

Loading data from a URL

Data Parsing

Separator (delimiter)

Detecting the separator

Multicharacter (regex) separator

Quoted and escaped data

Quoted data and regex separator

Dialect

Custom dialects

Comments

Empty lines

Missing values

Data Columns

Header (column names)

Duplicate column names

Custom column names

Filtering by columns

Index columns

Trailing delimiters

MultiIndex

Data Types

Explicit column types

Coercing a data type

Parsing numbers

Parsing Boolean values

Parsing DateTime

Error Processing

0 Comments

Leave a Reply Cancel reply

Google Colab Notebooks: Read Files from Google Drive

Pandas 101: Reading Data (CSV)

Table of Contents

Loading Datasets into Pandas

Reading CSV files

Loading data from a string

Loading data from a URL

Data Parsing

Separator (delimiter)

Detecting the separator

Multicharacter (regex) separator

Quoted and escaped data

Quoted data and regex separator

Dialect

Custom dialects

Comments

Empty lines

Missing values

Data Columns

Header (column names)

Duplicate column names

Custom column names

Filtering by columns

Index columns

Trailing delimiters

MultiIndex

Data Types

Explicit column types

Coercing a data type

Parsing numbers

Parsing Boolean values

Parsing DateTime

Error Processing

0 Comments

Leave a Reply Cancel reply

Related Posts

Google Colab Notebooks: Read Files from Google Drive