This is the first part of Pandas manual that I’ve composed as a quick reference. I’ve been using it for some time for personal needs and after some updates and changes, I share it here.
The content is compiled from various sources, primarily the official Pandas manual, as well as from videos by Daniel Chen, Pandas is for Everyone – PyCon 2019, Kevin Markham’s video Data Science Best Practices with pandas (PyCon 2019). I also incorporate some tips and tricks I’ve learned from my projects.
Table of Contents
Loading Datasets into Pandas
Pandas can work with a variety of data sources. Data sources can be textual, binary or SQL. Popular data formats are:
- CSV (text)
- JSON (text)
- Microsoft Excel (binary)
- SQL (SQL)
- HTML (text)
Besides these, there are several other sources that Pandas can read directly (OpenDocument, Python Pickle, Google Big Query and more).
For each of these types, Pandas offers read and write functions, prefixed as read_
and to_
. So the popular data formats can be read and written using these functions:
- CSV:
read_csv
,to_csv
- JSON:
read_json
,to_json
- Excel:
readexcel
,to_excel
- SQL:
read_sql
,to_sql
- HTML:
read_html
,to_html
Reading CSV files
Most frequently I deal with tabular data contained in text files. For example, a dataset of weekly product views that I can import with the read_csv
function.
product_pd = pd.read_csv(path + 'product_views.csv')
product_pd.shape
Output:
(23743, 8)
The only parameter I pass in this example is the file path (filepath_or_buffer
). This parameter can be a file system path, or URL (http, ftp, S3). I can also accept an opened file object or an in-memory stream for text I/O (StringIO) or any other object that implements a read()
method and returns text data.
Loading data from a string
You can load data from an in-memory stream. If for some reason, the source parameter is not in the first position, you can explicitly specify its name filepath_or_buffer
.
from io import StringIO
data = 'col1,col2,col3\na,1,2\nb,3,4\nc,5,6'
read_data = pd.read_csv(sep=',',filepath_or_buffer=StringIO(data))
read_data.shape
Output:
(3, 3)
Loading data from a URL
Pandas gives you a way to load data from a remote location via a URL:
remote = pd.read_csv('https://www.divisionlab.com/solvingmagento/wp-content/uploads/2019/2019/tabular_data.txt', sep='\t')
remote.shape
Output:
(5, 3)
Data Parsing
Below are several aspects of Pandas data IO that influence how the data is read and parsed.
Separator (delimiter)
Parameters:
sep
, ordelimiter
The default separator used by read_csv
is comma (,
). In the case of data that uses a different separator (e.g., tab), we need to pass it as a value to the sep
parameter. The alternative name for this parameter is delimiter
.
Loading tab-separated data without the separator parameter does not work:
tab = pd.read_csv(path + 'tabular_data.txt')
tab.head(2)
Output:
letter first second
0 a\t1\t2
1 b\t3\t4
Loading tab-separated value with a separator:
tab_2 = pd.read_csv(path + 'tabular_data.txt', sep='\t')
tab_2.head(2)
Output:
letter first second
0 a 1 2
1 b 3 4
Detecting the separator
When the separator parameter is set to None
, Pandas is sometimes able to detect it using a special tool csv.Sniffer
. this feature is available only with the Python parsing engine, so unless you specify it, you get a warning message.
Let’s try loading semicolon-separated data:
tab_3 = pd.read_csv(path + 'semicolon_data.txt', sep=None)
tab_3.head(2)
Output:
/usr/local/lib/python3.6/dist-packages/ipykernel_launcher.py:1: ParserWarning: Falling back to the 'python' engine because the 'c' engine does not support sep=None with delim_whitespace=False; you can avoid this warning by specifying engine='python'.
"""Entry point for launching an IPython kernel.
letter first second
0 a 1 2
1 b 3 4
Detecting the tab-separator and suppressing the warning message by specifying the parsing engine:
tab = pd.read_csv(path + 'tabular_data.txt', sep=None, engine='python')
tab.head(2)
Output:
letter first second
0 a 1 2
1 b 3 4
Multicharacter (regex) separator
When the data is separated with multiple characters, you can pass a regular expression to the sep
parameter.
Possible problem: when your data is quoted, the regex separator parsing may not work.
data_mult = 'col1---col2---col3\na---1---2\nb---3---4\nc---5---6'
read_data_mult = pd.read_csv(sep='---',filepath_or_buffer=StringIO(data_mult))
read_data_mult.head(2)
Output:
/usr/local/lib/python3.6/dist-packages/ipykernel_launcher.py:2: ParserWarning: Falling back to the 'python' engine because the 'c' engine does not support regex separators (separators > 1 char and different from '\s+' are interpreted as regex); you can avoid this warning by specifying engine='python'.
col1 col2 col3
0 a 1 2
1 b 3 4
The regex separators work only with the somewhat slower Python parser:
# Regex separator and pyhton engine
read_data_mult_2 = pd.read_csv(sep='-{3}',
filepath_or_buffer=StringIO(data_mult),
engine='python')
read_data_mult_2.head(2)
Output:
col1 col2 col3
0 a 1 2
1 b 3 4
Quoted and escaped data
Often data may contain separator characters in the column values. For example, columns of a text containing commas in a comma-separated data file. In this case, column values are quoted so that the parser can ignore separator characters between the quotes.
Most often, the data is quoted with the double-quote character ‘”‘, Pandas can recognize the quotation automatically:
data_comma_quot_double = \
'col1,col2,col3\n"cats meow, dogs bark","1,2,3","2,3,4"\nb,3,4\nc,5,6'
read_data_comma_quot_double = pd.read_csv(StringIO(data_comma_quot_double))
read_data_comma_quot_double.head(2)
Output:
col1 col2 col3 0 cats meow, dogs bark 1,2,3 2,3,4 1 b 3 4
If the quoting character is different, you can specify it with the quotechar
parameter:
data_comma_quot = \
"col1,col2,col3\n'cats meow, dogs bark','1,2,3','2,4,5'\nb,3,4\nc,5,6"
read_data_comma_quot = pd.read_csv(StringIO(data_comma_quot), quotechar="'")
read_data_comma_quot.head(2)
Output:
col1 col2 col3 0 cats meow, dogs bark 1,2,3 2,4,5 1 b 3 4
Sometimes, column values containing separator characters are not quoted. Instead, separators are escaped. In this case, you need to use the escapechar
parameter:
data_escaped = 'col1,col2,col3\ncats meow\, dogs bark ,1\,2\,3,2\,3\,4\nb,3,4\nc,5,6'
read_data_escaped = pd.read_csv(StringIO(data_escaped), escapechar="\\")
read_data_escaped.head(2)
Output:
col1 col2 col3 0 cats meow, dogs bark 1,2,3 2,3,4 1 b 3 4
If in the quoted data, the quote character appears in the column value and is self-escaped (i.e., given twice: ""
), you can tell Pandas to treat it as a single character and a part of the column value:
data_comma_quot_double = \
'col1,col2,col3\n"cat says,""Meow"", dog barks","1,2,3","2,3,4"\nb,3,4\nc,5,6'
read_data_comma_quot_double = pd.read_csv(StringIO(data_comma_quot_double))
read_data_comma_quot_double.head(2)
Output:
col1 col2 col3 0 cat says,"Meow", dog barks 1,2,3 2,3,4 1 b 3 4
Quoted data and regex separator
To illustrate the problem that regex separator has with quoted data look at the data string below. The first-row column values contain the character sequence---
that is also used to separate columns. These values are quoted to prevent their mixing with separators. To indicate that you are using quoted data, you can pass the parameter quotechar
. Still, the regex separator cannot detect them properly:
data_mult_quot = \
"col1---col2---col3\n'a---'---'1---'---'2---'\nb---3---4\nc---5---6"
read_data_mult_quot = pd.read_csv(sep='-{3}',
filepath_or_buffer=StringIO(data_mult_quot),
engine='python',
quotechar="'")
read_data_mult_quot.head(2)
Output:
col1 col2 col3 'a ' '1 ' '2 ' b 3 4 None None None
Dialect
Separators, quoting, and escaping can be controlled collectively by the dialect
. You can override all default values used to handle separators, quotes, and escape characters by specifying one of the available csv dialects:
- default
- excel
- excel-tab
- unix
When you specify a dialect, you tell Pandas to use its values for the following parameters:
- delimiter
- doublequote
- escapechar
- lineterminator
- quotechar
- quoting
- skipinitialspace
- strict
Below are the values supplied by various dialects.
read_dialect_data = pd.read_csv(path + 'dialect_data.txt', quotechar='"',
sep='\t', index_col=None)
read_dialect_data
Output:
Parameter default excel excel-tab unix 0 delimiter , , \t , 1 doublequote True True True True 2 escapechar None None None None 3 lineterminator \r\n \r\n \r\n \n 4 quotechar " " " " 5 quoting csv.QUOTE_MINIMAL csv.QUOTE_MINIMAL csv.QUOTE_MINIMAL csv.QUOTE_ALL 6 skipinitialspace True False False False 7 strict False False False False
Custom dialects
For more flexibility, you can define your own dialect and provide such values for its properties that you need to load your data source.
import csv
class myDialect(csv.Dialect):
delimiter = '\t'
doublequote = True
escapechar = None
lineterminator = '\r\n'
quotechar = '"'
quoting = csv.QUOTE_MINIMAL
skipinitialspace = True
strict = False
# Register the new dialect name
csv.register_dialect('my_dialect', myDialect)
# Check if my dialect is available
print(csv.list_dialects())
my_read_dialect_data = pd.read_csv(path + 'dialect_data.txt',
dialect='my_dialect')
# Check my data
my_read_dialect_data.shape
Output:
['excel', 'excel-tab', 'unix', 'my_dialect'] (8, 5)
You can use an existing dialect by simply modifying one or more of its properties.
In the example below, one of the column values contains a quote character. The dialect used by default by read_csv
is excel
. This dialect treats "
as a quoting character and will fail if it finds an unclosed quote.
The following custom dialect is based on excel
but is set up to ignore quoting:
my_dialect_2 = csv.excel()
open_quote_data = 'col1,col2,col3\n"a,1,2\nb,3,4\nc,5,6'
my_dialect_2.quoting = csv.QUOTE_NONE
read_open_quote_data = pd.read_csv(StringIO(open_quote_data), dialect=my_dialect_2)
read_open_quote_data.head()
Output:
col1 col2 col3 0 "a 1 2 1 b 3 4 2 c 5 6
Comments
Some data sources may include comments. Comments can come in two ways:
- Entire lines commented out, or
- A column value or part of it is commented out.
Pandas allow specifying the comment character in the comment
parameter.
Without specifying the comment
parameter commented lines and values will be imported:
comment_data = ('col1,col2,col3\n'
'a # comment in the beginning of the line,1,2\n'
'# commented line\n'
'b,3,4 # comment at the end of the line\n'
'c,5,6'
)
read_data_with_comments = pd.read_csv(StringIO(comment_data))
read_data_with_comments.head(3)
Output:
col1 col2 col3 0 a # comment in the beginning of the line 1.0 2 1 # commented line NaN NaN 2 b 3.0 4 # comment at the end of the line
If you need to hide comments, you must specify the comment character. Only single characters are accepted, e.g. #
:
read_data_without_comments = pd.read_csv(StringIO(comment_data), comment='#')
read_data_without_comments.head(3)
Output:
col1 col2 col3 0 a NaN NaN 1 b 3.0 4.0 2 c 5.0 6.0
Note that the comment suppression hides the entire content of the first line past the comment character – the separators are ignored and the values for columns col2
and col3
are not read.
Empty lines
Empty lines are ignored by default. If empty lines need to be imported (e.g., because they actually mean missing data that you will have to impute later), You can set the skip_blank_lines
parameter to False
:
empty_line_data = ('col1,col2,col3\n'
'a,1,2\n'
'\n'
'b,3,4\n'
)
read_empty_line_data = pd.read_csv(StringIO(empty_line_data),
skip_blank_lines=False)
read_empty_line_data.head()
Output:
col1 col2 col3 0 a 1.0 2.0 1 NaN NaN NaN 2 b 3.0 4.0
Missing values
Pandas allow me to control, which values must be interpreted as missing (NA) values. Some strings are recognized by default as empty. They are:
- -1.#IND
- 1.#QNAN
- 1.#IND
- -1.#QNAN
- #N/A
- N/A
- #N/A
- N/A
- n/a
- NA
- #NA
- NULL
- null
- NaN
- -NaN
- nan
- -nan
- ” (no character)
empty_data = ('col1,col2,col3\n'
'null,1,2\n'
'b,3,4\n'
'c,N/A,4\n'
)
read_empty_data = pd.read_csv(StringIO(empty_data))
read_empty_data.head()
Output:
col1 col2 col3 0 NaN 1.0 2 1 b 3.0 4 2 c NaN 4
When necessary, you can extend the list with my custom entries. The custom values can be strings and numbers. For example, you want that the letter ‘c’ and the number 1 are processed as missing values:
read_custom_empty_data = pd.read_csv(StringIO(empty_data), na_values=['c', 1])
read_custom_empty_data.head()
Output:
col1 col2 col3 0 NaN NaN 2 1 b 3.0 4 2 NaN NaN 4
If you want to ignore the default missing values, you can that tell pandas by setting the parameter keep_default_na
to False
:
read_custom__no_default_empty_data = pd.read_csv(StringIO(empty_data),
na_values=['c', 1],
keep_default_na=False)
read_custom__no_default_empty_data.head()
Output:
col1 col2 col3 0 null NaN 2 1 b 3 4 2 NaN N/A 4
Data Columns
Header (column names)
When Pandas imports data, it assumes that the first row contains column names. If the first row contains data other than column names, you can specify, what row to use as the header:
# The first row contains the sheet title
header_data = (
'This is the title of the data sheet\n'
'col1,col2,col3\n'
'a,1,2\n'
'b,3,4\n'
'c,5,6\n'
)
read_header_data = pd.read_csv(StringIO(header_data), header=1)
read_header_data.head()
Output:
col1 col2 col3 0 a 1 2 1 b 3 4 2 c 5 6
Duplicate column names
If the data contains duplicate column names, Pandas will automatically append a counter to the duplicate.
dup_col_data = (
'col1,col2,col1\n'
'a,1,2\n'
'b,3,4\n'
'c,5,6\n'
)
read_dup_col_data = pd.read_csv(StringIO(dup_col_data))
read_dup_col_data.head()
Output:
col1 col2 col1.1 0 a 1 2 1 b 3 4 2 c 5 6
Custom column names
You can supply the parameter names
with a list of the column names that you want to use instead of those in the data source. If you omit the header
parameter or set it to None
, the row with the original header will be imported as a row of data.
cust_col_name_data = (
'col1,col2,col1\n'
'a,1,2\n'
'b,3,4\n'
'c,5,6\n'
)
read_cust_col_name_data = pd.read_csv(StringIO(cust_col_name_data),
names=['my_col_1', 'my_col_2',
'my_col_3'])
read_cust_col_name_data.head()
Output:
my_col_1 my_col_2 my_col_3 0 col1 col2 col1 1 a 1 2 2 b 3 4 3 c 5 6
To avoid importing the original header as data, I need to specify the row containing it in the header
parameter.
read_cust_col_name_data = pd.read_csv(StringIO(cust_col_name_data),
names=['my_col_1', 'my_col_2',
'my_col_3'],
header=0)
read_cust_col_name_data.head()
Output:
my_col_1 my_col_2 my_col_3 0 a 1 2 1 b 3 4 2 c 5 6
Filtering by columns
If you need only a few of the columns in the data source, you can specify them in the parameter usecols
:
col_data = ('col1,col2,col3\n'
'a,1,2\n'
'b,3,4\n'
'c,5,6\n'
)
read_col_data = pd.read_csv(StringIO(col_data), usecols=['col1', 'col3'])
read_col_data.head()
Output:
col1 col3 0 a 2 1 b 4 2 c 6
Index columns
Sometimes, the data source contains fewer columns in the header row than there are columns in the following rows. In this case, Pandas will thread the first column as an index column:
index_col_data = ('col1,col2,col3\n'
'10,a,1,2\n'
'20,b,3,4\n'
'30,c,5,6\n'
)
read_index_col_data = pd.read_csv(StringIO(index_col_data))
read_index_col_data.head()
You can also explicitly specify the location of the index column by using the index_col. This is handy when the number of header columns and the value columns is the same. The index will get the name from the respective header column:
index_col_data = ('col1,Category,col2,col3\n'
'10,a,1,2\n'
'20,b,3,4\n'
'30,c,5,6\n'
)
read_index_col_data = pd.read_csv(StringIO(index_col_data), index_col=1)
read_index_col_data.head()
Output:
col1 col2 col3 Category a 10 1 2 b 20 3 4 c 30 5 6
Trailing delimiters
Sometimes only some of the rows in the data source have an extra column. This can happen when such rows end in a trailing delimiter. In this case, the index column parser will not work. To avoid the problem, you will have to set index_col
to False
:
index_col_data = ('col1,col2,col3\n'
'a,1,2,\n'
'b,3,4\n'
'c,5,6\n'
)
read_index_col_data = pd.read_csv(StringIO(index_col_data), index_col=False)
read_index_col_data.head()
MultiIndex
If the data has a multi-index structure and the header row does not include the index columns, Pandas is capable to recognize the MultiIndex
automatically:
multi_index_col_data = ('col1,col2\n'
'a,10,1,2\n'
'a,20,5,6\n'
'b,10,3,4\n'
'c,10,5,6\n'
)
read_multi_index_col_data = pd.read_csv(StringIO(multi_index_col_data))
read_multi_index_col_data.head()
Output:
col1 col2 a 10 1 2 20 5 6 b 10 3 4 c 10 5 6
When the MultiIndex
columns are named in the header row, you can pass a list of the index column numbers to the index_col
parameter:
multi_index_col_data = ('category,number,col1,col2\n'
'a,10,1,2\n'
'a,20,5,6\n'
'b,10,3,4\n'
'c,10,5,6\n'
)
read_multi_index_col_data = pd.read_csv(StringIO(multi_index_col_data),
index_col=[0, 1])
read_multi_index_col_data.head()
Output:
col1 col2 category number a 10 1 2 20 5 6 b 10 3 4 c 10 5 6
Data Types
Explicit column types
If you are sure about the data types present in the data source, you can explicitly specify what data type to use. You can do it to the entire data frame or to specific columns.
When you pass object
to the dtype
parameter, the data is read as strings:
object_col_data = ('col1,col2,col3\n'
'a,1,2\n'
'b,3,4\n'
'c,5,6\n'
)
read_object_col_data = pd.read_csv(StringIO(object_col_data), dtype=object)
print (read_object_col_data.iloc[0][1])
print (type(read_object_col_data.iloc[0][1]))
read_object_col_data.head()
Output:
1col1 col2 col3 0 a 1 2 1 b 3 4 2 c 5 6
When you have mixed-type data, you can import some columns as strings and some as integers or floats:
object_col_data = ('col1,col2,col3\n'
'a,1,2\n'
'b,3,4\n'
'c,5,6\n'
)
read_object_col_data = pd.read_csv(StringIO(object_col_data),
dtype={'col1': object, 'col2':'Int64',
'col3':'Float64'})
print (read_object_col_data.iloc[0][0])
print (type(read_object_col_data.iloc[0][0]))
print (read_object_col_data.iloc[0][1])
print (type(read_object_col_data.iloc[0][1]))
print (read_object_col_data.iloc[0][2])
print (type(read_object_col_data.iloc[0][2]))
read_object_col_data.head()
Output:
a1 2.0 col1 col2 col3 0 a 1 2.0 1 b 3 4.0 2 c 5 6.0
Coercing a data type
When a numerical column contains strings, you can enforce numerical conversion after the import by using the to_numeric
method. To this method, you can pass the column number and the errors='coerce'
:
mixed_col_data = ('col1,col2\n'
'1,2\n'
'a,4\n'
'5,6\n'
)
read_mixed_col_data = pd.read_csv(StringIO(mixed_col_data))
read_mixed_col_data['col1'] = pd.to_numeric(read_mixed_col_data['col1'],
errors='coerce')
read_mixed_col_data.head()
Output:
col1 col2 0 1.0 2 1 NaN 4 2 5.0 6
Parsing numbers
Parsing of integers is straightforward and happens automatically:
numbers_data = ('col1,col2,col3\n'
'a,1,2\n'
'b,3,4\n'
'c,5,6\n'
)
read_numbers_data = pd.read_csv(StringIO(numbers_data))
read_numbers_data.dtypes
Output:
col1 object col2 int64 col3 int64 dtype: object
If the source data has numbers formatted as a float (i.e., have a decimal separator), they also will be parsed as float data type automatically. It is enough that one value is float-formatted in a column of numbers to have the whole column parsed as float:
numbers_data = ('col1,col2,col3\n'
'a,1.0,2\n'
'b,3.0,4.5\n'
'c,5.0,6\n'
)
read_numbers_data = pd.read_csv(StringIO(numbers_data))
read_numbers_data.dtypes
Output:
col1 object col2 float64 col3 float64 dtype: object
During a data import, if you want to parse a number column that has float formatted values to ints, Pandas will throw a TypeError
. However, casting float columns to int is possible after the import.
To handle commas as the decimal separator, pass decimal=','
. In a case of comma-decimals, make sure that the decimal values are quoted in a comma-separated data source:
numbers_data = ('col1,col2,col3\n'
'a,"1,0",2\n'
'b,"3,0","4,5"\n'
'c,"5,0",6\n'
)
read_numbers_data = pd.read_csv(StringIO(numbers_data), decimal=',')
read_numbers_data.dtypes
Output:
col1 object col2 float64 col3 float64 dtype: object
To parse numbers containing a thousand separators, pass the thousands
parameter:
numbers_data = ('col1,col2,col3\n'
'a,"1,000.0",2\n'
'b,"3,010.0",4.5\n'
'c,"500,000,000.0",6\n'
)
read_numbers_data = pd.read_csv(StringIO(numbers_data), thousands=',')
read_numbers_data.dtypes
Output:
col1 object col2 float64 col3 float64 dtype: object
Parsing Boolean values
If a data source contains values like ‘True’, ‘true’, ‘TRUE’, ‘False’, ‘false’, ‘False’, columns containing these values will be automatically parsed as booleans. Note that if there is a single value in a column that cannot be parsed as a boolean, the entire column will be parsed as a string (object):
bool_data = ('col1,col2,col3\n'
'True,TRUE,true\n'
'true,false, 1\n'
'False,FALSE, false\n'
)
read_bool_data = pd.read_csv(StringIO(bool_data))
print(read_bool_data.dtypes)
read_bool_data.head()
Output:
col1 bool col2 bool col3 object dtype: object col1 col2 col3 0 True True true 1 True False 1 2 False False false
You can add more values that should be parsed as boolean by passing them in a list to true_values
and false_values
. Passing integers 0 and 1 will not work, but these values can be cast to a boolean after the import.
bool_data = ('col1,col2,col3\n'
'True,yes,0\n'
'true,no, 1\n'
'False,FALSE, false\n'
)
read_bool_data = pd.read_csv(StringIO(bool_data), true_values=['yes'],
false_values=['no'])
print(read_bool_data.dtypes)
read_bool_data.head()
Output:
col1 bool col2 bool col3 object dtype: object col1 col2 col3 0 True True 0 1 True False 1 2 False False false
Parsing DateTime
By default, datetime values are parsed as strings. By passing True
to the parameter parse_dates
, it is possible to parse them into Python datetime objects. Below, the resulting DataFrame gets a DatetimeIndex:
date_data = ('col1,col2,col3\n'
'2019-01-01,a,1,2\n'
'2019-01-02,b,2,3\n'
'2019-01-03,c,3,4\n'
)
read_date_data = pd.read_csv(StringIO(date_data), parse_dates=True)
print(read_date_data.dtypes)
print(read_date_data.index)
read_date_data.head()
Output:
col1 object col2 int64 col3 int64 dtype: object DatetimeIndex(['2019-01-01', '2019-01-02', '2019-01-03'], dtype='datetime64[ns]', freq=None) col1 col2 col3 2019-01-01 a 1 2 2019-01-02 b 2 3 2019-01-03 c 3 4
The parameter parse_dates
can be used to indicate what columns should be parsed into dates.
By passing a list of column numbers to it, you can tell it to try and parse each column into a datetime object. Note that when encountering a time string, the parser assumes the current date. So, the column tea
is interpreted as the time on 2019-09-26:
meals_data = ('date,lunch,tea,amt\n'
'2019-01-01,12:00:00,17:00:00,1\n'
'2019-01-02,12:10:00,17:05:00,1\n'
'2019-01-02,12:05:00,17:10:00,1\n')
read_meals_data = pd.read_csv(StringIO(meals_data), parse_dates=[0,2])
print(read_meals_data.dtypes)
read_meals_data.head()
Output:
date datetime64[ns] lunch object tea datetime64[ns] amt int64 dtype: object date lunch tea amt 0 2019-01-01 12:00:00 2019-09-26 17:00:00 1 1 2019-01-02 12:10:00 2019-09-26 17:05:00 1 2 2019-01-02 12:05:00 2019-09-26 17:10:00 1
If you need to combine data from several columns to build datetime values out of them, you can pass a nested list to the parse_dates
parameter. So, you will get the date and time in the columns date_lunch
and date_tea
:
meals_data = ('date,lunch,tea,amt\n'
'2019-01-01,12:00:00,17:00:00,1\n'
'2019-01-02,12:10:00,17:05:00,1\n'
'2019-01-02,12:05:00,17:10:00,1\n')
read_meals_data = pd.read_csv(StringIO(meals_data),
parse_dates=[[0,1], [0, 2]])
print(read_meals_data.dtypes)
read_meals_data.head()
Output:
date_lunch datetime64[ns]
date_tea datetime64[ns]
amt int64
dtype: object
date_lunch date_tea amt
0 2019-01-01 12:00:00 2019-01-01 17:00:00 1
1 2019-01-02 12:10:00 2019-01-02 17:05:00 1
2 2019-01-02 12:05:00 2019-01-02 17:10:00 1
By default, the original columns are discarded after the parsing. You can tell Pandas to keep them by passing true
to the parameter keep_date_col
:
read_meals_data = pd.read_csv(StringIO(meals_data),
parse_dates=[[0,1], [0, 2]],
keep_date_col=True)
print(read_meals_data.dtypes)
read_meals_data.head()
Output:
date_lunch datetime64[ns] date_tea datetime64[ns] date object lunch object tea object amt int64 dtype: object date_lunch date_tea date lunch tea amt 0 2019-01-01 12:00:00 2019-01-01 17:00:00 2019-01-01 12:00:00 17:00:00 1 1 2019-01-02 12:10:00 2019-01-02 17:05:00 2019-01-02 12:10:00 17:05:00 1 2 2019-01-02 12:05:00 2019-01-02 17:10:00 2019-01-02 12:05:00 17:10:00 1
Parsing variously formatted date strings. Note that the data in the column date2
could not be parsed. The values there put the day before the month and cannot be parsed to datetime objects without using the dayfirst
parameter:
date_data = (
'date1,date2,date3,date4,date5\n'
'2019-01-01,2019-17-01,01-01-19,17-01-19,20190102\n'
'2019-12-12,2019-14-03,01-03-19,18-01-19,20190103'
)
read_date_data = pd.read_csv(StringIO(date_data), parse_dates=[0,1,2,3,4])
print(read_date_data.dtypes)
read_date_data.head()
Output:
date1 datetime64[ns] date2 object date3 datetime64[ns] date4 datetime64[ns] date5 datetime64[ns] dtype: object date1 date2 date3 date4 date5 0 2019-01-01 2019-17-01 2019-01-01 2019-01-17 2019-01-02 1 2019-12-12 2019-14-03 2019-01-03 2019-01-18 2019-01-03
The same data, not parsed with the day_first=True
. Note that values in the first 4 columns are parsed assuming the day is before the month. However, the last column is still parsed with the assumption that the month precedes the day:
date_data = (
'date1,date2,date3,date4,date5\n'
'2019-01-01,2019-17-01,01-01-19,17-01-19,20190102\n'
'2019-12-12,2019-14-03,01-03-19,18-01-19,20190103'
)
read_date_data = pd.read_csv(StringIO(date_data), parse_dates=[0,1,2,3,4], dayfirst=True)
print(read_date_data.dtypes)
print(read_date_data.head())
print(read_date_data.iloc[1].date1.month)
print(read_date_data.iloc[1].date2.month)
print(read_date_data.iloc[1].date3.month)
print(read_date_data.iloc[1].date4.month)
print(read_date_data.iloc[1].date5.month)
Output:
date1 datetime64[ns] date2 datetime64[ns] date3 datetime64[ns] date4 datetime64[ns] date5 datetime64[ns] dtype: object date1 date2 date3 date4 date5 0 2019-01-01 2019-01-17 2019-01-01 2019-01-17 2019-01-02 1 2019-12-12 2019-03-14 2019-03-01 2019-01-18 2019-01-03 12 3 3 1 1
Date strings without delimiters are parsed without regard to the dayfirst
parameter unless there is a value that can be parsed only with the day-first assumption (e.g., ‘20191703’).
date_data_1 = (
'date1\n'
'20190102\n'
'20190103'
)
date_data_2 = (
'date1\n'
'20190102\n'
'20190103\n'
'20191703'
)
# no regard to the dayfirst parameter if thre is not day first formatted date:
df_1 = pd.read_csv(StringIO(date_data_1), parse_dates=[0])
df_2 = pd.read_csv(StringIO(date_data_1), parse_dates=[0], dayfirst=True)
print('Source data:')
print(date_data_1)
print('\n')
print('Not using dayfirst:')
print(df_1.head())
print('\n')
print('Using dayfirst:')
print(df_2.head())
print('\n')
print('\n')
# has a date formatted with the day first:
df_3 = pd.read_csv(StringIO(date_data_2), parse_dates=[0])
df_4 = pd.read_csv(StringIO(date_data_2), parse_dates=[0], dayfirst=True)
print('Source data:')
print(date_data_2)
print('\n')
print('Parsing fails:')
print(df_3.head())
print('\n')
print('Parsing day first:')
print(df_4.head())
Output:
Source data: date1 20190102 20190103 Not using dayfirst: date1 0 2019-01-02 1 2019-01-03 Using dayfirst: date1 0 2019-01-02 1 2019-01-03 Source data: date1 20190102 20190103 20191703 Parsing fails: date1 0 20190102 1 20190103 2 20191703 Parsing day first: date1 0 2019-01-02 1 2019-01-03 2 2019-03-17
Error Processing
Sometimes, the data source can contain lines that have more separators than other lines. In this case, columns cannot be recognized and the parser cannot generate a DataFrame.
You can choose to skip bad lines by passing error_bad_lines=False
.
bad_lines_data = ('date,lunch,tea,amt\n'
'2019-01-01,12:00:00,17:00:00,1\n'
'2019-01-02,12:10:00,17:05:00,1,\n'
'2019-01-02,12:05:00,17:10:00,1\n')
df_1 = pd.read_csv(StringIO(bad_lines_data), error_bad_lines=False)
print(df_1.head())
Output:
date lunch tea amt 0 2019-01-01 12:00:00 17:00:00 1 1 2019-01-02 12:05:00 17:10:00 1 b'Skipping line 3: expected 4 fields, saw 5\n'
The parser will also output a warning that says what line was skipped. To subdue this warning I can also pass warn_bad_lines=False
.
df_2 = pd.read_csv(StringIO(bad_lines_data), error_bad_lines=False,
warn_bad_lines=False)
print(df_2.head())
Output:
date lunch tea amt 0 2019-01-01 12:00:00 17:00:00 1 1 2019-01-02 12:05:00 17:10:00 1
0 Comments