05 November 2015
pandas_note

How to get start with Pandas

In this notebook, I followed the examples in Python for Data Analysis to learn how to use Pandas.

In [1]:
from pandas import Series, DataFrame
import pandas as pd

Data Structure

There are two important data structure in Pandas. Series and DataFrame

Series

In [2]:
obj = Series([4, 7, -5, 3])

The default index for the Series is integers starting from 0.

In [3]:
obj
Out[3]:
0    4
1    7
2   -5
3    3
dtype: int64
In [4]:
print obj.values
print obj.index
[ 4  7 -5  3]
Int64Index([0, 1, 2, 3], dtype='int64')

index can be set by user

In [5]:
obj2 = Series([4, 7, -5, 3], index=['d', 'b', 'a', 'c'])
In [6]:
obj2
Out[6]:
d    4
b    7
a   -5
c    3
dtype: int64
In [7]:
print obj2['a']
print "---"
obj2['d'] = 6
print obj2[['c', 'a', 'd']]
-5
---
c    3
a   -5
d    6
dtype: int64

You can do anything you can do on numpy array, Series will just keep its index.

In [8]:
print obj2
print "---"
print obj2[obj2 > 0]
print "---"
print obj2 * 2
print "---"
import numpy as np
print np.exp(obj2)
d    6
b    7
a   -5
c    3
dtype: int64
---
d    6
b    7
c    3
dtype: int64
---
d    12
b    14
a   -10
c     6
dtype: int64
---
d     403.428793
b    1096.633158
a       0.006738
c      20.085537
dtype: float64

Series looks just like a dict. Actually you can create a Series by a python dict.

In [9]:
sdata = {'Ohio': 35000, 'Texas': 71000, 'Oregon': 16000, 'Utah': 5000}
In [10]:
obj3 = Series(sdata)
In [11]:
obj3
Out[11]:
Ohio      35000
Oregon    16000
Texas     71000
Utah       5000
dtype: int64

You input the data with the index you assigned. See the example below. Utah no longer appears in obj4. Since there is no value for California in dict, the value is given as NaN.

In [12]:
states = ['California', 'Ohio', 'Oregon', 'Texas']
obj4 = Series(sdata, index=states) 
obj4
Out[12]:
California      NaN
Ohio          35000
Oregon        16000
Texas         71000
dtype: float64

pandas provides isnull and notnull to detect the missing data.

In [13]:
print obj4.isnull()
print '---'
print pd.isnull(obj4)
print '---'
print obj4.notnull()
print '---'
print pd.notnull(obj4)
California     True
Ohio          False
Oregon        False
Texas         False
dtype: bool
---
California     True
Ohio          False
Oregon        False
Texas         False
dtype: bool
---
California    False
Ohio           True
Oregon         True
Texas          True
dtype: bool
---
California    False
Ohio           True
Oregon         True
Texas          True
dtype: bool

When you add two Series, the values will be lined up by their indices.

In [14]:
print obj3
print '---'
print obj4
print '---'
print obj3 + obj4
Ohio      35000
Oregon    16000
Texas     71000
Utah       5000
dtype: int64
---
California      NaN
Ohio          35000
Oregon        16000
Texas         71000
dtype: float64
---
California       NaN
Ohio           70000
Oregon         32000
Texas         142000
Utah             NaN
dtype: float64

User can give name to Series object and its index.

In [15]:
obj4.name = 'population'
obj4.index.name = 'state'
obj4
Out[15]:
state
California      NaN
Ohio          35000
Oregon        16000
Texas         71000
Name: population, dtype: float64

index of Series can be changed

In [16]:
obj.index = ['Bob', 'Steve', 'Jeff', 'Ryan']
obj
Out[16]:
Bob      4
Steve    7
Jeff    -5
Ryan     3
dtype: int64

DataFrame

In [17]:
data = {'state': ['Ohio', 'Ohio', 'Ohio', 'Nevada', 'Nevada'],
        'year': [2000, 2001, 2002, 2001, 2002],
        'pop': [1.5, 1.7, 3.6, 2.4, 2.9]}
frame = DataFrame(data)
In [18]:
print frame
   pop   state  year
0  1.5    Ohio  2000
1  1.7    Ohio  2001
2  3.6    Ohio  2002
3  2.4  Nevada  2001
4  2.9  Nevada  2002

In ipython notebook, DataFrame can be shown as a table

In [19]:
frame
Out[19]:
pop state year
0 1.5 Ohio 2000
1 1.7 Ohio 2001
2 3.6 Ohio 2002
3 2.4 Nevada 2001
4 2.9 Nevada 2002

the columns will be sorted if you specify columns

In [20]:
DataFrame(data, columns=['year', 'state', 'pop'])
Out[20]:
year state pop
0 2000 Ohio 1.5
1 2001 Ohio 1.7
2 2002 Ohio 3.6
3 2001 Nevada 2.4
4 2002 Nevada 2.9

Just like Series, if you input a column which is not in the data, it will display as NaN

In [21]:
frame2 = DataFrame(data, columns=['year', 'state', 'pop', 'debt'], 
                   index=['one', 'two', 'three', 'four', 'five'])
frame2
Out[21]:
year state pop debt
one 2000 Ohio 1.5 NaN
two 2001 Ohio 1.7 NaN
three 2002 Ohio 3.6 NaN
four 2001 Nevada 2.4 NaN
five 2002 Nevada 2.9 NaN
In [22]:
frame2.columns
Out[22]:
Index([u'year', u'state', u'pop', u'debt'], dtype='object')

You can get column data in two ways

In [23]:
frame2['state']
Out[23]:
one        Ohio
two        Ohio
three      Ohio
four     Nevada
five     Nevada
Name: state, dtype: object
In [24]:
frame2.year
Out[24]:
one      2000
two      2001
three    2002
four     2001
five     2002
Name: year, dtype: int64

You can get row data by ix

In [25]:
frame2.ix['three']
Out[25]:
year     2002
state    Ohio
pop       3.6
debt      NaN
Name: three, dtype: object

The column value can be changed by assignment.

In [26]:
frame2['debt'] = 16.5
frame2
Out[26]:
year state pop debt
one 2000 Ohio 1.5 16.5
two 2001 Ohio 1.7 16.5
three 2002 Ohio 3.6 16.5
four 2001 Nevada 2.4 16.5
five 2002 Nevada 2.9 16.5
In [27]:
frame2['debt'] = np.arange(5.)
frame2
Out[27]:
year state pop debt
one 2000 Ohio 1.5 0
two 2001 Ohio 1.7 1
three 2002 Ohio 3.6 2
four 2001 Nevada 2.4 3
five 2002 Nevada 2.9 4
In [28]:
val = Series([-1.2, -1.5, -1.7], index=['two', 'four', 'five'])
frame2['debt'] = val
frame2
Out[28]:
year state pop debt
one 2000 Ohio 1.5 NaN
two 2001 Ohio 1.7 -1.2
three 2002 Ohio 3.6 NaN
four 2001 Nevada 2.4 -1.5
five 2002 Nevada 2.9 -1.7

A new column will be created if you assign value to a column which does not exist.

In [29]:
frame2['eastern'] = frame2.state == 'Ohio'
frame2
Out[29]:
year state pop debt eastern
one 2000 Ohio 1.5 NaN True
two 2001 Ohio 1.7 -1.2 True
three 2002 Ohio 3.6 NaN True
four 2001 Nevada 2.4 -1.5 False
five 2002 Nevada 2.9 -1.7 False

delete a column by del

In [30]:
del frame2['eastern']
frame2
Out[30]:
year state pop debt
one 2000 Ohio 1.5 NaN
two 2001 Ohio 1.7 -1.2
three 2002 Ohio 3.6 NaN
four 2001 Nevada 2.4 -1.5
five 2002 Nevada 2.9 -1.7

Another way to create DataFrame is by nesting dictionary

In [31]:
pop = {'Nevada': {2001: 2.4, 2002: 2.9},
       'Ohio': {2000: 1.5, 2001: 1.7, 2002: 3.6}}
In [32]:
frame3 = DataFrame(pop)
In [33]:
frame3
Out[33]:
Nevada Ohio
2000 NaN 1.5
2001 2.4 1.7
2002 2.9 3.6
In [34]:
frame3.T
Out[34]:
2000 2001 2002
Nevada NaN 2.4 2.9
Ohio 1.5 1.7 3.6
In [35]:
frame3.index.name = 'year'
frame3.columns.name = 'state'
frame3
Out[35]:
state Nevada Ohio
year
2000 NaN 1.5
2001 2.4 1.7
2002 2.9 3.6

values will return a ndarray with the values in the DataFrame

In [36]:
frame3.values
Out[36]:
array([[ nan,  1.5],
       [ 2.4,  1.7],
       [ 2.9,  3.6]])