How to get start with Pandas¶

In this notebook, I followed the examples in Python for Data Analysis to learn how to use Pandas.

from pandas import Series, DataFrame
import pandas as pd

Data Structure¶

There are two important data structure in Pandas. Series and DataFrame

Series¶

obj = Series([4, 7, -5, 3])

The default index for the Series is integers starting from 0.

obj

0    4
1    7
2   -5
3    3
dtype: int64

print obj.values
print obj.index

[ 4  7 -5  3]
Int64Index([0, 1, 2, 3], dtype='int64')

index can be set by user

obj2 = Series([4, 7, -5, 3], index=['d', 'b', 'a', 'c'])

obj2

d    4
b    7
a   -5
c    3
dtype: int64

print obj2['a']
print "---"
obj2['d'] = 6
print obj2[['c', 'a', 'd']]

-5
---
c    3
a   -5
d    6
dtype: int64

You can do anything you can do on numpy array, Series will just keep its index.

print obj2
print "---"
print obj2[obj2 > 0]
print "---"
print obj2 * 2
print "---"
import numpy as np
print np.exp(obj2)

d    6
b    7
a   -5
c    3
dtype: int64
---
d    6
b    7
c    3
dtype: int64
---
d    12
b    14
a   -10
c     6
dtype: int64
---
d     403.428793
b    1096.633158
a       0.006738
c      20.085537
dtype: float64

Series looks just like a dict. Actually you can create a Series by a python dict.

sdata = {'Ohio': 35000, 'Texas': 71000, 'Oregon': 16000, 'Utah': 5000}

obj3 = Series(sdata)

obj3

Ohio      35000
Oregon    16000
Texas     71000
Utah       5000
dtype: int64

You input the data with the index you assigned. See the example below. Utah no longer appears in obj4. Since there is no value for California in dict, the value is given as NaN.

states = ['California', 'Ohio', 'Oregon', 'Texas']
obj4 = Series(sdata, index=states) 
obj4

California      NaN
Ohio          35000
Oregon        16000
Texas         71000
dtype: float64

pandas provides isnull and notnull to detect the missing data.

print obj4.isnull()
print '---'
print pd.isnull(obj4)
print '---'
print obj4.notnull()
print '---'
print pd.notnull(obj4)

California     True
Ohio          False
Oregon        False
Texas         False
dtype: bool
---
California     True
Ohio          False
Oregon        False
Texas         False
dtype: bool
---
California    False
Ohio           True
Oregon         True
Texas          True
dtype: bool
---
California    False
Ohio           True
Oregon         True
Texas          True
dtype: bool

When you add two Series, the values will be lined up by their indices.

print obj3
print '---'
print obj4
print '---'
print obj3 + obj4

Ohio      35000
Oregon    16000
Texas     71000
Utah       5000
dtype: int64
---
California      NaN
Ohio          35000
Oregon        16000
Texas         71000
dtype: float64
---
California       NaN
Ohio           70000
Oregon         32000
Texas         142000
Utah             NaN
dtype: float64

User can give name to Series object and its index.

obj4.name = 'population'
obj4.index.name = 'state'
obj4

state
California      NaN
Ohio          35000
Oregon        16000
Texas         71000
Name: population, dtype: float64

index of Series can be changed

obj.index = ['Bob', 'Steve', 'Jeff', 'Ryan']
obj

Bob      4
Steve    7
Jeff    -5
Ryan     3
dtype: int64

DataFrame¶

data = {'state': ['Ohio', 'Ohio', 'Ohio', 'Nevada', 'Nevada'],
        'year': [2000, 2001, 2002, 2001, 2002],
        'pop': [1.5, 1.7, 3.6, 2.4, 2.9]}
frame = DataFrame(data)

print frame

   pop   state  year
0  1.5    Ohio  2000
1  1.7    Ohio  2001
2  3.6    Ohio  2002
3  2.4  Nevada  2001
4  2.9  Nevada  2002

In ipython notebook, DataFrame can be shown as a table

frame

the columns will be sorted if you specify columns

DataFrame(data, columns=['year', 'state', 'pop'])

Just like Series, if you input a column which is not in the data, it will display as NaN

frame2 = DataFrame(data, columns=['year', 'state', 'pop', 'debt'], 
                   index=['one', 'two', 'three', 'four', 'five'])
frame2

frame2.columns

Index([u'year', u'state', u'pop', u'debt'], dtype='object')

You can get column data in two ways

frame2['state']

one        Ohio
two        Ohio
three      Ohio
four     Nevada
five     Nevada
Name: state, dtype: object

frame2.year

one      2000
two      2001
three    2002
four     2001
five     2002
Name: year, dtype: int64

You can get row data by ix

frame2.ix['three']

year     2002
state    Ohio
pop       3.6
debt      NaN
Name: three, dtype: object

The column value can be changed by assignment.

frame2['debt'] = 16.5
frame2

frame2['debt'] = np.arange(5.)
frame2

val = Series([-1.2, -1.5, -1.7], index=['two', 'four', 'five'])
frame2['debt'] = val
frame2

A new column will be created if you assign value to a column which does not exist.

frame2['eastern'] = frame2.state == 'Ohio'
frame2

delete a column by del

del frame2['eastern']
frame2

Another way to create DataFrame is by nesting dictionary

pop = {'Nevada': {2001: 2.4, 2002: 2.9},
       'Ohio': {2000: 1.5, 2001: 1.7, 2002: 3.6}}

frame3 = DataFrame(pop)

frame3

frame3.T

frame3.index.name = 'year'
frame3.columns.name = 'state'
frame3

values will return a ndarray with the values in the DataFrame

frame3.values

array([[ nan,  1.5],
       [ 2.4,  1.7],
       [ 2.9,  3.6]])

	pop	state	year
0	1.5	Ohio	2000
1	1.7	Ohio	2001
2	3.6	Ohio	2002
3	2.4	Nevada	2001
4	2.9	Nevada	2002

	year	state	pop	debt
one	2000	Ohio	1.5	16.5
two	2001	Ohio	1.7	16.5
three	2002	Ohio	3.6	16.5
four	2001	Nevada	2.4	16.5
five	2002	Nevada	2.9	16.5

	year	state	pop	debt	eastern
one	2000	Ohio	1.5	NaN	True
two	2001	Ohio	1.7	-1.2	True
three	2002	Ohio	3.6	NaN	True
four	2001	Nevada	2.4	-1.5	False
five	2002	Nevada	2.9	-1.7	False