How to get start with Pandas¶
In this notebook, I followed the examples in Python for Data Analysis to learn how to use Pandas.
from pandas import Series, DataFrame
import pandas as pd
obj = Series([4, 7, -5, 3])
The default index for the Series
is integers starting from 0.
obj
print obj.values
print obj.index
index
can be set by user
obj2 = Series([4, 7, -5, 3], index=['d', 'b', 'a', 'c'])
obj2
print obj2['a']
print "---"
obj2['d'] = 6
print obj2[['c', 'a', 'd']]
You can do anything you can do on numpy array, Series
will just keep its index
.
print obj2
print "---"
print obj2[obj2 > 0]
print "---"
print obj2 * 2
print "---"
import numpy as np
print np.exp(obj2)
Series
looks just like a dict
. Actually you can create a Series
by a python dict
.
sdata = {'Ohio': 35000, 'Texas': 71000, 'Oregon': 16000, 'Utah': 5000}
obj3 = Series(sdata)
obj3
You input the data with the index
you assigned. See the example below. Utah
no longer appears in obj4
. Since there is no value for California
in dict
, the value is given as NaN
.
states = ['California', 'Ohio', 'Oregon', 'Texas']
obj4 = Series(sdata, index=states)
obj4
pandas provides isnull
and notnull
to detect the missing data.
print obj4.isnull()
print '---'
print pd.isnull(obj4)
print '---'
print obj4.notnull()
print '---'
print pd.notnull(obj4)
When you add two Series
, the values will be lined up by their indices.
print obj3
print '---'
print obj4
print '---'
print obj3 + obj4
User can give name
to Series
object and its index
.
obj4.name = 'population'
obj4.index.name = 'state'
obj4
index
of Series
can be changed
obj.index = ['Bob', 'Steve', 'Jeff', 'Ryan']
obj
DataFrame¶
data = {'state': ['Ohio', 'Ohio', 'Ohio', 'Nevada', 'Nevada'],
'year': [2000, 2001, 2002, 2001, 2002],
'pop': [1.5, 1.7, 3.6, 2.4, 2.9]}
frame = DataFrame(data)
print frame
In ipython notebook, DataFrame
can be shown as a table
frame
the columns will be sorted if you specify columns
DataFrame(data, columns=['year', 'state', 'pop'])
Just like Series
, if you input a column which is not in the data, it will display as NaN
frame2 = DataFrame(data, columns=['year', 'state', 'pop', 'debt'],
index=['one', 'two', 'three', 'four', 'five'])
frame2
frame2.columns
You can get column data in two ways
frame2['state']
frame2.year
You can get row data by ix
frame2.ix['three']
The column value can be changed by assignment.
frame2['debt'] = 16.5
frame2
frame2['debt'] = np.arange(5.)
frame2
val = Series([-1.2, -1.5, -1.7], index=['two', 'four', 'five'])
frame2['debt'] = val
frame2
A new column will be created if you assign value to a column which does not exist.
frame2['eastern'] = frame2.state == 'Ohio'
frame2
delete a column by del
del frame2['eastern']
frame2
Another way to create DataFrame
is by nesting dictionary
pop = {'Nevada': {2001: 2.4, 2002: 2.9},
'Ohio': {2000: 1.5, 2001: 1.7, 2002: 3.6}}
frame3 = DataFrame(pop)
frame3
frame3.T
frame3.index.name = 'year'
frame3.columns.name = 'state'
frame3
values
will return a ndarray with the values in the DataFrame
frame3.values