reading-notes

View project on GitHub

10 minutes to pandas

Pandas is a Python library.

Pandas is used to analyze data.

Object creation:

Creating a Series by passing a list of values, letting pandas create a default integer index:

s = pd.Series([1, 3, 5, np.nan, 6, 8])

s
Out[4]: 
0    1.0
1    3.0
2    5.0
3    NaN
4    6.0
5    8.0
dtype: float64

Creating a DataFrame by passing a NumPy array, with a datetime index and labeled columns:

dates = pd.date_range('20130101', periods=6)

dates
DatetimeIndex(['2013-01-01', '2013-01-02', '2013-01-03', '2013-01-04',
               '2013-01-05', '2013-01-06'],
              dtype='datetime64[ns]', freq='D')
```python
df = pd.DataFrame(np.random.randn(6, 4), index=dates, columns=list('ABCD'))

df
                   A         B         C         D
2013-01-01  0.469112 -0.282863 -1.509059 -1.135632
2013-01-02  1.212112 -0.173215  0.119209 -1.044236
2013-01-03 -0.861849 -2.104569 -0.494929  1.071804
2013-01-04  0.721555 -0.706771 -1.039575  0.271860
2013-01-05 -0.424972  0.567020  0.276232 -1.087401
2013-01-06 -0.673690  0.113648 -1.478427  0.524988

Viewing data

Here is how to view the top and bottom rows of the frame:

df.head()
df.tail(3)

Selection

Selecting a single column, which yields a Series, equivalent to df.A:

df['A']

Selection by label:

For getting a cross section using a label:

df.loc[dates[0]]

Selection by position

Select via the position of the passed integers:

df.iloc[3]

Boolean indexing

Using a single column’s values to select data.

df[df['A'] > 0]

Setting

Setting a new column automatically aligns the data by the indexes.

s1 = pd.Series([1, 2, 3, 4, 5, 6], index=pd.date_range('20130102', periods=6))

Missing data

pandas primarily uses the value np.nan to represent missing data. It is by default not included in computations.

Reindexing allows you to change/add/delete the index on a specified axis. This returns a copy of the data.

Merge

pandas provides various facilities for easily combining together Series and DataFrame objects with various kinds of set logic for the indexes and relational algebra functionality in the case of join / merge-type operations.

Concatenating pandas objects together with concat().

Grouping

By “group by” we are referring to a process involving one or more of the following steps:

  • Splitting the data into groups based on some criteria

  • Applying a function to each group independently

  • Combining the results into a data structure

Plotting

We use the standard convention for referencing the matplotlib API.

Getting data in/out

CSV

Writing to a csv file.

df.to_csv('foo.csv')

Reading from a csv file.

pd.read_csv('foo.csv')