Introduction to pandas

Working with pandas objects

Loading and saving data in pandas

Dataframe and Series manipulation

Graphics in pandas

Basic statistical modeling

Working with time series

Exercises

The `pandas`

library

Library designed to easily manipulate, clean, and analyze data sets of very different nature. Many idioms are taken from `NumPy`

, and the main difference is that `pandas`

allows for handling heterogeneous tabular data, while `NumPy`

only accepts homogeneous array data sets.

To import the `pandas`

library:

import pandas as pd

We now explain the two most important `pandas`

data structures: *series* and *dataframes*.

A pandas series is a 1D array object that contains a sequence of values (homogeneous or not) and an associated data label array, its index. You can define a series from any array, having integers from zero to the number of elements minus one as indexes.

series_ex_I = pd.Series([4,5,6,7,9])print(series_ex_I)series_ex_II = pd.Series([4,5,76.7,True])print(series_ex_II)

You can access in a separate way indexes and values with the attributes `index`

and `value`

print(series_ex_II.index)print(series_ex_II.values)

You can provide the indexes labels -not necessarily integers- when creating the data structure

series_ex_III = pd.Series([4,5,76.7,True], index=["t1", "t2", "t3", "tbool1"])print(series_ex_III)

You can access an element by its index value, or provide a list of indexes and access a subset of values.

# Accessing a single elementprint(series_ex_III['t1'])# Accessing several elementsprint(series_ex_III[["t2","tbool1"]])

You can also perform NumPy-like operations on series and select elements in accordance with a given criterion

print(series_ex_I[series_ex_I > 5])print(series_ex_I*4)print(series_ex_III[series_ex_III > 5])print(series_ex_III*4)

You can check the existence of a given index label with a syntax similar to the one used in dictionaries

"t2" in series_ex_III

You can also create directly a series from a dictionary

hash_0 = {"Sevilla": 2022, "Huelva": 2044, "Granada": 2033}series_ex_IV = pd.Series(hash_0)print(series_ex_IV.index)print(series_ex_IV.values)

You can also explicitly control the index label ordering

series_ex_IV = pd.Series(hash_0, index = ["Huelva", "Granada", "Sevilla", "Ceuta"])print(series_ex_IV)

As the `Ceuta`

index does not exist in the dictionary it is created with a `NaN`

value. You can check the occurrence with these values using the `isnull`

function or method

print(pd.isnull(series_ex_IV))print(series_ex_IV.isnull())

A very useful feature is that you can operate with pandas series and they will be aligned by index number.

hash_1 = {"Sevilla": 2, "Huelva": 4, "Granada": 3}series_ex_V = pd.Series(hash_1, index = ["Ceuta","Granada", "Huelva", "Sevilla"])# Index alignmentprint(series_ex_IV + series_ex_V)

You can also name the series and its index to facilitate its identification

# Series and index namesseries_ex_IV.name = "R&R Parameter"series_ex_IV.index.name = "City"#print(series_ex_IV)

You can also alter in-place the index labels of a series

series_ex_III.index = ["Ceuta","Granada", "Huelva", "Sevilla"]

And you can operate with the new series

series_ex_IV + 2*series_ex_V + series_ex_III

The most known pandas data structure is the *dataframe* worked out to mimic the versatility of `GNU R`

equally named data structures. A dataframe can be considered a set of series sharing the same indexes. Therefore they have an index for the rows and a label for each columns. The columns can have data of different *dtype* and even data within a column can be of different type.

We will see that there are several different ways of constructing a dataframe. An usual one is to start with a dict of equal-length lists or NumPy arrays.

data_student = {"Wall, L": [8,9,9,10,8], "Page, L": [9,7,9,10,10], "Gates, B": [8, 8, 8, 9, 9], "Thompson, K": [10,10,9,9,9], "Rosum, G van":[9,9,8,8,10]}dframe_1 = pd.DataFrame(data_student)dframe_1

As can be checked in the output, the rows index is a default one, as in the Series case, and the columns kept the dictionary order. You can sort them at your best convenience providing using the `columns`

argument.

dframe_1 = pd.DataFrame(data_student, columns = sorted(data_student.keys()))dframe_1

In fact if there are column names without associated data in the dictionary, the column is created with missing values (*NaNs*). And, as with series, you can provide a given index too.

cols = list(data_student.keys())cols.append("Ritchie, D")#dframe_2 = pd.DataFrame(data_student, columns = sorted(cols), index = ["Math_01", "Math_02", "Alg_01", "OpSys", "Num_An"])dframe_2

You can also set the *name* attribute for the dataframe *columns* and *index* and they will be displayed with the dataframe

dframe_2.index.name = "Subject Grades"dframe_2.columns.name = "Pro Programmers"dframe_2

You can retrieve any dataframe row or column. Columns are obtained as a series using the column name or by attribute, though the second option only is valid for column names that are also valid Python variable names.

dframe_2["Wall, L"]# Not a valid variable name# print(dframe_2."Wall, L")

Note that the series and index names are conserved. Rows can be retrieved using the *loc* attribute.

dframe_1.loc[1]βdframe_2.loc["OpSys"]

The values that you retrieve are not a copy of the underlying data, but a view of them. Be aware that you modify them in place this will affect the original dataframe. There is a `copy`

method to obtain a copy of the data.

You can modify an existing column or create a new column with a default value by assignment

dframe_2["Ritchie, D"] = 9dframe_2["Torvalds, L"] = 8dframe_2

You can also provide as a value an array with the right number of elements

dframe_2["Torvalds, L"] = np.arange(6,11)dframe_2

If instead of an array you provide a series, there are more flexibility as the indexes in the array will be aligned with the indexes in the dataframe and any missing index will be replaced by a *NaN* missing value.

series_Stallman = pd.Series([9,9,9,9], index = ["Math_01", "Math_02", "OpSys", "Num_An"])#dframe_2["Stallman, R"] = series_Stallman#dframe_2

You can add a boolean column using the NumPy syntax

dframe_2["test"] = dframe_2["Stallman, R"] >= 9dframe_2

You can delete the added column using the `del`

keyword

del(dframe_2["test"])dframe_2

You can use a syntax similar to the one used in NumPy arrays to transpose a dataframe and exchange the indexes and columns roles.

dframe_2.T

There are many other ways of creating a dataframe, apart from the previous one, consisting on formatting your data as a dictionary of lists or NumPy vectors. We will provide some of them here:

From a nested dict of dicts. The advantage in this case is that the nested dictionaries keys are the dataframe indexes

# Dict of dictsd3_dict = {"Planck, M":{ "Hometown": "Kiel", "Born": 1858, "Died": 1947}, "Heisenberg, W":{ "Hometown": "Wurzburg", "Born": 1901, "Died": 1976}, "Fermi, E": { "Hometown": "Roma", "Born": 1901, "Died": 1954}, "Schroedinger, E": { "Hometown": "Erdberg", "Born": 1887, "Died": 1961}}dframe_3 = pd.DataFrame(d3_dict)dframe_3You can also use a dict of series to build the dataframe.

# dict of seriesd4_dict = {"Page, L": dframe_2["Page, L"][:-1], "Torvalds, L": dframe_2["Torvalds, L"][1:] }dframe_4 = pd.DataFrame(d4_dict)dframe_4Two dimensional

*ndarray*# 2D ndarrayarr_2D = np.ones([4,4])dframe_5 = pd.DataFrame(arr_2D, columns=["Col_00", "Col_02","Col_03","Col_04"], index=["a", "b", "c", "d"])dframe_5

The array, tuple, list, or any other structure passed to the dataframe constructor as the index for the data set is transformed into an `index object`

.

index = dframe_5.indexindex

Those objects are *immutable*. You can create an index object using the `pd.Index`

keyword and pass it to a structure and also these objects can be shared among data structures.

index_obj = pd.Index(np.arange(0,5))print(index_obj)# Create a series with the index objectseries_ex_VI = pd.Series(np.pi*np.arange(0,5), index = index_obj)print(series_ex_VI)# Replace the index in a dataframe with the new index objectdframe_6 = dframe_4.set_index(index_obj)dframe_6

An index can contain duplicate labels, and then a selection will select all occurrences of the repeated labels

#dframe_7 = pd.DataFrame(data_student, columns = sorted(cols), index = ["Math_01", "Math_02", "Alg_01", "OpSys", "Math_02"])print(dframe_7)dframe_7.loc["Math_02"]

Reindexing a panda object creates a *new* object with different index labels. With a series it rearranges the data according to the new index. Any missing old index values is removed from the series and missing `NaN`

values are introduced for nonexistent index values.

print(series_ex_I)new_series = series_ex_I.reindex(np.arange(1,7))new_series

When applied to a dataframe, reindex can modify rows, columns, or both. By default, rows are reindexed according to a given sequence in the same way than for series.

print(dframe_5)dframe_8 = dframe_5.reindex(["a", "c", "b", "f", "e", "d"])dframe_8

The `reindex`

function has several arguments. For example, the argument `fill_value`

allows for defining a default value for data associated to non-existent labels when reindexing

dframe_8 = dframe_5.reindex(["a", "c", "b", "f", "e", "d"], fill_value=0)dframe_8

There is also an option to fill missing values when reindexing, which is quite useful when working with time series. The option name is `method`

and with the value `ffill~(~bfill`

) performs a forward(backward) filling.

dframe_9 = pd.DataFrame(np.random.randn(4,4), columns=["Col_00", "Col_02","Col_03","Col_04"], index=np.arange(0,8,2))print(dframe_9)dframe_10 = dframe_9.reindex(np.arange(0,8), method="ffill")dframe_10

You can also reindex columns using the `columns`

keyword

dframe_11 = dframe_5.reindex(columns=["Col_04", "Col_03", "Col_02", "Col_01", "Col_00"])dframe_11

Notice the difference between reindexing and renaming columns either creating a new object or in-place editing the dataframe.

print(dframe_11)print(dframe_11.rename({"Col_04":"Col_a", "Col_02":"Col_b", "Col_00":"Col_d"}, axis = 1))print(dframe_11)dframe_11.rename({"Col_04":"Col_a", "Col_02":"Col_b", "Col_00":"Col_d"}, axis = 1, inplace=True)print(dframe_11)

You can use the `drop`

method that returns a new object with the indicated entries deleted. This can be applied to series and dataframes.

print(series_ex_VI)new_series = series_ex_VI.drop([1,3])print(new_series)print(series_ex_VI)βnew_dfram = dframe_10.drop(range(2,6))print(new_dfram)new_dfram = dframe_10.drop(["Col_03", "Col_00"], axis=1)print(new_dfram)

The `inplace=True`

option allows for the editing of the object, avoiding the creation of a new object.

You can use a syntax similar to the one used by Numpy with pandas series, in particular you can use either the index values or integers.

print(series_ex_V)print(series_ex_V["Granada"])print(series_ex_V[1])#print(series_ex_V[["Granada", "Sevilla"]])

You can apply filters to the series

print(series_ex_V[series_ex_V > 2])

You can also use slicing with integers of index labels, but notice the difference. The end point of the range is included in the labels case.

print(series_ex_V["Granada":"Sevilla"])print(series_ex_V[1:3])

You can assign values using these methods

series_ex_V["Granada":"Sevilla"] = 10print(series_ex_V)

By default, indexing refers to columns and you can set values

print(dframe_7)print(dframe_7["Thompson, K"])print(dframe_7[["Gates, B","Thompson, K"]])dframe_7["Ritchie, D"] = 6

If you use slicing, this works over indexes

print(dframe_7[1:3])

You can select data using a Boolean array

print(dframe_7[dframe_7>8])print(dframe_7[dframe_7["Page, L"]>8])

You can assign also using this syntax

dframe_7[dframe_7 <= 8] = 5

You can also use `loc`

and `iloc`

for row indexing using axis labels or integers, respectively.

We can select two rows and two columns by label and by integer values as follows

dframe_7.loc[["Math_01","OpSys"], ["Rosum, G van", "Wall, L"]]dframe_7.iloc[[0,3], [3,5]]

Both ways of selecting elements work with slices

dframe_7.loc["OpSys":, "Rosum, G van":"Wall, L"]dframe_7.iloc[0:3, 0:4]

It is important to take into account that if you have an axis index containing integers, data selection will always be label-oriented to avoid possible ambiguities. In these cases is preferably to use `loc`

or `iloc`

.

The arithmetic between series with different indexes is performed in such a way that the final set of indexes is the union of the involved sets and missing values are introduced in the elements where the two series don't overlap and propagate through arithmetic operations.

An example with two series is

print(series_ex_I)print(series_ex_II)series_ex_I + series_ex_II

In the case of dataframes a double alignment is performed, in indexes and columns

print(dframe_4)print(dframe_7)dframe_4 + dframe_7

If there are neither columns nor rows in common the resulting series will be only made of `NaN`

's. Using the `add`

method you can pass an argument to fill with a given value the non-overlapping elements of the dataframe, though if both elements are missing the result will still be a `NaN`

.

dframe_4.add(dframe_7, fill_value = 0.0)

The following methods are available for series and dataframe arithmetics that have also reversed versions.

`add`

[~radd~]: addition`sub`

[~rsub~]: subtraction`div`

[~rdiv~]: division`floordiv`

[~rfloordiv~]: floor division`mul`

[~rmul~]: multiplication`pow`

[~rpow~]: exponentiation

You can mix series and dataframes in operations that are performed in a similar way to *broadcasting* in NumPy. By default the series index is matched against the dataframe's columns, broadcasting down the rows

series_ex_VII = pd.Series(np.random.randn(5), index =["Col_00", "Col_01", "Col_02","Col_03","Col_04"] )print(series_ex_VII)print(dframe_8)#dframe_8 * series_ex_VII

You can also broadcast on the rows, matching the series index versus the dataframe index. In this case you need to use the method notation.

dframe_2.rsub(series_Stallman, axis = 'index')

You can also perform arithmetic operations with dataframes

print(dframe_10)print(dframe_9)dframe_10/dframe_9

NumPy universal functions (*ufuncs*) can be used in series and dataframes

np.cos(dframe_10)

The `apply`

method allows for applying a function to each row or column of a dataframe and giving as a result a a dataframe or a series with the corresponding indexes, depending on the function output. By default, the function is applied to each column.

# series outputg = lambda x: np.mean(x)/np.std(x)print(dframe_10.apply(g))### dataframe outputf = lambda x: (x - np.mean(x))/np.std(x)#dframe_12 = dframe_10.apply(f)print(dframe_12)#print(dframe_12.mean(axis = 0))#dframe_13 = dframe_10.apply(f, axis="columns")print(dframe_13)#dframe_13.mean(axis = 1)

You can apply a function that returns multiple values, as a series

def h(x):return pd.Series([x.min(), x.max(), x.mean(), x.std(), x.sum()], index=['min', 'max', 'mean', 'std', 'sum'])print(dframe_10)dframe_13 = dframe_10.apply(h)print(dframe_13)

You can apply also element-wise functions to a Series with `map`

and to a Dataframe with `applymap`

as follows

format_floats = lambda x: '%.2f' % x# Seriesprint(dframe_10["Col_00"].apply(format_floats))# Dataframedframe_10.applymap(format_floats)

The `sort_index`

method returns a new object, lexicographically sorted by row or column, and can be applied to Series,

# Sort series by indexprint(series_ex_IV)#series_ex_IV.sort_index()

and DataFrames. In this case you can also sort by column name.

# Sort dataframe by indexprint(dframe_7)βprint(dframe_7.sort_index())βprint(dframe_7.sort_index(axis=1, ascending=False))

If instead of sorting by indexes you need to sort by Series of DataFrame values the method is called `sort_values`

, notice that `NaN`

values are sent to the end of the series. In case of DataFrames you can use as sorting keys one or various columns of the DataFrame. In this case it is mandatory to include at least one column name as a `by=`

argument.

# Sort Series by valuesprint(series_ex_IV.sort_values())# Sort DataFrame by valuesprint(dframe_7.sort_values(by="Wall, L"))print(dframe_7.sort_values(by=["Wall, L", "Thompson, K"]))

Related to sorting is ordering elements by its rank. This is accomplished with the `rank`

method. At first sight it can be surprising to find non integer positions, explained by the fact that the method breaks ties assigning the mean value to the group

print(series_ex_IV)print(series_ex_IV.rank())print(dframe_7)print(dframe_7.rank())print(dframe_7.rank(axis="columns"))

There are other ways of breaking the ties, using the options `min`

or `max`

that assigns the minimum or maximum rank to the whole group; `first`

that takes into account the order of appearance of the element; or `dense`

, similar to `min`

, but with ranks always increasing by one, independently of the number of elements in the group.

print(dframe_7)print(dframe_7.rank(method="min"))print(dframe_7.rank(method="dense"))

Pandas provides a set of useful methods to compute descriptive statistical quantities of your Series or DataFrame.

`count`

: Number of non-NaNs.`describe`

: Provides summary statistics for a Series or for each DataFrame column.`min`

,`max`

: Minimum and maximum values`argmin`

,`argmax`

: Index locations (integers) at which minimum or maximumvalue is obtained.

`idxmin`

,`idxmax`

: Index labels at which minimum or maximum value is obtained.`quantile`

: Sample quantile ranging from 0 to 1.`sum`

: Sum of values.`mean`

: Mean of values.`median`

: Arithmetic median (50% quantile) of values.`mad`

: Mean absolute deviation from mean value.`prod`

: Product of all values.`var`

: Variance of values.`std`

: Standard deviation of values.`skew`

: Skewness of values (third moment).`kurt`

: Excess Kurtosis of values (fourth moment - 3).`cumsum`

: Cumulative sum of values.`cummin`

,`cummax`

: Cumulative minimum or maximum of values, respectively.`cumprod`

: Cumulative product of values.`diff`

: Compute first arithmetic difference (useful for time series).`pct_change`

: Compute percent changes.

Some examples are provided for the normal distribution

# Random Gaussian DataframeElements = 1000arr_2D = np.random.randn(Elements,6)cols = []for index in range(6):cols.append("Set_0{0}".format(index))dframe_14 = pd.DataFrame(arr_2D, columns=cols)dframe_14βprint("Mean\n", dframe_14.mean(), "\n Std.\n", dframe_14.std(), "\nQuantile\n", dframe_14.quantile(q = 0.68), "\nSkewness\n", dframe_14.skew(), "\nKurtosis\n", dframe_14.kurt())

We can also compute more elaborate statistics. As an example we load COVID19 data for Spain and analyze the data for total number of cases in Andalousian provinces. The first step is to download the data and build the requested dataframe

# Libraryfrom datetime import date, timedelta## Data Readingdata=pd.read_csv('https://cnecovid.isciii.es/covid19/resources/datos_provincias.csv')## Andalousian ProvincesAndalousie = ["AL", "CA", "CO", "GR", "H", "J", "MA", "SE"]## Create a Dataframe with total number of casestemp_dict = {"AL": data[data["provincia_iso"]=="AL"]["num_casos"]}dframe_AND = pd.DataFrame(temp_dict)dframe_AND = dframe_AND.set_index(data[data["provincia_iso"]=="AL"]["fecha"])dframe_AND_index = dframe_AND.index#for province in Andalousie[1:]:series_province = data[data["provincia_iso"]==province]["num_casos"]series_province.index = dframe_AND.indexdframe_AND[province] = series_province

Now we can compute some statistics. For example the covariance and correlation matrices

print(dframe_AND.cov())print(dframe_AND.corr())

We can compute also the percent change for a given period

percent_change = dframe_AND.pct_change(periods=7)percent_change.tail()

We can compute the correlation with other Series of Dataframe column using `corrwith`

Province = "M"Series_P = data[data["provincia_iso"]==Province]["num_casos"]Series_P.index = dframe_AND_indexdframe_AND.corrwith(Series_P)

Other methods of interest related with Series and DataFrame description is `unique`

that provides an array of unique values and `value_counts`

that computes value frequencies

# Sevilla sorted unique valuesuniques_SE = dframe_AND["SE"].unique()uniques_SE.sort()uniques_SEβ# Sevilla value countvalues_SE = dframe_AND["SE"].value_counts()values_SE

The `isin`

method performs a vectorized check of membership for a given set and is used to filter a Series or a DataFrame column down to a values subset

mask_SE = dframe_AND["SE"].isin([0,1,2])dframe_AND["SE"][mask_SE]

Related to this method is the `Index.get_indexer`

method that provides an index array of distinct values. In this example we got an array for dates when the number of cases is less than or equal to three.

pd.Index(pd.Series([0,1,2,3])).get_indexer(dframe_AND["SE"])

Using `apply`

one can perform rather complex data manipulation in a concise way. We can, for example, count the distinct values for each column of our example dataframe

dframe_AND.apply(pd.value_counts).fillna(0)

Pandas offers an impressive set of methods for reading/saving data, making possible to grapple with many different formats. We start first with text formatted files and include later other formats.

The main functions in Pandas to deal with text-formatted files are

`read_csv`

**:**Load delimited data from a file, URL, or file-like object. Thedefault delimiter is the comma.

`read_table`

**:**Load delimited data from a file, URL, or file-like object. Thedefault delimiter is the tab ('\t').

`read_fwf`

**:**Read data without delimiters in a fixed-width column format.`read_clipboard`

**:**Version of readtable that reads data from the clipboard. Useful for converting tables from web

pages.

We have already seen an example of `read_csv`

in action loading data from a URL. The large variety of formats and ways of encoding information into text files can be handled at the cost of having a plethora of options and modifiers to the previous functions. The most common ones are

`path`

**:**String indicating filesystem location, URL, or file-like object.`sep`

**or**`delimiter`

**:**Character sequence or regular expression that marksfield separation in each row.

`header`

**:**Row number whose entries are used as column names. Defaults to thefirst row, and should be

`None`

if there is no header row.`index_col`

**:**Column numbers or names to use as the row index in the result;can be a single name/number or a list of them for a hierarchical index.

`names`

**:**List of column names for result, combine with header=None.`skiprows`

**:**Number of rows ignored at the beginning of the file toignore. Also it can be given as a list of row numbers (starting from 0) to skip.

`na_values`

**:**Sequence of values to replace with NA.`comment`

**:**Character(s) marking comments to split comments off the end of lines.`parse_dates`

**:**Attempt to parse data to datetime; False by default. If True,will attempt to parse all columns. Otherwise can specify a list of column

numbers or name to parse. If element of list is tuple or list, will combine multiple columns together and parse to date (e.g., if date/time split across two columns).

`keep_date_col`

**:**If joining columns to parse date, keep the joined columns; False by default.`converters`

**:**Dict containing column number of name mapping to functions(e.g., {'foo': f} would apply the function f to all values in the 'foo' column).

`dayfirst`

**:**When parsing potentially ambiguous dates, treat as internationalformat (e.g., 7/6/2012 -> June 7, 2012); False by default.

`date_parser`

**:**Function to use to parse dates.`nrows`

**:**Number of rows to read from beginning of file.`iterator`

**:**Return a*TextParser*object for reading file piecemeal.`chunksize`

**:**For iteration, size of file chunks.`skip_footer`

**:**Number of lines to ignore at end of file.`verbose`

**:**Print various parser output information, like the number ofmissing values placed in non-numeric columns.

`encoding`

**:**Text encoding for Unicode (e.g., 'utf-8' for UTF-8 encoded text).`squeeze`

**:**If the parsed data only contains one column, return a Series.`thousands`

**:**Separator for thousands (e.g., ',' or '.').

As an example we can read into a dataframe one of the monthly temperature files used in previous examples

pd.read_csv("files/TData/T_Araxos_EM.csv")

By default the first row elements are used to label columns. We can choose our own labels and also transform the year into the dataframe index

tdata_Araxos = pd.read_csv("files/TData/T_Araxos_EM.csv", index_col=0, names = ["Jan", "Feb", "Mar", "Apr","May", "Jun", "Jul", "Aug", "Sep", "Oct", "Nov", "Dec"], skiprows=1)tdata_Araxos

Notice that in this case we are also forced to skip the first row.

In case that a file is very long, we may need to read it piecewise or iterate over files small portions using the `nrows`

argument.

The method `to_csv`

allows for writing data files as comma separated files. We can write the dataframe read above into a file. By default the separator used is a comma, but you can define another character as separator with the `sep`

option

tdata_Araxos.to_csv("temp_Data_Araxos.csv", sep=";")!cat temp_Data_Araxos.csv

By defauly row and column labels are included in the file. This can be disabled using the *index=False* and *header=False* options, respectively. Note that missing values appear as empty strings and can be denoted by a so called *sentinel value* using the option *narep*.

tdata_Araxos["UndefVals"] = np.nantdata_Araxos.to_csv("temp_Data_Araxos.csv", sep=":", na_rep="NULL")!cat temp_Data_Araxos.csv

You can also save a number of columns and in an arbitrary order you define

tdata_Araxos.to_csv("temp_Data_Araxos.csv", sep=":", columns=["Mar", "Apr", "May"])!cat temp_Data_Araxos.csv

`read_excel`

**:**Read tabular data from an Excel XLS or XLSX file. Use packages`xlrd`

and`openpyxl`

.`read_hdf`

**:**Read HDF5 files written by pandas.`read_html`

**:**Read all tables found in the given HTML or XML document.`read_json`

**:**Read data from a JSON (JavaScript Object Notation) stringrepresentation. Library:

`json`

.`read_msgpack`

**:**Read pandas data encoded using the MessagePack binary format.`read_pickle`

**:**Read an arbitrary object stored in Python pickle format.`read_sas`

**:**Read a SAS dataset stored in one of the SAS systemβs custom storage formats.`read_sql`

**:**Read the results of a SQL query (using SQLAlchemy) as a pandas DataFrame.`read_stata`

**:**Read a dataset from Stata file format.`read_feather`

**:**Read the Feather binary file format

You can define series and dataframes with more than one level of indexing paving the way to multidimensional data treatment in a lower dimensional form. This is an example for a series

series_hind = pd.Series(np.random.randn(10), index=[[0,0,0,0,1,1,1,2,3,4],["x","y","z","t","x","y","z","x","y","z"]])#print(series_hind)print(series_hind.index)

You can access the elements making use of the different index levels

print(series_hind[0])print(series_hind[1,"z"])

And you can also slice through the indexes making use of the `loc`

function and the `slice`

function or the Numpy syntax.

print(series_hind.loc[slice(1,3)])print(series_hind.loc[1:3])print(series_hind.loc[:,"z"])

Note than in this case slices are inclusive.

Hierarchical index eases the reshaping of data. Making use of the `unstack`

method we can transform the previous series into a dataframe, filling with NaNs undefined values

series_hind.unstack()

The opposite operation is performed -unsurprisingly- by the `stack`

method, transforming a dataframe into a multi-index series.

The extension of hierarchical indexes to dataframes implies that both rows and columns can have multiple indexes. Let's build an example dataframe with hierarchical indexes in rows and columns.

frame_hind = pd.DataFrame(np.random.randint(0,100,size=(6,6)), index=[["i","i","ii","ii","iii","iii"],["a","b","a","b","a","b"]], columns=[["CA","CA","CA","HU","HU","HU"], ["S1", "S2", "S3","S1", "S2", "S3"]])frame_hind

You can name the indexes, helping to make your code more understandable

frame_hind.index.names = ["Key a", "Key b"]frame_hind.columns.names = ["Prov", "Section"]frame_hind

It is now easy to select a group of columns

frame_hind["HU"]

You can change the order of the levels in rows and columns and also sort the data according to some criterion. If, for example, you want to exchange the levels order in both rows and columns you can use the `swaplevel`

function, taking into account that it returns a new object

frame_hind_new = frame_hind.swaplevel("Prov", "Section", axis = 1).swaplevel("Key a", "Key b")frame_hind_new

The `sort_index`

command sorts levels according to a single index level

frame_hind.sort_index(level=1)

You can combine `swaplevel`

and `sort_index`

functions

frame_hind.swaplevel("Prov", "Section", axis = 1).sort_index(level=1, axis=1)

In many statistics functions you have a `level`

option that allows for specifying the level at which the manipulation is intended

print(frame_hind.sum(level="Key b"))print(frame_hind.sum(axis=1,level="Prov"))print(frame_hind.mean(axis=1, level="Section"))

You can easily create new DataFrames with hierarchical indexing using the `set_index`

and `reset_index`

functions. The first one takes one or more columns as new indexes for a new DataFrame, while the second does the inverse operation. Depending on the `drop`

argument value, the columns will disappear or not.

test_ri = frame_hind_new.reset_index()print(test_ri)test_si = test_ri.set_index(["Key b", "Key a"], drop=False)print(test_si)

Different Pandas objects can be combined using the functions `merge`

and `concat`

.

The `merge`

function connect rows in DataFrames based on one or more keys.

frame_1= pd.DataFrame(np.random.randint(0,4,size=(5,4)), columns=["S1", "S2", "S3","S4"])frame_2= pd.DataFrame(np.random.randint(0,6,size=(6,4)), columns=["K1", "K2", "K3","K4"])#print(frame_1)print(frame_2)#pd.merge(frame_1, frame_2, left_on = "S3", right_on = "K3")

By default an *inner* merge is performed and the intersection of the key set is used. This can be changed using the *left*, *right*, or *outer* options

pd.merge(frame_1, frame_2, left_on = "S3", right_on = "K3", how="outer")

You can also use keys from various columns

pd.merge(frame_1, frame_2, left_on = ["S2","S3"], right_on = ["K1","K3"], how="inner")

When columns names are equal on the joined dataframe, pandas by default add a *namex* or *namey* suffix. This can be changed using the `suffixes`

option to `merge`

.

The `concat`

function concatenates or stacks together objects along a given axis, in a similar way to the NumPy `concantenate`

function. The labeled axes in Pandas results in different options when binding two Series or DataFrames. You can choose either to stack only the intersection or the union of the indexes, you can decide to discard the labels on the axes being binded or keep the concatenated data structures identifiable after merging.

If we use as arguments Series having no common index, the `concat`

function simply stacks them and creates a new Series

pd.concat([series_ex_I,series_ex_III,series_ex_VII])

By default `concat`

acts on `axis=0`

, creating a new Series, the option `axis=1`

will turn the output into a DataFrame with each Series as a column and filling with `NaN`

the void values (an `outer`

join).

pd.concat([series_ex_I,series_ex_III,series_ex_VII],axis =1)

If there are common indexes and the operation is performed along `axis = 0`

there will be repeated index values while in the `axis = 1`

case the number of undefined `NaN`

values will be reduced.

series_ex_VIII = pd.concat([series_ex_III,series_ex_VII]) + 1print(pd.concat([series_ex_I,series_ex_III,series_ex_VIII]))print(pd.concat([series_ex_I,series_ex_III,series_ex_VIII], axis = 1))

In the `axis=1`

case, you can perform an `inner`

join using the argument `join = inner`

.

pd.concat([series_ex_III,series_ex_VIII], axis=1, join = "inner")

To keep track of the initial arguments you can use an hierarchical index in the concatenation axis with the `keys`

argument

pd.concat([series_ex_I,series_ex_III,series_ex_VIII],axis =0, keys=["I", "III", "VIII"])

If `axis=1`

the keys will be used as column names.

pd.concat([series_ex_I,series_ex_III,series_ex_VIII],axis =1, keys=["I", "III", "VIII"])

You can also concatenate DataFrames. Notice the difference between specifying `axis = 1`

or using the default. If you specify a `keys`

argument, it is used to define a hierarchical axis in a given axis.

print(pd.concat([dframe_1,dframe_4]))#print(pd.concat([dframe_1,dframe_4], axis=1))#print(pd.concat([dframe_1,dframe_4], keys=["DF_1", "DF_4"]))print(pd.concat([dframe_1,dframe_4], axis=1, keys=["DF_1", "DF_4"]))

You can also specify the key values as the keys of a dictionary that replaces the list argument.

print(pd.concat({"DF_1": dframe_1,"DF_4": dframe_4}, axis=1))

In case the information on the index is not relevant you can use the `ignore_index=True`

option

print(pd.concat([dframe_1,dframe_5], axis=0, ignore_index=True))print(pd.concat([dframe_1,dframe_5], axis=1, ignore_index=True))

Apart from the use of Matplotlib methods, Pandas has its own built-in methods to visualize data saved in Series and DataFrames and, on top of this, we may use the `Seaborn`

library, which modifies default `Matplotlib`

settings and allows for a fast and convenient way to make complex plots to infer possible statistical relations among data sets. In order to install `Seaborn`

you only need to run, in the environment where you intend to run the library

$ conda install seaborn

An interesting piece of advice is, once a plot get complex enough, to sketch by hand your plot -or your idea of the plot- before you begin coding it. That way, you might be able to identify how appropriate is the plot, as well as being able to clearly mark the objects and relationships that you want to be conveyed by the plot.

Very basic plots can be depicted using the built-in `plot`

method

dframe_AND.plot()

The default behavior is to create a line for each column and label it with the column name. The `plot`

attribute has a series of methods to create different plot types. The default is `line`

.

You can customize the plot with some or all of the following options.

**subplots:**Plot each DataFrame column in a different subplot.**sharex:**If`subplots=True`

, share the same x axis, linking ticks and limits.**sharey:**If`subplots=True`

, share the same y-axis**figsize:**Size of figure to create. Tuple with (width, height).**title:**Plot title as string.**legend:**Add a subplot legend (True by default).**sortcolumns:**Plot columns in alphabetical order; by default uses existing column order.

We can modify the previous plot using these options

dframe_AND.plot.area(subplots=True,figsize=(9,12),sharex=True)

You can also trivially plot histograms. For example

dframe_AND["H"].plot.hist(bins=60)

Also, one of the built-in Pandas option is the representation of Kernel Density Estimate plots, a kind of density plot, that provides an estimate of the probability distribution behind the data.

dframe_AND["H"].plot.kde()

Instead of `kde`

one can use the modifier `density`

. Both are equivalent.

You plot data using the `Seaborn`

library. The first step is to select a particular `Seaborn`

style, if you want to change the default one. Possible options for the `style`

argument are: `darkgrid`

, `whitegrid`

, `dark`

, `white`

, and `ticks`

.

sns.set(style="whitegrid")

We first build some figures using the provided Seaborn datasets, that are Pandas dataframes

tips = sns.load_dataset("tips")tips

This is a Pandas dataframe with data about the bills and tips left by different parties in a restaurant. Seaborn command `relplot`

allows for easy and direct creation of complex relational plots, like the following:

sns.relplot(data=tips,x="total_bill", y="tip", col="time",hue="smoker", style="smoker", size="size",)

In this case the `total_bill`

and `tip`

columns are plot as abscissa and ordinates, in different columns (`Lunch`

and `Dinner`

) defined by the `time`

column values, color and character are fixed according to the `sex`

column values, and the size of the character is determined by the party size in column `size`

.

A similar concise syntax is used in the `Seaborn`

`lmplot`

function, that includes in the scatterplot the results and uncertainty of a linear regression fit to the provided data.

sns.lmplot(data=tips, x="total_bill", y="tip", col="time", hue="smoker")

Another interesting function is `displot`

that combines histograms and kernel density estimations to obtain an approach to the distribution of probability of the variable under study

sns.displot(data=tips, x="total_bill", col="time", kde=True)

You can also depict in a single step the empirical cumulative distribution function of the data using the `kind="ecdf"`

argument

sns.displot(data=tips, kind="ecdf", x="total_bill", col="time", hue="smoker", rug=True)

In the case of categorical data, appropriate representations can be achieved with the `catplot`

command

sns.catplot(data=tips, kind="swarm", x="day", y="total_bill", hue="sex")

We will use the `statsmodel`

library for the statistical treatment of datasets. You can get plenty of information about this library in the Statsmodel Homepage.

The first step is loading the necessary libraries, and the previous installation of the `statsmodel`

package in the environment using conda is required.

import statsmodels.api as smimport statsmodels.formula.api as smffrom patsy import dmatrices

Then we use the `tips`

DataFrame from the previous section to perform some fits. We will work with four columns, and we create a new DataFrame with these data.

vars = ['tip', 'total_bill', 'day']df = tips[vars]df

The next step is to create the design matrices for the statistical analysis, the endogenous or dependent matrix, `YY`

, and the exogenous or independent matrix, `XX`

. We are going to explore first the linear relationship between tip amounts and total bill amounts

Y, X = dmatrices("tip ~ total_bill", data = df, return_type="dataframe")

This creates two DataFrames

print(Y.head())print(X.head())

And we can now describe the model and perform a fit

mod = sm.OLS(Y, X) # Describe modelres = mod.fit() # Fit modelprint(res.summary())

Using the `predict`

method new predicted values can be computed using the fit results.

X_prime = np.linspace(X.total_bill.min(), X.total_bill.max(), 10)[:, np.newaxis]X_prime = sm.add_constant(X_prime)#Y_prime = res.predict(X_prime) # Calculate the predicted values#plt.scatter(X["total_bill"], Y["tip"], alpha=0.3) # Plot the raw dataplt.plot(X_prime[:, 1], Y_prime, 'r', alpha=0.9) # Plot the predicted data#plt.xlabel("Total bill amount")plt.ylabel("Tip amount")

In this case a constant term is assumed in the linear relationship (the intercept). To fix it to zero and perform a single parameter fit the syntax is

Y, X = dmatrices("tip ~ total_bill - 1", data = df, return_type="dataframe")print(Y.head())print(X.head())mod = sm.OLS(Y, X) # Describe modelres = mod.fit() # Fit modelprint(res.summary())

We can also include categorical data in a convenient way in the fit

Y, X = dmatrices("tip ~ total_bill + day", data = df, return_type="dataframe")print(Y.head())print(X.head())mod = sm.OLS(Y, X) # Describe modelres = mod.fit() # Fit modelprint(res.summary())

And you can predict data in the categorical data case as

X_prime = np.zeros((10,3))X_prime[:,0:2] = 1.0X_prime[:,2] = np.linspace(X.total_bill.min(), X.total_bill.max(), 10)print(X_prime)# Calculate the predicted valuesY_prime = res.predict(X_prime)print(Y_prime)# Plot the raw dataplt.scatter(X["total_bill"], Y["tip"], alpha=0.3)plt.xlabel("Total bill amount")plt.ylabel("Tip amount")plt.plot(X_prime[:, 2], Y_prime, 'r', alpha=0.9)

The most relevant information in the summary are the items that follow

**Fit parameters:**Can be accessed using`res.params`

.**R-squared:**The correlation coefficient of determination. A statisticalmeasure of how well the regression line approximates the real data

points. Perfect fir for R-squared equal to one.

**Adj. R-squared:**The correlation coefficient adjusted based on the number ofobservations and of residuals degrees-of-freedom.

**P > |t|:**P-value that the null-hypothesis that the coefficient = 0 is true. If it is less than the confidence level, often 0.05, it indicates that there is a statistically significant relationship between the term and the response.**[95.0% Conf. Interval]:**The lower and upper values of the 95% confidenceinterval.

Working with time series is a complex subject and mastering it requires time and dedication. Time series are ubiquitous and can be found in Chemistry, Physics, Ecology, Economics and Finance, Medicine, and Social Sciences. Pandas provides tools to work with fixed frequency data as well as with irregular time series.

Native Python provides a way to deal with time data with the modules `datetime`

, `time`

, and `calendar`

and the `datetime`

data type. For example

from datetime import datetimenow = datetime.now()print(type(now))print(now.year, now.moth, now.day, now.hour, now.minutes)now

Time is stored down to the microsecond. The time difference between two different times is represented as a `timedelta`

object

delta = now - datetime(1969,10,9)delta

The `timedelta`

function allows to shift a given time by some amount

from datetime import timedeltanow - timedelta(days=10, hours=3, minutes = 20)

You can format `datetime`

objects using `str`

or -for a given format- `strftime`

.

print(str(now))print(now.strftime("%d/%m/%Y %H.%M"))

The possible format specification codes for `strftime`

are

**%Y:**Four-digit year**%y:**Two-digit year**%m:**Two-digit month**%d:**Two-digit day**%H:**Hour (24-hour clock)**%I:**Hour (12-hour clock)**%M:**Two-digit minute**%S:**Second [00, 61] (seconds 60, 61 account for leap seconds)**%w:**Weekday as integer [0 (Sunday)]**%U:**Week number of the year [00, 53]; Sunday is considered the first day of the week, and days before the first Sunday of the year are βweek 0β**%W:**Week number of the year [00, 53]; Monday is considered the first day of the week, and days before the first Monday of the year are βweek 0β**%z:**UTC time zone offset as +HHMM or -HHMM; empty if time zone naive**%F:**Shortcut for %Y-%m-%d (e.g., 2012-4-18)**%D:**Shortcut for %m/%d/%y (e.g., 04/18/12)

The same formats are used to convert strings to dates using the `strptime`

function

datetime.strptime("2020 12 26 22h11m", "%Y %m %d %Hh%Mm")

To avoid the explicit format definition, the method `parse.parse`

, able to translate many different date string formats, can be used

from dateutil.parser import parseparse("2020-12-10")βparse('Jan 01, 1957 11:45 PM')βparse('12/12/2011', dayfirst=True)

Pandas work usually with time data arrays, either as axis index or as columns. It provides the `to_datetime`

method to parse many different date representations

pd.to_datetime(["2021-11-26", "2021-11-26 00:12","2021-11-26 22:33"])

It can handle gaps or missing elements in the time series introducing `NaT`

values

pd.to_datetime(["2021-11-26", "2021-11-26 00:12","2021-11-26 22:33:10"] + [None])

We will only give some hints on how to work with time series using the previous example for Covid-19 data in Spain. We first read the data from the public server and create a new column named "time" transforming the date into a timestamp with `to_datetime`

data_CoViD=pd.read_csv('https://cnecovid.isciii.es/covid19/resources/datos_provincias.csv')data_CoViD['time']=pd.to_datetime(data_CoViD['fecha'])

We now select the data for the total number of cases in the eight Andalousian provinces using as an index the time data. We start with the province of AlmerΓa (`AL`

)

# Create a Dataframe with total number of casesdframe_AND = pd.DataFrame({"AL": data_CoViD[data_CoViD["provincia_iso"]=="AL"]["num_casos"]})dframe_AND = dframe_AND.set_index(data_CoViD[data_CoViD["provincia_iso"]=="AL"]["time"])β#Andalousie = ["AL", "CA", "CO", "GR", "H", "J", "MA", "SE"]for province in Andalousie[1:]:series_province = data_CoViD[data_CoViD["provincia_iso"]==province]["num_casos"]series_province.index = dframe_AND.indexdframe_AND[province] = series_province

A useful feature is to change the time frequency, downsampling or upsampling the data. In this case we will downsample to weekly and monthly frequencies using `resample`

method. This method first groups the data according to a given criterion and then calls an aggregation function.

dframe_AND_Monthly = dframe_AND.resample("M").sum()dframe_AND_Weekly = dframe_AND.resample("W").sum()

In this case you are applying the same aggregation function to each column, but you can instead specify different functions for different columns using the `agg`

method.

dframe_AND_Weekly_2 = dframe_AND.resample("W").agg({"AL":"mean", "CA":"mean", "CO":"max", "GR":"max", "H":"sum", "J":"sum", "MA":"std", "SE":"std"})

We can now depict the data for different frequencies and using different formats

dframe_AND.plot()dframe_AND.plot.area(subplots=True,figsize=(9,12),sharex=True)dframe_AND_Monthly.plot.bar(stacked=True)dframe_AND_Weekly.plot.bar(stacked=True, figsize=(10,7))dframe_AND_Weekly.plot.bar(subplots=True,figsize=(12,8),sharex=True)dframe_AND_Monthly.plot.barh(subplots=True,figsize=(7,20),sharex=True)

Read the different files with Cyprus towns temperatures provided in the

*TData*folder and build a dataframe combining all the information. The columnsshould be the year and months and you can distinguish between data for the

different towns adding an extra column with the town name. Hint: the function

`concat`

can be very useful in this case.A different way to combine the Cyprus towns temperature data provided in the

*TData*folder is to build a dataframe whose index is a hierarchical one, withthe year and month and there are as many columns as towns, labeled with the

town names. Hint: The

`concat`

and`unstack`

functions can be helpful in thiscase.

Compute the correlation matrix between the temperatures in the provided

Cyprus cities from the previous exercise dataframe.

We provide as an example data set the file meteodat.csv with an excerpt of

data with a 10 minute frequency from an automated meteorological station with

a span of two months (Jan and Feb 2014). Data are comma separated and you can

read them using

`pd.read_csv`

. The first column is the date and the secondthe time. Transform this data into a dataframe with an index of datetime,

compute dataframes with downsampling to hourly and daily average values for

temperature (

`Tout`

), pressure (`Pressure`

), relative humidity (`H out`

),wind speed (

`Wind Speed`

), and dew point temperature (`Dew point`

); sum ofrainfall (

`Rain`

); and maximum and minimum values of temperature (`Tmax`

and`Tmin`

) and show the correlation between these variables at hourly and daily scales.