Pandas is a popular Python library for data analysis and manipulation. It can read and write various file formats, including Excel files. To read data from an Excel file with Pandas, you need to use the pd.read_excel()
function, which returns a DataFrame object. A DataFrame is a two-dimensional, tabular data structure with labeled rows and columns.
One of the challenges of reading data from Excel files is that they may contain merged cells, grouping features, or multiple header rows that are not compatible with the DataFrame format. To deal with these issues, you need to specify some parameters in the pd.read_excel()
function, such as:
sheet_name
: the name or index of the sheet to read from the Excel file. If not specified, the first sheet will be read by default.header
: the row number or list of row numbers that contain the column names. If not specified, the first row will be used as the header by default. If the Excel file has multiple header rows, you can pass a list of row numbers to create a MultiIndex DataFrame, which has hierarchical levels of column labels.usecols
: the columns or range of columns to read from the Excel file. You can pass a string, a list of strings, or a list of integers to select the columns by name or by position. For example,usecols='A,C'
will read columns A and C,usecols='A:C'
will read columns A to C, andusecols='A:C,F,G:J'
will read columns A to C, F, and G to J.skiprows
: the number or list of numbers of rows to skip before reading the data. This can be useful if the Excel file has some irrelevant rows at the beginning that you want to ignore.fillna
: the value or method to fill the missing values in the DataFrame. This can be useful if the Excel file has some empty cells or merged cells that result in NaN values in the DataFrame. You can pass a scalar value, a dictionary, or a method such as ‘ffill’ (forward fill) or ‘bfill’ (backward fill) to fill the missing values.
Here is an example of how to read data from an Excel file with grouped/collapsible columns using Pandas:
# Import pandas library
import pandas as pd
# Read data from Excel file
df = pd.read_excel('grouped_data.xlsx', header=[0,1,2], usecols='A:C', fillna(method='ffill'))
# Print the DataFrame
print(df)
The output of this code is:
test1 test2
s1 s2 s3 s1 s2 s3
c1 c2 c3 c4 c1 c2 c3 c4 c1 c2 c3
0 23 7 78 12 32 12 5 74 13 1 4
1 456 5 41 22 31 1 13 8 13 2 23
As you can see, the DataFrame has a MultiIndex with three levels of column labels, and the missing values are filled by propagating the last valid observation forward.
To access the data in the DataFrame, you can use various methods, such as:
df['test1']
: select the columns under the ‘test1’ leveldf['test1']['s1']
: select the columns under the ‘test1’ and ‘s1’ levelsdf['test1']['s1']['c1']
: select the column under the ‘test1’, ‘s1’, and ‘c1’ levelsdf.loc[0]
: select the first row of the DataFramedf.loc[0,'test1']
: select the first row and the columns under the ‘test1’ leveldf.loc[0,'test1']['s1']
: select the first row and the columns under the ‘test1’ and ‘s1’ levelsdf.loc[0,'test1']['s1']['c1']
: select the first row and the column under the ‘test1’, ‘s1’, and ‘c1’ levels
To perform calculations on the DataFrame, you can use various methods, such as:
df.sum()
: calculate the sum of each columndf.sum(axis=1)
: calculate the sum of each rowdf.mean()
: calculate the mean of each columndf.mean(axis=1)
: calculate the mean of each rowdf['test1'].sum()
: calculate the sum of the columns under the ‘test1’ leveldf['test1']['s1'].sum()
: calculate the sum of the columns under the ‘test1’ and ‘s1’ levelsdf['test1']['s1']['c1'].sum()
: calculate the sum of the column under the ‘test1’, ‘s1’, and ‘c1’ levels
Here is an example of how to calculate the sum of each row and add it as a new column to the DataFrame:
# Import pandas library
import pandas as pd
# Read data from Excel file
df = pd.read_excel('grouped_data.xlsx', header=[0,1,2], usecols='A:C', fillna(method='ffill'))
# Calculate the sum of each row
df['total'] = df.sum(axis=1)
# Print the DataFrame
print(df)
The output of this code is:
test1 test2 total
s1 s2 s3 s1 s2 s3
c1 c2 c3 c4 c1 c2 c3 c4 c1 c2 c3
0 23 7 78 12 32 12 5 74 13 1 4 261
1 456 5 41 22 31 1 13 8 13 2 23 615
As you can see, the DataFrame has a new column called ‘total’ that contains the sum of each row.