Pandas pd.read_excel(): A Comprehensive Tutorial on Reading Data from Excel Files

Pandas is a popular Python library for data analysis and manipulation. It can read and write various file formats, including Excel files. To read data from an Excel file with Pandas, you need to use the pd.read_excel() function, which returns a DataFrame object. A DataFrame is a two-dimensional, tabular data structure with labeled rows and columns.

One of the challenges of reading data from Excel files is that they may contain merged cells, grouping features, or multiple header rows that are not compatible with the DataFrame format. To deal with these issues, you need to specify some parameters in the pd.read_excel() function, such as:

  • sheet_name: the name or index of the sheet to read from the Excel file. If not specified, the first sheet will be read by default.
  • header: the row number or list of row numbers that contain the column names. If not specified, the first row will be used as the header by default. If the Excel file has multiple header rows, you can pass a list of row numbers to create a MultiIndex DataFrame, which has hierarchical levels of column labels.
  • usecols: the columns or range of columns to read from the Excel file. You can pass a string, a list of strings, or a list of integers to select the columns by name or by position. For example, usecols='A,C' will read columns A and C, usecols='A:C' will read columns A to C, and usecols='A:C,F,G:J' will read columns A to C, F, and G to J.
  • skiprows: the number or list of numbers of rows to skip before reading the data. This can be useful if the Excel file has some irrelevant rows at the beginning that you want to ignore.
  • fillna: the value or method to fill the missing values in the DataFrame. This can be useful if the Excel file has some empty cells or merged cells that result in NaN values in the DataFrame. You can pass a scalar value, a dictionary, or a method such as ‘ffill’ (forward fill) or ‘bfill’ (backward fill) to fill the missing values.

Here is an example of how to read data from an Excel file with grouped/collapsible columns using Pandas:

Python

# Import pandas library
import pandas as pd

# Read data from Excel file
df = pd.read_excel('grouped_data.xlsx', header=[0,1,2], usecols='A:C', fillna(method='ffill'))

# Print the DataFrame
print(df)

The output of this code is:

  test1             test2          
     s1    s2    s3    s1    s2    s3
     c1 c2 c3 c4 c1 c2 c3 c4 c1 c2 c3
0    23  7 78 12 32 12  5 74 13  1  4
1   456  5 41 22 31  1 13  8 13  2 23

As you can see, the DataFrame has a MultiIndex with three levels of column labels, and the missing values are filled by propagating the last valid observation forward.

To access the data in the DataFrame, you can use various methods, such as:

  • df['test1']: select the columns under the ‘test1’ level
  • df['test1']['s1']: select the columns under the ‘test1’ and ‘s1’ levels
  • df['test1']['s1']['c1']: select the column under the ‘test1’, ‘s1’, and ‘c1’ levels
  • df.loc[0]: select the first row of the DataFrame
  • df.loc[0,'test1']: select the first row and the columns under the ‘test1’ level
  • df.loc[0,'test1']['s1']: select the first row and the columns under the ‘test1’ and ‘s1’ levels
  • df.loc[0,'test1']['s1']['c1']: select the first row and the column under the ‘test1’, ‘s1’, and ‘c1’ levels

To perform calculations on the DataFrame, you can use various methods, such as:

  • df.sum(): calculate the sum of each column
  • df.sum(axis=1): calculate the sum of each row
  • df.mean(): calculate the mean of each column
  • df.mean(axis=1): calculate the mean of each row
  • df['test1'].sum(): calculate the sum of the columns under the ‘test1’ level
  • df['test1']['s1'].sum(): calculate the sum of the columns under the ‘test1’ and ‘s1’ levels
  • df['test1']['s1']['c1'].sum(): calculate the sum of the column under the ‘test1’, ‘s1’, and ‘c1’ levels

Here is an example of how to calculate the sum of each row and add it as a new column to the DataFrame:

Python

# Import pandas library
import pandas as pd

# Read data from Excel file
df = pd.read_excel('grouped_data.xlsx', header=[0,1,2], usecols='A:C', fillna(method='ffill'))

# Calculate the sum of each row
df['total'] = df.sum(axis=1)

# Print the DataFrame
print(df)

The output of this code is:

  test1             test2           total
     s1    s2    s3    s1    s2    s3     
     c1 c2 c3 c4 c1 c2 c3 c4 c1 c2 c3     
0    23  7 78 12 32 12  5 74 13  1  4   261
1   456  5 41 22 31  1 13  8 13  2 23   615

As you can see, the DataFrame has a new column called ‘total’ that contains the sum of each row.

Comments

No comments yet. Why don’t you start the discussion?

Leave a Reply

Your email address will not be published. Required fields are marked *