Pandas Tutorial
Pandas is a powerful Python library used for data manipulation and analysis. It provides data structures and functions needed to work with structured data seamlessly. The most common data structures in Pandas are Series
and DataFrame
.
Installing Pandas:
To install Pandas, you can use PIP:
pip install pandas
Introduction to Series:
A Pandas Series
is a one-dimensional labeled array capable of holding any data type:
import pandas as pd
# Creating a Series from a list
data = [1, 2, 3, 4, 5]
series = pd.Series(data)
print(series)
# Output:
# 0 1
# 1 2
# 2 3
# 3 4
# 4 5
# dtype: int64
Introduction to DataFrame:
A DataFrame
is a two-dimensional labeled data structure with columns of potentially different types, similar to a spreadsheet or SQL table:
import pandas as pd
# Creating a DataFrame from a dictionary
data = {
"Name": ["Alice", "Bob", "Charlie"],
"Age": [25, 30, 35],
"City": ["New York", "Los Angeles", "Chicago"]
}
df = pd.DataFrame(data)
print(df)
# Output:
# Name Age City
# 0 Alice 25 New York
# 1 Bob 30 Los Angeles
# 2 Charlie 35 Chicago
Reading and Writing Data:
Pandas provides functions to read from and write to various data formats such as CSV, Excel, SQL, and JSON:
import pandas as pd
# Reading data from a CSV file
df = pd.read_csv("data.csv")
# Writing data to a CSV file
df.to_csv("output.csv", index=False)
DataFrame Operations:
You can perform various operations on DataFrames, such as filtering, grouping, and aggregating data:
import pandas as pd
data = {
"Name": ["Alice", "Bob", "Charlie", "David"],
"Age": [25, 30, 35, 40],
"City": ["New York", "Los Angeles", "Chicago", "Houston"],
"Salary": [70000, 80000, 120000, 110000]
}
df = pd.DataFrame(data)
# Filtering data
filtered_df = df[df["Age"] > 30]
# Grouping data
grouped_df = df.groupby("City")["Salary"].mean()
print(filtered_df)
# Output:
# Name Age City Salary
# 2 Charlie 35 Chicago 120000
# 3 David 40 Houston 110000
print(grouped_df)
# Output:
# City
# Chicago 120000.0
# Houston 110000.0
# Los Angeles 80000.0
# New York 70000.0
# Name: Salary, dtype: float64
Handling Missing Data:
Pandas provides methods to handle missing data by filling or dropping missing values:
import pandas as pd
data = {
"Name": ["Alice", "Bob", "Charlie", "David"],
"Age": [25, None, 35, 40],
"City": ["New York", "Los Angeles", "Chicago", None],
"Salary": [70000, 80000, None, 110000]
}
df = pd.DataFrame(data)
# Dropping rows with missing values
df_dropped = df.dropna()
# Filling missing values
df_filled = df.fillna({"Age": 30, "Salary": 50000, "City": "Unknown"})
print(df_dropped)
# Output:
# Name Age City Salary
# 0 Alice 25.0 New York 70000.0
print(df_filled)
# Output:
# Name Age City Salary
# 0 Alice 25.0 New York 70000.0
# 1 Bob 30.0 Los Angeles 80000.0
# 2 Charlie 35.0 Chicago 50000.0
# 3 David 40.0 Unknown 110000.0
Applying Functions:
Pandas allows you to apply functions to your data easily:
import pandas as pd
data = {
"Name": ["Alice", "Bob", "Charlie", "David"],
"Salary": [70000, 80000, 120000, 110000]
}
df = pd.DataFrame(data)
# Apply a function to each salary
df["Salary"] = df["Salary"].apply(lambda x: x * 1.1)
print(df)
# Output:
# Name Salary
# 0 Alice 77000.0
# 1 Bob 88000.0
# 2 Charlie 132000.0
# 3 David 121000.0
Merging and Joining DataFrames:
Pandas provides powerful tools for merging and joining DataFrames:
import pandas as pd
data_1 = {
"Name": ["Alice", "Bob", "Charlie"],
"City": ["New York", "Los Angeles", "Chicago"]
}
data_2 = {
"Name": ["Alice", "Bob", "Eve"],
"Age": [25, 30, 28]
}
df1 = pd.DataFrame(data_1)
df2 = pd.DataFrame(data_2)
# Merging DataFrames
merged_df = pd.merge(df1, df2, on="Name", how="inner")
print(merged_df)
# Output:
# Name City Age
# 0 Alice New York 25
# 1 Bob Los Angeles 30
Pivot Tables:
Pivot tables are used to summarize data in a DataFrame:
import pandas as pd
data = {
"Name": ["Alice", "Bob", "Charlie", "David", "Eve"],
"City": ["New York", "Los Angeles", "Chicago", "Houston", "Chicago"],
"Sales": [200, 300, 250, 400, 350]
}
df = pd.DataFrame(data)
# Creating a pivot table
pivot_table = df.pivot_table(values="Sales", index="City", aggfunc="sum")
print(pivot_table)
# Output:
# Sales
# City
# Chicago 600
# Houston 400
# Los Angeles 300
# New York 200
Pandas is an essential tool for data analysis and manipulation. It provides the flexibility and power needed to handle a wide variety of data tasks.
Import Links
Here are some useful import links for further reading: