Pandas Tutorial

Pandas is a powerful Python library used for data manipulation and analysis. It provides data structures and functions needed to work with structured data seamlessly. The most common data structures in Pandas are Series and DataFrame.

Installing Pandas:

To install Pandas, you can use PIP:

pip install pandas

Introduction to Series:

A Pandas Series is a one-dimensional labeled array capable of holding any data type:

Python
import pandas as pd

# Creating a Series from a list
data = [1, 2, 3, 4, 5]
series = pd.Series(data)

print(series)
# Output:
# 0    1
# 1    2
# 2    3
# 3    4
# 4    5
# dtype: int64

Introduction to DataFrame:

A DataFrame is a two-dimensional labeled data structure with columns of potentially different types, similar to a spreadsheet or SQL table:

Python
import pandas as pd

# Creating a DataFrame from a dictionary
data = {
    "Name": ["Alice", "Bob", "Charlie"],
    "Age": [25, 30, 35],
    "City": ["New York", "Los Angeles", "Chicago"]
}

df = pd.DataFrame(data)

print(df)
# Output:
#       Name  Age         City
# 0    Alice   25     New York
# 1      Bob   30  Los Angeles
# 2  Charlie   35     Chicago

Reading and Writing Data:

Pandas provides functions to read from and write to various data formats such as CSV, Excel, SQL, and JSON:

Python
import pandas as pd

# Reading data from a CSV file
df = pd.read_csv("data.csv")

# Writing data to a CSV file
df.to_csv("output.csv", index=False)

DataFrame Operations:

You can perform various operations on DataFrames, such as filtering, grouping, and aggregating data:

Python
import pandas as pd

data = {
    "Name": ["Alice", "Bob", "Charlie", "David"],
    "Age": [25, 30, 35, 40],
    "City": ["New York", "Los Angeles", "Chicago", "Houston"],
    "Salary": [70000, 80000, 120000, 110000]
}

df = pd.DataFrame(data)

# Filtering data
filtered_df = df[df["Age"] > 30]

# Grouping data
grouped_df = df.groupby("City")["Salary"].mean()

print(filtered_df)
# Output:
#       Name  Age     City  Salary
# 2  Charlie   35  Chicago  120000
# 3    David   40  Houston  110000

print(grouped_df)
# Output:
# City
# Chicago       120000.0
# Houston       110000.0
# Los Angeles    80000.0
# New York       70000.0
# Name: Salary, dtype: float64

Handling Missing Data:

Pandas provides methods to handle missing data by filling or dropping missing values:

Python
import pandas as pd

data = {
    "Name": ["Alice", "Bob", "Charlie", "David"],
    "Age": [25, None, 35, 40],
    "City": ["New York", "Los Angeles", "Chicago", None],
    "Salary": [70000, 80000, None, 110000]
}

df = pd.DataFrame(data)

# Dropping rows with missing values
df_dropped = df.dropna()

# Filling missing values
df_filled = df.fillna({"Age": 30, "Salary": 50000, "City": "Unknown"})

print(df_dropped)
# Output:
#     Name   Age      City    Salary
# 0  Alice  25.0  New York   70000.0

print(df_filled)
# Output:
#       Name   Age      City    Salary
# 0    Alice  25.0  New York   70000.0
# 1      Bob  30.0  Los Angeles  80000.0
# 2  Charlie  35.0  Chicago  50000.0
# 3    David  40.0  Unknown  110000.0

Applying Functions:

Pandas allows you to apply functions to your data easily:

Python
import pandas as pd

data = {
    "Name": ["Alice", "Bob", "Charlie", "David"],
    "Salary": [70000, 80000, 120000, 110000]
}

df = pd.DataFrame(data)

# Apply a function to each salary
df["Salary"] = df["Salary"].apply(lambda x: x * 1.1)

print(df)
# Output:
#       Name   Salary
# 0    Alice   77000.0
# 1      Bob   88000.0
# 2  Charlie  132000.0
# 3    David  121000.0

Merging and Joining DataFrames:

Pandas provides powerful tools for merging and joining DataFrames:

Python
import pandas as pd

data_1 = {
    "Name": ["Alice", "Bob", "Charlie"],
    "City": ["New York", "Los Angeles", "Chicago"]
}

data_2 = {
    "Name": ["Alice", "Bob", "Eve"],
    "Age": [25, 30, 28]
}

df1 = pd.DataFrame(data_1)
df2 = pd.DataFrame(data_2)

# Merging DataFrames
merged_df = pd.merge(df1, df2, on="Name", how="inner")

print(merged_df)
# Output:
#      Name     City  Age
# 0   Alice  New York   25
# 1     Bob  Los Angeles   30

Pivot Tables:

Pivot tables are used to summarize data in a DataFrame:

Python
import pandas as pd

data = {
    "Name": ["Alice", "Bob", "Charlie", "David", "Eve"],
    "City": ["New York", "Los Angeles", "Chicago", "Houston", "Chicago"],
    "Sales": [200, 300, 250, 400, 350]
}

df = pd.DataFrame(data)

# Creating a pivot table
pivot_table = df.pivot_table(values="Sales", index="City", aggfunc="sum")

print(pivot_table)
# Output:
#            Sales
# City
# Chicago      600
# Houston      400
# Los Angeles  300
# New York     200

Pandas is an essential tool for data analysis and manipulation. It provides the flexibility and power needed to handle a wide variety of data tasks.