Data analysis involves two key activities: computing summary statistics to numerically describe a dataset, and creating data visualizations to communicate patterns and trends effectively.
Summary statistics are numerical values that describe the key characteristics of a dataset. They help analysts quickly understand the distribution, center, and spread of data.
| Measure | Definition | Example (dataset: 2, 4, 4, 6, 8) |
|---|---|---|
| Mean | Sum of all values divided by the count | |
| Median | Middle value when data is sorted | (3rd value) |
| Mode | Most frequently occurring value | (appears twice) |
Data visualization converts raw data into graphical formats, making it easier to identify patterns, trends, and outliers.
| Scenario | Best Chart |
|---|---|
| Comparing categories | Bar Chart |
| Showing parts of a whole | Pie Chart |
| Tracking trends over time | Line Graph |
| Showing frequency distribution | Histogram |
| Showing relationship between two variables | Scatter Plot |
Python provides powerful libraries to create visualizations and compute summary statistics from pre-existing datasets.
matplotlib: The foundational plotting library for creating bar charts, line graphs, pie charts, and more.pandas: Used for loading and manipulating datasets; provides built-in .describe() for summary statistics.numpy: Provides mathematical functions for computing mean, median, standard deviation, etc.import pandas as pd
# Load a dataset
data = pd.read_csv('students.csv')
# Display summary statistics
print(data.describe())
# Individual statistics
print("Mean:", data['score'].mean())
print("Median:", data['score'].median())
print("Mode:", data['score'].mode()[0])
import matplotlib.pyplot as plt
categories = ['Math', 'Science', 'English', 'History', 'Art']
students = [45, 60, 55, 30, 40]
plt.bar(categories, students, color='steelblue')
plt.title('Student Enrollment by Subject')
plt.xlabel('Subject')
plt.ylabel('Number of Students')
plt.show()
import matplotlib.pyplot as plt
labels = ['Math', 'Science', 'English', 'History']
sizes = [30, 25, 25, 20]
plt.pie(sizes, labels=labels, autopct='%1.1f%%')
plt.title('Subject Distribution')
plt.show()
import matplotlib.pyplot as plt
days = list(range(1, 8))
temperature = [22, 25, 23, 28, 30, 27, 24]
plt.plot(days, temperature, marker='o', color='tomato')
plt.title('Daily Temperature Over a Week')
plt.xlabel('Day')
plt.ylabel('Temperature (°C)')
plt.show()
matplotlib, pandas, and numpy are essential tools for data analysis and visualization.