Chapter 8: Data Science

Understanding Statistics, Visualisation, and Data Access

Updated: October 2025

What is Data Science and Where Does It Fit?

Artificial Intelligence is the goal of creating smart machines. Data Science is the set of tools and methods we use to find patterns in data. You cannot have AI without Data Science. Data is the “fuel,” and Data Science is the “engine” that turns that fuel into intelligent actions.

Data Science is a key part of the AI Project Cycle, a structured plan for building an AI product. It is heavily involved in the “Data Exploration” and “Modelling” phases.

The AI Project Cycle: An Infographic

This diagram shows the steps for building any AI model. Data Science is central to this process.

The AI Project Cycle. Data Science focuses on steps 3 and 4.

Real-World Applications

Data Science is used everywhere. Here are some examples:

Recommendation Engines

Suggesting movies on Netflix or products on Amazon based on your past behavior.

Price Prediction

Airlines and ride-sharing apps changing prices based on demand, time, and location.

Medical Diagnosis

Using computer vision to analyze medical scans (like X-rays) to find signs of disease.

Spam Filtering

Your email service automatically moving junk mail to the spam folder based on keywords.

(Cross-link: See our guide on Model Evaluation to understand how we measure success.)

Data Collection & Access: The Raw Materials of AI

Data Collection Strategies

To build a model, we must first collect data. This can be done in two ways:

Offline Methods: Manual processes like in-person surveys, interviews, or direct observations.
Online Methods: Gathering data from digital sources, like downloading datasets or scraping websites.

Data Collection Methods

Common Data Sources

Data can come from many places. Here are the most common sources for projects:

Databases: The most common source for businesses. Data is stored in structured tables (like in SQL).
Files: Simple text files, CSVs (Comma-Separated Values), and Excel spreadsheets are very common for projects.
Surveys: A classic offline method. You create a questionnaire and collect answers from a group of people.

Advanced Online Collection Methods

Besides downloading files, there are two primary online methods:

Automated Data from the World: Sensors

A sensor is hardware that connects the physical and digital worlds. It detects a change in its environment (like temperature) and converts it into a digital signal. Sensors provide continuous, real-time data streams.

Example: Your smartphone’s GPS sensor, which constantly reports your location.
Example: A weather station’s thermometer, which reports temperature every minute.

Web Data from Other Systems: APIs

API stands for Application Programming Interface. Think of it as a set of rules that lets one computer program talk to another.

This is a formal system for communication. Your application (the “client”) sends a structured request to a server (the system with the data). The API is the middle layer that receives the request, gets the correct data, and returns it to your app in a predictable format. This process prevents direct access to the server’s database and ensures security and order. The `GET` request is commonly used to retrieve data.

A Note on Web Scraping

Web Scraping is the process of using automated scripts to extract information from websites. While it can be a powerful tool for gathering unique data (e.g., product prices), it must be done ethically and responsibly.

Check `robots.txt`: Always check a website’s `robots.txt` file (e.g., `website.com/robots.txt`) to see which pages you are allowed or disallowed from scraping.
Be Respectful: Do not send too many requests in a short time, as this can crash the server.
Privacy: Do not scrape or store personal information about people.

Open Data Resources for Your Projects

Governments provide free, high-quality data for public use. These are excellent for student projects. The main resource in India is the Open Government Data (OGD) Platform of India (`data.gov.in`).

Foundational Statistics with Python

Core Statistical Concepts

Statistics help us summarize and understand large amounts of data. The most basic concepts are measures of “central tendency,” which describe the center of the data.

Mean (Average): The sum of all values divided by the number of values.
- Best for: Symmetrical data (like heights).
- Problem: Very sensitive to outliers (extreme values).
Median (Middle): The middle value in a dataset that has been sorted.
- Best for: Skewed data (like salaries, where a few billionaires are outliers).
- Problem: Less common in advanced math.
Mode (Most Frequent): The value that appears most often.
- Best for: Categorical data (like finding the most popular shirt color).

Understanding Correlation

Correlation is a statistical measure that describes the relationship between two variables. It tells us if two variables move together. The correlation coefficient is a value between -1 and +1.

Positive Correlation (e.g., +0.8): As one variable increases, the other also tends to increase. (e.g., Study Hours and Exam Scores).
Negative Correlation (e.g., -0.7): As one variable increases, the other tends to decrease. (e.g., Video Game Hours and Exam Scores).
Zero Correlation (e.g., 0.1): There is no clear relationship between the two variables.

Correlation Examples

Data Types: The First Step in Analysis

Before you can calculate statistics or plot charts, you must understand your data types. Data generally falls into two main groups:

Main Type	Sub-Type	Description	Example
1. Categorical (Qualitative)	Nominal	Categories with no meaningful order.	‘Color’ (Red, Blue, Green), ‘City’ (Delhi, Mumbai)
1. Categorical (Qualitative)	Ordinal	Categories that have a clear order or rank.	‘Size’ (Small, Medium, Large), ‘Rating’ (1, 2, 3, 4, 5)
2. Numerical (Quantitative)	Discrete	Countable, whole numbers.	‘Number of students’ (30), ‘Age in years’ (15)
2. Numerical (Quantitative)	Continuous	Can take any value within a range (measured).	‘Height’ (170.5 cm), ‘Temperature’ (22.8°C)

Why does this matter? You calculate a ‘Mean’ for Numerical data, but you find the ‘Mode’ for Categorical data. You use a ‘Bar Chart’ for Categorical data but a ‘Histogram’ for Numerical data.

What is Data Cleaning?

Data collection is messy. Real-world data is almost never perfect. Before exploring or modeling, we must perform Data Cleaning, which is part of the Data Exploration phase.

This involves handling common problems like:

Missing Values: A person forgot to answer a survey question (e.g., `NaN`). We might need to fill this with the mean, or remove the row.
Outliers: An extreme value that is likely a mistake (e.g., a person’s age listed as 150 years). We might need to cap it or remove it.
Incorrect Data: Typos or formatting errors (e.g., “Male”, “M”, “male” all referring to the same thing). We must standardize these to one format.

Interactive Demo: Mean vs. Median

See Stats in Action: Outlier Impact

Our base dataset is:

Use the slider to add one new data point and see what happens to the Mean and Median.

New Data Point:

50 Normal (e.g., 90) Outlier (e.g., 200) ->

MEAN (Average)

MEDIAN (Middle)

Data Visualization with Python

Why We Visualize Data

Visualization is part of “Data Exploration”. While stats give you a single number, charts tell a richer story. They help us see trends, relationships, and unusual data points (anomalies).

Common Chart Types and When to Use Them

Choosing the right chart is important. Here are the most common types:

Chart Type	What It Shows	Example Use Case
Bar Chart	Compares values across different categories.	Comparing the number of students in different houses (Gryffindor, Hufflepuff, etc.).
Line Chart	Shows a trend over a continuous period.	Tracking the temperature in your city over the last 30 days.
Pie Chart	Shows parts of a whole (proportions).	Showing the percentage of your day spent on sleeping, eating, and studying. (Use with caution for more than 5 categories).
Histogram	Shows the distribution (frequency) of a single numerical variable.	Seeing how many students scored in the 80-90% range, 90-100% range, etc.
Scatter Plot	Shows the relationship between two numerical variables.	(As shown below) Seeing if there is a link between “Hours Studied” and “Exam Score”.

Interactive Chart Chooser

What is your primary goal?

Recommended Chart:

Bar Chart

Use a bar chart to clearly see which category is biggest or smallest.

Line Chart

Use a line chart to see if values are increasing or decreasing over time.

Scatter Plot

Use a scatter plot to check for correlation (a relationship).

Pie Chart

Use a pie chart to show percentages or fractions of a total.

Interactive Chart: Study Hours vs. Exam Score (D3.js)

Below is an interactive scatter plot, a common chart type. It shows the relationship between two numerical variables. This chart was built with D3.js. Hover over the dots to see the exact values.

Mini-Project: Classification with KNN

Classification is a type of “Supervised Learning” where the goal is to predict a category or label.

Example: Is this email `spam` or `not spam`?
Example: Does this medical scan show `disease` or `no disease`?

We will use one of the simplest and most intuitive classification algorithms: K-Nearest Neighbors (KNN).

How KNN Works (The “Neighbor” Analogy)

The KNN algorithm follows the simple idea that “you are like your nearest neighbors.” It assumes that similar data points exist close to each other.

To classify a new, unknown data point, the algorithm does three things:

Choose a value for ‘K’: ‘K’ is the number of “neighbors” to look at (e.g., K=5).
Find the ‘K’ nearest neighbors: The algorithm calculates the distance between the new point and all other points. It finds the top ‘K’ closest ones.
Take a “majority vote”: The new point is assigned the label that is most common among its ‘K’ neighbors.

How to Choose the ‘K’ Value?

Choosing the right ‘K’ is a trade-off. There is no single “best” answer, and it often requires experimentation.

A small ‘K’ (e.g., K=1): The model is very flexible and pays close attention to the single nearest neighbor. This can be bad if there is “noise” (a mislabeled point), leading to errors. It is sensitive.
A large ‘K’ (e.g., K=20): The model is very “smooth” and stable, as it takes a large poll. However, if ‘K’ is too large, it might blur the lines between different classes, losing local detail.

A common practice is to start with an odd number (like 3, 5, or 7) to avoid ties in voting. You can test several ‘K’ values to see which one gives the best accuracy on your test data.

Visualizing K: How ‘K’ Changes the Result

Pros and Cons of the KNN Algorithm

Pros (Advantages)	Cons (Disadvantages)
1. Simple to Understand: Its logic is easy to explain (voting).	1. Slow for large data: It must compare a new point to every point in the training data.
2. No “Training” Phase: It just stores the data. This is called a “lazy learner”.	2. Needs Feature Scaling: Variables on larger scales (e.g., 1-1000) will outweigh variables on small scales (e.g., 1-10).
3. Good for Complex Data: It can learn non-linear decision boundaries.	3. “Curse of Dimensionality”: It performs poorly when there are too many features (e.g., >100).

A Quick Note on Feature Scaling

In the “Cons” table, we listed that KNN “Needs Feature Scaling”. This is a very important concept for any algorithm that uses distance.

Imagine you have two features: ‘Age’ (e.g., 10 to 70) and ‘Salary’ (e.g., 20,000 to 1,000,000). When KNN calculates the “distance” between two people, the ‘Salary’ number will completely dominate the ‘Age’ number. A difference of 20 in age is tiny compared to a difference of 20,000 in salary.

Feature Scaling fixes this. It is a process that rescales all numerical features to be on a similar, common scale (e.g., all of them between 0 and 1). This ensures that all features contribute fairly to the distance calculation.

Project Walkthrough: Personality Prediction Model

This walkthrough outlines the project. We use the `16P.csv` dataset and the `scikit-learn` Python library.

Step 1: Problem Scoping & Data

Goal: Predict a person’s 16-type personality (e.g., ‘INFP’) based on 60 survey answers.
Data: `16P.csv` dataset. Each row is one person. The 60 columns are answers (-3 to +3), and the last column (‘Personality’) is the target we want to predict.

Download Sample CSV Download Notebook (.ipynb)

Step 2: Load and Prepare Data (Data Exploration)

We use Pandas to load the CSV. We separate the data into `X` (features, the 60 answers) and `y` (target, the personality type).

import pandas as pd
from sklearn.model_selection import train_test_split

# Load data
df = pd.read_csv('16P.csv')

# Separate features (X) and target (y)
X = df.drop('Personality', axis=1)
y = df['Personality']

# Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

Step 3: Modelling (Train the KNN Classifier)

We create the KNN model. We will pick K=5 for this example.

from sklearn.neighbors import KNeighborsClassifier

# Create a KNN model with K=5
knn = KNeighborsClassifier(n_neighbors=5)

# "Train" the model (KNN just stores the data)
knn.fit(X_train, y_train)

Step 4: Evaluation

Now we use the trained model (`knn`) to make predictions on the `X_test` data (which it has never seen). We then compare its predictions (`y_pred`) to the true answers (`y_test`).

from sklearn.metrics import accuracy_score, classification_report

# Make predictions on the test data
y_pred = knn.predict(X_test)

# Check the accuracy
accuracy = accuracy_score(y_test, y_pred)
print(f"Model Accuracy: {accuracy * 100:.2f}%")

# Show the detailed report
print(classification_report(y_test, y_pred))

(Cross-link: See how this entire process fits into the AI Project Cycle.)

Quick Revision: Q&A and FAQs

Test your knowledge. Click on a question to reveal the answer.

Q1: What is the main difference between Mean and Median?

The Mean is the average of all data points. The Median is the middle data point after sorting them.

Hint: The Mean is sensitive to outliers (extreme values), while the Median is not. Use the Median for data like salaries, which can have outliers.

Q2: What is an API, and why is it used for data collection?

An API (Application Programming Interface) is a set of rules that lets two computer programs talk to each other.

It is used for data collection because it’s a secure and stable way to get data from a server without needing to download a full file or scrape a website. It lets you request specific data (e.g., “Give me today’s weather”) and get a clean, predictable response (often in JSON format).

Q3: When should I use a Bar Chart vs. a Scatter Plot?

Use a Bar Chart when you are comparing a numerical value across different categories (e.g., number of students in `House A` vs. `House B`).

Use a Scatter Plot when you are checking the relationship between two different numerical variables (e.g., `Study Hours` vs. `Exam Score`).

Q4: What does “K” in K-Nearest Neighbors (KNN) mean?

K is the number of neighbors the algorithm looks at to make a decision.

If `K=1`, the algorithm only looks at the single closest neighbor. If `K=5`, it finds the 5 closest neighbors and takes a “majority vote” from them to decide on the new point’s class.

Data Science Class 10: Stats, Visualisation & KNN Project | CBSE AI