
Data Science Class 10: Stats, Visualisation & KNN Project | CBSE AI
Chapter 8: Data Science
Understanding Statistics, Visualisation, and Data Access
What is Data Science and Where Does It Fit?
Artificial Intelligence is the goal of creating smart machines. Data Science is the set of tools and methods we use to find patterns in data. You cannot have AI without Data Science. Data is the “fuel,” and Data Science is the “engine” that turns that fuel into intelligent actions.
Data Science is a key part of the AI Project Cycle, a structured plan for building an AI product. It is heavily involved in the “Data Exploration” and “Modelling” phases.
The AI Project Cycle: An Infographic
This diagram shows the steps for building any AI model. Data Science is central to this process.
The AI Project Cycle. Data Science focuses on steps 3 and 4.
Real-World Applications
Data Science is used everywhere. Here are some examples:
Recommendation Engines
Suggesting movies on Netflix or products on Amazon based on your past behavior.
Price Prediction
Airlines and ride-sharing apps changing prices based on demand, time, and location.
Medical Diagnosis
Using computer vision to analyze medical scans (like X-rays) to find signs of disease.
Spam Filtering
Your email service automatically moving junk mail to the spam folder based on keywords.
(Cross-link: See our guide on Model Evaluation to understand how we measure success.)
Data Collection & Access: The Raw Materials of AI
Data Collection Strategies
To build a model, we must first collect data. This can be done in two ways:
- Offline Methods: Manual processes like in-person surveys, interviews, or direct observations.
- Online Methods: Gathering data from digital sources, like downloading datasets or scraping websites.
Data Collection Methods
Common Data Sources
Data can come from many places. Here are the most common sources for projects:
- Databases: The most common source for businesses. Data is stored in structured tables (like in SQL).
- Files: Simple text files, CSVs (Comma-Separated Values), and Excel spreadsheets are very common for projects.
- Surveys: A classic offline method. You create a questionnaire and collect answers from a group of people.
Advanced Online Collection Methods
Besides downloading files, there are two primary online methods:
Automated Data from the World: Sensors
A sensor is hardware that connects the physical and digital worlds. It detects a change in its environment (like temperature) and converts it into a digital signal. Sensors provide continuous, real-time data streams.
- Example: Your smartphone’s GPS sensor, which constantly reports your location.
- Example: A weather station’s thermometer, which reports temperature every minute.
Web Data from Other Systems: APIs
API stands for Application Programming Interface. Think of it as a set of rules that lets one computer program talk to another.
This is a formal system for communication. Your application (the “client”) sends a structured request to a server (the system with the data). The API is the middle layer that receives the request, gets the correct data, and returns it to your app in a predictable format. This process prevents direct access to the server’s database and ensures security and order. The `GET` request is commonly used to retrieve data.
A Note on Web Scraping
Web Scraping is the process of using automated scripts to extract information from websites. While it can be a powerful tool for gathering unique data (e.g., product prices), it must be done ethically and responsibly.
- Check `robots.txt`: Always check a website’s `robots.txt` file (e.g., `website.com/robots.txt`) to see which pages you are allowed or disallowed from scraping.
- Be Respectful: Do not send too many requests in a short time, as this can crash the server.
- Privacy: Do not scrape or store personal information about people.
Open Data Resources for Your Projects
Governments provide free, high-quality data for public use. These are excellent for student projects. The main resource in India is the Open Government Data (OGD) Platform of India (`data.gov.in`).
Foundational Statistics with Python
Core Statistical Concepts
Statistics help us summarize and understand large amounts of data. The most basic concepts are measures of “central tendency,” which describe the center of the data.
- Mean (Average): The sum of all values divided by the number of values.
- Best for: Symmetrical data (like heights).
- Problem: Very sensitive to outliers (extreme values).
- Median (Middle): The middle value in a dataset that has been sorted.
- Best for: Skewed data (like salaries, where a few billionaires are outliers).
- Problem: Less common in advanced math.
- Mode (Most Frequent): The value that appears most often.
- Best for: Categorical data (like finding the most popular shirt color).
Understanding Correlation
Correlation is a statistical measure that describes the relationship between two variables. It tells us if two variables move together. The correlation coefficient is a value between -1 and +1.
- Positive Correlation (e.g., +0.8): As one variable increases, the other also tends to increase. (e.g., Study Hours and Exam Scores).
- Negative Correlation (e.g., -0.7): As one variable increases, the other tends to decrease. (e.g., Video Game Hours and Exam Scores).
- Zero Correlation (e.g., 0.1): There is no clear relationship between the two variables.
Correlation Examples
Data Types: The First Step in Analysis
Before you can calculate statistics or plot charts, you must understand your data types. Data generally falls into two main groups:
| Main Type | Sub-Type | Description | Example |
|---|---|---|---|
| 1. Categorical (Qualitative) |
Nominal | Categories with no meaningful order. | ‘Color’ (Red, Blue, Green), ‘City’ (Delhi, Mumbai) |
| Ordinal | Categories that have a clear order or rank. | ‘Size’ (Small, Medium, Large), ‘Rating’ (1, 2, 3, 4, 5) | |
| 2. Numerical (Quantitative) |
Discrete | Countable, whole numbers. | ‘Number of students’ (30), ‘Age in years’ (15) |
| Continuous | Can take any value within a range (measured). | ‘Height’ (170.5 cm), ‘Temperature’ (22.8°C) |
Why does this matter? You calculate a ‘Mean’ for Numerical data, but you find the ‘Mode’ for Categorical data. You use a ‘Bar Chart’ for Categorical data but a ‘Histogram’ for Numerical data.
What is Data Cleaning?
Data collection is messy. Real-world data is almost never perfect. Before exploring or modeling, we must perform Data Cleaning, which is part of the Data Exploration phase.
This involves handling common problems like:
- Missing Values: A person forgot to answer a survey question (e.g., `NaN`). We might need to fill this with the mean, or remove the row.
- Outliers: An extreme value that is likely a mistake (e.g., a person’s age listed as 150 years). We might need to cap it or remove it.
- Incorrect Data: Typos or formatting errors (e.g., “Male”, “M”, “male” all referring to the same thing). We must standardize these to one format.
Interactive Demo: Mean vs. Median
Our base dataset is:
Use the slider to add one new data point and see what happens to the Mean and Median.
Visualization is part of “Data Exploration”. While stats give you a single number, charts tell a richer story. They help us see trends, relationships, and unusual data points (anomalies).
Choosing the right chart is important. Here are the most common types: Use a bar chart to clearly see which category is biggest or smallest. Use a line chart to see if values are increasing or decreasing over time. Use a scatter plot to check for correlation (a relationship). Use a pie chart to show percentages or fractions of a total.
Below is an interactive scatter plot, a common chart type. It shows the relationship between two numerical variables. This chart was built with D3.js. Hover over the dots to see the exact values.
Classification is a type of “Supervised Learning” where the goal is to predict a category or label.
We will use one of the simplest and most intuitive classification algorithms: K-Nearest Neighbors (KNN).
The KNN algorithm follows the simple idea that “you are like your nearest neighbors.” It assumes that similar data points exist close to each other.
To classify a new, unknown data point, the algorithm does three things:
Choosing the right ‘K’ is a trade-off. There is no single “best” answer, and it often requires experimentation.
A common practice is to start with an odd number (like 3, 5, or 7) to avoid ties in voting. You can test several ‘K’ values to see which one gives the best accuracy on your test data.
In the “Cons” table, we listed that KNN “Needs Feature Scaling”. This is a very important concept for any algorithm that uses distance.
Imagine you have two features: ‘Age’ (e.g., 10 to 70) and ‘Salary’ (e.g., 20,000 to 1,000,000). When KNN calculates the “distance” between two people, the ‘Salary’ number will completely dominate the ‘Age’ number. A difference of 20 in age is tiny compared to a difference of 20,000 in salary.
Feature Scaling fixes this. It is a process that rescales all numerical features to be on a similar, common scale (e.g., all of them between 0 and 1). This ensures that all features contribute fairly to the distance calculation.
This walkthrough outlines the project. We use the `16P.csv` dataset and the `scikit-learn` Python library.
Goal: Predict a person’s 16-type personality (e.g., ‘INFP’) based on 60 survey answers.
We use Pandas to load the CSV. We separate the data into `X` (features, the 60 answers) and `y` (target, the personality type). We create the KNN model. We will pick K=5 for this example. Now we use the trained model (`knn`) to make predictions on the `X_test` data (which it has never seen). We then compare its predictions (`y_pred`) to the true answers (`y_test`).
(Cross-link: See how this entire process fits into the AI Project Cycle.)
Test your knowledge. Click on a question to reveal the answer.
The Mean is the average of all data points. The Median is the middle data point after sorting them.
An API (Application Programming Interface) is a set of rules that lets two computer programs talk to each other.
Use a Bar Chart when you are comparing a numerical value across different categories (e.g., number of students in `House A` vs. `House B`).
K is the number of neighbors the algorithm looks at to make a decision.
Data Visualization with Python
Why We Visualize Data
Common Chart Types and When to Use Them
Chart Type
What It Shows
Example Use Case
Bar Chart
Compares values across different categories.
Comparing the number of students in different houses (Gryffindor, Hufflepuff, etc.).
Line Chart
Shows a trend over a continuous period.
Tracking the temperature in your city over the last 30 days.
Pie Chart
Shows parts of a whole (proportions).
Showing the percentage of your day spent on sleeping, eating, and studying. (Use with caution for more than 5 categories).
Histogram
Shows the distribution (frequency) of a single numerical variable.
Seeing how many students scored in the 80-90% range, 90-100% range, etc.
Scatter Plot
Shows the relationship between two numerical variables.
(As shown below) Seeing if there is a link between “Hours Studied” and “Exam Score”.
Interactive Chart Chooser
Recommended Chart:
Interactive Chart: Study Hours vs. Exam Score (D3.js)
Mini-Project: Classification with KNN
How KNN Works (The “Neighbor” Analogy)
How to Choose the ‘K’ Value?
Visualizing K: How ‘K’ Changes the Result
Pros and Cons of the KNN Algorithm
Pros (Advantages)
Cons (Disadvantages)
1. Simple to Understand: Its logic is easy to explain (voting).
1. Slow for large data: It must compare a new point to *every* point in the training data.
2. No “Training” Phase: It just stores the data. This is called a “lazy learner”.
2. Needs Feature Scaling: Variables on larger scales (e.g., 1-1000) will outweigh variables on small scales (e.g., 1-10).
3. Good for Complex Data: It can learn non-linear decision boundaries.
3. “Curse of Dimensionality”: It performs poorly when there are too many features (e.g., >100).
A Quick Note on Feature Scaling
Project Walkthrough: Personality Prediction Model
Step 1: Problem Scoping & Data
Data: `16P.csv` dataset. Each row is one person. The 60 columns are answers (-3 to +3), and the last column (‘Personality’) is the target we want to predict.
Step 2: Load and Prepare Data (Data Exploration)
import pandas as pd
from sklearn.model_selection import train_test_split
# Load data
df = pd.read_csv('16P.csv')
# Separate features (X) and target (y)
X = df.drop('Personality', axis=1)
y = df['Personality']
# Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)Step 3: Modelling (Train the KNN Classifier)
from sklearn.neighbors import KNeighborsClassifier
# Create a KNN model with K=5
knn = KNeighborsClassifier(n_neighbors=5)
# "Train" the model (KNN just stores the data)
knn.fit(X_train, y_train)Step 4: Evaluation
from sklearn.metrics import accuracy_score, classification_report
# Make predictions on the test data
y_pred = knn.predict(X_test)
# Check the accuracy
accuracy = accuracy_score(y_test, y_pred)
print(f"Model Accuracy: {accuracy * 100:.2f}%")
# Show the detailed report
print(classification_report(y_test, y_pred))
Quick Revision: Q&A and FAQs
Q1: What is the main difference between Mean and Median?
Hint: The Mean is sensitive to outliers (extreme values), while the Median is not. Use the Median for data like salaries, which can have outliers.
Q2: What is an API, and why is it used for data collection?
It is used for data collection because it’s a secure and stable way to get data from a server without needing to download a full file or scrape a website. It lets you request specific data (e.g., “Give me today’s weather”) and get a clean, predictable response (often in JSON format).
Q3: When should I use a Bar Chart vs. a Scatter Plot?
Use a Scatter Plot when you are checking the relationship between two different numerical variables (e.g., `Study Hours` vs. `Exam Score`).
Q4: What does “K” in K-Nearest Neighbors (KNN) mean?
If `K=1`, the algorithm only looks at the single closest neighbor. If `K=5`, it finds the 5 closest neighbors and takes a “majority vote” from them to decide on the new point’s class.




