NLP (Class 10): Text Processing, BoW, TF-IDF & Chatbots
Part 1 of the CBSE Class 10 AI Chapter
Welcome to the study of Natural Language Processing, or NLP. This is a field of Artificial Intelligence focused on giving computers the ability to understand, interpret, and process human language. It’s the technology that powers spam filters, translation apps, and the voice assistants you talk to. This guide explains how it works, from cleaning text to understanding context.
Section 1: What Can NLP Do? Tasks & Chatbots
NLP is used in many applications you see every day. The main goal is to bridge the gap between human communication and computer understanding.
Common NLP Tasks
- Sentiment Analysis: Reading product reviews to see if they are positive or negative.
- Machine Translation: Translating text from one language to another (e.g., Google Translate).
- Spam Filtering: Checking emails to see if they are “spam” or “not spam”.
- Text Summarization: Reading a long article and providing a short summary.
Focus: Chatbots
- Rule-Based Chatbots: Follow a pre-programmed script. If you say “hello”, they are programmed to say “hi”. They cannot handle questions they weren’t programmed for.
- AI-Based Chatbots: Learn from large amounts of human conversation. They can understand context, handle new questions, and sound more natural.
Section 2: The AI Project Cycle for NLP
Building an NLP model follows the same AI Project Cycle as other projects. The main difference is the *type* of data we use: text.
The 5 Stages of an NLP Project
Problem Scoping
What text problem are we solving? (e.g., “Filter spam emails,” “Translate reviews”).
Data Acquisition
Collecting text data (e.g., thousands of sample emails, both spam and not spam).
Data Exploration
Cleaning and Normalizing all the text data so the model can read it. (See Section 4!)
Modeling
Using models like BoW or TF-IDF to build a system that can predict outcomes.
Evaluation
Testing the model. How accurate is our spam filter? Does it make mistakes?
Section 3: The Challenge of Human Language
Human language is difficult for computers because it is full of **ambiguity**. This means a single sentence can have multiple meanings depending on the context. Computers prefer precise, literal instructions.
Example of Ambiguity:
“I saw a man on a hill with a telescope.”
- Meaning 1: I was on a hill, and I used a telescope to see a man.
- Meaning 2: I saw a man who was on a hill and had a telescope.
- Meaning 3: I saw a man on a hill, and I was also on the hill with my telescope.
A human can often (but not always) guess the correct meaning using context. An AI must be trained to do the same.
Section 4: Text Normalization (Cleaning Text)
Before an AI can analyze text, the text must be cleaned and simplified. This process is called **Text Normalization** or Preprocessing. It’s like preparing ingredients before cooking. Try it yourself below!
Interactive Lab: The Normalization Pipeline
1. Lowercasing:
2. Remove Punctuation:
3. Remove Stopwords:
4. Stemming (Simple):
Section 5: Bag of Words (BoW)
Bag of Words is a simple method to represent text data for a computer. It ignores grammar and word order, and simply counts how many times each word appears in the document.
BoW Example
| Document | Word Counts (The “Bag”) | ||||
|---|---|---|---|---|---|
| the | cat | sat | dog | on | |
| “The cat sat.” | 1 | 1 | 1 | 0 | 0 |
| “The dog sat on the cat.” | 2 | 1 | 1 | 1 | 1 |
Problem with BoW: Common words like “the” get a high score, making them seem important. We need a way to find words that are *uniquely* important to a document.
Section 6: TF-IDF (Finding Important Keywords)
TF-IDF (Term Frequency – Inverse Document Frequency) is a more advanced scoring method. It finds words that are frequent in *one* document but rare across *all other* documents. This is how search engines find the best page for your query.
-
TF (Term Frequency): How often a word appears in a single document.
(e.g., “The cat sat.” -> TF for “cat” is 1). - IDF (Inverse Document Frequency): Measures how rare a word is. Common words like “the” have a very low IDF score. Rare words (like “photosynthesis”) have a high IDF score.
Final Score = TF × IDF
A high TF-IDF score means the word is a very important keyword for that specific document.
DIY Lab: TF-IDF Keyword Playground
Paste some text below. For this demo, **each line is treated as a separate document**. The tool will find the top TF-IDF keywords for each line (document).
Results:
DIY Lab: Rule-Based Chatbot
This is a simple “Rule-Based” chatbot. It doesn’t use AI, only a few `if/else` rules. Notice how it can only respond to specific keywords. Try asking it about the “weather” or “your name”.
Connecting Concepts
NLP does not exist on its own. It connects to all other parts of AI. See how the topics in this chapter relate to other areas of your study.
Link to: AI Ethics
The data used to train NLP models can be full of human biases. This is why language models can sometimes produce biased or unfair text. This creates an ethical challenge.
Read the Ethics Article →Concept: Evaluation
After building an NLP model (like a spam filter), we must test it. We use metrics like Precision (How many of the emails we *called* spam were *actually* spam?) and Recall (Of all the *actual* spam, how many did we *catch*?).





