Data Mining Assignment - Automatic News Classification System

0. Some pictures about this project

K-Means Clustering Visualization

Unsupervised clustering analysis showing natural topic separation

Top 10 Model Comparison

Comparison of validation, test, and cross-validation accuracy across top 10 models

Confusion Matrix

Confusion matrix showing prediction accuracy for each category using the best model

Category vs. Source Distribution

Heatmap showing the distribution of articles by category and news source

Data Split Visualization

TF-IDF Features by Category

Content Length & Word Count Analysis

Cross-Validation Results

Key Findings

Best Model: Linear SVM with ~97% test accuracy
Stable Performance: Low validation-test gap (~0.01)
Source Independence: Consistent accuracy across all news sources (93-98%)
Category Performance: Highest accuracy in thể thao and kinh doanh categories
Natural Clustering: K-Means analysis reveals strong natural topic separation, confirming data structure aligns with supervised categories

Key Functions

evaluate_on_validation() - Validation set evaluation
evaluate_on_test() - Test set evaluation
perform_cross_validation() - K-fold CV
compare_results() - Comprehensive comparison
plot_confusion_matrix() - Visualization
analyze_source_performance() - Source-wise analysis
perform_kmeans_clustering() - Unsupervised clustering analysis
visualize_clustering_results() - Cluster visualization

1. Background & Business Understanding

Modern online news platforms publish hundreds of new articles every day across various domains (e.g., Sports, Business, Entertainment, Technology). Manual classification by editors is time‑consuming, requires consistent human effort, and is prone to inconsistency.

Business Objective

Develop an automated system that reads the full text of an article and assigns it to the correct category with high accuracy.

Business Benefits

Reduce manual workload for editors
Ensure consistent categorization
Improve discoverability and reader experience

2. Data Collection & Understanding

Data Source

Articles are crawled from multiple Vietnamese online newspapers (VnExpress, Tuổi Trẻ, Dân Trí, Thanh Niên, Zing News, VietnamNet, 24h, Người Lao Động). Dataset includes:

Content — full body text of the article (main input)
Category — predefined label (target variable): thể thao, kinh doanh, giải trí, giáo dục, khoa học, sức khỏe
Source — news website origin

Data Characteristics

Textual, unstructured Vietnamese data
Multiple news sources for robustness
Balanced across 6 major categories

3. Data Preprocessing

Preprocessing is critical in text classification. Poor preprocessing leads to degraded model performance.

Data Cleaning

Remove residual HTML/CSS/JS tags
Remove URLs, emails, phone numbers
Remove special characters and unnecessary punctuation
Normalize case (lowercase conversion)
Remove numbers and extra whitespace

Vietnamese Tokenization

Use underthesea library for Vietnamese word segmentation
Handle compound words correctly (e.g., "bóng_đá", "chứng_khoán")

Stopword Removal

Remove high‑frequency but low‑semantic words (e.g., "là", "và", "của", "được")
Custom Vietnamese stopwords list

Vectorization (Transformation)

Convert text into numeric vectors using TF‑IDF:

TF — term frequency within a single document
IDF — inverse frequency across corpus
Configuration: max_features=5000, ngram_range=(1,2), min_df=2, max_df=0.8

Data Balancing

Option to undersample or oversample to balance categories
Ensure fair representation across all categories

4. Modeling

Supervised Learning: Multiple Algorithms Comparison

Linear SVM (Best performer - 97% accuracy)
Logistic Regression
Naive Bayes (baseline)
Random Forest
XGBoost
Gradient Boosting
SGD Classifier
Decision Tree
K-Nearest Neighbors
SVM with RBF Kernel
AdaBoost
Ensemble (Voting Classifier)

Unsupervised Learning: K-Means Clustering

Applied K-Means with k=6 (number of categories)
Discovered natural topic clusters in data
Extracted top 15 keywords per cluster
Analyzed cluster purity and alignment with true categories
Visualized using PCA + t-SNE dimensionality reduction

5. Model Evaluation

Data Split

70% Training
10% Validation
20% Testing
Stratified split to maintain category distribution

Metrics

Overall Accuracy
Per‑class Precision, Recall, F1‑Score (multi‑class evaluation)
Cross-validation scores (5-fold CV)
Confusion Matrix to inspect inter‑class misclassification patterns
Source-specific accuracy analysis

Evaluation Results

Linear SVM achieved 97.07% test accuracy
Very low overfitting (validation-test gap: 0.01)
Consistent performance across all news sources (93-98%)
High precision and recall across all categories

6. Unsupervised Analysis

K-Means Clustering Insights

Natural clusters strongly align with supervised categories
High cluster purity (>70% on average)
Clear topic separation visible in t-SNE visualization
Cluster keywords accurately reflect category themes:
- Sports cluster: "bàn_thắng", "cầu_thủ", "huấn_luyện_viên"
- Business cluster: "doanh_nghiệp", "kinh_tế", "đầu_tư"
- Entertainment cluster: "nghệ_sĩ", "phim", "ca_sĩ"

Key Takeaway: The unsupervised analysis confirms that news articles have naturally separable topic structures, validating the supervised classification approach.

7. Knowledge Presentation

Performance comparison table (12 algorithms)
Confusion matrix visualization
Clustering visualization (unsupervised vs supervised)
Source-wise accuracy comparison
Interactive demo: text input → predicted category result
Comprehensive evaluation reports

8. Project Structure

dma-automatic-news-classification-system/
├── crawler.py                 # Web scraping from 8 news sources
├── data_preprocessing.py      # Text cleaning and tokenization
├── train_model.py             # Model training and evaluation
├── predict.py                 # Prediction interface
├── crawled_data.csv           # Raw scraped data
├── cleaned_data.csv           # Preprocessed data
├── vietnamese_stopwords.txt   # Vietnamese stopwords
└── train_model_assets/        # Generated artifacts
    ├── model.pkl              # Trained models
    ├── tfidf_vectorizer.pkl   # TF-IDF vectorizer
    ├── label_encoder.pkl      # Label encoder
    ├── kmeans_model.pkl       # K-Means model
    ├── cluster_keywords.txt   # Cluster analysis
    └── XX.png                 # All generated visualizations (e.g., 01_...png, 11_...png)

9. Installation & Usage

Requirements

pip install pandas numpy scikit-learn xgboost
pip install underthesea beautifulsoup4 requests
pip install matplotlib seaborn joblib

Step 1: Crawl Data

python crawler.py

Step 2: Preprocess Data

python data_preprocessing.py

Step 3: Train Models

python train_model.py

Step 4: Make Predictions

python predict.py

Why This Project Is Valuable

High practical relevance for real‑world editorial workflows
Covers full KDD pipeline end‑to‑end (data collection → preprocessing → modeling → evaluation)
Employs modern ML techniques: ensemble methods, hyperparameter tuning, cross-validation
Comprehensive analysis: both supervised and unsupervised approaches
Production-ready: multi-source data, robust preprocessing, model persistence
Readily accessible Vietnamese news data
Easy to showcase via interactive prediction demo
Excellent performance: 97% accuracy with low overfitting
Interpretable results: confusion matrices, clustering visualizations, source analysis

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Data Mining Assignment - Automatic News Classification System

0. Some pictures about this project

K-Means Clustering Visualization

Top 10 Model Comparison

Confusion Matrix

Category vs. Source Distribution

Data Split Visualization

TF-IDF Features by Category

Content Length & Word Count Analysis

Cross-Validation Results

1. Background & Business Understanding

2. Data Collection & Understanding

3. Data Preprocessing

4. Modeling

5. Model Evaluation

6. Unsupervised Analysis

7. Knowledge Presentation

8. Project Structure

9. Installation & Usage

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 46 Commits
jupiter_notebook		jupiter_notebook
latex		latex
models		models
train_model_assets		train_model_assets
News_Classification_Project.ipynb		News_Classification_Project.ipynb
Project Report - Nhom 19 L01.pdf		Project Report - Nhom 19 L01.pdf
README.md		README.md
cleaned_data.csv		cleaned_data.csv
crawled_data.csv		crawled_data.csv
crawler.py		crawler.py
data_preprocessing.py		data_preprocessing.py
predict.py		predict.py
requirements.txt		requirements.txt
train_model.py		train_model.py
vietnamese_stopwords.txt		vietnamese_stopwords.txt

Folders and files

Latest commit

History

Repository files navigation

Data Mining Assignment - Automatic News Classification System

0. Some pictures about this project

K-Means Clustering Visualization

Top 10 Model Comparison

Confusion Matrix

Category vs. Source Distribution

Data Split Visualization

TF-IDF Features by Category

Content Length & Word Count Analysis

Cross-Validation Results

1. Background & Business Understanding

2. Data Collection & Understanding

3. Data Preprocessing

4. Modeling

5. Model Evaluation

6. Unsupervised Analysis

7. Knowledge Presentation

8. Project Structure

9. Installation & Usage

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages