Skip to content

bawfng04/dma-automatic-news-classification-system

Repository files navigation

Data Mining Assignment - Automatic News Classification System

0. Some pictures about this project

K-Means Clustering Visualization

K-Means Clustering Distribution K-Means Cluster Purity Matrix Unsupervised clustering analysis showing natural topic separation

Top 10 Model Comparison

Model Comparison Bar Chart Model Performance Heatmap Radar Chart Top 5 Models Comparison of validation, test, and cross-validation accuracy across top 10 models

Confusion Matrix

Confusion Matrix Normalized Confusion Matrix Per-Class Metrics Confusion matrix showing prediction accuracy for each category using the best model

Category vs. Source Distribution

Raw Data Distribution Category-Source Heatmap Cleaned Data Distribution Heatmap showing the distribution of articles by category and news source

Data Split Visualization

Train/Val/Test Split

TF-IDF Features by Category

Top TF-IDF Features

Content Length & Word Count Analysis

Content Length Analysis Word Count Distribution

Cross-Validation Results

Cross-Validation Results

Key Findings

  • Best Model: Linear SVM with ~97% test accuracy
  • Stable Performance: Low validation-test gap (~0.01)
  • Source Independence: Consistent accuracy across all news sources (93-98%)
  • Category Performance: Highest accuracy in thể thao and kinh doanh categories
  • Natural Clustering: K-Means analysis reveals strong natural topic separation, confirming data structure aligns with supervised categories

Key Functions

1. Background & Business Understanding

Modern online news platforms publish hundreds of new articles every day across various domains (e.g., Sports, Business, Entertainment, Technology). Manual classification by editors is time‑consuming, requires consistent human effort, and is prone to inconsistency.

Business Objective

Develop an automated system that reads the full text of an article and assigns it to the correct category with high accuracy.

Business Benefits

  • Reduce manual workload for editors
  • Ensure consistent categorization
  • Improve discoverability and reader experience

2. Data Collection & Understanding

Data Source

Articles are crawled from multiple Vietnamese online newspapers (VnExpress, Tuổi Trẻ, Dân Trí, Thanh Niên, Zing News, VietnamNet, 24h, Người Lao Động). Dataset includes:

  • Content — full body text of the article (main input)
  • Category — predefined label (target variable): thể thao, kinh doanh, giải trí, giáo dục, khoa học, sức khỏe
  • Source — news website origin

Data Characteristics

  • Textual, unstructured Vietnamese data
  • Multiple news sources for robustness
  • Balanced across 6 major categories

3. Data Preprocessing

Preprocessing is critical in text classification. Poor preprocessing leads to degraded model performance.

Data Cleaning

  • Remove residual HTML/CSS/JS tags
  • Remove URLs, emails, phone numbers
  • Remove special characters and unnecessary punctuation
  • Normalize case (lowercase conversion)
  • Remove numbers and extra whitespace

Vietnamese Tokenization

  • Use underthesea library for Vietnamese word segmentation
  • Handle compound words correctly (e.g., "bóng_đá", "chứng_khoán")

Stopword Removal

  • Remove high‑frequency but low‑semantic words (e.g., "là", "và", "của", "được")
  • Custom Vietnamese stopwords list

Vectorization (Transformation)

Convert text into numeric vectors using TF‑IDF:

  • TF — term frequency within a single document
  • IDF — inverse frequency across corpus
  • Configuration: max_features=5000, ngram_range=(1,2), min_df=2, max_df=0.8

Data Balancing

  • Option to undersample or oversample to balance categories
  • Ensure fair representation across all categories

4. Modeling

Supervised Learning: Multiple Algorithms Comparison

  • Linear SVM (Best performer - 97% accuracy)
  • Logistic Regression
  • Naive Bayes (baseline)
  • Random Forest
  • XGBoost
  • Gradient Boosting
  • SGD Classifier
  • Decision Tree
  • K-Nearest Neighbors
  • SVM with RBF Kernel
  • AdaBoost
  • Ensemble (Voting Classifier)

Unsupervised Learning: K-Means Clustering

  • Applied K-Means with k=6 (number of categories)
  • Discovered natural topic clusters in data
  • Extracted top 15 keywords per cluster
  • Analyzed cluster purity and alignment with true categories
  • Visualized using PCA + t-SNE dimensionality reduction

5. Model Evaluation

Data Split

  • 70% Training
  • 10% Validation
  • 20% Testing
  • Stratified split to maintain category distribution

Metrics

  • Overall Accuracy
  • Per‑class Precision, Recall, F1‑Score (multi‑class evaluation)
  • Cross-validation scores (5-fold CV)
  • Confusion Matrix to inspect inter‑class misclassification patterns
  • Source-specific accuracy analysis

Evaluation Results

  • Linear SVM achieved 97.07% test accuracy
  • Very low overfitting (validation-test gap: 0.01)
  • Consistent performance across all news sources (93-98%)
  • High precision and recall across all categories

6. Unsupervised Analysis

K-Means Clustering Insights

  • Natural clusters strongly align with supervised categories
  • High cluster purity (>70% on average)
  • Clear topic separation visible in t-SNE visualization
  • Cluster keywords accurately reflect category themes:
    • Sports cluster: "bàn_thắng", "cầu_thủ", "huấn_luyện_viên"
    • Business cluster: "doanh_nghiệp", "kinh_tế", "đầu_tư"
    • Entertainment cluster: "nghệ_sĩ", "phim", "ca_sĩ"

Key Takeaway: The unsupervised analysis confirms that news articles have naturally separable topic structures, validating the supervised classification approach.

7. Knowledge Presentation

  • Performance comparison table (12 algorithms)
  • Confusion matrix visualization
  • Clustering visualization (unsupervised vs supervised)
  • Source-wise accuracy comparison
  • Interactive demo: text input → predicted category result
  • Comprehensive evaluation reports

8. Project Structure

dma-automatic-news-classification-system/
├── crawler.py                 # Web scraping from 8 news sources
├── data_preprocessing.py      # Text cleaning and tokenization
├── train_model.py             # Model training and evaluation
├── predict.py                 # Prediction interface
├── crawled_data.csv           # Raw scraped data
├── cleaned_data.csv           # Preprocessed data
├── vietnamese_stopwords.txt   # Vietnamese stopwords
└── train_model_assets/        # Generated artifacts
    ├── model.pkl              # Trained models
    ├── tfidf_vectorizer.pkl   # TF-IDF vectorizer
    ├── label_encoder.pkl      # Label encoder
    ├── kmeans_model.pkl       # K-Means model
    ├── cluster_keywords.txt   # Cluster analysis
    └── XX.png                 # All generated visualizations (e.g., 01_...png, 11_...png)

9. Installation & Usage

Requirements

pip install pandas numpy scikit-learn xgboost
pip install underthesea beautifulsoup4 requests
pip install matplotlib seaborn joblib

Step 1: Crawl Data

python crawler.py

Step 2: Preprocess Data

python data_preprocessing.py

Step 3: Train Models

python train_model.py

Step 4: Make Predictions

python predict.py

Why This Project Is Valuable

  • High practical relevance for real‑world editorial workflows
  • Covers full KDD pipeline end‑to‑end (data collection → preprocessing → modeling → evaluation)
  • Employs modern ML techniques: ensemble methods, hyperparameter tuning, cross-validation
  • Comprehensive analysis: both supervised and unsupervised approaches
  • Production-ready: multi-source data, robust preprocessing, model persistence
  • Readily accessible Vietnamese news data
  • Easy to showcase via interactive prediction demo
  • Excellent performance: 97% accuracy with low overfitting
  • Interpretable results: confusion matrices, clustering visualizations, source analysis

About

A Machine Learning Approach for Automated Vietnamese News Categorization using Support Vector Machines (SVM) and Ensemble Methods

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors