Unsupervised clustering analysis showing natural topic separation
Comparison of validation, test, and cross-validation accuracy across top 10 models
Confusion matrix showing prediction accuracy for each category using the best model
Heatmap showing the distribution of articles by category and news source
Key Findings
- Best Model: Linear SVM with ~97% test accuracy
- Stable Performance: Low validation-test gap (~0.01)
- Source Independence: Consistent accuracy across all news sources (93-98%)
- Category Performance: Highest accuracy in thể thao and kinh doanh categories
- Natural Clustering: K-Means analysis reveals strong natural topic separation, confirming data structure aligns with supervised categories
Key Functions
evaluate_on_validation()- Validation set evaluationevaluate_on_test()- Test set evaluationperform_cross_validation()- K-fold CVcompare_results()- Comprehensive comparisonplot_confusion_matrix()- Visualizationanalyze_source_performance()- Source-wise analysisperform_kmeans_clustering()- Unsupervised clustering analysisvisualize_clustering_results()- Cluster visualization
Modern online news platforms publish hundreds of new articles every day across various domains (e.g., Sports, Business, Entertainment, Technology). Manual classification by editors is time‑consuming, requires consistent human effort, and is prone to inconsistency.
Business Objective
Develop an automated system that reads the full text of an article and assigns it to the correct category with high accuracy.
Business Benefits
- Reduce manual workload for editors
- Ensure consistent categorization
- Improve discoverability and reader experience
Data Source
Articles are crawled from multiple Vietnamese online newspapers (VnExpress, Tuổi Trẻ, Dân Trí, Thanh Niên, Zing News, VietnamNet, 24h, Người Lao Động). Dataset includes:
- Content — full body text of the article (main input)
- Category — predefined label (target variable): thể thao, kinh doanh, giải trí, giáo dục, khoa học, sức khỏe
- Source — news website origin
Data Characteristics
- Textual, unstructured Vietnamese data
- Multiple news sources for robustness
- Balanced across 6 major categories
Preprocessing is critical in text classification. Poor preprocessing leads to degraded model performance.
Data Cleaning
- Remove residual HTML/CSS/JS tags
- Remove URLs, emails, phone numbers
- Remove special characters and unnecessary punctuation
- Normalize case (lowercase conversion)
- Remove numbers and extra whitespace
Vietnamese Tokenization
- Use
underthesealibrary for Vietnamese word segmentation - Handle compound words correctly (e.g., "bóng_đá", "chứng_khoán")
Stopword Removal
- Remove high‑frequency but low‑semantic words (e.g., "là", "và", "của", "được")
- Custom Vietnamese stopwords list
Vectorization (Transformation)
Convert text into numeric vectors using TF‑IDF:
- TF — term frequency within a single document
- IDF — inverse frequency across corpus
- Configuration: max_features=5000, ngram_range=(1,2), min_df=2, max_df=0.8
Data Balancing
- Option to undersample or oversample to balance categories
- Ensure fair representation across all categories
Supervised Learning: Multiple Algorithms Comparison
- Linear SVM (Best performer - 97% accuracy)
- Logistic Regression
- Naive Bayes (baseline)
- Random Forest
- XGBoost
- Gradient Boosting
- SGD Classifier
- Decision Tree
- K-Nearest Neighbors
- SVM with RBF Kernel
- AdaBoost
- Ensemble (Voting Classifier)
Unsupervised Learning: K-Means Clustering
- Applied K-Means with k=6 (number of categories)
- Discovered natural topic clusters in data
- Extracted top 15 keywords per cluster
- Analyzed cluster purity and alignment with true categories
- Visualized using PCA + t-SNE dimensionality reduction
Data Split
- 70% Training
- 10% Validation
- 20% Testing
- Stratified split to maintain category distribution
Metrics
- Overall Accuracy
- Per‑class Precision, Recall, F1‑Score (multi‑class evaluation)
- Cross-validation scores (5-fold CV)
- Confusion Matrix to inspect inter‑class misclassification patterns
- Source-specific accuracy analysis
Evaluation Results
- Linear SVM achieved 97.07% test accuracy
- Very low overfitting (validation-test gap: 0.01)
- Consistent performance across all news sources (93-98%)
- High precision and recall across all categories
K-Means Clustering Insights
- Natural clusters strongly align with supervised categories
- High cluster purity (>70% on average)
- Clear topic separation visible in t-SNE visualization
- Cluster keywords accurately reflect category themes:
- Sports cluster: "bàn_thắng", "cầu_thủ", "huấn_luyện_viên"
- Business cluster: "doanh_nghiệp", "kinh_tế", "đầu_tư"
- Entertainment cluster: "nghệ_sĩ", "phim", "ca_sĩ"
Key Takeaway: The unsupervised analysis confirms that news articles have naturally separable topic structures, validating the supervised classification approach.
- Performance comparison table (12 algorithms)
- Confusion matrix visualization
- Clustering visualization (unsupervised vs supervised)
- Source-wise accuracy comparison
- Interactive demo: text input → predicted category result
- Comprehensive evaluation reports
dma-automatic-news-classification-system/
├── crawler.py # Web scraping from 8 news sources
├── data_preprocessing.py # Text cleaning and tokenization
├── train_model.py # Model training and evaluation
├── predict.py # Prediction interface
├── crawled_data.csv # Raw scraped data
├── cleaned_data.csv # Preprocessed data
├── vietnamese_stopwords.txt # Vietnamese stopwords
└── train_model_assets/ # Generated artifacts
├── model.pkl # Trained models
├── tfidf_vectorizer.pkl # TF-IDF vectorizer
├── label_encoder.pkl # Label encoder
├── kmeans_model.pkl # K-Means model
├── cluster_keywords.txt # Cluster analysis
└── XX.png # All generated visualizations (e.g., 01_...png, 11_...png)
Requirements
pip install pandas numpy scikit-learn xgboost
pip install underthesea beautifulsoup4 requests
pip install matplotlib seaborn joblib
Step 1: Crawl Data
python crawler.py
Step 2: Preprocess Data
python data_preprocessing.py
Step 3: Train Models
python train_model.py
Step 4: Make Predictions
python predict.py
Why This Project Is Valuable
- High practical relevance for real‑world editorial workflows
- Covers full KDD pipeline end‑to‑end (data collection → preprocessing → modeling → evaluation)
- Employs modern ML techniques: ensemble methods, hyperparameter tuning, cross-validation
- Comprehensive analysis: both supervised and unsupervised approaches
- Production-ready: multi-source data, robust preprocessing, model persistence
- Readily accessible Vietnamese news data
- Easy to showcase via interactive prediction demo
- Excellent performance: 97% accuracy with low overfitting
- Interpretable results: confusion matrices, clustering visualizations, source analysis




