Machine Learning Engineer building multilingual NLP systems, evaluation frameworks, and efficient AI architectures.
π Toronto, Ontario, Canada
LinkedIn β’ Hugging Face β’ The Meta Gradient
I'm a Machine Learning Engineer focused on multilingual NLP, evaluation systems, dataset engineering, and efficient machine learning architectures. My work spans open-source research, data pipelines, ML engineering, and technical writing, with an emphasis on building practical, reproducible systems grounded in real-world data.
I currently serve as Cross-Team Coordination Lead for the Multicultural Riddles Benchmark within the Cohere Labs Open Science Community, where I help coordinate multilingual dataset development, evaluation workflows, and benchmark infrastructure across a global contributor community.
Alongside open-source work, I independently research efficient sequence models, representation learning, reasoning systems, and optimization dynamics while publishing implementation-focused articles through The Meta Gradient.
- π Cross-Team Coordination Lead for the Multicultural Riddles Benchmark (Cohere Labs Open Science Community)
- π€ Building multilingual NLP datasets and evaluation pipelines
- π§ Researching State Space Models (SSMs), Mamba, and efficient neural architectures
- π Designing scalable datasets, schemas, and ML data pipelines
- βοΈ Writing implementation-focused deep learning articles on The Meta Gradient
Cross-team coordination, multilingual data pipelines, and evaluation workflows for an open-source cultural reasoning benchmark spanning 49 languages, ~49,000 riddles, and 70+ contributors.
- Appointed Cross-Team Coordination Lead, coordinating across data creation, evaluation, and analysis teams.
- Developed internal tooling that converts validator output into actionable guidance for 70+ contributors, improving annotation quality and review efficiency.
- Designed and shipped the pilot dataset (v1) covering 40 communities and 40,000 riddles using a canonical schema for downstream benchmarking.
- Help standardize multilingual datasets for structured releases on GitHub and Hugging Face.
- Collaborate on benchmark design across API-based and open-weight language models.
- Support preparation for an upcoming open-science research publication.
Tech
Python β’ Pandas β’ Hugging Face Datasets β’ Git β’ GitHub β’ Schema Design β’ ISO 639-2/3 β’ Multilingual NLP
Repository (Private)
Fully offline multilingual document question-answering system for newcomers and migrants.
- Pipeline Lead on a 7-person team.
- Built an end-to-end RAG pipeline:
- PDF extraction
- BGE-M3 embeddings
- ChromaDB retrieval
- Tiny Aya generation
- mDeBERTa hallucination verification
- Executed over 9,000 automated evaluations across Chinese, Hindi, and Polish.
- Identified cross-lingual embedding qualityβnot LLM capabilityβas the primary retrieval bottleneck.
Tech
Python β’ PyTorch β’ BGE-M3 β’ ChromaDB β’ Tiny Aya β’ mDeBERTa β’ Gradio
Repository
https://github.com/docunative-AI/docunative
Adaption Labs Uncharted Data Challenge
Created a high-quality Tamil agricultural instruction dataset from public-sector resources.
- Built the entire dataset independently from Kisan Call Centre, TNAU Extension Guides, and ICAR contingency plans.
- Curated 187 structured records covering 48 crops across 20 agricultural categories.
- Completed 10 iterative submissions over six weeks.
- Awarded Grade A (9.4/10) with honorary recognition from Sara Hooker.
Tech
Python β’ Pandas β’ Dataset Engineering β’ Metadata Design
Dataset
https://huggingface.co/datasets/vinod-anbalagan/tamil-agri-advisory-qa
Repository
https://github.com/VinodAnbalagan/tamil-agri-dataset-
Implementing neural architectures from first principles to better understand the evolution of sequence modeling.
Current implementations include:
- Vanilla RNN
- LSTM
- GRU
- Transformer
- State Space Models (Mamba)
Research focuses on:
- Gradient flow
- Memory mechanisms
- Computational efficiency
- Scaling behavior
- Architectural trade-offs
Tech
PyTorch β’ NumPy
Studying optimization dynamics and neural network convergence through visualization.
- Compare optimization behavior across architectures.
- Explore why certain models converge more reliably.
- Visualize training dynamics and loss surfaces.
Tech
PyTorch β’ Matplotlib
- Cross-Team Coordination Lead
- Multilingual benchmark development
- Evaluation pipelines
- Dataset engineering
Publishing datasets and machine learning resources.
Technical articles focused on deep learning, efficient architectures, and implementation from first principles.
- Python
- SQL
- Bash
- PyTorch
- Scikit-learn
- XGBoost
- Model Evaluation
- Feature Engineering
- Transformers
- Retrieval-Augmented Generation (RAG)
- Sentence Embeddings
- Hugging Face
- ChromaDB
- Multilingual NLP
- Pandas
- NumPy
- Dataset Curation
- Data Cleaning
- Schema Design
- Evaluation Pipelines
- FastAPI
- Docker
- AWS
- Git
- GitHub
- Linux
- Matplotlib
M.A.Sc. Electrical Engineering
University of Windsor
B.E. Electronics & Communication Engineering
Anna University
- University of Toronto Machine Learning & Data Science Certificate
- Stanford Machine Learning Specialization
- Google Advanced Data Analytics
I write implementation-focused articles that break down modern machine learning research into practical, reproducible code.
Topics include:
- Efficient neural architectures
- State Space Models
- Deep learning fundamentals
- Optimization
- Representation learning
- Machine learning engineering
The Meta Gradient
https://substack.com/@vinodanbalagan
- Multilingual NLP
- Efficient AI
- Representation Learning
- State Space Models
- Evaluation Frameworks
- Dataset Engineering
- Open-source ML
- AI for Scientific Research
"Learning by building, understanding by implementing, and improving through open collaboration."



