Skip to content
View VinodAnbalagan's full-sized avatar

Block or report VinodAnbalagan

Block user

Prevent this user from interacting with your repositories and sending you notifications. Learn more about blocking users.

You must be logged in to block users.

Maximum 250 characters. Please don’t include any personal information such as legal names or email addresses. Markdown is supported. This note will only be visible to you.
Report abuse

Contact GitHub support about this user’s behavior. Learn more about reporting abuse.

Report abuse
VinodAnbalagan/Readme.MD

Hi, I'm Vinod Anbalagan πŸ‘‹

Machine Learning Engineer building multilingual NLP systems, evaluation frameworks, and efficient AI architectures.

πŸ“ Toronto, Ontario, Canada

LinkedIn β€’ Hugging Face β€’ The Meta Gradient


About

I'm a Machine Learning Engineer focused on multilingual NLP, evaluation systems, dataset engineering, and efficient machine learning architectures. My work spans open-source research, data pipelines, ML engineering, and technical writing, with an emphasis on building practical, reproducible systems grounded in real-world data.

I currently serve as Cross-Team Coordination Lead for the Multicultural Riddles Benchmark within the Cohere Labs Open Science Community, where I help coordinate multilingual dataset development, evaluation workflows, and benchmark infrastructure across a global contributor community.

Alongside open-source work, I independently research efficient sequence models, representation learning, reasoning systems, and optimization dynamics while publishing implementation-focused articles through The Meta Gradient.


Currently

  • 🌍 Cross-Team Coordination Lead for the Multicultural Riddles Benchmark (Cohere Labs Open Science Community)
  • πŸ€– Building multilingual NLP datasets and evaluation pipelines
  • 🧠 Researching State Space Models (SSMs), Mamba, and efficient neural architectures
  • πŸ“Š Designing scalable datasets, schemas, and ML data pipelines
  • ✍️ Writing implementation-focused deep learning articles on The Meta Gradient

Selected Projects

🌍 Multicultural Riddles Benchmark β€” Cohere Labs Open Science Community

Cross-team coordination, multilingual data pipelines, and evaluation workflows for an open-source cultural reasoning benchmark spanning 49 languages, ~49,000 riddles, and 70+ contributors.

  • Appointed Cross-Team Coordination Lead, coordinating across data creation, evaluation, and analysis teams.
  • Developed internal tooling that converts validator output into actionable guidance for 70+ contributors, improving annotation quality and review efficiency.
  • Designed and shipped the pilot dataset (v1) covering 40 communities and 40,000 riddles using a canonical schema for downstream benchmarking.
  • Help standardize multilingual datasets for structured releases on GitHub and Hugging Face.
  • Collaborate on benchmark design across API-based and open-weight language models.
  • Support preparation for an upcoming open-science research publication.

Tech

Python β€’ Pandas β€’ Hugging Face Datasets β€’ Git β€’ GitHub β€’ Schema Design β€’ ISO 639-2/3 β€’ Multilingual NLP

Repository (Private)


πŸ“„ DocuNative β€” Cohere Expedition Hackathon

Fully offline multilingual document question-answering system for newcomers and migrants.

  • Pipeline Lead on a 7-person team.
  • Built an end-to-end RAG pipeline:
    • PDF extraction
    • BGE-M3 embeddings
    • ChromaDB retrieval
    • Tiny Aya generation
    • mDeBERTa hallucination verification
  • Executed over 9,000 automated evaluations across Chinese, Hindi, and Polish.
  • Identified cross-lingual embedding qualityβ€”not LLM capabilityβ€”as the primary retrieval bottleneck.

Tech

Python β€’ PyTorch β€’ BGE-M3 β€’ ChromaDB β€’ Tiny Aya β€’ mDeBERTa β€’ Gradio

Repository

https://github.com/docunative-AI/docunative


🌾 Tamil Agricultural Advisory Dataset

Adaption Labs Uncharted Data Challenge

Created a high-quality Tamil agricultural instruction dataset from public-sector resources.

  • Built the entire dataset independently from Kisan Call Centre, TNAU Extension Guides, and ICAR contingency plans.
  • Curated 187 structured records covering 48 crops across 20 agricultural categories.
  • Completed 10 iterative submissions over six weeks.
  • Awarded Grade A (9.4/10) with honorary recognition from Sara Hooker.

Tech

Python β€’ Pandas β€’ Dataset Engineering β€’ Metadata Design

Dataset

https://huggingface.co/datasets/vinod-anbalagan/tamil-agri-advisory-qa

Repository

https://github.com/VinodAnbalagan/tamil-agri-dataset-


πŸ”¬ Rethinking RNN (In Progress)

Implementing neural architectures from first principles to better understand the evolution of sequence modeling.

Current implementations include:

  • Vanilla RNN
  • LSTM
  • GRU
  • Transformer
  • State Space Models (Mamba)

Research focuses on:

  • Gradient flow
  • Memory mechanisms
  • Computational efficiency
  • Scaling behavior
  • Architectural trade-offs

Tech

PyTorch β€’ NumPy


πŸ“ˆ Loss Landscape Visualization

Studying optimization dynamics and neural network convergence through visualization.

  • Compare optimization behavior across architectures.
  • Explore why certain models converge more reliably.
  • Visualize training dynamics and loss surfaces.

Tech

PyTorch β€’ Matplotlib


Open Source

Cohere Labs Open Science Community

  • Cross-Team Coordination Lead
  • Multilingual benchmark development
  • Evaluation pipelines
  • Dataset engineering

Hugging Face

Publishing datasets and machine learning resources.

The Meta Gradient

Technical articles focused on deep learning, efficient architectures, and implementation from first principles.


Technical Skills

Languages

  • Python
  • SQL
  • Bash

Machine Learning

  • PyTorch
  • Scikit-learn
  • XGBoost
  • Model Evaluation
  • Feature Engineering

NLP & LLMs

  • Transformers
  • Retrieval-Augmented Generation (RAG)
  • Sentence Embeddings
  • Hugging Face
  • ChromaDB
  • Multilingual NLP

Data Engineering

  • Pandas
  • NumPy
  • Dataset Curation
  • Data Cleaning
  • Schema Design
  • Evaluation Pipelines

Deployment

  • FastAPI
  • Docker
  • AWS
  • Git
  • GitHub
  • Linux

Visualization

  • Matplotlib

Education

M.A.Sc. Electrical Engineering
University of Windsor

B.E. Electronics & Communication Engineering
Anna University

Recent Learning

  • University of Toronto Machine Learning & Data Science Certificate
  • Stanford Machine Learning Specialization
  • Google Advanced Data Analytics

Writing

I write implementation-focused articles that break down modern machine learning research into practical, reproducible code.

Topics include:

  • Efficient neural architectures
  • State Space Models
  • Deep learning fundamentals
  • Optimization
  • Representation learning
  • Machine learning engineering

The Meta Gradient

https://substack.com/@vinodanbalagan


Interests

  • Multilingual NLP
  • Efficient AI
  • Representation Learning
  • State Space Models
  • Evaluation Frameworks
  • Dataset Engineering
  • Open-source ML
  • AI for Scientific Research

"Learning by building, understanding by implementing, and improving through open collaboration."

Pinned Loading

  1. chaos-canvas chaos-canvas Public

    Learning math through generative art β€” one iteration at a time.

  2. marimo-research-playground marimo-research-playground Public

    Testing Marimo and widgets

    Python

  3. tamil-agri-dataset- tamil-agri-dataset- Public

    Tamil agriculture advisory QA dataset

    Python

  4. tendril tendril Public

    A personal research lab for nature-inspired architectures

    Python

  5. Uzumaki Uzumaki Public

    γ†γšγΎγ - An experiment in growing proto-cognition from a substrate, not engineering it from a goal.