Projects

Sparse Mixture-of-Experts for Summarization

PythonPyTorchTransformersNLP

Implemented Sparse Mixture-of-Experts (MoE) layers and Grouped Query Attention (GQA) from scratch, benchmarked against fine-tuned Llama and T5 baselines on the XSum summarization dataset.

  • Sparse MoE layer with top-k expert routing and load-balancing loss
  • Grouped Query Attention implementation sharing KV heads across query groups
  • Fine-tuned Llama and T5 baselines on the XSum abstractive summarization dataset
  • Evaluation using LLM-as-a-judge metrics alongside ROUGE