Sparse Mixture-of-Experts for Summarization

PythonPyTorchTransformersNLP

Implemented Sparse Mixture-of-Experts (MoE) layers and Grouped Query Attention (GQA) from scratch, benchmarked against fine-tuned Llama and T5 baselines on the XSum summarization dataset.

Sparse MoE layer with top-k expert routing and load-balancing loss
Grouped Query Attention implementation sharing KV heads across query groups
Fine-tuned Llama and T5 baselines on the XSum abstractive summarization dataset
Evaluation using LLM-as-a-judge metrics alongside ROUGE