Sparse Mixture-of-Experts for Summarization
Implemented Sparse Mixture-of-Experts (MoE) layers and Grouped Query Attention (GQA) from scratch, benchmarked against fine-tuned Llama and T5 baselines on the XSum summarization dataset.
- Sparse MoE layer with top-k expert routing and load-balancing loss
- Grouped Query Attention implementation sharing KV heads across query groups
- Fine-tuned Llama and T5 baselines on the XSum abstractive summarization dataset
- Evaluation using LLM-as-a-judge metrics alongside ROUGE