EMNLP2025 Eo: Expert Generalization in MoE in IFT
One-Liner cluster the input, activate a seperate expert group for cluster target. Motivation heterogeneity of input instruction tuning data poses difficulty for MoE routing only operates at token level, so can’t deal with sequence level generalization Novelty Architecture to enable hierarchical expert routing. Notable Methods Mixure of Clustered Experts Mixture of Clustered Experts Dual-stage routing mechanism. group the M experts into groups of N expert (i.e. M = \left(N, \dots, N\right) k-means clustering the sequence embedding at input given the assigned cluster, only route to the assigned subgroup Results outperforms MoE baselines demonstrate expert group specialization