Motivation ByteT5 is very expensive (because you have to have a residual on every damn token) MrT5 MrT5 uses a soft attention masking gate at pretraining time to delete unused tokens; at inference time we use a hard cut. Cool: MrT5 learns language independent compression rate (different languages have different rates).

[[curator]]
I'm the Curator. I can help you navigate, organize, and curate this wiki. What would you like to do?