SwiGLU: The Activation Function Powering Modern LLMs

Share

SwiGLU: A Popular Activation Function Used by Large Models Continuing my recent post about SiLU, let's explore another activation function commonly used in LLMs: SwiGLU. Introduced by Noam Shazeer, the second author of the "Attention Is All You Need" paper, SwiGLU has become the default activation function for large-scale models like Google's PaLM, Meta's LLaMA, and now Tencent's new Hunyuan model.

What Is SwiGLU?

SwiGLU stands for Swish Gated Linear Unit. It's a variant of the Gated Linear Unit (GLU) that incorporates the Swish activation function into its gating mechanism.

Why Does It Work?

Nobody knows. Noam in his paper writes: "We offer no explanation as to why these architectures seem to work; we attribute their success, as all else, to divine benevolence."

SwiGLU has been empirically successful in improving model performance, but its theoretical underpinnings are not yet fully understood.

Link to the pager : https://arxiv.org/abs/2002.05202