Building Custom Models Using FlagGems modules#

In some scenarios, users may want to build their own models from scratch or to adapt existing ones to better suit their specific use cases. To support this, FlagGems provides a growing collection of high-performance modules commonly used in large language models (LLMs).

These components are implemented using FlagGems-accelerated operators and can be used in the way you use any standard torch.nn.Module. You can seamlessly integrate them into your system to benefit from kernel-level acceleration without writing custom CUDA code or Triton code.

Modules can be found in flag_gems/modules.

Modules Available#

Module	Description	Supported Features
`GemsRMSNorm`	RMS LayerNorm	Fused residual add, `inplace` and `outplace`
`GemsRope`	Standard rotary position embedding	`inplace` and `outplace`
`GemsDeepseekYarnRoPE`	RoPE with extrapolation for DeepSeek-style LLMs	`inplace` and `outplace`
`GemsSiluAndMul`	Fused SiLU activation with elementwise multiplication.	`outplace` only

We encourage users to use these as drop-in replacements for the equivalent PyTorch layers. More components such as fused attention, MoE layers, and transformer blocks are under development.