FlagGems 实验性算子#

本节列举 FlagGems 中的实验性算子。这些算子与 PyTorch 的原生实现相比, 能够达到平均 0.8 倍或更高的性能。

性能数据概览#

  • 算子总数:142
  • 平均加速比范围:0.81x - 7.23x
  • 测试环境:Hopper GPU
  • 过滤条件:平均加速比 ≥ 0.8x

按性能排列的算子列表#

序号算子平均加速比分类
1_safe_softmax7.23x 🏆Internal
2digamma_2.41x 🏆Math
3zero1.85xOther
4relu1.79xActivation
5mse_loss1.64xLoss
6masked_select1.47xOther
7masked_scatter1.44xOther
8eye1.43xOther
9t_copy1.41xShape
10trace1.40xMath
11i0_1.37xMath
12zeros_like1.32xOther
13diag1.27xOther
14lift_fresh_copy1.24xOther
15alias_copy1.23xOther
16pixel_unshuffle1.20x 📈Vision
17triu1.18x 📈Shape
18rrelu_with_noise_backward1.17x 📈Activation
19glu1.17x 📈Activation
20tril1.16x 📈Shape
21silu_1.16x 📈Activation
22asinh_1.14x 📈Math
23mv1.14x 📈Linear Algebra
24arcsinh_1.13x 📈Math
25pixel_shuffle1.12x 📈Vision
26replication_pad3d1.11x 📈Padding
27_upsample_nearest_exact1d1.11x 📈Vision
28i01.11x 📈Math
29softplus1.10x 📈Activation
30selu_1.10x 📈Activation
31upsample_nearest1d1.10x 📈Vision
32special_i11.09x 📈Math
33selu1.09x 📈Activation
34amin1.09x 📈Math
35sinh_1.09x 📈Math
36logit_1.08x 📈Math
37upsample_nearest3d1.07x 📈Vision
38im2col1.06x 📈Vision
39reflection_pad1d1.06x 📈Padding
40elu1.06x 📈Activation
41arctanh_1.05x 📈Math
42sigmoid1.05x 📈Activation
43replication_pad1d1.04x 📈Padding
44silu1.04x 📈Activation
45sigmoid_1.04x 📈Activation
46addcdiv1.04x 📈Arithmetic
47sinc_1.03x 📈Math
48relu61.03x 📈Activation
49hardtanh1.03x 📈Activation
50hardtanh_1.03x 📈Activation
51hardswish_1.03x 📈Activation
52reciprocal_1.03x 📈Math
53sinc1.03x 📈Math
54hardsigmoid1.03x 📈Activation
55logaddexp21.02x 📈Math
56logit1.02x 📈Math
57arctanh1.02x 📈Math
58logaddexp1.02x 📈Math
59cosh_1.02x 📈Math
60special_xlog1py1.02x 📈Math
61celu1.02x 📈Activation
62hardsigmoid_1.02x 📈Activation
63arcsinh1.02x 📈Math
64sign1.02x 📈Math
65absolute_1.01x 📈Math
66_adaptive_avg_pool3d1.01x 📈Vision
67special_i0e1.01x 📈Math
68cos_1.01x 📈Math
69deg2rad_1.01x 📈Math
70floor_1.01x 📈Math
71negative1.01x 📈Math
72xlogy1.01x 📈Math
73exp21.01x 📈Math
74exp_1.00x 📈Math
75fix1.00x 📈Math
76xlogy_1.00x 📈Math
77absolute1.00x 📈Math
78prelu1.00x 📈Activation
79hypot1.00x 📈Math
80rad2deg_1.00x 📈Math
81smooth_l1_loss1.00x 📈Loss
82deg2rad1.00x 📈Math
83log_1.00x 📈Math
84sgn_1.00x 📈Math
85sin_1.00x 📈Math
86heaviside1.00x 📈Math
87logical_xor_1.00x 📈Other
88trunc1.00x 📈Math
89heaviside_1.00x 📈Math
90hardshrink1.00x 📈Activation
91huber_loss1.00x 📈Loss
92threshold_1.00x 📈Activation
93addcmul_1.00x 📈Arithmetic
94neg_1.00x 📈Math
95hypot_1.00x 📈Math
96leaky_relu1.00x 📈Activation
97fmin1.00x 📈Math
98erfinv1.00x 📈Math
99log1p_1.00x 📈Math
100frac1.00xMath
101_functional_sym_constrain_range_for_size1.00xInternal
102expand1.00xShape
103lift1.00xOther
104unsqueeze1.00xShape
105_unsafe_view1.00xInternal
106softshrink1.00xActivation
107log2_1.00xMath
108permute1.00xShape
109leaky_relu_1.00xActivation
110atanh_1.00xMath
111permute_copy1.00xShape
112fft_ifftshift1.00xOther
113copy_1.00xOther
114fix_1.00xMath
115slice_scatter0.99xOther
116exp2_0.99xMath
117rsqrt_0.99xMath
118threshold0.98xActivation
119reciprocal0.97xMath
120maximum0.97xArithmetic
121abs0.96xMath
122arccosh0.96xMath
123multiply0.95xArithmetic
124margin_ranking_loss0.95xLoss
125celu_0.92xActivation
126hardswish0.91xActivation
127soft_margin_loss0.90xLoss
128replication_pad2d0.90xPadding
129unsqueeze_copy0.89xShape
130native_dropout_backward0.89xOther
131slice_backward0.88xOther
132relu_0.86xActivation
133negative_0.86xMath
134abs_0.86xMath
135take0.86xOther
136sgn0.86xMath
137erf_0.82xMath
138gelu_0.82xActivation
139erfinv_0.82xMath
140_log_softmax_backward_data0.82xInternal
141log10_0.81xMath
142rmsnormspecialNormalization

图例

  • 🏆 卓越:加速比 ≥ 2.0x
  • 优秀:加速比 ≥ 1.5x
  • 📈 良好:加速比 ≥ 1.0x
  • 合格:加速比 ≥ 0.8x

算子分类说明#

  • Activation: 激活函数 (ReLU, GELU, Sigmoid, etc.)
  • Arithmetic: 基本算术操作 (add, mul, div, etc.)
  • Comparison: 比较操作 (eq, ne, gt, lt, etc.)
  • Internal: 内部、工具操作
  • Linear Algebra: 矩阵操作 (matmul, mv, etc.)
  • Loss: 损失函数计算 (MSE, Cross-Entropy, etc.)
  • Math: 数学函数 (sin, cos, exp, log, etc.)
  • NLP: 自然语言处理
  • Other: 杂项
  • Padding: 数据补齐操作 (reflection_pad, replication_pad, etc.)
  • Shape: 形状操控操作
  • Vision: 计算机视觉操作

说明#

  • 所有算子均通过正确性测试
  • 性能数据采用多种不同输入形状在 Hopper GPU 上采集
  • 加速比计算方式: PyTorch_time / FlagGems_time
  • 数值较大意味着性能较好