FlagGems 实验性算子#
本节列举 FlagGems 中的实验性算子。这些算子与 PyTorch 的原生实现相比, 能够达到平均 0.8 倍或更高的性能。
性能数据概览#
- 算子总数:142
- 平均加速比范围:0.81x - 7.23x
- 测试环境:Hopper GPU
- 过滤条件:平均加速比 ≥ 0.8x
按性能排列的算子列表#
| 序号 | 算子 | 平均加速比 | 分类 |
|---|---|---|---|
| 1 | _safe_softmax | 7.23x 🏆 | Internal |
| 2 | digamma_ | 2.41x 🏆 | Math |
| 3 | zero | 1.85x ✅ | Other |
| 4 | relu | 1.79x ✅ | Activation |
| 5 | mse_loss | 1.64x ✅ | Loss |
| 6 | masked_select | 1.47x ✅ | Other |
| 7 | masked_scatter | 1.44x ✅ | Other |
| 8 | eye | 1.43x ✅ | Other |
| 9 | t_copy | 1.41x ✅ | Shape |
| 10 | trace | 1.40x ✅ | Math |
| 11 | i0_ | 1.37x ✅ | Math |
| 12 | zeros_like | 1.32x ✅ | Other |
| 13 | diag | 1.27x ✅ | Other |
| 14 | lift_fresh_copy | 1.24x ✅ | Other |
| 15 | alias_copy | 1.23x ✅ | Other |
| 16 | pixel_unshuffle | 1.20x 📈 | Vision |
| 17 | triu | 1.18x 📈 | Shape |
| 18 | rrelu_with_noise_backward | 1.17x 📈 | Activation |
| 19 | glu | 1.17x 📈 | Activation |
| 20 | tril | 1.16x 📈 | Shape |
| 21 | silu_ | 1.16x 📈 | Activation |
| 22 | asinh_ | 1.14x 📈 | Math |
| 23 | mv | 1.14x 📈 | Linear Algebra |
| 24 | arcsinh_ | 1.13x 📈 | Math |
| 25 | pixel_shuffle | 1.12x 📈 | Vision |
| 26 | replication_pad3d | 1.11x 📈 | Padding |
| 27 | _upsample_nearest_exact1d | 1.11x 📈 | Vision |
| 28 | i0 | 1.11x 📈 | Math |
| 29 | softplus | 1.10x 📈 | Activation |
| 30 | selu_ | 1.10x 📈 | Activation |
| 31 | upsample_nearest1d | 1.10x 📈 | Vision |
| 32 | special_i1 | 1.09x 📈 | Math |
| 33 | selu | 1.09x 📈 | Activation |
| 34 | amin | 1.09x 📈 | Math |
| 35 | sinh_ | 1.09x 📈 | Math |
| 36 | logit_ | 1.08x 📈 | Math |
| 37 | upsample_nearest3d | 1.07x 📈 | Vision |
| 38 | im2col | 1.06x 📈 | Vision |
| 39 | reflection_pad1d | 1.06x 📈 | Padding |
| 40 | elu | 1.06x 📈 | Activation |
| 41 | arctanh_ | 1.05x 📈 | Math |
| 42 | sigmoid | 1.05x 📈 | Activation |
| 43 | replication_pad1d | 1.04x 📈 | Padding |
| 44 | silu | 1.04x 📈 | Activation |
| 45 | sigmoid_ | 1.04x 📈 | Activation |
| 46 | addcdiv | 1.04x 📈 | Arithmetic |
| 47 | sinc_ | 1.03x 📈 | Math |
| 48 | relu6 | 1.03x 📈 | Activation |
| 49 | hardtanh | 1.03x 📈 | Activation |
| 50 | hardtanh_ | 1.03x 📈 | Activation |
| 51 | hardswish_ | 1.03x 📈 | Activation |
| 52 | reciprocal_ | 1.03x 📈 | Math |
| 53 | sinc | 1.03x 📈 | Math |
| 54 | hardsigmoid | 1.03x 📈 | Activation |
| 55 | logaddexp2 | 1.02x 📈 | Math |
| 56 | logit | 1.02x 📈 | Math |
| 57 | arctanh | 1.02x 📈 | Math |
| 58 | logaddexp | 1.02x 📈 | Math |
| 59 | cosh_ | 1.02x 📈 | Math |
| 60 | special_xlog1py | 1.02x 📈 | Math |
| 61 | celu | 1.02x 📈 | Activation |
| 62 | hardsigmoid_ | 1.02x 📈 | Activation |
| 63 | arcsinh | 1.02x 📈 | Math |
| 64 | sign | 1.02x 📈 | Math |
| 65 | absolute_ | 1.01x 📈 | Math |
| 66 | _adaptive_avg_pool3d | 1.01x 📈 | Vision |
| 67 | special_i0e | 1.01x 📈 | Math |
| 68 | cos_ | 1.01x 📈 | Math |
| 69 | deg2rad_ | 1.01x 📈 | Math |
| 70 | floor_ | 1.01x 📈 | Math |
| 71 | negative | 1.01x 📈 | Math |
| 72 | xlogy | 1.01x 📈 | Math |
| 73 | exp2 | 1.01x 📈 | Math |
| 74 | exp_ | 1.00x 📈 | Math |
| 75 | fix | 1.00x 📈 | Math |
| 76 | xlogy_ | 1.00x 📈 | Math |
| 77 | absolute | 1.00x 📈 | Math |
| 78 | prelu | 1.00x 📈 | Activation |
| 79 | hypot | 1.00x 📈 | Math |
| 80 | rad2deg_ | 1.00x 📈 | Math |
| 81 | smooth_l1_loss | 1.00x 📈 | Loss |
| 82 | deg2rad | 1.00x 📈 | Math |
| 83 | log_ | 1.00x 📈 | Math |
| 84 | sgn_ | 1.00x 📈 | Math |
| 85 | sin_ | 1.00x 📈 | Math |
| 86 | heaviside | 1.00x 📈 | Math |
| 87 | logical_xor_ | 1.00x 📈 | Other |
| 88 | trunc | 1.00x 📈 | Math |
| 89 | heaviside_ | 1.00x 📈 | Math |
| 90 | hardshrink | 1.00x 📈 | Activation |
| 91 | huber_loss | 1.00x 📈 | Loss |
| 92 | threshold_ | 1.00x 📈 | Activation |
| 93 | addcmul_ | 1.00x 📈 | Arithmetic |
| 94 | neg_ | 1.00x 📈 | Math |
| 95 | hypot_ | 1.00x 📈 | Math |
| 96 | leaky_relu | 1.00x 📈 | Activation |
| 97 | fmin | 1.00x 📈 | Math |
| 98 | erfinv | 1.00x 📈 | Math |
| 99 | log1p_ | 1.00x 📈 | Math |
| 100 | frac | 1.00x ⚡ | Math |
| 101 | _functional_sym_constrain_range_for_size | 1.00x ⚡ | Internal |
| 102 | expand | 1.00x ⚡ | Shape |
| 103 | lift | 1.00x ⚡ | Other |
| 104 | unsqueeze | 1.00x ⚡ | Shape |
| 105 | _unsafe_view | 1.00x ⚡ | Internal |
| 106 | softshrink | 1.00x ⚡ | Activation |
| 107 | log2_ | 1.00x ⚡ | Math |
| 108 | permute | 1.00x ⚡ | Shape |
| 109 | leaky_relu_ | 1.00x ⚡ | Activation |
| 110 | atanh_ | 1.00x ⚡ | Math |
| 111 | permute_copy | 1.00x ⚡ | Shape |
| 112 | fft_ifftshift | 1.00x ⚡ | Other |
| 113 | copy_ | 1.00x ⚡ | Other |
| 114 | fix_ | 1.00x ⚡ | Math |
| 115 | slice_scatter | 0.99x ⚡ | Other |
| 116 | exp2_ | 0.99x ⚡ | Math |
| 117 | rsqrt_ | 0.99x ⚡ | Math |
| 118 | threshold | 0.98x ⚡ | Activation |
| 119 | reciprocal | 0.97x ⚡ | Math |
| 120 | maximum | 0.97x ⚡ | Arithmetic |
| 121 | abs | 0.96x ⚡ | Math |
| 122 | arccosh | 0.96x ⚡ | Math |
| 123 | multiply | 0.95x ⚡ | Arithmetic |
| 124 | margin_ranking_loss | 0.95x ⚡ | Loss |
| 125 | celu_ | 0.92x ⚡ | Activation |
| 126 | hardswish | 0.91x ⚡ | Activation |
| 127 | soft_margin_loss | 0.90x ⚡ | Loss |
| 128 | replication_pad2d | 0.90x ⚡ | Padding |
| 129 | unsqueeze_copy | 0.89x ⚡ | Shape |
| 130 | native_dropout_backward | 0.89x ⚡ | Other |
| 131 | slice_backward | 0.88x ⚡ | Other |
| 132 | relu_ | 0.86x ⚡ | Activation |
| 133 | negative_ | 0.86x ⚡ | Math |
| 134 | abs_ | 0.86x ⚡ | Math |
| 135 | take | 0.86x ⚡ | Other |
| 136 | sgn | 0.86x ⚡ | Math |
| 137 | erf_ | 0.82x ⚡ | Math |
| 138 | gelu_ | 0.82x ⚡ | Activation |
| 139 | erfinv_ | 0.82x ⚡ | Math |
| 140 | _log_softmax_backward_data | 0.82x ⚡ | Internal |
| 141 | log10_ | 0.81x ⚡ | Math |
| 142 | rmsnorm | special ⚡ | Normalization |
图例:
- 🏆 卓越:加速比 ≥ 2.0x
- ✅ 优秀:加速比 ≥ 1.5x
- 📈 良好:加速比 ≥ 1.0x
- ⚡ 合格:加速比 ≥ 0.8x
算子分类说明#
- Activation: 激活函数 (ReLU, GELU, Sigmoid, etc.)
- Arithmetic: 基本算术操作 (add, mul, div, etc.)
- Comparison: 比较操作 (eq, ne, gt, lt, etc.)
- Internal: 内部、工具操作
- Linear Algebra: 矩阵操作 (matmul, mv, etc.)
- Loss: 损失函数计算 (MSE, Cross-Entropy, etc.)
- Math: 数学函数 (sin, cos, exp, log, etc.)
- NLP: 自然语言处理
- Other: 杂项
- Padding: 数据补齐操作 (reflection_pad, replication_pad, etc.)
- Shape: 形状操控操作
- Vision: 计算机视觉操作
说明#
- 所有算子均通过正确性测试
- 性能数据采用多种不同输入形状在 Hopper GPU 上采集
- 加速比计算方式:
PyTorch_time / FlagGems_time - 数值较大意味着性能较好