post on 11 May 2021 about 15487words require 52min
CC BY 4.0 (除特别声明或转载文章外)
如果这篇博客帮助到你,可以请我喝一杯咖啡~
自己负责的 ASC20 的赛题的 GPU 部分,不过最后比赛表现不是很理想。随便写点东西吧。
GPU(s) | 1 | 2 | 4 | 8 |
---|---|---|---|---|
main_HamExp | 15.130841s | 13.011804s | 18.062945s | 15.508143s |
main_InvQFT | x | x | x | 33.724309s |
random | 8.241649s | 10.510134s | 13.1650710s | 9.797779s |
GHZ | 1.148370s | 1.057206s | 1.641490s | 1.313547s |
GHZ_QFT_N(N=29) | 0.560380s | 0.545665s | 0.840117s | 0.653872s |
被大清和北航按着锤了,他们 main_InvQFT
算例在四机八卡上分别 9s 和 22s…他们的显存分配比我写的更好,我这边只跑出其中两个算例,他们几乎全部跑完了。和北航的同学交流了一下,显存不够的时候怎么办?对方回答:拉回内存上呀…也去问了一下大清的队长,他们是直接重构了 QuEST 中状态向量的分布。华科的同学说大清把通信改成多对多了,我想了想很有道理,因为我这边确实很多进程之间是没有通信的,把通信流量均分成 AlltoAll 可能确实更高效。
感觉自己优化的时候老是局限于从某些细节扣出零点几秒,还非常缺乏直接从算法层面重构整个软件的能力。
辛 亥 革 命 失 败
我们的决赛平台「星河一号」包含四台浪潮 5280M6 服务器,每个节点上有两张 NVIDIA A100 PCIe。我们发现同一个节点间显卡 P2P 通信速率远低于跨节点的两张 GPU 通信速率,为此使用 NVIDIA 提供的 CUDA P2P Sample 测试的结果支持这一观点。慢的比例远大于 PCIe4.0 和 NVLink 之间的差距,猜测这里显卡的 P2P 走了内存。为此我们调整了机器走线,使得两张显卡均挂在一个 CPU 上,使结果快了一倍,但是仍远低于跨节点速率。
因此,我们运行时按这样分配进程到节点的映射:12213443,使得节点内的两个进程在算法上尽量不要有通信。然而,我们的代码在决赛机器上的八卡性能仍然略低于单卡性能。
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
$ nvidia-smi
Tue May 11 12:08:26 2021
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 455.32.00 Driver Version: 455.32.00 CUDA Version: 11.1 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 A100-PCIE-40GB On | 00000000:CA:00.0 Off | 0 |
| N/A 30C P0 36W / 250W | 0MiB / 40536MiB | 0% Default |
| | | Disabled |
+-------------------------------+----------------------+----------------------+
| 1 A100-PCIE-40GB On | 00000000:E3:00.0 Off | 0 |
| N/A 31C P0 36W / 250W | 0MiB / 40536MiB | 0% Default |
| | | Disabled |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+
$ nvidia-smi topo -m
GPU0 GPU1 mlx5_0 CPU Affinity NUMA Affinity
GPU0 X NODE NODE 32-63 1
GPU1 NODE X NODE 32-63 1
mlx5_0 NODE NODE X
Legend:
X = Self
SYS = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
PHB = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
PXB = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
PIX = Connection traversing at most a single PCIe bridge
NV# = Connection traversing a bonded set of # NVLinks
GPU(s) | 1 | 2 | 4 |
---|---|---|---|
random | x | 9.62s | 7.38s |
GHZ | x | 0.98s | 0.79s |
GHZ_QFT_N(N=29) | 0.75s | 0.47s | 0.38s |
实际上我们在提交初赛成绩的时候有所保留(以防被赛会过度宣传),只交了一份去年的代码,只写了多机多卡,没有做其他优化。random 算例和 GHZ 算例分别 8.53s、7.45s。很显然这么做的并不止我们,而且别人保留的更多…
在超算中心 TH-2K 的 V100 节点上运行的,节点上有四张 V100 16GB,用 NVLink 以一种很坑的方式连起来了,不同显卡间 NVLink 的数量并不一样多。
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
$ nvidia-smi
Wed May 5 18:58:36 2021
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 418.67 Driver Version: 418.67 CUDA Version: 10.1 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 Tesla V100-SXM2... Off | 00000000:8A:00.0 Off | 0 |
| N/A 39C P0 56W / 300W | 10MiB / 16130MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 1 Tesla V100-SXM2... Off | 00000000:8B:00.0 Off | 0 |
| N/A 36C P0 61W / 300W | 10MiB / 16130MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 2 Tesla V100-SXM2... Off | 00000000:B3:00.0 Off | 0 |
| N/A 36C P0 62W / 300W | 10MiB / 16130MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 3 Tesla V100-SXM2... Off | 00000000:B4:00.0 Off | 0 |
| N/A 39C P0 60W / 300W | 10MiB / 16130MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+
$ nvidia-smi topo -m
GPU0 GPU1 GPU2 GPU3 mlx5_0 mlx5_1 mlx5_2 mlx5_3 CPU Affinity
GPU0 X NV2 NV1 NODE PIX PIX NODE NODE 14-27
GPU1 NV2 X NODE NV2 PIX PIX NODE NODE 14-27
GPU2 NV1 NODE X NV2 NODE NODE PIX PIX 14-27
GPU3 NODE NV2 NV2 X NODE NODE PIX PIX 14-27
mlx5_0 PIX PIX NODE NODE X PIX NODE NODE
mlx5_1 PIX PIX NODE NODE PIX X NODE NODE
mlx5_2 NODE NODE PIX PIX NODE NODE X PIX
mlx5_3 NODE NODE PIX PIX NODE NODE PIX X
Legend:
X = Self
SYS = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
PHB = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
PXB = Connection traversing multiple PCIe switches (without traversing the PCIe Host Bridge)
PIX = Connection traversing a single PCIe switch
NV# = Connection traversing a bonded set of # NVLinks
只改动statevec_pauliYKernel
中的两行代码后(见statevec_pauliYKernel_WuK
),random 算例在单张 v100 32G 上就快了 0.8s。
不过很可惜,再后来重构的代码中我把所有的量子门使用两个通用门表示了,这个小彩蛋并没有用上。
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
// nvcc pauliYKernel.cu -ptx -arch=compute_70 -code=sm_70
typedef double qreal;
typedef struct
{
char *buffer; // generated QASM string
int bufferSize; // maximum number of chars before overflow
int bufferFill; // number of chars currently in buffer
int isLogging; // whether gates are being added to buffer
} QASMLogger;
/** Represents an array of complex numbers grouped into an array of
* real components and an array of coressponding complex components.
*
* @ingroup type
* @author Ania Brown
*/
typedef struct ComplexArray
{
qreal *real;
qreal *imag;
} ComplexArray;
/// \endcond
/** Represents a system of qubits.
* Qubits are zero-based
*
* @ingroup type
* @author Ania Brown
* @author Tyson Jones (density matrix)
*/
typedef struct Qureg
{
//! Whether this instance is a density-state representation
int isDensityMatrix;
//! The number of qubits represented in either the state-vector or density matrix
int numQubitsRepresented;
//! Number of qubits in the state-vector - this is double the number represented for mixed states
int numQubitsInStateVec;
//! Number of probability amplitudes held in stateVec by this process
//! In the non-MPI version, this is the total number of amplitudes
long long int numAmpsPerChunk;
//! Total number of amplitudes, which are possibly distributed among machines
long long int numAmpsTotal;
//! The position of the chunk of the state vector held by this process in the full state vector
int chunkId;
//! Number of chunks the state vector is broken up into -- the number of MPI processes used
int numChunks;
//! Computational state amplitudes - a subset thereof in the MPI version
ComplexArray stateVec;
//! Temporary storage for a chunk of the state vector received from another process in the MPI version
ComplexArray pairStateVec;
ComplexArray pairDeviceStateVec;
//! Storage for wavefunction amplitudes in the GPU version
ComplexArray deviceStateVec;
//! Storage for reduction of probabilities on GPU
qreal *firstLevelReduction, *secondLevelReduction;
int *lastPairRank;
void *deviceStateVecRealHandle, *deviceStateVecImagHandle, *pairDeviceStateVecRealHandle, *pairDeviceStateVecImagHandle;
//! Storage for generated QASM output
QASMLogger *qasmLog;
} Qureg;
__global__ void statevec_pauliYKernel(Qureg qureg, const int targetQubit, const int conjFac)
{
long long int sizeHalfBlock = 1LL << targetQubit;
long long int sizeBlock = 2LL * sizeHalfBlock;
long long int numTasks = qureg.numAmpsPerChunk >> 1;
long long int thisTask = blockIdx.x * blockDim.x + threadIdx.x;
if (thisTask >= numTasks)
return;
long long int thisBlock = thisTask / sizeHalfBlock;
long long int indexUp = thisBlock * sizeBlock + thisTask % sizeHalfBlock;
long long int indexLo = indexUp + sizeHalfBlock;
qreal stateRealUp, stateImagUp;
qreal *stateVecReal = qureg.deviceStateVec.real;
qreal *stateVecImag = qureg.deviceStateVec.imag;
stateRealUp = stateVecReal[indexUp];
stateImagUp = stateVecImag[indexUp];
// update under +-{{0, -i}, {i, 0}}
stateVecReal[indexUp] = conjFac * stateVecImag[indexLo];
stateVecImag[indexUp] = conjFac * -stateVecReal[indexLo];
stateVecReal[indexLo] = conjFac * -stateImagUp;
stateVecImag[indexLo] = conjFac * stateRealUp;
}
__global__ void statevec_pauliYKernel_WuK(Qureg qureg, const int targetQubit, const int conjFac)
{
long long int sizeHalfBlock = 1LL << targetQubit;
long long int sizeBlock = 2LL * sizeHalfBlock;
long long int numTasks = qureg.numAmpsPerChunk >> 1;
long long int thisTask = blockIdx.x * blockDim.x + threadIdx.x;
if (thisTask >= numTasks)
return;
long long int thisBlock = thisTask / sizeHalfBlock;
long long int indexUp = thisBlock * sizeBlock + thisTask % sizeHalfBlock;
long long int indexLo = indexUp + sizeHalfBlock;
qreal stateRealUp, stateImagUp;
qreal *stateVecReal = qureg.deviceStateVec.real;
qreal *stateVecImag = qureg.deviceStateVec.imag;
stateRealUp = stateVecReal[indexUp];
stateImagUp = stateVecImag[indexUp];
const qreal
stateRealLo = stateVecReal[indexLo],
stateImagLo = stateVecImag[indexLo];
// update under +-{{0, -i}, {i, 0}}
stateVecReal[indexUp] = conjFac * stateImagLo;
stateVecImag[indexUp] = conjFac * -stateRealLo;
stateVecReal[indexLo] = conjFac * -stateImagUp;
stateVecImag[indexLo] = conjFac * stateRealUp;
}
终端运行
1
nvcc pauliYKernel.cu -ptx -arch=compute_70 -code=sm_70
得到对应的 ptx 文件:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
//
// Generated by NVIDIA NVVM Compiler
//
// Compiler Build ID: CL-26907403
// Cuda compilation tools, release 10.1, V10.1.243
// Based on LLVM 3.4svn
//
.version 6.4
.target sm_70
.address_size 64
// .globl _Z21statevec_pauliYKernel5Quregii
.visible .entry _Z21statevec_pauliYKernel5Quregii(
.param .align 8 .b8 _Z21statevec_pauliYKernel5Quregii_param_0[168],
.param .u32 _Z21statevec_pauliYKernel5Quregii_param_1,
.param .u32 _Z21statevec_pauliYKernel5Quregii_param_2
)
{
.reg .pred %p<3>;
.reg .b32 %r<11>;
.reg .f64 %fd<12>;
.reg .b64 %rd<29>;
mov.b64 %rd8, _Z21statevec_pauliYKernel5Quregii_param_0;
ld.param.u32 %r2, [_Z21statevec_pauliYKernel5Quregii_param_1];
ld.param.u32 %r1, [_Z21statevec_pauliYKernel5Quregii_param_2];
mov.u64 %rd3, %rd8;
cvt.u64.u32 %rd1, %r2;
mov.u64 %rd9, 1;
shl.b64 %rd2, %rd9, %r2;
ld.param.u64 %rd10, [_Z21statevec_pauliYKernel5Quregii_param_0+16];
shr.s64 %rd11, %rd10, 1;
mov.u32 %r3, %ntid.x;
mov.u32 %r4, %ctaid.x;
mov.u32 %r5, %tid.x;
mad.lo.s32 %r6, %r3, %r4, %r5;
cvt.u64.u32 %rd4, %r6;
setp.ge.s64 %p1, %rd4, %rd11;
@%p1 bra BB0_5;
and.b64 %rd12, %rd2, -4294967296;
setp.eq.s64 %p2, %rd12, 0;
@%p2 bra BB0_3;
rem.s64 %rd28, %rd4, %rd2;
bra.uni BB0_4;
BB0_3:
cvt.u32.u64 %r7, %rd2;
cvt.u32.u64 %r8, %rd4;
rem.u32 %r9, %r8, %r7;
cvt.u64.u32 %rd28, %r9;
BB0_4:
cvt.u32.u64 %r10, %rd1;
shr.u64 %rd13, %rd4, %r10;
mul.lo.s64 %rd14, %rd2, %rd13;
shl.b64 %rd15, %rd14, 1;
add.s64 %rd16, %rd28, %rd15;
add.s64 %rd17, %rd16, %rd2;
ld.param.u64 %rd18, [%rd3+88];
cvta.to.global.u64 %rd19, %rd18;
ld.param.u64 %rd20, [%rd3+96];
cvta.to.global.u64 %rd21, %rd20;
shl.b64 %rd22, %rd16, 3;
add.s64 %rd23, %rd19, %rd22;
ld.global.f64 %fd1, [%rd23];
add.s64 %rd24, %rd21, %rd22;
ld.global.f64 %fd2, [%rd24];
shl.b64 %rd25, %rd17, 3;
add.s64 %rd26, %rd21, %rd25;
ld.global.f64 %fd3, [%rd26];
cvt.rn.f64.s32 %fd4, %r1;
mul.f64 %fd5, %fd4, %fd3;
st.global.f64 [%rd23], %fd5;
add.s64 %rd27, %rd19, %rd25;
ld.global.f64 %fd6, [%rd27];
mul.f64 %fd7, %fd4, %fd6;
neg.f64 %fd8, %fd7;
st.global.f64 [%rd24], %fd8;
mul.f64 %fd9, %fd4, %fd2;
neg.f64 %fd10, %fd9;
st.global.f64 [%rd27], %fd10;
mul.f64 %fd11, %fd4, %fd1;
st.global.f64 [%rd26], %fd11;
BB0_5:
ret;
}
// .globl _Z25statevec_pauliYKernel_WuK5Quregii
.visible .entry _Z25statevec_pauliYKernel_WuK5Quregii(
.param .align 8 .b8 _Z25statevec_pauliYKernel_WuK5Quregii_param_0[168],
.param .u32 _Z25statevec_pauliYKernel_WuK5Quregii_param_1,
.param .u32 _Z25statevec_pauliYKernel_WuK5Quregii_param_2
)
{
.reg .pred %p<3>;
.reg .b32 %r<11>;
.reg .f64 %fd<12>;
.reg .b64 %rd<29>;
mov.b64 %rd8, _Z25statevec_pauliYKernel_WuK5Quregii_param_0;
ld.param.u32 %r2, [_Z25statevec_pauliYKernel_WuK5Quregii_param_1];
ld.param.u32 %r1, [_Z25statevec_pauliYKernel_WuK5Quregii_param_2];
mov.u64 %rd3, %rd8;
cvt.u64.u32 %rd1, %r2;
mov.u64 %rd9, 1;
shl.b64 %rd2, %rd9, %r2;
ld.param.u64 %rd10, [_Z25statevec_pauliYKernel_WuK5Quregii_param_0+16];
shr.s64 %rd11, %rd10, 1;
mov.u32 %r3, %ntid.x;
mov.u32 %r4, %ctaid.x;
mov.u32 %r5, %tid.x;
mad.lo.s32 %r6, %r3, %r4, %r5;
cvt.u64.u32 %rd4, %r6;
setp.ge.s64 %p1, %rd4, %rd11;
@%p1 bra BB1_5;
and.b64 %rd12, %rd2, -4294967296;
setp.eq.s64 %p2, %rd12, 0;
@%p2 bra BB1_3;
rem.s64 %rd28, %rd4, %rd2;
bra.uni BB1_4;
BB1_3:
cvt.u32.u64 %r7, %rd2;
cvt.u32.u64 %r8, %rd4;
rem.u32 %r9, %r8, %r7;
cvt.u64.u32 %rd28, %r9;
BB1_4:
cvt.u32.u64 %r10, %rd1;
shr.u64 %rd13, %rd4, %r10;
mul.lo.s64 %rd14, %rd2, %rd13;
shl.b64 %rd15, %rd14, 1;
add.s64 %rd16, %rd28, %rd15;
add.s64 %rd17, %rd16, %rd2;
ld.param.u64 %rd18, [%rd3+88];
cvta.to.global.u64 %rd19, %rd18;
ld.param.u64 %rd20, [%rd3+96];
cvta.to.global.u64 %rd21, %rd20;
shl.b64 %rd22, %rd16, 3;
add.s64 %rd23, %rd19, %rd22;
ld.global.f64 %fd1, [%rd23];
add.s64 %rd24, %rd21, %rd22;
ld.global.f64 %fd2, [%rd24];
shl.b64 %rd25, %rd17, 3;
add.s64 %rd26, %rd19, %rd25;
ld.global.f64 %fd3, [%rd26];
add.s64 %rd27, %rd21, %rd25;
cvt.rn.f64.s32 %fd4, %r1;
ld.global.f64 %fd5, [%rd27];
mul.f64 %fd6, %fd4, %fd5;
st.global.f64 [%rd23], %fd6;
mul.f64 %fd7, %fd4, %fd3;
neg.f64 %fd8, %fd7;
st.global.f64 [%rd24], %fd8;
mul.f64 %fd9, %fd4, %fd2;
neg.f64 %fd10, %fd9;
st.global.f64 [%rd26], %fd10;
mul.f64 %fd11, %fd4, %fd1;
st.global.f64 [%rd27], %fd11;
BB1_5:
ret;
}
可以发现,这样进行一下 naive 的修改之后,部分访存的代码由
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
add.s64 %rd26, %rd21, %rd25;
ld.global.f64 %fd3, [%rd26];
cvt.rn.f64.s32 %fd4, %r1;
mul.f64 %fd5, %fd4, %fd3;
st.global.f64 [%rd23], %fd5;
add.s64 %rd27, %rd19, %rd25;
ld.global.f64 %fd6, [%rd27];
mul.f64 %fd7, %fd4, %fd6;
neg.f64 %fd8, %fd7;
st.global.f64 [%rd24], %fd8;
mul.f64 %fd9, %fd4, %fd2;
neg.f64 %fd10, %fd9;
st.global.f64 [%rd27], %fd10;
mul.f64 %fd11, %fd4, %fd1;
st.global.f64 [%rd26], %fd11;
变成了
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
add.s64 %rd26, %rd19, %rd25;
ld.global.f64 %fd3, [%rd26];
add.s64 %rd27, %rd21, %rd25;
cvt.rn.f64.s32 %fd4, %r1;
ld.global.f64 %fd5, [%rd27];
mul.f64 %fd6, %fd4, %fd5;
st.global.f64 [%rd23], %fd6;
mul.f64 %fd7, %fd4, %fd3;
neg.f64 %fd8, %fd7;
st.global.f64 [%rd24], %fd8;
mul.f64 %fd9, %fd4, %fd2;
neg.f64 %fd10, %fd9;
st.global.f64 [%rd26], %fd10;
mul.f64 %fd11, %fd4, %fd1;
st.global.f64 [%rd27], %fd11;
可以看到,核函数代码对 global 内存的读写顺序改变了,原先 ld.global 与 st.global 指令间存在一定的交错执行关系,经过调整之后统一先读后写,这使得其更容易在运行的时候合并访存。可以使用 nvprof 这个工具验证优化的有效性。statevec_pauliYKernel 在算例一中有 43 个 Calls,其运行时间的 Avg/Min/Max 分别为 63.251ms/57.993ms/84.206ms ,总运行时间占算例一总时间的 14.91%(2.71978s/18.688107s)。经过修改之后,其运行时间的 Avg/Min/Max 分别为 44.101ms/42.064ms/53.005ms ,总运行时间占算例一总时间的 10.87%(1.89634s/17.922671s)。从结果上来看,只改动了原 CUDA 代码的两行就给 GPU 单卡的代码带来了近 1s 的提速,还是非常有效的。
发现这个问题的过程也很有参考意义:在 nvprof 得到的结果中,这个核函数运行时间的 Avg(63.251ms)远大于相同计算量的 statevec_pauliXKernel(43.516ms),且这个核函数的 Max 远大于 Avg,说明运行时间的波动较大,存在优化空间。在逐行比较 statevec_pauliYKernel 和 statevec_pauliXKernel 后,唯一有区别的几乎就是,原作者为了节省两个变量,直接将访存代码放进了计算的表达式中;调整之后果然取得了优化。最后通过类似的手段又在别的地方挤出 0.4s。这也提示我们一个比较优秀的写 CUDA 代码的习惯:尽量不要在同一个表达式中同时出现读写 global memory 的操作;同时对 global memory 的读写应该尽量分开,各自聚集在一起。
Related posts