CC BY 4.0 (除特别声明或转载文章外)
如果这篇博客帮助到你,可以请我喝一杯咖啡~
Assume a GPU architecture that contains 10 SIMD processors. Each SIMD instruction has a width of 32 and each SIMD processor contains 8 lanes for single-precision arithmetic and load/store instructions, meaning that each nondiverged SIMD instruction can produce 32 results every 4 cycles. Assume a kernel that has divergent branches that causes, on average, 80% of threads to be active. Assume that 70% of all SIMD instructions executed are single precision arithmetic and 20% are load/store. Because not all memory latencies are covered, assume an average SIMD instruction issue rate of 0.85. Assume that the GPU has a clock speed of 1.5 GHz.
- Compute the throughput, in GFLOP/s, for this kernel on this GPU.
$10\text{Cores}\times 8\times 1.5\text{GHz}\times 80\%\times 70\%\times 0.85=57.12\text{GFLOP/s}$
- Assume that you have the following choices, what is speedup in throughput for each of these improvements?
- Increasing the number of single-precision lanes to 16.
- Increasing the number of SIMD processors to 15 (assume this change doesn’t affect any other performance metrics and that the code scales to the additional processors) .
- Adding a cache that will effectively reduce memory latency by 40%, which will increase instruction issue rate to 0.95.
- $10\text{Cores}\times 16\times 1.5\text{GHz}\times 80\%\times 70\%\times 0.85=114.24\text{GFLOP/s}$,加速比为 $\frac{114.24}{57.12}=2$
- $15\text{Cores}\times 8\times 1.5\text{GHz}\times 80\%\times 70\%\times 0.85=85.68\text{GFLOP/s}$,加速比为 $\frac{85.68}{57.12}=1.5$
- $10\text{Cores}\times 8\times 1.5\text{GHz}\times 80\%\times 70\%\times 0.95=63.84\text{GFLOP/s}$,加速比为 $\frac{63.84}{57.12}\approx 1.11$
In this exercise, we will examine several loops and analyze their potential for parallelization.
- Does the following loop have a loop-carried dependency?
for (int i = 0; i < 100; ++i) { A[i] = B[2 * i + 4]; B[4 * i + 5] = A[i]; }
不存在。对于 A 数组,每轮循环处理的元素不同;对于 B 数组,每一轮循环只依赖偶数下标的元素,而只有奇数下标的元素被修改。
正解:
有四种可能存在的相关:S1 自身、S2 自身的循环间相关,S1 到 S2 的相关(S2 对 S1 的依赖),S2 到 S1 的相关。对于前两种:因为 S1 和 S2 的赋值符号前后对应不同的数组,因此这种相关不存在。
对于 S1 到 S2 可能存在的相关:S1 中计算数组 A 的元素,并在 S2 中加载,两次对 A 的引用在同一个迭代内,不是循环间相关。因此,如果它是仅有的相关,那么这个循环的多次迭代就能并行执行,只要一个迭代中的每对语句保持相对顺序即可。
对于 S2 到 S1,也即 B[2*i+4]
和 B[4*i+5]
可能存在的相关。用最大公约数(Greatest Common Divisor,GCD)测试判断:假定我们已经以索引值 a*x+b
存储了一个数组元素,并以索引值 c*d+d
从同一数组中载入,那么如果要存在循环间相关,必须满足 $\gcd(c,a)$ 能够整除 $(d-b)$。本例中,$a=2,b=4,c=4,d=5,\gcd(4,2)=2$ 不能整除 $(5-4)=1$,故不存在相关。
综上,此代码不存在循环带来的相关,可并行执行。
- In the following loop, find all the true dependencies, output dependencies, and anti-dependencies. Eliminate the output dependencies and anti-dependencies by renaming.
for (int i = 0; i < 100; ++i) { A[i] = A[i] * B[i]; /* S1 */ B[i] = A[i] + c; /* S2 */ A[i] = C[i] * c; /* S3 */ C[i] = D[i] * A[i]; /* S4 */ }
- S2 与 S1 存在 true dependence(
A[i]
,或修改后的t_i
). - S3 与 S4 存在 true dependence(
A[i]
). - S1 与 S3 存在 output dependence(
A[i]
). - S3 与 S4 存在 anti-dependence(
C[i]
).
for (int i = 0; i < 100; ++i) {
int t_i = A[i] * B[i]; /* S1 */
B[i] = t_i + c; /* S2 */
A[i] = C[i] * c; /* S3 */
C[i] = D[i] * A[i]; /* S4 */
}
正解:
- 输出相关(写后写):S1 和 S3 通过 A[i] 产生了输出相关。
- 反相关(读后写):S3 和 S1 通过 A[i];S2 和 S1 通过 B[i];S3 和 S2 通过 A[i];S4 和 S3 通过 C[i] 产生了反相关。
- 真数据相关(写后读):S2 和 S1 通过 A[i];S4 和 S3 通过 A[i] 产生了真数据相关。
改写代码,通过重命名去掉输出相关和反相关:
for (int i = 0; i < 100; ++i) {
A[i] = A[i] * B[i]; /* S1 */
B1[i] = A[i] + c; /* S2 */
A1[i] = C[i] * c; /* S3 */
C1[i] = D[i] * A1[i]; /* S4 */
}
- Consider the following loop, Are there dependences between S1, S2 and S3? Is this loop parallel? If not, show how to make it parallel.
for (int i = 0; i < 100; ++i) { A[i] = B[i] + C[i]; /* S1 */ B[i + 1] = D[i] + E[i]; /* S2 */ C[i + 1] = D[i] * E[i]; /* S3 */ }
存在循环依赖,S1 依赖前一轮的 S2、S3。可以通过调整语句顺序,去掉循环依赖,效果如下。
A[0] = B[0] + C[0];
#pragma omp parallel for
for (int i = 0; i < 99; ++i) {
B[i + 1] = D[i] + E[i];
C[i + 1] = D[i] * E[i];
A[i + 1] = B[i + 1] + C[i + 1];
}
B[100] = D[99] + E[99];
C[100] = D[99] * E[99];
当然也可以在调用处展开原来的计算,缺点是计算和访存都翻倍了。
#pragma omp parallel for
for (int i = 0; i < 99; ++i) {
A[i] = i ? D[i - 1] + E[i - 1] + D[i - 1] * E[i - 1] : B[0] + C[0];
B[i + 1] = D[i] + E[i];
C[i + 1] = D[i] * E[i];
}
正解:
循环内 S1, S2, S3 之间不存在名称相关或真数据相关。
循环间数组 B 和数组 C 可能存在循环间相关。用最大公约数法分析,
B[i]
和 B[i+1]
,$a=1,b=0,c=1,d=1$,也即 $\gcd(a,c)=1$ 可以整除 (d-b)=1
,因此数组 B 存在循环间相关,不是循环可并行的。数组 C 同理。
改写代码:
A[0] = B[0] + C[0];
#pragma omp parallel for
for (int i = 0; i < 99; ++i) {
B[i + 1] = D[i] + E[i];
C[i + 1] = D[i] * E[i];
A[i + 1] = B[i + 1] + C[i + 1];
}
B[100] = D[99] + E[99];
C[100] = D[99] * E[99];
List and describe at least four factors that influence the performance of GPU kernels. In other words, which runtime behaviors are caused by the kernel code cause a reduction in resource utilization during kernel execution?
- 同一个 warp 执行不同的条件分支。
- 不能合并的访存。
- 同一个 warp 同时访问同一个 bank 中的元素,导致 bank conflict。
- 一个 kernel 使用过多 shared memory,导致 warp 占用率下降。
- 单个线程使用过多寄存器资源,导致 warp 占用率下降。
- 不合理的 block/grid 大小设置,增加调度开销。
正解:
- 分支发散:当线程遵循不同的控制路径时,导致 SIMD 通道被屏蔽。
- 覆盖内存延迟:足够数量的活动线程可以隐藏内存延迟并提高指令发布率。
- 合并的片外内存引用:内存访问应该在 SIMD 线程组内连续组织。
- 片上内存的使用:具有局部性的内存引用应利用片上内存,应组织 SIMD 线程组内对片上内存的引用以避免内存块冲突。
Assume that we have a function for an application of the form $F(i,p)$, which gives the fraction of time that exactly $i$ processors are usable given that a total of $p$ processors are available. This means that
\[\sum_{i=1}^pF\left(i,p\right)=1\]Assume that when $i$ processors are in use, the applications run $i$ times faster.
- Rewrite Amdahl’s Law so that it gives the speedup as a function of $p$ for some application.
设串行运行整个程序的时间为 1,则使用 p 个个物理核心加速程序的各个部分所能达到的加速比为 $k(p)=\frac{1}{\sum_{i=1}^p\frac{F\left(i,p\right)}{i}}$。
- An application A runs on single processor for a time $T$ seconds. Different portions of its running time can improve if a larger number of processors is used. The figure below provides the details. How much speedup will A achieve when on 8 processors?
Fraction of T 20% 20% 10% 5% 15% 20% 10% Processors(P) 1 2 4 6 8 16 128
使用八个处理器时,多于八核并行的部分实际上只有八核运行。因此对应的加速比为
\[\frac{1}{20\%/1+20\%/2+10\%/4+5\%/6+(15\%+20\%+10\%)/8}\approx 2.57\]\[\frac{1}{20\%/1+20\%/2+10\%/4+5\%/6+15\%/8+20\%/16+10\%/32}\approx 2.72\]
- Repeat for 32 processors and an infinite number of processors.
对于无限核数,可以认为是 128 核,则有
\[\frac{1}{20\%/1+20\%/2+10\%/4+5\%/6+15\%/8+20\%/16+10\%/128}\approx 2.74\]In this exercise, we examine the effect of the interconnection network topology on the CPI of programs running on a 64-processor distributed-memory multiprocessor. The processor clock rate is 2.0 GHz, and the base CPI of an application with all references hitting in the cache is 0.75. Assume that 0.2% of the instructions involve a remote communication reference. The cost of a remote communication reference is $\left(100+10\times h\right)$ ns, being the number of communication network hops that a remote reference has to make to the remote processor memory and back. Assume all communication links are bidirectional.
- Calculate the worst-case remote communication cost when the 64 processors are arranged as a ring, as an $8\times 8$ processor grid, or as a hypercube(Hint: longest communication path on a $2^n$ hypercube has n links).
- 对于 ring 方式组织成的处理器阵列,任意两个处理器之间至多有 32 跳,最坏耗时为 $\left(100+10\times 32\right)=420$ ns.
- 对于 $8\times 8$ 网格组织成的处理器阵列,任意两个处理器之间至多有 14 跳,最坏耗时为 $\left(100+10\times 14\right)=240$ ns.
- 对于超立方体方式组织成的处理器阵列,任意两个处理器之间至多有 6 跳,最坏耗时为 $\left(100+10\times 6\right)=160$ ns.
正解:
最大通信网络跳数 h 为最长通信路径 p 的两倍(要折返)。
- 对于 ring 方式组织成的处理器阵列,任意两个处理器之间至多有 32 跳,最坏耗时为 $\left(100+10\times 32\times 2\right)=740$ ns。
- 对于 $8\times 8$ 网格组织成的处理器阵列,任意两个处理器之间至多有 14 跳,最坏耗时为 $\left(100+10\times 14\times 2\right)=380$ ns。
- 对于超立方体方式组织成的处理器阵列,任意两个处理器之间至多有 6 跳,最坏耗时为 $\left(100+10\times 6\times 2\right)=220$ ns。
- Compare the base CPI of the application with no remote communication to the CPI achieved with each of the three topologies in last part.
则该 CPU 单周期时间为 $1/2.0\text{GHz}=5\times10^{-10}$ s,即 $0.5\text{ns}$.
- 对于 ring 方式组织成的处理器阵列,最坏情况下 CPI 为 $0.75\times\frac{0.5\times 99.8\%+(0.5+420)*0.2\%}{0.5}=2.43$。
- 对于 $8\times 8$ 网格组织成的处理器阵列,最坏情况下 CPI 为 $0.75\times\frac{0.5\times 99.8\%+(0.5+240)*0.2\%}{0.5}=1.71$。
- 对于超立方体方式组织成的处理器阵列,最坏情况下 CPI 为 $0.75\times\frac{0.5\times 99.8\%+(0.5+160)*0.2\%}{0.5}=1.39$。
正解:
则该 CPU 单周期时间为 $1/2.0\text{GHz}=5\times10^{-10}$ s,即 $0.5\text{ns}$。
- 对于 ring 方式组织成的处理器阵列,最坏情况下 CPI 为 $0.75\times\frac{0.5\times 99.8\%+(0.5+740)*0.2\%}{0.5}= 3.71$。
- 对于 $8\times 8$ 网格组织成的处理器阵列,最坏情况下 CPI 为 $0.75\times\frac{0.5\times 99.8\%+(0.5+380)*0.2\%}{0.5}= 2.27$。
- 对于超立方体方式组织成的处理器阵列,最坏情况下 CPI 为 $0.75\times\frac{0.5\times 99.8\%+(0.5+220)*0.2\%}{0.5}= 1.63$。