体系结构HW1

In this exercise, assume that we are considering enhancing a quad-core machine by adding encryption hardware to it. When computing encryption operations, it is 20 times faster than the normal mode of execution. We will define percentage of encryption as the percentage of time in the original execution that is spent performing encryption operations. The specialized hardware increases power consumption by 2%.

  • Draw a graph that plots the speedup as a percentage of the computation spent performing encryption. Label the y-axis “Net speedup” and label the x-axis “Percent encryption.”
{
  "mark": "line",
  "data": {
    "sequence": {
      "start": 0,
      "stop": 101,
      "as": "PercentEncryption"
    }
  },
  "transform": [
    {
      "calculate": "100/((100-datum.PercentEncryption)+0.05*datum.PercentEncryption)",
      "as": "NetSpeedup"
    }
  ],
  "encoding": {
    "x": {
      "field": "PercentEncryption",
      "type": "quantitative"
    },
    "y": {
      "field": "NetSpeedup",
      "type": "quantitative"
    }
  }
}
  • With what percentage of encryption will adding encryption hardware result in a speedup of 2?

设答案为 $x\%$。则有 $100/\left(\left(100-x\right)+0.05x\right)=2$,解得 $x=1000/19\approx 52.63$,即加密部分达到约 $52.63\%$ 时,使用硬件加密可以带来两倍的加速比。

  • What percentage of time in the new execution will be spent on encryption operations if a speedup of 2 is achieved?

设原总时间为 $100$。由上问可知加速比为 $2$ 时原加密部分为 $1000/19$,原加密部分被硬件加速后耗时为 $1000/19\times 0.05=50/19$,加速比为 $2$ 时总耗时为 $100/2=50$,此时在加速后的运行时间中加密部分占比 $\left(50/19\right)/50=1/19\approx5.263\%$。

  • Suppose you have measured the percentage of encryption to be 50%. The hardware design group estimates it can speed up the encryption hardware even more with significant additional investment. You wonder whether adding a second unit in order to support parallel encryption operations would be more useful. Imagine that in the original program, 90% of the encryption operations could be performed in parallel. What is the speedup of providing two or four encryption units, assuming that the parallelization allowed is limited to the number of encryption units? (Answer two and four respectively)

设原总时间为 $100$。则原加密时间为 $50$,使用 $p\left(p\ge 1\right)$ 个加密单元加速的时间是 $50\times90\%/p/20+50\times(1-90\%)/1/20=2.25/p+0.25$,对应的总加速比是 $100/\left(100-50+2.25/p+0.25\right)$。因此 $p=2,4$ 时总加速比分别约为 $1.946,1.968$

The ways of a set can be viewed as a priority list, ordered from high priority to low priority. Every time the set is touched, the list can be reorganized to change block priorities. With this view, cache management policies can be decomposed into three sub-policies: Insertion, Promotion, and Victim Selection. Insertion defines where newly fetched blocks are placed in the priority list. Promotion defines how a block’s position in the list is changed every time it is touched (a cache hit). Victim Selection defines which entry of the list is evicted to make room for a new block when there is a cache miss.

  • Can you frame the LRU cache policy in terms of the Insertion, Promotion, and Victim Selection sub-policies?
  1. 新数据插入(Insertion)到 set 头部;
  2. 当缓存命中即缓存数据被访问,则将数据移到(Promotion)头部;
  3. 当缓存满的时候,将尾部的数据丢弃(Victim)。

正解:

真正的 LRU 用一个特殊的栈来保存当前正在使用的各个页面的页面号。当一个新的进程访问某页面时,便将该页面号压入栈顶,其他的页面号往栈底移,如果内存不够,则将栈底的页面号移除。这样,栈顶始终是最新被访问的页面的编号,而栈底则是最近最久未访问的页面的页面号。因此,要用题中的集合模拟 LRU,可以定义:

  1. 插入:将新获取的块放在优先级最高的位置,也即优先级列表最前;同时,优先级列表最后的块从集合中删去。
  2. 提升:如果在优先级列表中的块被触摸(缓存命中),则将该块与优先级比他大 N 级的块调换位置;如果该块在前 N 个,则提到最前。从而增加其优先级。
  3. 受害者:同(1)中所述,当插入到来时,优先级最后块的成为受害者,被移出集合。
  • Can you define other Insertion and Promotion policies that may be competitive and worth exploring further?

考虑 LFU(Least Frequently Used)最近最少使用策略,如果一个数据在最近一段时间内使用次数很少,那么在将来一段时间内被使用的可能性也很小。因此,可以用 priority 代表数据进入缓存的次数,此时新 insert 的数据的 priority 为 0,每个 Promotion 将对应的 priority 增加一。


正解

其他可能具有竞争力的政策:提升:不一定要将新块插入到队首,可以维护一个最近 t 时间内各个块被触摸的强度 M。M 定义:每次被触摸加 k,k>1;为了保证各个块的优先级不会只大不小,每间隔时间 $t_1$, 把该间隔内没有被触摸的块优先级-1。插入:按照 M 排序得到优先级列表,将新块以一个特定的,小于 M 的优先级插入优先级列表。

A cache acts as a filter. For example, for every 1000 instructions of a program, an average of 20 memory accesses may exhibit low enough locality that they cannot be serviced by a 2 MB cache. The 2 MB cache is said to have an MPKI (misses per thousand instructions) of 20, and this will be largely true regardless of the smaller caches that precede the 2 MB cache. Assume the following cache/latency/MPKI values: 32 KB/1/100, 128 KB/2/80, 512 KB/4/50, 2 MB/8/40, 8 MB/16/10. Assume that accessing the off-chip memory system requires 200 cycles on average.

  • For the following cache configurations, calculate the average time(denote it by cycles) spent accessing the cache hierarchy.
    1. 32 KB L1; 8 MB L2; off-chip memory
    2. 32 KB L1; 512 KB L2; 8 MB L3; off-chip memory
    3. 32 KB L1; 128 KB L2; 2 MB L3; 8 MB L4; off-chip memory
  1. $1+\frac{100}{1000}\times\left(16+\frac{10}{1000}\times 200\right)=2.8$
  2. $1+\frac{100}{1000}\times\left(4+\frac{50}{1000}\times\left(16+\frac{10}{1000}\times 200\right) \right)=1.49$
  3. $1+\frac{100}{1000}\times\left(2+\frac{80}{1000}\times\left(8+\frac{40}{1000}\times\left(16+\frac{10}{1000}\times 200\right)\right)\right)=1.26976$

正解

  1. $1+\frac{100}{1000}\times\left(16+\frac{10}{100}\times 200\right)=4.6$
  2. $1+\frac{100}{1000}\times\left(4+\frac{50}{100}\times\left(16+\frac{10}{50}\times 200\right) \right)=4.2$
  3. $1+\frac{100}{1000}\times\left(2+\frac{80}{100}\times\left(8+\frac{40}{80}\times\left(16+\frac{10}{40}\times 200\right)\right)\right)=4.48$
  • What do you observe about the downsides of a cache hierarchy that is too shallow or too deep?
  • 缓存太浅,会导致 Cache Miss 频繁发生,导致内存墙。
  • 缓存太深,会导致访存延迟增加,而性能提升效果却逐渐下降,同时成本与功耗都大大增加。

正解

可以看到,缓存层级过少时,缓存 miss 的比例较大,会导致较多的片外存储的访问。而缓存层级过多是,各个层级累积的 hit time 也会增加。两种情况都会导致平均内存访问时间增加,因此,缓存层级要适度,而且各级缓存大小的选择也会影响平均访问速度,需要通过计算等方式进行设计。

Consider a 16 MB 16-way L3 cache that is shared by two programs A and B. There is a mechanism in the cache that monitors cache miss rates for each program and allocates 1–15 ways to each program such that the overall number of cache misses is reduced. Assume that program A has an MPKI of 100 when it is assigned 1 MB of the cache. Each additional 1 MB assigned to program A reduces the MPKI by 1. Program B has an MPKI of 50 when it is assigned 1 MB of cache; each additional 1 MB assigned to program B reduces its MPKI by 2. What is the best allocation of ways to programs A and B? Please explain why this allocation is best.

由于并没有给出关于两个程序的信息,实际上应当对延迟更敏感的程序分配更大的缓存。不考虑这一前提,对两个程序进行定性分析,程序 B 多分配缓存可以提速的比例大于程序 A,因此更适合被分配更多的缓存。定量分析,设分给程序 A 的缓存为 $a\ge 1$ 路,分配给程序 $B$ 的缓存为 $16-a\ge 1$ 路,则有平均访存延迟时间为 $101-a+52-2\times(16-a)=121+a$,随 $a$ 增加而增加。定量分析的结果同样支持上述结论,即程序 A 分配 1 路缓存,程序 B 分配 15 路缓存会有最佳性能。

Virtual machines (VMs) have the potential for adding many beneficial capabilities to computer systems, such as improved total cost of ownership (TCO) or availability. Could VMs be used to provide the following capabilities? If so, how could they facilitate this?

  • Test applications in production environments using development machines?

可以。通过在开发机和生产环境同时启动相同的虚拟机镜像或快照,使得在开发机上运行与生产环境完全相同的系统环境,从而进行应用测试。


正解

可以,虚拟机模拟了完整的软硬件,跟实际环境几乎是一样的。

  • Quick redeployment of applications in case of disaster or failure?

可以。通过批量启动虚拟机,可以自动将应用部署到生产环境,并且在应用故障时可以快速重启。


正解

可以。虚拟机具有封装的特性,可将完整状态保存到文件中,移动和复制虚拟机就像移动和复制文件一样轻松。与任何数字文件相同,虚拟机从一台计算机迁移至另一台计算机,在任何一台计算机上打开,工作方式都是相同的。这个数据文件应该也包括快照功能存下来的文件,通过这个文件和快照,可以在发生灾难或故障时快速重新部署应用程序。

  • Higher performance in I/O-intensive applications?

不能。虚拟机自身是有开销的,在虚拟机上无法获得超过宿主机的性能。


正解

不可以。I/O 密集型应用程序会导致虚拟机的高开销,可能会使得 I/O 需要等待,导致低效率。

  • Fault isolation between different applications, resulting in higher availability for services?

可以。在同一台宿主机上跑多个虚机,虚机间的运行是相互独立的,一台虚机上的软件或是系统崩溃不会影响到其他服务的运行,从而在不同的应用上做故障隔离。


正解

可以。一种叫做 Hypervisor (虚拟机监控程序)的软件可有效分隔物理资源,并将这些资源分配给不同虚拟环境(也就是需要这些资源的任务)使用。用户在虚拟环境(通常称为客户机或虚拟机)内部,能够与计算任务交互,并运行计算。

  • Performing software maintenance on systems while applications are running without significant interruption?

可以。在服务端维护或更新时,不需要物理重启,只需要逐步关闭对应的虚拟机,并使用维护或更新后的快照代替之即可,从而进行不停机维护。


正解

可以,由于虚拟机是独立于硬件的,且具有封装的特性,所以可以最大限度减少或消除停机,可以对系统执行软件维护而不出现重大中断。