ARMv8 memory model

在这边文章中, 我将以一个程序员的视角来谈一下我对ARMv8-A中的memory model的理解. 涉及到domain, atomic, ordering等. 如有错误请指正.

我会首先贴出ARM的官方描述, 然后在后面简要的描述一下我自己的理解. 部分章节或内容细节省略掉了.

  • Atomicity in the Arm architecture

    这里要注意的是, 此处所说的atomic并非是原子指令所说的原子. 这里描述的是系统对一段memory的访问(读或写)是否是一笔完成, 不被其他transaction打断. 或者说读或写不会和其他读或写产生交叉. 不会读到部分写效果, 也不会只写一部分.

    • Requirements for single-copy atomicity

      For explicit memory effects generated from an Exception level the following rules apply:

      • A read that is generated by a load instruction that loads a single general-purpose register and is aligned to the size of the read in the instruction is single-copy atomic.
      • A write that is generated by a store instruction that stores a single general-purpose register and is aligned to the size of the write in the instruction is single-copy atomic.
      • Reads that are generated by a Load Pair instruction that loads two general-purpose registers and are aligned to the size of the load to each register are treated as two single-copy atomic reads, one for each register being loaded.
      • Writes that are generated by a Store pair instruction that stores two general-purpose registers and are aligned to the size of the store of each register are treated as two single-copy atomic writes, one for each register being stored.
      • Load-Exclusive Pair instructions of two 32-bit quantities and Store-Exclusive Pair instructions of 32-bit quantities are single-copy atomic.
      • When the Store-Exclusive of a Load-Exclusive/Store-Exclusive pair instruction using two 64-bit quantities succeeds, it causes a single-copy atomic update of the entire memory location being updated.

        Note: To atomically load two 64-bit quantities, perform a Load-Exclusive pair/Store-Exclusive pair sequence of reading and writing the same value for which the Store-Exclusive pair succeeds, and use the read values from the Load-Exclusive pair.

      • Where translation table walks generate a read of a translation table entry, this read is single-copy atomic.

      注意的是这里的copy不要理解为复制, 个人认为这里理解为一份或副本可能更好一点. 后面会在multi-copy中再来细说这个.

      这里描述single-copy atomicity的要求, 如要求对齐的load/store是原子的. 对于load pair指令也有相应的约束, 细节可以逐条看spec. 就不啰嗦了.

    • Properties of single-copy atomic accesses

      A memory access instruction that is single-copy atomic has the following properties:

      • For a pair of overlapping single-copy atomic store instructions, all of the overlapping writes generated by one of the stores are Coherence-after the corresponding overlapping writes generated by the other store.
      • For a single-copy atomic load instruction L1 that overlaps a single-copy atomic store instruction S2, if one of the overlapping reads generated by L1 Reads-from one of the overlapping writes generated by S2, then none of the overlapping writes generated by S2 are Coherence-after the corresponding overlapping reads generated by L1.

      对于single-atomic的访存指令, 他有如下的属性:

      • 如果两条store指令有overlaping(见后面的overlapping access), 则两条指令一定是先后完成的. 也就是说, 两条指令不会同时修改内存, 而是一条修改完之后, 另一条才进行修改
      • 如果load指令L1和store指令S2有重叠, 如果L1读到了S2的store的一个location(byte, 可以重叠多个location), 则L1一定读到了S2 store的所有overlap location, 也就是说如果L1不会读到S2部分的结果, 要么读到的是S2完成后的结果, 要么读到的是S2未产生效果时的结果.
    • Multi-copy atomicity

      In a multiprocessing system, writes to a memory location are multi-copy atomic if the following conditions are both true:

      • All writes to the same location are serialized, meaning they are observed in the same order by all observers, although some observers might not observe all of the writes.
      • A read of a location does not return the value of a write until all observers observe that write.

      Note: Writes that are not coherent are not multi-copy atomic.

      multi-copy atomicity描述的是多个PE(Processing Element, 可以认为是一个cpu core)的atomic, 个人理解, 这里的multi-copy可以认为是多个PE都可能有同一个location的副本(copy, 我觉得可以看做cache), 他们之间存在observed-by的关系. multi-copy atomic要求: 所有的observers对同一个location发起的写必须是完全顺序的, 每一笔写都可以被所有的observer观察到(当然observer不一定会去实际观察, 因此observer可能不会观察到所有的写, 但是观察到的写的顺序一定是一致的, 比如A, B, C, D四个写, Observer1观察到了A, C, D, Observer2观察到了B, D, Observer3观察到了A, C). 并且写指令在所有观察者可以观察到这个写之前是不能返回的. 另外这里的观察到可以见后面的observed-by详细描述.

      从硬件上来看, 我觉得可以这样来理解, 在多核系统上, 多个observer都有自己的cache, 他们可以通过一定的cache总线和协议(如CCI和MOESI)来维护cache一致性. 对于multi-copy atomicity来说, 当一个observer对一个location发起写操作时, 其他的observer上的对应cache条目需要被invalid/update(相当于观察到了这笔写)后才算这笔写完成(指令才能返回). 因此multi-copy atomicity是一个很高的原子要求. 写会变得很慢.

      因此, arm的memory model不是使用的multi-copy atomicity, 而是使用的other-multi-copy aomicity. 这个后续会讲到.

      但是这里要再多讲一句,ARM所说的multi-copy atomic和其他架构说的不是一回事,ARM的other-multi-copy atomic和我们常说的multi-copy atomic其实是一回事

    • Requirements for multi-copy atomicity

      For Normal memory, writes are not required to be multi-copy atomic.

      For Device memory, writes are not required to be multi-copy atomic.

      The Arm memory model is Other-multi-copy atomic.

      ARMv8不要求multi-copy atomicity, 而是要求other-multi-copy atomicity

  • Definition of the Arm memory model

    • Basic definations

      ARMv8 memory model中有很多的概念, 这里抽出来一些来讲一下:

      • Observer

        An Observer refers to a processing element or mechanism in the system, such as a peripheral device, that can generate reads from, or writes to, memory.

        Observer就是一个可以发起读写的master, 可以是一个外设, 或CPU内部能够发起读写的一些组件(取指, load/store, MMU)等.

      • Common Shareability Domain

        For the purpose of this section, all Observers are assumed to belong to a Common Shareability Domain. All read and write effects access only Normal memory locations in a Common Shareability Domain, and excludes the situations described in Mismatched memory attributes on page B2-205.

        我们这里讲的observers都是在同一个sharebility domain, 并且都是对normal memory发起的访问. sharebility domain在后面会讲到

      • Location

        A Location is a byte that is associated with an address in the physical address space.

        Note: It is expected that an operating system will present the illusion to the application programmer that is consistent with a location also being considered as a byte that is associated with an address in the virtual address space.

        Location就是对应到一个物理地址的byte. 另外操作系统可以透明的让程序员看到用虚拟地址对应的byte.

      • Effects

        The Effects of an instruction can be:

        • Register effects.
        • Memory effects.
        • Barrier effects.
        • Tag effects.
        • Branching effects.

        The effects of an instruction I1 are said to appear in program order before the effects of an instruction I2 if and only if I1 occurs before I2 in the order specified by the program. Each effect generated by an instruction has a unique identifier, which characterizes it amongst the events generated by the same instruction.

        Effect是一条指令的效果, 它可以是引起寄存器被读写, 内存被读写, 产生屏障, Tag(这里我没细看), 产生跳转决策等

        如果说指令I1和I2是按照程序顺序发生, 那么他们产生的effect也是按程序顺序发生的

      • Register effect

        The Register effects of an instruction are register reads or register writes of that instruction. For an instruction that accesses registers, a register read effect is generated for each register read by the instruction and a register write effect is generated for each register written by the instruction. An instruction may generate both read and write Register effects.

        Register effect是一条指令产生对寄存器读或写的效果, 一条指令可以同时产生读效果和写效果

      • Memory effect

        The Memory effects of an instruction are the memory reads or writes generated by that instruction. For an instruction that accesses memory, a memory read effect is generated for each Location read by the instruction and a memory write effect is generated for each Location written by the instruction. An instruction may generate both read and write Memory effects.

        Memory effect是一条指令产生对memory读或写的效果, 一条指令可以同时产生读效果和写效果, 比如原子改写指令LDADD等

      • Reads-from

        The Reads-from relation couples memory read and write effects to the same Location such that each memory read effect is paired with exactly one memory write effect in the execution of a program. A memory read effect R2 from a Location Reads-from a memory write effect W1 to the same Location if and only if R2 takes its data from W1.

        Reads-from的意思是如果读效果R1读到的是写效果W1的结果, 那么R1 reads-from W1. 比如一条load指令, 加载了一条store指令store到memory中的结果.

      • Local read successor

        A memory read effect R2 of a Location is the Local read successor of a memory write effect W1 from the same Observer to the same Location if and only if W1 appears in program order before R2 and there is not a memory write effect W3 from the same Observer to the same Location appearing in program order between W1 and R2.

        同一个observer对同一个location的读写, 如果读R2读到的是W1的结果, 那么R2 local read successor W1.

      • Local write successor

        A memory write effect W2 of a Location is a Local write successor of a memory read or write effect RW1 from the same Observer to the same Location if and only if RW1 appears in program order before W2.

        同一个observer对同一个location的读写, 如果写W2是在读或写RW1之后发生, 则W2 local write successor RW1

      • Coherence order

        There is a per-location Coherence order relation that provides a total order over all memory write effects from all coherent Observers to that Location, starting with a notional memory write effect of the initial value. The Coherence order of a Location represents the order in which memory write effects to the Location arrive at memory.

        Coherence order是指所有的coherence observers观察到的对同一个location写到达的顺序

      • Coherence-after

        A memory write effect W2 to a Location is Coherence-after another memory write effect W1 to the same Location if and only if W2 is sequenced after W1 in the Coherence order of the Location.

        写W2 Coherence-after写W1, 表示在同一个location上, W2在W1之后到达

        A memory write effect W2 to a Location is Coherence-after a memory read effect R1 of the same location if and only if R1 Reads-from a memory write effect W3 to the same Location and W2 is Coherence-after W3.

        写W2 Coherence-after读R1, 表示在同一个location上, R1读到是W2前的W3产生的写, 顺序是W3->R1->W2

        可以看到coherence-after表示的是在同一location上一个写后于读/写的含义

      • Observed-by

        A memory read or write effect RW1 from an Observer is Observed-by a memory write effect W2 from a different Observer if and only if W2 is coherence-after RW1.

        RW1被W2观察到(RW1 observed-by W2), 表示W2 Coherence-after RW1, 也就是说W2比RW1后到达location, 且W2和RW1中间没有其他Wx

        A memory write effect W1 from an Observer is Observed-by a memory read effect R2 from a different Observer if and only if R2 Reads-from W1.

        W1被R2观察到(W1 observed-by R2), 表示R2 reads-from W1, 也就是说R2读到了W1的结果

        Note: The Observed-by relation relates only Memory effects generated by different Observers.

        observed-by描述的是不同的observer对一个location的操作顺序

        这里的observed不要理解成读, observed并不需要读(load)到哪个数据才叫observed, 只要满足上面两条就可以看做observed

      • Overlapping accesses

        Two Memory effects overlap if and only if they access the same Location. Two instructions overlap if and only if one or more of their generated Memory effects overlap.

        memory effect overlap是指访问到了相同的location(byte), instruction overlap是指两条指令产生了访问到了相同的location(bytes)

    • Ordering constraints

      The Arm memory model is described as being Other-multi-copy atomic. The definition of Other-multi-copy atomic is as follows:

      • Other-multi-copy atomic

        In an Other-multi-copy atomic system, it is required that a memory write effect from an Observer, if observed by a different Observer, is then observed by all other Observers that access the Location coherently. It is, however, permitted for an Observer to observe its own writes prior to making them visible to other observers in the system.

        ARMv8使用的是other-multi-copy atomic, other-multi-copy atomic要求, 对于其中一个observer的写, 如果被任意一个observer观察到, 那么所有其他的observers都必须能观察到这笔写. 并且发起写的这个observer是可以比其他observer先观察到它自己的这笔写的.

        从硬件上来讲, 这表示一个PE执行写操作之后, 是可以立即返回的. 不用等CCI总线基于MOESI去对cache进行操作, 就可以执行完成返回, 继续下一条指令了. 如果其中一个observer观察到了这笔写(也就是说那个observer的cache已经同步了), 那么所有的observer的cache都必须已经同步.

    • Memory barriers

      Memory barrier is the general term applied to an instruction, or sequence of instructions, that forces synchronization events by a PE with respect to retiring load/store instructions. The memory barriers defined by the Arm architecture provide a range of functionality, including:

      • Ordering of load/store instructions.
      • Completion of load/store instructions.
      • Context synchronization.
      • Instruction Synchronization Barrier (ISB)

        An ISB instruction ensures that all instructions that come after the ISB instruction in program order are fetched from the cache or memory after the ISB instruction has completed. Using an ISB ensures that the effects of context-changing operations executed before the ISB are visible to the instructions fetched after the ISB instruction. Examples of context-changing operations that require the insertion of an ISB instruction to ensure the effects of the operation are visible to instructions fetched after the ISB instruction are:

        • Completed cache and TLB maintenance instructions.
        • Changes to System registers.

        Any context-changing operations appearing in program order after the ISB instruction take effect only after the ISB has been executed.

        ISB保证和ISB指令完成后, 所有的指令都会cache/memory重新获取. 从硬件上看, 相当于把CPU的pipeline flush掉, 然后从ISB后面的指令开始重新跑. ISB可以保证cache和TLB的操作或system寄存器的修改产生的效果能实际影响到ISB后面的指令. 在进程切换, cache操作, MMU操作, system寄存器修改后之后往往需要ISB来保证后续的指令是在上述操作之后的环境下产生的执行效果.

        网上有很多博主说ISB是比DSB更加严格的barriar, 其实这是不对的. ISB和DSB是两个不同的概念, 他们的目的也不一样. DSB有ISB无法完成的功能, ISB也有DSB无法完成的功能.

      • Data Memory Barrier (DMB)

        The DMB instruction is a memory barrier instruction that ensures the relative order of memory accesses before the barrier with memory accesses after the barrier. The DMB instruction does not ensure the completion of any of the memory accesses for which it ensures relative order.

        The basic principle of a DMB instruction is to introduce order between memory accesses that are specified to be affected by the DMB options supplied as arguments to the DMB instruction. The DMB instruction ensures that all affected memory accesses by the PE executing the DMB instruction that appear in program order before the DMB instruction and those which originate from a different PE, to the extent required by the DMB options, which have been Observed-by the PE before the DMB instruction is executed, are Observed-by each PE, to the extent required by the DMB options, before any affected memory accesses that appear in program order after the DMB instruction are Observed-by that PE.

        DMB保证了DMB前面的和后面的load/store/cache指令不会越过DMB发生乱序, DMB只保证顺序, 不保证前面的命令执行完成.

        DMB指令可以在后面加参数来表示它限制的指令类型, 和保证生效的domain. 如DMB ISHST, ISH表示inner-shareable, ST表示限制STORE-STORE的顺序, 这条指令可以保证当前PE在DMB之后的指令观察到DMB前面的store的结果之前, 整个inner-shareable domain其他PE也都可以观察到DMB前面的store的结果, 关于domain, 后面会讲到

      • Data Synchronization Barrier (DSB)

        A DSB instruction is a memory barrier that ensures that memory accesses that occur before the DSB instruction have completed before the completion of the DSB instruction. In doing this, it acts as a stronger barrier than a DMB and all ordering that is created by a DMB with specific options is also generated by a DSB with the same options.

        A DSB instruction executed by a PE, PEe, completes when all of the following apply:

        In addition, no instruction that appears in program order after the DSB instruction can alter any state of the system or perform any part of its functionality until the DSB completes other than:

        • Being fetched from memory and decoded.
        • Reading the general-purpose, SIMD and floating-point, SVE vector or predicate, Special-purpose, or System registers that are directly or indirectly read without causing side-effects.
        • If FEAT_ETS is not implemented, having any virtual addresses of loads and stores translated.

        DSB是一个比DMB更强约束的指令, 并且和DMB一样可以在后面加参数产生一样的限制效果, DSB约束更强表现为:

        • 在DSB前面的load/store/cache完成之前, dsb不会返回, 相当于其他PE可以观察到dsb前面的这些操作前不能返回
        • DSB代码序(程序顺序)后面的所有会影响系统状态的指令都不会在DSB完成之前执行. 个人理解就是这些指令不能乱序到DMB前面来执行
      • Shareability and access limitations on the data barrier operations

        The DMB and DSB instructions take an argument that specifies:

        • The shareability domain over which the instruction must operate. This is one of:

          • Full system.
          • Outer Shareable.
          • Inner Shareable.
          • Non-shareable.

        Full system applies to all the observers in the system and, as such, encompasses the Inner and Outer Shareable domains of the processor.

        • The accesses for which the instruction operates. This is one of:

          • Read and write accesses, both before and after the barrier instruction.
          • Write accesses only, before and after the barrier instruction.
          • Read accesses before the barrier instruction, and read and write accesses after the barrier instruction.

        DMB和DSB后面可以加参数来确定屏障生效的范围, 具体可以参考ARMv8-A ARM的table2-1, 一共有3种读写限制方式*4种domain=12种case. 对于shareable domain后面还会讲

  • Memory types and attributes

    ARMv8将memory分为normal memory和device memory两种. 简单来说可以分别对应于RAM和外设.

    • Normal memory

      The Normal memory type attribute applies to most memory in a system. It indicates that the hardware is permitted by the architecture to perform Speculative data read accesses to these locations, regardless of the access permissions for these locations.

      For accesses to Normal memory, a DMB instruction is required to ensure the required ordering.

      对于normal memory, 是允许硬件进行投机的访问的, 也就是说对normal memory进行读访问是没有side effect的. 对normal memory进行write, 如果写入的值和前值相等, 那也是没有side effect的. 想想我们的外设寄存器, 很多寄存器都是不能满足这个特性的. 允许对normal memory进行非对齐访问. 另外对normal memory的访问是可以merge的. 比如连续对数组两个元素的写, 可以合并成一条写来增加效率. normal memory的访问是允许进行乱序的, 如果需要顺序访问, 则需要加DMB/DSB屏障

      • Shareable Normal memory

        A Normal memory location has a Shareability attribute that is one of:

        • Inner Shareable, meaning it applies across the Inner Shareable shareability domain.
        • Outer Shareable, meaning it applies across both the Inner Shareable and the Outer Shareable shareability domains.
        • Non-shareable.
        • Shareable, Inner Shareable, and Outer Shareable Normal memory

          The Arm architecture abstracts the system as a series of Inner and Outer Shareability domains.

          Each Inner Shareability domain contains a set of observers that are data coherent for each member of that set for data accesses with the Inner Shareable attribute made by any member of that set.

          Each Outer Shareability domain contains a set of observers that are data coherent for each member of that set for data accesses with the Outer Shareable attribute made by any member of that set.

          The following properties also hold:

          • Each observer is a member of only a single Inner Shareability domain.
          • Each observer is a member of only a single Outer Shareability domain.
          • All observers in an Inner Shareability domain are always members of the same Outer Shareability domain. This means that an Inner Shareability domain is a subset of an Outer Shareability domain, although it is not required to be a proper subset.

          Note:
          Because all data accesses to Non-cacheable locations are data coherent to all observers, Non-cacheable locations are always treated as Outer Shareable.
          The Inner Shareable domain is expected to be the set of PEs controlled by a single hypervisor or operating system.

          The details of the use of the shareability attributes are system-specific.

        Normal memory可以有shareability attribute, 可以是inner/outer/non-shareable, 这个attribute可以通过MMU页表的属性来配置. 这个特性和硬件定义的domain进行配合使用

        简单来说, SoC在设计时会将系统分为如下四个shareable domain:

        • non-shareable domain, 一般将一个core分在一个non-shareable domain中
        • inner-shareable domain, 一般将所有的core分在一个inner-shareable domain中, 可以有多个inner-shareable domain. 比如将cpu core分成两组, 分别跑不同的OS
        • outer-shareable domain, 一般将cache coherence的外设(如接在CCI上的GPU)和CPU一起放在outer-shareable domain中
        • system domain, 表示整个系统

        此时PEs访存的特性为:

        • 当cpu访问non-shareable memory时, 硬件不用去往外面广播来维护cache达到coherence
        • 当cpu访问inner-shareable memory时, 硬件会通过bus和协议来让inner-shareable domain中的所有observer达到coherence
        • 当cpu访问outer-shareable memory是, 硬件会通过bus和协议来让outer-shareable domain中的所有observer达到coherence

        上面说的硬件一般是CCI以及其内部的snoop组件. 通过MOESI来让所有的observer的cache达到一致(coherence).

        另外前面讲过的DMB/DSB可以通过指定shareable domain来将讲数据同步到指定的domain中.

        另外一般来说, 跑在同一个操作系统上的PEs都会设定在同一个inner shareable domain上.

        将系统切分不同的domain主要是为了更好的性能和更加的省电(不需要多做额外的coherence操作)

      • Cacheability attributes for Normal memory

        In addition to being Outer Shareable, Inner Shareable or Non-shareable, each region of Normal memory is assigned a Cacheability attribute that is one of:

        • Write-Through Cacheable.
        • Write-Back Cacheable.
        • Non-cacheable.

        Also, for Write-Through Cacheable and Write-Back Cacheable Normal memory regions:

        • A region might be assigned cache allocation hints for read and write accesses.
        • It is IMPLEMENTATION DEFINED whether the cache allocation hints can have an additional attribute of Transient or Non-transient.

        Normal memory可以配置cache属性, cacheable/non-cacheable, 对于cacheable, cache update的属性有:

        • write-back, 表示store只将数据写到cache中
        • write-through, 表示store将数据同时写到cache和memory中

        cache allocation的属性有:

        • read-allocate, 表示read miss时, 将发生cache line fill
        • write-allocate, 表示write miss时, 将发生cache line fill
    • Device memory

      The Device memory type attributes define memory locations where an access to the location can cause side-effects, or where the value returned for a load can vary depending on the number of loads performed. Typically, the Device memory attributes are used for memory-mapped peripherals and similar locations.

      Device memory定义的内存区被访问时可能产生side effect, 多次从同一位置进行load时可能发生变化. 这些属性和和外设寄存器非常像. 所以通常来说device memory通常用来作为外设的映射属性.

      Device memory有如下的属性:

      • 不允许进行投机的访问
      • 数据访问对系统中所有的observer都是coherence的, 可以被视为outer shareable
      • device memory不能被cache
      • 更多细节见ARMv8-A ARM手册…

      The Armv8 Device memory types are:

      • Device-nGnRnE Device non-Gathering, non-Reordering, No Early Write Acknowledgement. Equivalent to the Strongly-ordered memory type in earlier versions of the architecture.
      • Device-nGnRE Device non-Gathering, non-Reordering, Early Write Acknowledgement. Equivalent to the Device memory type in earlier versions of the architecture.
      • Device-nGRE Device non-Gathering, Reordering, Early Write Acknowledgement. Armv8 adds this memory type to the translation table formats found in earlier versions of the architecture. The use of barriers is required to order accesses to Device-nGRE memory.
      • Device-GRE Device Gathering, Reordering, Early Write Acknowledgement. Armv8 adds this memory type to the translation table formats found in earlier versions of the architecture. Device-GRE memory has the fewest constraints. It behaves similar to Normal memory, with the restriction that Speculative accesses to Device-GRE memory is forbidden.

      Device memory类型可以有Device-nGnRnE(等价于ARMv7的Strongly-ordered), Device-nGnRE(相当于ARMv7的device memory), Device-nGRE, Device-GRE这几种

      nG, G的G表示gathering, gathering表示是否可以将多笔连续地址写合并为一笔或将多笔同一地址写合并为一笔写

      nR, R的R表示reordering, reordering表示对这种类型地址的方式是否可以像normal memory那样进行乱序访问

      nE, E的E表示early-acknowlegement, early-acknowlegement相当于post write, 写操作可能被总线上的buffer直接会ack就表示完成, nE相当于non-post write, 一笔写必须等到实际写的位置确认才能表示完成. 所以当PE执行DSB时必须再写到达内存端点(比如实际写入到了外设寄存器)之后才完成