(ALMOST) FENCE-LESS PERSIST ORDERING

A Thesis in
Computer Science and Engineering
by
Sara Mahdizadeh Shahri

© 2020 Sara Mahdizadeh Shahri

Submitted in Partial Fulfillment
of the Requirements
for the Degree of

Master of Science

December 2020
The thesis of Sara Mahdizadeh Shahri was reviewed and approved by the following:

Aasheesh Kolli  
Assistant Professor of Computer Science and Engineering  
Thesis Advisor

Vijaykrishnan Narayanan  
A. Robert Noll Chair of Computer Science and Engineering

Anand Sivasubramaniam  
Distinguished Professor of Computer Science and Engineering

Chitaranjan Das  
Distinguished Professor of Computer Science and Engineering  
Department Head of Computer Science and Engineering
Abstract

The semantics and implementation of a memory persistency model can significantly impact the performance achieved on persistent memory systems. The only commercially available and widely used x86 persistency model causes significant performance losses by requiring redundant, expensive fence operations for commonly used undo logging programming patterns. In this work, we propose light-weight extensions to the x86 persistency model to provide some ordering guarantees without an intervening fence operation. Our extension, Themis, eliminates over 91.7% of the fence operations in undo-logging PM programs and improves average performance by 45.8% while incurring only 1.2% increase in data cache size.
# Table of Contents

List of Figures vi
List of Tables viii
Acknowledgments ix

Chapter 1
Introduction 1

Chapter 2
Background and Motivation 4
  2.1 Crash Consistency 4
  2.2 Memory Persistency 6
    2.2.1 x86 Persistency Model 6
  2.3 The problem 8
  2.4 Key insight 10
  2.5 Beyond Undo Logging 10

Chapter 3
Themis 12
  3.1 Design Goals 12
  3.2 Persistency Model 13
  3.3 High-level Design 15
  3.4 Implementation 16
    3.4.1 Write Combining Buffer 16
    3.4.2 L1 Data Cache 18
    3.4.3 Writeback Buffer 18
  3.5 Discussion 19

Chapter 4
Evaluation 22
  4.1 Methodology 22
  4.2 Performance Comparison 25
  4.3 Performance analysis 26
<table>
<thead>
<tr>
<th>Section</th>
<th>Title</th>
<th>Page</th>
</tr>
</thead>
<tbody>
<tr>
<td>4.3.1</td>
<td>Implications of Themis restrictions</td>
<td>26</td>
</tr>
<tr>
<td>4.3.2</td>
<td>Hardware overheads</td>
<td>28</td>
</tr>
<tr>
<td>4.3.3</td>
<td>Sensitivity to transaction sizes</td>
<td>28</td>
</tr>
</tbody>
</table>

Chapter 5

**Related Work**

| 5.1     | Software-based Solutions                   | 29   |
| 5.2     | Hardware-based Solutions                    | 30   |

Chapter 6

**conclusion**

<table>
<thead>
<tr>
<th>Bibliography</th>
<th>32</th>
</tr>
</thead>
</table>
## List of Figures

2.1 Bank balance transfer example: (a) w/o crash consistency, (b) w/ crash consistency using undo logging. ......................................................... 5

2.2 Relative order of undo logs and the actual balance updates for the bank balance transfer: (a) In the ideal case we only enforce order between undo log and its corresponding write data. (b) Using fence as an ordering semantics leads to a more conservative ordering. Black arrows represent necessary dependencies and red arrows the unnecessary dependencies. . . . 8

2.3 Ideal Undo logging performance improvement by removing \texttt{sfence} . . . 9

3.1 High-level system architecture with Themis . . . . . . . . . . . . . . . . . 15

3.2 Major events in WCB: (a) incoming non-temporal store, (b) WCB entry is drained, (c) ACK received from PM controller, (d) an overflow of tail pointer occurs. ................................................................. 17

3.3 Important events in L1 data cache with Themis: (a) incoming temporal store, (b) evicting a dirty cacheline, (c) receiving an overflow notification from core. ................................................................. 18

3.4 Important events in WBB with Themis: (a) Incoming writeback request with tail pointer greater than hAck', (b) Incoming writeback request with tail pointer lower or equal to hAck', (c) Receiving headACK update from WCB. ................................................................. 19

4.1 Number of stores per transaction distribution . . . . . . . . . . . . . . . . . 24

4.2 Speedup for Baseline, Themis, and Ideal. ............................................ 25

4.3 Number of instructions between two consecutive fences that are removed by Themis. ................................................................. 27
4.4 Distribution of number of cycles a writeback waits in WBB because of Themis’ restrictions. ................................................................. 27

4.5 Speedup for different number of stores per transaction (Geo-Mean) . . . 27
List of Tables

4.1 Simulation Configuration. .................................................. 23

4.2 Workload descriptions. Workloads with an * were only run on real hardware in §2. ...................................................... 23
The completion of my dissertation would not have been possible without the support and nurturing of my advisor, Prof. Aasheesh Kolli. I wish to express my sincere appreciation to him for his insightful guidance and encouraging me to believe in myself.

I wish to acknowledge the support of my lab-mates at Penn State, especially Talha who always helped me in all aspects of my research.

I would like to extend my sincere thanks to my family, my husband, Armin; my parents; and my brother, Ali for their unconditional love and sacrifice.
Chapter 1
Introduction

Persistent Memory (PM) technologies blur the difference between main memory technologies (e.g. DRAM) and storage technologies (e.g. SSD) [1–6]. They offer non-volatility, byte-addressability, and performance close to that of DRAM. Owing to these characteristics, PMs are expected to be placed on the memory bus, accessed using processor load and store instructions, and host recoverable data structures (e.g., databases, file systems) [4,7–23].

Memory persistency models [14] are extensions to the memory consistency models that enable programmers to express the order in which different stores update persistent memory. These ordering guarantees are necessary to develop any kind of recoverable storage software (e.g., file systems and databases). The hardware is responsible for enforcing this prescribed order, even in the presence of volatile structures like processor caches. In this work, we specifically focus on applications that use the popular undo logging techniques [10,16,17,24–27] to ensure correct recovery.

The semantics and implementation of a memory persistency model plays a significant role in the overall performance achieved by the applications [7]. The x86 persistency model is the only commercially available and widely used persistency model. In this work, we show that the x86 persistency model results in unnecessary performance losses due to the redundant use of expensive \texttt{sfence} instructions.

The x86 persistency model is unique in that it allows programmers to use two paths to persistence. The commonly used temporal path where a memory location is updated first in the cache and then written back to persistence through successive writebacks in the different levels of processor caches. The temporal path can be exercised using instructions like cacheline writeback (\texttt{clwb}) [28]. The non-temporal path is where a memory location is updated using cache-bypassing store instructions and results in the update going directly from the core to the memory controller. The non-temporal path
can be exercised using instructions like *movnt* [29].

For updates sent along either path, the current x86 persistency model requires using an intervening *sfence* instruction to ensure that the updates reach persistence in order [7, 13, 14, 17, 19, 26, 29–37]. Correctly ordering updates to persistence is the bedrock on which crash consistency can be achieved. For example, when using undo logging, the log updates must persist before any data updates persist to ensure proper recovery [24]. However, *sfence* instructions are expensive as they preclude coalescing, buffering, and reordering of memory accesses and hamper out-of-order execution due to their ordering semantics [7, 27]. So, the frequent use of *sfence* to ensure crash consistency results in significant performance losses on modern x86 processors.

In this work, we show that for certain kinds of updates, current x86 architectures already implicitly achieve the desired ordering and the use of an explicit *sfence* instruction to enforce the order already achieved is superfluous. Specifically, when one update is sent along the non-temporal (or cache bypassing) path and a subsequent update is sent along the temporal path (or through the caches), it is very likely that the non-temporal update persists before the temporal update. This behaviour is caused by the fact that the update along the non-temporal path skips past the many levels of caches and heads directly to the memory controller where as the update along the temporal path is retained at various cache levels for reuse and has to undergo successive writebacks at the various levels of caches to reach persistence.

Furthermore, we observe that in the widely used undo logging approach to crash consistency, a frequently occurring programming pattern is to require a non-temporal update reach persistence before a subsequent temporal update. This pattern arises because of the requirement that within an undo logging transaction, a log update reach persistence before the corresponding data updates. And, logs are usually updated using the non-temporal path to avoid polluting the caches with logs that will never be read during failure-free execution [17, 19, 38–41]. In fact, in different undo logging PM workloads we studied (details in §4), more than 91% of the *sfence* instructions used are specifically to enforce this exact kind of ordering. Additionally, we observed that this specific ordering is also widely employed in the state-of-the-art iDO logging [33]. Hence, we can significantly improve PM application performance if this kind of ordering can be achieved *without* the use of an *sfence*.

In this work, we propose *Themis* ¹, an extension to the x86 memory persistency model and implementation that guarantees a non-temporal update persists before all

¹Themis is an ancient Greek Titaness personifying divine order [42]
subsequent temporal updates issued on the same core, without requiring an intervening `sfence`. Themis leverages the fact that non-temporal updates likely already persist before subsequent temporal updates (as the non-temporal path is much faster than the temporal path). So, in the common case, the required ordering guarantee is achieved at virtually no cost. However, in the uncommon case where a non-temporal update is slowed down (due to any reason), Themis builds mechanisms to detect such cases and enforce the correct ordering by delaying the writeback of subsequent temporal updates.

With Themis and its fence-less ordering guarantee, we observe performance improvements as high as 87% and an average improvement of 45.8% over the baseline x86 persistency model implementation for a suite of PM workloads. Furthermore, Themis incurs only a 1.2% increase in the size of the data caches (to store extra state) and requires no modifications to the cache coherence or memory consistency model implementations. These light-weight changes are in contrast to other persistency model implementations [7,15] and make Themis much more adoptable in commercial multi-core processors. In summary, this work:

- Highlights that x86 processors can provide a fence-less ordering guarantee between a non-temporal update to a PM location and a subsequent temporal update at virtually no cost.

- Shows that such ordering primitives can be especially useful for undo-logging based PM programs and that over 91% of `sfence` instructions in them can be eliminated with this primitive.

- Presents the design and implementation of Themis, a light-weight extension to the x86 persistency model that leverages above insights to improve average performance by 45.8%.
Chapter 2  
Background and Motivation

2.1 Crash Consistency

Crash consistency refers to the guarantee provided by storage software (e.g., file systems and databases) that data is in a consistent and recoverable state even in the presence of system failures [43] like power interruptions, kernel crashes, etc. For example, consider the bank balance transfer transaction in Figure 2.1 (a), where the objective is to transfer $50 from Alice’s account to Bob’s account. This transaction deducts $50 from Alice’s balance and adds the same amount to Bob’s. To ensure that this transaction is correctly executed even in the presence of failures, first Alice’s and Bob’s account balances have to be stored in persistent locations. However, storing the balance information in persistent locations is not enough.

If a crash happens after the $50 were deducted from Alice’s account but before the amount gets added to Bob’s, on recovery, we would notice a net $50 loss in the overall balance of Alice and Bob, i.e., $50 vanishes from the system. In order to avoid such situations, storage systems (like databases and file systems) provide all-or-nothing guarantees, i.e., either both the operations on Alice’s and Bob’s balances happen or neither will. Developers can use these all-or-nothing guarantees to make their application crash consistent. These all-or-nothing guarantees can be provided at different granularity like transactions [10,11,27,44], failure-atomic sections (FASEs) [24,33,45–47], or synchronization-free regions (SFRs) [16,48], each bringing its own sets of trade-offs [36]. Finally, there are several techniques like undo/redo logging [7,10,11,16,17,24–27,32,45,46,49–52], iDO [33], shadow paging [9,44,53], checkpointing [30,54,55], and custom data structures [56–60] that developers can use to achieve these all-or-nothing guarantees and hence crash consistency. In this paper, we are going to limit ourselves to the popular undo logging approaches to crash consistency.
Undo logging [10, 16, 17, 24–27] is a crash consistency technique that provides all-or-nothing guarantees by undo-ing (or rolling back) changes from an aborted transaction. To be able to roll back changes, undo logging systems create an undo log entry prior to every update performed within the transaction. The undo log entry contains the current value of the memory location/variable that is being updated. Once the log entry has been created and persisted, only then is the actual memory location/variable updated. If a transaction succeeds, all the memory locations modified within the transaction are persisted and then a commit message is atomically persisted to the log to invalidate the log entries belonging to the transaction. If a transaction fails, the recovery process uses all valid log entries to roll back partial changes from that transaction.

For example, Figure 2.1 (b) shows how to make the balance transfer transaction crash consistent. At the start of the transaction a new log area is prepared (Line 2) for the transaction. Then before Alice’s and Bob’s balances are updated (Lines 4 and 6 respectively), the old values are inserted into the log (Lines 3 and 5 respectively). Finally, after both the updates are done only then is the log committed, essentially invalidating the log entries. So, if a crash occurs in the middle of the transaction, the recovery process can use the log entries to roll back the changes from the aborted transaction, guaranteeing all-or-nothing semantics. Note that updating the logs with current balances before updating the actual balance is the most critical requirement of undo logging.
2.2 Memory Persistency

Persistent Memory (PM) technologies straddle the divide between main memory technologies (e.g. DRAM) and storage technologies (e.g. SSD) [1–5]. They offer non-volatility, byte-addressability, and performance close to that of DRAM while also providing a load-store interface to storage. PMs are expected to host recoverable data structures (e.g., databases, file systems) [4,7–23]. When used to store recoverable data structures, PM systems are expected to provide guarantees on the order in which stores reach the persistent domain, in other words a memory persistency model [14]. Persistency models provide primitives that programmers can use to communicate the desired order of persists to the hardware. It is the responsibility of the hardware to ensure that the specified order of persists is enforced. In this work, we focus on the widely used x86 memory persistency model.

2.2.1 x86 Persistency Model

Studying and improving the x86 persistency model is particularly important as it is the only persistency model that is commercially available and widely used. The x86 persistency model is also unique in that it provides two paths to persistence: (i) a temporal path via the regular cache hierarchy and (ii) a non-temporal path or a cache-bypassing path directly from the core to the memory controller. The ISA provides mechanisms to order persists sent along either paths. Next, we provide more details about the persistence domain used and how to achieve the desired order of persists in x86.

Persistence domain: x86 persistency model requires Asynchronous DRAM Refresh (ADR) support [28,61,62], essentially making the memory controllers persistent apart from the PM devices. So, an update is considered persistent as soon as it reaches a memory controller.

Temporal $\rightarrow$ temporal ordering: To order two PM updates (say for location A and then B) while both updates are sent along the temporal path (using regular store (st) instructions), the code sequence required is:

\[
\text{st A; clwb A; sfence; st B;}
\]

The CacheLine WriteBack (clwb) [28] instruction writes back the cacheline with location A back to memory controller (and hence the persistence domain) and the subsequent sfence ensures that the writeback operation is complete before the store to B is performed.
at the L1 data cache. This sequence ensures that temporal updates to A and B persist in that order.

**Non-temporal → non-temporal ordering:** To order two PM updates (say for location A and then B) while both updates are sent along the non-temporal path (using non-temporal or cache-bypassing store (nt-st) instructions), the code sequence required is:

\[ \text{nt-st A; sfence; nt-st B; } \]

The nt-st instruction causes the updates to bypass the caches and go directly to the memory controller using the Write-Combining Buffer (WCB) at the processor core. Since the updates never go to the caches, we do not need a separate clwb instruction here. The sfence instruction ensures that the updates to A and B reach the memory controller in order.

**Non-temporal → temporal ordering:** To order two PM updates (say for location A and then B) while the update to A is sent along the non-temporal path while the update to B is sent along the temporal path, the code sequence required is:

\[ \text{nt-st A; sfence; st B; } \]

The combination of nt-st and sfence instructions causes the update to A to reach the memory controller before the update to B even reaches the L1 data cache, thus ensuring that updates to A and B reach persistence in order. For example, in Figure 2.1(b), the log insert for Alice’s balance (Line 3) must persist before the actual update to Alice’s balance (Line 4). And, usually log updates are done using non-temporal or cache bypassing stores as these logs are only necessary for crash consistency and never read in failure-free execution (the most common case). While the actual data updates are done using temporal stores as there is expected to be some reuse. In fact, this ordering requirement is frequently used as we see in undo logging based systems [17,19,38–41,63].

**Temporal → non-temporal ordering:** To order two PM updates (say for location A and then B) while the update to A is sent along the temporal path while the update to B is sent along the non-temporal path, the code sequence required is:

\[ \text{st A; clwb A; sfence; nt-st B; } \]

The combination of clwb and sfence instructions causes the update to A to reach the memory controller before the update to B ensuring that they persist in order. This combination of ordering from temporal to non-temporal stores are also quite common in
logging systems though not as common as the non-temporal to temporal ordering. For example, in Figure 2.1 (b), updates to Alice’s and Bob’s balances (lines 5 and 8) must persist before the log is committed (line 9).

It is typical in crash-consistent software systems running on x86 machines and relying on logging for crash consistency, to use a combination of temporal and non-temporal store instructions to achieve both necessary ordering guarantees while also judiciously using available cache resources for good performance [38,41]. As far as we are aware, the x86 ISA is the only ISA that provides these two paths to achieve persistence and next we are going to detail the problem with this approach and how we plan to address it.

2.3 The problem

While ordering persists is necessary for recovery correctness, enforcing the precise order of persist comes at a steep performance cost. This performance loss is due to two reasons: (i) reduced opportunities to buffer, coalesce, and reorder updates to PM and (ii) pipeline stalls caused by the use of fence instructions to express ordering dependencies. In § 2.2.1, we showed that with the x86 persistency model, using an sfence instruction is necessary for expressing any kind of persist ordering. However, fence instructions in general are not
optimal to express ordering relationships as they enforce ordering at a coarse granularity of *everything before the fence to everything after the fence* [64]. Not all instructions after the sfence are dependent on all instructions before the sfence. Independent instructions after the sfence are unnecessarily delayed. Furthermore, instructions after the sfence are not allowed to retire until all operations before the sfence are retired resulting in performance loss due to pipeline delays.

For example, consider the persist ordering constraints for the balance transfer transaction shown in Figure 2.2. The figure shows the ordering constraints between updates to the log and to the actual balances as part of the transactions. The log operations are shown in different color (gray) from the balances. In the ideal transaction, Figure 2.2(a), there is only a pair-wise dependency between a balance update and its corresponding log update. However, the x86 implementation of the transaction contains some unnecessary dependencies (red arrows in Figure 2.2(b)). These unnecessary dependencies are caused because the sfence instruction used to express ordering between the log and balance updates for Alice (arrow 1 in Figure 2.2(b)) unnecessarily causes a dependency between the log and data updates of Alice and Bob (red arrows in Figure 2.2(b)). Furthermore, the unnecessary dependencies introduced increase with the number of log-data update pairs we have in the transaction.

The central problem with the current x86 persistency model is that the sfence used to express the ordering within a single log-data update pair causes unnecessary persist dependencies and these unnecessary dependencies only grow with larger transactions. If
we had an “ideal” primitive that allows us to express the dependency within a log-data pair without having to use an `sfence` instruction, significant performance gains can be achieved. To estimate these performance gains of this “ideal”, we simply remove the `sfence` instructions used for crash-consistency in a baseline, state-of-the-art undo logging implementation [16] and execute a range of PM workloads on a real x86 processor. While removing the `sfence` may violate crash consistency requirements, it allows us to estimate the performance of our ideal primitive. Figure 2.3 illustrates the potential improvement in performance with our “ideal” ordering primitive. The improvement in performance ranges from 1.09x upto 2.68x, with an average improvement of 1.67x. These performance improvements are a result of eliminating 91.7% of `sfence` instructions on average.

### 2.4 Key insight

Our goal is to enforce the ordering within a log-data pair without using an `sfence` instruction. We observed that the log update is usually done with a non-temporal store instruction while the data update is done with a temporal store instruction. A non-temporal store heads directly to the memory controller and hence persistence. Whereas a temporal store is usually retained at multiple cache levels for reuse and needs to be written back or evicted from multiple caches in the cache hierarchy to reach the PM. So, even if the non-temporal log update and the temporal data update were to be started simultaneously, on a modern x86 processor, the log update will likely persist before the data update, even without any intervening `sfence` instruction. This observation implies that, most of the `sfence` instructions used for enforcing ordering within a log-data pair are unnecessary. However, if the `sfence` is not used, it is possible that there might be instances where the non-temporal update gets stuck in a buffer along its path and the temporal update reaches the memory controller first, violating recovery ordering constraints. In this work, we develop Themis, a light-weight x86 hardware extension that ensures that a non-temporal update persists before all subsequent temporal updates without using an `sfence` instruction.

### 2.5 Beyond Undo Logging

iDO [33] is a state-of-the-art logging mechanism that takes a different approach to achieving crash consistency. Instead of logging every individual write to PM, iDO performs logging at a much coarser granularity of *idempotent program regions*. Before
the start of a new idempotent region (identified by a compiler pass), all the information necessary to re-execute the idempotent region is logged. So, if an idempotent region’s execution is interrupted by a failure, the iDO logs are used to re-execute the region. While iDO incurs fewer logging operations than undo logging (due to a coarser granularity of logging), the amount of data logged in each logging operation is much larger than the data logged per write operation in undo logging. Due to this trade-off, iDO performs better with programs with larger critical sections/transactions, while undo logging performs better in applications with smaller transactions, especially if the transactions are read-heavy.

Themis improvements are not restricted to undo logging implementations. The iDO [33] logging mechanism also uses sfence instructions to enforce non-temporal to temporal store orderings. iDO persists the address of the first instruction in idempotent program region (and other logging metadata) through non-temporal store before executing the region. An sfence instruction is used to enforce the ordering between this non-temporal write and subsequent writes in the idempotent region. This fence instruction can be eliminated with our ideal ordering primitive. To estimate the performance improvements possible with a fence-less ordering primitive, we again remove the sfence instructions required in iDO. We observe a 13.7% increase in average performance for the same set of applications studied earlier in the real system setup.
In this section, we delve into the details of the design, specification, and implementation of Themis’ persistency model, an extension to existing x86 persistency models to reduce the use of fence instructions.

### 3.1 Design Goals

Themis is based on three key design goals:

**Create a non-fence ordering primitive:** As discussed in §2.3, fence instructions, while necessary in the current x86 architectures, are a significant source of performance degradation. We need to develop a mechanism that allows programmers to express an ordering dependency without a fence instruction, especially for the common-case ordering scenarios like non-temporal to temporal store ordering.

**Exploit multiple paths to persistence in x86:** x86 is a unique architecture in that its persistency models allows multiple paths to persistence, the fast non-temporal path and a slow temporal path. We can exploit the differences in speeds of requests sent along the fast vs slow paths to provide implicit ordering guarantees to programmers without relying on a fence instruction. To the best of our knowledge, no prior work has sought to exploit the presence of multiple paths to persistence.

**Light-weight hardware changes:** Since this work targets widely used x86 architectures, it is important that the hardware changes proposed are light-weight to allow easy adoption. Prior proposals on improved persistency models require invasive changes to the cache hierarchy, caches, cache coherence protocols, implementations, etc.

Such invasive changes imply that these proposals face a long road to adoption we’d like to avoid.
Next, we describe Themis’ persistency model and then describe an architectural implementation of the model.

### 3.2 Persistency Model

Formally, we express the Themis persistency model as an ordering relation over memory events *stores*, *non-temporal stores*, *clwbs*, and *fences*. The term *persist* refers to the act of durably writing a store to the persistent domain. We assume persists are performed atomically (with respect to failures) at naturally aligned 8-byte granularity. By “thread”, we refer to execution contexts—cores or hardware threads. We use the following notation:

- \( S_i^a \): A regular temporal store from thread \( i \) to address \( a \)
- \( NS_i^a \): A non-temporal store from thread \( i \) to address \( a \)
- \( C_i^a \): A clwb/clflush(opt) from thread \( i \) to address \( a \)
- \( F_i \): A fence (sfence or mfence) from thread \( i \).

We reason about two ordering relations over memory events, *execution order* and *persist memory order*. Execution order (EO) is a partial order over memory events that governs the order in which different instructions get executed in an x86 processor. For instructions within the same thread, the program order (the order in which instructions are retired) of the instructions decides the execution order. For instructions across threads, the x86 architecture’s memory consistency model (TSO [65]) and its cache coherence implementation determines the order in which the instructions get executed. Persist memory order (PMO) is the order of memory events as governed by the Themis’ memory persistency model [14].

We denote these ordering relations as:

- \( A \leq_e B \): \( A \) occurs no later than \( B \) in EO
- \( A \leq_p B \): \( A \) occurs no later than \( B \) in PMO

An ordering relation between stores in PMO implies the corresponding persist actions are ordered; that is,

\[
A \leq_p B \implies B \text{ may not persist before } A.
\]
We next describe the semantics of Themis’ persistency model by describing how to ensure that two different store instructions (of any kind) in the same or different threads will persist in the desired order.

**Temporal → temporal:** Persisting two temporal stores in order requires an intervening cacheline writeback/flush and sfence/mfence combination in the execution order. Formally:

\[
S_a^i \leq e C_a^j \leq e F_b^j \leq e S_b^i \rightarrow S_a^i \leq_p S_b^i
\]  

(3.1)

**Non-temporal → non-temporal:** Persisting two non-temporal stores in order requires an intervening sfence/mfence in the execution order. Formally:

\[
NS_a^i \leq e F^i \leq e NS_b^j \rightarrow S_a^i \leq_p NS_b^j
\]  

(3.2)

**Non-temporal → temporal:** Persisting a non-temporal store and a temporal store in order requires the two stores to be ordered in the execution order. No intervening fence instruction is necessary, unlike the baseline x86 persistency model. Formally:

\[
NS_a^i \leq e S_b^i \rightarrow NS_a^i \leq_p S_b^i
\]  

(3.3)

Note that here we provide this fence-less ordering guarantee for stores executed on the same thread (an overwhelmingly common case). If this ordering is required when the two stores are on different threads, an intervening fence is necessary, just in the baseline x86 model.

**Temporal → non-temporal:** Persisting a temporal store and a non-temporal store in order requires an intervening cacheline writeback/flush and sfence/mfence combination in the execution order. Formally:

\[
S_a^i \leq e C_a^j \leq e F_b^j \leq e NS_b^k \rightarrow S_a^i \leq_p NS_b^k
\]  

(3.4)

While the above equations formally define Themis’ persistency model, these ordering relationships are very similar to the ones described for the baseline x86’s persistency model in §2.2.1 with one exception. The only exception is that for non-temporal to temporal ordering, no intervening fence instruction is required, we just have to ensure that the two stores are correctly ordered in the execution order. If the two stores are in the same thread, then we have to ensure that the two stores are ordered within the program order. If the two stores are on different threads, then we need to ensure that there are intervening synchronization accesses (like unlock/lock or release/acquire
operations) to order the two stores in the execution order. Next, we describe a high-level design for this persistency model.

## 3.3 High-level Design

Figure 3.1 shows a simplistic multi-core system. Temporal stores go from the core, to the data cache to its writeback buffer (WBB), to the various levels of caches (L2$+LLC) and finally to the corresponding memory controller (based on its physical address). Non-temporal stores go from the core to the write-combining buffer (WCB) where they get reordered, buffered, and coalesced for performance and then directed to the corresponding memory controller.

At a high-level, with Themis, the goal is to ensure that when a non-temporal store is executed followed by a temporal store (i.e., $nt-st_A \leq e st_B$), A persists before B, even when there are no intervening fences (i.e., Eq 3.3 in §3.2). We achieve this by making light-weight modifications to the data cache, its WBB, and the core’s WCB. The steps involved in this process are as follows:

1. When $nt-st_A$ gets executed, it is first placed in the WCB and then eventually drained to the memory controller.
2. When $st_B$ comes along, we first check the WCB to see what outstanding non-temporal stores are yet to be drained by noting down the WCB’s tail pointer.
3. When the cacheline with $B$ is eventually written back either due to a regular replacement or due to a `clwb`-like instruction, we place the cacheline and its associated WCB tail pointer in the data cache’s WBB.
4. Before we let the cacheline $B$ leave the WBB, we check if all outstanding non-temporal stores at the time $st_B$ reached the data cache have been drained from the WCB (by
using the tail pointer stored along with cacheline B). If all prior non-temporal stores (including one to A) have been drained, we let the writeback of cacheline B continue. If not, then we wait for WCB to drain its outstanding non-temporal stores and only then we let cacheline B to be written back to L2.

By not allowing the update to B to even leave the data cache’s WBB before all prior executed non-temporal stores (include the one to A), we ensure that A persists before B (i.e., $\text{nt-st}_A \leq \text{st}_B$). Our approach is conservative. Ideally, we only have to ensure that A persists only before B is written back from the LLC (and hence reaching the persistence domain). However, we enforce that A persists before B is written back from the data cache, i.e., much earlier than necessary. As we will show in the following sections, this design choice considerably simplifies Themis’ architecture with almost no performance penalties.

3.4 Implementation

In this section, we detail how we extend the data cache, WBB, and WCB to implement Themis’ persistency model.

3.4.1 Write Combining Buffer

The WCB is usually implemented as circular buffer of about 16 entries. Each entry typically has the space to buffer a cacheline of data (64B) and the associated physical address. As shown in Figure 3.2, apart from the entries with data and addresses, the WCB consists of three important pointers: (i) tail ($t$): points to the next available entry for an incoming non-temporal store request (ii) head ($h$): points to the oldest outstanding entry that needs to be drained to the memory controller (iii) headACK ($hAck$): points to the oldest entry that has not been drained to the memory controller, or has been drained but has not yet received an acknowledgement from the memory controller saying that the entry has been persisted. A single bit per entry tracks if a drained entry has been ACKed by the memory controller or not and this bit is used to increment headACK.

The aspects of WCB described above already exist in a modern x86 processor and Figure 3.2 shows how they help the WCB operate: (i) Incoming non-temporal store $\triangleright$ gets allocated at the tail or gets coalesced into an active entry. If it gets allocated at the tail, the tail is incremented $\triangleright\circlearrowleft$, always pointing
Figure 3.2. Major events in WCB: (a) incoming non-temporal store, (b) WCB entry is drained, (c) ACK received from PM controller, (d) an overflow of tail pointer occurs.

to the next available entry for incoming requests. If no space is available in the WCB, the stall is back propagated to the core.

(ii) On draining an active entry to the PM controller, the head gets incremented, always pointing to the next entry to be drained. If the memory controller is full, a stall message is back propagated to the WCB.

(iii) On receiving a PM controller ACK signifying that a particular drained entry has reached the PM controller (and hence persisted), the ACK bit in the associated entry is set. If the entry pointed to by headACK has been ACKed, the headACK is incremented to point to the next oldest entry. This headACK increment is also communicated to the WBB, so that the WBB has up-to-date information on which entries in the WCB have been persisted.

(iv) Overflow is the corner-case situation when the tail pointer overflows. We use 6-bit tail, head, and headACK pointers even while using a 16-entry WCB. The least significant 4 bits are used to index into the WCB. The most significant 2 bits are used to track circular buffer “wrap arounds”, starting from 00 and incrementing by 1 every time a wrap around happens. Once the WCB has been through four full uses, we force drain all the outstanding requests and wait for ACKs for all of them from the memory controllers. Once the WCB has been completely drained and ACK, i.e., it is empty, we reset the head, tail, and headACK pointers and notify the core. As we will see next, the core will notify the data cache about the overflow and this information is used to correctly enforce persist orders. In § 4.3.2, we show how we arrived at using 6-bit pointers.
3.4.2 L1 Data Cache

As shown in Figure 3.3, apart from the the usual valid (V), tag, and data (D) fields, we extend each cacheline to store a 6-bit tail pointer value corresponding to the state of the WCB when a particular cacheline was written. Figure 3.3 shows how the data cache functions with Themis:

(i) **Incoming temporal store** request is tagged with the latest WCB tail pointer by the core. And this WCB tail is stored along with tag and data in the cacheline. The goal is to make sure that this cacheline persists only after associated WCB entries have been persisted (as indicated by the tail pointer). We achieve this goal by making sure that this cacheline does not even leave the writeback buffer (WBB) of the data cache before the corresponding WCB entry persists.

(ii) **While evicting a dirty cacheline** either due to a `clwb` request or due to regular cacheline replacement, along with the tag and the data, the associated WCB tail pointer is also sent to the the WBB.

(iii) **On an overflow notification** from the core, we know that all outstanding WCB entries have been drained and persisted, so none of the cachelines have any unresolved dependencies with prior non-temporal stores. Hence, all the WCB tail pointers in the cache are nullified.

3.4.3 Writeback Buffer

As shown in Figure 3.4, we extend each entry to store a 6-bit “Drainable After” (DA) pointer, pointing to the entry in the WCB after which this entry can be drained from the WBB to the L2 cache. The WBB also maintains a copy of the WCB’s head
Figure 3.4. Important events in WBB with Themis: (a) Incoming writeback request with tail pointer greater than \(h\text{Ack}'\), (b) Incoming writeback request with tail pointer lower or equal to \(h\text{Ack}'\), (c) Receiving headACK update from WCB.

pointer (\(h\text{Ack}'\)), allowing it to maintain visibility into which non-temporal stores have been persisted and which ones haven’t. Figure 3.4 shows how writeback buffer functions with Themis:

(i) **On an incoming writeback request** from the data cache, we compare the tail pointer associated with the incoming request with \(h\text{Ack}'\). If the tail pointer is lower or equal to \(h\text{Ack}'\), we can infer that the non-temporal requests that this writeback request is dependent upon have already been persisted and we can mark this writeback request as “drainable” (D) and evict it to L2 at any time. If the tail pointer is higher than \(h\text{Ack}'\), then the tail pointer is stored along with the request in the WBB.

(ii) **On a notification from WCB about headACK increments**, WBB first increments its own copy of headACK (\(h\text{Ack}'\)). Then it traverses active WBB entries and marks all those entries drainable whose tail pointers are smaller than or equal to the new \(h\text{Ack}'\).

This sequence of operations in the WCB, data cache, and the WBB ensure that all temporal stores will only reach the L2 when all prior executed non-temporal store instructions have been persisted.

### 3.5 Discussion

Next, we discuss some interesting corner cases and design decisions that we made with Themis.

**Coherence and consistency.** Prior persistency model implementations require invasive changes to the cache coherence and consistency model implementations [7]. However, Themis does not require *any* changes to the existing coherence and consistency
model implementations and in fact is completely orthogonal to them. All the information necessary to maintain the correct ordering is maintained only locally at one core’s data cache, WBB, and WCB. Cacheline invalidations and evictions that are caused by cache coherence transactions are treated just like any regular replacement policy related invalidations and evictions. By only maintaining local order information, Themis proposes a design that can be easily adopted into existing multi-core architectures.

**Thread migration.** On a thread migration a software thread may be de-scheduled from one core and scheduled at another core at some later point in time. Even during such context switches, we need to ensure that all persist ordering guarantees are maintained. While maintaining all the necessary ordering information locally within the core is beneficial from an implementation point of view, it raises questions during a context switch. When a thread migrates from one core to another it would be helpful to migrate its ordering information along with this but that requires extensive changes to the ISA and thread switching code. We instead take a more conservative approach, before migrating a thread, we completely drain and persist all outstanding non-temporal stores. This way, all subsequent temporal stores will be guaranteed to persist after prior non-temporal stores, irrespective of which core the thread gets migrated to. In our experiments, we observe negligible performance loss due to this conservative thread migration approach as thread migrations are relatively uncommon and long latency operations (so delaying them a little further does not seem to matter).

**Volatile vs persistent memory updates.** Even in PM systems, there is expected to be some significant amount of DRAM used and there are going to be some program variables that do not need to be persistent (e.g., stack variables) and are simply maintained in DRAM in volatile memory locations. Since volatile memory updates do not update persistent state directly, we do not always need to enforce the non-temporal to temporal store ordering constraints for them. So, we allow volatile cachelines (those backed by DRAM) to be evicted out of the data cache without any restrictions for improved performance. However, persist dependencies for stores executed across cores may still arise due to intervening volatile memory accesses [7] and for these cross-core persist dependencies, programmers have to rely on fence-based ordering primitives that they would normally use in existing x86 processors.

**Enforcing non-temporal to temporal store ordering at L1.** Technically, we only have to ensure that a non-temporal store persists before a subsequent temporal store is written back from the LLC (and hence reaches the persistence domain). However, we enforce that the non-temporal store persists even before the temporal store is written
back from the data cache, i.e., much earlier than necessary. This conservative enforcement provides three important benefits:

(i) **reduced storage overheads**: If ordering is enforced at the LLC, then all the cachelines in the LLC (and other intermediate caches) will need to maintain a WCB tail pointer. Since, LLCs are typically about 32x [66,67] larger than L1 data caches, this approach would increase storage overheads by about 32x.

(ii) **reduced design complexity**: With the current Themis architecture, all ordering requirements for a thread are enforced at the private L1 cache of the thread where we can be sure that memory accesses are generated only by the thread currently executing at that core. If ordering enforcement were to be moved to shared caches (like the LLC) then the same cacheline could be modified by different threads on different cores simultaneously and we need to maintain ordering information on a per-core basis for each cacheline, significantly complicating the design.

(iii) **reduced implementation complexity**: Even if the ordering information were to be correctly maintained, then while evicting a shared cacheline from the LLC, we would have to make sure all of its dependencies have been resolved at all the cores, requiring the LLC controller to constantly communicate with all the per-core WCBs in a multi-core system. This communication will be complex to implement, especially in multi-socket machines.

These three important benefits make Themis significantly easier to adopt, however, they come with a performance trade-off. As we will show in § 4.3.1, this performance loss is negligible as most non-temporal stores persist before subsequent temporal stores get evicted from the data cache.

**Comparison to other relaxed persistency models.** In this work, we propose Themis as an extension to the x86 persistency model, the only model that provides two paths to persistence, a temporal and a non-temporal one. ARM has recently introduced non-temporal store instructions [68], so, we expect that in the future Themis will be applicable to ARM processors as well. Prior persistency models such as [7,9,14,18,19] provide only one path to persistence. All of those persistency models cannot facilitate Themis-like reduction in the number of sfence instructions used. While strand persistency models can potentially provide similar reduction in unnecessary ordering enforced as Themis, it comes with a radically different memory persistency model that will necessarily require invasive changes to the cache coherence protocol, memory consistency model, and data cache implementations [69].
Chapter 4  
Evaluation

4.1 Methodology

**Simulation infrastructure.** We evaluate Themis using the gem5 architectural simulator [70] in full-system mode. We model a four-core system with the ARMv8 ISA [68] and the key architectural parameters of the evaluated system are summarized in Table 4.1. Note that we used ARM cores as a starting point for our evaluation due to the stability and maturity of the ARM core implementations in gem5 [71]. Even though we use ARM cores, we implemented the entire x86 memory persistency model (with `clwb`, `sfence`, and non-temporal store instructions). This approach is similar to the one used by Kolli et al. [7] and we firmly believe that our implementations are representative of the behaviors of modern x86 processors. All gem5 and real system experiments are executed on c6420 Xeon servers on CloudLab [72].

**Workloads.** We assess Themis with several PM-centric multi-threaded workloads [7, 19, 36, 74, 75] listed in Table 4.2. These workloads vary from those that modify a single data structure, like CTree [19] to database workloads like TPCC [75]. Each workload executes transactions of various different sizes, i.e., different number of log-data update pairs, we will simply refer to the size of a transactions as the number of stores per transaction. For example, the bank balance transfer transaction from Figure 2.1 has two stores in the transaction for the balance updates, one temporal store each. Figure 4.1 shows the distribution of size of transactions seen in each of the workloads. The smallest transactions have only one store within them (e.g., TATP’s update location transaction [74]) while the larger transactions can have over hundred stores within them (e.g., SPS). These workloads provide a range of average transaction sizes and also a varied distribution in the sizes of individual transactions. Note that “stores within a transaction” refers to only the temporal stores, the non-temporal stores required for logging are not considered as
### Table 4.1. Simulation Configuration.

<table>
<thead>
<tr>
<th>Component</th>
<th>Description</th>
</tr>
</thead>
<tbody>
<tr>
<td>Core</td>
<td>4-cores, 3GHz OoO 8-wide Dispatch, 8-wide Commit, 192-entry ROB, 32/32-entry Load/Store Queue</td>
</tr>
<tr>
<td>I-Cache</td>
<td>32KB, 4-way, 64B 1ns hit latency, 8 MSHRs</td>
</tr>
<tr>
<td>D-Cache</td>
<td>64KB, 4-way, 64B 2ns hit latency, 8 MSHRs 16-entry WBB</td>
</tr>
<tr>
<td>Write-combining Buffer (WCB)</td>
<td>16 entries 20ns delay to PM controller</td>
</tr>
<tr>
<td>LLC</td>
<td>2MB per core, 16-way, 64B 20ns hit latency, 32 MSHRs</td>
</tr>
<tr>
<td>Memory controller (DRAM, PM)</td>
<td>128/64-entry write/read queue</td>
</tr>
<tr>
<td>DRAM</td>
<td>DDR4, 1200MHz</td>
</tr>
<tr>
<td>PCM</td>
<td>1200MHz, 346ns read latency 500ns write latency [73]</td>
</tr>
</tbody>
</table>

### Table 4.2. Workload descriptions. Workloads with an * were only run on real hardware in §2.

<table>
<thead>
<tr>
<th>Workload</th>
<th>Description</th>
</tr>
</thead>
<tbody>
<tr>
<td>CQ</td>
<td>En/Dequeue nodes in concurrent queue</td>
</tr>
<tr>
<td>CTree</td>
<td>Insert/delete nodes in crit-bit tree</td>
</tr>
<tr>
<td>Hashmap</td>
<td>Insert/delete nodes in hashmap</td>
</tr>
<tr>
<td>LL</td>
<td>Insert/delete nodes in linked-list</td>
</tr>
<tr>
<td>PC</td>
<td>Update in hash-table</td>
</tr>
<tr>
<td>Redis*</td>
<td>Object-based KV store workload, lru-test, 1M keys [19]</td>
</tr>
<tr>
<td>RB</td>
<td>Insert/delete nodes in a red-black tree</td>
</tr>
<tr>
<td>SPS</td>
<td>Swap random entries of an array</td>
</tr>
<tr>
<td>TATP</td>
<td>\texttt{update_location} transaction [74]</td>
</tr>
<tr>
<td>TPCC</td>
<td>\texttt{add_new_order} transaction [75]</td>
</tr>
<tr>
<td>Vacation*</td>
<td>RB + LL KV store workload, 4 clients, 2M transactions [19]</td>
</tr>
<tr>
<td>YCSB*</td>
<td>BTree KV store database, 4 clients 8M transactions, 80% writes [19]</td>
</tr>
</tbody>
</table>

part of the original transaction.

The logging technique we use to make these transactions crash consistent is built upon the state-of-the-art “coupled undo logging” approach introduced by Gogte et al. [16].
We use their code as the starting point and improve the performance of the transactions by moving from using a linked-list-based log to an array-based log. This optimization not only reduces the time required to index into a specific portion of the log, it also reduces the number of stores necessary to add new entries into the log. With list we had to perform two operations one to create a new log entry and the second to splice the new entry onto the list. With an array based structure, we simply have to create a new log entry. We also observed that the code made a number of redundant function calls some of which we were able to completely elide and others we accelerated through function inlining. Furthermore, we also make use of non-temporal store instructions to reduce cache pollution from logging updates. These simple software engineering optimization techniques allowed us to improve the performance of transactions by 2.63x on average. We use these optimized transactions as our baseline and improve their performance with Themis.

**Designs evaluated.** In this work, we evaluate three different designs. (i) **Baseline** refers to the scenario where the workloads are run with our optimized logging techniques and the baseline x86 persistency model, so, this design requires the use of `sfence` instructions to ensure non-temporal to temporal store ordering. (ii) **Themis** refers to the scenario where the workloads are run on a processor with our Themis persistency model implementation. This design does not require a `sfence` instruction to ensure non-temporal to temporal store ordering. (iii) **Ideal** is the baseline system with no `sfence` instructions for non-temporal to temporal store ordering. This design is not crash consistent, but represents the upper bound on performance that can be achieved by eliminating `sfence` instructions.
Finally, in this evaluation section, we’d like to answer the following questions:

- How does Themis perform? (§4.2)
- What factors impact workload performance? (§4.3)
- What are the implications of Themis’ restrictions on writebacks on overall performance? (§4.3.1)
- What are the hardware overheads of Themis? (§4.3.2)
- How are differently sized transactions impacted? (§4.3.3)

### 4.2 Performance Comparison

Figure 4.2 contrasts the speedup of Themis and Ideal over Baseline. The key takeaways from this analysis are:

**Themis consistently outperforms Baseline:** Themis outperforms Baseline in all benchmarks as it eliminates the use of expensive `sfence` instructions to enforce non-temporal to temporal store ordering. Themis improves performance by as much as 87% (for PC) and by 45.8% on average across all workloads.

**Themis achieves near-ideal performance:** Themis almost completely bridges the gap between baseline and ideal performance. This level of performance shows that the restrictions imposed by Themis on the writeback operations at the L1 data cache results in barely any performance degradation.
**Best performing workloads:** SPS, PC, and LL are the three best performing workloads with performance improvements ranging from 77.4% to 87%.

**Worst performing workloads:** RB, TATP, and TPCC are the three worst performing workloads with performance improvements ranging from 12.8% to 20.6%.

### 4.3 Performance analysis

To better understand the reason behind why some workloads perform better than others, we analyzed the frequency with which `sfence` instructions appear in the baseline that can be removed by using Themis. The higher the frequency with which removable fence instructions appear in a program (i.e., lower inter fence distance), the larger these fences contribute to overall execution time and removing them yields larger performance gains. It is important to note that the absolute number of fence instructions removed or the fraction of fence instructions removed matters less than the frequency with which these fence instructions appear in the program. Larger the frequency or lower the distance between two consecutive removable fence instructions, the higher is the potential for performance improvement.

Figure 4.3 plots the cumulative distribution function (CDF) of distance between two consecutive removable fence instructions (measured as the number of intervening instructions) for the two best performing (SPS and PC) and worst performing (TATP and TPCC) workloads. As expected, the best performing workloads see that an overwhelming majority of consecutive removable fence instructions have only about 200 instructions between them. Whereas for the lower performing TATP and TPCC workloads the distance between removable fence instructions skews much higher.

#### 4.3.1 Implications of Themis restrictions

Themis achieves the desired persist ordering guarantees by placing restrictions on when a writeback may be drained from WBB in L1 data cache. The larger the wait times, the WBBs get full and the delays are back propagated to the core and show up as pipeline stalls. Note that not all delays result in performance degradation due the latency hiding powers of modern out-of-order processors. However, excessive delays do end up hurting performance. As shown in Figure 4.2, Themis’ near-ideal performance indicates that the restrictions are minimal.

Figure 4.4 shows the effects of these restrictions in more detail. This figure shows
Figure 4.3. Number of instructions between two consecutive fences that are removed by Themis.

Figure 4.4. Distribution of number of cycles a writeback waits in WBB because of Themis’ restrictions.

Figure 4.5. Speedup for different number of stores per transaction (Geo-Mean)

the distribution of a writeback requests categorized by the number of cycles they are delayed. A majority of writebacks see no delays, where as some writebacks see few tens
of cycles of delays. Writebacks in LL and TATP see the longest delays (about 38 cycles), while CQ and TPCC see the shortest delays. These delay numbers highlight that Themis places minimal restrictions on cache writebacks.

4.3.2 Hardware overheads

The main hardware overhead of Themis arises because of the WCB tail pointers that are stored along with each cacheline in the data caches. So, the size of the tail pointer ends up determining the Themis’ storage overhead. As the tail pointer width increases, there will be fewer overflows (as discussed in § 3.4.1) and lesser performance degradation. However, longer tail pointers result in larger storage overheads. We performed a sensitivity analysis with increasing hardware overheads: 0.5KB (4-bit tail pointer), 0.75KB (6-bit), 1KB (8-bit), and 1.25KB (10-bit). And found very little performance variation among all configurations. We use 6-bit tail pointer, incurring 0.75KB (or 1.2%) of hardware overhead per data cache.

4.3.3 Sensitivity to transaction sizes

Finally, we see how the performance of Themis changes with different transaction sizes. Figure 4.5 shows the speedup achieves with Themis over baseline for different transaction sizes. We change the workloads to artificially set the number of stores per transaction in each of the workloads and see how Themis performs with increasing transaction sizes.

As expected, with increasing transaction sizes, the frequency of removable fence instructions (those required to enforce non-temporal to temporal store orderings) increase and we observe an increase in speedup with Themis. We observe this pattern in all workloads, however, different workloads exhibit different rates of performance improvements. Note that we omit TATP and TPCC from this experiment as these changes violate the definitions of those workloads.
Chapter 5  
Related Work

In this section we discuss the prior work on facilitating crash consistency solutions to PM systems.

5.1 Software-based Solutions

Runtimes. A wide range of works have been proposed optimizing file systems for PM in different levels of system stack [9,26,76–82]. Atlas [24] is a compiler and runtime solution for providing undo logging that ensures crash consistency at the granularity of outer-most lock. Gogte et al. [16] proposes an undo logging approach at the granularity of Synchronization Free Regions (SFRs) and addresses drawbacks of providing failure atomicity at the granularity outer-most lock, as Atlas [24] offers. Kolli et al. [27] provides an efficient transaction implementation to reduce the constraints on the order of persistent updates. JUSTDO logging [46] makes sure that the program’s execution resumes immediately where the application is interrupted in the critical section by persisting the architectural state prior to the critical section’s execution in a system. iDO [33] extends JUSTDO logging to finer grained idempotent program regions.


All software approaches rely on the underlying memory persistency model and can not provide the same ordering guarantees without the use of sfence instruction.
5.2 Hardware-based Solutions

Multi-versioning. Kiln [21] employs persistent LLC alongside PM to enable in-place updates with no logging. ATOM [17] relies on hardware structures to provide atomic durability using undo logging. Proteus [20] proposes a logging mechanism to enable persisting the transactions by applying drastic modification to the core to manage the logs and order them with respect to the write data. Gupta et al. [83] leverages persistent memory controllers to accumulate the updates in a transaction until they are committed to PM. ThyNVM [30] and PiCL [84] propose hardware-based checkpointing mechanisms to achieve crash consistency.

Ordering. Prior work [14, 28, 85] propose memory persistency model to determine the necessary orderings in which persistent updates need to reach to PM. BPFS [9] provides atomicity and ordering via epoch barriers, allowing reordering of persistent updates only inside of an epoch. Doshi et al. [15], Kolli et al. [7], Nalli et al. [19] and Shin et al. [86] explore different approaches for implementing the epoch persistency model in hardware. Lu et al. [13] proposes loose-ordering consistency to relax the constraints among persistent writes and employ hardware support to resolve the conflicts of transitions that should be persisted. Dananjaya et al. [87] strengthens the one-sided barrier semantics in ARP [36] and proposes release persistency model (RP) to guarantee that any pair of stores are persisted in the same order as the consistency model enforces the ordering. Gogte et al. [69] propose an implementation for strand persistency model to reduce the ordering overheads.

Unlike these set of approaches, Themis requires no invasive changes to the cache hierarchy.
Chapter 6  conclusion

The x86 memory persistency model offers two non-overlapping paths to persistence. And it only provides ordering guarantees using expensive fence instructions across or along these paths. In this work, we propose Themis, a light-weight x86 persistency model extension to provide some implicit ordering guarantees without using an intervening fence instruction. We improve the average performance of undo-logging based PM workloads by 45.8% while incurring as hardware overhead only 1.2% increase in data cache size.


URL https://doi.org/10.1145/1555815.1555758


URL https://doi.org/10.1145/3392151


in Proceedings of the Ninth European Conference on Computer Systems, EuroSys ’14, Association for Computing Machinery, New York, NY, USA.
URL https://doi.org/10.1145/2592798.2592814

URL https://doi.org/10.1145/2954679.2872381


URL https://doi.org/10.1145/2830772.2830802

URL https://doi.org/10.1145/3133891


URL https://doi.org/10.1109/MICRO.2018.00029

URL https://doi.org/10.1145/3297858.3304015


