memory-barriers.txt 113 KB
Newer Older
1 2 3 4 5
			 ============================
			 LINUX KERNEL MEMORY BARRIERS
			 ============================

By: David Howells <dhowells@redhat.com>
6
    Paul E. McKenney <paulmck@linux.vnet.ibm.com>
7 8
    Will Deacon <will.deacon@arm.com>
    Peter Zijlstra <peterz@infradead.org>
9

10 11 12 13 14 15 16 17 18 19 20 21
==========
DISCLAIMER
==========

This document is not a specification; it is intentionally (for the sake of
brevity) and unintentionally (due to being human) incomplete. This document is
meant as a guide to using the various memory barriers provided by Linux, but
in case of any doubt (and there are many) please ask.

To repeat, this document is not a specification of what Linux expects from
hardware.

22 23 24 25 26 27 28 29
The purpose of this document is twofold:

 (1) to specify the minimum functionality that one can rely on for any
     particular barrier, and

 (2) to provide a guide as to how to use the barriers that are available.

Note that an architecture can provide more than the minimum requirement
30
for any particular barrier, but if the architecture provides less than
31 32 33 34 35 36 37
that, that architecture is incorrect.

Note also that it is possible that a barrier may be a no-op for an
architecture because the way that arch works renders an explicit barrier
unnecessary in that case.


38 39 40
========
CONTENTS
========
41 42 43 44 45 46 47 48 49 50 51 52 53 54

 (*) Abstract memory access model.

     - Device operations.
     - Guarantees.

 (*) What are memory barriers?

     - Varieties of memory barrier.
     - What may not be assumed about memory barriers?
     - Data dependency barriers.
     - Control dependencies.
     - SMP barrier pairing.
     - Examples of memory barrier sequences.
55
     - Read memory barriers vs load speculation.
56
     - Multicopy atomicity.
57 58 59 60

 (*) Explicit kernel barriers.

     - Compiler barrier.
61
     - CPU memory barriers.
62 63 64 65
     - MMIO write barrier.

 (*) Implicit kernel memory barriers.

66
     - Lock acquisition functions.
67
     - Interrupt disabling functions.
68
     - Sleep and wake-up functions.
69 70
     - Miscellaneous functions.

71
 (*) Inter-CPU acquiring barrier effects.
72

73 74
     - Acquires vs memory accesses.
     - Acquires vs I/O accesses.
75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95

 (*) Where are memory barriers needed?

     - Interprocessor interaction.
     - Atomic operations.
     - Accessing devices.
     - Interrupts.

 (*) Kernel I/O barrier effects.

 (*) Assumed minimum execution ordering model.

 (*) The effects of the cpu cache.

     - Cache coherency.
     - Cache coherency vs DMA.
     - Cache coherency vs MMIO.

 (*) The things CPUs get up to.

     - And then there's the Alpha.
96
     - Virtual Machine Guests.
97

98 99 100 101
 (*) Example uses.

     - Circular buffers.

102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150
 (*) References.


============================
ABSTRACT MEMORY ACCESS MODEL
============================

Consider the following abstract model of the system:

		            :                :
		            :                :
		            :                :
		+-------+   :   +--------+   :   +-------+
		|       |   :   |        |   :   |       |
		|       |   :   |        |   :   |       |
		| CPU 1 |<----->| Memory |<----->| CPU 2 |
		|       |   :   |        |   :   |       |
		|       |   :   |        |   :   |       |
		+-------+   :   +--------+   :   +-------+
		    ^       :       ^        :       ^
		    |       :       |        :       |
		    |       :       |        :       |
		    |       :       v        :       |
		    |       :   +--------+   :       |
		    |       :   |        |   :       |
		    |       :   |        |   :       |
		    +---------->| Device |<----------+
		            :   |        |   :
		            :   |        |   :
		            :   +--------+   :
		            :                :

Each CPU executes a program that generates memory access operations.  In the
abstract CPU, memory operation ordering is very relaxed, and a CPU may actually
perform the memory operations in any order it likes, provided program causality
appears to be maintained.  Similarly, the compiler may also arrange the
instructions it emits in any order it likes, provided it doesn't affect the
apparent operation of the program.

So in the above diagram, the effects of the memory operations performed by a
CPU are perceived by the rest of the system as the operations cross the
interface between the CPU and rest of the system (the dotted lines).


For example, consider the following sequence of events:

	CPU 1		CPU 2
	===============	===============
	{ A == 1; B == 2 }
151 152
	A = 3;		x = B;
	B = 4;		y = A;
153 154 155 156

The set of accesses as seen by the memory system in the middle can be arranged
in 24 different combinations:

157 158 159 160 161 162 163
	STORE A=3,	STORE B=4,	y=LOAD A->3,	x=LOAD B->4
	STORE A=3,	STORE B=4,	x=LOAD B->4,	y=LOAD A->3
	STORE A=3,	y=LOAD A->3,	STORE B=4,	x=LOAD B->4
	STORE A=3,	y=LOAD A->3,	x=LOAD B->2,	STORE B=4
	STORE A=3,	x=LOAD B->2,	STORE B=4,	y=LOAD A->3
	STORE A=3,	x=LOAD B->2,	y=LOAD A->3,	STORE B=4
	STORE B=4,	STORE A=3,	y=LOAD A->3,	x=LOAD B->4
164 165 166 167 168
	STORE B=4, ...
	...

and can thus result in four different combinations of values:

169 170 171 172
	x == 2, y == 1
	x == 2, y == 3
	x == 4, y == 1
	x == 4, y == 3
173 174 175 176 177 178 179 180 181 182 183


Furthermore, the stores committed by a CPU to the memory system may not be
perceived by the loads made by another CPU in the same order as the stores were
committed.


As a further example, consider this sequence of events:

	CPU 1		CPU 2
	===============	===============
184
	{ A == 1, B == 2, C == 3, P == &A, Q == &C }
185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229
	B = 4;		Q = P;
	P = &B		D = *Q;

There is an obvious data dependency here, as the value loaded into D depends on
the address retrieved from P by CPU 2.  At the end of the sequence, any of the
following results are possible:

	(Q == &A) and (D == 1)
	(Q == &B) and (D == 2)
	(Q == &B) and (D == 4)

Note that CPU 2 will never try and load C into D because the CPU will load P
into Q before issuing the load of *Q.


DEVICE OPERATIONS
-----------------

Some devices present their control interfaces as collections of memory
locations, but the order in which the control registers are accessed is very
important.  For instance, imagine an ethernet card with a set of internal
registers that are accessed through an address port register (A) and a data
port register (D).  To read internal register 5, the following code might then
be used:

	*A = 5;
	x = *D;

but this might show up as either of the following two sequences:

	STORE *A = 5, x = LOAD *D
	x = LOAD *D, STORE *A = 5

the second of which will almost certainly result in a malfunction, since it set
the address _after_ attempting to read the register.


GUARANTEES
----------

There are some minimal guarantees that may be expected of a CPU:

 (*) On any given CPU, dependent memory accesses will be issued in order, with
     respect to itself.  This means that for:

230
	Q = READ_ONCE(P); D = READ_ONCE(*Q);
231 232 233 234 235

     the CPU will issue the following memory operations:

	Q = LOAD P, D = LOAD *Q

236 237 238 239 240 241 242 243
     and always in that order.  However, on DEC Alpha, READ_ONCE() also
     emits a memory-barrier instruction, so that a DEC Alpha CPU will
     instead issue the following memory operations:

	Q = LOAD P, MEMORY_BARRIER, D = LOAD *Q, MEMORY_BARRIER

     Whether on DEC Alpha or not, the READ_ONCE() also prevents compiler
     mischief.
244 245 246 247

 (*) Overlapping loads and stores within a particular CPU will appear to be
     ordered within that CPU.  This means that for:

248
	a = READ_ONCE(*X); WRITE_ONCE(*X, b);
249 250 251 252 253 254 255

     the CPU will only issue the following sequence of memory operations:

	a = LOAD *X, STORE *X = b

     And for:

256
	WRITE_ONCE(*X, c); d = READ_ONCE(*X);
257 258 259 260 261

     the CPU will only issue:

	STORE *X = c, d = LOAD *X

262
     (Loads and stores overlap if they are targeted at overlapping pieces of
263 264 265 266
     memory).

And there are a number of things that _must_ or _must_not_ be assumed:

267 268 269 270
 (*) It _must_not_ be assumed that the compiler will do what you want
     with memory references that are not protected by READ_ONCE() and
     WRITE_ONCE().  Without them, the compiler is within its rights to
     do all sorts of "creative" transformations, which are covered in
271
     the COMPILER BARRIER section.
272

273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299
 (*) It _must_not_ be assumed that independent loads and stores will be issued
     in the order given.  This means that for:

	X = *A; Y = *B; *D = Z;

     we may get any of the following sequences:

	X = LOAD *A,  Y = LOAD *B,  STORE *D = Z
	X = LOAD *A,  STORE *D = Z, Y = LOAD *B
	Y = LOAD *B,  X = LOAD *A,  STORE *D = Z
	Y = LOAD *B,  STORE *D = Z, X = LOAD *A
	STORE *D = Z, X = LOAD *A,  Y = LOAD *B
	STORE *D = Z, Y = LOAD *B,  X = LOAD *A

 (*) It _must_ be assumed that overlapping memory accesses may be merged or
     discarded.  This means that for:

	X = *A; Y = *(A + 4);

     we may get any one of the following sequences:

	X = LOAD *A; Y = LOAD *(A + 4);
	Y = LOAD *(A + 4); X = LOAD *A;
	{X, Y} = LOAD {*A, *(A + 4) };

     And for:

300
	*A = X; *(A + 4) = Y;
301

302
     we may get any of:
303

304 305 306
	STORE *A = X; STORE *(A + 4) = Y;
	STORE *(A + 4) = Y; STORE *A = X;
	STORE {*A, *(A + 4) } = {X, Y};
307

308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351
And there are anti-guarantees:

 (*) These guarantees do not apply to bitfields, because compilers often
     generate code to modify these using non-atomic read-modify-write
     sequences.  Do not attempt to use bitfields to synchronize parallel
     algorithms.

 (*) Even in cases where bitfields are protected by locks, all fields
     in a given bitfield must be protected by one lock.  If two fields
     in a given bitfield are protected by different locks, the compiler's
     non-atomic read-modify-write sequences can cause an update to one
     field to corrupt the value of an adjacent field.

 (*) These guarantees apply only to properly aligned and sized scalar
     variables.  "Properly sized" currently means variables that are
     the same size as "char", "short", "int" and "long".  "Properly
     aligned" means the natural alignment, thus no constraints for
     "char", two-byte alignment for "short", four-byte alignment for
     "int", and either four-byte or eight-byte alignment for "long",
     on 32-bit and 64-bit systems, respectively.  Note that these
     guarantees were introduced into the C11 standard, so beware when
     using older pre-C11 compilers (for example, gcc 4.6).  The portion
     of the standard containing this guarantee is Section 3.14, which
     defines "memory location" as follows:

     	memory location
		either an object of scalar type, or a maximal sequence
		of adjacent bit-fields all having nonzero width

		NOTE 1: Two threads of execution can update and access
		separate memory locations without interfering with
		each other.

		NOTE 2: A bit-field and an adjacent non-bit-field member
		are in separate memory locations. The same applies
		to two bit-fields, if one is declared inside a nested
		structure declaration and the other is not, or if the two
		are separated by a zero-length bit-field declaration,
		or if they are separated by a non-bit-field member
		declaration. It is not safe to concurrently update two
		bit-fields in the same structure if all members declared
		between them are also bit-fields, no matter what the
		sizes of those intervening bit-fields happen to be.

352 353 354 355 356 357 358 359 360 361 362

=========================
WHAT ARE MEMORY BARRIERS?
=========================

As can be seen above, independent memory operations are effectively performed
in random order, but this can be a problem for CPU-CPU interaction and for I/O.
What is required is some way of intervening to instruct the compiler and the
CPU to restrict the order.

Memory barriers are such interventions.  They impose a perceived partial
363 364 365
ordering over the memory operations on either side of the barrier.

Such enforcement is important because the CPUs and other devices in a system
366
can use a variety of tricks to improve performance, including reordering,
367 368 369 370
deferral and combination of memory operations; speculative loads; speculative
branch prediction and various types of caching.  Memory barriers are used to
override or suppress these tricks, allowing the code to sanely control the
interaction of multiple CPUs and/or devices.
371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387


VARIETIES OF MEMORY BARRIER
---------------------------

Memory barriers come in four basic varieties:

 (1) Write (or store) memory barriers.

     A write memory barrier gives a guarantee that all the STORE operations
     specified before the barrier will appear to happen before all the STORE
     operations specified after the barrier with respect to the other
     components of the system.

     A write barrier is a partial ordering on stores only; it is not required
     to have any effect on loads.

388
     A CPU can be viewed as committing a sequence of store operations to the
389 390
     memory system as time progresses.  All stores _before_ a write barrier
     will occur _before_ all the stores after the write barrier.
391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450

     [!] Note that write barriers should normally be paired with read or data
     dependency barriers; see the "SMP barrier pairing" subsection.


 (2) Data dependency barriers.

     A data dependency barrier is a weaker form of read barrier.  In the case
     where two loads are performed such that the second depends on the result
     of the first (eg: the first load retrieves the address to which the second
     load will be directed), a data dependency barrier would be required to
     make sure that the target of the second load is updated before the address
     obtained by the first load is accessed.

     A data dependency barrier is a partial ordering on interdependent loads
     only; it is not required to have any effect on stores, independent loads
     or overlapping loads.

     As mentioned in (1), the other CPUs in the system can be viewed as
     committing sequences of stores to the memory system that the CPU being
     considered can then perceive.  A data dependency barrier issued by the CPU
     under consideration guarantees that for any load preceding it, if that
     load touches one of a sequence of stores from another CPU, then by the
     time the barrier completes, the effects of all the stores prior to that
     touched by the load will be perceptible to any loads issued after the data
     dependency barrier.

     See the "Examples of memory barrier sequences" subsection for diagrams
     showing the ordering constraints.

     [!] Note that the first load really has to have a _data_ dependency and
     not a control dependency.  If the address for the second load is dependent
     on the first load, but the dependency is through a conditional rather than
     actually loading the address itself, then it's a _control_ dependency and
     a full read barrier or better is required.  See the "Control dependencies"
     subsection for more information.

     [!] Note that data dependency barriers should normally be paired with
     write barriers; see the "SMP barrier pairing" subsection.


 (3) Read (or load) memory barriers.

     A read barrier is a data dependency barrier plus a guarantee that all the
     LOAD operations specified before the barrier will appear to happen before
     all the LOAD operations specified after the barrier with respect to the
     other components of the system.

     A read barrier is a partial ordering on loads only; it is not required to
     have any effect on stores.

     Read memory barriers imply data dependency barriers, and so can substitute
     for them.

     [!] Note that read barriers should normally be paired with write barriers;
     see the "SMP barrier pairing" subsection.


 (4) General memory barriers.

451 452 453 454 455 456
     A general memory barrier gives a guarantee that all the LOAD and STORE
     operations specified before the barrier will appear to happen before all
     the LOAD and STORE operations specified after the barrier with respect to
     the other components of the system.

     A general memory barrier is a partial ordering over both loads and stores.
457 458 459 460 461 462 463

     General memory barriers imply both read and write memory barriers, and so
     can substitute for either.


And a couple of implicit varieties:

464
 (5) ACQUIRE operations.
465 466

     This acts as a one-way permeable barrier.  It guarantees that all memory
467 468
     operations after the ACQUIRE operation will appear to happen after the
     ACQUIRE operation with respect to the other components of the system.
469 470 471
     ACQUIRE operations include LOCK operations and both smp_load_acquire()
     and smp_cond_acquire() operations. The later builds the necessary ACQUIRE
     semantics from relying on a control dependency and smp_rmb().
472

473 474
     Memory operations that occur before an ACQUIRE operation may appear to
     happen after it completes.
475

476 477
     An ACQUIRE operation should almost always be paired with a RELEASE
     operation.
478 479


480
 (6) RELEASE operations.
481 482

     This also acts as a one-way permeable barrier.  It guarantees that all
483 484 485 486
     memory operations before the RELEASE operation will appear to happen
     before the RELEASE operation with respect to the other components of the
     system. RELEASE operations include UNLOCK operations and
     smp_store_release() operations.
487

488
     Memory operations that occur after a RELEASE operation may appear to
489 490
     happen before it completes.

491 492 493 494 495 496 497 498 499
     The use of ACQUIRE and RELEASE operations generally precludes the need
     for other sorts of memory barrier (but note the exceptions mentioned in
     the subsection "MMIO write barrier").  In addition, a RELEASE+ACQUIRE
     pair is -not- guaranteed to act as a full memory barrier.  However, after
     an ACQUIRE on a given variable, all memory accesses preceding any prior
     RELEASE on that same variable are guaranteed to be visible.  In other
     words, within a given variable's critical section, all accesses of all
     previous critical sections for that variable are guaranteed to have
     completed.
500

501 502
     This means that ACQUIRE acts as a minimal "acquire" operation and
     RELEASE acts as a minimal "release" operation.
503

504 505 506 507 508
A subset of the atomic operations described in atomic_t.txt have ACQUIRE and
RELEASE variants in addition to fully-ordered and relaxed (no barrier
semantics) definitions.  For compound atomics performing both a load and a
store, ACQUIRE semantics apply only to the load and RELEASE semantics apply
only to the store portion of the operation.
509 510 511 512 513 514 515 516 517 518 519 520 521 522 523 524 525 526 527 528 529 530 531 532 533 534 535

Memory barriers are only required where there's a possibility of interaction
between two CPUs or between a CPU and a device.  If it can be guaranteed that
there won't be any such interaction in any particular piece of code, then
memory barriers are unnecessary in that piece of code.


Note that these are the _minimum_ guarantees.  Different architectures may give
more substantial guarantees, but they may _not_ be relied upon outside of arch
specific code.


WHAT MAY NOT BE ASSUMED ABOUT MEMORY BARRIERS?
----------------------------------------------

There are certain things that the Linux kernel memory barriers do not guarantee:

 (*) There is no guarantee that any of the memory accesses specified before a
     memory barrier will be _complete_ by the completion of a memory barrier
     instruction; the barrier can be considered to draw a line in that CPU's
     access queue that accesses of the appropriate type may not cross.

 (*) There is no guarantee that issuing a memory barrier on one CPU will have
     any direct effect on another CPU or any other hardware in the system.  The
     indirect effect will be the order in which the second CPU sees the effects
     of the first CPU's accesses occur, but see the next point:

536
 (*) There is no guarantee that a CPU will see the correct order of effects
537 538 539 540 541 542 543 544 545 546 547
     from a second CPU's accesses, even _if_ the second CPU uses a memory
     barrier, unless the first CPU _also_ uses a matching memory barrier (see
     the subsection on "SMP Barrier Pairing").

 (*) There is no guarantee that some intervening piece of off-the-CPU
     hardware[*] will not reorder the memory accesses.  CPU cache coherency
     mechanisms should propagate the indirect effects of a memory barrier
     between CPUs, but might not do so in order.

	[*] For information on bus mastering DMA and coherency please read:

548
	    Documentation/PCI/pci.txt
Paul Bolle's avatar
Paul Bolle committed
549
	    Documentation/DMA-API-HOWTO.txt
550 551 552 553 554 555 556 557 558 559
	    Documentation/DMA-API.txt


DATA DEPENDENCY BARRIERS
------------------------

The usage requirements of data dependency barriers are a little subtle, and
it's not always obvious that they're needed.  To illustrate, consider the
following sequence of events:

560 561
	CPU 1		      CPU 2
	===============	      ===============
562
	{ A == 1, B == 2, C == 3, P == &A, Q == &C }
563 564
	B = 4;
	<write barrier>
565 566
	WRITE_ONCE(P, &B)
			      Q = READ_ONCE(P);
567
			      D = *Q;
568 569 570 571 572 573 574

There's a clear data dependency here, and it would seem that by the end of the
sequence, Q must be either &A or &B, and that:

	(Q == &A) implies (D == 1)
	(Q == &B) implies (D == 4)

575
But!  CPU 2's perception of P may be updated _before_ its perception of B, thus
576 577 578 579 580 581 582 583
leading to the following situation:

	(Q == &B) and (D == 2) ????

Whilst this may seem like a failure of coherency or causality maintenance, it
isn't, and this behaviour can be observed on certain real CPUs (such as the DEC
Alpha).

584 585
To deal with this, a data dependency barrier or better must be inserted
between the address load and the data load:
586

587 588
	CPU 1		      CPU 2
	===============	      ===============
589
	{ A == 1, B == 2, C == 3, P == &A, Q == &C }
590 591
	B = 4;
	<write barrier>
592 593
	WRITE_ONCE(P, &B);
			      Q = READ_ONCE(P);
594 595
			      <data dependency barrier>
			      D = *Q;
596 597 598 599

This enforces the occurrence of one of the two implications, and prevents the
third possibility from arising.

600 601 602 603 604 605 606 607 608 609 610 611 612 613 614 615 616 617

[!] Note that this extremely counterintuitive situation arises most easily on
machines with split caches, so that, for example, one cache bank processes
even-numbered cache lines and the other bank processes odd-numbered cache
lines.  The pointer P might be stored in an odd-numbered cache line, and the
variable B might be stored in an even-numbered cache line.  Then, if the
even-numbered bank of the reading CPU's cache is extremely busy while the
odd-numbered bank is idle, one can see the new value of the pointer P (&B),
but the old value of the variable B (2).


A data-dependency barrier is not required to order dependent writes
because the CPUs that the Linux kernel supports don't do writes
until they are certain (1) that the write will actually happen, (2)
of the location of the write, and (3) of the value to be written.
But please carefully read the "CONTROL DEPENDENCIES" section and the
Documentation/RCU/rcu_dereference.txt file:  The compiler can and does
break dependencies in a great many highly creative ways.
618 619 620 621 622 623 624 625

	CPU 1		      CPU 2
	===============	      ===============
	{ A == 1, B == 2, C = 3, P == &A, Q == &C }
	B = 4;
	<write barrier>
	WRITE_ONCE(P, &B);
			      Q = READ_ONCE(P);
626
			      WRITE_ONCE(*Q, 5);
627

628 629 630
Therefore, no data-dependency barrier is required to order the read into
Q with the store into *Q.  In other words, this outcome is prohibited,
even without a data-dependency barrier:
631

632
	(Q == &B) && (B == 4)
633 634 635 636

Please note that this pattern should be rare.  After all, the whole point
of dependency ordering is to -prevent- writes to the data structure, along
with the expensive cache misses associated with those writes.  This pattern
637 638
can be used to record rare error conditions and the like, and the CPUs'
naturally occurring ordering prevents such records from being lost.
639 640


641 642 643 644 645
Note well that the ordering provided by a data dependency is local to
the CPU containing it.  See the section on "Multicopy atomicity" for
more information.


646 647 648 649 650
The data dependency barrier is very important to the RCU system,
for example.  See rcu_assign_pointer() and rcu_dereference() in
include/linux/rcupdate.h.  This permits the current target of an RCU'd
pointer to be replaced with a new modified target, without the replacement
target appearing to be incompletely initialised.
651 652 653 654 655 656 657

See also the subsection on "Cache Coherency" for a more thorough example.


CONTROL DEPENDENCIES
--------------------

658 659 660 661
Control dependencies can be a bit tricky because current compilers do
not understand them.  The purpose of this section is to help you prevent
the compiler's ignorance from breaking your code.

662 663 664
A load-load control dependency requires a full read memory barrier, not
simply a data dependency barrier to make it work correctly.  Consider the
following bit of code:
665

666
	q = READ_ONCE(a);
667 668
	if (q) {
		<data dependency barrier>  /* BUG: No data dependency!!! */
669
		p = READ_ONCE(b);
670
	}
671 672

This will not have the desired effect because there is no actual data
673 674 675 676
dependency, but rather a control dependency that the CPU may short-circuit
by attempting to predict the outcome in advance, so that other CPUs see
the load from b as having happened before the load from a.  In such a
case what's actually required is:
677

678
	q = READ_ONCE(a);
679
	if (q) {
680
		<read barrier>
681
		p = READ_ONCE(b);
682
	}
683 684

However, stores are not speculated.  This means that ordering -is- provided
685
for load-store control dependencies, as in the following example:
686

687
	q = READ_ONCE(a);
688
	if (q) {
689
		WRITE_ONCE(b, 1);
690 691
	}

692 693 694 695 696 697
Control dependencies pair normally with other types of barriers.
That said, please note that neither READ_ONCE() nor WRITE_ONCE()
are optional! Without the READ_ONCE(), the compiler might combine the
load from 'a' with other loads from 'a'.  Without the WRITE_ONCE(),
the compiler might combine the store to 'b' with other stores to 'b'.
Either can result in highly counterintuitive effects on ordering.
698 699 700 701 702 703 704

Worse yet, if the compiler is able to prove (say) that the value of
variable 'a' is always non-zero, it would be well within its rights
to optimize the original example by eliminating the "if" statement
as follows:

	q = a;
705
	b = 1;  /* BUG: Compiler and CPU can both reorder!!! */
706

707
So don't leave out the READ_ONCE().
708

709 710
It is tempting to try to enforce ordering on identical stores on both
branches of the "if" statement as follows:
711

712
	q = READ_ONCE(a);
713
	if (q) {
714
		barrier();
715
		WRITE_ONCE(b, 1);
716 717
		do_something();
	} else {
718
		barrier();
719
		WRITE_ONCE(b, 1);
720 721 722
		do_something_else();
	}

723 724
Unfortunately, current compilers will transform this as follows at high
optimization levels:
725

726
	q = READ_ONCE(a);
727
	barrier();
728
	WRITE_ONCE(b, 1);  /* BUG: No ordering vs. load from a!!! */
729
	if (q) {
730
		/* WRITE_ONCE(b, 1); -- moved up, BUG!!! */
731 732
		do_something();
	} else {
733
		/* WRITE_ONCE(b, 1); -- moved up, BUG!!! */
734 735 736
		do_something_else();
	}

737 738 739 740 741 742
Now there is no conditional between the load from 'a' and the store to
'b', which means that the CPU is within its rights to reorder them:
The conditional is absolutely required, and must be present in the
assembly code even after all compiler optimizations have been applied.
Therefore, if you need ordering in this example, you need explicit
memory barriers, for example, smp_store_release():
743

744
	q = READ_ONCE(a);
745
	if (q) {
746
		smp_store_release(&b, 1);
747 748
		do_something();
	} else {
749
		smp_store_release(&b, 1);
750 751 752
		do_something_else();
	}

753 754 755
In contrast, without explicit memory barriers, two-legged-if control
ordering is guaranteed only when the stores differ, for example:

756
	q = READ_ONCE(a);
757
	if (q) {
758
		WRITE_ONCE(b, 1);
759 760
		do_something();
	} else {
761
		WRITE_ONCE(b, 2);
762 763 764
		do_something_else();
	}

765 766
The initial READ_ONCE() is still required to prevent the compiler from
proving the value of 'a'.
767 768 769 770 771

In addition, you need to be careful what you do with the local variable 'q',
otherwise the compiler might be able to guess the value and again remove
the needed conditional.  For example:

772
	q = READ_ONCE(a);
773
	if (q % MAX) {
774
		WRITE_ONCE(b, 1);
775 776
		do_something();
	} else {
777
		WRITE_ONCE(b, 2);
778 779 780 781 782 783 784
		do_something_else();
	}

If MAX is defined to be 1, then the compiler knows that (q % MAX) is
equal to zero, in which case the compiler is within its rights to
transform the above code into the following:

785
	q = READ_ONCE(a);
786
	WRITE_ONCE(b, 2);
787 788
	do_something_else();

789 790 791 792 793 794
Given this transformation, the CPU is not required to respect the ordering
between the load from variable 'a' and the store to variable 'b'.  It is
tempting to add a barrier(), but this does not help.  The conditional
is gone, and the barrier won't bring it back.  Therefore, if you are
relying on this ordering, you should make sure that MAX is greater than
one, perhaps as follows:
795

796
	q = READ_ONCE(a);
797 798
	BUILD_BUG_ON(MAX <= 1); /* Order load from a with store to b. */
	if (q % MAX) {
799
		WRITE_ONCE(b, 1);
800 801
		do_something();
	} else {
802
		WRITE_ONCE(b, 2);
803 804 805
		do_something_else();
	}

806 807 808 809
Please note once again that the stores to 'b' differ.  If they were
identical, as noted earlier, the compiler could pull this store outside
of the 'if' statement.

810 811 812
You must also be careful not to rely too much on boolean short-circuit
evaluation.  Consider this example:

813
	q = READ_ONCE(a);
814
	if (q || 1 > 0)
815
		WRITE_ONCE(b, 1);
816

817 818 819
Because the first condition cannot fault and the second condition is
always true, the compiler can transform this example as following,
defeating control dependency:
820

821
	q = READ_ONCE(a);
822
	WRITE_ONCE(b, 1);
823 824

This example underscores the need to ensure that the compiler cannot
825
out-guess your code.  More generally, although READ_ONCE() does force
826 827 828
the compiler to actually emit code for a given load, it does not force
the compiler to use the results.

829 830 831 832 833 834
In addition, control dependencies apply only to the then-clause and
else-clause of the if-statement in question.  In particular, it does
not necessarily apply to code following the if-statement:

	q = READ_ONCE(a);
	if (q) {
835
		WRITE_ONCE(b, 1);
836
	} else {
837
		WRITE_ONCE(b, 2);
838
	}
839
	WRITE_ONCE(c, 1);  /* BUG: No ordering against the read from 'a'. */
840 841 842

It is tempting to argue that there in fact is ordering because the
compiler cannot reorder volatile accesses and also cannot reorder
843 844
the writes to 'b' with the condition.  Unfortunately for this line
of reasoning, the compiler might compile the two writes to 'b' as
845 846 847 848 849
conditional-move instructions, as in this fanciful pseudo-assembly
language:

	ld r1,a
	cmp r1,$0
850 851
	cmov,ne r4,$1
	cmov,eq r4,$2
852 853 854 855
	st r4,b
	st $1,c

A weakly ordered CPU would have no dependency of any sort between the load
856
from 'a' and the store to 'c'.  The control dependencies would extend
857 858 859 860 861
only to the pair of cmov instructions and the store depending on them.
In short, control dependencies apply only to the stores in the then-clause
and else-clause of the if-statement in question (including functions
invoked by those two clauses), not to code following that if-statement.

862

863 864 865
Note well that the ordering provided by a control dependency is local
to the CPU containing it.  See the section on "Multicopy atomicity"
for more information.
866 867 868 869 870 871 872 873


In summary:

  (*) Control dependencies can order prior loads against later stores.
      However, they do -not- guarantee any other sort of ordering:
      Not prior loads against later loads, nor prior stores against
      later anything.  If you need these other forms of ordering,
874
      use smp_rmb(), smp_wmb(), or, in the case of prior stores and
875 876
      later loads, smp_mb().

877 878 879 880
  (*) If both legs of the "if" statement begin with identical stores to
      the same variable, then those stores must be ordered, either by
      preceding both of them with smp_mb() or by using smp_store_release()
      to carry out the stores.  Please note that it is -not- sufficient
881 882 883 884
      to use barrier() at beginning of each leg of the "if" statement
      because, as shown by the example above, optimizing compilers can
      destroy the control dependency while respecting the letter of the
      barrier() law.
885

886
  (*) Control dependencies require at least one run-time conditional
887
      between the prior load and the subsequent store, and this
888 889
      conditional must involve the prior load.  If the compiler is able
      to optimize the conditional away, it will have also optimized
890 891
      away the ordering.  Careful use of READ_ONCE() and WRITE_ONCE()
      can help to preserve the needed conditional.
892 893

  (*) Control dependencies require that the compiler avoid reordering the
894 895
      dependency into nonexistence.  Careful use of READ_ONCE() or
      atomic{,64}_read() can help to preserve your control dependency.
896
      Please see the COMPILER BARRIER section for more information.
897

898 899 900 901 902 903
  (*) Control dependencies apply only to the then-clause and else-clause
      of the if-statement containing the control dependency, including
      any functions that these two clauses call.  Control dependencies
      do -not- apply to code following the if-statement containing the
      control dependency.

904 905
  (*) Control dependencies pair normally with other types of barriers.

906 907
  (*) Control dependencies do -not- provide multicopy atomicity.  If you
      need all the CPUs to see a given store at the same time, use smp_mb().
908

909 910 911
  (*) Compilers do not understand control dependencies.  It is therefore
      your job to ensure that they do not break your code.

912 913 914 915 916 917 918

SMP BARRIER PAIRING
-------------------

When dealing with CPU-CPU interactions, certain types of memory barrier should
always be paired.  A lack of appropriate pairing is almost certainly an error.

919
General barriers pair with each other, though they also pair with most
920 921 922 923 924 925 926 927
other types of barriers, albeit without multicopy atomicity.  An acquire
barrier pairs with a release barrier, but both may also pair with other
barriers, including of course general barriers.  A write barrier pairs
with a data dependency barrier, a control dependency, an acquire barrier,
a release barrier, a read barrier, or a general barrier.  Similarly a
read barrier, control dependency, or a data dependency barrier pairs
with a write barrier, an acquire barrier, a release barrier, or a
general barrier:
928

929 930
	CPU 1		      CPU 2
	===============	      ===============
931
	WRITE_ONCE(a, 1);
932
	<write barrier>
933
	WRITE_ONCE(b, 2);     x = READ_ONCE(b);
934
			      <read barrier>
935
			      y = READ_ONCE(a);
936 937 938

Or:

939 940
	CPU 1		      CPU 2
	===============	      ===============================
941 942
	a = 1;
	<write barrier>
943
	WRITE_ONCE(b, &a);    x = READ_ONCE(b);
944 945
			      <data dependency barrier>
			      y = *x;
946

947 948 949 950
Or even:

	CPU 1		      CPU 2
	===============	      ===============================
951
	r1 = READ_ONCE(y);
952
	<general barrier>
953
	WRITE_ONCE(x, 1);     if (r2 = READ_ONCE(x)) {
954
			         <implicit control dependency>
955
			         WRITE_ONCE(y, 1);
956 957 958 959
			      }

	assert(r1 == 0 || r2 == 0);

960 961 962
Basically, the read barrier always has to be there, even though it can be of
the "weaker" type.

963
[!] Note that the stores before the write barrier would normally be expected to
964
match the loads after the read barrier or the data dependency barrier, and vice
965 966
versa:

967 968
	CPU 1                               CPU 2
	===================                 ===================
969 970
	WRITE_ONCE(a, 1);    }----   --->{  v = READ_ONCE(c);
	WRITE_ONCE(b, 2);    }    \ /    {  w = READ_ONCE(d);
971
	<write barrier>            \        <read barrier>
972 973
	WRITE_ONCE(c, 3);    }    / \    {  x = READ_ONCE(a);
	WRITE_ONCE(d, 4);    }----   --->{  y = READ_ONCE(b);
974

975 976 977 978

EXAMPLES OF MEMORY BARRIER SEQUENCES
------------------------------------

979
Firstly, write barriers act as partial orderings on store operations.
980 981 982 983 984 985 986 987 988 989 990 991 992
Consider the following sequence of events:

	CPU 1
	=======================
	STORE A = 1
	STORE B = 2
	STORE C = 3
	<write barrier>
	STORE D = 4
	STORE E = 5

This sequence of events is committed to the memory coherence system in an order
that the rest of the system might perceive as the unordered set of { STORE A,
993
STORE B, STORE C } all occurring before the unordered set of { STORE D, STORE E
994 995 996 997 998
}:

	+-------+       :      :
	|       |       +------+
	|       |------>| C=3  |     }     /\
999 1000
	|       |  :    +------+     }-----  \  -----> Events perceptible to
	|       |  :    | A=1  |     }        \/       the rest of the system
1001 1002 1003 1004 1005 1006
	|       |  :    +------+     }
	| CPU 1 |  :    | B=2  |     }
	|       |       +------+     }
	|       |   wwwwwwwwwwwwwwww }   <--- At this point the write barrier
	|       |       +------+     }        requires all stores prior to the
	|       |  :    | E=5  |     }        barrier to be committed before
1007
	|       |  :    +------+     }        further stores may take place
1008 1009 1010 1011
	|       |------>| D=4  |     }
	|       |       +------+
	+-------+       :      :
	                   |
1012 1013
	                   | Sequence in which stores are committed to the
	                   | memory system by CPU 1
1014 1015 1016
	                   V


1017
Secondly, data dependency barriers act as partial orderings on data-dependent
1018 1019 1020 1021
loads.  Consider the following sequence of events:

	CPU 1			CPU 2
	=======================	=======================
1022
		{ B = 7; X = 9; Y = 8; C = &Y }
1023 1024 1025 1026 1027 1028 1029 1030 1031 1032 1033 1034 1035 1036 1037 1038 1039 1040 1041 1042 1043 1044 1045 1046 1047 1048 1049 1050 1051 1052 1053 1054 1055 1056 1057 1058 1059 1060
	STORE A = 1
	STORE B = 2
	<write barrier>
	STORE C = &B		LOAD X
	STORE D = 4		LOAD C (gets &B)
				LOAD *C (reads B)

Without intervention, CPU 2 may perceive the events on CPU 1 in some
effectively random order, despite the write barrier issued by CPU 1:

	+-------+       :      :                :       :
	|       |       +------+                +-------+  | Sequence of update
	|       |------>| B=2  |-----       --->| Y->8  |  | of perception on
	|       |  :    +------+     \          +-------+  | CPU 2
	| CPU 1 |  :    | A=1  |      \     --->| C->&Y |  V
	|       |       +------+       |        +-------+
	|       |   wwwwwwwwwwwwwwww   |        :       :
	|       |       +------+       |        :       :
	|       |  :    | C=&B |---    |        :       :       +-------+
	|       |  :    +------+   \   |        +-------+       |       |
	|       |------>| D=4  |    ----------->| C->&B |------>|       |
	|       |       +------+       |        +-------+       |       |
	+-------+       :      :       |        :       :       |       |
	                               |        :       :       |       |
	                               |        :       :       | CPU 2 |
	                               |        +-------+       |       |
	    Apparently incorrect --->  |        | B->7  |------>|       |
	    perception of B (!)        |        +-------+       |       |
	                               |        :       :       |       |
	                               |        +-------+       |       |
	    The load of X holds --->    \       | X->9  |------>|       |
	    up the maintenance           \      +-------+       |       |
	    of coherence of B             ----->| B->2  |       +-------+
	                                        +-------+
	                                        :       :


In the above example, CPU 2 perceives that B is 7, despite the load of *C
1061
(which would be B) coming after the LOAD of C.
1062 1063

If, however, a data dependency barrier were to be placed between the load of C
1064 1065 1066 1067 1068 1069 1070 1071 1072 1073 1074 1075 1076 1077
and the load of *C (ie: B) on CPU 2:

	CPU 1			CPU 2
	=======================	=======================
		{ B = 7; X = 9; Y = 8; C = &Y }
	STORE A = 1
	STORE B = 2
	<write barrier>
	STORE C = &B		LOAD X
	STORE D = 4		LOAD C (gets &B)
				<data dependency barrier>
				LOAD *C (reads B)

then the following will occur:
1078 1079 1080 1081 1082 1083 1084 1085 1086 1087 1088 1089 1090 1091 1092 1093 1094

	+-------+       :      :                :       :
	|       |       +------+                +-------+
	|       |------>| B=2  |-----       --->| Y->8  |
	|       |  :    +------+     \          +-------+
	| CPU 1 |  :    | A=1  |      \     --->| C->&Y |
	|       |       +------+       |        +-------+
	|       |   wwwwwwwwwwwwwwww   |        :       :
	|       |       +------+       |        :       :
	|       |  :    | C=&B |---    |        :       :       +-------+
	|       |  :    +------+   \   |        +-------+       |       |
	|       |------>| D=4  |    ----------->| C->&B |------>|       |
	|       |       +------+       |        +-------+       |       |
	+-------+       :      :       |        :       :       |       |
	                               |        :       :       |       |
	                               |        :       :       | CPU 2 |
	                               |        +-------+       |       |
1095 1096 1097 1098 1099 1100
	                               |        | X->9  |------>|       |
	                               |        +-------+       |       |
	  Makes sure all effects --->   \   ddddddddddddddddd   |       |
	  prior to the store of C        \      +-------+       |       |
	  are perceptible to              ----->| B->2  |------>|       |
	  subsequent loads                      +-------+       |       |
1101 1102 1103 1104 1105 1106 1107 1108
	                                        :       :       +-------+


And thirdly, a read barrier acts as a partial order on loads.  Consider the
following sequence of events:

	CPU 1			CPU 2
	=======================	=======================
1109
		{ A = 0, B = 9 }
1110 1111
	STORE A=1
	<write barrier>
1112
	STORE B=2
1113
				LOAD B
1114
				LOAD A
1115 1116 1117 1118

Without intervention, CPU 2 may then choose to perceive the events on CPU 1 in
some effectively random order, despite the write barrier issued by CPU 1:

1119 1120 1121 1122 1123 1124 1125 1126 1127 1128 1129 1130 1131 1132 1133 1134 1135 1136 1137
	+-------+       :      :                :       :
	|       |       +------+                +-------+
	|       |------>| A=1  |------      --->| A->0  |
	|       |       +------+      \         +-------+
	| CPU 1 |   wwwwwwwwwwwwwwww   \    --->| B->9  |
	|       |       +------+        |       +-------+
	|       |------>| B=2  |---     |       :       :
	|       |       +------+   \    |       :       :       +-------+
	+-------+       :      :    \   |       +-------+       |       |
	                             ---------->| B->2  |------>|       |
	                                |       +-------+       | CPU 2 |
	                                |       | A->0  |------>|       |
	                                |       +-------+       |       |
	                                |       :       :       +-------+
	                                 \      :       :
	                                  \     +-------+
	                                   ---->| A->1  |
	                                        +-------+
	                                        :       :
1138

1139

1140
If, however, a read barrier were to be placed between the load of B and the
1141 1142 1143 1144 1145 1146 1147 1148 1149 1150 1151 1152 1153 1154 1155 1156 1157 1158 1159 1160 1161 1162 1163 1164 1165 1166 1167 1168 1169 1170 1171 1172 1173 1174 1175 1176 1177 1178 1179 1180 1181 1182 1183 1184 1185 1186 1187 1188 1189 1190 1191 1192 1193 1194 1195 1196 1197 1198 1199 1200 1201 1202 1203 1204 1205 1206 1207 1208 1209 1210 1211 1212 1213 1214 1215 1216 1217 1218 1219 1220 1221 1222 1223 1224 1225 1226 1227 1228 1229 1230 1231 1232 1233 1234 1235 1236 1237 1238 1239 1240 1241 1242 1243 1244 1245 1246 1247 1248 1249 1250 1251 1252 1253 1254 1255 1256 1257 1258 1259 1260 1261 1262
load of A on CPU 2:

	CPU 1			CPU 2
	=======================	=======================
		{ A = 0, B = 9 }
	STORE A=1
	<write barrier>
	STORE B=2
				LOAD B
				<read barrier>
				LOAD A

then the partial ordering imposed by CPU 1 will be perceived correctly by CPU
2:

	+-------+       :      :                :       :
	|       |       +------+                +-------+
	|       |------>| A=1  |------      --->| A->0  |
	|       |       +------+      \         +-------+
	| CPU 1 |   wwwwwwwwwwwwwwww   \    --->| B->9  |
	|       |       +------+        |       +-------+
	|       |------>| B=2  |---     |       :       :
	|       |       +------+   \    |       :       :       +-------+
	+-------+       :      :    \   |       +-------+       |       |
	                             ---------->| B->2  |------>|       |
	                                |       +-------+       | CPU 2 |
	                                |       :       :       |       |
	                                |       :       :       |       |
	  At this point the read ---->   \  rrrrrrrrrrrrrrrrr   |       |
	  barrier causes all effects      \     +-------+       |       |
	  prior to the storage of B        ---->| A->1  |------>|       |
	  to be perceptible to CPU 2            +-------+       |       |
	                                        :       :       +-------+


To illustrate this more completely, consider what could happen if the code
contained a load of A either side of the read barrier:

	CPU 1			CPU 2
	=======================	=======================
		{ A = 0, B = 9 }
	STORE A=1
	<write barrier>
	STORE B=2
				LOAD B
				LOAD A [first load of A]
				<read barrier>
				LOAD A [second load of A]

Even though the two loads of A both occur after the load of B, they may both
come up with different values:

	+-------+       :      :                :       :
	|       |       +------+                +-------+
	|       |------>| A=1  |------      --->| A->0  |
	|       |       +------+      \         +-------+
	| CPU 1 |   wwwwwwwwwwwwwwww   \    --->| B->9  |
	|       |       +------+        |       +-------+
	|       |------>| B=2  |---     |       :       :
	|       |       +------+   \    |       :       :       +-------+
	+-------+       :      :    \   |       +-------+       |       |
	                             ---------->| B->2  |------>|       |
	                                |       +-------+       | CPU 2 |
	                                |       :       :       |       |
	                                |       :       :       |       |
	                                |       +-------+       |       |
	                                |       | A->0  |------>| 1st   |
	                                |       +-------+       |       |
	  At this point the read ---->   \  rrrrrrrrrrrrrrrrr   |       |
	  barrier causes all effects      \     +-------+       |       |
	  prior to the storage of B        ---->| A->1  |------>| 2nd   |
	  to be perceptible to CPU 2            +-------+       |       |
	                                        :       :       +-------+


But it may be that the update to A from CPU 1 becomes perceptible to CPU 2
before the read barrier completes anyway:

	+-------+       :      :                :       :
	|       |       +------+                +-------+
	|       |------>| A=1  |------      --->| A->0  |
	|       |       +------+      \         +-------+
	| CPU 1 |   wwwwwwwwwwwwwwww   \    --->| B->9  |
	|       |       +------+        |       +-------+
	|       |------>| B=2  |---     |       :       :
	|       |       +------+   \    |       :       :       +-------+
	+-------+       :      :    \   |       +-------+       |       |
	                             ---------->| B->2  |------>|       |
	                                |       +-------+       | CPU 2 |
	                                |       :       :       |       |
	                                 \      :       :       |       |
	                                  \     +-------+       |       |
	                                   ---->| A->1  |------>| 1st   |
	                                        +-------+       |       |
	                                    rrrrrrrrrrrrrrrrr   |       |
	                                        +-------+       |       |
	                                        | A->1  |------>| 2nd   |
	                                        +-------+       |       |
	                                        :       :       +-------+


The guarantee is that the second load will always come up with A == 1 if the
load of B came up with B == 2.  No such guarantee exists for the first load of
A; that may come up with either A == 0 or A == 1.


READ MEMORY BARRIERS VS LOAD SPECULATION
----------------------------------------

Many CPUs speculate with loads: that is they see that they will need to load an
item from memory, and they find a time where they're not using the bus for any
other loads, and so do the load in advance - even though they haven't actually
got to that point in the instruction execution flow yet.  This permits the
actual load instruction to potentially complete immediately because the CPU
already has the value to hand.

It may turn out that the CPU didn't actually need the value - perhaps because a
branch circumvented the load - in which case it can discard the value or just
cache it for later use.

Consider:

1263
	CPU 1			CPU 2
1264
	=======================	=======================
1265 1266 1267 1268
				LOAD B
				DIVIDE		} Divide instructions generally
				DIVIDE		} take a long time to perform
				LOAD A
1269 1270 1271 1272 1273 1274 1275 1276 1277 1278 1279 1280 1281 1282 1283 1284 1285 1286 1287 1288 1289 1290

Which might appear as this:

	                                        :       :       +-------+
	                                        +-------+       |       |
	                                    --->| B->2  |------>|       |
	                                        +-------+       | CPU 2 |
	                                        :       :DIVIDE |       |
	                                        +-------+       |       |
	The CPU being busy doing a --->     --->| A->0  |~~~~   |       |
	division speculates on the              +-------+   ~   |       |
	LOAD of A                               :       :   ~   |       |
	                                        :       :DIVIDE |       |
	                                        :       :   ~   |       |
	Once the divisions are complete -->     :       :   ~-->|       |
	the CPU can then perform the            :       :       |       |
	LOAD with immediate effect              :       :       +-------+


Placing a read barrier or a data dependency barrier just before the second
load:

1291
	CPU 1			CPU 2
1292
	=======================	=======================
1293 1294 1295
				LOAD B
				DIVIDE
				DIVIDE
1296
				<read barrier>
1297
				LOAD A
1298 1299 1300 1301 1302 1303 1304 1305 1306 1307 1308 1309 1310 1311 1312 1313 1314 1315 1316 1317 1318 1319 1320 1321 1322 1323 1324 1325 1326 1327 1328 1329 1330 1331 1332 1333 1334 1335 1336 1337 1338 1339 1340 1341

will force any value speculatively obtained to be reconsidered to an extent
dependent on the type of barrier used.  If there was no change made to the
speculated memory location, then the speculated value will just be used:

	                                        :       :       +-------+
	                                        +-------+       |       |
	                                    --->| B->2  |------>|       |
	                                        +-------+       | CPU 2 |
	                                        :       :DIVIDE |       |
	                                        +-------+       |       |
	The CPU being busy doing a --->     --->| A->0  |~~~~   |       |
	division speculates on the              +-------+   ~   |       |
	LOAD of A                               :       :   ~   |       |
	                                        :       :DIVIDE |       |
	                                        :       :   ~   |       |
	                                        :       :   ~   |       |
	                                    rrrrrrrrrrrrrrrr~   |       |
	                                        :       :   ~   |       |
	                                        :       :   ~-->|       |
	                                        :       :       |       |
	                                        :       :       +-------+


but if there was an update or an invalidation from another CPU pending, then
the speculation will be cancelled and the value reloaded:

	                                        :       :       +-------+
	                                        +-------+       |       |
	                                    --->| B->2  |------>|       |
	                                        +-------+       | CPU 2 |
	                                        :       :DIVIDE |       |
	                                        +-------+       |       |
	The CPU being busy doing a --->     --->| A->0  |~~~~   |       |
	division speculates on the              +-------+   ~   |       |
	LOAD of A                               :       :   ~   |       |
	                                        :       :DIVIDE |       |
	                                        :       :   ~   |       |
	                                        :       :   ~   |       |
	                                    rrrrrrrrrrrrrrrrr   |       |
	                                        +-------+       |       |
	The speculation is discarded --->   --->| A->1  |------>|       |
	and an updated value is                 +-------+       |       |
	retrieved                               :       :       +-------+
1342 1343


1344 1345 1346 1347 1348
MULTICOPY ATOMICITY
--------------------

Multicopy atomicity is a deeply intuitive notion about ordering that is
not always provided by real computer systems, namely that a given store
1349 1350 1351 1352 1353 1354 1355
becomes visible at the same time to all CPUs, or, alternatively, that all
CPUs agree on the order in which all stores become visible.  However,
support of full multicopy atomicity would rule out valuable hardware
optimizations, so a weaker form called ``other multicopy atomicity''
instead guarantees only that a given store becomes visible at the same
time to all -other- CPUs.  The remainder of this document discusses this
weaker form, but for brevity will call it simply ``multicopy atomicity''.
1356

1357
The following example demonstrates multicopy atomicity:
1358 1359 1360 1361

	CPU 1			CPU 2			CPU 3
	=======================	=======================	=======================
		{ X = 0, Y = 0 }
1362 1363 1364
	STORE X=1		r1=LOAD X (reads 1)	LOAD Y (reads 1)
				<general barrier>	<read barrier>
				STORE Y=r1		LOAD X
1365

1366 1367 1368 1369 1370 1371
Suppose that CPU 2's load from X returns 1, which it then stores to Y,
and CPU 3's load from Y returns 1.  This indicates that CPU 1's store
to X precedes CPU 2's load from X and that CPU 2's store to Y precedes
CPU 3's load from Y.  In addition, the memory barriers guarantee that
CPU 2 executes its load before its store, and CPU 3 loads from Y before
it loads from X.  The question is then "Can CPU 3's load from X return 0?"
1372

1373
Because CPU 3's load from X in some sense comes after CPU 2's load, it
1374
is natural to expect that CPU 3's load from X must therefore return 1.
1375