5.3 Locality

As mentioned in section 1.2.2 Private Heaps and Locality, locality is the tendency for memory accesses close together in time to be close together in memory. Locality is a natural property of conventional computer programs. For code, the instruction that follows the current instruction in memory is overwhelmingly more likely to be executed next than some instruction selected at random. For data, if an element of an array is being accessed, another element of the same array is likely to be accessed next.

Once engineers took advantage of locality and introduced caches to computer systems, a virtuous circle was initiated in which high locality became a programming goal in order to make the most of the cache, which in turn increased the attraction of incorporating more and larger caches. The result in modern systems is depicted as a stylised cache chain between a Pentium processor and its memory. The paging and executable files are regarded as constituting 'true' memory and RAM as a cache.

When the processor needs to read a particular memory location, it may find it has been recently read for another purpose, so available in a scratch register with a latency of 1 nS. Otherwise the request is quite likely to be satisfied by level one cache or level two cache. Failing that, the processor has to interrogate the memory modules, taking 100 nS. If the system is short of memory, the target memory page may have been unloaded, in which event a page fault fires. With luck Windows NT will be able to find the page in its file cache, and then the page fault will be soft. If the operating system has to go all the way to the disc drive, the page fault is hard. Even then, the sought variable may be held in the semiconductor buffer of the hard drive, avoiding the indignity of having to read it from magnetic medium with a latency of 10 mS.

Different levels of cache come into play at different scales of time and memory. Though the processor caches are also important, the focus of this section will be the caching of pagefile memory by RAM. The critical question is how likely it is that an access to memory will cause a page fault. Locality reduces the frequency of page faults, for under the predominant least-recently-used replacement algorithms, a recently accessed memory page will persist in RAM.

For an application to be incurring regular page faults, the system must be under some degree of memory stress, with the aggregate working memory of all processes exceeding the quantity of RAM fitted to the machine. We will assume this for the model application, for it will be tuned to maximise throughput, and given that at any moment a proportion of connections will be idle (waiting for network or user response), the number of connections needs to be large to keep the processors working hard.

The NT native heap carves allocations out of (conceptually) one large block of memory, whereas LeapHeap segregates its small allocations on the basis of size. In the early life of either heap, a burst of allocations is likely to be placed contiguously in memory and thus promote locality. When the heap is mature this effect cannot be relied on; consecutive allocations will be scattered through memory according to the prior pattern of suitable free blocks.

The native heap (some NT versions) runs a last in first out policy for free blocks. The most recently freed block of a particular size will be used to satisfy the next allocation request for that size. This is beneficial because the block is likely to be still held in RAM.

LeapHeap's major locality advantage stems from its size segregation. Whenever a substantial, homogeneous list, tree or similar data structure is processed, the earliest-encountered elements may need page faults to bring them into RAM. Each memory page so fetched will contain numerous other cells of the same size, which are likely to be accessed a short time later without incurring further page faults.

As it happens this effect is not at its strongest for the model application, where each thread 'owns' only a small portion of each compartment. However the preferred-bit variation introduced in LeapHeap version 1.3 encourages a further segregation based on thread number. Suppose that only one of the record types that the program defines is of size 32-bytes, and that each thread builds elements of this record type into a long linked list, which it scans periodically. Under LeapHeap, the thread will find 128 such elements on every memory page it accesses, whereas under the Microsoft heap there might only be one element per page.

LeapHeap also wins by keeping the heap bookkeeping information in a concentrated form off-heap. Each kilobyte of cell bitmap controls hundreds of kilobytes of compartment memory. On frequency of access grounds it is bound to be the case that the bitmaps will almost always be in RAM. Having paid the memory cost of the bitmaps, no further page faults are incurred, for LeapHeap allocates and frees cells without accessing them or the memory around them. This saving is not important for allocations, for the application is presumably going to initialise or otherwise use an allocation soon after making it. It is significant for frees, as the memory tied up in timed-out connections can be released without having to bring the stale pages back into RAM.

The same argument holds for concentrated descriptor memory versus big-block area memory: a compact linked list is navigated rather than a heap-embedded list spread across many megabytes of memory (though in this case a freeing thread does have to touch the memory it is freeing).