Memory or Memento Mori

Memory

Memento

Mori

Memory is central to the operation of modern computer systems. Memory consists of a large array of bytes, each with its own address. The CPU fetches instructions from memory according to the program counter. A typical instruction-execution cycle, first fetches an instruction from memory. The instruction is then decoded and may cause operands to be fetched from memory. After the instruction is executed, the result may be stored back in memory.

Main memory and the registers built into the processor itself are the only storage that the CPU can access directly. Hence, for a program to be executed, it must be mapped to absolute addresses and loaded into memory. Once loaded, a process — an active execution of a program, accesses program instructions, reads data from and writes data to memory by using these absolute addresses. In the same way, for the CPU to process data from disk, those data must first be transferred to main memory by CPU-generated I/O calls.

While CPU can access and operate on its registers within one cycle of a CPU clock, the same cannot be said about the main memory. The main memory is accessed via a translation on the memory bus and completing a memory access may take several CPU clock cycles; during which, the CPU will need to stall. Hence, a memory buffer is added between the CPU and the main memory to accommodate the speed differential is called cache.

Allocation Strategy

Each running process should have a separate memory space. It’s the responsibility of the operating system to allocate memory for each process, ensuring that they don’t interfere with each other. Furthermore, the main memory mush accommodate both the OS and the various user processes. Hence, efficient memory allocation strategies are necessary.

One of the simplest methods for allocating memory is to assign processes to a variably sized contiguous block of memory in memory (partitions), where each block may contain exactly one process. This is known as contiguous memory allocation.

As processes are introduced to the system, they are put into an input queue. The OS takes into account the memory requirements of each incoming process and the amount of memory space available in determining which processes will be allocated. After allocation, the process is loaded into memory and starts its execution. Once the process is finished, the operating system reclaims the memory block, making it available for other processes. If there is not enough room for an incoming process, the operating system may need to swap out some processes to disk or alternatively place such processes into a wait queue, to free up memory space. When memory is later released, the operating system checks the wait queue to determine if it will satisfy the memory demands of a waiting process.

During allocation, the operating system must look for a sufficiently large contiguous block of memory for the process. There are many algorithms to do this:

First Fit — Allocate the first block of memory that is large enough for the process execution. This is a greedy approach.
Best Fit — Allocate the smallest block of memory that is large enough for the process execution. This involves searching through the entire memory space to find the smallest block of memory.
Worst Fit — Allocate the largest block of memory for the process execution. This also involves searching through the entire memory space to find the largest block of memory.

First-fit and Best-fit perform better than Worst-fit in time and storage use. First-fit is usually faster, though storage efficiency is similar between the two.

Fragmentation

As processes are allocated and de-allocated to and from memory, the free memory blocks break up into smaller blocks. Fragmentation occurs when the memory space is divided into numerous smaller non-contiguous blocks making it difficult to utilise the memory efficiently.

Internal

Internal Fragmentation occurs when a process does not fully utilise its assigned memory block. This left over space is wasted and cannot be allocated to other processes. Over time, this creates numerous small fragments of empty memory blocks through the memory space which cannot be allocated to any process, and is thus wasted.

External

External fragmentation occurs when there is enough free memory space to be allocated, but they are not contiguous, the memory is fragmented into a number of small blocks of free memory. If all these scattered fragments were combined into a single large block, the system might be able to run several more processes.

Paging

To avoid fragmentation, a more sophisticated memory allocation strategy is generally used. In this strategy, the main memory is divided into fixed-sized blocks called frames. Rather than allocating a contiguous block of memory for each process, the OS allocates multiple frames which can be scattered throughout the main memory space. This strategy is known as Paging.

However, with paging, the concept of physical and virtual memory also comes into play. Physical memory refers to the actual main memory which is installed in the system, while virtual memory is an abstraction which is used by the OS to manage processes and memory allocation. A process can only access memory using the virtual memory, while the OS maps the virtual memory to the physical memory.

Note

While a process's physical memory might be non-contiguous and scattered across the memory space in form of multiple frames, its virtual memory however, has its own isolated space which appears as a contiguous memory block.

The virtual memory is also divided into fixed-sized blocks, the same size as that of the frames in physical memory - here these blocks are called pages. Every page in the virtual memory has a page number - p and every frame in the physical memory has a frame number - f. Every memory address has an offset number - d to identify a specific location within a page or frame. Generally, the p and f reside in the higher bits of the memory address and d resides in the lower bits.

The mapping between virtual pages and physical pages are maintained in a data structure called the Page Table. The page table is an isolated data structure which is maintained for each process separately.

In order to obtain the physical address of a virtual one, the following steps are performed:

The page number p is extracted from the virtual address.
The page table is accessed to retrieve the corresponding frame number f.
Replace the page number p with the frame number f in the virtual address.

Most systems allow the page table to be to be quite large (for example, 1 million entries), in which case, the page table its not possible to place the page table in fast registers. Instead they are maintained in main memory, and a page table base register (PTBR) points to the page table. Changing page table requires only changing the address in one register.

The main issue with the above scheme, is that for every single data lookup, it requires two memory accesses — one for the frame number from the page table and one for the actual data in the physical memory, especially for large page tables. The standard solution to this problem is to use a special, small and fast lookup cache, called a translation look-aside buffer (TLB). The TLB is an associative and high speed memory, where entries are stored in key-value form.

The TLB contains only a few page table entries. When a logical address is generated by the CPU, the page number is first used to look for the corresponding frame number from the TLB. If the page number is found, then the corresponding frame number is used to access the memory. If not found, then a page-table lookup is performed. In addition, after a page-table lookup, the page number to frame number mapping is also added to the TLB for faster successive lookups.

The most commonly used replacement policies (in case the TLB is full) used are least recently used (LRU) or random. There is also an option to mark certain entries as wired down, meaning they cannot be removed from the TLB. Generally, kernel code addresses are wired down in TLB.

Demand Paging

As mentioned earlier, for a program to execute, it must first be loaded into main memory. However, when dealing with large programs, it’s not always necessary to load the entire program at once, only the portion currently needed. This is where demand paging comes into play. With demand paging, only the required pages of a program are loaded into memory on demand.

As a process executes, some of its pages are loaded into memory, while others remain on disk storage (i.e., backing store). To manage this, an additional column called the valid-invalid bit is included in the page table to indicate the status of each page. If the bit is set to valid (v), it means the page is both legal (i.e. belonging to the process’s logical address space) and currently loaded in memory. If the bit is invalid, the page is either outside the process’s logical address space (illegal) or it is a valid page that currently resides on disk.

When a process tries to access a page whose valid-invalid bit is set to invalid (i), a page fault occurs, causing a trap to the operating system. The operating system then follows these steps to handle the page fault:

It checks the process’s internal table to determine whether the memory reference is valid.
If the address is invalid (i.e., not part of the process’s logical address space), the process is terminated.
If the address is valid but the page is not currently in memory, the page is paged in.
The operating system locates a free frame in physical memory.
It instructs the disk to read the required page into the newly allocated frame.
Once complete, the process’s internal table and page table are updated to reflect the presence of the page.
The process then resumes execution from the instruction that caused the page fault.

Segmentation

Segmentation is yet another memory allocation strategy, which is quite different from paging. In this scheme, the virtual address space is a collection of variable length segments. Each logical address specifies the segment number and the segment offset within the segment.

Normally during compilation, the compiler automatically constructs segments reflecting the structure of the input program. For example, libraries that are linked during compile time might be assigned separate segments. The loader would take all these segments and assign them segment numbers.

In practice, a segment table is used to map the two-dimensional logical address to a one-dimensional physical address. Each entry in the segment table has a segment base and a segment limit. The segment base contains the starting physical address where the segment resides in memory and the segment limit, contains the length of that segment.

Every logical address consists of the segment number (s) and the segment offset (d). The segment number is used as an index in the segment table. The segment offset d must be between 0 and the segment limit, otherwise a fault is thrown known as the segmentation fault. If the segment offset is within legal range, the offset is added to the segment base to product the final physical address.

Memory Layout

The memory layout of a program refers to how the program's data is stored in memory during execution. Understanding this layout helps developers manage memory more efficiently and avoid issues such as segmentation faults and memory leaks. This layout is more than a technical detail; it governs how code executes, how data is stored, and how programs interact with the operating system.

The memory layout is divided into several segments:

Kernel virtual memory space: Reserved for the kernel and is not accessible to user processes.
(User) Stack: Holds the stack frames of the process’s main thread and grows downward.
Memory mapped regions: Memory allocated by mmap for shared memory, file or anonymous mapping.
Heap: Memory allocated by the process for dynamic memory allocation, which grows upward.
Initialized data segment: Contains global and static variables that are initialized by the program.
Uninitialized data segment (.bss): Contains program’s uninitialized global and static variables.
Read-only code segment: Contains the executable code of the program, which is typically read-only.

Note

These segments are merely pages within the process’s virtual address space.

Stack

Every process has a stack which tracks the local variables and function calls during execution. For visualization, the stack grows downwards as more functions are invoked and local variables are created, and shrinks upwards as functions return and are taken off the stack.

When a function is called, a new stack frame is created on the stack, which contains the function’s local variables, parameters, and return address. When the function returns, its stack frame is popped off the stack, deallocating all variables within that stack frame.

Every thread has its own stack. Since a process can have multiple threads, there may be multiple stacks within a process. The size of the stack is fixed during thread creation and cannot be dynamically resized during runtime. Since stack allocation is determined at compile time, it is the compiler’s responsibility to calculate the size of all variables and generate the corresponding assembly code to allocate them on the stack. The compiler also emits instructions to deallocate the stack frame when the function returns.

In order to keep track of the stack, the CPU uses a special register called the stack pointer. Before a thread starts executing, the stack pointer is initialized to point to the top of the stack. As the stack is pre-allocated in thread creation, allocating a variable on the stack is simply moving the stack pointer down, which is a very fast operation. Loading a variable from the stack is also fast, as it only requires reading the value at the address pointed to by the stack pointer.

There is also another special register called the base pointer (or frame pointer), which points to the start of the current stack frame. It’s used to provide a stable reference point for accessing local variables as well as function parameters. The stack deallocation is done by simply setting the stack pointer back to the base pointer, which is fast.

Heap

Heap is the segment where dynamic memory allocation takes place. Unlike stack memory, heap memory is allocated explicitly by developers and it won’t be deallocated until it is explicitly freed either using free or by the garbage collector associated with the programming language. The current limit of the heap is referred to as the program break or abbreviated as brk. Resizing the heap is just simple as telling the kernel to adjust its idea of where the process’s program break is. After the program break is increased, the program may access any address in the newly allocated area, but no physical memory pages are allocated yet.

As the heap is shared across threads, to avoid corruption in multithreaded applications, mutexes are used internally to protect the memory-management data structures employed by these functions. In a multithreaded application in which threads simultaneously allocate and free memory, there could be contention for these mutexes. Therefore, heap allocation is less efficient than stack allocation.

Memory Mapped Regions

There are additional segments in the virtual memory space other than the heap & the stack — called memory mapped regions. In this segment, the memory space is either mapped directly to a file (or a region of a file) on disk — known as file mapping, or the pages of the mapping are initialized to 0 denoting that the mapping doesn’t have a corresponding file; in this case, its called anonymous mapping. A file-mapping maps a region of a file directly into the calling process’s virtual memory, allowing its contents to be accessed by operations on the bytes in the corresponding memory region.

A memory mapped region can be private (aka. copy-on-write) or shared. By private, it means that the memory region is only accessible by the process that created it. Whenever a process attempts to modify the contents of a page, the kernel first creates a new, separate copy of that page for the process and adjusts the process’s page tables. Conversely, if the memory mapped region is shared, then all processes that share the same memory mapped region can see the changes made by any of them.

Since anonymous memory mappings are not backed by a file and are always zero-initialized, they are ideal for programs that implement their own memory allocation strategies — such as Go; rather than relying on the operating system’s default allocators like malloc and free. This allows greater control over memory management, enabling features like custom allocators or garbage collection tailored to the runtime’s needs.

References

Nghia Nguyen Fundamental of Virtual Memory
Complete virtual memory map with 4-level page tables
University of Illinois Virtual Memory
Abraham Silberschatz, Peter B. Galvin, Greg Gagne. Operating System Concepts.

All Articles