Practical, Transparent Operating System Support for Superpages | USENIX
www.usenix.org
Abstract
Most gener-purpose processors provide support for memory pages of large sizes, called superpages. Superpages enable each entry in the translation lookaside buffer (TLB) to map a large physical memory region into a virtual address space. This dramtically increases TLB coverage, reduces TLB misses, and promises performance improvements for many applications. However, supporting superpages poses several challenges to the operating system, in terms of superpage allocation and promotion tradoffs, fragmentation control, etc. We analyze these issues, and propse the design of an effective superpage management system. We implement it in FreeBSD on the Alpha CPU, and evaluate it on real workloads and benchmarks. We obtain substantial performance benefits, often exceeding 30%; these benefits are sustained even under stressful workload scenarious.
Introduction
- TLB coverage is defined as the amount of memory accessible through these cached mappings
- Superpages improves over 30% of performance
- However it enlarges application footprint ▶increased physical memory requirements, higher paging traffic
- Solution: Mixture of page sizes ▶ Results in physical memory fragmentation
- This paper ...
- Reserves a larger contiguous region of physical memory in anticipation of subsequent allocations
- Superpages are created in increasing sizes as the process touches pages in this region
- If the system runs out of contiguous physical memory ▶ preempt portions of unused contiguous regions
- If those regions are exhausted ▶ system restores contiguity by biasing the page replacement scheme to evict contiguous inactive pages
- Contributions are ...
- Extends a previously proposed reservation-based approach to work with multiple,potentially very large superpage sizes
- Investigate the effect of fragmentation on superpages
- Propose a contiguity aware page replacement algorithm to control fragmentation
- Tackles the issues, such as superpage demotion and eviction of dirty superpages.
The Superpage problem
- The TLB coverage has lagged behind main memory groth
- TLB access is in the critical path of every memory access ▶ Kept slow, under 128 or fewer entries
- Many machines are now usually shipped with on-board, physically addressed caches (larger than TLB coverage) ▶ TLB misses require access to memroy to find a translation for data that is already in the cache.
- A solution could be increasing the base page size ▶ Internal fragmentation, higher I/O demands
- Therefore using multiple page sizes could solve such problem ▶ Several Challenges
- Challenge, Hardware-imposed constraints
- The page size must be among a set of page sizes supported by the processor
- A superpage is requird to be contiguous in physical and virtual address spaces
- Starting address in the physical/virtual address space must be a multiple of its size
- The TLB entry for a superpage provides only a single reference bit, dirty bit, set of protection attributes
- Issues and tradeoffs
- Memory objects: contiguous region of virtual address space and contains application-specific data (memory mapped files, code, data, stack, heap segments)
- Allocation
- Relocation-based allocation: copy physical data to contiguous regions when creating a superpage for the objects which already allocated pages
- Reservation-based allocation: Reserves page frames equal in size and alignment to the maximum desired super page size ▶Actual allocation is done when the corresponding base pages are touched
- Trade-off: Performance gains of using a large superpage vs Retaining the contiguous region for later
- Fragmentation Control
- OS may proactively release contiguous chunks of inactive memory from previous allocations
- OS may also preempt an existing partially used reservation, given that the reservation may never become a superpage
- Trade-off: Impact of various contiguity restoration techniques vs Benefits of using larger superpages
- Promotion
- If base pages meet condition (size, contiguity, alignment, protection) ▶ Promotion
- Promotion involves updating the PTE
- Trade-off: Benefits of early promotion (Reduced TLB misses) vs Increased memory consumption
- Demotion
- Reduce the size of a superpage to base pages or to smaller superpages ▶ Demotion
- Single reference bit ▶ Difficult to know which portions of a superpage are actively used
- Eviction
- Memory pressure demands ▶ Superpage eviction from physcial memory
- Single dirty bit ▶The whole superpage may have to be flushed out
Related Approaches
Existing superpage solutions for application memory
- Reservation-based schemes ▶ Preserve contiguity
- Relocation-based approaches ▶ Create contiguity
- Hardware-based mechanisms ▶Reduce or Eliminate contiguity requirements
- Reservations
- Superpage-aware allocation decisions at page-fault time
- Talluri and Hill
- Promoted when the number of frames in use reaches a promotion threshold
- HP-UX and IRIX
- System may allocate several contiguous frames, promote them into a superpage
- Superpage size is based on memory availability at allocation time + page size hint
- Problem: not transparen (requires experimentation to determine optimum superpage size)
- Page Relocation
- Entirely and Transparently implemented in the hardware dependent layer of OS ▶ Need to reloacte most of the allocated base pages of a super page prior to promotion
- Romer et al.
- Competitive algorithm (online const-beneift analysis) to determine the benefit of superpages vs promotion overhead
- Requires a software-managed TLB ▶Counter to handle TLB miss
- In absense of memory contention, has strictly lower performance vs reservation-based approach
- Relocation Costs
- More TLB misses (Relocation is a response to an excessive number of TLB miss)
- TLB misses are more expensive (Software based, complex TLB miss handler)
- Better performance in terms of fragmentation
- Hybrid approach(?)
- Do relocation whenever reservations fail + large number of TLB misses is detected
- Do relocation as background task to do off-line memory compaction (▶ IRIX coalescing daemon)
- Hardware Support
- Hardware support can reduce or eliminate the contiguity requirement for superpages
- Talluri and Hill, Partial-Subblock TLBs
- Allow 'holes' for missing base pages
- Most of the benefits from superpages can be optained with minimal modifications to the OS
- Yield only moderately larger TLB coverage
- Not clear how to extend the partial-subblock TLBs to multiple superpage sizes
- Fang et al.
- Hardware support that completely eliminates the contiguity requirement of superpages
- Additional level of address translation in the memory controller
Design
- Paper's work
- Generalizes Talluri and Hill's reservation mechanism to multiple superpage sizes.
- To regain contiguity on fragmented physical memory without relocating pages, it biases the page replacement policy to select those pages that contribute the most to contiguity.
- Tackles issues of demotion and eviction
- Does not require special HW support
- Data Structures used to control (allocation, promotion, demotion)
- Available physical memory is classified into contiguous regions of different sizes ▶managed by buddy allocator
- Multi-list reservation scheme used to track partially used memory reservations and help choose reservations for preemption
- Population Map ▶Keeps track of memory allocations in each memory object
- Handle Fragmentation by performing page replacements in a contiguity-aware manner
- Reservation-based allocation
- Determines a preferred superpage size to comprise the faulted page
- Contiguous page frames allocated from buddy allocator
- Mapping of the base page that caused the page fault is entered into the page table
- Other base pages are reserved ▶ Reservation List
- Preffered superpage size policy
- Looks at the attributes of the memory object to which the faulting page belongs
- Too large ▶ Overridden by preempting the initial reservation
- Too small ▶ Relocation (BAD!)
- Tends to choose the maximum superpage size that can be effectively used in an object
- Fixed memory objects (code, memory-mapped files) ▶Largest, aligned, does not overlap with existing reservations, does not reach beyond the end of object
- Dynamically sized memory objects (stacks, heaps) ▶One page at a time (limited to the current size of the object)
- Preempting Reservations
- When no extent of frames are availale
- Refusing the allocation and thus reserving a smaller extent than desired
- Preempting an existing reservation that has enought unallocated frames (Selected by this design)
- When more than one reservations can yield ▶Choose the most recent page allocation
- Useful reservations are often populated quikly + No recent allocations are less likely to be full in future
- When no extent of frames are availale
- Fragmentation Control
- Buddy allocator performs coalescing of available memory regions whenever possible
- Under persistent memory pressure ▶Page replacement daemon is modified to perform contiguity-aware page replacement
- Incremental Promotions
- Superpages are created as soon as any superpage-sized and aligned extent within a reservation gets fully populated
- Perform Promotion only when the regions that are fully populated
- Speculative Demotions
- 1) Demotions occur as a side-effect of page replacement ▶ When a page daemon selects a base page for eviction, the superpage comprising the base page is demoted.
- Demotions are also incremental (and recursive)
- 2) Demotions can also occure when the protection attributes are changed (hardware provides only a single set)
- 3) Speculative demotions ▶ In order to determine if the superpage is still being actively used in its entirety
- When the page daemon resets the reference bit of a superpage's base page and if there is memory pressure ▶Recursively demotes the superpage with a certain probability p (=1%)
- Paging out dirty superpages (Eviction)
- Demote clean superpages whenever a process attempt to write into them ▶ Repromote later if all the base pages are dirtied
- Inferring dirty base pages using hash digests (그냥 불가능한 방법 같은데..)
- Clean memory page is read from disk ▶ cryptographic hash digest recorded
- When page is flushed, re-calculate the hash value and compare them
- Hash overhead too big!
- Solution 1. Hash computation can be postponed ▶Fully-clean or fully-dirty superpages and unpromoted base pages need not be hashed
- Solution 2. Perform hashing entirely from the idle loop ... (오바..)
- Multi-list reservation scheme
- Reservation lists: Reserved page frame extents that are not fully populated
- One reservation list for each page size supported
- The reservation has at least one of its frames allocate ▶ largest extents it can yield if preempted are one page size smaller than its own size (e.g. 64KB -> 512KB and 4MB)
- Reservations in each list are kept sorted by the time of their most recent page frame allocations
- Preempting process
- Breaked into smaller extents
- Unpopulated extents are transferred to the buddy allocator
- Partially populated ones are reinserted to the appropriate lists
- Fully populated extents are not reinserted into the reservation lists
- Population map
- Keep track of allocated base pages within each memory object
- Queried on every page fault
- Use a radix tree (each level = page size)
- lazy update of (somepop, fullpop)
- Hash table used to locate population maps, (memory_object, page_index)
- Purpose of population map
- On each page fault, they enable the os to map the virtual address to a page frame that may already be reserved for this address
- While allocating contiguous regions in physical address space, they enable the OS to detect and avoid overlapping regions
- Assist in making page promotion decisions
- Help identifying unallocated regions during preemption
Implementation notes
- Contiguity-aware page daemon
- Three lists of pages (active, inactive, and cache)
- Cached: Clean and unmapped
- Available for reservations
- Buddy allocator keeps them coalesced with the free pages, increasing the available contiguity of the system
- Inactive: mapped into the addres space of some process, not referenced for a long time
- Active: Accessed recently, may or may not have their reference bit set
- Under memory pressure
- Clean inactive pages ▶ Cache
- Dirty Inactive pages ▶ paged out
- Deactivates some unreferenced pages from the active list
- The page daemon is activated not only on memory pressure, but also when available contiguity falls low ▶Failure to allocate a contiguous region of the preferred size
- The page daemon searches through the inactive list and moves them to the cache (only ones that contribute to this goal)
- More inactive pages result in higher chances of restoring contiguity ▶Clean pages backed by a file are moved to the inactive list as soon as the file is closed by all processes
- Wired page clustering
- Memory pages that are used by FreeBSD are wired (marked as non-pageable since they cannot be evicted)
- These wired pages get scattered ▶Cluster them in pools of contiguous physical memory
- Multiple mappings
- Two processes can map a file into different virtual addresses
- If the addresses differ by one base page, then it is impossible to build superpages for that file in the page tables of both processes ▶???? 이건 그냥 process들이 superpage size에 align안 맞게 할당한 거잖아
- Solution: applications do not specify an address when mapping a file ▶ Kernel can assign a virtual address for mapping each process (chooses addresses that ar compatible with superpage allocation)