I/O Alignment Considerations
Traditional storage devices such as Hard Disk Drives, SSD’s, NVMe, M.2, and SAN LUNs present storage as blocks. A block is an addressable unit of storage measured in bytes. The traditional block size used by hard disks is 512 bytes. Newer devices commonly use 4KiB or 8KiB physical block sizes, but may also choose to present logical/emulated 512 bytes blocks.
Persistent Memory devices are accessible via the Virtual Memory System. Therefore, IO should be aligned using the systems Page Size(s). The Memory Management Unit (MMU) located on the CPU determines what page sizes are possible.
Linux supports several page sizes:
Default Page Size; is commonly 4KiB by default on all architectures. Linux often refers to these as a Page Table Entry (PTE).
Huge Pages; requires Kernel support having configured
CONFIG_HUBETLB_PAGE
andCONFIG_HUGETLBFS
. Often referred to as the Page Middle Directory (PMD), huge pages are commonly 2MiB in size although modern kernel's also support 1GiB page sizes.
More information can be found in Chapter 3 Page Table Management of Mel Gorman’s book Understanding the Linux Virtual Memory Manager.
The page size is a compromise between memory usage and speed.
A larger page size means more waste when a page is partially used.
A smaller page size with a large memory capacity means more kernel memory for the page tables since there’s a large number of page table entries.
A smaller page size could require more time spent in page table traversal, particularly if there’s a high Translation Lookaside Buffer (TLB) miss count.
The capacity difference between DDR and Persistent Memory Modules is considerable. Using smaller pages on a system with terabytes of memory could negatively impact performance for the reasons described above.
Verifying Page Size Support
The systems default page size can be found by querying its configuration using the getconf
command:
or
NOTE: The above units are bytes. 4096 bytes == 4 Kilobytes (4 KiB)
To verify the system currently has HugePage support, cat /proc/meminfo|grep -i hugepage
will return information similar to the following:
Depending on the processor, there are at least two different huge page sizes on the x86_64 architecture: 2MiB and 1GiB. If the CPU supports 2MiB pages, it has the PSE
cpuinfo flag, for 1GiB it has the PDPE1GB
flag. /proc/cpuinfo
shows whether the two flags are set.
If this commands returns a non-empty string, 2MiB pages are supported.
If this commands returns a non-empty string, 1GiB pages are supported.
Verifying IO Alignment
For a DAX filesystem to be able to use 2 MiB hugepages several things have to happen:
The mmap() mapping has to be at least 2 MiB in size.
The filesystem block allocation has to be at least 2 MiB in size.
The filesystem block allocation has to have the same alignment as our mmap().
The first requirement is trivial to control since the size of the mapping relates to the size of the persistent memory pool file(s). Both EXT4 and XFS each have support for requesting specific filesystem block allocation alignment and size. This feature was introduced in support of RAID, but can be used equally well for DAX filesystems. Finally, controlling the starting alignment is achieved by ensuring the start of the filesystem is 2MB aligned.
The procedure to ensure DAX filesystems use PMDs is shown below as an example. It needs to be executed once the dm-linear or dm-stripe has been configured.
1) Verify the namespace is in fsdax mode.
If the namespace is not in fsdax mode, use the following to switch modes.
Note: This will destroy all data within the namespace so backup any existing data before switching modes.
2) Verify the persistent memory block device starts at a 2 MiB aligned physical address.
This is important because when we ask the filesystem for 2 MiB aligned and sized block allocations it will provide those block allocations relative to the beginning of its block device. If the filesystem is built on top of a namespace whose data starts at a 1 MiB aligned offset, for example, a block allocation that is 2 MiB aligned from the point of view of the filesystem will still be only 1 MiB aligned from DAX’s point of view. This will cause DAX to fall back to 4 KiB page faults.
Use /proc/iomem
to verify the starting address of the namespace, eg:
Both namespaces are 2MiB (0x200000) aligned since namespace0.0 starts at 0x140000000 (5GiB) and namespace1.0 starts at 0x23fe00000 (~9GiB)
When creating filesystems using the namespaces, it’s important to maintain the 2MiB alignment (4096 sectors). Depending upon the VTOC type, fdisk creates 1MiB alignment (2048 sectors). For a non-device mapped /dev/pmem0 a partition aligned at the 2MiB boundary can be created using the following:
3) Create an XFS or EXT4 filesystem. The commands below show how this can be achieved. See the mkfs.xfs
and mkfs.ext4
man pages for more information.
4) [Optional] Watch IO allocations. Without enabling filesystem debug options, it is possible to confirm the filesystem is allocating in 2MiB blocks using FTrace:
Run test which faults in filesystem DAX mappings, eg:
Look for dax_pmd_fault_done events in /sys/kernel/debug/tracing/trace
to see if the allocations were successful. An event that successfully faulted in a filesystem DAX PMD looks like this:
If the entry ends in NOPAGE, this means the fault succeeded and didn’t return a page cache page, which is expected for DAX. A 2 MiB fault that failed and fell back to 4 KiB DAX faults will instead look like this:
You can see that this fault resulted in a fallback to 4 KiB faults via the FALLBACK return code at the end of the line. The rest of the data in this line can help you determine why the fallback happened. In this example an intentional mmap() smaller than 2 MiB was created. vm_end (0x10500000) - vm_start (0x10420000) == 0xE0000 (896 KiB).
To disable tracing run sudo echo 0 > events/fs_dax/dax_pmd_fault_done/enable
.
Last updated