[File System] F2FS: A New File System for Flash Storage

Paper Review

[File System] F2FS: A New File System for Flash Storage

telomere37 2022. 11. 11. 16:46

Notice

해당 post의 정리는 Fast 15에 발표된 "F2FS: A New File System for Flash Storage" 논문의 내용을 그대로 인용한 것이다.

https://www.usenix.org/conference/fast15/technical-sessions/presentation/lee

F2FS: A New File System for Flash Storage | USENIX

www.usenix.org

Paper Review

Abstract

F2FS is a Linux file system designed to perform well on modern flash storage devices. The file system builds on append-only logging and its key design decisions were made with the characteristics of flash storage in mind.

Introduction

Flash Storage has become a popular choice for primary storage. However, flash storage has limitations. 1) Erase-before-write requirement, 2) Write on erased blocks sequentially, 3) limitied write cycles per block.

In order to solve such problems, instead of using a raw NAND flash memory, electronic devices started use multiple flash chips connected through a dedicated controller. The firmware running on the controller (FTL) addresses the NAND flash memory's limitations and provides a generic block device abstraction. Such solution resulted flash storage such as eMMC, UFS, and SSD.

Perfect as it might seem, such solutions has holes under certain workloads. For example, for frequent random writes the SSD gets severely fragmented leading to sustained SSD performance. This is a huge problem since further researches note that most of the SSD accesses are random.

The detrimental effects of random writes could be reduced by the log-structured file system(LFS) approach and the copy-on-write strategy. However, conventional LFS does not consider the characteristics of flash storage devices and are inevitably suboptimal in terms of performance and device lifetime.

The design considerations are as follows...

Flash-friendly on-disk layout: F2FS employs three configurable units, segments, section and zone. It allocates storage blocks in the unit of segments from a number of individual zones. It performs cleaning in the unit of section. These units are introduced to align with the underlying FTL's operational units to avoid unnecessary data copying.
Cost-effective index structure: LFS writes data and index blocks to newly allocated free space. If a leaf data block is updated, its direct index block should be updated, too. Once the direct index block is written, again its indirect index block should be updated. Such recursive updates result in a chain of writes, creating the well known "wandering tree" problem. In order to attack this problem, F2FS propose a novel index table called node address table.
Multi-head logging: F2FS devise an effective hot/cold data seperation scheme applied during logging time. It runs multiple active log segments concurrently and appends data and metadata to seperate log segments based on their anticipated update frequency. Since the flash storage devices exploit media parallelism, multiple active segments can run simultaneously without frequent managment operations, making performance degradation due to multiple logging insignificant.
Addaptive logging: F2FS builds basically on append-only logging to turn random writes into sequential ones. at high storage utilization, however, it changes the logging strategy to threaded logging to avoid long write latency. In essence, threaded logging writes new data to free space in a dirty segment without cleaning it in the foreground. This strategy works well on modern flash devices but may not do so on HDDs.
fsync acceleration with roll-forward recoevery: F2FS optimizes small synchronous writes to reduce the latency of fsync requests, by minimizing required metadata writes and recovering synchronized data with an efficient roll-forward mechanim.

Design and Implementation of F2FS

On-Disk Layout

F2FS divides the whole volume into fixed-size segments. Segment is the basic unit of management in F2FS.
A section is comprised of consecutive segments.
A zone consists of series of sections.
F2FS Splits the entire volume into six areas:
- Superblock (SB) - basic partition information and default parameters of F2FS
- CheckPoint (CP) - Keeps the file system status, bitmaps for valid NAT/SIT sets, orphan inode lists and summary entries of currently active segments.
- Segment Information Table (SIT) - Contains per-segment infromation such as the number of valid blocks and the bimap for the validity of all blocks in the "Main" area. The SIT information is retrieved to select victim segments and identify valid blocks in them during the cleaning process.
- Node Address Table (NAT) - Block address table to locate all the "node blocks" stored in the Main area
- Segment Summary Area (SSA) - Stores summary entries representing the owner information of all blocks in the main area, such as parent inode number and its node/data offsets. The SSA entries identify parent node blocks before migrating valid blocks during cleaning.
- Main Area - Filled with 4KB blocks. Each block is either node or data. A node block contains inode or indicies of data blocks, while a data contains either directory or user file data. Note that a section does not store data and node blocks simuultaneously.
Look-up operation in F2FS (/dir/file)
1. Obtains the Root inode by reading a block whose location is obtained from NAT
2. Search for directory entry name "dir" from its data blocks and obtains its inode number
3. Translates retrieved inode number to a physical location through NAT
4. Obtains inode named "dir" by reading corresponding block
5. Identify directory entry named "file"
6. Obtain the file inode by repeating (3) and (4)

File Structure

Original LFS uses inode map to translate an inode number to an on-disk location.
F2FS utilizes the "node" structure that extends the inode map to locate more indexing blocks.
Each node block has a unique identification number (node ID)
Node blocks are one of three types: inode, direct, and indirect node.
Direct node - Contains block addresses of data
Indirect node - Node IDs locating another node blocks.
F2FS handles the wandering tree problem by using pointer-based file indexing
- F2FS only updates one direct node block and its NAT entry.
F2FS supports inline data and inline extended attributes. By default, files smaller than 3692 bytes uses inlining of data. and 200bytes for extended attributes.

Directory Structure

4KB Directory Entry(dentry) block is composed of a bitmap and two arrays of slots and names in pairs.
- Bitmap - tells whether each slot is valid or not
- Slot - carries a hash value, inode number, length of a file name and file type
Directory file constructs multi-level hash tables to manage large number of dentires efficiently
Look-up file name in a directory
1. Calculate hash value of the file name
2. To find a dentfy more quickly, it compares the bitmap, the hash value and the file name in order

Multi-head logging

F2FS maintains six major log areas to maximize the effect of hot and cold data seperation.
Statically defines three levels of temperature - hot, warm and cold (each for node and data blocks)
direct nodes are considered hotter than indirect node blocks - Indirect node blocks are modified only when a node is added or removed
direct node blocks and data blocks for directories are considered hot
Cold data blocks
- Data blocks moved by cleaning - remained valid for an extended period of time
- data blocks labeled "cold" by the user
- Multimedia file data - write once and read-only patterns
Log numbers may be configured at mount time (2 or 4)
F2FS introduces configurable zones to be compatible with an FTL
FTL algorithms are largely classified into three groups
- Block-associative - Log flash block can be used exclusiviely for a single data flash block
- Set-associative - for a set of contiguous data flash blocks
- Fully-associative - for all data flash blocks
Modern SSDs - Fully-associative or set-associative
Multi-head logging is a natual match with the recently proposed "multi-streaming" interface.

Cleaning

F2FS performs cleaning in two distinct manners foreground and background.
Foreground - Triggered only when there are not enough free sections
Background - Triggered by a kernel thread, periodically.
Cleaning process takes three steps
1. Victim Selection - greedy and cost-benefit algorithm.
  - Foreground (greedy) - Select a section with the smallest number of valid blocks
  - Background (cost-benefit) - Select victim based on utilization + age(averaging the age of segments in the section) -> Seperate hot and cold data
2. Valid block idenficiation and migration - Validity bitmap per segment in SIT
  - Retrieves parent node blocks containing their indices from the SSA information.
  - For background (Lazy Migration) - does not issue actual I/Os to migrate valid blocks. Instead, loads the blocks into page cache and marks them as dirty. Later the kernel worker thread flushes them.
3. Post-cleaning process - victim section is registered as a candidate to become a new free section (pre-free section). After a checkpoint is made, the section finally becomes a free section to be rellocated.

Adaptive Logging

Normal Logging - blocks are written to clean segments yielding strictly sequential writes.
Threaded Logging - writes blocks to holes in existing dirty segments.
More than k(5% of the total section #) clean section -> normal logging

Checkpointing and Recovery

Checkpointing - provide a consistent recovery point from a sudden power failure (before sync, umount, etc)
1. All dirty node and dentry blocks in the page cache are flushed
2. Suspends ordinary writing activities including system calls such as create, unlink and mkdir
3. FS metadata, NAT, SIT and SSA are written to their dedicated areas on the disk
4. F2FS writes a checkpoint pack consisteing of the following information to the CP area
  - Header and footer - Contains a version number which is incremented for every CP
  - NAT and SIT bitmaps - indicate the set of NAT and SID blocks comprising the current pack
  - NAT and SIT journals - comprise recently modified entries of NAT and SIT to avoid frequent updates
  - Summary blocks of active segments - In memory SSA blocks that will be flushed to the SSA area in the future
  - Orphan blocks - "Orphan inode" information (If an inode is deleted before it is closed)
F2FS manages two sets of NAT and SIT blocks (distinguished by NAT and SIT bitmaps)
If a small number of NAT or SIT entries are updated frequently, F2FS would write many 4KB-sized NAT or SID blocks. To mitiage this overhead, F2FS implements a NAT and SIT journal within the checkpoint pack.
Roll Back Recovery
1. Search valid CP packs (pick latest one if both are valid)
2. Check whether orphan inode blocks exist
3. If exists, truncates all the data blocks + free the orphan inode
4. Start service with consistent set of NAT and SIT blocks referenced by their bitmaps, after roll forward recoevery.
Roll Forward Recovery
1. To enhance fsync performance - Write data blocks and their direct node blocks only
2. F2FS collects the direct node blocks having the special flag located in N+n
3. Using the node information, it loads the most recently written node blocks into the page cache
4. Compares the data indices in between N-n and N=n
5. If it is different, it refreshes the cache node blocks with new indices stored in N+n and finally mark them as dirty
6. After roll-forward recovery, F2FS performs checkpointing to store the whole in-memory changes

Evaluation

Related Work

Personal Questions

Zone이 의미하는 것은 무엇이며, 왜 나눴을까?
CP 내부에 NAT, SIT Journal의 의미가 정확하게 이해되지 않는다.
NAT, SIT, SSA 의 entry들이 실제로 어떻게 구현되어 있는지 궁금하다.
CP를 기록하는 시점 개념이 정확하게 이해되지 않는다.

Personal Notes

On-disk layout에서 각각의 필요한 이유는?
- CheckPoint (CP) -> Mount 시에 consistency를 유지하기 위해 복구할 때 사용하는 data
  - Header and footer - valid 한 CP인지, 가장 최신 CP인지 확인하기 위하여
  - NAT and SIT bitmaps - 현재 CP packd에서 지정하는 NAT, SIT가 어떤 것인지
  - NAT and SIT journals
  - Summary blocks of active segments - In memory SSA blocks that will be flushed to the SSA area in the future
  - Orphan blocks - "Orphan inode" information (If an inode is deleted before it is closed)
- Segment Information Table (SIT)
  - Number of valid blocks per segment - victim 선정 시에
  - Bitmap for the validity of all blocks in Main area - cleaning시에 validity check하기 위해
- Node Address Table (NAT)
  - Node ID를 physical address로 변환하기 위하여
- Segment Summary Area (SSA) -
  - Summary entries (owner information of blocks in main area) - GC에서 valid check하기 위하여
Cold, Warm, Hot data 혹은 node block의 분류는 어떻게 하는지?
- Node
  - Hot - Direct node blocks for directories
  - Warm - Direct node blocks for regular files
  - Cold - Indirect node blocks
- Data
  - Hot - Directory entry blocks
  - Warm - Data blocks made by users
  - Cold - Data blocks moved by cleaning, specified by users, multimedia

'Paper Review' 카테고리의 다른 글

[OS] Practical, transparent operating system support for superpages (0)	2022.12.12
[OS] Scheduler Activations: Effective Kernel Support for the User-Level Management of Parallelism (0)	2022.12.11
[File System] The Design and Implementation of a Log-Structured File System (1)	2022.11.14
[Storage] ZNS: Avoiding the Block Interface Tax for Flash-based SSDs (0)	2022.11.13
[Storage] The Multi-streamed Solid-State Drive (0)	2022.11.13

현재글[File System] F2FS: A New File System for Flash Storage

어제의 나를 이기자

Today :
Yesterday :

일	월	화	수	목	금	토
				1	2	3
4	5	6	7	8	9	10
11	12	13	14	15	16	17
18	19	20	21	22	23	24
25	26	27	28	29	30	31