아무것도 몰라요

Paper Review

[Storage] ZNS: Avoiding the Block Interface Tax for Flash-based SSDs

telomere37 2022. 11. 13. 23:55

Notice

해당 post는 "ZNS: Avoiding the Block Interface Tax for Flash-based SSDs"논문의 대부분을 인용 + 정리한 것이다.

https://www.usenix.org/conference/atc21/presentation/bjorling

 

ZNS: Avoiding the Block Interface Tax for Flash-based SSDs | USENIX

Open Access Media USENIX is committed to Open Access to the research presented at our events. Papers and proceedings are freely available to everyone once the event begins. Any video, audio, and/or slides that are posted after the event are also free and o

www.usenix.org


Paper Review

Abstract

   The Zoned Namespace(ZNS) interface represents a new division of functionality between host software and flash-based SSDs. Current flash-based SSDs maintain the decades-old block interface, which comes at substantial expense in terms of capcity over-provisioning, DRAM for page mapping tables, garbage collection overheads, and host software complexity attempting to mitigatge the GC. ZNS offers shelter from this ever-rising block interface tax.

   By exposing flash erase block boundaries and write-ordering rules, the ZNS interface requires the host software to address these issues while continuing to manage media reliability within the SSD.


Introduction

  • Block interface: Present storage devices as one-dimensional arrays of fixed-size logical data blocks that may be read, written, and overwritten in any order. ▶ In order to hide hard drive media characteristics and to simplify host software
  • However, as the trends converted from HDDs to SSDs, the performance and operational costs of supporting the block interface are growing.
  • The GC in the FTL layer results in: Throughput limitations, write-amplification, performance unpredictability, and high tail latency
  • The ZNS groups logical blocks into zones
    • Zones can be read randomly but should be written sequentially
    • Zones should be erased between rewrites
  • The ZNS SSD aligns zone and the physical media boundaries ▶ shifting the responsibility of data management to the host  Obviate the need of in-device GC + OP
  • Key contributions
    • Present the first evaluation of a production ZNS SSD
    • Review the emerging ZNS standard and its relation to prior SSD interfaces
    • Describe the lessons learned adapting host software layers to utilize ZNS SSDs
    • Describe a set of changes spanning the whold storage stack to enable ZNS support
    • Introduce ZenFS, a storage backend for Rocks DB, to showcase the full performance of ZNS devices.

The Zoned Storage Model

  • Originally introduced for Shingled Magnetic Recording(SMR) HDDs.
  • Using SSDs on conventional block interface results in the following taxs
    • Due to GC, performance is unpredictable
    • Since it needs OP area and DRAM for mapping information, it has a higher cost per GB of capacity
  • Existing Methods to reduce such taxs
    • SSDs with Stream support ▶ Host explicitly notifies the SSD the stream ID
      • However, does not shed the costs of OP and DRAM (Still needs GC + mapping)
    • Open Channel SSDs ▶ Align contiguous LBA chunks to physcial erase block boundaries
      • Eliminates in-device GC + reduces the cost of OP and DRAM
      • Host is responsible for data placement (+ Underlying media reliability such as wear-leveling)
      • However, Host must manage differences across SSD Implementations to guarantee durability
  • Characteristics of `zoned storage`
    • Per-zone state machine ▶ Whether a given zone is writeable
      • EMPTY ▶ At the begining, After the `reset zone` command
      • OPEN ▶ When write occurs, device may impose an open zone limit
      • CLOSED ▶ Closed when the open zone limit is reached and the host opens a new zone, frees on device resources (write buffers), must be opened to resume writing
      • FULL ▶ Fully written
    • Write Pointer ▶ Designates the next writeable LBA within a writeable zone, only valid for OPEN and EMPTY zones
      • Updated after successful writes
      • Any writes from host that does not begin at the write pointer or targeting a FULL zone will fail
  • Two additional Concepts for `zoned storage` to cope with the flash-based SSDs
    • Writeable Zone capacity
      • Allows a zone to divide its LBAs into writeable and non-writeable
      • Allows a zone to have a writeable capacity smaller than the zone size
      • Such constraints enables the zone size to align with the power-of-two zone size industry norm
    • Active Zone limit
      • Hard limit on zones that can be either OPEN or CLOSED
      • Wherease SMR HDDs allow all zones to be CLOSED, the characteristsics of flash-based media require this quantity to be bounded for ZNS SSDs.

 

Evolving towards ZNS

  • ZNS interfce increases the responsibilities of host software, however using some techniques it could be eased
  • ZNS interface enables the SSD to translate sequential zone writes into distinct erase blocks
  • Tradeoffs in designing the SSD's FTL in the Hardware 
    • Zone sizing
      • The erase block size and the zone's write capcaity is a directly corelated feature
      • The erase block size is determined to provide die-level and other media failures through per-stripe parity
      • Large Zone reduce the degrees of freedom for data placement by the host
      • Small zone is possible in the cost of host complexity (host-side parity calculation + deeper I/O queue)
    • Mapping Table
      • ZNS SSD zone writes are required to be sequential ▶ Coarse-grained mappings at erase block level or hybrid fashion
    • Device Resources
      • Active zones inside the ZNS SSD requires hardware resources (XOR engines, memory resource, power capacitors) ▶ Limits the number of active zones (8 ~ 32)
  • Host side software modification required to support the ZNS interface.
    • Host-size FTL (HFTL)
      • Mediator between ZNS SSD's write semantics and applications performing random write and in-place updates
      • HFTL manages only the translation mapping and associated GC
      • Should also consider the utilization of CPU and DRAM resources
      • Easier to integrate host-side information
    • File Systems
      • If the FS is aware of the underlying zone structure, it can eliminate the overhead that is otherwise associated with both HFTL and FTL data placement. 
    • End-to-End Data Placement
      • The application's data structure could be aligned to the zone-write semantics. 
      • This eliminates all overheads from FS or translation layers ▶ Best write amplification, throughput, latency
      • However, it is daunting as interacting with raw block devices
      • Should provide tools for the user to perform inspection, error checking, and backup/restore operations

Implementation

  • Modification
    • Linux Kernel to support ZNS SSDs
    • F2FS File system to evaluate the benefits on "higher-level storage stack layer"
    • FIO benchmark to support the newly added ZNS-specific attributes
    • ZenFS to evaluate the benefits on "end-to-end integration for zoned storage"
  • Linux Support
    • Linux kernel's Zoned Block Device (ZBD) ▶ abstraction layer providing a single unified zoned storage API on top of various zoned storage device types
    • Provides in-kernel API and ioctl-based user-space API
    • Modified NVMe device driver to enumerate and register ZNS SSDs with the ZBD subsystem
    • Modified ZBD subsystem API to expose the per zone capacity attribute and active zone limit
    • Zone Capacity 
      • Kernel maintains an in-memory representation of zones (Zone descriptor data structure)
      • Add zone capacity attribute and a versioning information to allow host application to use
      • FIO ▶ used to not exceed the zone capacity
      • F2FS ▶ Two extra segment sypes 
        • Unusable ▶ unwriteable part of  a zone
        • Partial ▶ Segment which LBAs cross both the writeable and unwriteable LBAs of a zone
    • Limiting Active Zones
      • Due to the nature of flash-based SSDs, there is a strict limit on the number of active zones (OPEN or CLOSED)
      • The limit is detected upon zoned block device enumerate, exposed to the kernel and user space APIs.
      • FIO ▶ no modification (User should respect the active-limit constraint)
      • F2FS ▶ Linked to the number of segments that can be open simultaneously (Max 6)
  • RocksDB Zone Support
    • What part should be modified in the RocksDB to support ZNS?
      • RocksDB has support for seperate sotrage backends through its file system wrapper API
      • Wrapper API identifies data units (SST files, WAL) through unique identifiers
      • Such semantics are closely related to file system semantics which is why RocksDB's main storage backend is FS
      • By using a FW, RocksDB avoids managing file extents, buffering, and free space management, but also looses ability to place data directly into zones
    • ZenFS Architecture
      • ZenFS Storage backend implements a minimal on-disk FS and integrates it using  RocksDB's file-wrapper API
      • Journaling and Data
        • ZenFS divides journal and data zones
        • Journal zones are used to recover the state of FS (superblock data structure, mapping of WAL, mapping of data files to zones)
        • Data zones store file contents
      • Extents
        • Data files are mappend and written to a extent
        • Extent: vairable-sized, block-aligned, contiguous region that is written sequentially to a data zone
        • Each zone can store multiple extents, extents do not span zones
        • Extent allocation and deallocation events are recorded in an in-memory data structur ▶written to the journal when file is closed or fsync is called. 
        • If all files with allocated extents in a zone has been deleted, the zone can be reset and reused
      • Superblock
        • Initial entry point when initializing and recovering ZenFS
      • Journal 
        • Maintain superblock and WAL and data file to zone translation mapping through extents
        • Journal state is stored on dedicated journal zones ▶ first two non-offline zones of a device
        • Journal header ▶ sequence number, superblock data structure, snapshot of the current journal state
        • Reset of the zone is used for journaling
        • Recovery
          • Read sequence number (Journal Header) ▶ determine the active zone
          • Read initial superblock and journal state
          • Journal updates are applied to the header's journal snapshot
      • Writeable Capacity in Data Zones
        • Ideal allocation can be done if the file sizes are a multiple of the writeable capacity of a zone 
        • However, file sizes will vary depending on the results of compression and compaction process
        • ZenFS address this by allowing a user-configurable limit for finishing data zones ▶ Allows for the file size to vary within a limit and still achieve file seperation by zones
        • Exception: File size variation is outside the specified limit, ZenFS will make sure that all available capacity is utilized by using its zone allocation algorithm (?)
      • Data Zone Selection
        • Basically uses the best-effort algorithm
        • Seperates the WAL and SST levels by setting a write-lifetime hint
        • When first writing a file, ZenFS compares lifetime of the file and max lifetime of the data stored in the zone ▶ Match is made if lifetime of the file is less (if several matches occur, choose the closest, if no matches are made a new zone is allocated)
      • Active Zone Limits
        • ZenFS requires a minimum of three active zones (Journal, WAL, Compaction Process)
      • Direct I/O and Buffered Writes
        • SSTFiles: Direct I/O Write bypassing the kernel page cache ▶ immutable and sequential
        • WAL, ZenFS Buffers: writes in memory an flushes

Evaluation