diff options
Diffstat (limited to 'static/freebsd/man7/zpoolconcepts.7')
| -rw-r--r-- | static/freebsd/man7/zpoolconcepts.7 | 556 |
1 files changed, 556 insertions, 0 deletions
diff --git a/static/freebsd/man7/zpoolconcepts.7 b/static/freebsd/man7/zpoolconcepts.7 new file mode 100644 index 00000000..ebd0b346 --- /dev/null +++ b/static/freebsd/man7/zpoolconcepts.7 @@ -0,0 +1,556 @@ +.\" SPDX-License-Identifier: CDDL-1.0 +.\" +.\" CDDL HEADER START +.\" +.\" The contents of this file are subject to the terms of the +.\" Common Development and Distribution License (the "License"). +.\" You may not use this file except in compliance with the License. +.\" +.\" You can obtain a copy of the license at usr/src/OPENSOLARIS.LICENSE +.\" or https://opensource.org/licenses/CDDL-1.0. +.\" See the License for the specific language governing permissions +.\" and limitations under the License. +.\" +.\" When distributing Covered Code, include this CDDL HEADER in each +.\" file and include the License file at usr/src/OPENSOLARIS.LICENSE. +.\" If applicable, add the following below this CDDL HEADER, with the +.\" fields enclosed by brackets "[]" replaced with your own identifying +.\" information: Portions Copyright [yyyy] [name of copyright owner] +.\" +.\" CDDL HEADER END +.\" +.\" Copyright (c) 2007, Sun Microsystems, Inc. All Rights Reserved. +.\" Copyright (c) 2012, 2018 by Delphix. All rights reserved. +.\" Copyright (c) 2012 Cyril Plisko. All Rights Reserved. +.\" Copyright (c) 2017 Datto Inc. +.\" Copyright (c) 2018 George Melikov. All Rights Reserved. +.\" Copyright 2017 Nexenta Systems, Inc. +.\" Copyright (c) 2017 Open-E, Inc. All Rights Reserved. +.\" Copyright (c) 2026 Seagate Technology, LLC. +.\" +.Dd August 6, 2025 +.Dt ZPOOLCONCEPTS 7 +.Os +. +.Sh NAME +.Nm zpoolconcepts +.Nd overview of ZFS storage pools +. +.Sh DESCRIPTION +.Ss Virtual Devices (vdevs) +A "virtual device" describes a single device or a collection of devices, +organized according to certain performance and fault characteristics. +The following virtual devices are supported: +.Bl -tag -width "special" +.It Sy disk +A block device, typically located under +.Pa /dev . +ZFS can use individual slices or partitions, though the recommended mode of +operation is to use whole disks. +A disk can be specified by a full path, or it can be a shorthand name +.Po the relative portion of the path under +.Pa /dev +.Pc . +A whole disk can be specified by omitting the slice or partition designation. +For example, +.Pa sda +is equivalent to +.Pa /dev/sda . +When given a whole disk, ZFS automatically labels the disk, if necessary. +.It Sy file +A regular file. +The use of files as a backing store is strongly discouraged. +It is designed primarily for experimental purposes, as the fault tolerance of a +file is only as good as the file system on which it resides. +A file must be specified by a full path. +.It Sy mirror +A mirror of two or more devices. +Data is replicated in an identical fashion across all components of a mirror. +A mirror with +.Em N No disks of size Em X No can hold Em X No bytes and can withstand Em N-1 +devices failing, without losing data. +.It Sy raidz , raidz1 , raidz2 , raidz3 +A distributed-parity layout, similar to RAID-5/6, with improved distribution of +parity, and which does not suffer from the RAID-5/6 +.Qq write hole , +.Pq in which data and parity become inconsistent after a power loss . +Data and parity is striped across all disks within a raidz group, though not +necessarily in a consistent stripe width. +.Pp +A raidz group can have single, double, or triple parity, meaning that the +raidz group can sustain one, two, or three failures, respectively, without +losing any data. +The +.Sy raidz1 +vdev type specifies a single-parity raidz group; the +.Sy raidz2 +vdev type specifies a double-parity raidz group; and the +.Sy raidz3 +vdev type specifies a triple-parity raidz group. +The +.Sy raidz +vdev type is an alias for +.Sy raidz1 . +.Pp +A raidz group with +.Em N No disks of size Em X No with Em P No parity disks can hold approximately +.Em (N-P)*X No bytes and can withstand Em P No devices failing without losing data . +The minimum number of devices in a raidz group is one more than the number of +parity disks. +The recommended number is between 3 and 9 to help increase performance. +.It Sy draid , draid1 , draid2 , draid3 +A variant of raidz that provides integrated distributed hot spares, allowing +for faster resilvering, while retaining the benefits of raidz. +A dRAID vdev is constructed from multiple internal raidz groups, each with +.Em D No data devices and Em P No parity devices . +These groups are distributed over all of the children in order to fully +utilize the available disk performance. +.Pp +Unlike raidz, dRAID uses a fixed stripe width (padding as necessary with +zeros) to allow fully sequential resilvering. +This fixed stripe width significantly affects both usable capacity and IOPS. +For example, with the default +.Em D=8 No and Em 4 KiB No disk sectors the minimum allocation size is Em 32 KiB . +If using compression, this relatively large allocation size can reduce the +effective compression ratio. +When using ZFS volumes (zvols) and dRAID, the default of the +.Sy volblocksize +property is increased to account for the allocation size. +If a dRAID pool will hold a significant amount of small blocks, it is +recommended to also add a mirrored +.Sy special +vdev to store those blocks. +.Pp +In regards to I/O, performance is similar to raidz since, for any read, all +.Em D No data disks must be accessed . +Delivered random IOPS can be reasonably approximated as +.Sy floor((N-S)/(D+P))*single_drive_IOPS . +.Pp +Like raidz, a dRAID can have single-, double-, or triple-parity. +The +.Sy draid1 , +.Sy draid2 , +and +.Sy draid3 +types can be used to specify the parity level. +The +.Sy draid +vdev type is an alias for +.Sy draid1 . +.Pp +A dRAID with +.Em N No disks of size Em X , D No data disks per redundancy group , Em P +.No parity level, and Em S No distributed hot spares can hold approximately +.Em (N-S)*(D/(D+P))*X No bytes and can withstand Em P +devices failing without losing data. +.It Sy draid Ns Oo Ar parity Oc Ns Oo Sy \&: Ns Ar data Ns Sy d Oc Ns Oo Sy \&: Ns Ar children Ns Sy c Oc Ns Oo Sy \&: Ns Ar width Ns Sy w Oc Ns Oo Sy \&: Ns Ar spares Ns Sy s Oc +A non-default dRAID configuration can be specified by appending one or more +of the following optional arguments to the +.Sy draid +keyword: +.Bl -tag -compact -width "children" +.It Ar parity +The parity level (1-3). +.It Ar data +The number of data devices per redundancy group. +In general, a smaller value of +.Em D No will increase IOPS, improve the compression ratio , +and speed up resilvering at the expense of total usable capacity. +Defaults to +.Em 8 , No unless Em N-P-S No is less than Em 8 . +.It Ar children +The expected number of children. +Useful as a cross-check when listing a large number of devices. +An error is returned when the provided number of children differs. +.It Ar width +You can configure several groups of children in the same row, in which case +.Em width No would be a multiple of Em children . +Such configurations allow the creation of failure groups with every i-th device +in each group being from different failure domain (for example an enclosure) +so that if all devices in one domain fail, the +.Em draid No vdev still will be operational with enough redundancy to +rebuild the data. +In case of +.Em draid2 , No two domains can fail at a time, in case of +.Em draid3 No \(em three domains (provided there are no other failures +in any failure group). +For each group, it will be only one, two or three failures. +.It Ar spares +The number of distributed hot spares. +All spares are shared between failure groups. +Defaults to zero. +.Pp +Note: to support domain failure, we cannot have more than +.Em parity-1 No failures in any failure group, no matter if the failed +devices are rebuilt to draid hot spares or not \(em the blocks of those +spares can be mapped to the devices from the failed domain, and we cannot +tolerate more than +.Em parity No failures in any failure group . +.El +.It Sy spare +A pseudo-vdev which keeps track of available hot spares for a pool. +For more information, see the +.Sx Hot Spares +section. +.It Sy log +A separate intent log device. +If more than one log device is specified, then writes are load-balanced between +devices. +Log devices can be mirrored. +However, raidz vdev types are not supported for the intent log. +For more information, see the +.Sx Intent Log +section. +.It Sy dedup +A device solely dedicated for deduplication tables. +The redundancy of this device should match the redundancy of the other normal +devices in the pool. +If more than one dedup device is specified, then +allocations are load-balanced between those devices. +.It Sy special +A device dedicated solely for allocating various kinds of internal metadata, +and optionally small file blocks. +The redundancy of this device should match the redundancy of the other normal +devices in the pool. +If more than one special device is specified, then +allocations are load-balanced between those devices. +.Pp +For more information on special allocations, see the +.Sx Special Allocation Class +section. +.It Sy cache +A device used to cache storage pool data. +A cache device cannot be configured as a mirror or raidz group. +For more information, see the +.Sx Cache Devices +section. +.It Sy fdomain No or Sy failure_domain +Denotes the list of failure domain devices for dRAID vdev. +.It Sy fgroup No or Sy failure_group +Denotes the list of failure group devices for dRAID vdev. +.El +.Pp +Virtual devices cannot be nested arbitrarily. +A mirror, raidz or draid virtual device can only be created with files or disks. +Mirrors of mirrors or other such combinations are not allowed. +.Pp +A pool can have any number of virtual devices at the top of the configuration +.Po known as +.Qq root vdevs +.Pc . +Data is dynamically distributed across all top-level devices to balance data +among devices. +As new virtual devices are added, ZFS automatically places data on the newly +available devices. +.Pp +Virtual devices are specified one at a time on the command line, +separated by whitespace. +Keywords like +.Sy mirror No and Sy raidz +are used to distinguish where a group ends and another begins. +For example, the following creates a pool with two root vdevs, +each a mirror of two disks: +.Dl # Nm zpool Cm create Ar mypool Sy mirror Ar sda sdb Sy mirror Ar sdc sdd +. +.Ss Device Failure and Recovery +ZFS supports a rich set of mechanisms for handling device failure and data +corruption. +All metadata and data is checksummed, and ZFS automatically repairs bad data +from a good copy, when corruption is detected. +.Pp +In order to take advantage of these features, a pool must make use of some form +of redundancy, using either mirrored or raidz groups. +While ZFS supports running in a non-redundant configuration, where each root +vdev is simply a disk or file, this is strongly discouraged. +A single case of bit corruption can render some or all of your data unavailable. +.Pp +A pool's health status is described by one of three states: +.Sy online , degraded , No or Sy faulted . +An online pool has all devices operating normally. +A degraded pool is one in which one or more devices have failed, but the data is +still available due to a redundant configuration. +A faulted pool has corrupted metadata, or one or more faulted devices, and +insufficient replicas to continue functioning. +.Pp +The health of the top-level vdev, such as a mirror or raidz device, +is potentially impacted by the state of its associated vdevs +or component devices. +A top-level vdev or component device is in one of the following states: +.Bl -tag -width "DEGRADED" +.It Sy DEGRADED +One or more top-level vdevs is in the degraded state because one or more +component devices are offline. +Sufficient replicas exist to continue functioning. +.Pp +One or more component devices is in the degraded or faulted state, but +sufficient replicas exist to continue functioning. +The underlying conditions are as follows: +.Bl -bullet -compact +.It +The number of checksum errors or slow I/Os exceeds acceptable levels and the +device is degraded as an indication that something may be wrong. +ZFS continues to use the device as necessary. +.It +The number of I/O errors exceeds acceptable levels. +The device could not be marked as faulted because there are insufficient +replicas to continue functioning. +.El +.It Sy FAULTED +One or more top-level vdevs is in the faulted state because one or more +component devices are offline. +Insufficient replicas exist to continue functioning. +.Pp +One or more component devices is in the faulted state, and insufficient +replicas exist to continue functioning. +The underlying conditions are as follows: +.Bl -bullet -compact +.It +The device could be opened, but the contents did not match expected values. +.It +The number of I/O errors exceeds acceptable levels and the device is faulted to +prevent further use of the device. +.El +.It Sy OFFLINE +The device was explicitly taken offline by the +.Nm zpool Cm offline +command. +.It Sy ONLINE +The device is online and functioning. +.It Sy REMOVED +The device was physically removed while the system was running. +Device removal detection is hardware-dependent and may not be supported on all +platforms. +.It Sy UNAVAIL +The device could not be opened. +If a pool is imported when a device was unavailable, then the device will be +identified by a unique identifier instead of its path since the path was never +correct in the first place. +.El +.Pp +Checksum errors represent events where a disk returned data that was expected +to be correct, but was not. +In other words, these are instances of silent data corruption. +The checksum errors are reported in +.Nm zpool Cm status +and +.Nm zpool Cm events . +When a block is stored redundantly, a damaged block may be reconstructed +(e.g. from raidz parity or a mirrored copy). +In this case, ZFS reports the checksum error against the disks that contained +damaged data. +If a block is unable to be reconstructed (e.g. due to 3 disks being damaged +in a raidz2 group), it is not possible to determine which disks were silently +corrupted. +In this case, checksum errors are reported for all disks on which the block +is stored. +.Pp +If a device is removed and later re-attached to the system, +ZFS attempts to bring the device online automatically. +Device attachment detection is hardware-dependent +and might not be supported on all platforms. +. +.Ss Hot Spares +ZFS allows devices to be associated with pools as +.Qq hot spares . +These devices are not actively used in the pool. +But, when an active device +fails, it is automatically replaced by a hot spare. +To create a pool with hot spares, specify a +.Sy spare +vdev with any number of devices. +For example, +.Dl # Nm zpool Cm create Ar pool Sy mirror Ar sda sdb Sy spare Ar sdc sdd +.Pp +Spares can be shared across multiple pools, and can be added with the +.Nm zpool Cm add +command and removed with the +.Nm zpool Cm remove +command. +Once a spare replacement is initiated, a new +.Sy spare +vdev is created within the configuration that will remain there until the +original device is replaced. +At this point, the hot spare becomes available again, if another device fails. +.Pp +If a pool has a shared spare that is currently being used, the pool cannot be +exported, since other pools may use this shared spare, which may lead to +potential data corruption. +.Pp +Shared spares add some risk. +If the pools are imported on different hosts, +and both pools suffer a device failure at the same time, +both could attempt to use the spare at the same time. +This may not be detected, resulting in data corruption. +.Pp +An in-progress spare replacement can be canceled by detaching the hot spare. +If the original faulted device is detached, then the hot spare assumes its +place in the configuration, and is removed from the spare list of all active +pools. +.Pp +The +.Sy draid +vdev type provides distributed hot spares. +These are virtual devices whose blocks are reserved and distributed among +all real devices, which makes resilvering to them much faster because one +device is not a bottleneck anymore. +Fast resilvering is crucial for data durability, it decreases the time of +having degraded data redundancy in the pool, thus decreasing the chance of +losing more devices at a time which cannot be tolerate. +dRAID hot spares are named after the draid vdev they're a part of +.Po Sy draid1 Ns - Ns Ar 2 Ns - Ns Ar 3 No specifies spare Ar 3 No of vdev Ar 2 , +.No which is a single parity dRAID Pc +and may only be used by that dRAID vdev. +Otherwise, they behave the same as normal hot spares. +.Pp +Spares cannot replace log devices. +. +.Ss Intent Log +The ZFS Intent Log (ZIL) satisfies POSIX requirements for synchronous +transactions. +For instance, databases often require their transactions to be on stable storage +devices when returning from a system call. +NFS and other applications can also use +.Xr fsync 2 +to ensure data stability. +By default, the intent log is allocated from blocks within the main pool. +However, it might be possible to get better performance using separate intent +log devices such as NVRAM or a dedicated disk. +For example: +.Dl # Nm zpool Cm create Ar pool sda sdb Sy log Ar sdc +.Pp +Multiple log devices can also be specified, and they can be mirrored. +See the +.Sx EXAMPLES +section for an example of mirroring multiple log devices. +. +.Ss Cache Devices +Devices can be added to a storage pool as +.Qq cache devices . +These devices provide an additional layer of caching between main memory and +disk. +For read-heavy workloads, where the working set size is much larger than what +can be cached in main memory, using cache devices allows much more of this +working set to be served from low latency media. +Using cache devices provides the greatest performance improvement for random +read-workloads of mostly static content. +.Pp +To create a pool with cache devices, specify a +.Sy cache +vdev with any number of devices. +For example: +.Dl # Nm zpool Cm create Ar pool sda sdb Sy cache Ar sdc sdd +.Pp +Cache devices cannot be mirrored or part of a raidz configuration. +If a read error is encountered on a cache device, that read I/O is reissued to +the original storage pool device, which might be part of a mirrored or raidz +configuration. +.Pp +The content of the cache devices is persistent across reboots and restored +asynchronously when importing the pool in L2ARC (persistent L2ARC). +This can be disabled by setting +.Sy l2arc_rebuild_enabled Ns = Ns Sy 0 . +For cache devices smaller than +.Em 1 GiB , +ZFS does not write the metadata structures +required for rebuilding the L2ARC, to conserve space. +This can be changed with +.Sy l2arc_rebuild_blocks_min_l2size . +The cache device header +.Pq Em 512 B +is updated even if no metadata structures are written. +.Pp +L2ARC operates in one of two modes depending on total cache capacity. +When total L2ARC capacity is less than twice +.Sy arc_c_max , +L2ARC writes buffers to cache as they are evicted from ARC. +When total capacity is at least twice +.Sy arc_c_max , +L2ARC uses persistent markers that track scan positions across iterations, +writing cacheable content as write throughput allows. +A depth cap +.Pq Sy l2arc_ext_headroom_pct +limits how far markers advance from the tail, keeping them focused on +buffers soon to be evicted where caching adds the most value. +.Pp +If a cache device is added with +.Nm zpool Cm add , +its label and header will be overwritten and its contents will not be +restored in L2ARC, even if the device was previously part of the pool. +If a cache device is onlined with +.Nm zpool Cm online , +its contents will be restored in L2ARC. +This is useful in case of memory pressure, +where the contents of the cache device are not fully restored in L2ARC. +The user can off- and online the cache device when there is less memory +pressure, to fully restore its contents to L2ARC. +. +.Ss Pool checkpoint +Before starting critical procedures that include destructive actions +.Pq like Nm zfs Cm destroy , +an administrator can checkpoint the pool's state and, in the case of a +mistake or failure, rewind the entire pool back to the checkpoint. +Otherwise, the checkpoint can be discarded when the procedure has completed +successfully. +.Pp +A pool checkpoint can be thought of as a pool-wide snapshot and should be used +with care as it contains every part of the pool's state, from properties to vdev +configuration. +Thus, certain operations are not allowed while a pool has a checkpoint. +Specifically, vdev removal/attach/detach, mirror splitting, and +changing the pool's GUID. +Adding a new vdev is supported, but in the case of a rewind it will have to be +added again. +Finally, users of this feature should keep in mind that scrubs in a pool that +has a checkpoint do not repair checkpointed data. +.Pp +To create a checkpoint for a pool: +.Dl # Nm zpool Cm checkpoint Ar pool +.Pp +To later rewind to its checkpointed state, you need to first export it and +then rewind it during import: +.Dl # Nm zpool Cm export Ar pool +.Dl # Nm zpool Cm import Fl -rewind-to-checkpoint Ar pool +.Pp +Note that rewinding to a checkpoint will +.Sy permanently discard it. +Once the pool has been successfully imported with the above rewind command, +you cannot rewind to the same checkpoint. +.Pp +To discard the checkpoint from a pool: +.Dl # Nm zpool Cm checkpoint Fl d Ar pool +.Pp +Dataset reservations (controlled by the +.Sy reservation No and Sy refreservation +properties) may be unenforceable while a checkpoint exists, because the +checkpoint is allowed to consume the dataset's reservation. +Finally, data that is part of the checkpoint but has been freed in the +current state of the pool won't be scanned during a scrub. +. +.Ss Special Allocation Class +Allocations in the special class are dedicated to specific block types. +By default, this includes all metadata, the indirect blocks of user data, +intent log (in absence of separate log device), and deduplication tables. +The class can also be provisioned to accept small file blocks or zvol blocks +on a per dataset granularity. +.Pp +A pool must always have at least one normal +.Pq non- Ns Sy dedup Ns /- Ns Sy special +vdev before +other devices can be assigned to the special class. +If the +.Sy special +class becomes full, then allocations intended for it +will spill back into the normal class. +.Pp +Deduplication tables can be excluded from the special class by unsetting the +.Sy zfs_ddt_data_is_special +ZFS module parameter. +.Pp +Inclusion of small file or zvol blocks in the special class is opt-in. +Each dataset can control the size of small file blocks allowed +in the special class by setting the +.Sy special_small_blocks +property to nonzero. +See +.Xr zfsprops 7 +for more info on this property. |
