summaryrefslogtreecommitdiff
path: root/static/freebsd/man7/zpoolconcepts.7
diff options
context:
space:
mode:
Diffstat (limited to 'static/freebsd/man7/zpoolconcepts.7')
-rw-r--r--static/freebsd/man7/zpoolconcepts.7556
1 files changed, 556 insertions, 0 deletions
diff --git a/static/freebsd/man7/zpoolconcepts.7 b/static/freebsd/man7/zpoolconcepts.7
new file mode 100644
index 00000000..ebd0b346
--- /dev/null
+++ b/static/freebsd/man7/zpoolconcepts.7
@@ -0,0 +1,556 @@
+.\" SPDX-License-Identifier: CDDL-1.0
+.\"
+.\" CDDL HEADER START
+.\"
+.\" The contents of this file are subject to the terms of the
+.\" Common Development and Distribution License (the "License").
+.\" You may not use this file except in compliance with the License.
+.\"
+.\" You can obtain a copy of the license at usr/src/OPENSOLARIS.LICENSE
+.\" or https://opensource.org/licenses/CDDL-1.0.
+.\" See the License for the specific language governing permissions
+.\" and limitations under the License.
+.\"
+.\" When distributing Covered Code, include this CDDL HEADER in each
+.\" file and include the License file at usr/src/OPENSOLARIS.LICENSE.
+.\" If applicable, add the following below this CDDL HEADER, with the
+.\" fields enclosed by brackets "[]" replaced with your own identifying
+.\" information: Portions Copyright [yyyy] [name of copyright owner]
+.\"
+.\" CDDL HEADER END
+.\"
+.\" Copyright (c) 2007, Sun Microsystems, Inc. All Rights Reserved.
+.\" Copyright (c) 2012, 2018 by Delphix. All rights reserved.
+.\" Copyright (c) 2012 Cyril Plisko. All Rights Reserved.
+.\" Copyright (c) 2017 Datto Inc.
+.\" Copyright (c) 2018 George Melikov. All Rights Reserved.
+.\" Copyright 2017 Nexenta Systems, Inc.
+.\" Copyright (c) 2017 Open-E, Inc. All Rights Reserved.
+.\" Copyright (c) 2026 Seagate Technology, LLC.
+.\"
+.Dd August 6, 2025
+.Dt ZPOOLCONCEPTS 7
+.Os
+.
+.Sh NAME
+.Nm zpoolconcepts
+.Nd overview of ZFS storage pools
+.
+.Sh DESCRIPTION
+.Ss Virtual Devices (vdevs)
+A "virtual device" describes a single device or a collection of devices,
+organized according to certain performance and fault characteristics.
+The following virtual devices are supported:
+.Bl -tag -width "special"
+.It Sy disk
+A block device, typically located under
+.Pa /dev .
+ZFS can use individual slices or partitions, though the recommended mode of
+operation is to use whole disks.
+A disk can be specified by a full path, or it can be a shorthand name
+.Po the relative portion of the path under
+.Pa /dev
+.Pc .
+A whole disk can be specified by omitting the slice or partition designation.
+For example,
+.Pa sda
+is equivalent to
+.Pa /dev/sda .
+When given a whole disk, ZFS automatically labels the disk, if necessary.
+.It Sy file
+A regular file.
+The use of files as a backing store is strongly discouraged.
+It is designed primarily for experimental purposes, as the fault tolerance of a
+file is only as good as the file system on which it resides.
+A file must be specified by a full path.
+.It Sy mirror
+A mirror of two or more devices.
+Data is replicated in an identical fashion across all components of a mirror.
+A mirror with
+.Em N No disks of size Em X No can hold Em X No bytes and can withstand Em N-1
+devices failing, without losing data.
+.It Sy raidz , raidz1 , raidz2 , raidz3
+A distributed-parity layout, similar to RAID-5/6, with improved distribution of
+parity, and which does not suffer from the RAID-5/6
+.Qq write hole ,
+.Pq in which data and parity become inconsistent after a power loss .
+Data and parity is striped across all disks within a raidz group, though not
+necessarily in a consistent stripe width.
+.Pp
+A raidz group can have single, double, or triple parity, meaning that the
+raidz group can sustain one, two, or three failures, respectively, without
+losing any data.
+The
+.Sy raidz1
+vdev type specifies a single-parity raidz group; the
+.Sy raidz2
+vdev type specifies a double-parity raidz group; and the
+.Sy raidz3
+vdev type specifies a triple-parity raidz group.
+The
+.Sy raidz
+vdev type is an alias for
+.Sy raidz1 .
+.Pp
+A raidz group with
+.Em N No disks of size Em X No with Em P No parity disks can hold approximately
+.Em (N-P)*X No bytes and can withstand Em P No devices failing without losing data .
+The minimum number of devices in a raidz group is one more than the number of
+parity disks.
+The recommended number is between 3 and 9 to help increase performance.
+.It Sy draid , draid1 , draid2 , draid3
+A variant of raidz that provides integrated distributed hot spares, allowing
+for faster resilvering, while retaining the benefits of raidz.
+A dRAID vdev is constructed from multiple internal raidz groups, each with
+.Em D No data devices and Em P No parity devices .
+These groups are distributed over all of the children in order to fully
+utilize the available disk performance.
+.Pp
+Unlike raidz, dRAID uses a fixed stripe width (padding as necessary with
+zeros) to allow fully sequential resilvering.
+This fixed stripe width significantly affects both usable capacity and IOPS.
+For example, with the default
+.Em D=8 No and Em 4 KiB No disk sectors the minimum allocation size is Em 32 KiB .
+If using compression, this relatively large allocation size can reduce the
+effective compression ratio.
+When using ZFS volumes (zvols) and dRAID, the default of the
+.Sy volblocksize
+property is increased to account for the allocation size.
+If a dRAID pool will hold a significant amount of small blocks, it is
+recommended to also add a mirrored
+.Sy special
+vdev to store those blocks.
+.Pp
+In regards to I/O, performance is similar to raidz since, for any read, all
+.Em D No data disks must be accessed .
+Delivered random IOPS can be reasonably approximated as
+.Sy floor((N-S)/(D+P))*single_drive_IOPS .
+.Pp
+Like raidz, a dRAID can have single-, double-, or triple-parity.
+The
+.Sy draid1 ,
+.Sy draid2 ,
+and
+.Sy draid3
+types can be used to specify the parity level.
+The
+.Sy draid
+vdev type is an alias for
+.Sy draid1 .
+.Pp
+A dRAID with
+.Em N No disks of size Em X , D No data disks per redundancy group , Em P
+.No parity level, and Em S No distributed hot spares can hold approximately
+.Em (N-S)*(D/(D+P))*X No bytes and can withstand Em P
+devices failing without losing data.
+.It Sy draid Ns Oo Ar parity Oc Ns Oo Sy \&: Ns Ar data Ns Sy d Oc Ns Oo Sy \&: Ns Ar children Ns Sy c Oc Ns Oo Sy \&: Ns Ar width Ns Sy w Oc Ns Oo Sy \&: Ns Ar spares Ns Sy s Oc
+A non-default dRAID configuration can be specified by appending one or more
+of the following optional arguments to the
+.Sy draid
+keyword:
+.Bl -tag -compact -width "children"
+.It Ar parity
+The parity level (1-3).
+.It Ar data
+The number of data devices per redundancy group.
+In general, a smaller value of
+.Em D No will increase IOPS, improve the compression ratio ,
+and speed up resilvering at the expense of total usable capacity.
+Defaults to
+.Em 8 , No unless Em N-P-S No is less than Em 8 .
+.It Ar children
+The expected number of children.
+Useful as a cross-check when listing a large number of devices.
+An error is returned when the provided number of children differs.
+.It Ar width
+You can configure several groups of children in the same row, in which case
+.Em width No would be a multiple of Em children .
+Such configurations allow the creation of failure groups with every i-th device
+in each group being from different failure domain (for example an enclosure)
+so that if all devices in one domain fail, the
+.Em draid No vdev still will be operational with enough redundancy to
+rebuild the data.
+In case of
+.Em draid2 , No two domains can fail at a time, in case of
+.Em draid3 No \(em three domains (provided there are no other failures
+in any failure group).
+For each group, it will be only one, two or three failures.
+.It Ar spares
+The number of distributed hot spares.
+All spares are shared between failure groups.
+Defaults to zero.
+.Pp
+Note: to support domain failure, we cannot have more than
+.Em parity-1 No failures in any failure group, no matter if the failed
+devices are rebuilt to draid hot spares or not \(em the blocks of those
+spares can be mapped to the devices from the failed domain, and we cannot
+tolerate more than
+.Em parity No failures in any failure group .
+.El
+.It Sy spare
+A pseudo-vdev which keeps track of available hot spares for a pool.
+For more information, see the
+.Sx Hot Spares
+section.
+.It Sy log
+A separate intent log device.
+If more than one log device is specified, then writes are load-balanced between
+devices.
+Log devices can be mirrored.
+However, raidz vdev types are not supported for the intent log.
+For more information, see the
+.Sx Intent Log
+section.
+.It Sy dedup
+A device solely dedicated for deduplication tables.
+The redundancy of this device should match the redundancy of the other normal
+devices in the pool.
+If more than one dedup device is specified, then
+allocations are load-balanced between those devices.
+.It Sy special
+A device dedicated solely for allocating various kinds of internal metadata,
+and optionally small file blocks.
+The redundancy of this device should match the redundancy of the other normal
+devices in the pool.
+If more than one special device is specified, then
+allocations are load-balanced between those devices.
+.Pp
+For more information on special allocations, see the
+.Sx Special Allocation Class
+section.
+.It Sy cache
+A device used to cache storage pool data.
+A cache device cannot be configured as a mirror or raidz group.
+For more information, see the
+.Sx Cache Devices
+section.
+.It Sy fdomain No or Sy failure_domain
+Denotes the list of failure domain devices for dRAID vdev.
+.It Sy fgroup No or Sy failure_group
+Denotes the list of failure group devices for dRAID vdev.
+.El
+.Pp
+Virtual devices cannot be nested arbitrarily.
+A mirror, raidz or draid virtual device can only be created with files or disks.
+Mirrors of mirrors or other such combinations are not allowed.
+.Pp
+A pool can have any number of virtual devices at the top of the configuration
+.Po known as
+.Qq root vdevs
+.Pc .
+Data is dynamically distributed across all top-level devices to balance data
+among devices.
+As new virtual devices are added, ZFS automatically places data on the newly
+available devices.
+.Pp
+Virtual devices are specified one at a time on the command line,
+separated by whitespace.
+Keywords like
+.Sy mirror No and Sy raidz
+are used to distinguish where a group ends and another begins.
+For example, the following creates a pool with two root vdevs,
+each a mirror of two disks:
+.Dl # Nm zpool Cm create Ar mypool Sy mirror Ar sda sdb Sy mirror Ar sdc sdd
+.
+.Ss Device Failure and Recovery
+ZFS supports a rich set of mechanisms for handling device failure and data
+corruption.
+All metadata and data is checksummed, and ZFS automatically repairs bad data
+from a good copy, when corruption is detected.
+.Pp
+In order to take advantage of these features, a pool must make use of some form
+of redundancy, using either mirrored or raidz groups.
+While ZFS supports running in a non-redundant configuration, where each root
+vdev is simply a disk or file, this is strongly discouraged.
+A single case of bit corruption can render some or all of your data unavailable.
+.Pp
+A pool's health status is described by one of three states:
+.Sy online , degraded , No or Sy faulted .
+An online pool has all devices operating normally.
+A degraded pool is one in which one or more devices have failed, but the data is
+still available due to a redundant configuration.
+A faulted pool has corrupted metadata, or one or more faulted devices, and
+insufficient replicas to continue functioning.
+.Pp
+The health of the top-level vdev, such as a mirror or raidz device,
+is potentially impacted by the state of its associated vdevs
+or component devices.
+A top-level vdev or component device is in one of the following states:
+.Bl -tag -width "DEGRADED"
+.It Sy DEGRADED
+One or more top-level vdevs is in the degraded state because one or more
+component devices are offline.
+Sufficient replicas exist to continue functioning.
+.Pp
+One or more component devices is in the degraded or faulted state, but
+sufficient replicas exist to continue functioning.
+The underlying conditions are as follows:
+.Bl -bullet -compact
+.It
+The number of checksum errors or slow I/Os exceeds acceptable levels and the
+device is degraded as an indication that something may be wrong.
+ZFS continues to use the device as necessary.
+.It
+The number of I/O errors exceeds acceptable levels.
+The device could not be marked as faulted because there are insufficient
+replicas to continue functioning.
+.El
+.It Sy FAULTED
+One or more top-level vdevs is in the faulted state because one or more
+component devices are offline.
+Insufficient replicas exist to continue functioning.
+.Pp
+One or more component devices is in the faulted state, and insufficient
+replicas exist to continue functioning.
+The underlying conditions are as follows:
+.Bl -bullet -compact
+.It
+The device could be opened, but the contents did not match expected values.
+.It
+The number of I/O errors exceeds acceptable levels and the device is faulted to
+prevent further use of the device.
+.El
+.It Sy OFFLINE
+The device was explicitly taken offline by the
+.Nm zpool Cm offline
+command.
+.It Sy ONLINE
+The device is online and functioning.
+.It Sy REMOVED
+The device was physically removed while the system was running.
+Device removal detection is hardware-dependent and may not be supported on all
+platforms.
+.It Sy UNAVAIL
+The device could not be opened.
+If a pool is imported when a device was unavailable, then the device will be
+identified by a unique identifier instead of its path since the path was never
+correct in the first place.
+.El
+.Pp
+Checksum errors represent events where a disk returned data that was expected
+to be correct, but was not.
+In other words, these are instances of silent data corruption.
+The checksum errors are reported in
+.Nm zpool Cm status
+and
+.Nm zpool Cm events .
+When a block is stored redundantly, a damaged block may be reconstructed
+(e.g. from raidz parity or a mirrored copy).
+In this case, ZFS reports the checksum error against the disks that contained
+damaged data.
+If a block is unable to be reconstructed (e.g. due to 3 disks being damaged
+in a raidz2 group), it is not possible to determine which disks were silently
+corrupted.
+In this case, checksum errors are reported for all disks on which the block
+is stored.
+.Pp
+If a device is removed and later re-attached to the system,
+ZFS attempts to bring the device online automatically.
+Device attachment detection is hardware-dependent
+and might not be supported on all platforms.
+.
+.Ss Hot Spares
+ZFS allows devices to be associated with pools as
+.Qq hot spares .
+These devices are not actively used in the pool.
+But, when an active device
+fails, it is automatically replaced by a hot spare.
+To create a pool with hot spares, specify a
+.Sy spare
+vdev with any number of devices.
+For example,
+.Dl # Nm zpool Cm create Ar pool Sy mirror Ar sda sdb Sy spare Ar sdc sdd
+.Pp
+Spares can be shared across multiple pools, and can be added with the
+.Nm zpool Cm add
+command and removed with the
+.Nm zpool Cm remove
+command.
+Once a spare replacement is initiated, a new
+.Sy spare
+vdev is created within the configuration that will remain there until the
+original device is replaced.
+At this point, the hot spare becomes available again, if another device fails.
+.Pp
+If a pool has a shared spare that is currently being used, the pool cannot be
+exported, since other pools may use this shared spare, which may lead to
+potential data corruption.
+.Pp
+Shared spares add some risk.
+If the pools are imported on different hosts,
+and both pools suffer a device failure at the same time,
+both could attempt to use the spare at the same time.
+This may not be detected, resulting in data corruption.
+.Pp
+An in-progress spare replacement can be canceled by detaching the hot spare.
+If the original faulted device is detached, then the hot spare assumes its
+place in the configuration, and is removed from the spare list of all active
+pools.
+.Pp
+The
+.Sy draid
+vdev type provides distributed hot spares.
+These are virtual devices whose blocks are reserved and distributed among
+all real devices, which makes resilvering to them much faster because one
+device is not a bottleneck anymore.
+Fast resilvering is crucial for data durability, it decreases the time of
+having degraded data redundancy in the pool, thus decreasing the chance of
+losing more devices at a time which cannot be tolerate.
+dRAID hot spares are named after the draid vdev they're a part of
+.Po Sy draid1 Ns - Ns Ar 2 Ns - Ns Ar 3 No specifies spare Ar 3 No of vdev Ar 2 ,
+.No which is a single parity dRAID Pc
+and may only be used by that dRAID vdev.
+Otherwise, they behave the same as normal hot spares.
+.Pp
+Spares cannot replace log devices.
+.
+.Ss Intent Log
+The ZFS Intent Log (ZIL) satisfies POSIX requirements for synchronous
+transactions.
+For instance, databases often require their transactions to be on stable storage
+devices when returning from a system call.
+NFS and other applications can also use
+.Xr fsync 2
+to ensure data stability.
+By default, the intent log is allocated from blocks within the main pool.
+However, it might be possible to get better performance using separate intent
+log devices such as NVRAM or a dedicated disk.
+For example:
+.Dl # Nm zpool Cm create Ar pool sda sdb Sy log Ar sdc
+.Pp
+Multiple log devices can also be specified, and they can be mirrored.
+See the
+.Sx EXAMPLES
+section for an example of mirroring multiple log devices.
+.
+.Ss Cache Devices
+Devices can be added to a storage pool as
+.Qq cache devices .
+These devices provide an additional layer of caching between main memory and
+disk.
+For read-heavy workloads, where the working set size is much larger than what
+can be cached in main memory, using cache devices allows much more of this
+working set to be served from low latency media.
+Using cache devices provides the greatest performance improvement for random
+read-workloads of mostly static content.
+.Pp
+To create a pool with cache devices, specify a
+.Sy cache
+vdev with any number of devices.
+For example:
+.Dl # Nm zpool Cm create Ar pool sda sdb Sy cache Ar sdc sdd
+.Pp
+Cache devices cannot be mirrored or part of a raidz configuration.
+If a read error is encountered on a cache device, that read I/O is reissued to
+the original storage pool device, which might be part of a mirrored or raidz
+configuration.
+.Pp
+The content of the cache devices is persistent across reboots and restored
+asynchronously when importing the pool in L2ARC (persistent L2ARC).
+This can be disabled by setting
+.Sy l2arc_rebuild_enabled Ns = Ns Sy 0 .
+For cache devices smaller than
+.Em 1 GiB ,
+ZFS does not write the metadata structures
+required for rebuilding the L2ARC, to conserve space.
+This can be changed with
+.Sy l2arc_rebuild_blocks_min_l2size .
+The cache device header
+.Pq Em 512 B
+is updated even if no metadata structures are written.
+.Pp
+L2ARC operates in one of two modes depending on total cache capacity.
+When total L2ARC capacity is less than twice
+.Sy arc_c_max ,
+L2ARC writes buffers to cache as they are evicted from ARC.
+When total capacity is at least twice
+.Sy arc_c_max ,
+L2ARC uses persistent markers that track scan positions across iterations,
+writing cacheable content as write throughput allows.
+A depth cap
+.Pq Sy l2arc_ext_headroom_pct
+limits how far markers advance from the tail, keeping them focused on
+buffers soon to be evicted where caching adds the most value.
+.Pp
+If a cache device is added with
+.Nm zpool Cm add ,
+its label and header will be overwritten and its contents will not be
+restored in L2ARC, even if the device was previously part of the pool.
+If a cache device is onlined with
+.Nm zpool Cm online ,
+its contents will be restored in L2ARC.
+This is useful in case of memory pressure,
+where the contents of the cache device are not fully restored in L2ARC.
+The user can off- and online the cache device when there is less memory
+pressure, to fully restore its contents to L2ARC.
+.
+.Ss Pool checkpoint
+Before starting critical procedures that include destructive actions
+.Pq like Nm zfs Cm destroy ,
+an administrator can checkpoint the pool's state and, in the case of a
+mistake or failure, rewind the entire pool back to the checkpoint.
+Otherwise, the checkpoint can be discarded when the procedure has completed
+successfully.
+.Pp
+A pool checkpoint can be thought of as a pool-wide snapshot and should be used
+with care as it contains every part of the pool's state, from properties to vdev
+configuration.
+Thus, certain operations are not allowed while a pool has a checkpoint.
+Specifically, vdev removal/attach/detach, mirror splitting, and
+changing the pool's GUID.
+Adding a new vdev is supported, but in the case of a rewind it will have to be
+added again.
+Finally, users of this feature should keep in mind that scrubs in a pool that
+has a checkpoint do not repair checkpointed data.
+.Pp
+To create a checkpoint for a pool:
+.Dl # Nm zpool Cm checkpoint Ar pool
+.Pp
+To later rewind to its checkpointed state, you need to first export it and
+then rewind it during import:
+.Dl # Nm zpool Cm export Ar pool
+.Dl # Nm zpool Cm import Fl -rewind-to-checkpoint Ar pool
+.Pp
+Note that rewinding to a checkpoint will
+.Sy permanently discard it.
+Once the pool has been successfully imported with the above rewind command,
+you cannot rewind to the same checkpoint.
+.Pp
+To discard the checkpoint from a pool:
+.Dl # Nm zpool Cm checkpoint Fl d Ar pool
+.Pp
+Dataset reservations (controlled by the
+.Sy reservation No and Sy refreservation
+properties) may be unenforceable while a checkpoint exists, because the
+checkpoint is allowed to consume the dataset's reservation.
+Finally, data that is part of the checkpoint but has been freed in the
+current state of the pool won't be scanned during a scrub.
+.
+.Ss Special Allocation Class
+Allocations in the special class are dedicated to specific block types.
+By default, this includes all metadata, the indirect blocks of user data,
+intent log (in absence of separate log device), and deduplication tables.
+The class can also be provisioned to accept small file blocks or zvol blocks
+on a per dataset granularity.
+.Pp
+A pool must always have at least one normal
+.Pq non- Ns Sy dedup Ns /- Ns Sy special
+vdev before
+other devices can be assigned to the special class.
+If the
+.Sy special
+class becomes full, then allocations intended for it
+will spill back into the normal class.
+.Pp
+Deduplication tables can be excluded from the special class by unsetting the
+.Sy zfs_ddt_data_is_special
+ZFS module parameter.
+.Pp
+Inclusion of small file or zvol blocks in the special class is opt-in.
+Each dataset can control the size of small file blocks allowed
+in the special class by setting the
+.Sy special_small_blocks
+property to nonzero.
+See
+.Xr zfsprops 7
+for more info on this property.