summaryrefslogtreecommitdiff
path: root/static/freebsd/man9/buf.9 3.html
blob: c441c3b4f523eb50b7bcc513460d9564749dcd0b (plain)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
<table class="head">
  <tr>
    <td class="head-ltitle">BUF(9)</td>
    <td class="head-vol">Kernel Developer's Manual</td>
    <td class="head-rtitle">BUF(9)</td>
  </tr>
</table>
<div class="manual-text">
<section class="Sh">
<h1 class="Sh" id="NAME"><a class="permalink" href="#NAME">NAME</a></h1>
<p class="Pp"><code class="Nm">buf</code> &#x2014; <span class="Nd">kernel
    buffer I/O scheme used in FreeBSD VM system</span></p>
</section>
<section class="Sh">
<h1 class="Sh" id="DESCRIPTION"><a class="permalink" href="#DESCRIPTION">DESCRIPTION</a></h1>
<p class="Pp">The kernel implements a KVM abstraction of the buffer cache which
    allows it to map potentially disparate vm_page's into contiguous KVM for use
    by (mainly file system) devices and device I/O. This abstraction supports
    block sizes from <code class="Dv">DEV_BSIZE</code> (usually 512) to upwards
    of several pages or more. It also supports a relatively primitive
    byte-granular valid range and dirty range currently hardcoded for use by
    NFS. The code implementing the VM Buffer abstraction is mostly concentrated
    in <span class="Pa">sys/kern/vfs_bio.c</span> in the
    <span class="Ux">FreeBSD</span> source tree.</p>
<p class="Pp" id="page">One of the most important things to remember when
    dealing with buffer pointers (<var class="Vt">struct buf</var>) is that the
    underlying pages are mapped directly from the buffer cache. No data copying
    occurs in the scheme proper, though some file systems such as UFS do have to
    copy a little when dealing with file fragments. The second most important
    thing to remember is that due to the underlying page mapping, the
    <var class="Va">b_data</var> base pointer in a buf is always
    <a class="permalink" href="#page"><i class="Em">page</i></a>-aligned, not
    <a class="permalink" href="#block"><i class="Em" id="block">block</i></a>-aligned.
    When you have a VM buffer representing some <var class="Va">b_offset</var>
    and <var class="Va">b_size</var>, the actual start of the buffer is
    &#x2018;<code class="Li">b_data + (b_offset &amp; PAGE_MASK)</code>&#x2019;
    and not just &#x2018;<code class="Li">b_data</code>&#x2019;. Finally, the VM
    system's core buffer cache supports valid and dirty bits
    (<var class="Va">m-&gt;valid</var>, <var class="Va">m-&gt;dirty</var>) for
    pages in <code class="Dv">DEV_BSIZE</code> chunks. Thus a platform with a
    hardware page size of 4096 bytes has 8 valid and 8 dirty bits. These bits
    are generally set and cleared in groups based on the device block size of
    the device backing the page. Complete page's worth are often referred to
    using the <code class="Dv">VM_PAGE_BITS_ALL</code> bitmask (i.e., 0xFF if
    the hardware page size is 4096).</p>
<p class="Pp">VM buffers also keep track of a byte-granular dirty range and
    valid range. This feature is normally only used by the NFS subsystem. I am
    not sure why it is used at all, actually, since we have
    <code class="Dv">DEV_BSIZE</code> valid/dirty granularity within the VM
    buffer. If a buffer dirty operation creates a &#x201C;hole&#x201D;, the
    dirty range will extend to cover the hole. If a buffer validation operation
    creates a &#x201C;hole&#x201D; the byte-granular valid range is left alone
    and will not take into account the new extension. Thus the whole
    byte-granular abstraction is considered a bad hack and it would be nice if
    we could get rid of it completely.</p>
<p class="Pp">A VM buffer is capable of mapping the underlying VM cache pages
    into KVM in order to allow the kernel to directly manipulate the data
    associated with the (<var class="Va">vnode</var>,
    <var class="Va">b_offset</var>, <var class="Va">b_size</var>). The kernel
    typically unmaps VM buffers the moment they are no longer needed but often
    keeps the <var class="Vt">struct buf</var> structure instantiated and even
    <var class="Va">bp-&gt;b_pages</var> array instantiated despite having
    unmapped them from KVM. If a page making up a VM buffer is about to undergo
    I/O, the system typically unmaps it from KVM and replaces the page in the
    <var class="Va">b_pages[]</var> array with a place-marker called bogus_page.
    The place-marker forces any kernel subsystems referencing the associated
    <var class="Vt">struct buf</var> to re-lookup the associated page. I believe
    the place-marker hack is used to allow sophisticated devices such as file
    system devices to remap underlying pages in order to deal with, for example,
    re-mapping a file fragment into a file block.</p>
<p class="Pp">VM buffers are used to track I/O operations within the kernel.
    Unfortunately, the I/O implementation is also somewhat of a hack because the
    kernel wants to clear the dirty bit on the underlying pages the moment it
    queues the I/O to the VFS device, not when the physical I/O is actually
    initiated. This can create confusion within file system devices that use
    delayed-writes because you wind up with pages marked clean that are actually
    still dirty. If not treated carefully, these pages could be thrown away!
    Indeed, a number of serious bugs related to this hack were not fixed until
    the <span class="Ux">FreeBSD 2.2.8</span> / <span class="Ux">FreeBSD
    3.0</span> release. The kernel uses an instantiated VM buffer (i.e.,
    <var class="Vt">struct buf</var>) to place-mark pages in this special state.
    The buffer is typically flagged <code class="Dv">B_DELWRI</code>. When a
    device no longer needs a buffer it typically flags it as
    <code class="Dv">B_RELBUF</code>. Due to the underlying pages being marked
    clean, the &#x2018;<code class="Li">B_DELWRI|B_RELBUF</code>&#x2019;
    combination must be interpreted to mean that the buffer is still actually
    dirty and must be written to its backing store before it can actually be
    released. In the case where <code class="Dv">B_DELWRI</code> is not set, the
    underlying dirty pages are still properly marked as dirty and the buffer can
    be completely freed without losing that clean/dirty state information. (XXX
    do we have to check other flags in regards to this situation ???)</p>
<p class="Pp">The kernel reserves a portion of its KVM space to hold VM Buffer's
    data maps. Even though this is virtual space (since the buffers are mapped
    from the buffer cache), we cannot make it arbitrarily large because
    instantiated VM Buffers (<var class="Vt">struct buf</var>'s) prevent their
    underlying pages in the buffer cache from being freed. This can complicate
    the life of the paging system.</p>
</section>
<section class="Sh">
<h1 class="Sh" id="HISTORY"><a class="permalink" href="#HISTORY">HISTORY</a></h1>
<p class="Pp">The <code class="Nm">buf</code> manual page was originally written
    by <span class="An">Matthew Dillon</span> and first appeared in
    <span class="Ux">FreeBSD 3.1</span>, December 1998.</p>
</section>
</div>
<table class="foot">
  <tr>
    <td class="foot-date">December 22, 1998</td>
    <td class="foot-os">FreeBSD 15.0</td>
  </tr>
</table>