1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
|
<table class="head">
<tr>
<td class="head-ltitle">BUF(9)</td>
<td class="head-vol">Kernel Developer's Manual</td>
<td class="head-rtitle">BUF(9)</td>
</tr>
</table>
<div class="manual-text">
<section class="Sh">
<h1 class="Sh" id="NAME"><a class="permalink" href="#NAME">NAME</a></h1>
<p class="Pp"><code class="Nm">buf</code> — <span class="Nd">kernel
buffer I/O scheme used in FreeBSD VM system</span></p>
</section>
<section class="Sh">
<h1 class="Sh" id="DESCRIPTION"><a class="permalink" href="#DESCRIPTION">DESCRIPTION</a></h1>
<p class="Pp">The kernel implements a KVM abstraction of the buffer cache which
allows it to map potentially disparate vm_page's into contiguous KVM for use
by (mainly file system) devices and device I/O. This abstraction supports
block sizes from <code class="Dv">DEV_BSIZE</code> (usually 512) to upwards
of several pages or more. It also supports a relatively primitive
byte-granular valid range and dirty range currently hardcoded for use by
NFS. The code implementing the VM Buffer abstraction is mostly concentrated
in <span class="Pa">sys/kern/vfs_bio.c</span> in the
<span class="Ux">FreeBSD</span> source tree.</p>
<p class="Pp" id="page">One of the most important things to remember when
dealing with buffer pointers (<var class="Vt">struct buf</var>) is that the
underlying pages are mapped directly from the buffer cache. No data copying
occurs in the scheme proper, though some file systems such as UFS do have to
copy a little when dealing with file fragments. The second most important
thing to remember is that due to the underlying page mapping, the
<var class="Va">b_data</var> base pointer in a buf is always
<a class="permalink" href="#page"><i class="Em">page</i></a>-aligned, not
<a class="permalink" href="#block"><i class="Em" id="block">block</i></a>-aligned.
When you have a VM buffer representing some <var class="Va">b_offset</var>
and <var class="Va">b_size</var>, the actual start of the buffer is
‘<code class="Li">b_data + (b_offset & PAGE_MASK)</code>’
and not just ‘<code class="Li">b_data</code>’. Finally, the VM
system's core buffer cache supports valid and dirty bits
(<var class="Va">m->valid</var>, <var class="Va">m->dirty</var>) for
pages in <code class="Dv">DEV_BSIZE</code> chunks. Thus a platform with a
hardware page size of 4096 bytes has 8 valid and 8 dirty bits. These bits
are generally set and cleared in groups based on the device block size of
the device backing the page. Complete page's worth are often referred to
using the <code class="Dv">VM_PAGE_BITS_ALL</code> bitmask (i.e., 0xFF if
the hardware page size is 4096).</p>
<p class="Pp">VM buffers also keep track of a byte-granular dirty range and
valid range. This feature is normally only used by the NFS subsystem. I am
not sure why it is used at all, actually, since we have
<code class="Dv">DEV_BSIZE</code> valid/dirty granularity within the VM
buffer. If a buffer dirty operation creates a “hole”, the
dirty range will extend to cover the hole. If a buffer validation operation
creates a “hole” the byte-granular valid range is left alone
and will not take into account the new extension. Thus the whole
byte-granular abstraction is considered a bad hack and it would be nice if
we could get rid of it completely.</p>
<p class="Pp">A VM buffer is capable of mapping the underlying VM cache pages
into KVM in order to allow the kernel to directly manipulate the data
associated with the (<var class="Va">vnode</var>,
<var class="Va">b_offset</var>, <var class="Va">b_size</var>). The kernel
typically unmaps VM buffers the moment they are no longer needed but often
keeps the <var class="Vt">struct buf</var> structure instantiated and even
<var class="Va">bp->b_pages</var> array instantiated despite having
unmapped them from KVM. If a page making up a VM buffer is about to undergo
I/O, the system typically unmaps it from KVM and replaces the page in the
<var class="Va">b_pages[]</var> array with a place-marker called bogus_page.
The place-marker forces any kernel subsystems referencing the associated
<var class="Vt">struct buf</var> to re-lookup the associated page. I believe
the place-marker hack is used to allow sophisticated devices such as file
system devices to remap underlying pages in order to deal with, for example,
re-mapping a file fragment into a file block.</p>
<p class="Pp">VM buffers are used to track I/O operations within the kernel.
Unfortunately, the I/O implementation is also somewhat of a hack because the
kernel wants to clear the dirty bit on the underlying pages the moment it
queues the I/O to the VFS device, not when the physical I/O is actually
initiated. This can create confusion within file system devices that use
delayed-writes because you wind up with pages marked clean that are actually
still dirty. If not treated carefully, these pages could be thrown away!
Indeed, a number of serious bugs related to this hack were not fixed until
the <span class="Ux">FreeBSD 2.2.8</span> / <span class="Ux">FreeBSD
3.0</span> release. The kernel uses an instantiated VM buffer (i.e.,
<var class="Vt">struct buf</var>) to place-mark pages in this special state.
The buffer is typically flagged <code class="Dv">B_DELWRI</code>. When a
device no longer needs a buffer it typically flags it as
<code class="Dv">B_RELBUF</code>. Due to the underlying pages being marked
clean, the ‘<code class="Li">B_DELWRI|B_RELBUF</code>’
combination must be interpreted to mean that the buffer is still actually
dirty and must be written to its backing store before it can actually be
released. In the case where <code class="Dv">B_DELWRI</code> is not set, the
underlying dirty pages are still properly marked as dirty and the buffer can
be completely freed without losing that clean/dirty state information. (XXX
do we have to check other flags in regards to this situation ???)</p>
<p class="Pp">The kernel reserves a portion of its KVM space to hold VM Buffer's
data maps. Even though this is virtual space (since the buffers are mapped
from the buffer cache), we cannot make it arbitrarily large because
instantiated VM Buffers (<var class="Vt">struct buf</var>'s) prevent their
underlying pages in the buffer cache from being freed. This can complicate
the life of the paging system.</p>
</section>
<section class="Sh">
<h1 class="Sh" id="HISTORY"><a class="permalink" href="#HISTORY">HISTORY</a></h1>
<p class="Pp">The <code class="Nm">buf</code> manual page was originally written
by <span class="An">Matthew Dillon</span> and first appeared in
<span class="Ux">FreeBSD 3.1</span>, December 1998.</p>
</section>
</div>
<table class="foot">
<tr>
<td class="foot-date">December 22, 1998</td>
<td class="foot-os">FreeBSD 15.0</td>
</tr>
</table>
|