diff options
Diffstat (limited to 'static/freebsd/man4/netmap.4')
| -rw-r--r-- | static/freebsd/man4/netmap.4 | 1200 |
1 files changed, 1200 insertions, 0 deletions
diff --git a/static/freebsd/man4/netmap.4 b/static/freebsd/man4/netmap.4 new file mode 100644 index 00000000..6be0c866 --- /dev/null +++ b/static/freebsd/man4/netmap.4 @@ -0,0 +1,1200 @@ +.\" Copyright (c) 2011-2014 Matteo Landi, Luigi Rizzo, Universita` di Pisa +.\" All rights reserved. +.\" +.\" Redistribution and use in source and binary forms, with or without +.\" modification, are permitted provided that the following conditions +.\" are met: +.\" 1. Redistributions of source code must retain the above copyright +.\" notice, this list of conditions and the following disclaimer. +.\" 2. Redistributions in binary form must reproduce the above copyright +.\" notice, this list of conditions and the following disclaimer in the +.\" documentation and/or other materials provided with the distribution. +.\" +.\" THIS SOFTWARE IS PROVIDED BY THE AUTHOR AND CONTRIBUTORS ``AS IS'' AND +.\" ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE +.\" IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE +.\" ARE DISCLAIMED. IN NO EVENT SHALL THE AUTHOR OR CONTRIBUTORS BE LIABLE +.\" FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL +.\" DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS +.\" OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) +.\" HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT +.\" LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY +.\" OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF +.\" SUCH DAMAGE. +.\" +.\" This document is derived in part from the enet man page (enet.4) +.\" distributed with 4.3BSD Unix. +.\" +.Dd October 10, 2024 +.Dt NETMAP 4 +.Os +.Sh NAME +.Nm netmap +.Nd a framework for fast packet I/O +.Sh SYNOPSIS +.Cd device netmap +.Sh DESCRIPTION +.Nm +is a framework for extremely fast and efficient packet I/O +for userspace and kernel clients, and for Virtual Machines. +It runs on +.Fx , +Linux and some versions of Windows, and supports a variety of +.Nm netmap ports , +including +.Bl -tag -width XXXX +.It Nm physical NIC ports +to access individual queues of network interfaces; +.It Nm host ports +to inject packets into the host stack; +.It Nm VALE ports +implementing a very fast and modular in-kernel software switch/dataplane; +.It Nm netmap pipes +a shared memory packet transport channel; +.It Nm netmap monitors +a mechanism similar to +.Xr bpf 4 +to capture traffic +.El +.Pp +All these +.Nm netmap ports +are accessed interchangeably with the same API, +and are at least one order of magnitude faster than +standard OS mechanisms +(sockets, bpf, tun/tap interfaces, native switches, pipes). +With suitably fast hardware (NICs, PCIe buses, CPUs), +packet I/O using +.Nm +on supported NICs +reaches 14.88 million packets per second (Mpps) +with much less than one core on 10 Gbit/s NICs; +35-40 Mpps on 40 Gbit/s NICs (limited by the hardware); +about 20 Mpps per core for VALE ports; +and over 100 Mpps for +.Nm netmap pipes . +NICs without native +.Nm +support can still use the API in emulated mode, +which uses unmodified device drivers and is 3-5 times faster than +.Xr bpf 4 +or raw sockets. +.Pp +Userspace clients can dynamically switch NICs into +.Nm +mode and send and receive raw packets through +memory mapped buffers. +Similarly, +.Nm VALE +switch instances and ports, +.Nm netmap pipes +and +.Nm netmap monitors +can be created dynamically, +providing high speed packet I/O between processes, +virtual machines, NICs and the host stack. +.Pp +.Nm +supports both non-blocking I/O through +.Xr ioctl 2 , +synchronization and blocking I/O through a file descriptor +and standard OS mechanisms such as +.Xr select 2 , +.Xr poll 2 , +.Xr kqueue 2 +and +.Xr epoll 7 . +All types of +.Nm netmap ports +and the +.Nm VALE switch +are implemented by a single kernel module, which also emulates the +.Nm +API over standard drivers. +For best performance, +.Nm +requires native support in device drivers. +A list of such devices is at the end of this document. +.Pp +In the rest of this (long) manual page we document +various aspects of the +.Nm +and +.Nm VALE +architecture, features and usage. +.Sh ARCHITECTURE +.Nm +supports raw packet I/O through a +.Em port , +which can be connected to a physical interface +.Em ( NIC ) , +to the host stack, +or to a +.Nm VALE +switch. +Ports use preallocated circular queues of buffers +.Em ( rings ) +residing in an mmapped region. +There is one ring for each transmit/receive queue of a +NIC or virtual port. +An additional ring pair connects to the host stack. +.Pp +After binding a file descriptor to a port, a +.Nm +client can send or receive packets in batches through +the rings, and possibly implement zero-copy forwarding +between ports. +.Pp +All NICs operating in +.Nm +mode use the same memory region, +accessible to all processes who own +.Pa /dev/netmap +file descriptors bound to NICs. +Independent +.Nm VALE +and +.Nm netmap pipe +ports +by default use separate memory regions, +but can be independently configured to share memory. +.Sh ENTERING AND EXITING NETMAP MODE +The following section describes the system calls to create +and control +.Nm netmap +ports (including +.Nm VALE +and +.Nm netmap pipe +ports). +Simpler, higher level functions are described in the +.Sx LIBRARIES +section. +.Pp +Ports and rings are created and controlled through a file descriptor, +created by opening a special device +.Dl fd = open("/dev/netmap"); +and then bound to a specific port with an +.Dl ioctl(fd, NIOCREGIF, (struct nmreq *)arg); +.Pp +.Nm +has multiple modes of operation controlled by the +.Vt struct nmreq +argument. +.Va arg.nr_name +specifies the netmap port name, as follows: +.Bl -tag -width XXXX +.It Dv OS network interface name (e.g., 'em0', 'eth1', ... ) +the data path of the NIC is disconnected from the host stack, +and the file descriptor is bound to the NIC (one or all queues), +or to the host stack; +.It Dv valeSSS:PPP +the file descriptor is bound to port PPP of VALE switch SSS. +Switch instances and ports are dynamically created if necessary. +.Pp +Both SSS and PPP have the form [0-9a-zA-Z_]+ , the string +cannot exceed IFNAMSIZ characters, and PPP cannot +be the name of any existing OS network interface. +.El +.Pp +On return, +.Va arg +indicates the size of the shared memory region, +and the number, size and location of all the +.Nm +data structures, which can be accessed by mmapping the memory +.Dl char *mem = mmap(0, arg.nr_memsize, fd); +.Pp +Non-blocking I/O is done with special +.Xr ioctl 2 +.Xr select 2 +and +.Xr poll 2 +on the file descriptor permit blocking I/O. +.Pp +While a NIC is in +.Nm +mode, the OS will still believe the interface is up and running. +OS-generated packets for that NIC end up into a +.Nm +ring, and another ring is used to send packets into the OS network stack. +A +.Xr close 2 +on the file descriptor removes the binding, +and returns the NIC to normal mode (reconnecting the data path +to the host stack), or destroys the virtual port. +.Sh DATA STRUCTURES +The data structures in the mmapped memory region are detailed in +.In sys/net/netmap.h , +which is the ultimate reference for the +.Nm +API. +The main structures and fields are indicated below: +.Bl -tag -width XXX +.It Dv struct netmap_if (one per interface ) +.Bd -literal +struct netmap_if { + ... + const uint32_t ni_flags; /* properties */ + ... + const uint32_t ni_tx_rings; /* NIC tx rings */ + const uint32_t ni_rx_rings; /* NIC rx rings */ + uint32_t ni_bufs_head; /* head of extra bufs list */ + ... +}; +.Ed +.Pp +Indicates the number of available rings +.Pa ( struct netmap_rings ) +and their position in the mmapped region. +The number of tx and rx rings +.Pa ( ni_tx_rings , ni_rx_rings ) +normally depends on the hardware. +NICs also have an extra tx/rx ring pair connected to the host stack. +.Em NIOCREGIF +can also request additional unbound buffers in the same memory space, +to be used as temporary storage for packets. +The number of extra +buffers is specified in the +.Va arg.nr_arg3 +field. +On success, the kernel writes back to +.Va arg.nr_arg3 +the number of extra buffers actually allocated (they may be less +than the amount requested if the memory space ran out of buffers). +.Pa ni_bufs_head +contains the index of the first of these extra buffers, +which are connected in a list (the first uint32_t of each +buffer being the index of the next buffer in the list). +A +.Dv 0 +indicates the end of the list. +The application is free to modify +this list and use the buffers (i.e., binding them to the slots of a +netmap ring). +When closing the netmap file descriptor, +the kernel frees the buffers contained in the list pointed by +.Pa ni_bufs_head +, irrespectively of the buffers originally provided by the kernel on +.Em NIOCREGIF . +.It Dv struct netmap_ring (one per ring ) +.Bd -literal +struct netmap_ring { + ... + const uint32_t num_slots; /* slots in each ring */ + const uint32_t nr_buf_size; /* size of each buffer */ + ... + uint32_t head; /* (u) first buf owned by user */ + uint32_t cur; /* (u) wakeup position */ + const uint32_t tail; /* (k) first buf owned by kernel */ + ... + uint32_t flags; + struct timeval ts; /* (k) time of last rxsync() */ + ... + struct netmap_slot slot[0]; /* array of slots */ +} +.Ed +.Pp +Implements transmit and receive rings, with read/write +pointers, metadata and an array of +.Em slots +describing the buffers. +.It Dv struct netmap_slot (one per buffer ) +.Bd -literal +struct netmap_slot { + uint32_t buf_idx; /* buffer index */ + uint16_t len; /* packet length */ + uint16_t flags; /* buf changed, etc. */ + uint64_t ptr; /* address for indirect buffers */ +}; +.Ed +.Pp +Describes a packet buffer, which normally is identified by +an index and resides in the mmapped region. +.It Dv packet buffers +Fixed size (normally 2 KB) packet buffers allocated by the kernel. +.El +.Pp +The offset of the +.Pa struct netmap_if +in the mmapped region is indicated by the +.Pa nr_offset +field in the structure returned by +.Dv NIOCREGIF . +From there, all other objects are reachable through +relative references (offsets or indexes). +Macros and functions in +.In net/netmap_user.h +help converting them into actual pointers: +.Pp +.Dl struct netmap_if *nifp = NETMAP_IF(mem, arg.nr_offset); +.Dl struct netmap_ring *txr = NETMAP_TXRING(nifp, ring_index); +.Dl struct netmap_ring *rxr = NETMAP_RXRING(nifp, ring_index); +.Pp +.Dl char *buf = NETMAP_BUF(ring, buffer_index); +.Sh RINGS, BUFFERS AND DATA I/O +.Va Rings +are circular queues of packets with three indexes/pointers +.Va ( head , cur , tail ) ; +one slot is always kept empty. +The ring size +.Va ( num_slots ) +should not be assumed to be a power of two. +.Pp +.Va head +is the first slot available to userspace; +.Pp +.Va cur +is the wakeup point: +select/poll will unblock when +.Va tail +passes +.Va cur ; +.Pp +.Va tail +is the first slot reserved to the kernel. +.Pp +Slot indexes +.Em must +only move forward; +for convenience, the function +.Dl nm_ring_next(ring, index) +returns the next index modulo the ring size. +.Pp +.Va head +and +.Va cur +are only modified by the user program; +.Va tail +is only modified by the kernel. +The kernel only reads/writes the +.Vt struct netmap_ring +slots and buffers +during the execution of a netmap-related system call. +The only exception are slots (and buffers) in the range +.Va tail\ . . . head-1 , +that are explicitly assigned to the kernel. +.Ss TRANSMIT RINGS +On transmit rings, after a +.Nm +system call, slots in the range +.Va head\ . . . tail-1 +are available for transmission. +User code should fill the slots sequentially +and advance +.Va head +and +.Va cur +past slots ready to transmit. +.Va cur +may be moved further ahead if the user code needs +more slots before further transmissions (see +.Sx SCATTER GATHER I/O ) . +.Pp +At the next NIOCTXSYNC/select()/poll(), +slots up to +.Va head-1 +are pushed to the port, and +.Va tail +may advance if further slots have become available. +Below is an example of the evolution of a TX ring: +.Bd -literal + after the syscall, slots between cur and tail are (a)vailable + head=cur tail + | | + v v + TX [.....aaaaaaaaaaa.............] + + user creates new packets to (T)ransmit + head=cur tail + | | + v v + TX [.....TTTTTaaaaaa.............] + + NIOCTXSYNC/poll()/select() sends packets and reports new slots + head=cur tail + | | + v v + TX [..........aaaaaaaaaaa........] +.Ed +.Pp +.Fn select +and +.Fn poll +will block if there is no space in the ring, i.e., +.Dl ring->cur == ring->tail +and return when new slots have become available. +.Pp +High speed applications may want to amortize the cost of system calls +by preparing as many packets as possible before issuing them. +.Pp +A transmit ring with pending transmissions has +.Dl ring->head != ring->tail + 1 (modulo the ring size). +The function +.Va int nm_tx_pending(ring) +implements this test. +.Ss RECEIVE RINGS +On receive rings, after a +.Nm +system call, the slots in the range +.Va head\& . . . tail-1 +contain received packets. +User code should process them and advance +.Va head +and +.Va cur +past slots it wants to return to the kernel. +.Va cur +may be moved further ahead if the user code wants to +wait for more packets +without returning all the previous slots to the kernel. +.Pp +At the next NIOCRXSYNC/select()/poll(), +slots up to +.Va head-1 +are returned to the kernel for further receives, and +.Va tail +may advance to report new incoming packets. +.Pp +Below is an example of the evolution of an RX ring: +.Bd -literal + after the syscall, there are some (h)eld and some (R)eceived slots + head cur tail + | | | + v v v + RX [..hhhhhhRRRRRRRR..........] + + user advances head and cur, releasing some slots and holding others + head cur tail + | | | + v v v + RX [..*****hhhRRRRRR...........] + + NICRXSYNC/poll()/select() recovers slots and reports new packets + head cur tail + | | | + v v v + RX [.......hhhRRRRRRRRRRRR....] +.Ed +.Sh SLOTS AND PACKET BUFFERS +Normally, packets should be stored in the netmap-allocated buffers +assigned to slots when ports are bound to a file descriptor. +One packet is fully contained in a single buffer. +.Pp +The following flags affect slot and buffer processing: +.Bl -tag -width XXX +.It NS_BUF_CHANGED +.Em must +be used when the +.Va buf_idx +in the slot is changed. +This can be used to implement +zero-copy forwarding, see +.Sx ZERO-COPY FORWARDING . +.It NS_REPORT +reports when this buffer has been transmitted. +Normally, +.Nm +notifies transmit completions in batches, hence signals +can be delayed indefinitely. +This flag helps detect +when packets have been sent and a file descriptor can be closed. +.It NS_FORWARD +When a ring is in 'transparent' mode, +packets marked with this flag by the user application are forwarded to the +other endpoint at the next system call, thus restoring (in a selective way) +the connection between a NIC and the host stack. +.It NS_NO_LEARN +tells the forwarding code that the source MAC address for this +packet must not be used in the learning bridge code. +.It NS_INDIRECT +indicates that the packet's payload is in a user-supplied buffer +whose user virtual address is in the 'ptr' field of the slot. +The size can reach 65535 bytes. +.Pp +This is only supported on the transmit ring of +.Nm VALE +ports, and it helps reducing data copies in the interconnection +of virtual machines. +.It NS_MOREFRAG +indicates that the packet continues with subsequent buffers; +the last buffer in a packet must have the flag clear. +.El +.Sh SCATTER GATHER I/O +Packets can span multiple slots if the +.Va NS_MOREFRAG +flag is set in all but the last slot. +The maximum length of a chain is 64 buffers. +This is normally used with +.Nm VALE +ports when connecting virtual machines, as they generate large +TSO segments that are not split unless they reach a physical device. +.Pp +NOTE: The length field always refers to the individual +fragment; there is no place with the total length of a packet. +.Pp +On receive rings the macro +.Va NS_RFRAGS(slot) +indicates the remaining number of slots for this packet, +including the current one. +Slots with a value greater than 1 also have NS_MOREFRAG set. +.Sh IOCTLS +.Nm +uses two ioctls (NIOCTXSYNC, NIOCRXSYNC) +for non-blocking I/O. +They take no argument. +Two more ioctls (NIOCGINFO, NIOCREGIF) are used +to query and configure ports, with the following argument: +.Bd -literal +struct nmreq { + char nr_name[IFNAMSIZ]; /* (i) port name */ + uint32_t nr_version; /* (i) API version */ + uint32_t nr_offset; /* (o) nifp offset in mmap region */ + uint32_t nr_memsize; /* (o) size of the mmap region */ + uint32_t nr_tx_slots; /* (i/o) slots in tx rings */ + uint32_t nr_rx_slots; /* (i/o) slots in rx rings */ + uint16_t nr_tx_rings; /* (i/o) number of tx rings */ + uint16_t nr_rx_rings; /* (i/o) number of rx rings */ + uint16_t nr_ringid; /* (i/o) ring(s) we care about */ + uint16_t nr_cmd; /* (i) special command */ + uint16_t nr_arg1; /* (i/o) extra arguments */ + uint16_t nr_arg2; /* (i/o) extra arguments */ + uint32_t nr_arg3; /* (i/o) extra arguments */ + uint32_t nr_flags /* (i/o) open mode */ + ... +}; +.Ed +.Pp +A file descriptor obtained through +.Pa /dev/netmap +also supports the ioctl supported by network devices, see +.Xr netintro 4 . +.Bl -tag -width XXXX +.It Dv NIOCGINFO +returns EINVAL if the named port does not support netmap. +Otherwise, it returns 0 and (advisory) information +about the port. +Note that all the information below can change before the +interface is actually put in netmap mode. +.Bl -tag -width XX +.It Pa nr_memsize +indicates the size of the +.Nm +memory region. +NICs in +.Nm +mode all share the same memory region, +whereas +.Nm VALE +ports have independent regions for each port. +.It Pa nr_tx_slots , nr_rx_slots +indicate the size of transmit and receive rings. +.It Pa nr_tx_rings , nr_rx_rings +indicate the number of transmit +and receive rings. +Both ring number and sizes may be configured at runtime +using interface-specific functions (e.g., +.Xr ethtool 8 +). +.El +.It Dv NIOCREGIF +binds the port named in +.Va nr_name +to the file descriptor. +For a physical device this also switches it into +.Nm +mode, disconnecting +it from the host stack. +Multiple file descriptors can be bound to the same port, +with proper synchronization left to the user. +.Pp +The recommended way to bind a file descriptor to a port is +to use function +.Va nm_open(..) +(see +.Sx LIBRARIES ) +which parses names to access specific port types and +enable features. +In the following we document the main features. +.Pp +.Dv NIOCREGIF can also bind a file descriptor to one endpoint of a +.Em netmap pipe , +consisting of two netmap ports with a crossover connection. +A netmap pipe share the same memory space of the parent port, +and is meant to enable configuration where a master process acts +as a dispatcher towards slave processes. +.Pp +To enable this function, the +.Pa nr_arg1 +field of the structure can be used as a hint to the kernel to +indicate how many pipes we expect to use, and reserve extra space +in the memory region. +.Pp +On return, it gives the same info as NIOCGINFO, +with +.Pa nr_ringid +and +.Pa nr_flags +indicating the identity of the rings controlled through the file +descriptor. +.Pp +.Va nr_flags +.Va nr_ringid +selects which rings are controlled through this file descriptor. +Possible values of +.Pa nr_flags +are indicated below, together with the naming schemes +that application libraries (such as the +.Nm nm_open +indicated below) can use to indicate the specific set of rings. +In the example below, "netmap:foo" is any valid netmap port name. +.Bl -tag -width XXXXX +.It NR_REG_ALL_NIC "netmap:foo" +(default) all hardware ring pairs +.It NR_REG_SW "netmap:foo^" +the ``host rings'', connecting to the host stack. +.It NR_REG_NIC_SW "netmap:foo*" +all hardware rings and the host rings +.It NR_REG_ONE_NIC "netmap:foo-i" +only the i-th hardware ring pair, where the number is in +.Pa nr_ringid ; +.It NR_REG_PIPE_MASTER "netmap:foo{i" +the master side of the netmap pipe whose identifier (i) is in +.Pa nr_ringid ; +.It NR_REG_PIPE_SLAVE "netmap:foo}i" +the slave side of the netmap pipe whose identifier (i) is in +.Pa nr_ringid . +.Pp +The identifier of a pipe must be thought as part of the pipe name, +and does not need to be sequential. +On return the pipe +will only have a single ring pair with index 0, +irrespective of the value of +.Va i . +.El +.Pp +By default, a +.Xr poll 2 +or +.Xr select 2 +call pushes out any pending packets on the transmit ring, even if +no write events are specified. +The feature can be disabled by or-ing +.Va NETMAP_NO_TX_POLL +to the value written to +.Va nr_ringid . +When this feature is used, +packets are transmitted only on +.Va ioctl(NIOCTXSYNC) +or +.Va select() / +.Va poll() +are called with a write event (POLLOUT/wfdset) or a full ring. +.Pp +When registering a virtual interface that is dynamically created to a +.Nm VALE +switch, we can specify the desired number of rings (1 by default, +and currently up to 16) on it using nr_tx_rings and nr_rx_rings fields. +.It Dv NIOCTXSYNC +tells the hardware of new packets to transmit, and updates the +number of slots available for transmission. +.It Dv NIOCRXSYNC +tells the hardware of consumed packets, and asks for newly available +packets. +.El +.Sh SELECT, POLL, EPOLL, KQUEUE +.Xr select 2 +and +.Xr poll 2 +on a +.Nm +file descriptor process rings as indicated in +.Sx TRANSMIT RINGS +and +.Sx RECEIVE RINGS , +respectively when write (POLLOUT) and read (POLLIN) events are requested. +Both block if no slots are available in the ring +.Va ( ring->cur == ring->tail ) . +Depending on the platform, +.Xr epoll 7 +and +.Xr kqueue 2 +are supported too. +.Pp +Packets in transmit rings are normally pushed out +(and buffers reclaimed) even without +requesting write events. +Passing the +.Dv NETMAP_NO_TX_POLL +flag to +.Em NIOCREGIF +disables this feature. +By default, receive rings are processed only if read +events are requested. +Passing the +.Dv NETMAP_DO_RX_POLL +flag to +.Em NIOCREGIF updates receive rings even without read events. +Note that on +.Xr epoll 7 +and +.Xr kqueue 2 , +.Dv NETMAP_NO_TX_POLL +and +.Dv NETMAP_DO_RX_POLL +only have an effect when some event is posted for the file descriptor. +.Sh LIBRARIES +The +.Nm +API is supposed to be used directly, both because of its simplicity and +for efficient integration with applications. +.Pp +For convenience, the +.In net/netmap_user.h +header provides a few macros and functions to ease creating +a file descriptor and doing I/O with a +.Nm +port. +These are loosely modeled after the +.Xr pcap 3 +API, to ease porting of libpcap-based applications to +.Nm . +To use these extra functions, programs should +.Dl #define NETMAP_WITH_LIBS +before +.Dl #include <net/netmap_user.h> +.Pp +The following functions are available: +.Bl -tag -width XXXXX +.It Va struct nm_desc * nm_open(const char *ifname, const struct nmreq *req, uint64_t flags, const struct nm_desc *arg ) +similar to +.Xr pcap_open_live 3 , +binds a file descriptor to a port. +.Bl -tag -width XX +.It Va ifname +is a port name, in the form "netmap:PPP" for a NIC and "valeSSS:PPP" for a +.Nm VALE +port. +.It Va req +provides the initial values for the argument to the NIOCREGIF ioctl. +The nm_flags and nm_ringid values are overwritten by parsing +ifname and flags, and other fields can be overridden through +the other two arguments. +.It Va arg +points to a struct nm_desc containing arguments (e.g., from a previously +open file descriptor) that should override the defaults. +The fields are used as described below +.It Va flags +can be set to a combination of the following flags: +.Va NETMAP_NO_TX_POLL , +.Va NETMAP_DO_RX_POLL +(copied into nr_ringid); +.Va NM_OPEN_NO_MMAP +(if arg points to the same memory region, +avoids the mmap and uses the values from it); +.Va NM_OPEN_IFNAME +(ignores ifname and uses the values in arg); +.Va NM_OPEN_ARG1 , +.Va NM_OPEN_ARG2 , +.Va NM_OPEN_ARG3 +(uses the fields from arg); +.Va NM_OPEN_RING_CFG +(uses the ring number and sizes from arg). +.El +.It Va int nm_close(struct nm_desc *d ) +closes the file descriptor, unmaps memory, frees resources. +.It Va int nm_inject(struct nm_desc *d, const void *buf, size_t size ) +similar to +.Va pcap_inject() , +pushes a packet to a ring, returns the size +of the packet is successful, or 0 on error; +.It Va int nm_dispatch(struct nm_desc *d, int cnt, nm_cb_t cb, u_char *arg ) +similar to +.Va pcap_dispatch() , +applies a callback to incoming packets +.It Va u_char * nm_nextpkt(struct nm_desc *d, struct nm_pkthdr *hdr ) +similar to +.Va pcap_next() , +fetches the next packet +.El +.Sh SUPPORTED DEVICES +.Nm +natively supports the following devices: +.Pp +On +.Fx : +.Xr cxgbe 4 , +.Xr em 4 , +.Xr iflib 4 +.Pq providing Xr igb 4 and Xr em 4 , +.Xr ix 4 , +.Xr ixl 4 , +.Xr re 4 , +.Xr vtnet 4 . +.Pp +On Linux e1000, e1000e, i40e, igb, ixgbe, ixgbevf, r8169, virtio_net, vmxnet3. +.Pp +NICs without native support can still be used in +.Nm +mode through emulation. +Performance is inferior to native netmap +mode but still significantly higher than various raw socket types +(bpf, PF_PACKET, etc.). +Note that for slow devices (such as 1 Gbit/s and slower NICs, +or several 10 Gbit/s NICs whose hardware is unable to sustain line rate), +emulated and native mode will likely have similar or same throughput. +.Pp +When emulation is in use, packet sniffer programs such as tcpdump +could see received packets before they are diverted by netmap. +This behaviour is not intentional, being just an artifact of the implementation +of emulation. +Note that in case the netmap application subsequently moves packets received +from the emulated adapter onto the host RX ring, the sniffer will intercept +those packets again, since the packets are injected to the host stack as they +were received by the network interface. +.Pp +Emulation is also available for devices with native netmap support, +which can be used for testing or performance comparison. +The sysctl variable +.Va dev.netmap.admode +globally controls how netmap mode is implemented. +.Sh SYSCTL VARIABLES AND MODULE PARAMETERS +Some aspects of the operation of +.Nm +and +.Nm VALE +are controlled through sysctl variables on +.Fx +.Em ( dev.netmap.* ) +and module parameters on Linux +.Em ( /sys/module/netmap/parameters/* ) : +.Bl -tag -width indent +.It Va dev.netmap.admode: 0 +Controls the use of native or emulated adapter mode. +.Pp +0 uses the best available option; +.Pp +1 forces native mode and fails if not available; +.Pp +2 forces emulated hence never fails. +.It Va dev.netmap.generic_rings: 1 +Number of rings used for emulated netmap mode +.It Va dev.netmap.generic_ringsize: 1024 +Ring size used for emulated netmap mode +.It Va dev.netmap.generic_mit: 100000 +Controls interrupt moderation for emulated mode +.It Va dev.netmap.fwd: 0 +Forces NS_FORWARD mode +.It Va dev.netmap.txsync_retry: 2 +Number of txsync loops in the +.Nm VALE +flush function +.It Va dev.netmap.no_pendintr: 1 +Forces recovery of transmit buffers on system calls +.It Va dev.netmap.no_timestamp: 0 +Disables the update of the timestamp in the netmap ring +.It Va dev.netmap.verbose: 0 +Verbose kernel messages +.It Va dev.netmap.buf_num: 163840 +.It Va dev.netmap.buf_size: 2048 +.It Va dev.netmap.ring_num: 200 +.It Va dev.netmap.ring_size: 36864 +.It Va dev.netmap.if_num: 100 +.It Va dev.netmap.if_size: 1024 +Sizes and number of objects (netmap_if, netmap_ring, buffers) +for the global memory region. +The only parameter worth modifying is +.Va dev.netmap.buf_num +as it impacts the total amount of memory used by netmap. +.It Va dev.netmap.buf_curr_num: 0 +.It Va dev.netmap.buf_curr_size: 0 +.It Va dev.netmap.ring_curr_num: 0 +.It Va dev.netmap.ring_curr_size: 0 +.It Va dev.netmap.if_curr_num: 0 +.It Va dev.netmap.if_curr_size: 0 +Actual values in use. +.It Va dev.netmap.priv_buf_num: 4098 +.It Va dev.netmap.priv_buf_size: 2048 +.It Va dev.netmap.priv_ring_num: 4 +.It Va dev.netmap.priv_ring_size: 20480 +.It Va dev.netmap.priv_if_num: 2 +.It Va dev.netmap.priv_if_size: 1024 +Sizes and number of objects (netmap_if, netmap_ring, buffers) +for private memory regions. +A separate memory region is used for each +.Nm VALE +port and each pair of +.Nm netmap pipes . +.It Va dev.netmap.bridge_batch: 1024 +Batch size used when moving packets across a +.Nm VALE +switch. +Values above 64 generally guarantee good +performance. +.It Va dev.netmap.max_bridges: 8 +Max number of +.Nm VALE +switches that can be created. This tunable can be specified +at loader time. +.It Va dev.netmap.ptnet_vnet_hdr: 1 +Allow ptnet devices to use virtio-net headers +.It Va dev.netmap.port_numa_affinity: 0 +On +.Xr numa 4 +systems, allocate memory for netmap ports from the local NUMA domain when +possible. +This can improve performance by reducing the number of remote memory accesses. +However, when forwarding packets between ports attached to different NUMA +domains, this will prevent zero-copy forwarding optimizations and thus may hurt +performance. +Note that this setting must be specified as a loader tunable at boot time. +.El +.Sh SYSTEM CALLS +.Nm +uses +.Xr select 2 , +.Xr poll 2 , +.Xr epoll 7 +and +.Xr kqueue 2 +to wake up processes when significant events occur, and +.Xr mmap 2 +to map memory. +.Xr ioctl 2 +is used to configure ports and +.Nm VALE switches . +.Pp +Applications may need to create threads and bind them to +specific cores to improve performance, using standard +OS primitives, see +.Xr pthread 3 . +In particular, +.Xr pthread_setaffinity_np 3 +may be of use. +.Sh EXAMPLES +.Ss TEST PROGRAMS +.Nm +comes with a few programs that can be used for testing or +simple applications. +See the +.Pa examples/ +directory in +.Nm +distributions, or +.Pa tools/tools/netmap/ +directory in +.Fx +distributions. +.Pp +.Xr pkt-gen 8 +is a general purpose traffic source/sink. +.Pp +As an example +.Dl pkt-gen -i ix0 -f tx -l 60 +can generate an infinite stream of minimum size packets, and +.Dl pkt-gen -i ix0 -f rx +is a traffic sink. +Both print traffic statistics, to help monitor +how the system performs. +.Pp +.Xr pkt-gen 8 +has many options can be uses to set packet sizes, addresses, +rates, and use multiple send/receive threads and cores. +.Pp +.Xr bridge 4 +is another test program which interconnects two +.Nm +ports. +It can be used for transparent forwarding between +interfaces, as in +.Dl bridge -i netmap:ix0 -i netmap:ix1 +or even connect the NIC to the host stack using netmap +.Dl bridge -i netmap:ix0 +.Ss USING THE NATIVE API +The following code implements a traffic generator: +.Pp +.Bd -literal -compact +#include <net/netmap_user.h> +\&... +void sender(void) +{ + struct netmap_if *nifp; + struct netmap_ring *ring; + struct nmreq nmr; + struct pollfd fds; + + fd = open("/dev/netmap", O_RDWR); + bzero(&nmr, sizeof(nmr)); + strcpy(nmr.nr_name, "ix0"); + nmr.nm_version = NETMAP_API; + ioctl(fd, NIOCREGIF, &nmr); + p = mmap(0, nmr.nr_memsize, fd); + nifp = NETMAP_IF(p, nmr.nr_offset); + ring = NETMAP_TXRING(nifp, 0); + fds.fd = fd; + fds.events = POLLOUT; + for (;;) { + poll(&fds, 1, -1); + while (!nm_ring_empty(ring)) { + i = ring->cur; + buf = NETMAP_BUF(ring, ring->slot[i].buf_index); + ... prepare packet in buf ... + ring->slot[i].len = ... packet length ... + ring->head = ring->cur = nm_ring_next(ring, i); + } + } +} +.Ed +.Ss HELPER FUNCTIONS +A simple receiver can be implemented using the helper functions: +.Pp +.Bd -literal -compact +#define NETMAP_WITH_LIBS +#include <net/netmap_user.h> +\&... +void receiver(void) +{ + struct nm_desc *d; + struct pollfd fds; + u_char *buf; + struct nm_pkthdr h; + ... + d = nm_open("netmap:ix0", NULL, 0, 0); + fds.fd = NETMAP_FD(d); + fds.events = POLLIN; + for (;;) { + poll(&fds, 1, -1); + while ( (buf = nm_nextpkt(d, &h)) ) + consume_pkt(buf, h.len); + } + nm_close(d); +} +.Ed +.Ss ZERO-COPY FORWARDING +Since physical interfaces share the same memory region, +it is possible to do packet forwarding between ports +swapping buffers. +The buffer from the transmit ring is used +to replenish the receive ring: +.Pp +.Bd -literal -compact + uint32_t tmp; + struct netmap_slot *src, *dst; + ... + src = &src_ring->slot[rxr->cur]; + dst = &dst_ring->slot[txr->cur]; + tmp = dst->buf_idx; + dst->buf_idx = src->buf_idx; + dst->len = src->len; + dst->flags = NS_BUF_CHANGED; + src->buf_idx = tmp; + src->flags = NS_BUF_CHANGED; + rxr->head = rxr->cur = nm_ring_next(rxr, rxr->cur); + txr->head = txr->cur = nm_ring_next(txr, txr->cur); + ... +.Ed +.Ss ACCESSING THE HOST STACK +The host stack is for all practical purposes just a regular ring pair, +which you can access with the netmap API (e.g., with +.Dl nm_open("netmap:eth0^", ... ) ; +All packets that the host would send to an interface in +.Nm +mode end up into the RX ring, whereas all packets queued to the +TX ring are send up to the host stack. +.Ss VALE SWITCH +A simple way to test the performance of a +.Nm VALE +switch is to attach a sender and a receiver to it, +e.g., running the following in two different terminals: +.Dl pkt-gen -i vale1:a -f rx # receiver +.Dl pkt-gen -i vale1:b -f tx # sender +The same example can be used to test netmap pipes, by simply +changing port names, e.g., +.Dl pkt-gen -i vale2:x{3 -f rx # receiver on the master side +.Dl pkt-gen -i vale2:x}3 -f tx # sender on the slave side +.Pp +The following command attaches an interface and the host stack +to a switch: +.Dl valectl -h vale2:em0 +Other +.Nm +clients attached to the same switch can now communicate +with the network card or the host. +.Sh SEE ALSO +.Xr vale 4 , +.Xr bridge 8 , +.Xr lb 8 , +.Xr nmreplay 8 , +.Xr pkt-gen 8 , +.Xr valectl 8 +.Pp +.Pa http://info.iet.unipi.it/~luigi/netmap/ +.Pp +Luigi Rizzo, Revisiting network I/O APIs: the netmap framework, +Communications of the ACM, 55 (3), pp.45-51, March 2012 +.Pp +Luigi Rizzo, netmap: a novel framework for fast packet I/O, +Usenix ATC'12, June 2012, Boston +.Pp +Luigi Rizzo, Giuseppe Lettieri, +VALE, a switched ethernet for virtual machines, +ACM CoNEXT'12, December 2012, Nice +.Pp +Luigi Rizzo, Giuseppe Lettieri, Vincenzo Maffione, +Speeding up packet I/O in virtual machines, +ACM/IEEE ANCS'13, October 2013, San Jose +.Sh AUTHORS +.An -nosplit +The +.Nm +framework has been originally designed and implemented at the +Universita` di Pisa in 2011 by +.An Luigi Rizzo , +and further extended with help from +.An Matteo Landi , +.An Gaetano Catalli , +.An Giuseppe Lettieri , +and +.An Vincenzo Maffione . +.Pp +.Nm +and +.Nm VALE +have been funded by the European Commission within FP7 Projects +CHANGE (257422) and OPENLAB (287581). +.Sh CAVEATS +No matter how fast the CPU and OS are, +achieving line rate on 10G and faster interfaces +requires hardware with sufficient performance. +Several NICs are unable to sustain line rate with +small packet sizes. +Insufficient PCIe or memory bandwidth +can also cause reduced performance. +.Pp +Another frequent reason for low performance is the use +of flow control on the link: a slow receiver can limit +the transmit speed. +Be sure to disable flow control when running high +speed experiments. +.Ss SPECIAL NIC FEATURES +.Nm +is orthogonal to some NIC features such as +multiqueue, schedulers, packet filters. +.Pp +Multiple transmit and receive rings are supported natively +and can be configured with ordinary OS tools, +such as +.Xr ethtool 8 +or +device-specific sysctl variables. +The same goes for Receive Packet Steering (RPS) +and filtering of incoming traffic. +.Pp +.Nm +.Em does not use +features such as +.Em checksum offloading , TCP segmentation offloading , +.Em encryption , VLAN encapsulation/decapsulation , +etc. +When using netmap to exchange packets with the host stack, +make sure to disable these features. |
