The zerocopy initiative A look at the zerocopy

From Wikipedia, the free encyclopedia: Zero-copy is an adjective that refers to computer operations

Application source-code char message[] = “This is a test of network-packet transmission n”; int

Transmit operation user space kernel space Linux OS kernel runtime library write() file subsystem

Our driver’s packet-layout destn-address source-address TYPE/ LENGTH count -- data --- data – packet-buffer

Can zero-copy be transparent? • We would like to implement the zero-copy concept in

TX Descriptor’s CMD byte Command-Byte Format I D E V L E 0 0

Splitting our packet-layout destn-address source-address TYPE/ LENGTH count HDR -- data -- LEN --

Splitting our packet-buffer destn-address source-address TYPE/ LENGTH count HDR packet-buffer in kernel-space -- data

Transmitting a ‘split-packet’ The 82573 L controller ‘merges’ the contents of these separate buffers

The ‘virt_to_phys()’ macro • Linux provides a convenient macro which kernel-module code can employ

Linux memory-mapping = persistent mapping = transient mappings HMA kernel space 896 -MB physical

Two-Level Translation Scheme PAGE DIRECTORY CR 3 PAGE TABLES PAGE FRAMES

Linear to Physical linear address dir-index table-index offset physical address-space page table page directory

Address-translation • The CPU examines any virtual address it encounters, subdividing it into three

Format of a Page-Table entry 31 PAGE-FRAME BASE ADDRESS 12 11 10 9 8

Finding the user-buffer’s PFN • To program the ‘base-address’ field in the second TX-Descriptor,

Performing ‘virt_to_phys()’ ssize_t my_write( struct file *file, const char *buf, size_t len, loff_t *pos

Can’t cross a ‘page-boundary’ • In order for the NIC to fetch the user’s

Truncate ‘len’ if necessary ssize_t my_write( struct file *file, const char *buf, size_t len,

‘zerocopy. c’ • We created this modification of our ‘nic 2. c’ device-driver so

Website article • We’ve posted a link on our CS 686 website to a

Slides: 22

Download presentation

The ‘zero-copy’ initiative A look at the ‘zero-copy’ concept and an x 86 Linux implementation for the case of outgoing packets

From Wikipedia, the free encyclopedia: Zero-copy is an adjective that refers to computer operations in which the CPU does not perform the task of copying data from one area of memory to another. The availability of zero-copy versions of operating system elements such as device drivers, file systems and network protocol stacks greatly increases the performance of many applications, since using a CPU that is capable of complex operations just to make copies of data can be a great waste of resources. Zero-copy also reduces the number of context-switches from User space to Kernel space and vice-versa. Several OS like Linux support zero copying of files through specific API's like sendfile, sendfile 64, etc. Techniques for creating zero-copy software include the use of DMA-based copying, and memory-mapping through an MMU. These features require specific hardware support and usually involve particular memory alignment requirements. Zero-copy protocols are especially important for high-speed networks, as memory copies would cause a serious workload for the host cpu. Still, such protocols have some initial overhead so that avoiding programmed IO (PIO) there only makes sense for large messages.

Application source-code char message[] = “This is a test of network-packet transmission n”; int main( void ) { int fd = open( “/dev/nic”, O_RDWR ); if ( fd < 0 ) { perror( “/dev/nic” ); exit(1); } int msglen = strlen( message ); int nbytes = write( fd, message, msglen ); if ( nbytes < 0 ) { perror( “write” ); exit(1); } printf( “Transmitted %d bytes n”, nbytes ); }

Transmit operation user space kernel space Linux OS kernel runtime library write() file subsystem nic device-driver my_write() user data-buffer copy_from_user() packet buffer DMA application program hardware We want to eliminate this copying-operation

Our driver’s packet-layout destn-address source-address TYPE/ LENGTH count -- data --- data – packet-buffer in kernel-space 16 bytes base-address (64 -bits) Packetlength CSO cmd status CSS Format for Legacy Transmit-Descriptor special

Can zero-copy be transparent? • We would like to implement the zero-copy concept in out ‘nic 2. c’ character driver in such a manner that no changes would be required to an ‘application’ program’s code • We will show to do this for ‘outgoing’ packets (i. e. , by modifying ‘my_write()’), but achieving zero-copy with ‘incoming’ packets would be a lot more complicated!

TX Descriptor’s CMD byte Command-Byte Format I D E V L E 0 0 R S I C I F C S E O P EOP = End-Of-Packet (1=yes, 0=no) RS = Report Status (1=yes, 0=no) VLE = VLAN-tag Enable Key question: What will the NIC do if we don’t set the EOP-bit in a TX Descriptor?

Splitting our packet-layout destn-address source-address TYPE/ LENGTH count HDR -- data -- LEN -- data --- data – packet-buffer in kernel-space base-address (64 -bits) Packet. Length (=HDR) (=LEN) CSO cmd status CSS special EOP=0 EOP=1 Format for Legacy Transmit-Descriptor Pair

Splitting our packet-buffer destn-address source-address TYPE/ LENGTH count HDR packet-buffer in kernel-space -- data -- LEN -- data --- data – packet-buffer in user-space base-address (64 -bits) Packet. Length (=HDR) (=LEN) CSO cmd status CSS special EOP=0 EOP=1 Format for Legacy Transmit-Descriptor Pair Two physical packet-buffers comprise one logical packet that gets transmitted!

Transmitting a ‘split-packet’ The 82573 L controller ‘merges’ the contents of these separate buffers into just a single ethernet-packet Application-program packet-data buffer User-space Kernel-space Device-driver module DMA packet-header buffer DMA NIC hardware

The ‘virt_to_phys()’ macro • Linux provides a convenient macro which kernel-module code can employ to obtain the physical-address for a memory-region from its virtual-address – but it only works for addresses that aren’t in ‘high’ memory • For ‘normal’ memory-regions, conversion between ‘virtual’ and ‘physical’ addresses amounts to a simple addition/subtraction

Linux memory-mapping = persistent mapping = transient mappings HMA kernel space 896 -MB physical RAM There is more physical RAM in our classroom’s systems than can be ‘mapped’ into the available address-range for kernel virtual addresses user space CPU’s virtual address-space

Two-Level Translation Scheme PAGE DIRECTORY CR 3 PAGE TABLES PAGE FRAMES

Linear to Physical linear address dir-index table-index offset physical address-space page table page directory CR 3 page frame

Address-translation • The CPU examines any virtual address it encounters, subdividing it into three fields 31 22 21 12 11 index into page-directory index into page-table 10 -bits This field selects one of the 1024 array-entries in the Page-Directory This field selects one of the 1024 array-entries in that Page-Table 0 offset into page-frame 12 -bits This field provides the offset to one of the 4096 bytes in that Page-Frame

Format of a Page-Table entry 31 PAGE-FRAME BASE ADDRESS 12 11 10 9 8 7 6 5 4 3 2 1 0 P P AVAIL 0 0 D A C W U W P D T LEGEND P = Present (1=yes, 0=no) W = Writable (1 = yes, 0 = no) U = User (1 = yes, 0 = no) A = Accessed (1 = yes, 0 = no) D = Dirty (1 = yes, 0 = no) PWT = Page Write-Through (1=yes, 0 = no) PCD = Page Cache-Disable (1 = yes, 0 = no)

Finding the user-buffer’s PFN • To program the ‘base-address’ field in the second TX-Descriptor, our driver’s ‘write()’ function will need to know which physical Page-Frame the application’s buffer lies in • And its PFN (Page-Frame Number) can be found from its virtual address by ‘walkingthe-cpu-page-tables’ – even when Linux puts some page-tables in ‘high’ memory

Performing ‘virt_to_phys()’ ssize_t my_write( struct file *file, const char *buf, size_t len, loff_t *pos ) { unsigned int _cr 3, *pgdir, *pgtbl, pfn_frame; unsigned int dindex, pindex, offset; // take apart the virtual-address of the user’s ‘buf’ variable dindex = ((int)buf >> 22) & 0 x 3 FF; // pgdir-index (10 -bits) pindex = ((int)buf >> 12) & 0 x 3 FF; // pgtbl-index (10 -bits) offset = ((int)buf >> 0) & 0 x. FFF; // frame-offset (12 -bits) // then walk the CPU’s paging-tables to get buf’s physical-address asm(“ mov %%cr 3, %%eax n mov %%eax, %0 “ : “=m”(_cr 3) : : “ax” ); pgdir = (unsigned int*)phys_to_virt( _cr 3 & ~0 x. FFF ); pfn_pgtbl = (pgdir[ dindex ] >> 12); pgtbl = (unsigned int *)kmap( &mem_map[ pfn_pgtbl ] ); pfn_frame = (pgtbl[ pindex ] >> 12); kunmap( &mem_map[ pfn_pgtbl ]; txring[ txtail + 1 ]. base_address = (pfn_frame << 12) + offset;

Can’t cross a ‘page-boundary’ • In order for the NIC to fetch the user’s data using its Bus-Master DMA capability, it is necessary for the buffer needs to reside in a physically contiguous memory-region buf • But we can’t be sure Linux will have setup the CPU’s page-tables that way – unless the ‘buf’ is confined to a single page-frame

Truncate ‘len’ if necessary ssize_t my_write( struct file *file, const char *buf, size_t len, loff_t *pos ) { if ( offset + len > PAGE_SIZE ) len = PAGE_SIZE – offset; offset len buf PAGE_SIZE

‘zerocopy. c’ • We created this modification of our ‘nic 2. c’ device-driver so it’s ‘my_write()’ function lets an application perform transmissions without performing a memory-to-memory copy-operation (i. e. , copy_from_user()’ ) • It is not so easy to implement ‘zero-copy’ for receiving packets – can you say why?

Website article • We’ve posted a link on our CS 686 website to a frequently cited research-article about the various issues that arise when trying to implement the ‘zero-copy’ concept for the case of ‘incoming’ network-packets: The Need for Asynchronous, Zero-Copy Network I/O, by Ulrich Drepper, Red Hat, Inc.