Device Layer and Device Drivers COMS W 6998

  • Slides: 20
Download presentation
Device Layer and Device Drivers COMS W 6998 Spring 2010 Erich Nahum

Device Layer and Device Drivers COMS W 6998 Spring 2010 Erich Nahum

Device Layer vs. Device Driver l Linux tries to abstract away the device specifics

Device Layer vs. Device Driver l Linux tries to abstract away the device specifics using the struct net_device l Provides a generic device layer in linux/net/core/dev. c and include/linux/netdevice. h l Device drivers are responsible for providing the appropriate virtual functions l l l E. g. , dev->netdev_ops->ndo_start_xmit Device layer calls driver layer and vice-versa Execution spans interrupts, syscalls, and softirqs

Device Interfaces Higher Protocol Instances dev. c napi_schedule dev_open dev_queue_xmit net_device_ops dev_close Network devices

Device Interfaces Higher Protocol Instances dev. c napi_schedule dev_open dev_queue_xmit net_device_ops dev_close Network devices (adapter-independent) Network devices interface netdev_ops->ndo_open netdev_ops->ndo_start_xmit netdev_ops->ndo_stop Abstraction from Adapter specifics pcnet 32. c pcnet 32_interrupt pcnet 32_open pcnet 32_start_xmit pcnet 32_stop Network driver (adapter-specific)

Network Process Contexts l Hardware interrupt l l Process context l l Received packets

Network Process Contexts l Hardware interrupt l l Process context l l Received packets (upcalls) System calls (downcalls) Softirq context l l NET_RX_SOFTIRQ for received packets (upcalls) NET_TX_SOFTIRQ for delayed sending packets (downcalls)

Softnet l l l Introduced in kernel 2. 4. x Parallelize packet handling on

Softnet l l l Introduced in kernel 2. 4. x Parallelize packet handling on SMP machines Packet transmit/receive is handled via two softirqs: l l NET_TX_SOFTIRQ feeds packets from network stack to driver. NET_RX_SOFTIRQ feeds packets from driver to network stack. The transmit/receive queues used to be stored in per -cpu softnet_data. Now stored in specific places: l l Receive side: in device packet rx queues Send side: in device qdiscs

Device Driver HW Interface l Driver Memory mapped register reads/ writes Interrupts l Driver

Device Driver HW Interface l Driver Memory mapped register reads/ writes Interrupts l Driver talks to the device: l Writing commands to memory-mapped control status registers l Setting aside buffers for packet transmission/reception l Describing these buffers in descriptor rings Device talks to driver: l Generating interrupts (both on send and receive) l Placing values in control status registers l DMA’ing packets to/from available buffers l Updating status in descriptor rings

Packet Descriptor Rings l l Descriptors contain pointers, status bits Driver allocates packet buffers

Packet Descriptor Rings l l Descriptors contain pointers, status bits Driver allocates packet buffers TX Descriptor Ring Packet Buffer Send. Err Sent Send Packet Buffer Free RX Descriptor Ring TXQ Tail RXQ Head TXQ Head Free Recv. OK Rcv. Err Recv. CRC Recv. OK RXQ Tail Free Packet Buffer Packet Buffer

NIC IRQ l The NIC registers an interrupt handler with the IRQ with which

NIC IRQ l The NIC registers an interrupt handler with the IRQ with which the device works by calling request_irq(). l l This interrupt handler is the one that will be called when a frame is received The same interrupt handler may be called for other reasons (depends, NIC-dependent) l Transmission complete, transmission error Newer drivers (e. g. , e 1000 e) seem to use Message Sequenced Interrupts (MSI), which use different interrupt numbers Device drivers can release an IRQ using free_irq.

Packet Reception with NAPI l Originally, Linux took one interrupt per received packet l

Packet Reception with NAPI l Originally, Linux took one interrupt per received packet l l NAPI: “New API” With NAPI, interrupt notifies softnet layer (NET_RX_SOFTIRQ) that packets are available Driver requirements: l l This could cause excessive overhead under heavy loads Ability to turn receive interrupts off and back on again A ring buffer A poll function to pull packets out Most drivers support this now.

Reception: NAPI mode (1) l l l NAPI allows dynamic switching: l To polled

Reception: NAPI mode (1) l l l NAPI allows dynamic switching: l To polled mode when the interrupt rate is too high l To interrupt-driven when load is low In the network interface private structure, add a struct napi_struct At driver initialization, register the NAPI poll operation: netif_napi_add(dev, &bp->napi, my_poll, 64); l dev is the network interface l &bp->napi is the struct napi_struct l my_poll is the NAPI poll operation l 64 is the weight that represents the importance of the network interface. It is related to the threshold below which the driver will return back to interrupt mode.

Reception: NAPI mode (2) l In the interrupt handler, when a packet has been

Reception: NAPI mode (2) l In the interrupt handler, when a packet has been received: if (napi_schedule_prep(&bp->napi)) { /* Disable reception interrupts */ __napi_schedule(& bp->napi); } l l l The kernel will call our poll() operation regularly The poll() operation has the following prototype: l static int my_poll(struct napi_struct *napi, int budget) It must receive at most budget packets and push them to the network stack using netif_receive_skb(). If fewer than budget packets have been received, switch back to interrupt mode using napi_complete(& bp->napi) and reenable interrupts Poll function must return the number of packets received

Receiving Data Packets (1) l dev. c napi_schedule l ‘‘hard“ IRQ pcnet 32. c

Receiving Data Packets (1) l dev. c napi_schedule l ‘‘hard“ IRQ pcnet 32. c pcnet 32_interrupt irq/handle. c __do_IRQ interrupt HW interrupt invokes __do_IRQ invokes each handler for that IRQ: l l action->handler(irq, action->dev_id); pcnet_32_interrupt l l l Acknowledge intr ASAP Checks various registers Calls napi_schedule to wake up NET_RX_SOFTIRQ

Receiving Data Packets (2) arp_rcv ip_rcv . . ipx_rcv l l dev. c ptype_base[ntohs(type)]

Receiving Data Packets (2) arp_rcv ip_rcv . . ipx_rcv l l dev. c ptype_base[ntohs(type)] soft IRQ Immediately after the interrupt, do_softirq is run netif_receive_skb pcnet 32. c l For each napi struct in the list (one per dev) l l pcnet 32_poll l dev. c softirq. c net_rx_action do_softirq Scheduler Recall softirqs are per-cpu Invoke poll function Track amount of work done (packets) If work threshold exceeded, wake up softirqd and break out of loop

Receiving Data Packets (3) l arp_rcv ip_rcv . . Driver poll function: l ipx_rcv

Receiving Data Packets (3) l arp_rcv ip_rcv . . Driver poll function: l ipx_rcv dev. c l ptype_base[ntohs(type)] soft IRQ netif_receive_skb pcnet 32. c dev. c softirq. c l l netif_receive_skb: l pcnet 32_poll net_rx_action do_softirq Scheduler may call dev_alloc_skb and copy l pcnet 32 does, e 1000 doesn’t. Does call netif_receive_skb Clears tx ring and frees sent skbs l Calls eth_type_trans to get packet type l skb_pull the ethernet header (14 bytes) l Data now points to payload data (e. g. , IP header) Demultiplexes to appropriate receive function based on header type

Packet Types Hash Table ptype_base[16] A protocol that receives only packets with the correct

Packet Types Hash Table ptype_base[16] A protocol that receives only packets with the correct packet identifier 0 1 16 ptype_all packet_type: ETH_P_ARP dev: NULL func. . . list packet_type: ETH_P_IP dev: NULL func. . . list. . . packet_type: ETH_P_ALL dev func. . . list arp_rcv() packet_type ip_rcv() A protocol that receives all packets arriving at the interface packet_type: ETH_P_ALL dev func. . . list

Transmission Overview l l l Transmission is surprisingly complex Each net_device has 1 or

Transmission Overview l l l Transmission is surprisingly complex Each net_device has 1 or more tx queues Each queue has a policy associated with it l l struct Qdisc Polices can be simple l l Policies can be very complex l l e. g. , default pfifo, stochastic fairness queuing e. g. , RED, Hierarchical Token Bucket In this section, we assume PFIFO.

Queuing Ops l enqueue() l l l Enqueues a packet dequeue() l Returns a

Queuing Ops l enqueue() l l l Enqueues a packet dequeue() l Returns a pointer to a packet (skb) eligible for sending; NULL means nothing is ready pfifo – 3 band priority fifo l l Enqueue function is pfifo_fast_enqueue Dequeue function is pfifo_fast_dequeue

Sending a Packet Direct (1) dev. c dev_queue_xmit l sched_generic. c dev_queue_xmit l dev->qdisc->pfifo_fast_enqueue

Sending a Packet Direct (1) dev. c dev_queue_xmit l sched_generic. c dev_queue_xmit l dev->qdisc->pfifo_fast_enqueue l l __qdisc_run Syscall or soft IRQ qdisc_restart dev->qdisc->pfifo_fast_dequeue dev. c l l pcnet 32. c pcnet 32_start_xmit If not, calls dev_hard_start_xmit dev->q->enqueue(pfifo) l l dev_hard_start_xmit Linearizes skb if nec Checksums if nec Calls q->enqueue if avail l Checks queue length Drops if necessary Adds to tail otherwise

Sending a Packet Direct (2) dev. c dev_queue_xmit l l sched_generic. c dev->qdisc->pfifo_fast_enqueue l

Sending a Packet Direct (2) dev. c dev_queue_xmit l l sched_generic. c dev->qdisc->pfifo_fast_enqueue l l __qdisc_run Syscall or soft IRQ __qdisc_run Qdisc_restart l l qdisc_restart dev->qdisc->pfifo_fast_dequeue dev. c l l pcnet 32. c pcnet 32_start_xmit Dequeues a packet Finds tx queue Calls dev_hard_start_xmit l dev_hard_start_xmit Calls qdisc_restart until error Enables tx softirq if nec Invokes dev->xmit Frees the skb pcnet 32_start_xmit l Puts skb in tx descriptor ring

Sending a Packet via Soft. IRQ softirq. c dev. c do_softirq l do_softirq invoked

Sending a Packet via Soft. IRQ softirq. c dev. c do_softirq l do_softirq invoked l net_tx_action is the action for NET_TX_SOFTIRQ net_tx_action l net_tx_action sched_generic. c __qdisc_run l soft IRQ qdisc_restart l dev->qdisc->pfifo_fast_dequeue dev. c l dev_hard_start_xmit pcnet 32. c pcnet 32_start_xmit Frees packets posted to completion queue Invokes __qdisc_run on all output qdiscs if possible Sets bit in qdisc to run again if necessary