INSIDE THE LINUX KERNEL Unix Forum Chicago March

  • Slides: 40
Download presentation
INSIDE THE LINUX KERNEL Unix. Forum Chicago - March 8, 2001 Daniel P. Bovet

INSIDE THE LINUX KERNEL Unix. Forum Chicago - March 8, 2001 Daniel P. Bovet University of Rome "Tor Vergata"

WHAT IS A KERNEL? (1/2) + it’s a program that runs in Kernel Mode

WHAT IS A KERNEL? (1/2) + it’s a program that runs in Kernel Mode + CPUs run either in Kernel Mode or in User Mode + when in User Mode, some parts of RAM can’t be addressed, some instructions can’t be executed, and I/O ports can’t be accessed + when in Kernel Mode, no restriction is put on the program

WHAT IS A KERNEL? (2/2) + besides running in Kernel Mode, kernels have three

WHAT IS A KERNEL? (2/2) + besides running in Kernel Mode, kernels have three other peculiarities: + large size (millions of machine language instructions) + machine dependency (some parts of the kernel must be coded in Assembly language) + loading into RAM at boot time in a rather primitive way

ENTERING THE KERNEL PROGRAM (1/2) � when the CPU is running in User Mode

ENTERING THE KERNEL PROGRAM (1/2) � when the CPU is running in User Mode Kernel Mode User Mode

ENTERING THE KERNEL PROGRAM (2/2) + when the CPU is running in Kernel Mode

ENTERING THE KERNEL PROGRAM (2/2) + when the CPU is running in Kernel Mode User Mode

NESTED KERNEL INVOCATIONS + some similarity with nested function calls C A + B

NESTED KERNEL INVOCATIONS + some similarity with nested function calls C A + B different because events causing kernel invocations are not (usually) related to the running program

KERNEL ENTRY POINTS software interrupt ---> I/O device requires attention ---> time interval elapsed

KERNEL ENTRY POINTS software interrupt ---> I/O device requires attention ---> time interval elapsed ---> hardware failure ---> faulty instruction ---> Kernel

IS AN INSTRUCTION REALLY FAULTY? + faulty instructions may occur for two distinct reasons:

IS AN INSTRUCTION REALLY FAULTY? + faulty instructions may occur for two distinct reasons: �programming error �deferred allocation of some kind of resource + the kernel must be able to identify the reason that caused the exception

EXCEPTIONS RELATED TO DEFERRED ALLOCATION + two cases of deferred allocation of resources in

EXCEPTIONS RELATED TO DEFERRED ALLOCATION + two cases of deferred allocation of resources in Linux + page frames (demand paging, Copy On Write) + floating point registers

WHY IS A KERNEL SO COMPLEX? + large program with many entry points +

WHY IS A KERNEL SO COMPLEX? + large program with many entry points + must offer disk caching to lower average disk access time + must support run nested kernel invocations --> must run with the interrupts enabled most of the time + must be updated quite frequently to support new hardware circuits and devices

HW CONCURRENCY (1/2) I/O device IRQ INT I/O APIC CPU INT ACK + +

HW CONCURRENCY (1/2) I/O device IRQ INT I/O APIC CPU INT ACK + + + the I/O APIC polls the devices and issues interrupts no new interrupt can be issued until the CPU acknowledges the previous one good kernels run with interrupts enabled most of the time

HW CONCURRENCY (2/2) + Symmetrical Multi. Processor architectures (SMP) include two ore more CPUs

HW CONCURRENCY (2/2) + Symmetrical Multi. Processor architectures (SMP) include two ore more CPUs + SMP kernels must be able to execute concurrently on available CPUs + one service routine related to networking runs on a CPU while another routine related to file system runs concurrently on another CPU

LIMITING KERNEL SIZE + + + try to distribute kernel functions in smaller programs

LIMITING KERNEL SIZE + + + try to distribute kernel functions in smaller programs that can be linked separately two approaches: microkernels and modules Linux prefers modules for reasons of efficiency

MICROKERNELS + only a few functions such as process scheduling, and interprocess communication are

MICROKERNELS + only a few functions such as process scheduling, and interprocess communication are included into the microkernel + other kernel functions such as memory allocation, file system handling, and device drivers are implemented as system processes running in User Mode + microkernels introduce a lot of interprocess communication

MODULES (1/2) + modules are object files containing kernel functions that are linked dynamically

MODULES (1/2) + modules are object files containing kernel functions that are linked dynamically to the kernel + Linux offers an excellent support for implementing and handling modules

MODULES (2/2) b p t object module mmm. o a b external references to

MODULES (2/2) b p t object module mmm. o a b external references to kernel symbols z kernel symbol table thanks to the kernel symbol table, it is possible to defer linking of an object module

MODULES AND DISTRIBUTIONS + modern computer architectures based on PCI busses support autoprobe of

MODULES AND DISTRIBUTIONS + modern computer architectures based on PCI busses support autoprobe of installed I/O devices while booting the system + recent Linux distributions put all noncritical I/O drivers into modules + at boot time, only the I/O modules of identified I/O devices are dynamically linked to the kernel

SUPPORT TO CLIENT/SERVER APPLICATIONS + scenario: many tasks executing concurrently on a common address

SUPPORT TO CLIENT/SERVER APPLICATIONS + scenario: many tasks executing concurrently on a common address space (for instance, a web server handling thousands of requests per second) + problem: implementing each client request as a new process causes a lot of overhead + process creation/elimination are timeconsuming kernel functions

THE THREAD SOLUTION + introduce a new kernel object called thread + each process

THE THREAD SOLUTION + introduce a new kernel object called thread + each process includes one or more threads + all threads associated with a given process share the same address space + CPU scheduling is done at the thread level (Windows NT) + thread switching is more efficient than process switching

THE CLONE SOLUTION + introduce groups of lightweight processes called clones that share a

THE CLONE SOLUTION + introduce groups of lightweight processes called clones that share a common address space, opened files, signals, etc. + CPU scheduling is done at the process level in a standard way + + clones have been invented by Linux the npmt_pthread or the dexter module used by the Linux version of Apache 2. 0 are both based on clones

LINUX PEARLS + we selected in a rather arbitrary way a few pearls related

LINUX PEARLS + we selected in a rather arbitrary way a few pearls related to two distinct kernel design areas: + clever design choices + efficient coding

CLEVER DESIGN CHOICES + isolate the architecture-dependent code + rely on the VFS abstraction

CLEVER DESIGN CHOICES + isolate the architecture-dependent code + rely on the VFS abstraction + avoid over-designing

ISOLATE THE ARCHITECTUREDEPENDENT CODE (1/2) + Linux source code includes two architecture -dependent directories:

ISOLATE THE ARCHITECTUREDEPENDENT CODE (1/2) + Linux source code includes two architecture -dependent directories: /usr/src/linux/arch and /usr/src/linux/include arch include i 386 �. . s 390 asm-i 386 �. asm-s 390

ISOLATE THE ARCHITECTUREDEPENDENT CODE (2/2) + the schedule() function invokes the switch_to() Assembly language

ISOLATE THE ARCHITECTUREDEPENDENT CODE (2/2) + the schedule() function invokes the switch_to() Assembly language function to perform process switching + the code for switch_to() is stored in the include/asm/system. h file + depending on the target system, the asm symbolic link is set to asm-i 386, asm-s 390, etc.

RELY ON THE VFS ABSTRACTION + VFS is an abstraction for representing several kinds

RELY ON THE VFS ABSTRACTION + VFS is an abstraction for representing several kinds of information containers (IC) in a common way + standard operations on ICs: open(), close(), seek(), ioctl(), read(), write() + VFS associates a logical inode with each opened IC

EXAMPLES OF ICs + files stored in a disk-based filesystem + files stored in

EXAMPLES OF ICs + files stored in a disk-based filesystem + files stored in a network filesystem + disk partitions + kernel data structures (/proc filesystem) + RAM content (/dev/mem) + RAM disk (/dev/ram 0) + serial port (/dev/tty. S 0)

AVOID OVER-DESIGNING + Linux scheduler is simple and works for most applications + no

AVOID OVER-DESIGNING + Linux scheduler is simple and works for most applications + no attempt to transform Linux into a realtime system

A GENERAL-PURPOSE SCHEDULER + the scheduler of the System V Release 4 provides a

A GENERAL-PURPOSE SCHEDULER + the scheduler of the System V Release 4 provides a set of class-independent routines that implement common services + object-oriented approach based on scheduling class: the scheduler represents an abstract base class, and each scheduling class acts as a subclass

A HEATED DISCUSSION + If the Linux development community is not responsive to the

A HEATED DISCUSSION + If the Linux development community is not responsive to the end user community, refusing to incorporate necessary functionality on the basis of aesthetics, then that community will abandon Linux in favor of something else. Is that really what you want? + Yes - If it turns into a pile of shit they'll abandon it even faster. I'd rather have a decent OS that works and does the right thing for most people than a single OS that tries to do everything and does nothing right (Alan Cox)

EXAMPLES OF EFFICIENT CODING + + + retrieving the process descriptor of the running

EXAMPLES OF EFFICIENT CODING + + + retrieving the process descriptor of the running process handling dynamic timers catching invalid addresses passed as system call parameters

DESCRIPTOR OF THE RUNNING PROCESS (1/3) + classic solution: introduce an array current[NCPU] whose

DESCRIPTOR OF THE RUNNING PROCESS (1/3) + classic solution: introduce an array current[NCPU] whose components point to the process descriptors of the processes running on the CPUs + clever solution: store the process Kernel Mode stack and the process descriptor into contiguous addresses so that the value of the CPU stack pointer register (esp register) is linked to that of the process descriptor

DESCRIPTOR OF THE RUNNING PROCESS (2/3) + Kernel Mode stack + process descriptor are

DESCRIPTOR OF THE RUNNING PROCESS (2/3) + Kernel Mode stack + process descriptor are stored in 2 contiguous page frames (8 KB) variable-length Kernel Mode stack esp fixed-length process descriptor

DESCRIPTOR OF THE RUNNING PROCESS (3/3) variable-length Kernel Mode stack esp fixed-length process descriptor

DESCRIPTOR OF THE RUNNING PROCESS (3/3) variable-length Kernel Mode stack esp fixed-length process descriptor value of esp register: mask: starting address of process descriptor 0: 0 x 00 bdbad 4 0 xffffd 000 0 x 00 bda 000

HANDLING DYNAMIC TIMERS (1/3) + I/O drivers and user applications may create hundreds of

HANDLING DYNAMIC TIMERS (1/3) + I/O drivers and user applications may create hundreds of timers + find an efficient way to check at each timer interrupt whether at least one timer has expired + trivial solution: maintain a list of timers ordered by increasing decaying times and start checking from the first element of the list

HANDLING DYNAMIC TIMERS (2/3) + + clever solution (timing wheel): use percolation and maintain

HANDLING DYNAMIC TIMERS (2/3) + + clever solution (timing wheel): use percolation and maintain strict ordering only for the next 256 ticks (in Linux- i 386, one tick = 10 ms) use several lists of timers

HANDLING DYNAMIC TIMERS (3/3) 0 1 2 �� 255 tv 1: index incremented by

HANDLING DYNAMIC TIMERS (3/3) 0 1 2 �� 255 tv 1: index incremented by 1 once every tick tv 2: 0 1 2 �� 63 index incremented by 1 once every 256 ticks when tv 1 becomes empty, it is replenished by emptying one slot of tv 2, and so forth

CATCHING INVALID ADDRESSES (1/4) + many system calls require one or more addresses specified

CATCHING INVALID ADDRESSES (1/4) + many system calls require one or more addresses specified as parameters + invalid addresses passed as parameters should not cause a system crash + classic solution: perform a preliminary check before servicing the system call + clever solution: defer checking until an exception caused by the invalid occurs in Kernel Mode

CATCHING INVALID ADDRESSES (2/4) + deferred checking is more efficient since system calls are

CATCHING INVALID ADDRESSES (2/4) + deferred checking is more efficient since system calls are issued most of the times with correct parameters + if an addressing error occurs in Kernel Mode, the kernel must be able to distinguish whether it is caused by a faulty process or whether by a kernel bug + in the first case, the kernel sends a SIGSEGV signal to the faulty process

CATCHING INVALID ADDRESSES (3/4) + clever idea: force the kernel to use always the

CATCHING INVALID ADDRESSES (3/4) + clever idea: force the kernel to use always the same group of functions when copying data to or from the process address space + if an addressing error occurs while doing that, the CPU will signal the address of the instruction that contained an invalid address operand

CATCHING INVALID ADDRESSES (4/4) + the kernel knows from the address of the faulty

CATCHING INVALID ADDRESSES (4/4) + the kernel knows from the address of the faulty instruction that it belongs to one of the functions used to access data in the process address space + it can then execute some kind of “fixup code”: as a result, the system call returns an error code