Effect of Context Aware Scheduler on TLB Satoshi
Effect of Context Aware Scheduler on TLB Satoshi Yamada and Shigeru Kusakabe Kyushu University 1
Contents • • Introduction Effect of Sibling Threads on TLB Context Aware Scheduler (CAS) Benchmark Applications and Measurement Environment • Result • Related Work • Conclusion 2
Contents • Introduction – – What is Context? Motivation Task Switch and Cache Approach of our Scheduler • Effect of Sibling Threads on TLB • Context Aware Scheduler (CAS) • Benchmark Applications and Measurement Environment • Result • Related Work • Conclusion 3
What is context ? • Definition in this presentation Context = Memory Address Space • Task switch • Context switch 4
Motivation • More chances of using native threads in OS today – Java, Perl, Python, Erlang, and Ruby – Open. MP, MPI • The more threads increase, the heavier the overhead due to a task switch tends to get – Agarwal, et al. “Cache performance of operating system and multiprogramming workloads” (1988) 5
Task Switch and Cache • Overhead due a task switch – includes that of loading a working set of next process – is deeply related with the utilization of caches • Mogul, et al. “The effect of of context switches on cache performance” (1991) Working sets overflows the cache Working set of B Switch Process A Working set of of A Working B Process B Cache 6
Approach of our Scheduler • Three solutions to reduce the overhead due to task switches – Agarwal, et al. “Cache performance of operating system and multiprogramming workloads” (1988) 1. Increase the size of caches 2. Reuse the shared date among threads 3. Utilize tagged caches and/or restrain cache flushes * We utilize sibling threads to achieve 2. and 3. * We mainly discuss on 3. 7
Contents • Introduction • Effect of Sibling Threads on TLB – – Working Set and Task Switch TLB tag and Task Switch Advantage of Sibling Threads Effect of Sibling Threads on Task Switches • Context Aware Scheduler (CAS) • Benchmark Applications and Measurement Environment • Result • Related Work • Conclusion 8
Working Set and Task Switch l Task Switch with small overhead Working set of A Working setset of Aof&AB Working Switch Process B Process A Working setofof. BB Working set Cache l Task Switch with large overhead Working set of of B A Switch Process A Process B Cache 9
TLB and Task Switch Tagged TLB context Virtual Address Physical Address 2056 0 x 0123 0 x 4567 496 0 x 0123 0 xcdef 1024 0 x 0123 0 xefca 8192 0 x 0123 0 x 8034 Non - Tagged TLB context 496 2056 Virtual Address Physical Address 0 x 0123 0 xc 567 0 x 0 a 67 0 x 23 ab 0 xcea 4 0 x 0 aa 4 0 x 3614 0 xc 345 0 x 0 a 45 0 x 8 a 24 0 xcacd 0 x 0 acd ü Tagged TLB: TLB flush is not necessary (ARM, MIPS, etc) ü Non-tagged TLB: TLB flush is necessary(x 86, etc) 10
Advantage of Sibling Threads Parent task_struct mm_struct mm signal Child fork() signal copy. . clone() task_structmm_struct mm mm signal_struct file. . Parent create a PROCESS file. . Child task_struct mm mm mm share signal mm signal_struct signal file signal. . file. create a. THREAD. . Sibling Threads. Advantage on task switches • Higher possibility of sharing data among sibling threads • Context switch does not happen • Restrain TLB flushes in non-tagged TLB 11
Effect of Sibling Threads on Task Switches Measurement We use the idea of lat_ctx program in LMbench Sibling Thread Process switch switch Process Sibling Thread Working set 12
Effect of Sibling Threads on Task Switches Results (sibling threads / process) working L 1 cache L 2 cache TLB set (KB) misses Elapsed Time 0 8 0. 76 0. 46 1. 42 2. 84 0. 28 0. 22 0. 86 0. 84 16 128 512 1024 1408 1536 0. 73 0. 87 0. 90 1. 07 1. 03 2. 17 1. 24 1. 33 0. 86 0. 99 0. 97 0. 20 0. 10 0. 26 0. 97 0. 98 0. 81 0. 80 0. 67 0. 86 0. 91 0. 83 13
Contents • Introduction • Effect of Sibling Threads on TLB • Context Aware Scheduler (CAS) – O(1) Scheduler in Linux – Context Aware Scheduler (CAS) • Benchmark Applications and Measurement Environment • Result • Related Work • Conclusion 14
O(1) Scheduler in Linux • Structure – active queue and expired queue – priority bitmap and array of linked list of threads • Behavior – search priority bitmap and choose a thread with the highest priority • Scheduling overhead – independent of the number of threads bitmap B D bitmap 1 high 0 1 0 0 low expired 1 1 0 0 A C 0 active Processor 15
Context Aware Scheduler (CAS) (1/2) auxiliary runqueues per context regular O(1) scheduler runqueue 1 A B 0 1 0 C E 1 Preg 1 D 1 0 A 1 B 1 C Paux D 1 E 0 • CAS creates auxiliary runqueues per context • CAS compares Preg and Paux • Preg: the highest priority in regular O(1) scheduler runqueue • Paux: the highest priority in the auxiliary runqueue 16 • if Preg - Paux ≦ threshold, then we choose Paux
Context Aware Scheduler (CAS) (2/2) regular O(1) scheduler runqueue 1 A auxiliary runqueues per context B C D E 0 CAS with threshold 2 context switch: 1 time O(1) scheduler context switch: 4 times B 1 1 0 1 A 1 1 C 1 E 1 0 D 0 A C E B D A B C D E 17
Contents • Introduction • Effect of Sibling Threads on TLB • Context Aware Scheduler (CAS) • Benchmark Applications and Measurement Environment – – Measurement Environment Benchmarks Measurements Scheduler • Result • Related Work • Conclusion 18
Measurement Environment • Intel Core 2 Duo 1. 86 GHz Spec of each memory hierarchy TLB Size / Latency 256 entries / 1 ns L 1 Cache Size / Latency 32 KB / 3 ns L 2 Cache Size / Latency 2 MB / 14 ns Memory Size / Latency 1 GB / 149 ns 19
Benchmarks Benchmark Options Volano Benchmark (Volano) default Da. Capo Benchmark suite (Da. Capo) lusearch program, large size Chat Benchmark (Chat) 10 rooms, 20 members, 5000 messages Sys. Bench memory program, block benchmark size: 512 KB, total size: 30 GB suite (Sys. Bench) # of threads Static Priority Working Set (bytes) 800 25 600 K 70 15 5 M 800 15 10 K 30 25 512 K 20
Measurements Chat Sys. Bench Volano Da. Capo chat 0 Sys. Bench 0 Volano 0 Da. Capo 0 chat 1 Sys. Bench 1 Volano 1 Da. Capo 1 chat M Sys. Bench N Volano X process time of chat = chat 0 + chat 1 + … + chat M Da. Capo Y Elapsed Time of each DTLB and. Time ITLB (user/kernel spaces) Elapsed Process Time of ofmisses executing each application 4 applications 21
Scheduler • O(1) scheduler in Linux 2. 6. 21 • CAS – threshold 10 22
Contents • • Introduction Effect of Sibling Threads on TLB Context Aware Scheduler (CAS) Benchmark Applications and Measurement Environment • Result – – TLB misses Process Time Elapsed Time Comparison between Completely Fair Scheduler • Related Work • Conclusion 23
TLB misses (million times) Instruction TLB Data TLB OS user kernel O(1) 98 (1. 00) 360 (1. 00) 105 (1. 00) 29 (1. 00) CAS: 1 68 (0. 69) 56 (0. 57) 262 (0. 73) 222 (0. 62) 59 (0. 57) 43 (0. 41) 21 (0. 73) CAS: 10 24
Why larger threshold better? 1 A BB G F H G E C D E larger threshold can aggregate more I F C D 0 0 0 1 C E I Dynamic priority works against CAS 0 A H EB B C D F H I G H I 25
Process Time (seconds) OS Volano Da. Capo Chat Sysbench total O(1) 9. 34 (1. 00) 27. 41 (1. 00) 99. 83 (1. 00) 0. 45 (1. 00) 137. 03 (1. 00) CAS: 1 9. 28 (0. 99) 27. 36 (0. 99) 48. 50 (0. 47) 0. 44 (0. 97) 85. 33 (0. 69) CAS: 10 8. 75 (0. 93) 27. 27 (0. 99) 29. 29 (0. 28) 0. 42 (0. 93) 65. 73 (0. 57) 26
Elapsed Time (seconds) OS Volano Da. Capo Chat Sysbench Total O(1) 125 (1. 00) 100 (1. 00) 137 (1. 00) 170 (1. 00) CAS: 1 79 (0. 63) 72 (0. 58) 51 (0. 51) 87 (0. 64) 112 (0. 65) CAS: 10 62 (0. 50) 26 (0. 21) 30 (0. 31) 40 (0. 30) 89 (0. 52) 27
Comparison between Completely Fair Scheduler (CFS) • What is CFS? – Introduced from Linux 2. 6. 23 – Cut off the heuristic calculation of dynamic priority – Not consider the address space in scheduling • Why compare? – Investigate if applying CAS into CFS is valuable • CAS idea can reduce TLB misses and process time in CFS? 28
TLB misses OS Data TLB (million times) Instruction TLB (million times) user kernel O(1) 98 (1. 00) 360 (1. 00) 105 (1. 00) 29 (1. 00) CAS: 1 68 (0. 69) 262 (0. 73) 59 (0. 57) 21 (0. 73) CAS: 10 56 (0. 57) 222 (0. 62) 43 (0. 41) 21 (0. 73) CFS 120 (1. 23) 274 (0. 76) 60 (0. 57) 60 (0. 80) 29
Process Time and Total Elapsed Time (seconds) OS Volano Da. Capo Chat Sysbench total process time total elapsed time O(1) 9. 34 (1. 00) 27. 41 (1. 00) 99. 83 (1. 00) 0. 45 (1. 00) 137. 03 (1. 00) 170 (1. 00) CAS: 1 9. 28 (0. 99) 27. 36 (0. 99) 48. 50 (0. 47) 0. 44 (0. 97) 85. 33 (0. 62) 112 (0. 65) CAS: 10 8. 75 (0. 93) 27. 27 (0. 99) 29. 29 (0. 28) 0. 42 (0. 93) 65. 73 (0. 47) 89 (0. 52) CFS 12. 23 (1. 32) 31. 57 (1. 15) 28. 56 (0. 28) 0. 36 (0. 80) 72. 72 (0. 53) 89 (0. 52) 30
Contents • • Introduction Effect of Sibling Threads on TLB Context Aware Scheduler (CAS) Benchmark Applications and Measurement Environment • Result • Related Work • Conclusion 31
Sujay Parekh, et. al, “Thread Sensitive Scheduling for SMT Processors” (2000) • Parekh’s scheduler – tries groups of threads to execute in parallel and sample the information about • IPC • TLB misses • L 2 cache misses, etc – schedules on the information sampled Sampling Phase Scheduling Phase 32
Contents • • Introduction Effect of Sibling Threads on TLB Context Aware Scheduler (CAS) Benchmark Applications and Measurement Environment • Result • Related Work • Conclusion 33
Conclusion • Conclusion – CAS is effective in reducing TLB misses – CAS enhances the throughput of every application • Future Works – Evaluation on other architectures – Applying CAS into CFS scheduler – Extension to SMP platforms 34
additional slides 35
Effect of sibling threads on context switches (counts) l 1 working set (KB) l 2 Process Thread TLB Process Thread 0 10. 6 K 8. 1 K 73 104 43. 9 K 12. 2 K 8 151 K 69. 8 K 37 105 54. 9 K 12. 3 K 16 2444 K 1777 K 46 100 62. 0 K 12. 4 K 128 2. 55 M 2. 21 M 180 224 144 K 13. 7 K 512 10. 8 M 9. 81 M 162 K 215 K 444 K 117 K 1024 43. 4 M 46. 5 M 4102 K 3536 K 883 K 854 K 1408 88. 3 M 91. 1 M 9493 K 9434 K 1. 19 M 1. 16 M 1536 100 M 102 M 1. 10 M 1. 07 M 1. 29 M 36 1. 27 M
Result of Cache Misses (thousand times) OS L 1 Inst Cache L 1 Data Cache L 2 Cache O(1) 4, 514 (1. 00) 36, 614 (1. 00) 120 (1. 00) CAS: 1 3, 572 (0. 79) 34, 972 (0. 96) 121 (1. 01) CAS: 10 751 (0. 17) 27, 776 (0. 76) 130 (1. 09) CFS 971 (0. 22) 33, 923 (0. 93) 159 (1. 33) 37
Result of Cache Misses (thousand times) OS L 1 Data user L 1 Instruction L 2 kernel user kernel O(1) 12, 561 20, 883 (1. 00) 512 (1. 00) 3456 (1. 00) 56. 40 (1. 00) 63. 64 (1. 00) CAS: 1 12, 738 16, 520 (1. 01) (0. 79) 11, 601 14, 872 (0. 92) (0. 71) 14, 785 15, 840 (1. 18) (0. 76) 519 (1. 01) 446 (0. 87) 355 (0. 69) 745 (0. 22) 282 (0. 08) 365 (0. 11) 56. 13 (1. 00) 54. 70 (0. 97) 82. 64 (1. 47) 65. 60 (1. 03) 76. 26 (1. 20) 77. 16 (1. 21) CAS: 10 CFS 38
Memory Consumption of CAS • Additional memory consumption of CAS – About 40 bytes per thread – About 150 K bytes per thread group – 6 * 150 K + 1700 * 40 = 970 K 39
Effective and Ineffective Case of CAS • Effective case – Consecutive threads share certain amount of data cache Working set of A Working set of B • Ineffective case – Consecutive threads do not share data 40
Pranay Koka, et. al, “Opportunities for Cache Friendly Process” (2005) • Koka’s scheduler – traces the execution of each thread – puts the focus on the shared memory space between threads Tracing Phase Scheduling Phase 41
Extension to SMP • Aggregation into limited processors CPU 0 CPU 1 42
Extension to SMP • Execute threads with the same address space in parallel CPU 0 CPU 1 43
TLB misses and Total Elapsed Time OS Data TLB (million times) Instruction TLB (million times) user kernel Total Elapsed Time (seconds) kernel O(1) 98 (1. 00) 360 (1. 00) 105 (1. 00) 29 (1. 00) 170 (1. 00) CAS: 1 68 (0. 69) 262 (0. 73) 59 (0. 57) 21 (0. 73) 112 (0. 65) CAS: 10 56 (0. 57) 222 (0. 62) 43 (0. 41) 21 (0. 73) 89 (0. 52) CFS 120 (1. 23) 274 (0. 76) 60 (0. 57) 60 (0. 80) 89 (0. 52) 44
45
widely spread multithreading • Multithreading hides the latency of disk I/O and network access • Threads in many languages, Java, Perl, and Python correspond to OS threads Thread. A Thread. B waits disk * More context switches happen today * Process scheduler in OS is more responsible for the system performance 46
Context Aware (CA) scheduler Our CA scheduler aggregates sibling threads Linux O(1) scheduler A B C D E B E Context switches between processes: 3 times CA scheduler A C D Context switches between processes: 1 time 47
Results of Context Switch (micro seconds) Process C L 2 cache size: 2 MB Cache 2 MB Process A Process B 1 MB 048
Overhead due to a context switch by lat_ctx in LMbench working set (KB) 0 8 16 128 512 1024 1408 1536 Process (μs) Threads (μs) 1. 88 1. 52 1. 97 1. 66 2. 43 1. 99 2. 12 1. 7 2. 85 1. 92 85. 53 73. 6 213. 12 195. 68 243. 73 203. 78 Threads - Process (μs) -0. 36 -0. 31 -0. 44 -0. 42 -0. 93 -11. 93 -17. 44 -39. 95 Threads/Process 0. 81 0. 84 0. 82 0. 80 0. 67 0. 86 0. 92 49 0. 84
Fairness bitmap • O(1) scheduler keeps the 1 fairness by epoch B – cycles of active queue and expired queue D • CA scheduler also follows epoch – guarantee the same level of fairness as O(1) scheduler bitmap A C 1 1 1 0 0 expired 0 active Processor 0 50
Influence of sibling threads on the overhead of context switch Ratio of each events (process / sibling threads) working set (KB) 0 8 16 128 512 1024 1408 1536 L 1 L 2 1. 31 2. 17 1. 38 1. 15 1. 11 0. 93 0. 97 TLB 0. 70 0. 35 0. 46 0. 80 0. 75 1. 16 1. 01 1. 03 Elapsed Time 3. 59 1. 23 4. 46 1. 18 5. 00 1. 22 10. 49 1. 24 3. 78 1. 48 1. 03 1. 16 1. 02 1. 08 1. 02 1. 19 51
Results of TLB misses (million times) OS O(1) CA: 10 CFS Data TLB 664 (1. 00) 626 (0. 94) 457 (0. 68) 581 (0. 87) Instruction TLB 135 (1. 00) 119 (0. 88) 66 (0. 48) 117 (0. 86) • CA scheduler significantly reduces TLB misses 52
Effect on Process Time (seconds) OS Volano Da. Capo Chat Sysbench O(1) 9. 34 (1. 00) 27. 41 (1. 00) 50. 83 (1. 00) 0. 45 (1. 00) CA: 1 9. 28 27. 36 24. 25 0. 44 (0. 99) (0. 47) (0. 97) CA: 10 8. 75 27. 27 14. 29 0. 42 (0. 93) (0. 99) (0. 28) (0. 93) CFS 12. 23 31. 57 14. 27 0. 36 • CA scheduler time of every application (1. 32)gives benefit (1. 15)to process (0. 28) (0. 80) • CA is especially effective in Chat application 53
OS Effect on Elapsed Time (seconds) Volano Da. Capo Chat Sysbench Total O(1) 151 (1. 00) 28. 38 (1. 00) 110 (1. 00) 193 (1. 00) 170 (1. 00) CA: 1 148 (0. 98) 27. 35 (0. 96) 97 (0. 88) 180 (0. 93) 112 (0. 65) CA: 10 78 (0. 51) 27. 30 (0. 96) 30 (0. 27) 114 (0. 59) 89 (0. 52) CFS 38 (0. 25) 83. 78 (2. 95) 40 (0. 36) 99 (0. 51) 89 (0. 52) CA scheduler reduces the total elapsed time by 48% 54
Measuring Tools • Perfctr to count the TLB misses and Total Elapsed Time • GNU’s time command to measure the process time • Counter implemented in each application (elapsed time) 55
TLB flush in Context Switch Example of x 86 processors Switch of memory address spaces triggers TLB flush except small number of entries with G flag In case of switching sibling threads, TLB entries are not flushed 56
- Slides: 56