Week 4.b CS 5600 02/09 2022 https://naizhengtan.github.io/22spring/ On the board ------------ 1. Last time 2. More about scheduling 3. Scheduling problem today 4. Lessons and conclusions 5. Threads ----------------------------------------------- Admin -- Lab0 commit id -- the one on Canvas will be the ground truth -- How's Lab1? -- memory problems -- take review session on Friday -- Again, Labs are not evaluation (exams are). They are ways to learn. -- read instructions carefully -------- 1. Last time - scheduling algorithms -- six algorithms: FIFO, SJF, RR, Prio, MLFQ, Lottery - Votes last time: (with candidate >=5 votes) "Best Turnaround Time": STCF (48) "Best Response Time": RR (34), MLFQ (10) "Best Fairness": RR (33), MLFQ (10), lottery (10) "Most popular algorithm": MLFQ (23), lottery (14), STCF (5), Prio (5) -- different winners and (sort of) diverse answers indicates: -- no "best" scheduling algorithms -- metric-oriented (or usually called workload-oriented) algorithms -- we can game the metrics by not-very-useful algorithms -- "best response time" algo: FIFO + always scheduling the new arrival process for 0.0001 sec 2. More about scheduling A. a note on MLFQ parameters: -- scheduling algorithm for each queue (last time, RR). -- when to demote a process to a lower priority queue (last time, a "downgrade" running time). -- when to promote a process to a higher priority queue (last time, none) -- others: -- the number of queues (last time, 2) -- determine which queue a process will enter when that process arrives (last time, high-priority queue) B. a note about Lottery algo --randomness is important in CS, and lottery scheduling is one example --randomized algorithms in theory --stochastic gradient descent in deep learning --BitCoin's proof-of-work; Ethereum's proof-of-stake --lots of nice features --can model as currency, so there can be an exchange rate between real currencies (money) and lottery tickets --deals with starvation (have one ticket --> will make progress) --don't have to worry that adding one high priority job will starve all others --adding/deleting jobs affects all jobs proportionally (T gets bigger) --can transfer tickets between processes: highly useful if a client is waiting for a server. then client can donate tickets to server so it can run. --many other details --ticket inflation for processes that don't use their whole slice (like feedback in MLFQ) --use fraction f of slice; inflate tix by 1/f until it next gets CPU (improve fairness) --disadvantages --latency unpredictable --expected error somewhat high --in reaction to these disadvantages, Waldspurger and Weihl proposed *Stride Scheduling* --basically, a deterministic version of lottery scheduling. less randomness --> less expected error. --see OSTEP (chap 9) for details C. Incorporating I/O (relax assumption from before) motivating example: 3 jobs P1, P2: both CPU bound, run for a week P3: I/O bound, loop (1 ms of CPU, 10 ms of disk I/O) process arrival running P1 0 1 week P2 0 1 week P3 0 30 sec (with 300sec I/O) by itself, P3 uses ~90% of disk By itself, P1 or P2 uses 100% of CPU Question: what happens if we use FIFO? (arrival: P1, P2, P3) --P1 and P2 will run, keep CPU for 2 weeks what about RR with 100msec time slice? --only get ~5% disk utilization CPU: [P1:100] [P2:100] [P3:1] [P1:100 ]... disk: [---------idle---------] [P3:10] [--]... what about RR with 1msec time slice? --get nearly 90% disk utilization --but lots of preemptions CPU: [P1:1] [P2:1] [P3:1] [P1:1] [P2:1] [P1:1]... disk: [------------------] [P3: 10 ... what about STCF? --would give us good disk utilization ... CPU: [P3:1] [P1:10] [P3:1] [P1:10]... disk: [----] [P3:10] [----] [P3:10]... (what about P2? starvation) --advantages to STCF --disk utilization (see above) --low overhead (no needless preemptions) --disadvantages to STCF --long-running jobs can get starved --requires predicting the future D. predicting future we cannot really predict the future. however, can attempt to estimate future based on past (a normal approach that people do when designing systems): --Exponentially weighted average a good idea --t_n: length of proc's nth CPU running time --\tau_{n+1}: estimate for n+1 running time --choose \alpha, 0 < \alpha <= 1 --set \tau_{n+1} = \alpha * t_n + (1-\alpha)*\tau_n --this is called an exponential weighted moving average (EWMA) --reacts to changes, but smoothly [skipped] E. What Linux does --the current Linux scheduler (post 2.6.23), called CFS (Completely Fair Scheduler), roughly reinvented the ideas of Stride Scheduling -- If you're interested in CFS: "Inside the Linux 2.6 Completely Fair Scheduler" https://developer.ibm.com/tutorials/l-completely-fair-scheduler/ [maybe talk about "design vs. implementation"] 3. scheduling problem in new era - you may have an illusion that scheduling is "easy" - a problem in practice: Reconfigurable Machine Scheduling Problem (RMS) - scheduling problem in practice: -- NVIDIA A100 GPU has a new feature: Multi-Instance GPU (MIG) -- MIG can partition one GPU into 7 small GPUs (call it 1/7 instance) -- small GPUs can merge into larger ones: two 1/7 instances => one 2/7 instance (with limitations) -- different deep neural network (DNN) jobs have non-linear performance on different sized instances -- for example, inceptionv3, two 1/7 instances have higher throughputs than 2/7 - Question: give a set of DNN jobs and a set of GPU, how to maximize the throughput? (or minimize #GPUs used) -- this is an NP-complete problem -- see some viable solution in our paper: "Serving DNN Models with Multi-Instance GPUs: A Case of the Reconfigurable Machine Scheduling Problem" https://arxiv.org/pdf/2109.11067.pdf - point: scheduling problem becomes complicated and important in new era (e.g., cloud computing and AI). 4. Scheduling lessons and conclusions --Scheduling comes up all over the place --m requests share n resources --disk arm: which read/write request to do next? --memory: which process to take physical page from? --This topic was popular in the days of time sharing, when there was a shortage of resources all around, but many scheduling problems become not very interesting when you can just buy a faster CPU or a faster network. --Exception 1: web sites and large-scale networks often cannot be made fast enough to handle peak demand (flash crowds, attacks) so scheduling becomes important again. For example may want to prioritize paying customers, or address denial-of-service attacks. --Exception 2: real-time systems: soft real time: miss deadline and CD or MPEG decode will skip hard real time: miss deadline and plane will crash Plus, at some level, every system with a human at the other end is a real-time system. If a Web server delays too long, the user gives up. So the ultimate effect of the system may in fact depend on scheduling! --Cloud computing (or huge datacenters) makes scheduling "fancy" again... ...with new problems (like RMS) ...and "scale effect": improving 1% CPU efficiency on you laptop is not interesting, whereas saving 1% CPU time for Google is a BIG deal! --In principle, scheduling decisions shouldn't affect program's results --This is good because it's rare to be able to calculate the best schedule --So instead, we build the kernel so that it's correct to do a context switch and restore at any time, and then *any* schedule will get the right answer for the program --This is a case of a concept that comes up a fair bit in computer systems: the policy/mechanism split. In this case, the idea is that the *mechanism* allows the OS to switch any time while the *policy* determines when to switch in order to meet whatever goals are desired by the scheduling designer --But there are cases when the schedule *can* affect correctness --multimedia: delay too long, and the result looks or sounds wrong --Web server: delay too long, and users give up 5. Threads [ask how many students have used threads] Interface to threads: tid thread_create (void (*fn) (void *), void *); Create a new thread, run fn with arg void thread_exit (); void thread_join (tid thr); Wait for thread with tid 'thr' to exit plus a lot of synchronization primitives, which we'll see in the next lecture Assume for now that threads are: --an abstraction created by OS --preemptively scheduled thread vs. process: a frequently asked question -- threads share memory, but they have their own "execution context" (registers and stack). --similar to PCB, there is TCB (thread control block) --Thread Identifier: Unique id (tid) is assigned to every new thread --Stack pointer: Points to thread's stack in the process --Program counter: Points to the current program instruction of the thread --State of the thread (running, ready, waiting, start, done) --Thread's register values --Pointer to the Process control block (PCB) of the process that the thread lives on [Google uses user-level threads for performance. See talk "User-level threads....... with threads." by Paul Turner from Google https://www.youtube.com/watch?v=KXuZi9aeGTw ] [draw abstract picture of threads in a process: own registers, share memory] Question: if you were OS designer, what are you going to do if a thread calls fork()? [Linux: only the thread called fork() left in the child process]