Week 4.b
CS 5600
02/09 2022
https://naizhengtan.github.io/22spring/

On the board
------------
1. Last time
2. More about scheduling
3. Scheduling problem today
4. Lessons and conclusions
5. Threads
-----------------------------------------------

Admin

-- Lab0 commit id
  -- the one on Canvas will be the ground truth

-- How's Lab1?
  -- memory problems
  -- take review session on Friday

-- Again, Labs are not evaluation (exams are).
   They are ways to learn.
   -- read instructions carefully


--------

1. Last time

  - scheduling algorithms
    -- six algorithms: FIFO, SJF, RR, Prio, MLFQ, Lottery

  - Votes last time: 
      (with candidate >=5 votes)

      "Best Turnaround Time":   STCF (48)

      "Best Response Time":     RR (34), MLFQ (10)

      "Best Fairness":          RR (33), MLFQ (10), lottery (10)

      "Most popular algorithm": MLFQ (23), lottery (14), STCF (5), Prio (5)

    -- different winners and (sort of) diverse answers indicates:
      -- no "best" scheduling algorithms
      -- metric-oriented (or usually called workload-oriented) algorithms
        -- we can game the metrics by not-very-useful algorithms
        -- "best response time" algo:
           FIFO + always scheduling the new arrival process for 0.0001 sec


2. More about scheduling

   A. a note on MLFQ parameters:

    -- scheduling algorithm for each queue (last time, RR).
    -- when to demote a process to a lower priority queue
        (last time, a "downgrade" running time).
    -- when to promote a process to a higher priority queue
        (last time, none)

    -- others:
      -- the number of queues (last time, 2)
      -- determine which queue a process will enter when
         that process arrives (last time, high-priority queue)


    B. a note about Lottery algo

    --randomness is important in CS, and lottery scheduling is one example
      --randomized algorithms in theory
      --stochastic gradient descent in deep learning
      --BitCoin's proof-of-work; Ethereum's proof-of-stake

    --lots of nice features

        --can model as currency, so there can be an exchange
        rate between real currencies (money) and lottery tickets

        --deals with starvation (have one ticket --> will make
        progress)

        --don't have to worry that adding one high priority job will
        starve all others

        --adding/deleting jobs affects all jobs proportionally (T
        gets bigger)

        --can transfer tickets between processes: highly useful if a
        client is waiting for a server. then client can donate
        tickets to server so it can run.

    --many other details

        --ticket inflation for processes that don't use their whole
          slice (like feedback in MLFQ)

        --use fraction f of slice; inflate tix by 1/f until it
        next gets CPU (improve fairness)

    --disadvantages

        --latency unpredictable

        --expected error somewhat high


    --in reaction to these disadvantages, Waldspurger and Weihl
    proposed *Stride Scheduling*

        --basically, a deterministic version of lottery scheduling.
        less randomness --> less expected error.

        --see OSTEP (chap 9) for details


   C. Incorporating I/O
      (relax assumption from before)

        motivating example:

        3 jobs
          P1, P2: both CPU bound, run for a week
          P3: I/O bound, loop
              (1 ms of CPU, 10 ms of disk I/O)

        process      arrival     running
        P1           0             1 week
        P2           0             1 week 
        P3           0             30 sec (with 300sec I/O)

        by itself, P3 uses ~90% of disk
        By itself, P1 or P2 uses 100% of CPU


        Question: what happens if we use FIFO? (arrival: P1, P2, P3)
        --P1 and P2 will run, keep CPU for 2 weeks

        what about RR with 100msec time slice?
        --only get ~5% disk utilization

        CPU:  [P1:100] [P2:100] [P3:1] [P1:100    ]...
        disk: [---------idle---------] [P3:10] [--]...

        what about RR with 1msec time slice?
        --get nearly 90% disk utilization
        --but lots of preemptions

        CPU:  [P1:1] [P2:1] [P3:1] [P1:1] [P2:1] [P1:1]...
        disk: [------------------] [P3: 10             ...

        what about STCF?
        --would give us good disk utilization ... 

        CPU:  [P3:1] [P1:10] [P3:1] [P1:10]...
        disk: [----] [P3:10] [----] [P3:10]...

        (what about P2? starvation)

        --advantages to STCF
            --disk utilization (see above)
            --low overhead (no needless preemptions)

        --disadvantages to STCF
            --long-running jobs can get starved
            --requires predicting the future

    D. predicting future

    we cannot really predict the future.
    however, can attempt to estimate future based on past
    (a normal approach that people do when designing systems):

        --Exponentially weighted average a good idea 

        --t_n: length of proc's nth CPU running time

        --\tau_{n+1}: estimate for n+1 running time

        --choose \alpha, 0 < \alpha <= 1

        --set \tau_{n+1} = \alpha * t_n + (1-\alpha)*\tau_n

        --this is called an exponential weighted moving average
        (EWMA)

        --reacts to changes, but smoothly


    [skipped]
    E. What Linux does

    --the current Linux scheduler (post 2.6.23), called CFS
    (Completely Fair Scheduler), roughly reinvented the ideas of
    Stride Scheduling

    -- If you're interested in CFS:

       "Inside the Linux 2.6 Completely Fair Scheduler"
       https://developer.ibm.com/tutorials/l-completely-fair-scheduler/

    [maybe talk about "design vs. implementation"]

3. scheduling problem in new era

    - you may have an illusion that scheduling is "easy"

    - a problem in practice:
      Reconfigurable Machine Scheduling Problem (RMS)

    - scheduling problem in practice:

      -- NVIDIA A100 GPU has a new feature: Multi-Instance GPU (MIG)
      -- MIG can partition one GPU into 7 small GPUs (call it 1/7 instance)
      -- small GPUs can merge into larger ones: two 1/7 instances => one 2/7
         instance (with limitations)

      -- different deep neural network (DNN) jobs have non-linear
         performance on different sized instances
           -- for example, inceptionv3, two 1/7 instances have higher
           throughputs than 2/7

    - Question: give a set of DNN jobs and a set of GPU,
      how to maximize the throughput? (or minimize #GPUs used)
      -- this is an NP-complete problem
      -- see some viable solution in our paper:

        "Serving DNN Models with Multi-Instance GPUs: A Case of the
        Reconfigurable Machine Scheduling Problem"
        https://arxiv.org/pdf/2109.11067.pdf

    - point: scheduling problem becomes complicated and important
      in new era (e.g., cloud computing and AI).


4. Scheduling lessons and conclusions

    --Scheduling comes up all over the place

    --m requests share n resources

    --disk arm: which read/write request to do next?

    --memory: which process to take physical page from?

    --This topic was popular in the days of time sharing, when there was
    a shortage of resources all around, but many scheduling problems
    become not very interesting when you can just buy a faster CPU or a
    faster network.

    --Exception 1: web sites and large-scale networks often cannot
    be made fast enough to handle peak demand (flash crowds,
    attacks) so scheduling becomes important again. For example may
    want to prioritize paying customers, or address
    denial-of-service attacks.

    --Exception 2: real-time systems:
        soft real time: miss deadline and CD or MPEG decode will skip
        hard real time: miss deadline and plane will crash

        Plus, at some level, every system with a human at the other
        end is a real-time system. If a Web server delays too long,
        the user gives up. So the ultimate effect of the system may
        in fact depend on scheduling!


    --Cloud computing (or huge datacenters) makes scheduling "fancy" again...
      ...with new problems (like RMS)
      ...and "scale effect":
        improving 1% CPU efficiency on you laptop is not interesting,
        whereas saving 1% CPU time for Google is a BIG deal!

    --In principle, scheduling decisions shouldn't affect program's
    results

    --This is good because it's rare to be able to calculate the
    best schedule

    --So instead, we build the kernel so that it's correct to do a
    context switch and restore at any time, and then *any* schedule
    will get the right answer for the program

    --This is a case of a concept that comes up a fair bit in
    computer systems: the policy/mechanism split. In this case, the
    idea is that the *mechanism* allows the OS to switch any time
    while the *policy* determines when to switch in order to meet
    whatever goals are desired by the scheduling designer

    --But there are cases when the schedule *can* affect correctness

      --multimedia: delay too long, and the result looks or sounds
      wrong

      --Web server: delay too long, and users give up


5. Threads

    [ask how many students have used threads]

    Interface to threads:

        tid thread_create (void (*fn) (void *), void *); 
            Create a new thread, run fn with arg

        void thread_exit (); 

        void thread_join (tid thr); 

            Wait for thread with tid 'thr' to exit

      plus a lot of synchronization primitives, which we'll see 
      in the next lecture

    Assume for now that threads are:
        --an abstraction created by OS
        --preemptively scheduled

    thread vs. process: a frequently asked question
      -- threads share memory, but they have their own "execution context"
      (registers and stack).

    --similar to PCB, there is TCB (thread control block)
      --Thread Identifier: Unique id (tid) is assigned to every new thread
      --Stack pointer: Points to thread's stack in the process
      --Program counter: Points to the current program instruction of the thread
      --State of the thread (running, ready, waiting, start, done)
      --Thread's register values
      --Pointer to the Process control block (PCB) of the process that the thread lives on

    [Google uses user-level threads for performance.

        See talk "User-level threads....... with threads."
            by Paul Turner from Google
            https://www.youtube.com/watch?v=KXuZi9aeGTw
    ]


    [draw abstract picture of threads in a process:
      own registers, share memory]

    Question: if you were OS designer, what are you going to do
    if a thread calls fork()?
      [Linux: only the thread called fork() left in the child process]