Week 10.b.
CS 5600
11/12 2021

On the board
------------

1. I/O, continued
2. device drivers
3. Synchronous vs. async I/O
4. Disks

-------------------------------------------------------

Admin:

-- power pose
  -- Body language affects how others see us, but it may also change how we see
  ourselves. Social psychologist Amy Cuddy argues that "power posing" --
  standing in a posture of confidence, even when we don't feel confident -- can
  boost feelings of confidence, and might have an impact on our chances for success.
  -- TED talk: 
    https://www.ted.com/talks/amy_cuddy_your_body_language_may_shape_who_you_are?language=en#t-1307

---------

1. I/O, continued

    (last time)
    - I/O architecture

    - PMIO/MMIO

    ** Polling vs. interrupts

        Polling: check back periodically 

            kernel...

           - ... sent a packet? Periodically ask the card when the buffer is
             free.

           - ... waiting for a packet? Periodically ask whether there is
             data

           - ... did Disk I/O? Periodically ask whether the disk is done.

            Disadvantages: wasted CPU cycles

        Interrupts:

          Recall interrupts:
            -- three ways to trap to kernel: syscall, exceptions and interrupts
            -- CPU scheduling, for preemptive scheduler, timer interrupt
            -- page faults: how CPU+OS handle interrupts
              --(OS configures) registering handlers in IDT (interrupt descriptor table)
              --(CPU executes) when meeting interrupts, CPU transfer control to handlers

          The device interrupts the CPU when its status
          changes (for example, data is ready, or data is fully written).

          This is what most general-purpose OSes do. There is a
          disadvantage, however. This could come up if you need to
          build a high-performance system.

          Namely: If interrupt rate is high, then the computer can
          spend a lot of time handling interrupts (interrupts are
          expensive because they generate a context switch, and the
          interrupt handler runs at high priority).

              --> in the worst case, you can get *receive livelock*
              where you spend 100% of time in an interrupt handler but no
              work gets done.

        How to design systems given these tradeoffs? Start with
        interrupts. If you notice that your system is slowing down
        because of livelock, then switch to polling. If polling is chewing
        up too many cycles, then move towards an adaptive switching
        between interrupts and polling. (But of course, never optimize
        until you actually know what the problem.) A classic reference
        on this subject is the paper 
            "Eliminating Receive Livelock in an Interrupt-driven
            Kernel", by Mogul and Ramakrishnan, 1996.

        We have just seen two approaches to synchronizing with
        hardware:

            polling
            interrupts

        Notice that they are mostly about communicating device status.
        How about data transfer?
        (by "mostly",  I mean gettting/setting status/commands don't have a
        clear difference from "data transfer"---status/commands are
        bits as well!)

    ** DMA vs. programmed I/O

        Programmed I/O: what we have been seeing in the handout
        so far: CPU writes data directly to device, and reads data
        directly from device.

    DMA: better way for large and frequent transfers

        CPU (really, device driver programmer) places some buffers
        in main memory.

        Tells device where the buffers are

        Then "pokes" the device by writing to register

            Then device uses *DMA* (direct memory access) to read or
            write the buffers,

            The CPU can poll to see if the DMA completed (or the device
            can interrupt the CPU when done).

            [rough picture:
           buffer descriptor list
           <metadata> --> [  buf ]
           <metadata> --> [  buf ]
           ....
            ]

        DMA process is managed by a hardware known as a DMA controller (DMAC).

        This makes a lot of sense. Instead of having the CPU
        constantly dealing with a small amount of data at a time, the
        device can simply write the contents of its operation straight
        into memory.

        NOTE: OSTEP couples DMA to interrupts, but things don't have to
        work like that. You could have all four possibilities in
        {DMA, programmed I/O} x {polling, interrupts}. 

            For example, (DMA, polling) would mean requesting a DMA
            and then later polling to see if the DMA is complete.

2. Device drivers

    The examples (keyboard and screen) on the handout are
    simple device drivers.

    Device drivers in general solve a software engineering problem ...

        [draw a picture of different devices have different shapes
        and drivers fit them into kernel]

        expose a well-defined interface to the kernel, so that the
        kernel can call comparatively simple read/write calls or
        whatever.

        For example, reset, ioctl, output, read, write,
        handle_interrupt()

        this abstracts away nasty hardware details so that the kernel
        doesn't have to understand them.

        When you write a driver, you are implementing this interface,
        and also calling functions that the kernel itself exposes

    ... but device drivers also *create* software engineering problems.

    Fundamental issues:

        (1) Each device driver is per-OS and per-device (often can't reuse
        the "hard parts")

        They are often written by the device manufacturer (core
        competence of device manufacturers is hardware development, not
        software development).

        Under conventional kernel architectures, bugs in device drivers
        -- and there are many, many of them -- bring down the entire
        machine.

    So we have to worry about potentially sketchy drivers ...

    ... but (2) we also have to worry about potentially sketchy devices.

        a buggy network card can scribble all over memory
        (solution: use IOMMU; advanced topic)

        plug in your USB stick: claims to be a keyboard; starts issuing
        commands.

        plug in a USB stick: if it's carrying a virus (aka malware),
        your computer can now be infected.

    [if interested, check out:

        Angel, S., Wahby, R.S., Howald, M., Leners, J.B., Spilo, M., Sun, Z., Blumberg,
        A.J. and Walfish, M., Defending against malicious peripherals with Cinch.
    ]


3. Synchronous vs asynchronous I/O

    - A question of interface

    - NOTE: kernel never blocks when issuing I/O. We're discussing the
    interface presented to user-level processes.

    - Synchronous I/O: system calls block until they're handled.

      for example: read(...) syscall

    - Asynchronous I/O:

        call it "try_read()" (of course, a fake name)

        I/O doesn't block. for example, if there is nothing to read,
        it returns immediately but sets a
        flag indicating that it _would_ have blocked.

        Process discovers that data is ready either by making another
        query or by registering to be notified by a signal

    - Annoyingly, standard POSIX interface for files is blocking,
    always. Need to use platform-specific extensions to POSIX to get
    async I/O for files.

    - Pros/cons:

        - blocking interface leads to more readable code, when
        considering the code that invokes that interface

        - but blocking interfaces BLOCK, which means that the code
        _above_ the interface cannot suddenly switch to doing something
        else. if we want concurrency, it has to be handled by a layer
        _underneath_ the blocking interface. 

        (Building async I/O is a big topic that we will not
        cover in this course.)

4. Disks

    Disks have historically been *the* bottleneck in many systems

        - This becomes less and less true every year:
        - SSDs (solid state drives) now common; will see in a sec
        - PM (persistent memory) or NVRAM (non-volatile RAM) now available

    [Reference: "An Introduction to Disk Drive Modeling",
    by Chris Ruemmler and John Wilkes. IEEE Computer 1994, Vol. 27,
    Number 3, 1994. pp17-28.]

    A. What is a disk?

    [see handout]

    --stack of magnetic platters

    --Rotate together on a central spindle @3,600-15,000 RPM

    --Arms rotate around pivot, all move together

    --Arms contain disk heads--one for each recording surface

    --Heads read and write data to platters

    [interlude: why are we studying this?

        disks are still widely in use everywhere, and will be for some
        time. Very cheap. Great medium for backup. Better than SSDs for
        durability (SSDs have limited number of write cycles, and decay
        over time)

        Google, Facebook, etc. historically packed their data centers
        full of cheap disks.

        As a second point, it's technical literacy; many filesystems
        were designed with the disk in mind (sequential access
        significantly higher throughput than random access). You have to
        know how these things work as a computer scientist and as a
        programmer.
    ]

    B. Geometry of a disk

    [see handout]

    --track: circle on a platter. each platter is divided into
    concentric tracks.

    --sector: chunk of a track

    --cylinder: locus of all tracks of fixed radius on all platters

    --Heads are roughly lined up on a cylinder

    --Generally only one head active at a time

    --disk positioning system

        --Move head to specific track and keep it there

        --a *seek* consists of up to four phases:

          --speedup: accelerate arm to max speed or half way point

          --coast: at max speed (for long seeks)

          --slowdown: stops arm near destination

          --settle: adjusts head to actual desired track

          [BTW, this thing can accelerate at up to several hundred g]

    --Question: which have better performance, reads or writes? why?
        [answer: reads.
                [update 12/12: was "writes", a typo.]
        Here are reasons:

        --settle times takes longer for writes than reads. why?
        --because if read strays, the error will be caught, and the
        disk can retry
        --if the write strays, some other track just got clobbered.
        so write settles need to be done precisely]

    C. Common #s and performance

    --capacity: high 100s of GB, now TBs common

    --platters: 8

    --number of cylinders: tens of thousands or more

    --sectors per track: ~1000

    --RPM: 10000

    --transfer rate: 50-150 MB/s

    --Question: guess mean time between failures
    --answer: ~1 million hours
        (for disks in data centers, it's vastly less; for a provider
        like Google, even if they had very reliable disks, they'd still need
        an automated way to handle failures because failures would
        be common (imagine 10 million disks: *some* will be on the
        fritz at any given moment). so what they do is to buy
        defective and cheap disks, which are far cheaper. lets them
        save on hardware costs. they get away with it because they
        *anyway* needed software and systems -- replication and
        other fault-tolerance schemes -- to handle failures.)

    D. How driver interfaces to disk

    --Sectors

        --in old days (before 1990ish): the OS specifies a platter, track,
        sector (CHS, Cylinder-head-sector); but who knows where it really is?

        --nowadays, the OS sees a disk as an array of sectors (LBA, logical
        block addressing); normally each sector is of size 512B.

        [see handout's bootloader code]

        --Disk interface presents linear array of **sectors**

        --traditionally 512 bytes (moving to 4KB)

    --Question: how many bits do we need to address a 1TB disk?
      (note: we will simplify here, assuming 1TB=10^40B;
      in reality, in the context of storage, 1TB=1000,000,000,000B,
      or 1 trillion bytes)

      [answer:
        1 sector is 512B = 2^9 Bytes;
        the entire disk has 1TB/512B = 2^40 / 2^9 = 2^31 sectors;
        to address each sector, we need at least 31-bits

        In fact:
        "The current 48-bit LBA scheme was introduced in 2003 with the ATA-6
        standard,[4] raising the addressing limit to 248 × 512 bytes, which is
        exactly 128 PiB or approximately 144 PB."
        (from wiki: https://en.wikipedia.org/wiki/Logical_block_addressing)
      ]


    E. Disk scheduling: not covering in class.
       Can read in text. Some notes below:

    --FCFS: process requests in the order they are received
        +: easy to implement
        +: good fairness
        -: cannot exploit request locality
        -: increases average latency, decreasing throughput

    --SPTF/SSTF/SSF/SJF: shortest positioning time first / shortest seek
    time first: pick request with shortest seek time

        +: exploits locality of requests
        +: higher throughput
        -: starvation
        -: don't always know which request will be fastest

        improvement: aged SPTF

        --give older requests priority

        --adjust "effective" seek time with weighting [no pun
        intended] factor:
            T_{eff} = T_{pos} - W*T_{wait}

    --Elevator scheduling: like SPTF, but next seek must be in same
    direction; switch direction only if no further requests
        +: exploits locality
        +: bounded waiting
        -: cylinders in middle get better service
        -: doesn't fully exploit locality

        modification: only sweep in one direction (treating all
        address as being circular): very commonly used in Unix.

    F. technology and systems trends
       (covered some, but skipped a lot)

    --unfortunately, while seeks and rotational delay are getting a
    little faster, they have not kept up with the huge growth
    elsewhere in computers.

    --transfer bandwidth has grown about 10x per decade

    --the thing that is growing fast is disk density
    (byte_stored/$). that's because density is less about the
    mechanical limitations

    --to improve density, need to get the head close to the surface.

        --[aside: what happens if the head contacts the surface? called
        "head crash": scrapes off the magnetic material ... and,
        with it, the data.]

    --Disk accesses a huge system bottleneck and getting worse. So
    what to do?

        --Bandwidth increase lets system (pre-)fetch large chunks
        for about the same cost as small chunk.

        --So trade latency for bandwidth if you can get lots of
        related stuff at roughly the same time. How to do that?

        --By clustering the related stuff together on the disk. can
        grab huge chunks of data without incurring a big cost since
        we already paid for the seek + rotation.


    --The saving grace for big systems is that memory size
    is increasing faster than typical workload size

        --result: more and more of workload fits in file cache,
        which in turn means that the profile of traffic to the disk
        has changed: now mostly writes and new data.

        --which means logging and journaling become viable (more on
        this over next few classes)


[Acknowledgments: Mike Walfish, David Mazieres, Mike Dahlin, Brad Karp]