Week 4.a
CS3650
01/29 2024
https://naizhengtan.github.io/24spring/

0. process birth (cont'd)
1. Shell crash course
2. Shell internals, part I
3. File descriptors
4. Shell internals, part II
----

Admin:
  - talk about assignment1
  - Lab2: will have overwhelmed office hours

some notes:
  - Q: is "x == *&x" always true? Yes.
  - exit(0) vs. return 0
     [an example:
         int main() {
             //exit(123);
             return 123;
         }
     ]
    and "$ echo $?"
    Q: do echo twice the values change, why?
    A: the second time is the exit status of the first "echo",
       which successfully finishes (hence see "0").

0. Process birth

  - fork() is what we use for creating a process:
    it duplicates the parent process and creates a child process.

  Q: there is only one difference between parent and child.
     What is that?
    [Answer: the return value of fork();
     0 means child; the child pid (>0) means parent]

  - wait() syscall
    * a family of syscalls, like wait() and waitpid()
    * does two things:
      (a) wait (does not run the next instruction)
      (b) when the child finishes (meaning called exit()),
          the parent collects the child's exit status

  - normal case:
     parent: fork() then wait()
     child: run and exit()
    * "abnormal" case: orphan and zombine processes

  [demo: use htop to show how fork() look like]

  - Q: Can we create a process by "create_process" (a syscall)?
    --that's what windows do
    --why fork() works "better"?
    // why fork() is less attractive today?

1. Shell crash course

  - how fork/wait works in reality?
    -- think of how you run a hello program

  - a program that creates processes

  - the human's interface to the computer

  - GUIs (graphical user interfaces) are another kind of shell.

  - mechanically introduce (to some of you, review) the basics of shell
    -- next section will discuss their motivations, how they
       are implemented, and why they are powerful

  - shell reads user inputs (cmds and arguments) and run the cmds
    for example, "$ ls" and "$ ls -a"
    -- "ls" is a program, like your hellowrold
    -- "-a" is an argument; you will see argument handling in your lab

  - output redirection
    -- "$ ls" prints to screen; what if I want the output to a file?

    -- Question: how people will do this in Windows?

    -- "$ ls > files.txt"

  - backgrounding
    -- there are long-running jobs, for example, web server
    -- run them by backgrounding
      $ web-server &
      $

  - pipe
    -- feed one program's output to another's input

  Q: what does this line mean: (a demo)
    -- "$ cat students.txt | shuf -n 1"
    -- equivalent to 
       "$ cat students.txt > /tmp/tmpfile
        $ shuf -n 1 /tmp/tmpfile
        $ rm /tmp/tmpfile"
       (of course, technically, we don't need "cat" and "/tmp/tmpfile")

  - Shell builtin cmds vs. program
    -- "echo/pwd/which" vs. "ls"
    -- use "which" to tell
       program:  "$ which ls"    => "/bin/ls"
       built-in: "$ which which" => "which: shell built-in command"
    -- why builtin cmds?
      -- has to be builtin (impossible to implement otherwise): "cd"
      -- for efficiency: "echo"


2. Shell internals, part I

    a. How does the shell start programs?

    --example:
        $ ls

    [see panel 1 on handout; go line-by-line]

    --calls fork(), which creates a copy of the shell. now there are
    two copies of the shell running

    --then calls exec(), which loads the new program's instructions
    into memory and begins executing them.
        --(exec invokes the loader)

    while (1) {
        write(1, "$ ", 2);
        readcommand(command, args); // parse input
        if ((pid = fork()) == 0) // child?
            execve(command, args, 0);
        else if (pid > 0) // parent?
            wait(0); //wait for child
        else
            perror("failed to fork");
    }

    [why 0 in wait(0)? 0 means NULL]

    --waits for the end of a process
        --with wait() or waitpid() system calls

    --QUESTION: why is fork different from exec?
      why not combine them?

       [in fact, we have "posix_spawn()" to somewhat simulate fork/exec]

      * We will come back to this.
      this => "the power of the fork/exec separation"

    b. Redirection and pipe, motivation

      Q: in Windows, how can you concate two txt files? 10 files? 1000 files?

      What does this do?

      $ cat file1 file2 > new_file

      or say we wanted to extract all of your GitHub ids...how would
      you do that without pipelines?

      fetch repos from html page https://github.com/NEU-CS5600-23spring (call it URL)
      then
      $ curl $URL | grep -o lab1-[a-zA-Z0-9\-]* | uniq > repo.txt

      How are these things implemented? Remember, the programmer of
      cat is long gone, and their output is winding
      up somewhere that the original program never specified.

3. File descriptors

    --"int fd = open(const char* path, int flags)"

    --every process can usually expect to begin life with three file
    descriptors already open:
    0: represents the input to the process (e.g., tied to terminal)
    1: represents the output
    2: represents the error output

    these are sometimes known as stdin, stdout, stderr

    --NOTE: Unix hides for processes the difference between a device and
    a file. this is a very powerful hiding (or abstraction), as we will
    see soon

    [draw kernel file descriptor table]


4. Shell internals, part II

  - redirection

    Back to 
         $ cat abcd efgh > /tmp/foo

  Q: How is that implemented?

  Answer: after fork() but before exec(), shell does:

        close(1)
        open("/tmp/foo", O_TRUNC | O_CREAT | O_WRONLY, 0666)

        which automatically assigns fd 1 to point to /tmp/foo

        [draw picture of fds, fd 0 /dev/tty, fd 1 now /tmp/foo]

        --now, when "cat" runs, it still has in its code: write(1,...),
        but "1" now means something else.

    [read handout part 2]

    What about 

        $ sh < script > tmp1

        where script contains 
        echo abc
        echo def

        [draw picture]


  * The power of the fork/exec separation

    [an innovation from the original Unix. possibly lucky design
    choice at the time. but turns out to work really well.  allows
    the child to manipulate environment and file descriptors
    *before* exec, so that the *new* program may in fact encounter a
    different environment]

      --recall how we handle redirection

      --To generalize redirections and pipelines, there are lots of
      things the parent shell might want to manipulate in the child
      process: file descriptors, environment, resource limits.

      --yet fork() requires no arguments!

      --syscall CreateProcess on Windows:

       BOOL CreateProcess(
         name,
         commandline,
         security_attr,
         thr_security_attr,
         inheritance?,
         other flags,
         new_env,
         curr_dir_name,
         ...)

       [http://msdn.microsoft.com/en-us/library/ms682425(v=VS.85).aspx]
       [see also Windows syscalls: https://github.com/j00ru/windows-syscalls]

       there's also CreateProcessAsUser, CreateProcessWithLogonW,
       CreateProcessWithTokenW, ...

    * The issue is that any conceivable manipulation of the
    environment of the new process has to be passed through 
    arguments, instead of via arbitrary code.

    in other words:

      because whoever calls CreateProcess() (or its variant) needs
      to perfectly configure the process before it starts running.

      with fork(), whoever calls fork() **is still running** so
      can arrange to do whatever it wants, without having to work
      through a rigid interface like the above. allows arbitrary
      "setup" of the process before exec().

   [start from here next time]