Week 11.b CS7670 11/16 2022 https://naizhengtan.github.io/22fall/ 1. FUSE basics 2. stackfs implementation 3. FUSE performance --- 1. FUSE basics * a fundamental trade-off in systems: low-level vs. high-level Q: the 2nd paragraph, which one type lfs falls into? * FUSE workflow: [draw Fig1] some examples: [see handout] a) open("/tmp/a", flag) b) read(fd, buf, 4096) * FUSE implementation Q: what is an "interrupt"? what is "foget"? -- interrupt: issued by kernel -- forget: issued by user-space daemon, remove the inode from cache * API levels Checkout: https://www.fsl.cs.stonybrook.edu/docs/fuse/fuse-article-appendices.html Q: Shall lfs use high-level APIs or low-level ones? -- high-level skips the implementation of the path-to-inode mapping -- low-level has "lookup" which translate path-to-inode * five queues [draw Fig2] -- interrupts -- forgets -- pending -- processing -- background (what requests will go here?) Q: what's the status transformation graph of a read op? read op -> [background] -> [pending] -> [processing] -> done Q: what policy you will choose for four queues? (page 61) WHY four, instead of five? Q: when will tasks in background move to pending? 12 async tasks (page 62) Q: what will happen if FUSE meets congestion? (page 62) DISCUSSION: dropping or not dropping? tput and latency graph * FUSE optimizations -- Splicing: important for lsf [see handout] Linux syscall: ssize_t splice(int fd_in, off64_t *off_in, int fd_out, off64_t *off_out, size_t len, unsigned int flags); -- multi-threading for user-daemon -- Write back cache and max writes 2. stackfs [see FUSE op implementation here: https://github.com/sbu-fsl/fuse-stackfs/blob/master/StackFS_LowLevel/StackFS_LowLevel.c] * inode path to the underlying file inode number reference counter * inode number is the address in memory * inodes are stored in a hash table Q: can you imagine how stackfs works? for a file create? [read handout] "insert(lo_data, lo_inode)" 3. FUSE performance Q: how to understand the statistics? read 2nd paragraph of S3.2 -- row: request type -- col: time -- cell: #request, happened in the past in 2^{N+1}-2^{N+2} ns [draw onboard] * hardware: HDD and SDD Q: what do you expect the FUSE overhead for now? larger? or smaller? * optimizations: (1) writeback cache + batch multiple write pages (2) multi-threading (3) splice (memory copy) * read the observations 1--4, 5--8 in S5.1 Q: ob2, how come there is an improvement? A: read ahead 128KB Q: Observation 3: why perf became worse for files-rd-1th? (1) this is read (2) this is single-thread (3) read a page at at time WHY adding overheads? Q: Observation 4: why is create expensive for stackfs? allocate inode in hashtable Q: ob5: why seq reads trigger overheads with concurrency (32threads) for HDD but not SSD? limited by the single-thread user-daemon mode (for base) cannot saturate bandwidth Q: why multi-threading opt also doesn't work well? limited by the 12 background events. * performance summary * Let's focus on data not meta-data, and ignore CPU overheads. Q: can you summarize the FUSE perf? * Queuing effect: * long latency * tput-wise: if we can saturate the bandwidth can: concurrently do things cannot: wait for unfinished ones; cannot do new tasks * look at the last column; bad ones: -- rnd-rd-1th-1f (4KB, 32KB) -- rnd-rd-32th-1f (4KB) -- rnd-wr-1th-1f (4KB) -- rnd-wr-32th-1f (4KB)