xv6 Lecture Notes

THIS NOTE WAS WRITTEN IN MARKDOWN

Handout 1

Purpose of an OS:

Isolate hard/software, virtualize hardware
Interact with OS via syscalls
First arg in syscall is FD
UNIX I/O is 8-bit bytes

When CPU receives a syscall:
save, jump, execute, call, restore, reduce, jump back

Shell is a user program instead of shell
fork() lets us create a new process
exec() replaces current process with an executable file

Book Chapter 1

Operating system let different softwares use hardware together
Process: instructions, data, stack
fork() returns 0 in the child process while pid in the parent process
wait() returns the pid and copies the exit status
exec() loads a file and executes it

file descripter refers to a file.
fork() and exec()let the shell has the chance to redirect child's I/O
File descripter is a good abstraction

Pipe is a small kernel buffer conncting files.
Pipe: auto-clean, pass long data, allow parallel execution

Underlying file is called inode
cd is built-in, if not, the shell will form a infinity process tree of dir-changing

ALL XV6 PROGRAMS RUN AS ROOT!!!!!

Amusement: UNIX

In order to avoid large amount of team population and project cost, we need to build a good programming environment. UNIX is for building such an environment.
PROBLEM: Large number of softwares, large scale of jobs
Software is always demanded for different operations, to avoid rewrite code per annum, the code should be divided into modules. Layers: kernel, shell, util
A show of pipes
Easy and packed pattern-matching algorithms
Refer file as a simple sequence of bytes, file is contained in directories
Concept of I/O redirection
C: If you need to avoid hardware, simply use it; but also ways to interact with hardware directly
An example of a software, inputting boolean equation and outputting circuit traces
Computer is built for easy to use

Lecture 1

What would a kernel do?
Everyone thought OS provides what they already know
Many tensions occurred in OS design
Func call vs Syscall? func hasn't previlege to access real CPU and mem but syscalls have
High-level language is designed to be portable
There is a rule that we read from fd 0 and write to fd 1
Kernel holds a process table and guided by fd
Instruction ecall passes the system from user space to the kernel space
Parent and child process shares the same fd
exec() loads the program in the file and throw away current process
exec() remains the fd that already exists
wait() waits the child process to return, if the process hasn't child process, it'll return -1

Just before Lab util...

installing the riscv64-unknown-elf-gcc:

sudo apt install gcc=riscv64-unknown-elf

Handout 2

memory layout:
text: code, read-only data
data: global C vars
stack: local variables
heap: sbrk, malloc, free
.c -compile-> .o -link-> executable file
strlen() uses array access while strcmp() uses pointer access
kalloc() keeps a large linked list of free pages of memory
LRU buffer cache is implemented as a dual-pointer linked list
static: limited to the file where the variable is declared

Slides

This class focus on RAM and I/O
address space: represent bus as a giant array of data
memory allocation: decide where array to store things
A stack is much more smaller than heap

K&R

alloc: use a stack to allocate memory space
afree: pop the space and decrease the pointer
the behavior is undefined for arithmetic or comparisons with pointers that do not point to members of the same array
size_t is uint returned by sizeof()
the result of modifying the string pointer is undefined

Handout 3

No OS: lack of isolation
fork() -> abstract core
exec()/sbrk() -> abstract RAM
assumption: user is always willing to break the isolation, while kernel is always trustable
user mode and supervisor mode
pgtbl: maps virtual->physical
ecall: change to supervisor mode, get to a known point of the kernel code

Singularity provides a way of process isolation without hardware support

kernel is a big program holds all of the syscalls
On CVE there's 2997 bugs detected on Linux in 2024(kernel bugs are much more common than we thought)

build the kernel: make
gcc compiles every .c, .o, linking, kernel/kernel, kernel/kernel.asm, produces an .iso
run the OS: make qemu
simulates the CPU, run into the first instruction in the OS

kernel/entry.S -> start.c -> main()

Book Chapter 2

Isolation can be done by protecting sensitive hardware from system calls
monolithic kernel: the whole OS resides in the kernel
micro kernel: minimalize code run in the supervisor mode
in xv6, isolation is done by setting processes
trampoline: contains the code from user space to kernel space
trapframe: protects the status of the process
sret instruction is used to return to user space from kernel space

process: an addr space as a virtual RAM and an thread as a virtual CPU

Boot a RISC-V Computer:
power on -> run bootloader -> load kernel into memory address 0x80000000(I/O starts from here) and run into M mode -> set up a stack at _entry for runnning C code -> run into start of C -> program the clock chip to generate timer interrupts -> call useerinit to start the first syscall -> complete exec() -> return to user space and initialize the shell

Lecture 3

No strong isolation will cause memory overwrite, which will lead to a bug hard to debug
OS should be defensive
There is a flag to check whether the CPU is at user mode or kernel mode
BIOS is always trustworthy

fork() as a syscall hasn't its function decalaration
when calling fork: ecall sys_fork (here sys_fork is a number) -> syscall -> find number in a0 -> create a new process
How to get control from buggy useer application? set a timer

think qemu as a real circuit board
qemu is an infinite for loop: fetch, decode, execute
use gdb: b _entry; c; si; layout src;
use n to goto next line of C code

Book Chapter 4.3 & 4.4

(Before the labs...)
args for exec() in a0 & a1, syscall number in a7
SYS_exec -> sys_exec
return value in p->trapframe->a0

fetchstr -> copyinstr -> walkaddr -> walk -> pagetable

Just before Lab syscall...

installing riscv64-unknown-elf-gdb:
download the tar file from TUNA mirror site, unpack it, enter the current directory

mkdir build
cd build
../configure --prefix=/usr/local --target=riscv64-unknown-elf
make -j 12
sudo make install

using gdb

target remote localhost:26000
b syscall
c
layout src
backtrace

p /x *p
p /x $sstatus

Handout 4

what if a user program writes to a random part of memory?
what way could we separate and isolate memory?
page tables
a level of indirection: CPU -(VM)> MMU -(PM)> MEM
satp, MMU, kernel

reduce the size of page tables, not building a direct-map

page=4KB, maximum 52 bits, nowadays 27bits
page table entry: 64 bits, 54 used, 10 of them are flags, low 12 bits of PA are from VA
3-level page table to reduce its size
a tree descended 9 bits at a time

page-fault forces transfer to kernel
PHYSTOP = 0x88000000

the design and arrangement of user address space: easier for compiler to compile
need contingous PM not VM

kvmmake() makes kernel's page table
kvmmap() adds PTEs to a page table

Book Chapter 3

bits 25  27    12
VA   EXT index offset
|
V
bits      44  10
pagetable PPN flags
|
V
bits 44    10
PA   index offset

TLB: avoid the cost of loading PTE
write PA into satp register
1 page table per process
KERNBASE=0x80000000
fork() uses parent's virtual memory address directly
trampoline and kernel stack aren't directly-mapped

pagetable_t: a pointer to root page table page
walk() looks up for PTE's VA
instruction sfence.vma flushes CPU's TLB

allocator: keeps a free list of struct run
kfree: set value to all 1

refuse to execute, occur a page fault, kill the process and print the message
make exploit attack harder by such a protection process

sbrk() -> growproc() -> uvmalloc()/uvdemalloc() -> kalloc()/uvmunmap() exec() -> namei() -> quick check -> proc_pagetable() -> loadseg() -> walkaddr()

xv6 is lack of malloc()-like allocator to allocate small pieces of space

Lecture 4

What do we want from isolation?

The VA_MAX could be larger than PA_MAX
where does page table allocates? kalloc()
there are 4096 bytes of continuous PA in PM, but not continuous for pages
the size of PM is determined by designers
satp is points to the top directory

don't use a translation scheme to rely on another translation scheme, so store PA in tables
satp also stores PA
why walk() ? initialization, sysinfo() needs walk()
flexibility stays in dealing with page faults
where will the data store? depends on a "multiplexer"
mostly-identity mapping is used in kernel mapping

stack overflow: page fault instead of covering other program's memory
consolidate the mapping? good, but xv6 didn't do this action

b main
c
b kvminit
c
layout src

b kvminithart
c

where to store satp? in each proc structure, there's a satp
why 3-level not a single big table? you can remain many blocks empty
etext is the last addr of kernel, KERNBASE=0x80000000, etext-KERNBASE=size of kernel

GDB

step: one line of code at a time, jumps into the function
next: jump over functions
stepi, nexti: do same thing for asm code
continue: run until next breakpoint
finish: runs until the function ends

break: set a breakpoint
watch: stop whenever value changes
x: prints the raw memory
print: prints the expression
info registers: print value of registers
info frame: prints current stack frame

set: change value of a variable

Handout 5

overwrite return address will cause an infinite loop

Calling Conventions

int in RISC-V is always 32-bit wide, while long is 32-bit wide in RV32 and 64-bit wide in RV64
in RV64, int is sign-extended
struct: pass first 8 pointer-words into register
twice size as a pointer word: use an even-odd pair; more than twice: pass by reference

Lecture 5

since published, x86 adds 3 instructions per month

.section .test
.global sum_to

definition of .test is at \<defs.h\>

focus: switch the layout window in gdb

tmux: C-b c: new window
C-b p & C-b n: switch between terminals
C-b %: vertical split
C-b ": horizontal split
C-b o: jump between splited windows

apropos will display gdb built-in manual

caller: not preserved, callee: preserved
s0-s11 is for? maybe to give compiler flexibility

stack frame generated by calls
sp->bottom of the stack
fp->top of the current frame

i frame: show the message of stack frame
backtrace

Handout 6

focusing on user->kernel transition
NEVER execute user code in supervisor mode
C on RISC-V puts function arguments in a0, a1, a2, &c

check pagetables: C-a c, info mem
trampoline: the start of kernel's trap handling code
at the top: avoid punch a hole in user addr

observe PC is an indirect way to check whether we're in supervisor mode

ecall:
change mode, save PC in SEPC, jump to STVEC, disable further interrupts
ecall does as little as possible
even supervisor mode is constrained to use pgtbl

save the user registers:
using trapframe or the sscratch register

why not a crash? we just switched page tables while executing
the trampoline is mapped at the same addr in pgtbls

back to usertrap(), return $ and space
usertrapret()
prepare for next transition
use sret instruction

complex from isolation
what if design for faster?

Book Chapter 4

ecall & exception will cause a trap
stvec: trap handler
sepc: PC saver
scause: cause of occurring a trap
sscratch: avoid overwriting
usertrap: determine the reason of trap
usertrapret -> userret -> switch satp to user space -> load trapframe addr to a0 -> executes sret

argint, argaddr, argfd retrieves n'th argument of integer, address or fd
kernelvec: save regs, jump to kerneltrap, yield

Lecture 6

don't let user code to interfere the transition between user and kernel code
supervisor: R/W control regs, satp, stvec, sepc, sscratch, use PTE/PTE_U

write() -> ecall -(user->kernel)-> uservec -> usertrap() -> syscall() <-> sys_write() -> usertrapret() -> userret()

the user code could only access memory with PTE_U flag
(checking by C-a c, info mem)
csrrw? swaps a0 and special temp register
lack of crash, the status of pc -> we're in supervisor mode
OS cares lot about performance, so ecall never forces you to do stack operations
struct trapframe stays in proc.h, using 32 slots to save registers

when running shell, sscratch holds the pointer to trapframe
save? beware overwrite
the instructions not executed? we're still in trampoline, where user VA = kernel VA

what if switch to another process?
we could only switch pgtbl on trampoline

sret? re-enables interrupts, setting the PC same as sepc, switching user mode
what happens in UIE?

Handout 7

VM-related projects: COW, mappings, performance
ideas: isolation, indirection
panic? update page tables when page fault happens
VA causes page fault: stval reg
violation causes page fault: scause reg
instruction/mode can also cause page fault

if the user program asks for more memory, sbrk() could be expensive
zero-filled page? large part of memory filled with zero
COW or write then copy

don't let fork() copy every pages from parent
but share addr space between parent and child, use RSW in PTEs
page fault: make copy, map, read/write(hard in real life!)

load the pages for the file on demand
keep meta information in VMA

memory-mapping: use load & store to access files
page-in & page-out due to the LRU strategy, especially when the file is larger than the VM page

KPTI is used to handle meltdown

Slides

Modern OSes allocate memory lazily, only allocate memory when user program accesses the memory
COW: make all pages read-only, On page fault, copy page and mark R/W, use extra PTE bits

Book Chapter 4.6

xv6's response:
kill the faulting process; kernel panic
let child and parent use same block of PM? overwritten!!! load, store, instruction page faults
COW fork is faster especially in fork just before exec
lazy allocation
demand paging; page to disk

The Paper of COW

<a>https://lwn.net/Articles/849638/</a> Project Zero issue; CVE-2020-29374
for performance, android doesn't put an exec() just behind fork()

Xu: Breaking COW by GUP; Tovalds: do_wp_pages()
the new approach breaks RDMA self-tests

a userfaultfd() was reported
something about the new rules
the way of using RDMA for soft-dirty or userfaultfd() is broken now

Lecture 7

page faults can let page mapping from a static map to a dynamic one
information needed: faulted va page, stval register
type of page fault

when page fault happens:
allocate 1 page, set the page all 0, map the page, restart instruction
what if the process uses up the PM? just kill the process
lazy allocation: the memory not allocated will not be mapped

free the memory that is not allocated will cause a fault, as the sbrk() let p->sz go upper
negative number? shrinking addr space, but be careful!!!

zero-fill on demand
kalloc() a zero page, change the mapping
copy, update, restart
the cost of switching space and storing registers will be huge!!!

COW fork: use share instead of copy
set all pages read only, receive page fault, copy the page into a new page, map it, restart instruction, call userret()
except trampoline which could never be freed, other physical memory blocks are belonged to two or more different processes
but when to free the page?

demand paging
exec(): load text, data, segment, eagerly alloc page
read, map, restart
If out of memory: evict, use the just-free-page, restart
evict the clean pages not dirty ones!!!

mmap(va, len, plot, flags, fd, off)
unmap(va, len), write back dirty block

Handout 8

pgtbl is hard to debug, why?
how to speed up syscall? Linux has vDSO
vDSO enables virtual syscalls
511: trampoline
510: trapframe
509: USYSCALL

protection bits? URV, WRV, XRV
scan all pages could be expensive!!

Slides 8

which syscall could be sped up?
mo side-effect, return constant value, value can change after entering the kernel
options: getpid() & uptime()

provide a bitmask for pages accessed
detect page accesses without access bits? use page fault!
TOCTOU attack: argument is modified after the kernel reads it

chatgpt: reduce the time lap between check and use

virtual syscalls

put syscalls access kernel space -> access process addr space, reduce the cost of context switching
vDSO? a library, or a memory region
vDSO is a fully formed ELF image

List Balancing

file-backed pages
change of kernel-tracking infrastructure

Handout 9

CPU & devices: complicated & parallel
most code in modern OSes are device drivers
UART, RS232 port
UART & FIFO: not using a busy loop, but using interrupts
UART interrupts if: rx FIFO goes from empty to not-empty, or tx FIFO goes from full to not-full

device interrupts: device -> PLIC -> trap -> usertrap()/kerneltrap() -> devintr()
interrupt just means the state of device is changed
the bottom-half of interrupt doesn't run in the context of top-half
registers: sie(supervisor interrupt enabled), PLIC claim: get next IRQ, sstatus

kernelvec is like trampoline for kernel
executing syscall in kernel, save in some proc's stack

multiple devices? deliver to different CPU or stay pending
disable interrupts? clear SIE, using intr_off(), remember the pending interrupts, deliver when kernel re-enables interrupts
many places has parallelism, the operation msy not atomic!
using producer-consumer buffer

trampoline cannot tolerate a second interrupt to trampoline!
polling strategy: Top-half loops until device says it is ready
use DMA is more efficient

Book Chapter 5

dispatch happens in devintr()
top-half: kernel thread, bottom-kernel: executes at interrupt time
UART appears as a series of memory-mapped registers
use consoleinit() to init UART hardware
transmit complete
init() will let the shell read the console

read() -> consoleread() -> sleep()
devintr() -(UART)-> uartintr() -> consoleintr()
each time the UART finishes sending a byte, it generates an interrupt.
in write(), uartintr() -> uartstart()
allowing processes to execute concurrently with device I/O

in RISC-V, timer interrupt happens in M mode
start.c executes before main, to program CLINT hardware, set up a scratch area, set mtvec to timeervec
The kernel could be made somewhat simpler if device and timer interrupts only occurred while executing user code.

Lecture 9

save, process, resume
asynchronous, concurrency, program devices
How the '$' appears, and what happens after ls ?
PLIC: 53 interrupt connect from devices
core hold interrupt

Driver manages device
interrupt handler: top, bottom, a queue in it with read/write interface
the queue is used to decouple top & bottom
programming device: memory-mapped I/O; ld/st
the kernel and the device should make a protocol

$: device put it into UART, UART gen interrupts when the char has been sent
ls: keyboard connect to receive line, generate interrupt

RISC-V support:
SIE: 1 bit for external interrupt
SSTATUS: bit enable/disable
SIP: interrupt pending
SCAUSE: the cause of interrupt
STVEC: holds address for the switching

plicinit(): take interrupt from the devices
more initialization in main.c, main() finally calls scheduler()

init.c: creates a device as console
shell itself writes into fd 2

a pointer to producer and a pointer to consumer

Interrupt:
If SIE bitset: clear SIE bits
sepc<-pc, save current mode, entering supervisor mode, pc<-stvec

interrupts & concurrency:
devices & CPU run in parallel, interrupt stops current program
top & bottom deliver may run in parallel
producer & consumer: read after write, using a queue

once, interrupt was fast, now, interrupt is slow.
solution: polling
dynamic switch

Handout 10

kernel must deal with parallel syscalls
race between two cores calling kfree() leads to a page losing
if multiple cores calls the lock, only one will be returned, other will wait until the lock release
auto locking? needs explicit comtrol over different regions of code
deadlocking problem
locks are often not private business of modules
lock & parallelism may require a whole re-write for the project!
use big lock first, big lock is always enough
check and re-lock the lock atomically, pushing down the question into hardware

spinlock: if locked=1, again the loop, if locked=0, set it to 1, return 0
if you use locks, you don't need to understand the memory ordering rules, you need them if you want to write exotic "lock-free" code

in RISC-V:

a5 = 1
s1 = &lk->locked
amoswap.w.aq a5, a5, (s1)

Book Chapter 6

concurrency: multiple instruction streams are interleaved
wait() frees child's memory
race: the location is accessed concurrently
critical section: code covered in acquire() and release()
invariants: properties of data structures maintained across operations

traditional approach:

for(;;)
{
    if(lock == 0)
        lock = 1;
}

two cores may reach line 43 at the same time!
amoswap r, a
swaps the value in address a and register r
kalloc: a single free list protected by single big lock
deadlock: a program requires A to B, the other program requires B to A
deadlock requires global lock aquisition

file system contains a huge lock-chain
re-entrant lock: recursive lock
acquire the lock again: allow the action
if a spinlock is used by an interrupt handler, a CPU must never hold that lock with interrupts enabled

nesting level of locks: acquire() calls push_off(); release() calls pop_off();
__sync_synchronize() tells the compiler not to re-order the code, in fact, it's a memory barrier

sleeplock: not occupying the core for the lock could last for a long time

lock can be expensive if CPU requires the same lock at the same time

Book Chapter 9

acquire() and yield() are released in the scheduler thread
inode acts as a shard lock
implementation of spinlock: no lock at all
scheduler, different process call fork() at the same time

Lecture 10

apps want to use multiple cores, so kernel must handle multiple syscalls
the performance of single core has reached a limit

why locks? avoid race conditions
many locks -> more parallelism
access a shared data, one is write -> using lock
lock-free programming

could lock be automatic? wrong result, eg: rename('d1/x', d2/y);
lock avoids wrong updates, make operations atomic
lock helps maintain an invariant

deadlock:

acquire(lock);
...
acquire(lock);

order the locks globally

internals of m2 must be seen by m1

rebuild code with locks: start with coarse-grained locks, measure locks are co-required
lock isolation, redesign

UART in printf: a buffer with a read pointer and a write pointer
implementation of amoswap depends on memory layout
same CPU: let operation atomic
memory fence: any ld/st instruction aren't allowed to move after this instruction
use a race detector

is fence unnecessary as there are atomic memory instructions?
the compiler knows what couldn't be moved

old OS kernels didn't have lock acquire as they always assume that they run on a single core

Handout 11

concurrency: threads inside the kernel, processes
each thread, taken alone, executes in an ordinary way
need locks when interact

in xv6, only one user-level thread
event-driven and state machine could also use to multitasking
executing: using resources; not executing: save and release
each process has its own kernel thread

p->state: running, runnable, sleeping
user->kernel->scheduler->kernel->user
scheduler thread: one per CPU
idle scheduler: no running thread
swtch() returns to scheduler()
swtch(): saves current registers in xx(a0), restores registers in xx(a1)

yield() acquires lock, scheduler() releases it
the lock is released by a different thread

p->lock makes steps atomic
scheduling policy?

yield() can be called by kerneltrap()

Book Chapter 7

force switching: sleep & wakeup; long period without sleeping
context: struct context -> struct proc
swtch: save callee registers, return to instruction pointed by ra register
continue for loop, find a thread, run it
hold the lock, or different CPU may run same thread
procedures that intentionally transfer control to each other via thread switch are sometimes referred to as coroutines

scheduler: find a process, run, until it is terminated
maintain invariants: p->lock acquires in a thread and releases in another thread

in multi-core, we can't hold global variables for process, create struct cpu for each core
tp: stores the core's haltid
xv6 disable caller interrupt but wait until the struct cpu is returned
return value of myproc() is safe

conceal actions from one thread to another: sleep & wakeup
sequence coordination mechanism
use a high-level spinlock? most of time it will produce right result but it's expensive
naive approach of sleep and wakeup will cause lose wake-up problem
solution: the caller must pass condition lock to sleep

sleep: mark the current process SLEEPING and release the resources of core
same channel: see spurious wakeups

pipewrite & piperead are complex applications of sleep & wakeup
nwrite == nread + PIPESIZE

wait: wait_lock; exit: reparent
kill: set p->killed, wakeup
p->parent is protected by wait_lock instesd of p->lock
round robin: run each process in turn
linux's sleep adds a wait queue

Lecture 11

full use of multi-core
thread: one single exec
care the PC
interleaving of multiple threads
interleave threads

scheduling: interleave one thread, execute another thread
compute-bound?
pre-empitue scheduling: even the code doesn't release the resources of CPU, the kernel will interrupt it by timer
on the contrary, volunteer scheduling
p1 -> tf1 -> kernel stack-> swtch()-> ctx1 -> ctx0 -> swtch() -> scheduler() -> tf2 -> p2
where contexts stores? process structure, p->context
process vs threads? a process in user; or in kernel; or stored in trapframe and context

yield(): change the thread into runnable
no point to save PC as we know we're in swtch
ra will be the point to return
why only half of regs are stored by swtch()? it is called by C function
we want the start process also atomic
what we really care is ra

RISC-V uses general registers instead of floating-point registers
we believe the kernel code will infinitely looping without interrupt
why trampoline and swtch must be written in assembly? C is hard to reach ra & sp

almost each thread in Linux is a complete process
allocproc() sets the new context for the process
the first call in forkret is about initializing file system and crash recovery

Handout 12

why hold p->lock across threads?
forbid holding lock when yielding the CPU!
threads often wait for conditions

use coordination primitives
uartwrite(): writing thread should give up CPU
sleep() cannot simply wait the event, or will cause lose wake-up problem

what's wrong if uartwrite() releases the lock before broken_sleep()?
uartwrite() went to sleep EVEN THOUGH UART TX WAS DONE, nothing will awake uartwrite()
use lock to prevent wakeup() from running during the entire window

sleep() will require a lock to protect its condition, both sleep() and wakeup() will hold the condition lock
caller acquired condition lock before calling wakeup()
acquire p->lock before releasing condition lock

wakeup() holds both locks, sleep() just need to hold either

wakeup() can't proceed until after swtch() completes so wakeup() is guaranteed to see p->state==SLEEPING and p->chan==chan
all uses of sleep are wrapped in a loop, so they re-check
sleep() doesn't need to know the condition

a thread cannot free all of its own resources
voluntarily quit with exit()
kill() just sets p->killed flag
kill() will change SLEEPING into RUNNABLE
user space: will die next time it makes a system call or takes a timer interrupt
kernel space: target will never execute another user instruction but may spend a while yet in the kernel

Book Chapter 7

(DONE)

Lecture 12

coordination

acquire(p->lock);
p->state = RUNNABLE;
swtch();
release(p->lock);

check whether the thread is runnable
calling swtch() requires p->lock but forbid any other locks

two threads acquire same locks
freeze as the process2 will spin forever

coordination: sleep() & wakeup()
busywait loop
sleep channel: sleep() and wakeup() could check whether the thread is sleeping
why sleep needs the second argument?

lost-wakeup problems
broken_sleep: simply set the value
wakeup: for each p in procs: setback the process to runnable

int done;
uartwrite(buf)
    for each c in buf:
        while not done:
            sleep(&tx_chan);
        done = 0;

intr()
    done = 1;
    wakeup(&tx_chan);

add lock & unlock; why not work?
on some other core, UART interrupt is working
nothing will wake up the broken_sleep, as wakeup() has already happened
tx_done: a way for interrupt routine to connect the uartintr()
each typing will cause an interrupt, wakes up the sleeping process for a while then continue sleeping

wakeup(): run through process table, lock every process, if SLEEPING & same channel, wakeup and release lock
after we released the condition lock, the wakeup() will not see this process until it holds process lock
reader acquire the lock to prevent writer from sleeping
last thing sleep() does is acquiring condition lock

exit() will close files if necessary
wait() will find the process whose parent process is current process and in state ZOMBIE
parent will call freeproc() to release resources
concurrent exit of parent and child?

why doesn't kill() kills all the processes that could be handled?
not allowed in real world, but in xv6, there's no permissions, you can do this
in Linux, the process to be kill must have the same id as its user id, or killing is not allowed
we would never expose broken variants in the file system

Handout 13

files, bit arrays, human-readable names
fs APIs
fd: listed to the file even if it changes
a file can have multiple links
file must have info stored other than directory

inode: i-number(initial version), link count, count of open FDs
inode deallocation deferred until last link and FD are gone

data stays on disk without power
historically, disks were read/write usually in 512-byte units, called sectors

in HDD, sub-sector operation are expensive(read-modify-write)
flash must be erased before written
xv6 uses 2-sector blocks

xv6 treats disk as an array of sectors
metadata: everything on disk other than file content
on-disk inode: type, nlink, size, addrs[12+1]

content is array of dirents
dirent: inum, 14-byte filename

view file system as an on-disk data structure
facing concurrency challenges
look at block cache
bread->bget, watch the bcache
bcache.lock: description of what's in the cache
b->lock: protects just the one buffer

bcache replacement: bget, brelse
namex() ilock()
getting a reference separately from locking

Slides 13

0: syscalls
1: names+fd
2: inodes
3: inode cache & buffer cache
4: Log
5: virtio & disk driver
FD: stdin(0), stdout(1), stderr(2)

mkfs: generate layout for new FS, the layout will stay static for the lifetime of fs
indirect block: inode is fixed size, but file could be large
tree-structure: dir tree layer; inode layer; block layer

Book Chapter 8

file system must have crash recovery

block 0: boot sector
block 1: superblock, holds metadata
block start at 2 hold the log
inode, bitmap, data

bcache: only one kernel thread is using the copy, cache popular blocks
bcache is a doubly linked-list, main->binit
bread->bget->sleeplock, scan the bcache
at most one cached buffer per disk sector
brelse to release the buffer

block allocator: maintains a free bitmap
balloc: find the free block, refresh the bitmap, return the block
bfree: clear the block

on-disk inode: struct dinode
itable, struct inode to copy dinode into memory
iget, iput
there can be many same pointers point to same inode
hold ilock and release by iunlock
Code that modifies an in-memory inode writes it to disk with iupdate
iput() doesn’t truncate a file immediately when the link count for the file drops to zero

direct/indirect block
bmap: returns the nth data block for inode ip, or allocate one if not exists
itrunc: frees blocks and sets inode to 0

directory: has type T_DIR, dirlookup
namex -> skipelem
namex locks each directory in the path separately
avoid racing by using sleeplocks

opened file is symboled as struct file
kept in ftable, filealloc->filedup->fileclose->fileread/filewrite
file has no concept of offset

sys_link, sys_unlink
sys_link creates a new name for the inode
using create, it is easy to implement sys_open, sys_mkdir, and sys_mknod

Lecture 13

user-friendly, concurrency, persistence
the abstraction itself is useful
crash safety
disk layout
performance problems, as disk are always slow
pathname is human-readable, no explicit offset in write() syscall

file system structure: inode
link count & fd count = 0, miantain an offset
sync/async read/write option? yes, same way as console driver
max file size in xv6: (256+12)*1024
make files bigger: add indirect portions
why must 256 bytes a block? block address has only 4 digits

how to compute block index? divide into 1024
get index from inode, read out 8000 bytes
how to find root inode index? scan root block, scan subdir's block

create a file:
33, write into the free inode
33, fill in the inode
46, write the first block
32, size changed
33, return back the inode

write "hi" into x:
45, bitmap refresh
595: the block allocated for x
33: updating the size of inode again

bget: get the bcache lock, scan the buffer
xv6 follows a rule: any modification on bcache needs bcache lock, any modification on block 33 need to hold sleeplock

shortage of spinlock: must turn off interrupt
sleeplock will return the CPU resources

bcache: a copy of block in mem, sleeplock, LRU, 2-level-blocking

Handout 14

fs write first 46, then 32, 33? dirent points to free inode
correctness and performance are often conflict
crash recovery is a recurring problem
logging solution: goal: fast recovery, atomic syscalls
log, commit, install
write-ahead log: if "done" in log, replay all writes in log, if no "done", ignore log
multiple data structures, lay on existing storage system, high performance
buffer cache, in-memory log block # array, FS tree on disk, log header and blocks on disk
write buffers, wait for disk to complete write, keep data in buffer cache
begin_op()->bmap()->bread()->log_write()->i_update()->end_op()

when crash happens, kernel calls recover_from_log(), it's OK to replay the log more than once
challenge: prevent write-back from cache
system's call data must fit in log
allowing concurrent syscalls
xv6's solution: wait

avoid written multiple times: write absorbtion

Slides 14

33, 33, 46, 32 inode crashed and wasted
46, 32, 33, 33 inode reallocated, causing a disaster
we hope for: internal fs invariants must be held, all but last few options must be stored, no reordering of data writes
xv6 assumes disk is fail-stop

pin dirty blocks in buffer cache, unpin the block
Logging makes file system transactions atomic
write-ahead logging is key

Book Logging Sections

log writes, write a special commit, copy the writes, erase the log
design: logged blocks
write the header block with a transaction commit
sequence of write must be atomic
group commit: commit several transactions together
absorption: achieve better performance
end_op()->commit()->write_log()->write_head()->install_trans()

Lecture 14

crash will lead to on-disk incorrect state
fs operation are all multi-step
33, failure, 45: a block remain unallocated
if block is shared between files, there will be a security problem
solution: logging, atomic fs calls, fast recovery

crash between log write? see commit records
every fs-related syscall: begin_op() -> operations -> end_op()
ialloc() -> bwrite() -> log_write()
bwrite() could never used by itself but closed with log_write()
we don't expect the process live after the crash
log, create, install
ext3: deal with the factor 2

eviction: broken of atomicity, write-ahead rule violation
solution: don't evict blocks in the log

fs operation must fit in log
max log size is 30, 30 as the max block size(operation writes)
write fs->many transactions

concurrent fs calls
all concurrent ops must be fit, limit concurrent calls

Handout 15

crashes may lead to problem that won't show immediately
make syscalls atomic(all-or-nothing)
ext3: less waiting, reduced writes, ordered data mode

batch update blocks, commit the current open transaction
performance: multiple syscalls, multiple transactions

the crash may forget last few ops
log is circular: SB T4 T5 T2 T3
before re-use, must write log superblock with offset/seq of resulting now-oldest transaction in log

logging could help with "fail-stop"
reboot, look in log, find the end of log, re-write all logs
another crash? just repeat the logging steps
commit needs to write on-disk log from a snapshot of cached blocks? if not, repeat but not all
efficient using COW fork

in ext3, metadata and blocks are both written in log
write content after metadata? no, crash may leave inode pointing to blocks from other file!

correctness challenges
no enough free blocks? free oldest ones
use checksums

Paper

preservation, predictability, atomicity
ex2fs, however, is not atomic
synchronous metadata update has performance problems
solution: defered-ordered write way, but may leat to a cyclic problem
soft-update way: share same problem as the recovery process still need to scan the whole disk

commit: mot make effect until final
journaling file system: the old and new versions of incomplete updates are preserved by writing the new versions to a separate location on disk until such time as the update has be committed

transactions are atomic: either undo or redo
contains all changed metadata
reading the existing file system
before commiting, make sure all blocks are written in the disk

fs has no transaction abort, transactions are short-lived
create a new transaction so often
less IO operations is needed for long term, but make fs time-predictless

inodes, bitmaps, record new contents and maintain atomicity

journal metadata block: contains entire component
keep track of metadata buffers
keep old blocks unsynced before commiting to avoid crash
complete commit: close the transaction, flush transactions to disk, wait for outstanding operations, wait for outstanding transactions, update journal handle block, mark as pinned
without syncing, maybe data will lost if crash happens when completing commits

Lecture 15

ext3 = ext2 + journal
let operations in begin_op() and end_op() atomic
write-ahead: all of refresh or write should be pre-declared
free rule: only when only when synced the logging will delete the indexes

ext3 didn't modify the bottom layer of ext2, but just added journalling system in it
cache, action info layer(sequence numbers, block numbers; handles), fs tree
ext3 log format:
log super block: off, seq, desc, commit
there's a bunch of closed older transactions

async, batching, concurrency
fsync(fd): do all writes to the file descripter
batching: one open transaction
write absorbtion, disk scheduling

concurrency: syscalls in parallel, older transactions
one open, commiting, writing, freed
we can't have a transaction started since close

sys_unlink()
h = start()
get(h, block#)
stop()

steps in commit:
block new syscalls, wait for outstanding syscall, open a new transaction, write a descriptor block with block numbers, write blocks to log, wait for finish, write commit record to the block, wait for the commit point, write the transactions blocks to the home locations, re-use log
lots of records to show the transactions

if there's a crash?
assuming the disk is still all right, when the power reboot, there's a bunch of old completed transactions

every descriptor block starts with a magic number
after SB, start with a magic number, it must be a descriptor block

why not let T2 get syscall just after T1 finishes syscall?
create()->unlink(), crash?
the inode won't be deleted, y still exists in file system
no possibility to see the refresh in the future transaction
copy into cache, copy into disk

logging let multi-step disk transaction into atomic ones
the correctness of logging is depend on write-ahead rule

why not put commit message right in the descriptor block?
no problem is eliminated

Handout 16

OS kernel use VM in creative ways, even benefit from VM
trap, prot, unprot, dirty, map
mmap(): map memory into user space
mprotect(): change the permission of mapping
munmap(), sigaction()
kernel asks VM waht to do, generate signal, run user handler, kernel returns to user program, continue or restart the previous action
bake a large sqrt table by a single page
real-time GC Algorithm

real-time: cheap and large to allocate
cost: pointer reside, difficult to run collector and the program at the same time
adding instructions could solve user-level VM problems

Paper

add user-level handler to handle fault with more friendly messages
not decrease the accessibility of a page, but decrease for a large batch with small cost for each page
in the same pagetable to avoid expensive context switching

fetch and store are common in memory operations

real-time garbage-collecting algorithm
divides the memory into 2 spaces: from-space & to-space
trace out searchable objects
forward: point to the space previous to the to-object
mutator see only to-space pointers, new area and scanned area only contains to-space pointers
synchronize mutatoe and collector
multiple mapping pages: let the page scanned but still accessed by mutator

SVM: large conherent VM accessed by multi-core
shared memory only exists virtually

page fault could be used to make checkpointing real-time
acess protection page fault to instead saving data to disk all at once
protect dirtied-pages only

young records: may die sooner than old records and likely points to old records
VM could detect assignments to old records

persistent store: a dynamic allocation heap persists from one-invocation to the next
commit or abort
as fast as fetches and stores in main memory
modified: not altered until commit
compiled programs could act fetch quickly

extended address
handle page fault, append another page, transform it into short pointer
page compression algoerithm

detect stack overflow: mark the page above the stack invalid
heap overflow: detected by a control and condition-overflow branch
at the end of the heap, cause a page fault, call the garbage collector
slow fault will cause problems

clear what is needed and what is not
TLB consistency, easily patch and fix TLB faults
batch the shootdown to stop the shootdown to the multi-core

optimal page size <- dynamic ram
algorithms must be asynchronous friendly

Lecture 16

OS kernel uses VM in creative way
VM could have same mechanism as the kernel

what primitives of VM does user program need
trap
prot1-decrease accessibility(R+W->R->not at all)
unprot-increase accessibility
mprotect() still works on page accuracy not addr accuracy
AS: pagetable+VMA
contiguous range of memory, some permission, backed by same object

user-level traps
does this trap break the isolation? upcall lets the whole program runs in same conrtext and same page

challenge: the table may be big
use VM promitives, allocate huge range, allow page fault

GC: copying, forwarding, move objects from FROM space to TO space
Baker's algorithm: just copy the root
new: forward a few more objects
check whether the reference of the pointer is in from space
fault handler, scan a page of objects, forward them
no pointer to check anymore
if you fulfill the to-space, flip it as a new from space, if the space is still fulfilled, flip it again

tricky use of concurrency
maps mapping app's space to GC's space

VM is a still developing system
through the VMA there is no hole

Handout 17

What should a kernel do?
traditional: a "monolithic" kernel abstraction(UNIX, Linux, xv6)
abstraction leads to security
monolithic kernel may lead to a big, complex and buggy system, may lead to many thoughts in designing the whole kernel
lots of code is needed to execute a pipe

microkernel: make kernel simpler enough
move most techniques out of the kernel, to the user space
most bugs are at drivers, move them out to the user space so the kernel will be more modifiable

L4: less syscalls, less code
missing almost everything, but everything is in the user space
pager: mmap, cow, alloc
design to improve bad performance brought by microkernels

send() & recv(): async & buffered

how to build a microkernel? for embedded, easy; for a server, maybe run UNIX on top of the kernel is good
on it, Linux is just a program!
but Linux switches its threads through L4-s threads
fork() -> turn into IPC -> alloc pages -> get page faults
L4 server acts as the Linux's pager
getpid() is turned into 2 syscalls in L4(call, sendrecv)

whole OS benchmark: AIM
microcontrollers are used to run specific software

Paper

researchers have almost stopped microkernel researching as its poor performance
too-low and too-higfh ideas
L4 built-on threads & addr spaces
inter-process communication
grant, map, unmap
IO are treated as part of addr space
compilers and libraries hides difference of each architecture

Linux's processes and resources management system is architecture independent
device drivers are always hardware dependent
top halves & bottom halves of Linux interrupt handlers

microkernel's task use for Linux's processes
problems surrounding libc
simulate TLB on Pentium processor
the L4Linux uses physical copyin and copyout
inter-thread using the same addr space
support TLB for smaller spaces

creating one server addr space per Linux process
syncronization of server space & server thread lead to the mistake
re-construct the TLB as Linux's minimum user space is too large

using benchmarks compare L4Linux to MkLinux
co-location decreases L4Linux's security, so is it worth?
measure the cost of getpid() syscall
indeed IPC is the main additional cost in L4Linux
use lmbench to measure the cost of sync RPC
use matrix multiplication to test VM performance
use a protected transfer to reduce the cost of security, however, performance decreased

Handout 18

virtual machine: simulation of a computer
accuracy may be 100%
software simulation: like qemu, but may slow
run the instructions in real CPU but whole in user mode
VM trap handler
trap-and-emulate approach
handle privileged instructions into virtual state
VMM must ensure that guest only accesses its own memory and must remap guest physical addresses
turn page faults into operation
each CPU in non-root mode or in root mode
extended page table contains user memory space
2 layers of addr
CPU forces exit to the guest, passes interrupts to the host
Dune: use VMX to run a Linux process
a loadable kernel module
additional functionality like r/w
bottom line: faster for some Linux syscalls

Paper

provide process rather machine abstraction
move the kernel into lower layer may lead to higher security
poor interaction between VM and the host OS
Dune is a loadable kernel module

VTX: VMX root & VMX non-root mode
VMLAUNCH & VNRESUME to enter hardware
PCID permits a single user program to switch between multi-pagetables efficiently
include the libdune
translate through ioctl on /dev/dune
assume CPU is free of defects
hypercalls are common in VMM
use a narrower hardware interface to limit the differences
Dune could be considered a type2 hypervisor as it runs on top of an existing OS kernel

host-virtual to host-physical memory translation
query the kernel for mmap, manually update EPT
mirror the configurationgset to the kernel
expose raw access to TSC
use VMCALL to apply syscalls
share code of low-level of VT-x operations butnot sharing high-level code with KVM
VPID: enable a unique TLB tag for each Dune process
libDune is totally untrusted by kernel, including pagetable manager, ELF loader, simple page allocator, routines assist using programs in managing exceptions and syscalls

use dune_signals instead of builtin system signals
simpler implementations, higher performance and security

run sandbox in separate memory space
sandbox: null policy, prevent undesirable parties
wedge: a sthread abstraction for safe fork, fast sthread recycling
modified Boehm GC

build synthetic benchmarks to test performance
higher performance in ptrace and trap
leverage modern virtualization hardware

Slides

what if write an OS in HLL?
less faults, but with less performance and less chance to interact with bottom of a computer
tricky to measure trade-offs, must compare with production of C kernel
kernel threads -> goroutines
syscall: SYSENTER, SYSEXIT
solve the heap exhaustion rule
Go is easy to analyze heap bounds
HLL truly simplified code
concurrency in Go is easy
compare apps through biscuit & Linux
Go's GC has no many large pauses but many small delays
many programs could tolerate the pause rate of GC
Go is slower but competitive

Paper

Go: type-safe with GC
Singularity didn't fix the security issues in HLL
Rust kernels designed without using Rust as an implementation language
no consensus in GC

risk of deadlock when dealing with heap exhaustion
apply too-small-to-fail rule, instead of exhaustion, the user application see delay
certain kinds of bugs are less likely happen in HLL

biscuit is a monolithic UNIX kernel
shim layer: provide functions of underlying kernel
provide user with POSIX interfaces
use Go's channels to ensure synchronization related
in performance-critical code, using read-lock-free lookups
biscuit does not support scheduling priority

biscuit's approach to kernel exhaustion: purge caches and soft state -> wait until reserve -> kill by kernel-watcher thread
maximum amount of simultaneously live data
maxlive to calculate code used for a single syscall
handle poll & write loops in deep reservations
apply to long-running kernel threads
killer thread is woken when syscall's reservation fails
Go's auto GC never reduce fragmentation nor increases performance

unsafe Go used in memory related code

count and explain the ways HLL benefits the kernel
code execution bugs in biscuit is mostly avoided in HLL related features
try ways to make code routine between C and Go same
use pingpong and pagefaults to masure performance of biscuit
Linux uses RCU to delay such frees of safe deleting

Handout 20

tcpdump -t -xx -n

ethernet addr
today, ethernet LAN is a switch with cables to each host
use the dest addr to decide where to send
arp: translate IP addr to ethernet addr
nest high layer packet in low layer packet
packet mbuf allocator

queues of buffers: avoid forcing discard, keep output NIC busy, allow independent control flow
illustrate tradeoffs in kernel network structure
could a disk cause a livelock?
don't spend time on subsequent packets until done with this one
modern Linux: NAPI
livelock is made worse by doing even more processing before discard
timeout + resend

Paper

receive livelock: system spends all its time deal with interrupts
poll is expensive especially when I/O is rare
interrupt-driven design loccking is better
network implementation must deal with high event rates
non-flow control protocols

receive livelock: the system is not deadlocked, but has no progress on current event
resonable latency and jitter
scheduling system must fairly allocate CPU resources among packet reception
queue provides some insulation against packet losses due to transient overlords
batch interrupts
receive livelock: a state when the system is no useful process is made, as some necessary resources is consumed with processing receiver interrupts
latency is increased by the time it receives the entire burst
transmissin starvation: packets may awaiting transmission, but the transmitting interface is idle

a cycle that could repeat ad infinitum
use a round-robin schedule to provide fairness polling
purely interrupt driven wil lead to livelock, purely polling system will lead to unnecessary latency
fast-path designs
use route-under-test to check the performance
interrupt handler does almost no work at all
unavoidable queue as no guarantee to let the output interface run as fast as input interface
the screend program is typically run as the only application on a system
simply measure the resources used by CPU for receiving packets

Handout 21

What if malicious user code execute in kernel?
the CPU may speculate the execution of asm code
mispredict -> flush the result, revert content, restart execution
CPU retires an instruction only after they won't be changed after all previous instructions are retired

L1 miss: TLB look up, L2 look up with phy addr
in real life, microarchitecture is not entirely invisible
sense whether the instruction is cached: flush + reload
use clflush to reflush the cache, use rdtsc to read the cycle counter
deduce the low bit of kernel data based on which cache lines hold
the attack is always failed

fix the vulnerability: don't map the kernel in user pagetable, require a pagetable switch when context switches
only return data from speculative loads

Paper

Meltdown: side effects of out-of-order processing
exploit side-channel information on modern processors
in out-of-order, CPU calculates for further information
influence the cache, affects the side-channel
kernel ASLR randomizes the offsets where driver is loaded on the boot

use flush + reload to detect whether the instruction is cached so that the microarchitecture is no more hidden
spectre attacks: lack of privilege escalation in Meltdown
a transient instruction as a side effect
execution suppression
build a covert channel to receive information from microarchitecture and deduce this secret
covert channel is not limited to microarchitectures

steps: content of loadable mem location is loaded in reg, transient instruction, flush + reload
bias is often lost in out-of-order execution

KAISER: a strong isolation which won't map any kernel memory into user space

Handout 22

can we have a lock that allow parappel reads?
one writer, but no other reader or writer; lots of reader, but no writer
atomic CAS's execute one at a time -- not in parallel
r_lock's writes are expensive
reader connot simply don't lock, but use methods like RCU
in RCU: writer doesn't modify the file directly, but prepare a new copy, the reader won't see a partial midified string, even if it doesn't hold a lock

before: no change is visible; after: all changes are visible
use memory barriers to enhance order of transactions
potential disaster: use after free
let writer wait for GC
implement the graceful delay

RCU implements nearly zero in reads
reference counting, redesign for no sharing

Paper

RCU's success: high performance in concurrent reading and updating
support for concurrent reading and updating, low computation and storage overhead, deterministic completion time
VFS needs concurrent reading and updating
RCU allows thread to wait for the completion of pre-existing RCU critical sections, but coordinate activities using other mechanisms
synchroize_rcu: ensure every cpu acts a context switch
writer just need to wait for reader, but reader must communicate with writer that the critical section is complete

RCU's simplest usage is to wait for a transaction complete
using RCU in NMI system
execute in RCU, apply the reference counting
using an RCU critical section is good if the object is relatively short
RCU could use to remove need for type-safe memory
RCU has higher deadlock immunity than read/write lock

RCU is used increasingly in Linux kernel source code

Handout 23

what if dynamically update the kernel?
observability, performance, security
detect DDoS attack
eBPF uses bin-constraint approach
C reads cBPF, emulates the "language", run in the same addr space as in the kernel

whether the asm-like language will break the isolation?
binaries, look through the whole program and disallow huge offsets
speed: slow interpreter -> JIT
ways of tracking pointers
is eBPF even the right approach to extensibility?

Paper

original: designed for capture and filter packets
design of VM is left behind the traditional processor
mapping eBPF to native instructions
in 2014, eBPF is directly accessed by user space

attached to a designated code path
debugging the kernel

lock the kernel, verify the simulation, secure mode
define the type of eBPF programs
eBPF map: allow data pass through kernel, kernel-user, user space