THIS NOTE WAS WRITTEN IN MARKDOWN
Purpose of an OS:
Isolate hard/software, virtualize hardware
Interact with OS via syscalls
First arg in syscall is FD
UNIX I/O is 8-bit bytes
When CPU receives a syscall:
save, jump, execute, call, restore, reduce, jump back
Shell is a user program instead of shell
fork() lets us create a new process
exec() replaces current process with an executable file
Operating system let different softwares use hardware together
Process: instructions, data, stack
fork() returns 0 in the child process while pid in the parent process
wait() returns the pid and copies the exit status
exec() loads a file and executes it
file descripter refers to a file.
fork() and exec()let the shell has the chance to redirect child's I/O
File descripter is a good abstraction
Pipe is a small kernel buffer conncting files.
Pipe: auto-clean, pass long data, allow parallel execution
Underlying file is called inode
cd is built-in, if not, the shell will form a infinity process tree of dir-changing
ALL XV6 PROGRAMS RUN AS ROOT!!!!!
In order to avoid large amount of team population and project cost, we need to build a good programming environment. UNIX is for building such an environment.
PROBLEM: Large number of softwares, large scale of jobs
Software is always demanded for different operations, to avoid rewrite code per annum, the code should be divided into modules.
Layers: kernel, shell, util
A show of pipes
Easy and packed pattern-matching algorithms
Refer file as a simple sequence of bytes, file is contained in directories
Concept of I/O redirection
C: If you need to avoid hardware, simply use it; but also ways to interact with hardware directly
An example of a software, inputting boolean equation and outputting circuit traces
Computer is built for easy to use
What would a kernel do?
Everyone thought OS provides what they already know
Many tensions occurred in OS design
Func call vs Syscall? func hasn't previlege to access real CPU and mem but syscalls have
High-level language is designed to be portable
There is a rule that we read from fd 0 and write to fd 1
Kernel holds a process table and guided by fd
Instruction ecall passes the system from user space to the kernel space
Parent and child process shares the same fd
exec() loads the program in the file and throw away current process
exec() remains the fd that already exists
wait() waits the child process to return, if the process hasn't child process, it'll return -1
installing the riscv64-unknown-elf-gcc:
sudo apt install gcc=riscv64-unknown-elf
memory layout:
text: code, read-only data
data: global C vars
stack: local variables
heap: sbrk, malloc, free
.c -compile-> .o -link-> executable file
strlen() uses array access while strcmp() uses pointer access
kalloc() keeps a large linked list of free pages of memory
LRU buffer cache is implemented as a dual-pointer linked list
static: limited to the file where the variable is declared
This class focus on RAM and I/O
address space: represent bus as a giant array of data
memory allocation: decide where array to store things
A stack is much more smaller than heap
alloc: use a stack to allocate memory space
afree: pop the space and decrease the pointer
the behavior is undefined for arithmetic or comparisons with pointers that do not point to members of the same array
size_t is uint returned by sizeof()
the result of modifying the string pointer is undefined
No OS: lack of isolation
fork() -> abstract core
exec()/sbrk() -> abstract RAM
assumption: user is always willing to break the isolation, while kernel is always trustable
user mode and supervisor mode
pgtbl: maps virtual->physical
ecall: change to supervisor mode, get to a known point of the kernel code
Singularity provides a way of process isolation without hardware support
kernel is a big program holds all of the syscalls
On CVE there's 2997 bugs detected on Linux in 2024(kernel bugs are much more common than we thought)
build the kernel: make
gcc compiles every .c, .o, linking, kernel/kernel, kernel/kernel.asm, produces an .iso
run the OS: make qemu
simulates the CPU, run into the first instruction in the OS
kernel/entry.S -> start.c -> main()
Isolation can be done by protecting sensitive hardware from system calls
monolithic kernel: the whole OS resides in the kernel
micro kernel: minimalize code run in the supervisor mode
in xv6, isolation is done by setting processes
trampoline: contains the code from user space to kernel space
trapframe: protects the status of the process
sret instruction is used to return to user space from kernel space
process: an addr space as a virtual RAM and an thread as a virtual CPU
Boot a RISC-V Computer:
power on -> run bootloader -> load kernel into memory address 0x80000000(I/O starts from here) and run into M mode -> set up a stack at _entry for runnning C code -> run into start of C -> program the clock chip to generate timer interrupts -> call useerinit to start the first syscall -> complete exec() -> return to user space and initialize the shell
No strong isolation will cause memory overwrite, which will lead to a bug hard to debug
OS should be defensive
There is a flag to check whether the CPU is at user mode or kernel mode
BIOS is always trustworthy
fork() as a syscall hasn't its function decalaration
when calling fork: ecall sys_fork (here sys_fork is a number) -> syscall -> find number in a0 -> create a new process
How to get control from buggy useer application? set a timer
think qemu as a real circuit board
qemu is an infinite for loop: fetch, decode, execute
use gdb: b _entry; c; si; layout src;
use n to goto next line of C code
(Before the labs...)
args for exec() in a0 & a1, syscall number in a7
SYS_exec -> sys_exec
return value in p->trapframe->a0
fetchstr -> copyinstr -> walkaddr -> walk -> pagetable
installing riscv64-unknown-elf-gdb:
download the tar file from TUNA mirror site, unpack it, enter the current directory
mkdir build
cd build
../configure --prefix=/usr/local --target=riscv64-unknown-elf
make -j 12
sudo make install
target remote localhost:26000
b syscall
c
layout src
backtrace
p /x *p
p /x $sstatus
what if a user program writes to a random part of memory?
what way could we separate and isolate memory?
page tables
a level of indirection: CPU -(VM)> MMU -(PM)> MEM
satp, MMU, kernel
reduce the size of page tables, not building a direct-map
page=4KB, maximum 52 bits, nowadays 27bits
page table entry: 64 bits, 54 used, 10 of them are flags, low 12 bits of PA are from VA
3-level page table to reduce its size
a tree descended 9 bits at a time
page-fault forces transfer to kernel
PHYSTOP = 0x88000000
the design and arrangement of user address space: easier for compiler to compile
need contingous PM not VM
kvmmake() makes kernel's page table
kvmmap() adds PTEs to a page table
bits 25 27 12
VA EXT index offset
|
V
bits 44 10
pagetable PPN flags
|
V
bits 44 10
PA index offset
TLB: avoid the cost of loading PTE
write PA into satp register
1 page table per process
KERNBASE=0x80000000
fork() uses parent's virtual memory address directly
trampoline and kernel stack aren't directly-mapped
pagetable_t: a pointer to root page table page
walk() looks up for PTE's VA
instruction sfence.vma flushes CPU's TLB
allocator: keeps a free list of struct run
kfree: set value to all 1
refuse to execute, occur a page fault, kill the process and print the message
make exploit attack harder by such a protection process
sbrk() -> growproc() -> uvmalloc()/uvdemalloc() -> kalloc()/uvmunmap() exec() -> namei() -> quick check -> proc_pagetable() -> loadseg() -> walkaddr()
xv6 is lack of malloc()-like allocator to allocate small pieces of space
What do we want from isolation?
The VA_MAX could be larger than PA_MAX
where does page table allocates? kalloc()
there are 4096 bytes of continuous PA in PM, but not continuous for pages
the size of PM is determined by designers
satp is points to the top directory
don't use a translation scheme to rely on another translation scheme, so store PA in tables
satp also stores PA
why walk() ? initialization, sysinfo() needs walk()
flexibility stays in dealing with page faults
where will the data store? depends on a "multiplexer"
mostly-identity mapping is used in kernel mapping
stack overflow: page fault instead of covering other program's memory
consolidate the mapping? good, but xv6 didn't do this action
b main
c
b kvminit
c
layout src
b kvminithart
c
where to store satp? in each proc structure, there's a satp
why 3-level not a single big table? you can remain many blocks empty
etext is the last addr of kernel, KERNBASE=0x80000000, etext-KERNBASE=size of kernel
step: one line of code at a time, jumps into the function
next: jump over functions
stepi, nexti: do same thing for asm code
continue: run until next breakpoint
finish: runs until the function ends
break: set a breakpoint
watch: stop whenever value changes
x: prints the raw memory
print: prints the expression
info registers: print value of registers
info frame: prints current stack frame
set: change value of a variable
overwrite return address will cause an infinite loop
int in RISC-V is always 32-bit wide, while long is 32-bit wide in RV32 and 64-bit wide in RV64
in RV64, int is sign-extended
struct: pass first 8 pointer-words into register
twice size as a pointer word: use an even-odd pair; more than twice: pass by reference
since published, x86 adds 3 instructions per month
.section .test
.global sum_to
definition of .test is at \<defs.h\>
focus: switch the layout window in gdb
tmux:
C-b c: new window
C-b p & C-b n: switch between terminals
C-b %: vertical split
C-b ": horizontal split
C-b o: jump between splited windows
apropos will display gdb built-in manual
caller: not preserved, callee: preserved
s0-s11 is for? maybe to give compiler flexibility
stack frame generated by calls
sp->bottom of the stack
fp->top of the current frame
i frame: show the message of stack frame
backtrace
focusing on user->kernel transition
NEVER execute user code in supervisor mode
C on RISC-V puts function arguments in a0, a1, a2, &c
check pagetables: C-a c, info mem
trampoline: the start of kernel's trap handling code
at the top: avoid punch a hole in user addr
observe PC is an indirect way to check whether we're in supervisor mode
ecall:
change mode, save PC in SEPC, jump to STVEC, disable further interrupts
ecall does as little as possible
even supervisor mode is constrained to use pgtbl
save the user registers:
using trapframe or the sscratch register
why not a crash? we just switched page tables while executing
the trampoline is mapped at the same addr in pgtbls
back to usertrap(), return $ and space
usertrapret()
prepare for next transition
use sret instruction
complex from isolation
what if design for faster?
ecall & exception will cause a trap
stvec: trap handler
sepc: PC saver
scause: cause of occurring a trap
sscratch: avoid overwriting
usertrap: determine the reason of trap
usertrapret -> userret -> switch satp to user space -> load trapframe addr to a0 -> executes sret
argint, argaddr, argfd retrieves n'th argument of integer, address or fd
kernelvec: save regs, jump to kerneltrap, yield
don't let user code to interfere the transition between user and kernel code
supervisor: R/W control regs, satp, stvec, sepc, sscratch, use PTE/PTE_U
write() -> ecall -(user->kernel)-> uservec -> usertrap() -> syscall() <-> sys_write() -> usertrapret() -> userret()
the user code could only access memory with PTE_U flag
(checking by C-a c, info mem)
csrrw? swaps a0 and special temp register
lack of crash, the status of pc -> we're in supervisor mode
OS cares lot about performance, so ecall never forces you to do stack operations
struct trapframe stays in proc.h, using 32 slots to save registers
when running shell, sscratch holds the pointer to trapframe
save? beware overwrite
the instructions not executed? we're still in trampoline, where user VA = kernel VA
what if switch to another process?
we could only switch pgtbl on trampoline
sret? re-enables interrupts, setting the PC same as sepc, switching user mode
what happens in UIE?
VM-related projects: COW, mappings, performance
ideas: isolation, indirection
panic? update page tables when page fault happens
VA causes page fault: stval reg
violation causes page fault: scause reg
instruction/mode can also cause page fault
if the user program asks for more memory, sbrk() could be expensive
zero-filled page? large part of memory filled with zero
COW or write then copy
don't let fork() copy every pages from parent
but share addr space between parent and child, use RSW in PTEs
page fault: make copy, map, read/write(hard in real life!)
load the pages for the file on demand
keep meta information in VMA
memory-mapping: use load & store to access files
page-in & page-out due to the LRU strategy, especially when the file is larger than the VM page
KPTI is used to handle meltdown
Modern OSes allocate memory lazily, only allocate memory when user program accesses the memory
COW: make all pages read-only, On page fault, copy page and mark R/W, use extra PTE bits
xv6's response:
kill the faulting process; kernel panic
let child and parent use same block of PM? overwritten!!!
load, store, instruction page faults
COW fork is faster especially in fork just before exec
lazy allocation
demand paging; page to disk
<a>https://lwn.net/Articles/849638/</a>
Project Zero issue; CVE-2020-29374
for performance, android doesn't put an exec() just behind fork()
Xu: Breaking COW by GUP; Tovalds: do_wp_pages()
the new approach breaks RDMA self-tests
a userfaultfd() was reported
something about the new rules
the way of using RDMA for soft-dirty or userfaultfd() is broken now
page faults can let page mapping from a static map to a dynamic one
information needed: faulted va page, stval register
type of page fault
when page fault happens:
allocate 1 page, set the page all 0, map the page, restart instruction
what if the process uses up the PM? just kill the process
lazy allocation: the memory not allocated will not be mapped
free the memory that is not allocated will cause a fault, as the sbrk() let p->sz go upper
negative number? shrinking addr space, but be careful!!!
zero-fill on demand
kalloc() a zero page, change the mapping
copy, update, restart
the cost of switching space and storing registers will be huge!!!
COW fork: use share instead of copy
set all pages read only, receive page fault, copy the page into a new page, map it, restart instruction, call userret()
except trampoline which could never be freed, other physical memory blocks are belonged to two or more different processes
but when to free the page?
demand paging
exec(): load text, data, segment, eagerly alloc page
read, map, restart
If out of memory: evict, use the just-free-page, restart
evict the clean pages not dirty ones!!!
mmap(va, len, plot, flags, fd, off)
unmap(va, len), write back dirty block
pgtbl is hard to debug, why?
how to speed up syscall? Linux has vDSO
vDSO enables virtual syscalls
511: trampoline
510: trapframe
509: USYSCALL
protection bits? URV, WRV, XRV
scan all pages could be expensive!!
which syscall could be sped up?
mo side-effect, return constant value, value can change after entering the kernel
options: getpid() & uptime()
provide a bitmask for pages accessed
detect page accesses without access bits? use page fault!
TOCTOU attack: argument is modified after the kernel reads it
chatgpt: reduce the time lap between check and use
put syscalls access kernel space -> access process addr space, reduce the cost of context switching
vDSO? a library, or a memory region
vDSO is a fully formed ELF image
file-backed pages
change of kernel-tracking infrastructure
CPU & devices: complicated & parallel
most code in modern OSes are device drivers
UART, RS232 port
UART & FIFO: not using a busy loop, but using interrupts
UART interrupts if: rx FIFO goes from empty to not-empty, or tx FIFO goes from full to not-full
device interrupts: device -> PLIC -> trap -> usertrap()/kerneltrap() -> devintr()
interrupt just means the state of device is changed
the bottom-half of interrupt doesn't run in the context of top-half
registers: sie(supervisor interrupt enabled), PLIC claim: get next IRQ, sstatus
kernelvec is like trampoline for kernel
executing syscall in kernel, save in some proc's stack
multiple devices? deliver to different CPU or stay pending
disable interrupts? clear SIE, using intr_off(), remember the pending interrupts, deliver when kernel re-enables interrupts
many places has parallelism, the operation msy not atomic!
using producer-consumer buffer
trampoline cannot tolerate a second interrupt to trampoline!
polling strategy: Top-half loops until device says it is ready
use DMA is more efficient
dispatch happens in devintr()
top-half: kernel thread, bottom-kernel: executes at interrupt time
UART appears as a series of memory-mapped registers
use consoleinit() to init UART hardware
transmit complete
init() will let the shell read the console
read() -> consoleread() -> sleep()
devintr() -(UART)-> uartintr() -> consoleintr()
each time the UART finishes sending a byte, it generates an interrupt.
in write(), uartintr() -> uartstart()
allowing processes to execute concurrently with device I/O
in RISC-V, timer interrupt happens in M mode
start.c executes before main, to program CLINT hardware, set up a scratch area, set mtvec to timeervec
The kernel could be made somewhat simpler if device and timer interrupts only
occurred while executing user code.
save, process, resume
asynchronous, concurrency, program devices
How the '$' appears, and what happens after ls ?
PLIC: 53 interrupt connect from devices
core hold interrupt
Driver manages device
interrupt handler: top, bottom, a queue in it with read/write interface
the queue is used to decouple top & bottom
programming device: memory-mapped I/O; ld/st
the kernel and the device should make a protocol
$: device put it into UART, UART gen interrupts when the char has been sent
ls: keyboard connect to receive line, generate interrupt
RISC-V support:
SIE: 1 bit for external interrupt
SSTATUS: bit enable/disable
SIP: interrupt pending
SCAUSE: the cause of interrupt
STVEC: holds address for the switching
plicinit(): take interrupt from the devices
more initialization in main.c, main() finally calls scheduler()
init.c: creates a device as console
shell itself writes into fd 2
a pointer to producer and a pointer to consumer
Interrupt:
If SIE bitset: clear SIE bits
sepc<-pc, save current mode, entering supervisor mode, pc<-stvec
interrupts & concurrency:
devices & CPU run in parallel, interrupt stops current program
top & bottom deliver may run in parallel
producer & consumer: read after write, using a queue
once, interrupt was fast, now, interrupt is slow.
solution: polling
dynamic switch
kernel must deal with parallel syscalls
race between two cores calling kfree() leads to a page losing
if multiple cores calls the lock, only one will be returned, other will wait until the lock release
auto locking? needs explicit comtrol over different regions of code
deadlocking problem
locks are often not private business of modules
lock & parallelism may require a whole re-write for the project!
use big lock first, big lock is always enough
check and re-lock the lock atomically, pushing down the question into hardware
spinlock: if locked=1, again the loop, if locked=0, set it to 1, return 0
if you use locks, you don't need to understand the memory ordering rules, you need them if you want to write exotic "lock-free" code
in RISC-V:
a5 = 1
s1 = &lk->locked
amoswap.w.aq a5, a5, (s1)
concurrency: multiple instruction streams are interleaved
wait() frees child's memory
race: the location is accessed concurrently
critical section: code covered in acquire() and release()
invariants: properties of data structures maintained across operations
traditional approach:
for(;;)
{
if(lock == 0)
lock = 1;
}
two cores may reach line 43 at the same time!
amoswap r, a
swaps the value in address a and register r
kalloc: a single free list protected by single big lock
deadlock: a program requires A to B, the other program requires B to A
deadlock requires global lock aquisition
file system contains a huge lock-chain
re-entrant lock: recursive lock
acquire the lock again: allow the action
if a spinlock is used by an interrupt handler, a CPU must never hold that lock with interrupts enabled
nesting level of locks: acquire() calls push_off(); release() calls pop_off();
__sync_synchronize() tells the compiler not to re-order the code, in fact, it's a memory barrier
sleeplock: not occupying the core for the lock could last for a long time
lock can be expensive if CPU requires the same lock at the same time
acquire() and yield() are released in the scheduler thread
inode acts as a shard lock
implementation of spinlock: no lock at all
scheduler, different process call fork() at the same time
apps want to use multiple cores, so kernel must handle multiple syscalls
the performance of single core has reached a limit
why locks? avoid race conditions
many locks -> more parallelism
access a shared data, one is write -> using lock
lock-free programming
could lock be automatic? wrong result, eg: rename('d1/x', d2/y);
lock avoids wrong updates, make operations atomic
lock helps maintain an invariant
deadlock:
acquire(lock);
...
acquire(lock);
order the locks globally
internals of m2 must be seen by m1
rebuild code with locks: start with coarse-grained locks, measure locks are co-required
lock isolation, redesign
UART in printf: a buffer with a read pointer and a write pointer
implementation of amoswap depends on memory layout
same CPU: let operation atomic
memory fence: any ld/st instruction aren't allowed to move after this instruction
use a race detector
is fence unnecessary as there are atomic memory instructions?
the compiler knows what couldn't be moved
old OS kernels didn't have lock acquire as they always assume that they run on a single core
concurrency: threads inside the kernel, processes
each thread, taken alone, executes in an ordinary way
need locks when interact
in xv6, only one user-level thread
event-driven and state machine could also use to multitasking
executing: using resources; not executing: save and release
each process has its own kernel thread
p->state: running, runnable, sleeping
user->kernel->scheduler->kernel->user
scheduler thread: one per CPU
idle scheduler: no running thread
swtch() returns to scheduler()
swtch(): saves current registers in xx(a0), restores registers in xx(a1)
yield() acquires lock, scheduler() releases it
the lock is released by a different thread
p->lock makes steps atomic
scheduling policy?
yield() can be called by kerneltrap()
force switching: sleep & wakeup; long period without sleeping
context: struct context -> struct proc
swtch: save callee registers, return to instruction pointed by ra register
continue for loop, find a thread, run it
hold the lock, or different CPU may run same thread
procedures that intentionally transfer control to each other via thread switch are sometimes referred to as coroutines
scheduler: find a process, run, until it is terminated
maintain invariants: p->lock acquires in a thread and releases in another thread
in multi-core, we can't hold global variables for process, create struct cpu for each core
tp: stores the core's haltid
xv6 disable caller interrupt but wait until the struct cpu is returned
return value of myproc() is safe
conceal actions from one thread to another: sleep & wakeup
sequence coordination mechanism
use a high-level spinlock? most of time it will produce right result but it's expensive
naive approach of sleep and wakeup will cause lose wake-up problem
solution: the caller must pass condition lock to sleep
sleep: mark the current process SLEEPING and release the resources of core
same channel: see spurious wakeups
pipewrite & piperead are complex applications of sleep & wakeup
nwrite == nread + PIPESIZE
wait: wait_lock; exit: reparent
kill: set p->killed, wakeup
p->parent is protected by wait_lock instesd of p->lock
round robin: run each process in turn
linux's sleep adds a wait queue
full use of multi-core
thread: one single exec
care the PC
interleaving of multiple threads
interleave threads
scheduling: interleave one thread, execute another thread
compute-bound?
pre-empitue scheduling: even the code doesn't release the resources of CPU, the kernel will interrupt it by timer
on the contrary, volunteer scheduling
p1 -> tf1 -> kernel stack-> swtch()-> ctx1 -> ctx0 -> swtch() -> scheduler() -> tf2 -> p2
where contexts stores? process structure, p->context
process vs threads? a process in user; or in kernel; or stored in trapframe and context
yield(): change the thread into runnable
no point to save PC as we know we're in swtch
ra will be the point to return
why only half of regs are stored by swtch()? it is called by C function
we want the start process also atomic
what we really care is ra
RISC-V uses general registers instead of floating-point registers
we believe the kernel code will infinitely looping without interrupt
why trampoline and swtch must be written in assembly? C is hard to reach ra & sp
almost each thread in Linux is a complete process
allocproc() sets the new context for the process
the first call in forkret is about initializing file system and crash recovery
why hold p->lock across threads?
forbid holding lock when yielding the CPU!
threads often wait for conditions
use coordination primitives
uartwrite(): writing thread should give up CPU
sleep() cannot simply wait the event, or will cause lose wake-up problem
what's wrong if uartwrite() releases the lock before broken_sleep()?
uartwrite() went to sleep EVEN THOUGH UART TX WAS DONE, nothing will awake uartwrite()
use lock to prevent wakeup() from running during the entire window
sleep() will require a lock to protect its condition, both sleep() and wakeup() will hold the condition lock
caller acquired condition lock before calling wakeup()
acquire p->lock before releasing condition lock
wakeup() holds both locks, sleep() just need to hold either
wakeup() can't proceed until after swtch() completes so wakeup() is guaranteed to see p->state==SLEEPING and p->chan==chan
all uses of sleep are wrapped in a loop, so they re-check
sleep() doesn't need to know the condition
a thread cannot free all of its own resources
voluntarily quit with exit()
kill() just sets p->killed flag
kill() will change SLEEPING into RUNNABLE
user space: will die next time it makes a system call or takes a timer interrupt
kernel space: target will never execute another user instruction but may spend a while yet in the kernel
(DONE)
coordination
acquire(p->lock);
p->state = RUNNABLE;
swtch();
release(p->lock);
check whether the thread is runnable
calling swtch() requires p->lock but forbid any other locks
two threads acquire same locks
freeze as the process2 will spin forever
coordination: sleep() & wakeup()
busywait loop
sleep channel: sleep() and wakeup() could check whether the thread is sleeping
why sleep needs the second argument?
lost-wakeup problems
broken_sleep: simply set the value
wakeup: for each p in procs: setback the process to runnable
int done;
uartwrite(buf)
for each c in buf:
while not done:
sleep(&tx_chan);
done = 0;
intr()
done = 1;
wakeup(&tx_chan);
add lock & unlock; why not work?
on some other core, UART interrupt is working
nothing will wake up the broken_sleep, as wakeup() has already happened
tx_done: a way for interrupt routine to connect the uartintr()
each typing will cause an interrupt, wakes up the sleeping process for a while then continue sleeping
wakeup(): run through process table, lock every process, if SLEEPING & same channel, wakeup and release lock
after we released the condition lock, the wakeup() will not see this process until it holds process lock
reader acquire the lock to prevent writer from sleeping
last thing sleep() does is acquiring condition lock
exit() will close files if necessary
wait() will find the process whose parent process is current process and in state ZOMBIE
parent will call freeproc() to release resources
concurrent exit of parent and child?
why doesn't kill() kills all the processes that could be handled?
not allowed in real world, but in xv6, there's no permissions, you can do this
in Linux, the process to be kill must have the same id as its user id, or killing is not allowed
we would never expose broken variants in the file system
files, bit arrays, human-readable names
fs APIs
fd: listed to the file even if it changes
a file can have multiple links
file must have info stored other than directory
inode: i-number(initial version), link count, count of open FDs
inode deallocation deferred until last link and FD are gone
data stays on disk without power
historically, disks were read/write usually in 512-byte units, called sectors
in HDD, sub-sector operation are expensive(read-modify-write)
flash must be erased before written
xv6 uses 2-sector blocks
xv6 treats disk as an array of sectors
metadata: everything on disk other than file content
on-disk inode: type, nlink, size, addrs[12+1]
content is array of dirents
dirent: inum, 14-byte filename
view file system as an on-disk data structure
facing concurrency challenges
look at block cache
bread->bget, watch the bcache
bcache.lock: description of what's in the cache
b->lock: protects just the one buffer
bcache replacement: bget, brelse
namex() ilock()
getting a reference separately from locking
0: syscalls
1: names+fd
2: inodes
3: inode cache & buffer cache
4: Log
5: virtio & disk driver
FD: stdin(0), stdout(1), stderr(2)
mkfs: generate layout for new FS, the layout will stay static for the lifetime of fs
indirect block: inode is fixed size, but file could be large
tree-structure: dir tree layer; inode layer; block layer
file system must have crash recovery
block 0: boot sector
block 1: superblock, holds metadata
block start at 2 hold the log
inode, bitmap, data
bcache: only one kernel thread is using the copy, cache popular blocks
bcache is a doubly linked-list, main->binit
bread->bget->sleeplock, scan the bcache
at most one cached buffer per disk sector
brelse to release the buffer
block allocator: maintains a free bitmap
balloc: find the free block, refresh the bitmap, return the block
bfree: clear the block
on-disk inode: struct dinode
itable, struct inode to copy dinode into memory
iget, iput
there can be many same pointers point to same inode
hold ilock and release by iunlock
Code that modifies an in-memory inode writes it to disk with iupdate
iput() doesn’t truncate a file immediately when the link count for the file drops to zero
direct/indirect block
bmap: returns the nth data block for inode ip, or allocate one if not exists
itrunc: frees blocks and sets inode to 0
directory: has type T_DIR, dirlookup
namex -> skipelem
namex locks each directory in the path separately
avoid racing by using sleeplocks
opened file is symboled as struct file
kept in ftable, filealloc->filedup->fileclose->fileread/filewrite
file has no concept of offset
sys_link, sys_unlink
sys_link creates a new name for the inode
using create, it is easy to implement sys_open, sys_mkdir, and sys_mknod
user-friendly, concurrency, persistence
the abstraction itself is useful
crash safety
disk layout
performance problems, as disk are always slow
pathname is human-readable, no explicit offset in write() syscall
file system structure: inode
link count & fd count = 0, miantain an offset
sync/async read/write option? yes, same way as console driver
max file size in xv6: (256+12)*1024
make files bigger: add indirect portions
why must 256 bytes a block? block address has only 4 digits
how to compute block index? divide into 1024
get index from inode, read out 8000 bytes
how to find root inode index? scan root block, scan subdir's block
create a file:
33, write into the free inode
33, fill in the inode
46, write the first block
32, size changed
33, return back the inode
write "hi" into x:
45, bitmap refresh
595: the block allocated for x
33: updating the size of inode again
bget: get the bcache lock, scan the buffer
xv6 follows a rule: any modification on bcache needs bcache lock, any modification on block 33 need to hold sleeplock
shortage of spinlock: must turn off interrupt
sleeplock will return the CPU resources
bcache: a copy of block in mem, sleeplock, LRU, 2-level-blocking
fs write first 46, then 32, 33? dirent points to free inode
correctness and performance are often conflict
crash recovery is a recurring problem
logging solution: goal: fast recovery, atomic syscalls
log, commit, install
write-ahead log: if "done" in log, replay all writes in log, if no "done", ignore log
multiple data structures, lay on existing storage system, high performance
buffer cache, in-memory log block # array, FS tree on disk, log header and blocks on disk
write buffers, wait for disk to complete write, keep data in buffer cache
begin_op()->bmap()->bread()->log_write()->i_update()->end_op()
when crash happens, kernel calls recover_from_log(), it's OK to replay the log more than once
challenge: prevent write-back from cache
system's call data must fit in log
allowing concurrent syscalls
xv6's solution: wait
avoid written multiple times: write absorbtion
33, 33, 46, 32 inode crashed and wasted
46, 32, 33, 33 inode reallocated, causing a disaster
we hope for: internal fs invariants must be held, all but last few options must be stored, no reordering of data writes
xv6 assumes disk is fail-stop
pin dirty blocks in buffer cache, unpin the block
Logging makes file system transactions atomic
write-ahead logging is key
log writes, write a special commit, copy the writes, erase the log
design: logged blocks
write the header block with a transaction commit
sequence of write must be atomic
group commit: commit several transactions together
absorption: achieve better performance
end_op()->commit()->write_log()->write_head()->install_trans()
crash will lead to on-disk incorrect state
fs operation are all multi-step
33, failure, 45: a block remain unallocated
if block is shared between files, there will be a security problem
solution: logging, atomic fs calls, fast recovery
crash between log write? see commit records
every fs-related syscall: begin_op() -> operations -> end_op()
ialloc() -> bwrite() -> log_write()
bwrite() could never used by itself but closed with log_write()
we don't expect the process live after the crash
log, create, install
ext3: deal with the factor 2
eviction: broken of atomicity, write-ahead rule violation
solution: don't evict blocks in the log
fs operation must fit in log
max log size is 30, 30 as the max block size(operation writes)
write fs->many transactions
concurrent fs calls
all concurrent ops must be fit, limit concurrent calls
crashes may lead to problem that won't show immediately
make syscalls atomic(all-or-nothing)
ext3: less waiting, reduced writes, ordered data mode
batch update blocks, commit the current open transaction
performance: multiple syscalls, multiple transactions
the crash may forget last few ops
log is circular: SB T4 T5 T2 T3
before re-use, must write log superblock with offset/seq of resulting now-oldest transaction in log
logging could help with "fail-stop"
reboot, look in log, find the end of log, re-write all logs
another crash? just repeat the logging steps
commit needs to write on-disk log from a snapshot of cached blocks? if not, repeat but not all
efficient using COW fork
in ext3, metadata and blocks are both written in log
write content after metadata? no, crash may leave inode pointing to blocks from other file!
correctness challenges
no enough free blocks? free oldest ones
use checksums
preservation, predictability, atomicity
ex2fs, however, is not atomic
synchronous metadata update has performance problems
solution: defered-ordered write way, but may leat to a cyclic problem
soft-update way: share same problem as the recovery process still need to scan the whole disk
commit: mot make effect until final
journaling file system: the old and new versions of incomplete updates are preserved by writing the new versions to a separate location on disk until such time as the update has be committed
transactions are atomic: either undo or redo
contains all changed metadata
reading the existing file system
before commiting, make sure all blocks are written in the disk
fs has no transaction abort, transactions are short-lived
create a new transaction so often
less IO operations is needed for long term, but make fs time-predictless
inodes, bitmaps, record new contents and maintain atomicity
journal metadata block: contains entire component
keep track of metadata buffers
keep old blocks unsynced before commiting to avoid crash
complete commit: close the transaction, flush transactions to disk, wait for outstanding operations, wait for outstanding transactions, update journal handle block, mark as pinned
without syncing, maybe data will lost if crash happens when completing commits
ext3 = ext2 + journal
let operations in begin_op() and end_op() atomic
write-ahead: all of refresh or write should be pre-declared
free rule: only when only when synced the logging will delete the indexes
ext3 didn't modify the bottom layer of ext2, but just added journalling system in it
cache, action info layer(sequence numbers, block numbers; handles), fs tree
ext3 log format:
log super block: off, seq, desc, commit
there's a bunch of closed older transactions
async, batching, concurrency
fsync(fd): do all writes to the file descripter
batching: one open transaction
write absorbtion, disk scheduling
concurrency: syscalls in parallel, older transactions
one open, commiting, writing, freed
we can't have a transaction started since close
sys_unlink()
h = start()
get(h, block#)
stop()
steps in commit:
block new syscalls, wait for outstanding syscall, open a new transaction, write a descriptor block with block numbers, write blocks to log, wait for finish, write commit record to the block, wait for the commit point, write the transactions blocks to the home locations, re-use log
lots of records to show the transactions
if there's a crash?
assuming the disk is still all right, when the power reboot, there's a bunch of old completed transactions
every descriptor block starts with a magic number
after SB, start with a magic number, it must be a descriptor block
why not let T2 get syscall just after T1 finishes syscall?
create()->unlink(), crash?
the inode won't be deleted, y still exists in file system
no possibility to see the refresh in the future transaction
copy into cache, copy into disk
logging let multi-step disk transaction into atomic ones
the correctness of logging is depend on write-ahead rule
why not put commit message right in the descriptor block?
no problem is eliminated
OS kernel use VM in creative ways, even benefit from VM
trap, prot, unprot, dirty, map
mmap(): map memory into user space
mprotect(): change the permission of mapping
munmap(), sigaction()
kernel asks VM waht to do, generate signal, run user handler, kernel returns to user program, continue or restart the previous action
bake a large sqrt table by a single page
real-time GC Algorithm
real-time: cheap and large to allocate
cost: pointer reside, difficult to run collector and the program at the same time
adding instructions could solve user-level VM problems
add user-level handler to handle fault with more friendly messages
not decrease the accessibility of a page, but decrease for a large batch with small cost for each page
in the same pagetable to avoid expensive context switching
fetch and store are common in memory operations
real-time garbage-collecting algorithm
divides the memory into 2 spaces: from-space & to-space
trace out searchable objects
forward: point to the space previous to the to-object
mutator see only to-space pointers, new area and scanned area only contains to-space pointers
synchronize mutatoe and collector
multiple mapping pages: let the page scanned but still accessed by mutator
SVM: large conherent VM accessed by multi-core
shared memory only exists virtually
page fault could be used to make checkpointing real-time
acess protection page fault to instead saving data to disk all at once
protect dirtied-pages only
young records: may die sooner than old records and likely points to old records
VM could detect assignments to old records
persistent store: a dynamic allocation heap persists from one-invocation to the next
commit or abort
as fast as fetches and stores in main memory
modified: not altered until commit
compiled programs could act fetch quickly
extended address
handle page fault, append another page, transform it into short pointer
page compression algoerithm
detect stack overflow: mark the page above the stack invalid
heap overflow: detected by a control and condition-overflow branch
at the end of the heap, cause a page fault, call the garbage collector
slow fault will cause problems
clear what is needed and what is not
TLB consistency, easily patch and fix TLB faults
batch the shootdown to stop the shootdown to the multi-core
optimal page size <- dynamic ram
algorithms must be asynchronous friendly
OS kernel uses VM in creative way
VM could have same mechanism as the kernel
what primitives of VM does user program need
trap
prot1-decrease accessibility(R+W->R->not at all)
unprot-increase accessibility
mprotect() still works on page accuracy not addr accuracy
AS: pagetable+VMA
contiguous range of memory, some permission, backed by same object
user-level traps
does this trap break the isolation? upcall lets the whole program runs in same conrtext and same page
challenge: the table may be big
use VM promitives, allocate huge range, allow page fault
GC: copying, forwarding, move objects from FROM space to TO space
Baker's algorithm: just copy the root
new: forward a few more objects
check whether the reference of the pointer is in from space
fault handler, scan a page of objects, forward them
no pointer to check anymore
if you fulfill the to-space, flip it as a new from space, if the space is still fulfilled, flip it again
tricky use of concurrency
maps mapping app's space to GC's space
VM is a still developing system
through the VMA there is no hole
What should a kernel do?
traditional: a "monolithic" kernel abstraction(UNIX, Linux, xv6)
abstraction leads to security
monolithic kernel may lead to a big, complex and buggy system, may lead to many thoughts in designing the whole kernel
lots of code is needed to execute a pipe
microkernel: make kernel simpler enough
move most techniques out of the kernel, to the user space
most bugs are at drivers, move them out to the user space so the kernel will be more modifiable
L4: less syscalls, less code
missing almost everything, but everything is in the user space
pager: mmap, cow, alloc
design to improve bad performance brought by microkernels
send() & recv(): async & buffered
how to build a microkernel? for embedded, easy; for a server, maybe run UNIX on top of the kernel is good
on it, Linux is just a program!
but Linux switches its threads through L4-s threads
fork() -> turn into IPC -> alloc pages -> get page faults
L4 server acts as the Linux's pager
getpid() is turned into 2 syscalls in L4(call, sendrecv)
whole OS benchmark: AIM
microcontrollers are used to run specific software
researchers have almost stopped microkernel researching as its poor performance
too-low and too-higfh ideas
L4 built-on threads & addr spaces
inter-process communication
grant, map, unmap
IO are treated as part of addr space
compilers and libraries hides difference of each architecture
Linux's processes and resources management system is architecture independent
device drivers are always hardware dependent
top halves & bottom halves of Linux interrupt handlers
microkernel's task use for Linux's processes
problems surrounding libc
simulate TLB on Pentium processor
the L4Linux uses physical copyin and copyout
inter-thread using the same addr space
support TLB for smaller spaces
creating one server addr space per Linux process
syncronization of server space & server thread lead to the mistake
re-construct the TLB as Linux's minimum user space is too large
using benchmarks compare L4Linux to MkLinux
co-location decreases L4Linux's security, so is it worth?
measure the cost of getpid() syscall
indeed IPC is the main additional cost in L4Linux
use lmbench to measure the cost of sync RPC
use matrix multiplication to test VM performance
use a protected transfer to reduce the cost of security, however, performance decreased
virtual machine: simulation of a computer
accuracy may be 100%
software simulation: like qemu, but may slow
run the instructions in real CPU but whole in user mode
VM trap handler
trap-and-emulate approach
handle privileged instructions into virtual state
VMM must ensure that guest only accesses its own memory and must remap guest physical addresses
turn page faults into operation
each CPU in non-root mode or in root mode
extended page table contains user memory space
2 layers of addr
CPU forces exit to the guest, passes interrupts to the host
Dune: use VMX to run a Linux process
a loadable kernel module
additional functionality like r/w
bottom line: faster for some Linux syscalls
provide process rather machine abstraction
move the kernel into lower layer may lead to higher security
poor interaction between VM and the host OS
Dune is a loadable kernel module
VTX: VMX root & VMX non-root mode
VMLAUNCH & VNRESUME to enter hardware
PCID permits a single user program to switch between multi-pagetables efficiently
include the libdune
translate through ioctl on /dev/dune
assume CPU is free of defects
hypercalls are common in VMM
use a narrower hardware interface to limit the differences
Dune could be considered a type2 hypervisor as it runs on top of an existing OS kernel
host-virtual to host-physical memory translation
query the kernel for mmap, manually update EPT
mirror the configurationgset to the kernel
expose raw access to TSC
use VMCALL to apply syscalls
share code of low-level of VT-x operations butnot sharing high-level code with KVM
VPID: enable a unique TLB tag for each Dune process
libDune is totally untrusted by kernel, including pagetable manager, ELF loader, simple page allocator, routines assist using programs in managing exceptions and syscalls
use dune_signals instead of builtin system signals
simpler implementations, higher performance and security
run sandbox in separate memory space
sandbox: null policy, prevent undesirable parties
wedge: a sthread abstraction for safe fork, fast sthread recycling
modified Boehm GC
build synthetic benchmarks to test performance
higher performance in ptrace and trap
leverage modern virtualization hardware
what if write an OS in HLL?
less faults, but with less performance and less chance to interact with bottom of a computer
tricky to measure trade-offs, must compare with production of C kernel
kernel threads -> goroutines
syscall: SYSENTER, SYSEXIT
solve the heap exhaustion rule
Go is easy to analyze heap bounds
HLL truly simplified code
concurrency in Go is easy
compare apps through biscuit & Linux
Go's GC has no many large pauses but many small delays
many programs could tolerate the pause rate of GC
Go is slower but competitive
Go: type-safe with GC
Singularity didn't fix the security issues in HLL
Rust kernels designed without using Rust as an implementation language
no consensus in GC
risk of deadlock when dealing with heap exhaustion
apply too-small-to-fail rule, instead of exhaustion, the user application see delay
certain kinds of bugs are less likely happen in HLL
biscuit is a monolithic UNIX kernel
shim layer: provide functions of underlying kernel
provide user with POSIX interfaces
use Go's channels to ensure synchronization related
in performance-critical code, using read-lock-free lookups
biscuit does not support scheduling priority
biscuit's approach to kernel exhaustion: purge caches and soft state -> wait until reserve -> kill by kernel-watcher thread
maximum amount of simultaneously live data
maxlive to calculate code used for a single syscall
handle poll & write loops in deep reservations
apply to long-running kernel threads
killer thread is woken when syscall's reservation fails
Go's auto GC never reduce fragmentation nor increases performance
unsafe Go used in memory related code
count and explain the ways HLL benefits the kernel
code execution bugs in biscuit is mostly avoided in HLL related features
try ways to make code routine between C and Go same
use pingpong and pagefaults to masure performance of biscuit
Linux uses RCU to delay such frees of safe deleting
tcpdump -t -xx -n
ethernet addr
today, ethernet LAN is a switch with cables to each host
use the dest addr to decide where to send
arp: translate IP addr to ethernet addr
nest high layer packet in low layer packet
packet mbuf allocator
queues of buffers: avoid forcing discard, keep output NIC busy, allow independent control flow
illustrate tradeoffs in kernel network structure
could a disk cause a livelock?
don't spend time on subsequent packets until done with this one
modern Linux: NAPI
livelock is made worse by doing even more processing before discard
timeout + resend
receive livelock: system spends all its time deal with interrupts
poll is expensive especially when I/O is rare
interrupt-driven design loccking is better
network implementation must deal with high event rates
non-flow control protocols
receive livelock: the system is not deadlocked, but has no progress on current event
resonable latency and jitter
scheduling system must fairly allocate CPU resources among packet reception
queue provides some insulation against packet losses due to transient overlords
batch interrupts
receive livelock: a state when the system is no useful process is made, as some necessary resources is consumed with processing receiver interrupts
latency is increased by the time it receives the entire burst
transmissin starvation: packets may awaiting transmission, but the transmitting interface is idle
a cycle that could repeat ad infinitum
use a round-robin schedule to provide fairness polling
purely interrupt driven wil lead to livelock, purely polling system will lead to unnecessary latency
fast-path designs
use route-under-test to check the performance
interrupt handler does almost no work at all
unavoidable queue as no guarantee to let the output interface run as fast as input interface
the screend program is typically run as the only application on a system
simply measure the resources used by CPU for receiving packets
What if malicious user code execute in kernel?
the CPU may speculate the execution of asm code
mispredict -> flush the result, revert content, restart execution
CPU retires an instruction only after they won't be changed after all previous instructions are retired
L1 miss: TLB look up, L2 look up with phy addr
in real life, microarchitecture is not entirely invisible
sense whether the instruction is cached: flush + reload
use clflush to reflush the cache, use rdtsc to read the cycle counter
deduce the low bit of kernel data based on which cache lines hold
the attack is always failed
fix the vulnerability: don't map the kernel in user pagetable, require a pagetable switch when context switches
only return data from speculative loads
Meltdown: side effects of out-of-order processing
exploit side-channel information on modern processors
in out-of-order, CPU calculates for further information
influence the cache, affects the side-channel
kernel ASLR randomizes the offsets where driver is loaded on the boot
use flush + reload to detect whether the instruction is cached so that the microarchitecture is no more hidden
spectre attacks: lack of privilege escalation in Meltdown
a transient instruction as a side effect
execution suppression
build a covert channel to receive information from microarchitecture and deduce this secret
covert channel is not limited to microarchitectures
steps: content of loadable mem location is loaded in reg, transient instruction, flush + reload
bias is often lost in out-of-order execution
KAISER: a strong isolation which won't map any kernel memory into user space
can we have a lock that allow parappel reads?
one writer, but no other reader or writer; lots of reader, but no writer
atomic CAS's execute one at a time -- not in parallel
r_lock's writes are expensive
reader connot simply don't lock, but use methods like RCU
in RCU: writer doesn't modify the file directly, but prepare a new copy, the reader won't see a partial midified string, even if it doesn't hold a lock
before: no change is visible; after: all changes are visible
use memory barriers to enhance order of transactions
potential disaster: use after free
let writer wait for GC
implement the graceful delay
RCU implements nearly zero in reads
reference counting, redesign for no sharing
RCU's success: high performance in concurrent reading and updating
support for concurrent reading and updating, low computation and storage overhead, deterministic completion time
VFS needs concurrent reading and updating
RCU allows thread to wait for the completion of pre-existing RCU critical sections, but coordinate activities using other mechanisms
synchroize_rcu: ensure every cpu acts a context switch
writer just need to wait for reader, but reader must communicate with writer that the critical section is complete
RCU's simplest usage is to wait for a transaction complete
using RCU in NMI system
execute in RCU, apply the reference counting
using an RCU critical section is good if the object is relatively short
RCU could use to remove need for type-safe memory
RCU has higher deadlock immunity than read/write lock
RCU is used increasingly in Linux kernel source code
what if dynamically update the kernel?
observability, performance, security
detect DDoS attack
eBPF uses bin-constraint approach
C reads cBPF, emulates the "language", run in the same addr space as in the kernel
whether the asm-like language will break the isolation?
binaries, look through the whole program and disallow huge offsets
speed: slow interpreter -> JIT
ways of tracking pointers
is eBPF even the right approach to extensibility?
original: designed for capture and filter packets
design of VM is left behind the traditional processor
mapping eBPF to native instructions
in 2014, eBPF is directly accessed by user space
attached to a designated code path
debugging the kernel
lock the kernel, verify the simulation, secure mode
define the type of eBPF programs
eBPF map: allow data pass through kernel, kernel-user, user space