File Systems, Kernels, and High-Performance Networking
Fast File System (FFS)
Writes
Writes: Writes data to blocks chosen by cylinder groups. Tries to place related blocks near each other and rotationally optimize access. Metadata updates (inodes, directories) are also written in-place within the same group.
Reads
Reads: Very fast because data is grouped by cylinder, minimizing seeks. Sequential reads benefit from rotational optimization and large block sizes (4–8 KB). The first read goes through the inode, then the inode's data block pointers.
Appends
Appends: Allocates the next block near the previous one. Uses rotational delay tables to place the next block just in time for the disk head. Large blocks plus fragments allow efficient small-file appends.
Crash Recovery
Crash Recovery: Uses fsck to walk the entire file system after a crash — slow. Must verify consistency of directories, inodes, free lists, and bitmaps. There is no logging, so the state must be reconstructed by scanning all structures.
Mechanisms
- Cylinder groups for locality.
- Large blocks and small fragments for mixed workloads.
- Rotational optimization using hardware parameters.
- Bitmaps for free block tracking.
Differences from Older UFS
How is it different from the previous? Older UFS used random placement and 512-byte blocks — poor performance. FFS introduced structured placement, geometry-aware allocation, and much larger blocks, improving throughput by 2–10×.
Benefits
- Dramatically lower seek cost and huge performance improvement.
- Better use of disk bandwidth.
- Efficient small-file storage via fragments.
- Stable long-term performance (no random free lists).
- Larger maximum file size due to bigger blocks.
Challenges
- Requires accurate hardware parameters for rotational optimization.
- Locality policies make allocation decisions complex.
- Fragmentation management is tricky.
- fsck recovery time grows with disk size.
Drawbacks
Metadata can still be scattered; updates require seeks. The write path is not purely sequential and can still be seek-heavy. fsck can be slow on large disks. Hard to scale to write-heavy workloads with many small random writes.
Log-Structured File System (LFS)
Writes
Writes: All updated data blocks, inodes, and the inode map are appended sequentially to the log. No overwrites — each update creates a new version. Changes are batched in memory into segments for efficient large writes.
Reads
Reads: Must locate the latest inode via the inode map, then follow inode → logical block pointers → physical block in the log. The first read can be slower because of the lookup; cached reads are fast. Read locality is typically worse than FFS because data can be spread across the log.
Appends
Appends: Extremely fast: new data goes to the end of the log. Entire files effectively "move" as new versions are written. There is no need to find a free location; everything is appended.
Crash Recovery
Crash Recovery: Very fast: scan the tail of the log and use segment summary blocks to rebuild the inode map. No full-disk scan is required (unlike fsck), and the log itself encodes write order for natural consistency.
Mechanisms
- Sequential log for all writes.
- Inode map to track the newest inode versions.
- Segment summary blocks to identify contents.
- A cleaner to reclaim space from dead blocks.
- Hot/cold policies and cost-benefit ratios for cleaning decisions.
Differences from FFS
How is it different from the previous? FFS optimizes disk access by minimizing seeks; LFS avoids seeks entirely by converting updates into sequential appends. FFS writes mixed patterns; LFS turns everything into large sequential writes. LFS introduces garbage collection (cleaning), which FFS does not need.
Benefits
- Write bandwidth near hardware limits — very high.
- Great for workloads with many small writes (databases, mail servers).
- Crash recovery is extremely fast.
- Batched updates improve metadata consistency.
- Sequential writes reduce disk wear.
Challenges
- Cleaning is expensive — must identify live vs. dead data and compact segments.
- Read performance depends on inode-map lookups and may be slower than FFS.
- Maintaining efficient cleaning policies (hot/cold separation) is non-trivial.
- Hard to tune thresholds for when to clean.
Drawbacks
Cleaning overhead can dominate performance in worst-case workloads. Spatial locality for reads can be poor, leading to scattered data. The inode map and metadata updates add complexity. For mostly-read workloads, FFS may outperform LFS.
Soft Updates
Ordering constraints: Ensure no dangling pointers and no reuse of blocks until old pointers are removed. Never reset an old pointer until a new one persists.
Overall approach: Write changes to the file buffer cache and track dependencies between blocks; write blocks in a safe order.
Challenge: Circular dependencies. Solution: fine-grained dependency tracking across inodes and directory entries.
Undo/redo: Temporary undo operations are used when necessary.
Overhead: Memory to store dependency information and CPU for extra work on undo/redo.
Optimizations: Write many blocks at once and skip writes that are undone.
SplitFS and Persistent Memory
Persistent Memory: Attached to the memory bus. Properties: larger than DRAM, smaller than disk, 2–4× slower than DRAM, expensive, non-volatile (persistent), limited number of writes, and byte-addressable.
Challenges: Caching requires write-through stores or nontemporal stores to avoid violating persistence semantics.
SplitFS (hybrid): Handle data in user space and metadata in kernel space. The relink operation is a metadata-only atomic operation that moves persistent memory (PM) blocks from a staging file into the real file without copying data. This is useful for appends and atomic overwrites. It is faster and avoids extra writes that would wear out persistent memory.
Exokernel (Exo)
Goal: Provide tiny, secure kernel primitives that let applications or untrusted, user-level library OSes (libOSes) implement full OS abstractions themselves.
Why Exo? Expose hardware to user level, push OS functionality to libOS, and allow safe revocation of resources.
Does: Decide ownership, protect usage, revoke resources safely.
Doesn't: Implement VM and FS policy, scheduling, or TCP/IP — these belong to libOSes.
Key concepts
- Secure binding: Applications bind to hardware resources once (authorization) and use them subsequently with minimal overhead; this is central to exokernel performance.
- Visible revocation: When the kernel needs a resource back, it asks the libOS to give it up; the libOS selects what to relinquish. If the libOS does not cooperate, a protocol is used to abort or forcibly reclaim.
- Abort protocol: If the libOS refuses to return a resource, the kernel forcibly breaks the binding, marks it as repossessed, and sends an exception to the libOS so it can fix page tables and data structures (it does not kill the application unless recovery fails).
- User-level resource management: The untrusted libOS manages hardware resources and exposes OS functionality to applications.
Memory Management and IPC in Exo
When a libOS wants to map a physical page, Exo checks if the capability is valid. If valid, a TLB entry is installed so later access does not need kernel involvement. IPC and protection control transfers are fast: copying registers, switching TLB tags, and jumping can be 5–40× faster than traditional kernel-mediated transfers.
L4 vs Exokernel Comparison
Philosophy: Minimize the kernel and OS abstractions; provide secure multiplexing without imposing abstractions. Kernel responsibilities include threads, address spaces, IPC, interrupts, protecting resources, tracking ownership, and revocation.
Abstractions provided: Threads, IPC, and VM primitives, but Exo exposes raw resources (page tables, disk blocks, packets) to libOSes.
Who implements services? User-level servers and library OSes implement OS services and policies (untrusted and application-specific).
Level of hardware exposure: L4 provides moderate exposure (still abstracted); Exo provides full exposure (page tables, disk blocks, packets).
Memory management: L4 uses a pager model (user-level pager). Exo can allocate raw physical pages to libOSes.
Protection: Kernel-enforced isolation and IPC capabilities, secure bindings, and revocation with an abort protocol.
Network handling: Exo can be very fast, but more checks and domain crossings for packet handling can slow network traps.
Performance: L4: incredibly fast IPC and optimized interrupt handling with low overhead system calls. Exo: almost no kernel overhead; libOSs can build exactly what they need, offering extreme flexibility. Performance depends heavily on the quality of the libOS implementation. Exo may be slower when secure binding lookups, packet filters, or many domain crossings are required.
Typical interrupt flow:
- L4: interrupt → IPC → server.
- Exo: interrupt → check filters → verify binding → handoff → libOS → maybe fallback to kernel.
Remote Procedure Call (RPC)
Definition: Make a remote function look like a local function call, while aiming for identical semantics and high efficiency.
Goals: Identical semantics to local procedure calls, efficiency (minimize CPU time and number of messages), security, high performance for small requests, and generality.
Core idea and flow
- Client calls a function normally; a user stub catches the call.
- The stub packs arguments into a message and the RPC runtime sends the message over the network.
- On the server, the RPC runtime receives the message, the server stub unpacks arguments, calls the real function, and sends results back.
Binding: How the client finds a server. Example steps: server exports an interface; a directory service responds with the server's network address; the client and server can then communicate. Type = service kind; instance = specific service. Unique identifiers help detect server restarts. Exporting scales well and importing is optimized because it does not create server state.
Transport protocol: Needed because RPC calls are short and low-latency. Avoid heavy TCP-like connection setup/teardown. RPCs use implicit acknowledgments, lightweight per-call connections, and small tables for duplicate suppression.
Handling failures
If a call returns, the server executed the function exactly once. If a call fails, it may have run once or not at all — RPC cannot always tell. Mechanisms include retransmissions, timeouts, probe packets, and duplicate suppression.
Runtime optimizations
- Process pools to avoid creating a process per call.
- Transport optimizations: piggyback ACKs and retransmit only when needed.
- Minimize server state and avoid teardown messages.
Receive Livelock and Packet Processing
Modern systems perform a lot of packet processing (network monitoring, routing, firewalls, VPNs, file servers, DNS). Extremely high request rates and unregulated incoming traffic can overwhelm the OS.
Traditional network stack: NIC → interrupt handler → network device code → IP/TCP processing (network thread) → socket buffers → application. NICs typically trigger an interrupt per packet; the interrupt handler runs at high priority and the network thread runs at lower priority. A queue absorbs bursts as long as load is manageable.
Receive livelock
Definition: When input load becomes too high, the system spends all its time handling interrupts and never processes and delivers packets — throughput falls to zero even though CPU is busy.
Why interrupts cause the problem: Interrupt-driven processing reacts to packet arrivals, but frequent interrupts cause many context switches and can starve lower-priority network threads. Queues fill and packets are dropped later; the system collapses.
Polling vs Interrupts
Polling: Pros: control over when and how many packets to process, fairness, effective under high load. Cons: potential high latency under low load, possible packet drops, and parameter tuning (poll frequency, batch size) is tricky.
Interrupts: Pros: immediate response to arrivals. Cons: can trigger livelock and high overhead from frequent handlers.
Solutions (hybrid approaches)
- Combine interrupts and polling: use interrupts to signal start of work and switch to polling under load to prevent interrupt storms.
- Reserve CPU time for applications to prevent the network stack from starving user programs.
- Drop packets early at the NIC or the earliest point when overwhelmed; dropping early is cheaper than wasting CPU on doomed packets.
- Remove extra queue layers to reduce copying and CPU waste.
- Use round-robin scheduling among network interfaces or tasks to prevent one device from hogging the CPU.
IX and Snap: Userspace Networking
IX dataplane: Sending and receiving packets; control plane handles management. Modern NIC hardware offers virtual queues and memory isolation.
Challenges: Protection/isolation between apps and network stacks, microsecond tail latency with large fanout, high packet rates with small packets, and resource efficiency for cores and memory.
IX approach: Provide a library OS (libIX) bound to each application and running in userspace. Applications call libIX instead of POSIX. Processing runs to completion with adaptive batching. Sync-free processing partitions resources across cores; RSS ensures packets for a flow go to the same core to avoid shared state and locking, enabling very low latency.
Zero copy: NIC DMA writes into application-accessible memory to avoid multiple copies, lowering latency and increasing throughput.
Evaluation: Higher throughput than Linux, lower latency, and better tail latency under high fanout due to avoiding interrupts and extra copies.
Snap
Motivation: Datacenter networking needs changed rapidly; kernel stacks were slow to evolve. Snap moves networking out of the kernel into a userspace service that owns NIC resources and provides networking via shared-memory queues and RPC.
Architecture: Control plane via RPC services for setup, config, policy; dataplane engines process packets with fast, zero-copy I/O and NIC rings. Engines are single-threaded, stateful computation units.
CPU scheduling: Options include dedicating a core for lowest latency, spreading engines (per-engine threads woken by interrupts) for best tail latency, or compacting engines into fewer cores for best CPU efficiency.
Transparent upgrades: Start a new instance; the old instance writes state (brownout), the new one reads state and re-creates engine structures. Old instance pauses (blackout), state transfers, new instance attaches NIC filters and resumes; the old instance is killed. Applications generally do not lose connections; packet loss is treated as congestion.
Sprite OS (distributed UNIX research)
Goal: Make a network of many UNIX workstations feel like one coherent computer, providing transparent file sharing, load sharing, and process management while preserving UNIX semantics.
Context: Workstations were common and connected by LANs; memory sizes increased, making caching valuable; multiprocessors began appearing and administrators wanted simpler system management.
Problems with older systems: Inconsistent file sharing (weak cache consistency in NFS), no transparent process migration, static mount tables requiring manual setup, and lack of multiprocessor support such as fine-grained locking and shared memory.
Features
- Global namespace with prefix tables: A unified file system across machines. Each client maintains a dynamic prefix table mapping path prefixes to servers so files are located automatically without manual mounts.
- Consistent file caching: Sprite caches files on clients and servers using version numbers to ensure sequential consistency and disables caching during concurrent writes to avoid stale reads.
- Process migration: Processes can move to idle machines while keeping identity; system calls requiring home resources are forwarded transparently, enabling automatic load sharing.
- Shared fork and multiprocessor support: Parent and child can share large parts of the address space to help multiprocessor performance and support lightweight parallelism akin to threads.
- Integrated VM and file cache management: Sprite coordinates VM and file caching to avoid double caching, sharing the same physical memory pool for efficiency.
LegoOS and Disaggregated Datacenters
Traditional servers are monolithic: CPU, DRAM, and storage are bundled in one box and communicate over very fast cache-coherent interconnects. This causes resource waste because each server provides a fixed bundle of resources.
LegoOS idea: Use hardware disaggregation: place CPU, memory, storage, and GPUs in separate machines connected via the datacenter network. The OS composes these machines like LEGO blocks to provide flexible resource allocation, smaller failure domains, better utilization, and hardware specialization (e.g., fast memory pools or GPU pools).
Performance challenge: Local interconnects inside a server are ~500–600 Gbps with ~50 ns latency, while datacenter networks are ~40–100 Gbps with microsecond latencies. The OS must hide extra latency and present remote resources as if local.
Design: A split-kernel architecture with pComps (processing), mComps (memory), and sComps (storage). A pComp runs application threads, maintains the virtual address space, and has a small fast local memory area called the excache. On a page fault, the pComp fetches the page from an mComp. The mComp manages physical pages, page table entries, and TLB translations; the sComp stores disk blocks and provides a distributed file buffer cache.
Two-level memory management: Frequently accessed pages are kept in the pComp's excache to avoid network delays; larger memory pools on mComps are accessed remotely when needed. Applications with small working sets may see slightly worse performance due to network overhead, but applications with large working sets can benefit from effectively larger memory pools.
Evaluation: LegoOS maintains good performance, allows independent scaling of CPU, memory, and storage, and improves reliability via fine-grained failure isolation.
VAX/VMS Memory and Paging
Goal: Provide predictable, efficient, and stable performance across a wide hardware range, from very small RAM (250 KB–8 MB) to larger minicomputers. Avoid thrashing, reduce disk I/O, and remain efficient under mixed workloads (real-time, interactive timesharing, batch).
Address space and page tables
VAX divides virtual address space into system space (high half, shared by all processes) and process space (low half), which is further divided into P0 (program code/data/heap) and P1 (stack). Page tables are simple linear arrays stored in system space and can themselves be paged. The hardware uses base and length registers to reference P0/P1 tables while the system page table is always mapped. The TLB is split into system and per-process entries; only per-process entries are flushed on context switches.
Paging model
VAX uses process-local replacement rather than global LRU. Each process has a resident-set limit. When a fault occurs and the resident set is full, eviction happens only from that process's set using a simple FIFO replacement policy. This isolates processes and prevents one memory-intensive process from evicting another's pages.
Free and modified lists
Evicted pages are not immediately written to disk. Clean pages go into a free list and dirty pages into a modified list. If a process faults on a page still cached in either list, the fault can be satisfied without disk I/O. The modified list delays writes so multiple dirty pages can be batched and avoid writing pages that may be reused soon.
Other: Clustering is used because pages are small (512 bytes) and disks are slow moving-head devices. Swapping writes out entire resident sets of low-priority or idle processes to a dedicated swap file to maintain stability under extreme memory pressure. Demand-zero and copy-on-reference techniques are used. System-space paging is also supported.
Mach Microkernel
Summary: Mach is a microkernel with good modularity but historically suffered from poor performance due to slow IPC and many abstractions in the kernel.
Mach vs L4: Mach moves OS services to user level but keeps many high-level abstractions inside the kernel. L4 pushes abstractions out and optimizes for performance with a much smaller kernel.
Kernel size: Mach had a relatively large microkernel compared to minimal kernels.
Abstractions: Mach provides ports, messages, tasks, threads, and VM primitives.
Performance: Mach was slow due to large IPC overhead, though later implementations optimized IPC significantly. The system call path could be expensive with many kernel crossings and user-level servers requiring RPCs back to the kernel.
VM/paging: Mach offered flexibility but kernel mediation could be slow. User-level pagers provide flexibility at the cost of more complexity.
Interrupt handling: Delivery could be slow with heavy kernel mediation, though optimizations exist.
IPC mechanism: Initially heavy and slow, though implementations improved with ultra-fast IPC techniques.
Flexibility: Mach allowed custom pagers and servers but came with higher complexity and performance penalties compared to simpler kernels like L4.
Monitors: Hoare, Mesa, and Java
Hoare monitors: Every monitor procedure implicitly acquires the monitor lock on entry; condition variables are explicit. Waiting follows Hoare semantics: when a thread executes wait, it suspends, and when another thread signals, control is transferred immediately to the waiting thread. This guarantees that after wait returns, the condition is true, allowing the programmer to use if (!cond) wait. The signaller is blocked so the awakened thread runs inside the monitor without interference. Hoare's design is mathematically elegant but impractical for many systems. The paper does not fully specify handling of aborts, exceptions, or nested monitor invocations.
Mesa monitors: Adapt monitors for real-world implementation. Monitor entry and exit are explicit (enter/exit), though the compiler may help insert lock operations. Condition variables remain explicit, but Mesa uses weaker semantics: when a thread signals a condition, the signaller continues executing and the awakened thread is placed on the runnable queue. Thus, when the waiter re-enters the monitor the condition may no longer hold, and the pattern while (!cond) wait is required. Mesa introduces runtime mechanisms such as unwind handlers for aborts and exceptions; monitors are non-reentrant, so nested monitor calls must be carefully structured to avoid deadlock. Mesa is more practical but more complex than Hoare's model.
Java monitors: The monitor abstraction is implicit: every object can be a monitor. Entering a synchronized method or block acquires the object's lock. Java uses Mesa-style semantics for waiting and signaling; each object has implicit condition variables and threads call wait(), notify(), or notifyAll(). Because awakened threads run only after the notifying thread releases the lock, Java also requires the while (!cond) wait pattern. Java supports multiple granularities of synchronization, automatically releases locks when exceptions occur, and supports reentrancy (a thread can acquire the same lock multiple times). Java blends Mesa-style practicality with higher-level implicit monitors and exception-safe lock handling.
English with a size of 25.6 KB