Linux Process Management: Deep Dive into Kernel Internals
Classified in Computers
Written on in English with a size of 27.36 KB
What is a Process?
A process is essentially a program in execution.
Lightweight Processes in Linux
Lightweight Processes (LWPs) in Linux are processes that offer better support for multithreaded applications.
Multithreaded Applications in Linux
A multithreaded application is designed to perform multiple tasks concurrently within a single process. In Linux, a straightforward way to implement multithreaded applications is to associate a lightweight process with each thread.
This approach allows threads to access the same set of application data structures by simply sharing the same memory address space, the same set of open files, and so on. Simultaneously, each thread can be scheduled independently by the kernel, meaning one thread may sleep while another remains runnable.
Examples of POSIX-compliant pthread libraries that utilize Linux's lightweight processes include:
- LinuxThreads
- Native POSIX Thread Library (NPTL)
- IBM's Next Generation POSIX Threading Package (NGPT)
Linux Thread Groups Explained
In Linux, a thread group is fundamentally a set of lightweight processes that implement a multithreaded application. These processes act as a cohesive unit with regard to certain system calls, such as getpid()
, kill()
, and _exit()
.
Process Descriptors and Their Role
A process descriptor is a crucial data structure used by the kernel to manage processes. It stores all information related to a single process, including:
- The process's priority
- Whether it is running on a CPU or blocked on an event
- The address space assigned to it
- Which files it is allowed to access
- And other vital process-related data
Linux Process Descriptor: struct task_struct
The Linux process descriptor is represented by the struct task_struct
. (A schematic diagram would typically illustrate its various fields and their relationships, but cannot be provided in this text format.)
Seven Linux Process States
The seven possible process states in the Linux operating system are:
- TASK_RUNNING
- TASK_INTERRUPTIBLE
- TASK_UNINTERRUPTIBLE
- TASK_STOPPED
- TASK_TRACED
- EXIT_ZOMBIE
- EXIT_DEAD
Detailed Linux Process States
TASK_RUNNING
The process is either currently executing on a CPU or waiting in the runqueue to be executed.
TASK_INTERRUPTIBLE
The process is suspended (sleeping) until some condition becomes true. Examples of conditions that might wake up the process (returning its state to TASK_RUNNING
) include a hardware interrupt, the release of a system resource the process is waiting for, or the delivery of a signal.
TASK_UNINTERRUPTIBLE
Similar to TASK_INTERRUPTIBLE
, but with the key difference that delivering a signal to a sleeping process in this state leaves its state unchanged. It will not be woken up by signals.
TASK_STOPPED
Process execution has been halted. The process enters this state after receiving a SIGSTOP
, SIGTSTP
, SIGTTIN
, or SIGTTOU
signal.
TASK_TRACED
Process execution has been stopped by a debugger. When a process is being monitored by another (e.g., when a debugger executes a ptrace()
system call to monitor a test program), each signal may put the process into the TASK_TRACED
state.
EXIT_ZOMBIE
Process execution is terminated, but the parent process has not yet issued a wait4()
or waitpid()
system call to retrieve information about the dead process. Before a wait()
-like call is issued, the kernel cannot discard the data contained in the dead process descriptor because the parent might still need it. Other wait()
-like library functions, such as wait3()
and wait()
, are implemented in Linux via the wait4()
and waitpid()
system calls.
EXIT_DEAD
This is the final state. The process is being removed by the system because the parent process has just issued a wait4()
or waitpid()
system call for it. Changing its state from EXIT_ZOMBIE
to EXIT_DEAD
helps avoid race conditions that could occur if other threads of execution were to execute wait()
-like calls on the same process.
Purpose of TASK_UNINTERRUPTIBLE State
The TASK_UNINTERRUPTIBLE
state is used for processes that are suspended (sleeping) and cannot be woken up by signals. This is crucial for operations where a process must not be interrupted, such as waiting for I/O on a critical device, to prevent data corruption or system instability.
Kernel Process Identification Approaches
The kernel uses two primary approaches to identify a process:
set_task_state
set_current_state
Process ID (PID): Definition and Necessity
PID stands for Process ID. It is a unique numerical identifier assigned to each process, essential for the kernel and other processes to identify and manage individual processes within the system.
PID Limits: 32-bit vs. 64-bit Architectures
The upper limit on PID values varies by architecture:
- In 32-bit architectures, the upper limit on PID values is 32,767.
- In 64-bit architectures, the upper limit on PID values is 4,194,303.
pidmap_array
Bitmap in Linux
The pidmap_array
bitmap in Linux is used to denote which PIDs are currently assigned and which are free. The size of the pidmap_array
bitmap is 32,768 bits.
Thread Group Leader and PID Assignment
By default, the first lightweight process created in a thread group becomes the thread group leader. The identifier shared by all threads in the group is the PID of this thread group leader, which is stored in the tgid
field of the process descriptors.
Process Descriptors in Dynamic Memory
Since the kernel must be able to handle many processes concurrently, process descriptors are stored in dynamic memory rather than in a memory area permanently assigned to the kernel. This allows for flexible allocation and deallocation as processes are created and terminated.
Per-Process Memory Area Data Structures
The kernel assigns two different data structures to every process within a single per-process memory area. The length of this memory area is usually 8,192 bytes (two page frames). The data structures stored in this area are:
- A small data structure linked to the process descriptor, namely the
thread_info
structure. - The Kernel Mode process stack.
(A diagram would typically illustrate how these two structures are laid out within the 8KB memory area, but cannot be provided here.)
thread_info
and Kernel Mode Stack Purpose
Thread_info Structure Stack
In the 80x86 architecture, the kernel can be configured at compilation time so that the memory area including the stack and thread_info
structure spans a single page frame (4,096 bytes). The thread_info
structure holds essential information about the thread, such as flags, the task pointer, and the base address of the kernel stack.
Kernel Mode Process Stack
A process in Kernel Mode accesses a stack contained in the kernel data segment, which is distinct from the stack used by the process in User Mode. Because kernel control paths make relatively little use of the stack, only a few thousand bytes of kernel stack are typically required. Therefore, 8 KB is ample space for both the stack and the thread_info
structure.
Efficiency of thread_info
and Kernel Stack Association
A key benefit in terms of efficiency offered by the kernel in providing a close association between the thread_info
structure and the kernel mode stack is rapid access. Because the thread_info
structure is relatively small (e.g., 52 bytes long), the kernel stack can expand up to 8,140 bytes within the 8KB memory area.
The kernel can easily obtain the address of the thread_info
structure of the process currently running on a CPU from the value of the esp
register. If the thread_union
structure is 8 KB (213 bytes) long, the kernel masks out the 13 least significant bits of esp
to obtain the base address of the thread_info
structure. If the thread_union
structure is 4 KB long, the kernel masks out the 12 least significant bits of esp
.
Purpose of the current
Macro in Linux
Most often, the kernel needs the address of the process descriptor rather than the address of the thread_info
structure. To get the process descriptor pointer of the process currently running on a CPU, the kernel makes use of the current
macro, which is essentially equivalent to current_thread_info()->task
. The current
macro frequently appears in kernel code as a prefix to fields of the process descriptor. For example, current->pid
returns the process ID of the process currently running on the CPU.
Linux Process List Implementation
In Linux, the process list is implemented as a doubly linked list that links together all existing process descriptors. The head of this process list is the init_task
task_struct
descriptor, which represents process 0, also known as the swapper process.
- The
tasks->prev
field ofinit_task
points to thetasks
field of the process descriptor inserted last in the list. - The
SET_LINKS
andREMOVE_LINKS
macros are used to insert and remove a process descriptor in the process list, respectively.
Process List Data Structures (Diagram Not Shown)
Special data structures are used to implement the process list, typically involving a doubly linked list where each node is embedded within a task_struct
. The first node points to process 0 (init_task
), and the second node would point to process 1 (the init
process).
Linux Runqueue for Scheduler Speedup
Earlier Linux versions placed all runnable processes in a single list called the runqueue. To achieve scheduler speedup and select the best runnable process in constant time, modern Linux kernels use more sophisticated runqueue implementations, often involving multiple priority arrays or red-black trees, to efficiently manage and select processes for execution.
Parent and Sibling Process Relationships (Figure Not Shown)
A figure would illustrate the hierarchical relationship between processes. For example, if Process P0 successively created P1, P2, and P3, and P3, in turn, created P4, the diagram would show P0 as the parent of P1, P2, and P3 (siblings to each other), and P3 as the parent of P4.
PID Hash Tables and Chained Lists
To speed up the search for process descriptors by PID, four hash tables have been introduced in Linux. These are required because the process descriptor includes fields that represent different types of PIDs, and each type of PID requires its own hash table.
Linux uses chaining to handle colliding PIDs; each table entry is the head of a doubly linked list of colliding process descriptors. Hashing with chaining is preferable to a linear transformation from PIDs to table indexes because, at any given instance, the number of processes in the system is usually far below 32,768 (the maximum number of allowed PIDs).
PID Hash Table and Chained Lists Diagram
(A block diagram would typically describe the PID hash table structure, showing how hash values map to table entries, and how chained lists extend from these entries to handle collisions, but cannot be provided here.)
Four PID Hash Tables in Linux
The four PID hash tables used in Linux are required to efficiently look up process descriptors based on different types of PIDs associated with a process (e.g., process ID, thread group ID, session ID, process group ID). Each table optimizes lookups for its specific PID type.
Four Hash Tables Implementation Diagram
(A block diagram would describe the four hash tables, illustrating how they are implemented with chained lists and how processes are grouped within each chain, but cannot be provided here.)
Purpose of Wait Queues in Linux
The process state alone does not provide enough information to retrieve a process quickly when it's waiting for a specific event. Therefore, additional lists of processes called wait queues are introduced. A wait queue represents a set of sleeping processes that are woken up by the kernel when a particular condition becomes true.
Runqueue vs. Wait Queues in Linux
- The runqueue lists group all processes that are in a
TASK_RUNNING
state (i.e., ready to be executed). - Wait queues, on the other hand, group processes in other states, specifically those that are sleeping and waiting for a particular event to occur. The various sleeping states call for different types of treatment, with Linux opting for specific wait queue mechanisms.
Key Uses of Wait Queues in the Kernel
The three important uses of wait queues in the kernel are:
- Interrupt handling
- Process synchronization
- Timing mechanisms
Wait Queue Implementation in Linux
Wait queues in Linux are implemented using two primary data structures:
1. struct __wait_queue_head
struct __wait_queue_head {
spinlock_t lock;
struct list_head task_list;
};
typedef struct __wait_queue_head wait_queue_head_t;
lock
: Aspinlock_t
used for synchronization. Because wait queues are modified by interrupt handlers as well as by major kernel functions, the doubly linked lists must be protected from concurrent accesses, which could induce unpredictable results.task_list
: The head of the list of waiting processes.
2. struct __wait_queue
struct __wait_queue {
unsigned int flags;
struct task_struct * task;
wait_queue_func_t func;
struct list_head task_list;
};
typedef struct __wait_queue wait_queue_t;
flags
: Indicates the kind of sleeping process (e.g., exclusive or non-exclusive).task
: A pointer to thetask_struct
(process descriptor) of the sleeping process.func
: Specifies how the processes sleeping in the wait queue should be woken up.task_list
: Contains the pointers that link this element to the list of processes waiting for the same event.
The "Thundering Herd" Problem and Solution
The "thundering herd" problem occurs when multiple processes are woken up simultaneously, only to race for a resource that can be accessed by only one of them. The result is that the remaining processes must once more be put back to sleep, leading to inefficiency.
This problem is tackled by classifying sleeping processes into two kinds: exclusive and non-exclusive, allowing the kernel to selectively wake up processes.
Types of Sleeping Processes in Wait Queues
There are two kinds of sleeping processes in a wait queue, and this classification is required to mitigate the "thundering herd" problem:
- Exclusive processes (denoted by the value 1 in the
flags
field of the corresponding wait queue element) are selectively woken up by the kernel. Only one or a limited number of these processes will be woken up when the event occurs. - Non-exclusive processes (denoted by the value 0 in the
flags
field) are always woken up by the kernel when the event occurs.
Linux Process Resource Limits
Each process has an associated set of resource limits, which specify the amount of system resources it can use. These limits are crucial for preventing a single user or process from overwhelming the system's CPU, disk space, memory, and other resources.
Various resource limits can be specified, including (but not limited to):
- CPU time
- File size
- Data segment size
- Stack size
- Core file size
- Resident set size
- Number of open files
- Number of processes
- Locked-in-memory address space
Unix/Linux Resource Limit Commands
The purpose of the following commands in Unix/Linux is:
ulimit -Ha
: Displays all hard resource limits for the current user. Hard limits are maximum values that cannot be increased by an unprivileged user.ulimit -Sa
: Displays all soft resource limits for the current user. Soft limits are the current effective limits, which can be increased up to the hard limit by an unprivileged user.getconf -a
: Displays all configurable system variables and their current values.getconf PAGE_SIZE
: Displays the size of a memory page in bytes for the current system.
Process, Context, and Task Switching in Linux
To control the execution of processes, the kernel must be able to suspend the execution of the process currently running on the CPU and resume the execution of some other process that was previously suspended. This activity is known by various names: process switch, task switch, or context switch.
Hardware Context in Linux
The hardware context is the set of data that must be loaded into the CPU registers before a process can resume its execution. In Linux, a portion of the hardware context of a process is stored in the process descriptor, while the remaining part is saved in the Kernel Mode stack.
Key Concepts: TSS and Thread Field
Task State Segment (TSS)
The 80x86 architecture includes a specific segment type called the Task State Segment (TSS), designed to store hardware contexts. Linux uses the hardware support offered by the 80x86 architecture and performs a process switch through a jmp
instruction to the selector of the Task State Segment Descriptor of the next process.
Thread Field
Each process descriptor includes a field called thread
, of type thread_struct
. In this structure, the kernel saves the hardware context whenever the process is being switched out (i.e., its execution is suspended).
Kernel Process Switch with schedule()
The kernel performs a process switch, typically orchestrated by the schedule()
function, through two main steps:
- Switching the Page Global Directory: This installs a new address space for the incoming process.
- Switching the Kernel Mode stack and the hardware context: This provides all the information needed by the kernel to execute the new process, including the CPU registers.
Address Space Duplication Solutions in Unix
Modern Unix kernels have introduced three different mechanisms to solve the problem of duplicating address space efficiently when creating new processes:
Copy-On-Write (COW) technique: This allows both the parent and the child process to initially read the same physical pages. Whenever either process attempts to write to a physical page, the kernel copies its contents into a new physical page, which is then assigned to the writing process. The associated system call for process creation is
fork()
.Lightweight processes (LWP): These allow both the parent and the child to share many per-process kernel data structures, such as the paging tables (and therefore the entire User Mode address space), the open file tables, and the signal dispositions. The associated system call is
clone()
.vfork()
system call: This creates a process that shares the memory address space of its parent. To prevent the parent from overwriting data needed by the child, the parent's execution is blocked until the child exits or executes a new program (e.g., viaexecve()
). The associated system call isvfork()
.
Process Creation System Calls
The three system calls used to create a process are clone()
, fork()
, and vfork()
. These system calls are required to provide different levels of resource sharing and execution semantics, catering to various application needs, from traditional process creation to lightweight multithreading.
clone()
System Call for Lightweight Processes
Lightweight processes are created in Linux using a function named clone()
, which takes the following parameters:
fn
- Specifies a function to be executed by the new process; when this function returns, the child terminates. The function returns an integer, which represents the exit code for the child process.
arg
- Points to data passed to the
fn()
function. flags
- Miscellaneous information. The low byte specifies the signal number to be sent to the parent process when the child terminates; the
SIGCHLD
signal is generally selected. child_stack
- Specifies the User Mode stack pointer to be assigned to the
esp
register of the child process. The invoking process (the parent) should always allocate a new stack for the child. tls
- Specifies the address of a data structure that defines a Thread Local Storage segment for the new lightweight process. This parameter is meaningful only if the
CLONE_SETTLS
flag is set. ptid
- Specifies the address of a User Mode variable of the parent process that will hold the PID of the new lightweight process. This is meaningful only if the
CLONE_PARENT_SETTID
flag is set. ctid
- Specifies the address of a User Mode variable of the new lightweight process that will hold the PID of such process. This is meaningful only if the
CLONE_CHILD_SETTID
flag is set.
Process 0 and Process 1 in Linux
Process 0 (Swapper/Idle Process)
The ancestor of all processes, called process 0, the idle process, or, for historical reasons, the swapper process, is a kernel thread created from scratch during the initialization phase of Linux. It is the first process to run and is responsible for system initialization and managing idle CPU time.
Process 1 (Init Process)
The kernel thread created by process 0 executes the init()
function, which in turn completes the initialization of the kernel. Then, init()
invokes the execve()
system call to load the executable program init
(or systemd
in modern systems). Process 1 becomes the parent of all other user-space processes and is responsible for managing orphaned processes.
Linux Process Management Functions
do_fork()
Function
The do_fork()
function makes use of an auxiliary function called copy_process()
to set up the process descriptor and any other kernel data structure required for the child's execution.
copy_process()
Function
The copy_process()
function is a core routine used for creating new processes or threads. For instance, the swapper process running on CPU 0 initializes the kernel data structures, then enables the other CPUs and creates additional swapper processes by means of the copy_process()
function, passing to it the value 0 as the new PID.
Kernel Threads: Creation and Characteristics
Kernel threads are processes that run only in Kernel Mode, unlike regular processes which run alternatively in Kernel Mode and User Mode. They are used for tasks that need to run in the kernel's context, such as managing devices, handling interrupts, or performing background operations.
Kernel threads are created in Linux using functions like kernel_thread()
. This function receives as parameters the address of the kernel function to be executed (fn
), the argument to be passed to that function (arg
), and a set of clone flags (flags
). Essentially, this function invokes do_fork()
with specific flags to ensure the new process runs entirely in kernel space.
Purpose of Key Linux Functions
kernel_thread()
: This function creates a new kernel thread.copy_process()
: This function is a fundamental routine for duplicating process information, used byfork()
,vfork()
, andclone()
to set up new process descriptors and associated kernel data structures.exit()
: This system call terminates a single process, regardless of any other processes in the thread group of the victim.exit_group()
: This system call terminates a full thread group, meaning it terminates an entire multithreaded application.
Process Creation and Termination in Linux
In Linux, processes are primarily created by using functions that wrap the clone()
system call (which underlies fork()
and vfork()
). Processes are terminated or destroyed by the _exit()
system call (or exit_group()
for thread groups).
Orphan Processes and Memory Leak Prevention
An orphan process is a child process whose parent process has terminated before the child. To prevent memory leaks and ensure proper system cleanup, orphan processes are handled in Linux by becoming children of the init
process (process 1).
Handling Parent Termination Before Children
If a parent process terminates before its children, the system could be flooded with zombie processes whose process descriptors would remain in RAM indefinitely, leading to memory leaks. This problem is solved by forcing all orphan processes to become children of the init
process. In this way, the init
process will destroy these zombie children while checking for the termination of one of its legitimate children through a wait()
-like system call.