Linux Process Management: Deep Dive into Kernel Internals

Classified in Computers

Written on August 31, 2025 in English with a size of 27.36 KB

What is a Process?

A process is essentially a program in execution.

Lightweight Processes in Linux

Lightweight Processes (LWPs) in Linux are processes that offer better support for multithreaded applications.

Multithreaded Applications in Linux

A multithreaded application is designed to perform multiple tasks concurrently within a single process. In Linux, a straightforward way to implement multithreaded applications is to associate a lightweight process with each thread.

This approach allows threads to access the same set of application data structures by simply sharing the same memory address space, the same set of open files, and so on. Simultaneously, each thread can be scheduled independently by the kernel, meaning one thread may sleep while another remains runnable.

Examples of POSIX-compliant pthread libraries that utilize Linux's lightweight processes include:

LinuxThreads
Native POSIX Thread Library (NPTL)
IBM's Next Generation POSIX Threading Package (NGPT)

Linux Thread Groups Explained

In Linux, a thread group is fundamentally a set of lightweight processes that implement a multithreaded application. These processes act as a cohesive unit with regard to certain system calls, such as getpid(), kill(), and _exit().

Process Descriptors and Their Role

A process descriptor is a crucial data structure used by the kernel to manage processes. It stores all information related to a single process, including:

The process's priority
Whether it is running on a CPU or blocked on an event
The address space assigned to it
Which files it is allowed to access
And other vital process-related data

Linux Process Descriptor: `struct task_struct`

The Linux process descriptor is represented by the struct task_struct. (A schematic diagram would typically illustrate its various fields and their relationships, but cannot be provided in this text format.)

Seven Linux Process States

The seven possible process states in the Linux operating system are:

TASK_RUNNING
TASK_INTERRUPTIBLE
TASK_UNINTERRUPTIBLE
TASK_STOPPED
TASK_TRACED
EXIT_ZOMBIE
EXIT_DEAD

Detailed Linux Process States

TASK_RUNNING

The process is either currently executing on a CPU or waiting in the runqueue to be executed.

TASK_INTERRUPTIBLE

The process is suspended (sleeping) until some condition becomes true. Examples of conditions that might wake up the process (returning its state to TASK_RUNNING) include a hardware interrupt, the release of a system resource the process is waiting for, or the delivery of a signal.

TASK_UNINTERRUPTIBLE

Similar to TASK_INTERRUPTIBLE, but with the key difference that delivering a signal to a sleeping process in this state leaves its state unchanged. It will not be woken up by signals.

TASK_STOPPED

Process execution has been halted. The process enters this state after receiving a SIGSTOP, SIGTSTP, SIGTTIN, or SIGTTOU signal.

TASK_TRACED

Process execution has been stopped by a debugger. When a process is being monitored by another (e.g., when a debugger executes a ptrace() system call to monitor a test program), each signal may put the process into the TASK_TRACED state.

EXIT_ZOMBIE

Process execution is terminated, but the parent process has not yet issued a wait4() or waitpid() system call to retrieve information about the dead process. Before a wait()-like call is issued, the kernel cannot discard the data contained in the dead process descriptor because the parent might still need it. Other wait()-like library functions, such as wait3() and wait(), are implemented in Linux via the wait4() and waitpid() system calls.

EXIT_DEAD

This is the final state. The process is being removed by the system because the parent process has just issued a wait4() or waitpid() system call for it. Changing its state from EXIT_ZOMBIE to EXIT_DEAD helps avoid race conditions that could occur if other threads of execution were to execute wait()-like calls on the same process.

Purpose of TASK_UNINTERRUPTIBLE State

The TASK_UNINTERRUPTIBLE state is used for processes that are suspended (sleeping) and cannot be woken up by signals. This is crucial for operations where a process must not be interrupted, such as waiting for I/O on a critical device, to prevent data corruption or system instability.

Kernel Process Identification Approaches

The kernel uses two primary approaches to identify a process:

set_task_state
set_current_state

Process ID (PID): Definition and Necessity

PID stands for Process ID. It is a unique numerical identifier assigned to each process, essential for the kernel and other processes to identify and manage individual processes within the system.

PID Limits: 32-bit vs. 64-bit Architectures

The upper limit on PID values varies by architecture:

In 32-bit architectures, the upper limit on PID values is 32,767.
In 64-bit architectures, the upper limit on PID values is 4,194,303.

`pidmap_array` Bitmap in Linux

The pidmap_array bitmap in Linux is used to denote which PIDs are currently assigned and which are free. The size of the pidmap_array bitmap is 32,768 bits.

Thread Group Leader and PID Assignment

By default, the first lightweight process created in a thread group becomes the thread group leader. The identifier shared by all threads in the group is the PID of this thread group leader, which is stored in the tgid field of the process descriptors.

Process Descriptors in Dynamic Memory

Since the kernel must be able to handle many processes concurrently, process descriptors are stored in dynamic memory rather than in a memory area permanently assigned to the kernel. This allows for flexible allocation and deallocation as processes are created and terminated.

Per-Process Memory Area Data Structures

The kernel assigns two different data structures to every process within a single per-process memory area. The length of this memory area is usually 8,192 bytes (two page frames). The data structures stored in this area are:

A small data structure linked to the process descriptor, namely the thread_info structure.
The Kernel Mode process stack.

(A diagram would typically illustrate how these two structures are laid out within the 8KB memory area, but cannot be provided here.)

`thread_info` and Kernel Mode Stack Purpose

Thread_info Structure Stack

In the 80x86 architecture, the kernel can be configured at compilation time so that the memory area including the stack and thread_info structure spans a single page frame (4,096 bytes). The thread_info structure holds essential information about the thread, such as flags, the task pointer, and the base address of the kernel stack.

Kernel Mode Process Stack

A process in Kernel Mode accesses a stack contained in the kernel data segment, which is distinct from the stack used by the process in User Mode. Because kernel control paths make relatively little use of the stack, only a few thousand bytes of kernel stack are typically required. Therefore, 8 KB is ample space for both the stack and the thread_info structure.

Efficiency of `thread_info` and Kernel Stack Association

A key benefit in terms of efficiency offered by the kernel in providing a close association between the thread_info structure and the kernel mode stack is rapid access. Because the thread_info structure is relatively small (e.g., 52 bytes long), the kernel stack can expand up to 8,140 bytes within the 8KB memory area.

The kernel can easily obtain the address of the thread_info structure of the process currently running on a CPU from the value of the esp register. If the thread_union structure is 8 KB (2¹³ bytes) long, the kernel masks out the 13 least significant bits of esp to obtain the base address of the thread_info structure. If the thread_union structure is 4 KB long, the kernel masks out the 12 least significant bits of esp.

Purpose of the `current` Macro in Linux

Most often, the kernel needs the address of the process descriptor rather than the address of the thread_info structure. To get the process descriptor pointer of the process currently running on a CPU, the kernel makes use of the current macro, which is essentially equivalent to current_thread_info()->task. The current macro frequently appears in kernel code as a prefix to fields of the process descriptor. For example, current->pid returns the process ID of the process currently running on the CPU.

Linux Process List Implementation

In Linux, the process list is implemented as a doubly linked list that links together all existing process descriptors. The head of this process list is the init_task task_struct descriptor, which represents process 0, also known as the swapper process.

The tasks->prev field of init_task points to the tasks field of the process descriptor inserted last in the list.
The SET_LINKS and REMOVE_LINKS macros are used to insert and remove a process descriptor in the process list, respectively.

Process List Data Structures (Diagram Not Shown)

Special data structures are used to implement the process list, typically involving a doubly linked list where each node is embedded within a task_struct. The first node points to process 0 (init_task), and the second node would point to process 1 (the init process).

Linux Runqueue for Scheduler Speedup

Earlier Linux versions placed all runnable processes in a single list called the runqueue. To achieve scheduler speedup and select the best runnable process in constant time, modern Linux kernels use more sophisticated runqueue implementations, often involving multiple priority arrays or red-black trees, to efficiently manage and select processes for execution.

Parent and Sibling Process Relationships (Figure Not Shown)

A figure would illustrate the hierarchical relationship between processes. For example, if Process P0 successively created P1, P2, and P3, and P3, in turn, created P4, the diagram would show P0 as the parent of P1, P2, and P3 (siblings to each other), and P3 as the parent of P4.

PID Hash Tables and Chained Lists

To speed up the search for process descriptors by PID, four hash tables have been introduced in Linux. These are required because the process descriptor includes fields that represent different types of PIDs, and each type of PID requires its own hash table.

Linux uses chaining to handle colliding PIDs; each table entry is the head of a doubly linked list of colliding process descriptors. Hashing with chaining is preferable to a linear transformation from PIDs to table indexes because, at any given instance, the number of processes in the system is usually far below 32,768 (the maximum number of allowed PIDs).

PID Hash Table and Chained Lists Diagram

(A block diagram would typically describe the PID hash table structure, showing how hash values map to table entries, and how chained lists extend from these entries to handle collisions, but cannot be provided here.)

Four PID Hash Tables in Linux

The four PID hash tables used in Linux are required to efficiently look up process descriptors based on different types of PIDs associated with a process (e.g., process ID, thread group ID, session ID, process group ID). Each table optimizes lookups for its specific PID type.

Four Hash Tables Implementation Diagram

(A block diagram would describe the four hash tables, illustrating how they are implemented with chained lists and how processes are grouped within each chain, but cannot be provided here.)

Purpose of Wait Queues in Linux

The process state alone does not provide enough information to retrieve a process quickly when it's waiting for a specific event. Therefore, additional lists of processes called wait queues are introduced. A wait queue represents a set of sleeping processes that are woken up by the kernel when a particular condition becomes true.

Runqueue vs. Wait Queues in Linux

The runqueue lists group all processes that are in a TASK_RUNNING state (i.e., ready to be executed).
Wait queues, on the other hand, group processes in other states, specifically those that are sleeping and waiting for a particular event to occur. The various sleeping states call for different types of treatment, with Linux opting for specific wait queue mechanisms.

Key Uses of Wait Queues in the Kernel

The three important uses of wait queues in the kernel are:

Interrupt handling
Process synchronization
Timing mechanisms

Wait Queue Implementation in Linux

Wait queues in Linux are implemented using two primary data structures:

1. `struct __wait_queue_head`

struct __wait_queue_head {
  spinlock_t lock;
  struct list_head task_list;
};
typedef struct __wait_queue_head wait_queue_head_t;

lock: A spinlock_t used for synchronization. Because wait queues are modified by interrupt handlers as well as by major kernel functions, the doubly linked lists must be protected from concurrent accesses, which could induce unpredictable results.
task_list: The head of the list of waiting processes.

2. `struct __wait_queue`

struct __wait_queue {
  unsigned int flags;
  struct task_struct * task;
  wait_queue_func_t func;
  struct list_head task_list;
};
typedef struct __wait_queue wait_queue_t;

flags: Indicates the kind of sleeping process (e.g., exclusive or non-exclusive).
task: A pointer to the task_struct (process descriptor) of the sleeping process.
func: Specifies how the processes sleeping in the wait queue should be woken up.
task_list: Contains the pointers that link this element to the list of processes waiting for the same event.

The "Thundering Herd" Problem and Solution

The "thundering herd" problem occurs when multiple processes are woken up simultaneously, only to race for a resource that can be accessed by only one of them. The result is that the remaining processes must once more be put back to sleep, leading to inefficiency.

This problem is tackled by classifying sleeping processes into two kinds: exclusive and non-exclusive, allowing the kernel to selectively wake up processes.

Types of Sleeping Processes in Wait Queues

There are two kinds of sleeping processes in a wait queue, and this classification is required to mitigate the "thundering herd" problem:

Exclusive processes (denoted by the value 1 in the flags field of the corresponding wait queue element) are selectively woken up by the kernel. Only one or a limited number of these processes will be woken up when the event occurs.
Non-exclusive processes (denoted by the value 0 in the flags field) are always woken up by the kernel when the event occurs.

Linux Process Resource Limits

Each process has an associated set of resource limits, which specify the amount of system resources it can use. These limits are crucial for preventing a single user or process from overwhelming the system's CPU, disk space, memory, and other resources.

Various resource limits can be specified, including (but not limited to):

CPU time
File size
Data segment size
Stack size
Core file size
Resident set size
Number of open files
Number of processes
Locked-in-memory address space

Unix/Linux Resource Limit Commands

The purpose of the following commands in Unix/Linux is:

ulimit -Ha: Displays all hard resource limits for the current user. Hard limits are maximum values that cannot be increased by an unprivileged user.
ulimit -Sa: Displays all soft resource limits for the current user. Soft limits are the current effective limits, which can be increased up to the hard limit by an unprivileged user.
getconf -a: Displays all configurable system variables and their current values.
getconf PAGE_SIZE: Displays the size of a memory page in bytes for the current system.

Process, Context, and Task Switching in Linux

To control the execution of processes, the kernel must be able to suspend the execution of the process currently running on the CPU and resume the execution of some other process that was previously suspended. This activity is known by various names: process switch, task switch, or context switch.

Hardware Context in Linux

The hardware context is the set of data that must be loaded into the CPU registers before a process can resume its execution. In Linux, a portion of the hardware context of a process is stored in the process descriptor, while the remaining part is saved in the Kernel Mode stack.

Key Concepts: TSS and Thread Field

Task State Segment (TSS)

The 80x86 architecture includes a specific segment type called the Task State Segment (TSS), designed to store hardware contexts. Linux uses the hardware support offered by the 80x86 architecture and performs a process switch through a jmp instruction to the selector of the Task State Segment Descriptor of the next process.

Thread Field

Each process descriptor includes a field called thread, of type thread_struct. In this structure, the kernel saves the hardware context whenever the process is being switched out (i.e., its execution is suspended).

Kernel Process Switch with `schedule()`

The kernel performs a process switch, typically orchestrated by the schedule() function, through two main steps:

Switching the Page Global Directory: This installs a new address space for the incoming process.
Switching the Kernel Mode stack and the hardware context: This provides all the information needed by the kernel to execute the new process, including the CPU registers.

Address Space Duplication Solutions in Unix

Modern Unix kernels have introduced three different mechanisms to solve the problem of duplicating address space efficiently when creating new processes:

Copy-On-Write (COW) technique: This allows both the parent and the child process to initially read the same physical pages. Whenever either process attempts to write to a physical page, the kernel copies its contents into a new physical page, which is then assigned to the writing process. The associated system call for process creation is fork().
Lightweight processes (LWP): These allow both the parent and the child to share many per-process kernel data structures, such as the paging tables (and therefore the entire User Mode address space), the open file tables, and the signal dispositions. The associated system call is clone().
vfork() system call: This creates a process that shares the memory address space of its parent. To prevent the parent from overwriting data needed by the child, the parent's execution is blocked until the child exits or executes a new program (e.g., via execve()). The associated system call is vfork().

Process Creation System Calls

The three system calls used to create a process are clone(), fork(), and vfork(). These system calls are required to provide different levels of resource sharing and execution semantics, catering to various application needs, from traditional process creation to lightweight multithreading.

`clone()` System Call for Lightweight Processes

Lightweight processes are created in Linux using a function named clone(), which takes the following parameters:

fn: Specifies a function to be executed by the new process; when this function returns, the child terminates. The function returns an integer, which represents the exit code for the child process.
arg: Points to data passed to the fn() function.
flags: Miscellaneous information. The low byte specifies the signal number to be sent to the parent process when the child terminates; the SIGCHLD signal is generally selected.
child_stack: Specifies the User Mode stack pointer to be assigned to the esp register of the child process. The invoking process (the parent) should always allocate a new stack for the child.
tls: Specifies the address of a data structure that defines a Thread Local Storage segment for the new lightweight process. This parameter is meaningful only if the CLONE_SETTLS flag is set.
ptid: Specifies the address of a User Mode variable of the parent process that will hold the PID of the new lightweight process. This is meaningful only if the CLONE_PARENT_SETTID flag is set.
ctid: Specifies the address of a User Mode variable of the new lightweight process that will hold the PID of such process. This is meaningful only if the CLONE_CHILD_SETTID flag is set.

Process 0 and Process 1 in Linux

Process 0 (Swapper/Idle Process)

The ancestor of all processes, called process 0, the idle process, or, for historical reasons, the swapper process, is a kernel thread created from scratch during the initialization phase of Linux. It is the first process to run and is responsible for system initialization and managing idle CPU time.

Process 1 (Init Process)

The kernel thread created by process 0 executes the init() function, which in turn completes the initialization of the kernel. Then, init() invokes the execve() system call to load the executable program init (or systemd in modern systems). Process 1 becomes the parent of all other user-space processes and is responsible for managing orphaned processes.

Linux Process Management Functions

`do_fork()` Function

The do_fork() function makes use of an auxiliary function called copy_process() to set up the process descriptor and any other kernel data structure required for the child's execution.

`copy_process()` Function

The copy_process() function is a core routine used for creating new processes or threads. For instance, the swapper process running on CPU 0 initializes the kernel data structures, then enables the other CPUs and creates additional swapper processes by means of the copy_process() function, passing to it the value 0 as the new PID.

Kernel Threads: Creation and Characteristics

Kernel threads are processes that run only in Kernel Mode, unlike regular processes which run alternatively in Kernel Mode and User Mode. They are used for tasks that need to run in the kernel's context, such as managing devices, handling interrupts, or performing background operations.

Kernel threads are created in Linux using functions like kernel_thread(). This function receives as parameters the address of the kernel function to be executed (fn), the argument to be passed to that function (arg), and a set of clone flags (flags). Essentially, this function invokes do_fork() with specific flags to ensure the new process runs entirely in kernel space.

Purpose of Key Linux Functions

kernel_thread(): This function creates a new kernel thread.
copy_process(): This function is a fundamental routine for duplicating process information, used by fork(), vfork(), and clone() to set up new process descriptors and associated kernel data structures.
exit(): This system call terminates a single process, regardless of any other processes in the thread group of the victim.
exit_group(): This system call terminates a full thread group, meaning it terminates an entire multithreaded application.

Process Creation and Termination in Linux

In Linux, processes are primarily created by using functions that wrap the clone() system call (which underlies fork() and vfork()). Processes are terminated or destroyed by the _exit() system call (or exit_group() for thread groups).

Orphan Processes and Memory Leak Prevention

An orphan process is a child process whose parent process has terminated before the child. To prevent memory leaks and ensure proper system cleanup, orphan processes are handled in Linux by becoming children of the init process (process 1).

Handling Parent Termination Before Children

If a parent process terminates before its children, the system could be flooded with zombie processes whose process descriptors would remain in RAM indefinitely, leading to memory leaks. This problem is solved by forcing all orphan processes to become children of the init process. In this way, the init process will destroy these zombie children while checking for the termination of one of its legitimate children through a wait()-like system call.

Related entries:

Tags: