How does Linux work?

I recently prepared for a position which prompted me to read up and consolidate my knowledge on Operating Systems and Linux which led me down a weird and wonderful rabbit hole learning so much more than I thought I knew!

This post intends to be a summary of how an Operating System (focussing on Linux) works, and what happens behind the scenes.

So, let's get started!

Disclaimer: I, myself, am constantly learning more about this area, and whilst I've tried to be as accurate and concise as possible, there may be mistakes. If you spot something that looks wrong, I would love to hear from you in the comments so we can all learn together!

What are UNIX and Linux?

UNIX and Linux are both families of Operating Systems (OSes) -- the underlying software behind many popular Operating Systems. An examples of a UNIX OSes is Solaris; examples of Linux-based distributions are Ubuntu and Arch Linux.

The biggest difference between the two is that Linux is open-source and free to useyou can contribute to it if you want! UNIX on the other hand is proprietary software which requires a license to use.

There's also POSIX: the Portable Operating System Interface, which defines a set of standards so different OSes can be compatible. It includes things like; what should program exit codes be? What default environment variables are there? How do filenames work? What is the underlying C API? A lot of Linux distributions are mostly POSIX-compliant (although they may not be officially certified!).

What are the "Kernel" and "Operating System"?

The kernel is a part of the Operating System; it's at the lowest software level of the OS to control access to the hardware, system resources, files, processes, system calls, and more! Without an OS, your computer can't really do anything.

A nice way to look at the difference between the Kernel and the general Operating System is that the OS as a whole sits between a user and the software; the kernel sits between the software and the hardware.

So, how does my computer start?

BIOS

Your computer has a set of fixed instructions in a specific physical memory location in ROM (Read-Only Memory), which usually form the BIOS (Basic Input/Output System). This is firmware (low-level software that's permanent on your system) that initializes the hardware on boot.

The BIOS usually performs a POST (Power-On Self-Test) to detect and setup any connected hardware (e.g. memory, video cards, CPUs, etc.). If there's an error here, your computer will normally display some text (if it can), or perform various different audible beeps, with each different number of beeps indicating a specific problem.

Once the hardware is confirmed to be working, the BIOS starts the boot process which involves finding the boot device (e.g., hard drive). Boot devices usually store a bootloader (or a pointer to it), in the Master Boot Record (MBR) or in a specific partition on the drive (EFI). This is a tiny piece of software which is less than a kilobyte in size, and is responsible for loading the OS into RAM (memory). An example of a bootloader is GRUB.

Initializing the Kernel

Once the bootloader has been loaded, it needs to be executed! This can get quite complicated and I definitely do not know every single thing that happens here. Here is an overview of what now happens:

  1. The kernel is decompressed.
  2. A few registers are initialized (e.g. the Interrupt Handler Table and Global Descriptor Table) – these are needed later on when using the system.
  3. Various system calls are made to spawn initial processes such as the task scheduler (these are all explained a bit later on!).
  4. The init process executes, which is responsible for mounting all file systems in read/write mode, starting daemons (like sshd for SSH connections, httpd for HTTP connections, etc.), and calling the getty program ("get TTY") which prompts you to log in. systemd is a common init process used in Linux distributions.

At this point, your computer is up and running! Now what?

System Calls & CPU Execution Modes

First, we should note that many of the low-level details about the hardware are abstracted away and hidden from user applications; this means the Operating System must issue requests to the kernel in the form of system calls (syscalls), which are executed by the kernel.

So, there are different CPU execution privilege modes (sometimes called rings): User (ring 3) and Kernel (ring 0) mode. The rings in between are for device drivers (software that enables interaction with hardware peripherals).

User mode is an unprivileged mode for user programs – programs can run and execute code, but they can't manipulate actual memory, use input/output devices, nor switch modes itself. As a result, when any of these resources are needed, the programs send a system call which generates a software interrupt. This prompts the switch to Kernel mode where the kernel checks for permissions, performs necessary actions, and returns relevant data.

Kernel mode is therefore a privileged mode; there is unrestricted access to memory and devices. Any errors encountered here are critical and trigger a kernel panic (analogous to a Windows Blue Screen of Death).

Why have separate modes? Having a separate kernel mode ensures that programs can't interfere with each other; it is the single source of truth for the entire system, and it is more secure, as the kernel handles permission checks to resources!

The Filesystem

In Linux, you don't mount hard drives, or partitions. Instead, you mount the file systems on those partitions.

The Virtual File System (VFS) abstracts a standard interface (think: an 'API') for file systems, so all file systems appear identical to the rest of the kernel and applications. Data is split into Blocks (typically 4MB), which are further grouped into Block Groups. The VFS caches blocks when they are accessed by placing them into the Buffer Cache.

Inodes (indexed nodes) are structures that store metadata for every file and directory, providing easy access to anyone who needs information on files. They have a number (index) that uniquely identifies them in a filesystem, and are used in conjunction with the filesystem ID to ensure they are unique across the entire machine.

Inodes are stored in a table so they are accessed by their index number, and they point to the disk blocks storing the contents of the file they represent.

The use of inodes actually means there is a limit to the number of files/directories you can store on a system! Mail servers can run into the problem where they store lots of tiny files (emails) which don't take up too much disk space but still run out of inodes! Inodes are usually 32-bit unsigned integers, meaning ~4.2 billion inodes maximum. Practically, a system might have much fewer available inodes as the default ratio tends to be 1 inode per x bytes of storage capacity.

Inodes store the file mode (permissions), type (file/directory), user, group, size, links, block count, creation/accessed/modified times, and inode checksum.

Inodes don't store filenames themselves – why? These are stored in directory structures (or 'directory entries', or 'dentries'). These are tables that store filenames and their corresponding inodes; the first two entries are always . and .. which probably seem familiar. An advantage of this system is, if you are moving files, all you are doing is moving the (name, inode) pair – so it's extremely cheap!

File permissions

Commands like chmod and chown allow you to alter file permissions, but how do they work? Each file/directory has three user permission groups: owner, group, all users (i.e., all other users). For each of these, there are a further three permission types: read, write, execute. For directories, these mean slightly different things: listing the contents of the directory, changing the contents of the directory (new/delete/rename files), and moving into the directory respectively.

So what do the permissions looks like? _ rwx rwx rwx 1 owner:group is the general format! (Remember, this is all stored in the inode!) The first group of 3 is the owner permissions, the second group is the group permissions, and the final group is the all users permissions. The final string shows which user/group owns the file.

The very first character is the 'special file type flag': _ means no special permissions, d means it is a directory, l means it is a symbolic link, s is the setuid/setgid permission for executables (meaning it should be executed with the owner permissions), and t is the sticky bit permission (meaning only the file owner can rename or delete the directory).

If you've ever used chmod you may have used a number to set permissions – this is the numeric method where a 4 represents read, 2 represents write, 1 represents execute. For example, 740 means 7 for the owner, 4 for the group, 0 for all others, i.e., _ rwx r__ ___!

Memory

Moving onto memory management!

Linux is a multiprocessing OS; each process is a separate task with its own rights and responsibilities, and each has its own virtual memory, running in its own virtual address space so they can't affect each other, only interacting with others through the kernel.

This means virtual and physical memory is split into pages, small contiguous chunks of memory, which are mapped to each other via the page table. When a program requests a virtual page address, the page table determines the actual memory address to use – if it's not found, you get a page fault. Pages aren't always loaded, so demand paging is where memory is loaded lazily, as it is needed.

A swap file is a file on the disk that is used when a virtual page is needed, but no physical page is available. In this case, an existing page is written to disk to this swap file, to be re-loaded later if needed! Thrashing occurs if pages are constantly being read from/written to, which means the OS can't actually do anything meaningful!

Processes

Processes are computer programs in-action; they include program instructions, data, CPU registers, the Program Counter, and call stacks. There is a limited number of processes that can execute at one time. Processes are stored in a task array in the form of a task_struct data structure (think: a linked list of dictionaries). This stores lots of information like how virtual memory is mapped onto the system's physical memory; CPU time consumed; (effective) user/group IDs, etc.

Every process (except for the initial init process) has a parent – the task_struct keeps a pointer to parent and child processes (a doubly linked list).

Processes are not created, they are cloned from a previous one via system calls. Usually, the fork syscall is used which clones the calling process (including the code, the data, and call-stack), and then exec is used to overwrite ('overlay') the cloned process and its data with the supplied program name!

It's not just one process that uses all the memory or CPU though; with multiprocessing, the system gives the CPU to processes that need it most. The scheduler chooses which process is most appropriate to run, by selecting them out of the run queue. It often uses a priority-based scheduling algorithm, but there are different types (e.g., round-robin, or first-in-first-out). The scheduler runs after processes are put onto the wait queue (e.g., whilst they are waiting for a system resource), or when a syscall is ending and the CPU is switching back to user mode.

The niceness of a process is a user-defined priority (ranging from -20 to 19 – highest to lowest) that can be given to processes using the nice -n NICENESS_VALUE command, e.g. nice -n -15 ./my-important-program.

Processes have different states: running (the current process in the system, or ready to run); waiting (waiting for an event/resource); stopped (stopped due to a signal); zombie (halted processes which for some reason still have a task_struct entry).

Threads are a single execution sequence within a process; a process can contain many threads. They share the memory of the process they belong to, which means inter-thread communication is cheaper than inter-process communication.

Inter-Process Communication (IPC)

IPC allows processes to communicate with each other.

Signals are the classic example of this, e.g. SIGHUP, SIGINT, SIGKILL, etc. These are asynchronous events set to processes and can be generated by shells, keyboards, or even errors. Processes can choose how to deal with (or ignore) most of these, except for two: SIGSTOP (halts a process until SIGCONT resumes it) and SIGKILL (exits a process entirely). Only the kernel and superusers can send signals to processes, or processes with the same GID/UID as others!

Pipes are another method of IPC, allowing redirection between commands – they are one-way byte streams connecting the standard-out (stdout) of one process to the standard-in (stdin) of another. These are implemented using two files with the same temporary VFS inode, and are an abstraction as neither process is aware of the pipe!

Sockets are another method of IPC; these are pseudo-files that represent the network connection and can be read/written to using read/write/send/recv syscalls.

Wrapping up

That's a lot of information! I do hope this post helped you understand a tiny bit more about Linux (and Operating Systems in general), and how it works!

If you're interested in learning more, I found The Linux Kernel by David A Rusling extremely useful, which goes into a lot more detail about the above topics, and much more.

If you have any feedback, spotted any errors (or just want to chat!), please feel free to leave a comment below, or get in touch with me in any other way!