Prex Home > Document Index > Kernel Internals

Prex Kernel Internals

Version 1.4, 2005/12/31

Kernel Overview
Design Policy
Thread
Task
Scheduler
Memory Management
IPC
Exception Handling
Interrupt Framework
Timer
Device I/O Service
Mutex
Debug

Kernel Overview

Kernel Structure

The following figure illustrates the Prex kernel structure.

Kernel Components
Figure 1. Prex kernel Structure

A kernel object belongs in one of the following groups.

kern: kernel core components
mem: memory managers
ipc: inter process communication (*)
sync: synchronize objects
arch: architecture dependent components

*) Since all messages in Prex are transferred among threads, the name of "IPC" is not appropriate. However, "IPC" is still used as a general term of the message transfer via the kernel, in Prex.

Naming Convention

The name of "group/object" in figure 1 is mapped to "directory/file" in the Prex source tree. For example, the thread related functions are located in "kern/thread.c", and the functions for semaphore are placed in "sync/sem.c".

In addition, there is a standard naming convention about kernel routines. The method named bar for the object named foo should be named "foo_bar". For example, the routine to create a new thread is named "thread_create", and locking mutex will be "mutex_lock". This rule is not applied to the name of the local function.

Design Policy

The Prex kernel focuses the following points to be designed.

Portability
Scalability
Reliability
Interoperability
Maintainability

Portability

The Prex kernel is divided into two different layers - a common kernel layer and an architecture dependent layer. Any routine in the common kernel layer must not access to the H/W by itself. Instead, it must request it to the architecture dependent layer. The interface to the architecture dependent layer is strictly defined by the Prex kernel. This interface is designed carefully to support various different architecture with minimum code change. So, it is easy to port the Prex kernel to different architecture.

The following functions must be provided by the architecture dependent layer.

CPU: initializes processor registers before kernel boot
Context: abstracts processor and hardware context
MMU: abstracts memory management unit (*)
Trap: abstracts processor trap
Interrupt: abstracts the interrupt control unit
Clock: abstracts clock timer unit
Misc.: abstracts system reset, idle power state

*) In case of no-MMU system, MMU related routines will be defined as no-operation routine. So, the kernel common layer can not assume MMU is always available.

Scalability

In order to obtain higher scalability, the kernel does not limit the maximum number of the kernel objects to create. So, the resource for all kernel objects are allocated dynamically after system boot. This can keep the memory prerequisite smaller than the static resource allocation. This means that the kernel can create task, thread, object, device, event, mutex, timer as many as usable memory remains.

The kernel supports both of MMU and No-MMU systems. So, most of the kernel components and VM sub-system are designed carefully to work without MMU.

Reliability

When the remaining memory is exhausted, what should OS do? If the system can stop with panic() there, the error checks of many portions in the kernel are not necessary. But obviously, this is not allowed with the reliable system. Even if the memory is exhausted, a kernel must continue processing. So, all kernel code is checking the error status returned by the memory allocation routine.

In addition, the kernel must not crush anytime even if any invalid parameter is passed via kernel API. Basically, the Prex kernel code is written with "garbage in, error out" principle. The Prex kernel never stops even if any malicious program is loaded.

Interoperability

Although the Prex kernel was written from scratch, its applications will be brought from the other operating systems like BSD. So, the system call interface is designed with consideration to support general OS API like POSIX or APIs for generic RTOS.

The error code for the Prex system call is defined as the same name with POSIX. For example, EINVAL for "Invalid argument", or ENOMEM for "Out of memory". So, peoples do not have to study new error codes if they already have skills about POSIX programming. This is important point to write applications and to read the kernel code because study of a new error scheme will cause pain for developers. In addition, it simplify the POSIX emulation library because it does not have to remap the error code.

Maintainability

All kernel codes are kept clean and simple for the maintenance. All codes are well-commented and consistent. It is easy to add or remove new system call into the kernel. The kernel has the debugging facility like the function trace or the dump of the kernel objects.

Thread

Thread Control Block

The thread control block includes data for owner task, scheduler, timer, IPC, exception, mutex, and context. The following thread structure is most important definition in the kernel codes.

struct thread {
        int             magic;          /* magic number */
        task_t          task;           /* pointer to owner task */
        struct list     task_link;      /* link for threads in same task */
        struct queue    link;           /* linkage on scheduling queue */
        int             state;          /* thread state */
        int             policy;         /* scheduling policy */
        int             prio;           /* current priority */
        int             baseprio;       /* base priority */
        int             timeleft;       /* remaining ticks to run */
        u_int           time;           /* total running time */
        int             resched;        /* true if rescheduling is needed */
        int             locks;          /* schedule lock counter */
        int             suscnt;         /* suspend counter */
        struct event    *slpevt;        /* sleep event */
        int             slpret;         /* sleep result code */
        struct timer    timeout;        /* thread timer */
        struct timer    *periodic;      /* pointer to periodic timer */
        uint32_t        excbits;        /* bitmap of pending exceptions */
        struct queue    ipc_link;       /* linkage on IPC queue */
        void            *msgaddr;       /* kernel address of IPC message */
        size_t          msgsize;        /* size of IPC message */
        thread_t        sender;         /* thread that sends IPC message */
        thread_t        receiver;       /* thread that receives IPC message */
        object_t        sendobj;        /* IPC object sending to */
        object_t        recvobj;        /* IPC object receiving from */
        struct list     mutexes;        /* mutexes locked by this thread */
        struct mutex    *wait_mutex;    /* mutex pointer currently waiting */
        void            *kstack;        /* base address of kernel stack */
        struct context  ctx;            /* machine specific context */
};

Thread Creation

New thread can be created by thread_create(). The initial states of newly created thread are as follow:

Table 1. Initial thread state

Data type	Initial state
Owner Task	Inherit from parent thread
Thread state	Suspended
Suspend count	Task suspend count + 1
Scheduling policy	Round Robin
Scheduling Priority	Default (= 200)
Time quantum	Default (= 50 msec)
Processor registers	Default value

Since new thread is initially set to the suspended state, thread_resume() must be called to start it.

Creating a thread and loading its register state are isolated in different routines. These two routines are used by fork(), exec(), and pthread_create() in the POSIX emulation library.

Table 2. Usage of thread_create()/thread_load()

Library routine	thread_create()	thread_load()
fork()	O	X
exec()	X	O
pthread_create()	O	O

Thread Termination

The kernel will usually release all resources owned by the terminated thread. But, there are some complicated process to release the resources. The priority adjustment may be required if the thread inherits its priority.

If the thread is terminated with mutex locking, all threads waiting for that mutex does sleep forever. So, the mutex held by the terminated thread must be unlocked, or change its mutex owner if some thread is waiting for.

In general, there is a known issue about the thread termination. If the termination target is current thread, the kernel can not release the context of the current thread because the thread switching always requires the current context. There are the following 3 solutions for this.

Create "clean up thread" to terminate thread
Add condition check in thread switching code
Defer termination in next termination request

The Prex kernel is using #3.

Thread Suspension

Each thread can be set to the suspended state by using thread_suspend() interface. Although a thread can be suspended any number of times, it does not start to run unless it is resumed by the same number of suspend.

Kernel Thread

A kernel thread is always executed in kernel mode, and it does not have user mode context. The scheduling policy is set to SCHED_FIFO by default.

Currently, the following kernel threads are running in kernel mode.

Interrupt Service Threads
Timer Thread
Idle Thread
DPC Thread

Idle Thread

An idle thread runs when no other thread is active. It has the role of cutting down the power consumption of a system. An idle thread has FIFO scheduling policy, and it does not have time quantum. The lowest scheduling priority (=255) is reserved for an idle thread.

Task

Task Creation

The task can be created by using task_create(). New child task will have the same memory image with the parent task. Especially text region and read-only region are physically shared among them. The parent task receives the new task ID of child task from task_create(), but child task will receive 0 as task ID.

The initial task states are as follow:

Table 3. Initial task state

Data type	Inherit from parent task?
Object List	No
Threads	No
Memory Map	Yes
Suspend Count	No
Exception Handler	Yes

If the parent task is specified as NULL for task_create(), all child state are initialized to default. This is used in exec() emulation.

Task Suspension

When the task is set to suspend state, the thread suspend count of all threads in the task is also incremented. A thread can start to run only when both of the thread suspend count and the task suspend count becomes 0.

Kernel Task

The kernel task is a special task that has only an idle thread and interrupt threads. It does not have any user mode memory.

Scheduler

Thread Priority

The Prex scheduler is based on the algorithm known as priority based multi level queue. Each thread is assigned the priority between 0 and 255. The lower number means higher priority like BSD unix. It maintains 256 level run queues mapped to each priority. The lowest priority (=255) is used only for an idle thread.

A thread has two different types of priority:

Base priority: This is a static priority which can be changed only by user mode program.
Current Priority: An actual scheduling priority. A kernel may adjust this priority dynamically if it's needed.

Although the base priority and the current priority are same value in almost conditions, kernel will sometimes change the current priority to avoid "priority inversion".

Thread State

Each thread has one of the following states.

Memory Structure
Figure 2. Thread States

RUN :Running or ready to run
SLEEP :Sleep for some event
SUSPEND :Suspend count is not 0
EXIT :Terminated

The thread is always preemptive even in kernel mode. There are following 4 events to switch thread:

Table 4. Events to switch thread

Event	Condition	Run queue position
Block	Thread sleep or suspend	Move to the tail of runq
Preemption	Higher priority thread becomes runnable	Keep the head of runq
Quantum Expiration	The thread consumes its time quantum	Move to the tail of runq
Yield	The thread releases CPU by itself	Move to the tail of runq

Scheduling Policy

There are following three types of scheduling policy.

SCHED_FIFO: First-in First-out
SCHED_RR: Round Robin (SCHED_FIFO + timeslice)
SCHED_OTHER: Not supported

In early Prex development phase, SCHED_OTHER was implemented as a traditional BSD scheduler. Since this scheduler changes the thread priority dynamically, it is unpredictable and does not fit the real-time system. Recently, SCHED_OTHER policy was dropped from Prex to focus on real-time platform.

Scheduling Parameter

An application program can change the following scheduling parameters via kernel API.

Thread Priority
Scheduling Policy
Time Quantum (only for SCHED_RR)

Scheduling Lock

The thread scheduling can be disabled by locking the scheduler. This is used to synchronize the thread execution to protect the access to the global resources. Since any interrupt handler can work while scheduling lock state, it does not affect to the interrupt latency.

Named Event

The thread can sleep/wakeup for the specific event. The event works as the queue of the sleeping threads. Since each event has its name, it is easy to know which event the debugee is waiting for.

Memory Management

Physical Page Allocator

The physical page allocator provides the service for page allocation/deallocation/reservation. It works on the bottom layer for other memory managers.

Figure 3. Prex Memory Structure

The key point is that Prex kernel does not page out to any external disk devices. This is an important design point to get real-time performance and system simplicity.

Kernel Memory Allocator

The kernel memory allocator is optimized for the small memory foot print system.

To allocate kernel memory, it is necessary to divide one page into two or more blocks. There are following 3 linked lists to manage used/free blocks for kernel memory.

All pages allocated for the kernel memory are linked.
All blocks divided in the same page are linked.
All free blocks of the same size are linked.

Currently, it can not handle the memory size exceeding one page. Instead, a driver can use page_alloc() to allocate large memory.
When the kernel code illegally writes data into non-allocated memory, the system will crash easily. The kmem modules are called from not only kernel code but from various drivers. In order to check the memory over run, each free block has a tag with magic ID.

The kernel maintains the array of the block headers for the free blocks. The index of an array is decided by the size of each block. All block has the size of the multiple of 16.

free_blks[0] = list for 16 byte block
free_blks[1] = list for 32 byte block
free_blks[2] = list for 48 byte block
     .
     .
free_blks[255] = list for 4096 byte block

In generic design, only one list is used to search the free block for a first fit algorithm. However, the Prex kernel memory allocator is using multiple lists corresponding to each block size. A search is started from the list of the requested size. So, it is not necessary to search smaller block's list wastefully.

In most of the "buddy" based memory allocators, their algorithm are using 2^n bytes as block size. But, this logic will throw away much memory in case the block size is not fit. So, this is not suitable for the embedded systems that Prex aims to.

Virtual Memory Manager

A task owns its private virtual address space. All threads in a same task share one memory space. When new task is made, the address map of the parent task will be automatically copied. In this time, the read-only space is not copied and is shared with old map.

The kernel provide the following functions for VM:

Allocate/deallocate memory region
Change memory attribute (read/write/exec)
Map another task's memory to current task

The VM allocator is using the traditional list-based algorithm.

The kernel task is a special task which has the virtual memory mapping for kernel. All other user mode tasks will have the same kernel memory image mapped from the kernel task. So, kernel threads can work with the all user mode task context without switching memory map.

Figure 4. Kernel Memory Mapping

Since the Prex kernel does not do page out to an external storage, it is guaranteed that the allocated memory is always continuing and existing. Thereby, a kernel and drivers can be constructed very simply.

Note: "Copy-on-write" feature was supported with the Prex kernel before. But, it was dropped to increase the real-time performance.

IPC

The message passing model of Prex is very simple compared with other modern microkernels. The Prex message is sent to the "object" from thread to thread. The "object" in Prex is similar concept that is called as "port" in other microkernel.

Object

An object represents service, state, or policies etc. For object manipulation, kernel provide 3 functions: object_create(), object_delete(), object_lookup(). Prex task will create an object to publish its interface to other tasks. For example, server tasks will create objects like "proc", "fs", "exec" to allow clients to access their service. And then, client tasks will send a request message to these objects

An actual object is stored in kernel space, and it is protected from user mode code. Each object data is managed with the hash table by using its name string. Usually, an object has a unique name within a system. To send a message to the specific object, it must obtain the target object ID by looking up by the name.

An object can be created without its name. These objects can be used as private objects for threads in same task.

Message

Each IPC message must have the pre-defined message header in it. The kernel will store the sender task's ID into the message header. This mechanism ensures the receiver task can get the exact task ID of the sender task. Therefore, receiver task can check the sender task's capability for various secure services.

It is necessary to recognize the pre-defined message format between sender and receiver.

Messages are sent to the specific object using msg_send(). The transmission of a message is always synchronous. This means that the thread which sent the message is blocked until it receives a response from another threads. msg_receive() performs reception of a message. msg_receive() is also blocked when no message is reached to the target object. The receiver thread must answer the message using msg_reply() after it finishes processing.

The receiver thread can not receive another message until it replies to the sender. In short, a thread can receive only one message at once. Once the thread receives message, it can send another message to different object. This mechanism allows threads to redirect the sender's request to another thread.

A thread can receive a message from the specific object which is created by itself or thread in same task. If the message has not arrived, it blocks until any message comes in. The following figure shows the IPC transmit sequence of Prex.

Figure 5. IPC Transmit Sequence

Message Transfer

The message is copied to task to task directly without kernel buffering. The memory region of sent message is automatically mapped to the receiver's memory within kernel. This mechanism allows to reduce the number of copy time while message translation. Since there is no page out of memory in Prex, we can copy the message data via physical memory at anytime.

Figure 6. IPC message transfer

Exception Handling

A user mode task can specify its own exception handler with exception_setup(). There are two different types of exception.

H/W exception: This type of exception is caused by H/W trap & fault. The exception will be sent to the thread which caused the trap. If no exception handler is specified by the task, it will be terminated by kernel.
S/W exception: The user mode task can send S/W exception to another task by exception_raise(). The exception will be sent to the thread that is sleeping with exception_wait(). If no thread is waiting for the exception, the exception is sent to the first thread in the target task.

Kernel supports 32 types of exception. The following pre-defined exceptions are raised by kernel itself.

Table 5. Kernel exceptions

Exception	Type	Reason
SIGILL	H/W	Illegal instruction
SIGTRAP	H/W	Break point
SIGFPE	H/W	Math error
SIGSEGV	H/W	Invalid memory access
SIGALRM	S/W	Alarm event

POSIX emulation library will setup its own exception handler to convert the Prex exceptions into UNIX signals. It will maintain its own signal mask. And, it transfer control to the actual POSIX signal handler that is defined by the user mode process.

Interrupt Framework

Prex defines two different types of interrupt service to optimize the response time of real-time operation.

Interrupt Service Routine (ISR)

ISR is started by an actual hardware interrupt. The associated interrupt is disabled in ICU and CPU interrupt is enabled while it runs. If ISR determines that its device generates the interrupt, ISR must program the device to stop the interrupt. Then, ISR should do minimum I/O operation and return control as quickly as possible. ISR will run within the context of current running thread at interrupt time. So, only few kernel services are available within ISR. IRQ_ASSERT() macro can be used to detect the invalid function call from ISR.

Interrupt Service Thread (IST)

IST is automatically activated if ISR returns INT_CONTINUE to kernel. It will be called when the system enters safer condition than ISR. Any interrupt driven I/O operation should be done in IST not ISR. Since ISR for same IRQ may be run during IST, the shared data, resources, and device registers must be synchronized by using irq_lock(). IST does not have to be reentrant, since it is not interrupted by same IST itself.

Interrupt Nesting & Priority

Each ISR has its logical priority level, with 0 being the highest priority. While one ISR is running, all lower priority interrupts are masked off. This interrupt nesting mechanism avoids delaying of high priority interrupt events.

IST is executed as a normal thread dispatched by the scheduler. So, the interrupt thread which has higher priority is executed first. The driver writer can specify the thread priority of IST when IST is attached to the specific interrupt line. The important point is that even a user mode task can be performed prior to an interrupt thread.

The following figure is the sample of the Prex interrupt processing.

Figure 7. Prex Interrupt Processing

Interrupt Locking

irq_lock() & irq_unlock() are used to disable all interrupts in order to synchronize the access to the kernel or H/W resource. Since irq_lock() increments a lock counter, irq_unlock() will automatically restore to the original interrupt state when locking count becomes 0. So the caller does not have to save the previous interrupt state.

Interrupt Stack

If each ISR uses the kernel stack of the current running thread, the stack area may be over-flow when continuous interrupts are occurred at one same thread. So the kernel stack will be switched to the dedicated stack while ISR is running.

Timer

Kernel Timers

The kernel timer provides the following feature.

Sleep timer: Put caller thread to sleep in the specified time.
Call back timer: Call the routine after specified time passes.
Periodic timer: Call the routine at the specified interval.

Timer Jitter

The periodic timer is designed to minimize the deviation between desired and actual expiration.

Device I/O Service

The Prex device driver module is separated from the kernel, and this module is linked with the kernel at the boot time. The kernel provides only simple and minimum services to help the communication between applications and drivers.

Device Object

Since the Prex kernel does not have the file system in it, the kernel provides a device object service for I/O interface. The device object is created by the device driver to communicate to the application. Usually, the driver creates a device object for an existing physical device. But, it can be used to handle logical or virtual devices.

Driver Interface

The interface between kernel and drivers are defined clearly as "Driver Kernel Interface". The kernel provides the following services for device drivers.

Device object service
Kernel memory allocation
Physical page allocation
Interrupt handling service
Scheduler service
Timer service
Debug service

Application Interface

The kernel device I/O interface are provided to access the specific device object which is handled by a driver. The Prex kernel provides the following 5 functions for applications.

Open a device
Close a device
Read from a device
Write to a device
Device I/O control

Mutex

Priority Inheritance

The thread priority is automatically changed at one of the following conditions.

When the current thread fails to lock the mutex and the mutex owner has lower priority than current thread, the priority of mutex owner is boosted to the current priority. If this mutex owner is waiting for another mutex, such related mutexes are also processed.
When the current thread unlocks the mutex and its priority has already been boosted, kernel recomputes the current priority. In this case, the priority is set to the highest priority among the threads waiting for the mutexes locked by the current thread.
When the priority is changed by the user request, the related(inherited) thread's priority is also changed.

There are following limitations about priority inheritance with Prex mutex.

If the priority is changed by the user request, the priority recomputation is done only when the new priority is higher than old priority. The inherited priority is reset to base priority when the mutex is unlocked.
Even if thread is killed with mutex waiting, the related priority is not adjusted.

Debug

There are following debugging support functions:

printf(): Display the debug message in kernel.
panic(): Dump processor registers and stop system.
ASSERT(): If expression is false (zero), stop system and display information.
trace_on(): If the kernel trace is enabled, all entry/exit of functions are logged.