I-Tech: Linux - Threads, Concurrency

Threads are called light-weight processes, because context switch is quick between threads. This is due to the fact that threads share same address space, so while switching threads, address space need not be switched. This also results in faster communication across threads as long as concurrency issues are handled.

Kernel Threads

Similar to user space threads, but exist in kernel space with full access to kernel data structure running in privileged mode.

shown with [] around name in ps -aux.

Khreads can be created using kernel_create or kernel_run APIs.

struct task_struct t = kthread_create(&func, (void*)data, "thread");

Create new thread and keep it in sleep mode.

Coder need to explicitly call wake_up_process(t);

struct task_struct t = kthread_run(&func, (void*)data, "thread");

Create new thread and run it in same call.

Thread Function: int func(void *data);

kthread_bind(struct task_struct *k, unsigned int cpu):

The idea is,

kthread_create() create thread in sleeping state
kthread_bind() bind to a CPU
wakeup using wake_up_process

There are couple of issues to handle. The questions are:

How to make thread exit without causing system instability?

Does kernel thread accept signals?

Exit Thread:

In user space, using pthread_cancel(), one thread can send cancellation request to another. Similar version is kthread_stop().

The kernel thread can exit in two cases:

1. Thread calls do_exit(), OR

2. Someone calls int kthread_stop(struct task_struct *t); kthread_stop() enabled flag in task_struct to indicate that this thread is instructed to stop. This is blocking call, and does not return until thread in question actually stops. This requires implementation in thread function at appropriate point as well, to catch this information and exit, as shown below:

while(!kthread_should_stop()) { /*cleanup- clean memory etc*/ do_exit(); }

What if thread does not take care to clean up and call exit when kthread_should_stop() is enabled by kthread_stop()?

This can even stall system as rmmod may be waiting for kthread_stop() to finish which is waiting for thread to take action and exit.

Will Control+C work in above case when system is waiting for thread to exit?

Unfortunately no. By default kernel threads do not accept signals unless coder has instructed them to do. The thread can be programmed to accept signals by adding below call,

int allow_signal(int signum);

It also requires thread function to poll on signal at some point in execution flow,

signal_pending(task struct *t);

The thread can use it as:

allow_signal(SIGKILL);

while(!kthread_should_stop()) { /*Do processing */ if(signal_pending(current) break: }

do_exit();

Stack: Kernel thread (source: Kernel Documentation)

1) One stack of size THEAD_SIZE (2*PAGE_SIZE) per active thread. Stack consists useful data as long as thread is alive/zombie. While in user space, only thread_info field has meaningful data, rest thread stack is empty.

2) Interrupt Stack: used for HW interrupt, softIRQs. Kernel switches from task to INT stack for first external HW interrupt.

3) DoubleFault Stack: used when handling an exception causes another exception. This second stack gives kernel change to recover, but it still outputs oops message.

4) NMI Stack: For non-makable Interrupts as they can occur even when kernel is doing context switches.

5) Debug Stack: For HW debug interrupt(1) and SW debug interrupt(3).

Bit Operations and Atomic Operations helps synchronize without locking.

Atomic Operations

These are compiled in single instruction, indivisible and un-interruptible. atomic.h

#define ATOMIC_INIT(i) { (i) }

#define atomic_read(v) ((v)->counter) //guaranteed useful range of an atomic_t is only 24 bits

#define atomic_set(v, i) (((v)->counter) = (i))

static inline int atomic_add/sub_return(int i, atomic_t *v)

unsigned long flags; int temp;

local_irq_save(flags); temp = v->counter; temp +-= i; v->counter = temp;

local_irq_restore(flags);

return temp;

static inline void atomic_add/sub(int i, atomic_t *v) atomic_add/sub_return(i, v);

static inline int atomic_add_negative(int i, atomic_t *v) return atomic_add_return(i, v) < 0;

static inline void atomic_inc/dec(atomic_t *v) atomic_add/sub_return(1, v);

static inline void atomic_clear_mask(unsigned long mask, unsigned long *addr)

mask = ~mask; local_irq_save(flags); *addr &= mask; local_irq_restore(flags);

#define atomic_dec_return(v) atomic_sub_return(1, (v))

#define atomic_inc_return(v) atomic_add_return(1, (v))

#define atomic_sub_and_test(i, v) (atomic_sub_return((i), (v)) == 0)

#define atomic_dec_and_test(v) (atomic_sub_return(1, (v)) == 0)

#define atomic_inc_and_test(v) (atomic_add_return(1, (v)) == 0)

#define atomic_inc_not_zero(v) atomic_add_unless((v), 1, 0)

#define atomic_add_unless(v, a, u) \

({ \

int c, old; \

c = atomic_read(v); \

while (c != (u) && (old = atomic_cmpxchg((v), c, c + (a))) != c) \

c = old; \

c != (u); \

})

Bit Operations

static inline void set/clear/change_bit(int nr, volatile unsigned long *addr)

unsigned long mask = BIT_MASK(nr);

unsigned long *p = ((unsigned long *)addr) + BIT_WORD(nr);

unsigned long flags;

_atomic_spin_lock_irqsave(p, flags);

*p |= mask; //*p &= ~mask; //*p ^= mask;

_atomic_spin_unlock_irqrestore(p, flags);

set/clear_bit() is atomic and may not be reordered on x86. but can be reordered on other arch.

If using clear_bit for locking purpose, must call smp_mb__before/after_set/clear_bit() to ensure changes visible on other CPU.

Below routines returns old value before modifying.

static inline int test_and_set/clear/change_bit(int nr, volatile unsigned long *addr)

unsigned long mask = BIT_MASK(nr);

unsigned long *p = ((unsigned long *)addr) + BIT_WORD(nr);

unsigned long old;

unsigned long flags;

_atomic_spin_lock_irqsave(p,

old = *p;

*p = old | mask; //old & ~mask; //old ^ mask;

_atomic_spin_unlock_irqrestore(p, flags);

return (old & mask) != 0;

test_and_set/clear/change_bit

atomic and cannot be reordered on x86. may be reordered on other architectures than x86. also implies a memory barrier

#define BIT(nr) (1UL << (nr))

#define BIT_MASK(nr) (1UL << ((nr) % BITS_PER_LONG)) //BITS_PER_LONG=32/64

#define BIT_WORD(nr) ((nr) / BITS_PER_LONG)

#define BITS_TO_LONGS(nr) DIV_ROUND_UP(nr, BITS_PER_BYTE * sizeof(long))

#ifdef CONFIG_SMP

#define smp_mb() mb() //mb()=asm volatile ("": : :"memory")

#define smp_rmb() rmb() //rmb() = mb()

#define smp_wmb() wmb() //wmb()=asm volatile ("": : :"memory")

#else

#define smp_mb() barrier()

#define smp_rmb() barrier()

#define smp_wmb() barrier()

#endif

#define set_mb(var, value) do { var = value; mb(); } while (0)

#define set_wmb(var, value) do { var = value; wmb(); } while (0)

#define read_barrier_depends() do {} while (0)

#define smp_read_barrier_depends() do {} while (0)

# define barrier() __memory_barrier() //compiler.h. compiler-intel.h

#define barrier() __asm__ __volatile__("": : :"memory") //compiler-gcc.h

Concurrency

There are favourite questions like below (The answers are also pet responses by people and then at one point it all gets too messy)

Difference between semaphore and mutex?

Semaphore are like number of changing rooms in mall. There can be at max those many users.

Also, semaphore can be used to take bulk of resources, say 10 buffers at a time. The resource count shall be down by 10 then.

struct semaphore s; DEFINE_SEMAPHORE(s); sema_init(&s, value);

void down/up(&s); int down_interruptible/trylock(&s);

Use to protect longer Critical Section considering overhead in semaphore implementation.

Mutex is like a lock to enter single ATM machine, only one user at a time.

struct mutex m; DEFINE_MUTEX(m); OR mutex_init(&m);

void mutex_lock/unlock(m); int mutex_lock_interruptible/trylock(m);

Code can also call mutex_is_locked() to atomically check if mutex is locked. [return atomic_read(&lock->count) != 1;]

interruptible: Put this process in interruptible wait. This interrupt can be the mutex, or any other signal.

uninterrutible: only mutex availability can wake up process.

Mutex start as spinlock, i,e. when asking for mutex, thread is kept in active wait on CPU in hope that most mutex can be quick , after some time lock requests move to system call way(i.e. can sleep for acquisition).

What is binary semaphore?

Binary semaphore has value 1, means at max one user can change as there is single changing room.

Then why do we have mutex?

Mutex have concept of ownership, only owner can unlock the mutex.

For semaphore, coder needs to ensure down/up are in pair. For mutex, it is must that lock and unlock are in pairs.

What do you mean by ownership? Well.. um... you see.. try to see this..

Unless you know really, these will be the answers.

Can Semaphore/Mutex be used in Interrupt Context

Yes, but only down and mutex_unlock. The other half (down and mutex_lock) can not be used as they can cause calling process to sleep as they are blocking calls.

Why do they cause process to sleep/Why are they blocking?

Spinlock This can be used in Interrupt context, as this does not sleep. It is designed such that when kernel code holds spinlock, preemption is disabled on local cpu.

struct spinlock_t s; spinlock_init(&s); spin_lock(&s); spin_unlock(&s);

If you need to share spinlock for thread and interrupt handles, user variant [spinlock_irqsave()] which disabled interrupts on local cpu.

RW Spinlock: More read, less write. Such as search on linked list without modification.
rwlock_t rwl = __RW_LOCK_UNLOCKED(rwl);
read_lock_irqsave(&rwl, flage); Critical Section for Readers; read_unlock_irqsave(&rwl, flage);
write_lock_irqsave(&rwl, flage); CS for Writers; write_unlock_irqsave(&rwl, flage);
Caution: Prefer to use RCU (Read Copy Update) mechanism over RWSpinlock.

RWSemaphore:
In this case, access is governed by the operation(read or write) to be performed by thread. Optimized semaphore for multiple readers, single writer. Readers are blocked when writer is doing operation.
struct rw_semaphore s; void init_rwsem(&s);
void down_read/up_read(&s);
void down_write/up_write(&s); //Separete r/w operations

RTMutex: simple mutex + priority inheritance protocol.
struct rt_mutex
{ spinlock_t wait_lock; struct plist_head wait_list; struct task_struct *owner; };

Tasks on wait_list are sorted by prio (not done in normal mutex).

When wait_list is updated, kernel can update owner's priority high/low usig rt_mutex_setprio() which updates task_struct->prio(dynamic prio), while keeping task_struct->normal_priority same.

waiter DS is allocated on the kernel stack on of the blocked task:

struct rt_mutex_waiter { //structure for tasks blocked on a rt_mutex

struct plist_node list_entry; struct plist_node pi_list_entry;

struct task_struct *task; struct rt_mutex *lock; };

A low priority owner of RTMutex1 inherits priority of higher priority waiter until RTmutex1 is released.
If this low priority owner (which inherited priority) blocks on another RTmutex2, it propagates priority to owner of RTMutex2.
Waiters are placed in RTmutex waitlist in priority order

rt_mutex->owner: task_struct of owner.

 owner  bit1 bit0
 NULL  0 0 mutex is free (fast acquire possible)
 NULL  0 1 invalid state
 NULL  1 0 Transitional state*
 NULL  1 1 invalid state
 taskpointer 0 0 mutex is held (fast release possible)
 taskpointer 0 1 task is pending owner
 taskpointer 1 0 mutex is held and has waiters
 taskpointer 1 1 task is pending owner and mutex has waiters

bit0: owner is pending, bit1:rtmutex has waisters.

Pending-ownership is assigned to the first (highest priority) waiter of the mutex, when the mutex is released. The thread is woken up and once it starts executing it can acquire the mutex. Until the mutex is taken by it (bit 0 is cleared) a competing higher priority thread can "steal" the mutex which puts the woken up thread back on the waiters list.

RTMutex has 0 locking overhead for uncontended mutex and 0 unlocking overhead for mutex with no waiter. This is called fastpath optimization of RTMutex.

Wait Mechanism

In HW, there are simple wait requirements to stablize system, or finish previous operation, like Flash R/w using SPI. Also, reading data from disk may require some wait before data is available.
There are number of ways to implement wait mechanism in Linux.

Process can have below states:
TASK_RUNNING: only this process can be scheduled. In RunQ/running.
TASK_STOPPED: stopped by debugger

Below are waiting processes:
TASK_INTERRUPTIBLE: Process waiting for some event, can be woken up by other signal also
TASK_UNINTERRUPTIBLE: Process waiting for event, can't be woken up by other signal also

TASK_ZOMBIE: terminated, waiting to be cleanedup
TASK_DEAD: terminated and entry removed from process table as parent has taken its status.

TASK_TRACED: process being debugged. Process enter this state whenever it is stopped by debugger

1. Wait with schedule(): set_current_state(TASK_INTERRUPTIBLE); schedule();
schedule(): To voluntarily yield processor. Process does not move out of runqueue and can be run again.
Who will put this process back to TASK_RUNNING?
Someone should call wake_up_process(ptrToTaskStruct);

2. Waiting on Event: Generally process would wait on an event (like, waiting for some time, waiting to get a resource, waiting to get data etc).

3. WaitQueue: List of processes waiting for an event to occur.
struct wait_queue_head_t w; DECLARE_WAIT_QUEUE(w); OR init_wait_queue(&w);
wait_event(w, condition);
wait_event_interruptible/timeout(wait, condition, timeout); //condition=flag='n';
wait_event_interruptible_timeout(wait, condition, timeout);

wake_up/up_interruptible(&w);

Kernel DS/APIs (Internal Implementation of these primitives)

Semaphore

When P2 attempts already taken semaphore,
P2 is put to sleep
P2 is added to wait_list of that semaphore
Signals blocked while P2 waits to enter CS(For TASK_UNINTERRUPTIBLE]
Else
P1 reserve semaphore, proceed to critical section.

struct semaphore

{ spinlock_t lock; unsigned int count; struct list_head wait_list; };

struct semaphore_waiter { struct list_head list; struct task_struct *task; int up; };

down_interruptible/killable/timeout()
  spin_lock_irqsave,
if(sem->count>0) sem->count--; else __down_interruptible/killable/timeout();
  spin_unlock_irqrestore
down_trylock
  spin_lock_irqsave
int cnt=sem->count-1; if(cnt>=0) sem->count=count; //Beauty, first assume taken to get count.
spin_unlock_irqrestore
Helper function
__down_interruptible/killable/timeout
__down_common(sem, TASK_UNINTERRUPTIBLE/KILLABLE, MAX_SCHEDULE_TIMEOUT/jiffies);

static inline int __sched __down_common(struct semaphore *sem, long state, long timeout)
add 'current' task to semaphore_waiters.task, waiter.up=0
for(;;)
if(interrupted or timed-out) delete from list and return -EINTR or -ETIME
set task state
spin_unlock_irq
wait for given time
spin_lock_irq
if(waiter.up=1) return 0

static noinline void __sched __up(struct semaphore *sem)
*waiter = list_first_entry(&sem->wait_list, struct semaphore_waiter, list);
list_del(&waiter->list); waiters->up=1;   wake_up_process(waiter->task);

Mutex:

struct mutex{ //1: unlocked, 0: locked, negative: locked, possible waiters

atomic_t count; spinlock_t wait_lock; struct list_head wait_list; };

struct mutex_waiter { //control structure for tasks blocked on mutex

struct list_head list;

struct task_struct *task; }; //mutex_waiter struct resides on the blocked task's kernel stack

Spinlock:

Protected code must not sleep. Current holder (and its Critical Section Code) can not acquire more than once.

If lock is not acquired from another place in kernel,
Lock reserved for current cpu, other cpu can not enter.
If lock already taken on other place
Endless loop on current CPU to check release.

typedef struct { } raw_spinlock_t;

typedef struct {raw_spinlock_t raw_lock;} spinlock_t

spin_lock_irqsave()
local_irq_save(flags);
preempt_disable();

spin_acquire(&lock->dep_map, 0, 0, _RET_IP_);

RW-Semaphore: struct rwsemaphore

__s32 activity; //0-no active reader/writer; +: num of active readers, -1: 1 writer active

spinlock_t wait_lock; struct list_head wait_list;

RCU: Read Copy Update

When data is to be changed, create copy, update. When last previous reader finishes, change pointer to reflect new copy, increase sequence number.

Shared resource better be read only, rare write for better performance
Kernel cannot sleep within RCU protected region
Protect resource must be accessed via pointer, pointer can not be dereferenced directly.

Notes from Kernel Code (2.6)

Mutex:

only one task can hold the mutex at a time
only the owner can unlock the mutex
multiple unlocks are not permitted
recursive locking is not permitted
a mutex object must be initialized via the API
a mutex object must not be initialized via memset or copying
task may not exit with mutex held
memory areas where held locks reside must not be freed
held mutexes must not be reinitialized
mutexes may not be used in hardware or software interrupt contexts such as tasklets and timers

Semaphore:

Use of this function (down) is deprecated, please use down_interruptible() or down_killable() instead
up() may be called from any context and even by tasks which have never called down()

Spinlock

Safe only when *also* lock itself to do locking across CPU to enforce agreement onto everything touching shared variable.

RW-Spinlock

Are being phased out, instead use RCU (Read Copy Update)

Some Good Macros in Kernel Code
//Attach __sched__ to any functions which should be ignored in wchan output

#define __sched __attribute__((__section__(".sched.text")))

#define __must_check __attribute__((warn_unused_result))

#define uninitialized_var(x) x = x

#define __must_be_array(a) BUILD_BUG_ON_ZERO(__same_type (a), &a[0])
Concept: &a[0] degrades to pointer.
__same_type: __builtin_type_compatible( typeof(a), typeof(b) )

#define CHAR_TO_IDX(c) ( (int)c - (int)'a')
#define CHAR_TO_ASCII(c) ( (int)c - (int)'0')

#define BUILD_BUG_ON_NOT_POWER_OF_TWO(n) \
BUILD_BUG_ON( (n)==0 || (n)&((n)-1)) !=0 ))

Create Macro to initialize struct members:

struct sample { spinlock_t lock; unsigned int cnt; struct list_head wait_list };

#define MYSTRUCT_INIT(name, n) \

{ .lock=SPINLOCK_UNLOCKED( (name).lock), \

.cnt=n, \

.wait_list = LIST_HEAD_INIT( (name).wait_list) \

}

Sample Program:

kthread_demo.ko

MODULE_LICENSE("GPL");

static struct task_struct *t;

static int thread_func(void *dummy)

{ allow_signal(SIGKILL);

while (!kthread_should_stop()) {

printk(KERN_INFO "Thread Running\n"); sleep(5);

if(signal_pending(current)) break;

}

printk(KERN_INFO "Stopping Thread\n");

do_exit(0);

return 0;

}

static void __exit cleanup_dummy_thread(void)

{ printk(KERN_INFO "Cleaning Up\n");

if (t) { kthread_stop(t); printk(KERN_INFO "Thread stopped"); }

}

static int __init init_demo_thread(void)

{

printk(KERN_INFO "Creating Thread\n");

t = kthread_run(thread_func, NULL, "thread");

if (t) printk(KERN_INFO "Thread Created!!\n");

else printk(KERN_ERR "Thread Create FAILED\n");

return 0;

}

module_init(init_demo_thread);

module_exit(cleanup_dummy_thread);

Binary Semaphore/Mutex

The below example is for binary semaphore. If this is used as mutex (all commented code), this will generate: DEBUG_LOCKS_WARN_ON(lock->owner != current)

It means process is trying to unlock mutex which it does not own.

#include semaphore /*#include mutex */

MODULE_LICENSE("GPL");

MODULE_DESCRIPTION("Binary Semaphore Demonstration");

#define MINOR 0

#define COUNT 1

static dev_t dev;

static struct cdev c_dev;

static struct class *cl;

static char c = 'X';

static struct semaphore sem_var; /*DEFINE_MUTEX(mutex_var); */

static int my_open(struct inode *i, struct file *f) { return 0; }

static int my_close(struct inode *i, struct file *f) { return 0; }

static ssize_t my_read(struct file *f, char __user *buf, size_t len, loff_t *off)

{

if (down_interruptible(&sem_var)) /*if (mutex_lock_interruptible(&mutex_var))*/

{ printk("Unable to acquire Semaphore\n"); return -1; }

return 0;

}

static ssize_t my_write(struct file *f, const char __user *buf, size_t len, off_t *off)

{

up(&sem_var); /*mutex_unlock(&mutex_var);*/

if (copy_from_user(&c, buf + len - 1, 1))

{ return -EFAULT; }

return len;

}

static struct file_operations driver_fops =

{

.owner = THIS_MODULE,

.open = my_open,

.release = my_close,

.read = my_read,

.write = my_write

};

static int __init sem_mutex_init(void)

{

int ret; struct device *dev_ret;

if ((ret = alloc_chrdev_region(&dev, MINOR, COUNT, "sem_mutex")) < 0)

{ return ret; }

printk("Major Number: %d\n", MAJOR(dev));

cdev_init(&c_dev, &driver_fops);

if ((ret = cdev_add(&c_dev, dev, COUNT)) < 0)

{ unregister_chrdev_region(dev, COUNT); return ret; }

if (IS_ERR(cl = class_create(THIS_MODULE, "char")))

{ cdev_del(&c_dev);

unregister_chrdev_region(dev, COUNT);

return PTR_ERR(cl);

}

if (IS_ERR(dev_ret = device_create(cl, NULL, dev, NULL, "sem_mutex%d", MINOR)))

{

class_destroy(cl);

cdev_del(&c_dev);

unregister_chrdev_region(dev, COUNT);

return PTR_ERR(dev_ret);

}

sema_init(&sem_var, 0);

return 0;

}

static void __exit sem_mutex_exit(void)

{

device_destroy(cl, dev);

class_destroy(cl);

cdev_del(&c_dev);

unregister_chrdev_region(dev, COUNT);

}

module_init(sem_mutex_init);

module_exit(sem_mutex_exit);

Execution:

cat /dev/sem_mutex0 - This will acquire the mutex

cat /dev/sem_mutex0 - This will block

echo 1 > /dev/sem_mutex0

RW-Semaphore

WaitQueue

ssize_t op_read(struct file *filp, char *buffer, size_t cnt, loff_t *p_offset)
{
printk(KERN_INFO "In read\n");
printk(KERN_INFO "Going Out\n");
wait_event_interruptible(wq, flag == 'y');
flag = 'n';
printk(KERN_INFO "Woken Up\n");
return 0;
}

ssize_t op_write(struct file *filp, const char *buffer, size_t cnt, loff_t *p_offset)
{
printk(KERN_INFO "In write\n");
if (copy_from_user(&flag, buffer, 1)) return -EFAULT;
printk(KERN_INFO "flag %c", flag);
wake_up_interruptible(&wq);
return cnt;
}

Kernel Supports debug for mutex: CONFIG DEBUG_MUTEXES

    - Uses symbolic names of mutexes, whenever they are printed
      in debug output.
    - Point-of-acquire tracking, symbolic lookup of function names,
      list of all locks held in the system, printout of them.
    - Owner tracking.
    - Detects self-recursing locks and prints out all relevant info.
    - Detects multi-task circular deadlocks and prints out all affected
      locks and tasks (and only those tasks).

Some more theory: What is portability, scalability and modularity?
Portability: Should run on different architecture
Scalability: Can run on super computer as well as tiny device (needs 4MB RAM)
Modularity: Include things at runtime

All code outside arch directory should be portable. For portability, kernel has MACROS:
Endianness: cpu_to_[b/l]e32, [b/l]e32_to_cpu
I/O Memory Access
Memory Barriers for ordering guarantee
DMA APIs to flush/invalidate cache when needed
Avoid Floats totally [soft-float- emulation in user space is ok for performance]

User Space Device Drivers: Ok when kernel give mechanism to use user app to access HW, want to avoid existing kernel subsystem such as networking, simple system with single appliction

Interrupt Handling in Linux
INT is event that changes sequence of instructions executed by CPU.
1. Sync INT(Exceptions): produced by CPU when processing instructions
2. Async INT(INT) Produced by peripherals to gain CPU attention.

Exceptions: caused by programming error, such as div 0, page fault, overflow.
1. CPU detected exceptions: Faults correctable, Traps reports execution, Abort serious error.
2. Programmed exceptions: requested by programmer, handled like Trap.

Trap is exception in user process, caused by divby0, invalid mem access. It is also usual way to invoke kernel routine(syscall) as it runs with higher priority than user code. In Linux, SW interrupts are handled as trap.

INT are
- Maskable: Issued by I/O, can be masked or unmasked. Only unmasked INT are processed.
- Nonmaskable: Critical malfunctions, always CPU processes.

Normally, CPU checks if there is IRQ from A(PIC) after each instruction.

PIC: Programmable INT controller.

When INT occurs, CPU checks if its masked.
If masked, do nothing until they are unmasked.
When unmasked and there are pending INT, CPU picks one.
CPU masks INT, saves registers and branches to a address where INT handler code resides
INT handler talks to peripheral that triggered INT, tx/rx data; call scheduler for timer INT etc.
On return, CPU executes special return from INT instruction which restores reg & unmasks INT

ISR
Its role is to tell device about INT reception, R/w data, clear INT bit, wakeup process sleeping on the device/events.
ISR is not executed in process context, means can not transfer data to/from user space, can not sleep (i.e. not call wait_event, semaphore, scheduler). That is why we have bottom half.

Bottom Half Mechanisms
Tasklet: can be run in parallel on multiple CPU, but same tasklet can only be run once at a time.
Never run in Process context, run only on CPU that scheduled them, leading to better cache coherancy, serialization. Tasklet can run only after tophalf has finished, so re-entrance etc need not be bothered.
WorkQ: Run in process context, thus can sleep. Each workQ has its own thread on each CPU, sleeping does not block other tasks. This bottom half can run on other CPU than the one scheduled them, so need to handle race condition etc.
Kernel Thread: create own thread at startup, then sleep. wakeup when there is some work.

Tasklet are part of more generic SoftIRQ (TASKLET_SOFTIRQ=5).
SoftIRQs are consumed:
1. After INT is serviced, kernel checks for pending softirq and execute in priority order
2. The kernel thread ksoftirqd[cpu] scheduled like other process and consumes
Second way is helpful to prevent priority inversion due to tasklet storm(tasklets resubmitting themselves). do_softirq() is allowed to restart processing only MAX_SOFTIRQ_RESTART=10 times

Issues:
1. Tasklet runs in SW INT mode, so cant sleep
2. As they run in SW INT Mode, have higher priority than other tasks and can result in poor latencies in other tasks for poorly coded tasklets
Tasklets are kind of deprecated outside network code.

Threaded Interrupt Handler:
Use kernel thread to handle interrupt.
request_threaded_irq(unsigned int irq, irq_handler_t handler, irq_handler_t thread_fnc, ulong flags, char *name);
The top half is called in hard INT context, checks whether INT comes from its device.
If not return IRQ_NONE,
else return IRQ_WAKE_THREAD if processing required by its thread,
else return IRQ_HANDLED if nothing to be done.
This new return type(underlined) helps implement threaded INT handler. This method is used to replace tasklets and workQ in DD.

INT handling in user space
Why? Can have better security, can keep kernel driver clean, no licensing constraints.
e.g. printer driver is always done outside linux kernel and done using ports.

Device Driver

Layer between application and actual device providing mechanism, no policy.

e.g. XServer known HW and provides interface(mechanism), while SessionManager implements policy irrespective of HW(policy).

Supports sync/async operations
Can be opened multiple times
Must be reentrant, avoid race conditions, handle concurrency

Char DD: Accessed as stream of data, provides open(), close(), read(), write(). Accessed by file system nodes, such as /dev/tty0

Block DD: Transfer data in block

Network DD: Exchange network data, knows about packets but not connections, NOT mapped to file system node.

USB device

Kernel does not know printf, scanf. These are C libraries.

Polling vs Interrupt

Poll: Consume CPU cycle, suitable for ultra high speed I/O since data will keep coming and poll will not waste CPU cycles

Interrupt: Asynchronous, Need based, suitable for slow speed I/O where data is rarely generated.

Blocking I/O

When driver can not satisfy request (e.g. read when data not available, write when device not ready to accept data).

Calling process should not bother about such issues, driver should block calling process and put to sleep until request can be serviced.

e.g. user calles read(fd, buffer, size), on fd opened using open() system call:

device_read(){ check buffer size, copy_to_user() or sleep to block I/O}

Sleep is special state for process where it is removed from scheduler's runQ and will not run until some future event happens.

Rules: Avoid sleep in atomic context(holding lock, disabled INT..), can not assume system state on wakeup(previously available resources may not be available now), Ensure other process can wake up while it sleeps.

Select/Poll System Call

Helps user app to wait for data to arrive on one/more fd
Call f_ops->poll for all fds
Each fops->poll would return if data received
If no fd has data, select/poll has to wait for data on those fds
It should know about waitQ that should be used to signal new data

poll(&pollfd, 1, timeout) //userspace. pollfd contains info about fds to wait for

//kernel driver:

device_poll(){ .. poll_wait(filep, &p->irq_wq, w);} //poll_wait kernel API.

irq_isr(){ wakeup(&p->irq_wq);}

The best resource:

Kernel Documentation

Great Work By: Pradeep's Blog

I-Tech

Technical Notes

Linux - Threads, Concurrency

Kernel Threads

Atomic Operations

Concurrency

Wait Mechanism

Sample Program:

No comments:

Post a Comment