Version 1.0.
Sep 29 2009


This document describes the FS threads port to Linux and solution for the
memory management issues.

1. Problem Statement

The file system threads are basic cooperatively scheduled threads implemented
on top of EEE state machines.  The EEE provides the following very basic API:

    thread_create()
    thread_exit()
    thread_yield()

The wake-ups are done by sending the events to the thread state machines,
esm_deliverEvent().

The file system code expects to be able to efficiently create and schedule
thousands of threads.  Both thread creation and context switches are expected
to be extremely light-weight.  The EEE scheduler is strict FIFO and as such is
as efficient as the scheduler can get.  The linux implementation is expected
to obey the same constraints.

The memory manager for the file system is expected to provide never block
never fail properties.  This is achieved by employing request gate, the new
requests will not be scheduled unless it is proven there is enough memory to
complete the currently running requests and the new one.  There is no static
allocation of the data structures, however, unless the fragmentation goes
through the roof, the memory allocation is bounded if the number of the data
structures allocated is bounded.  The use of the slab allocator helps to keep
the fragmentation in check, so running out of memory have not been a problem
so far.  The file system code used to rely on the no block property, there are
a lot of places in the current code that allocate the memory under spin locks.
It seems that finding and modifying all these places will be too dangerous and
it is desirable that the memory manager under the linux port will provide the
same properties.  There a few places outside the file system code that rely on
no-fail property of the memory manager.  The restricted environment allowed
this code to work, the out of memory condition simply never happened.  With
the linux port there is a possibility of all kinds of random code being
executed on the box so the code should be shielded from the temporary out of
memory condition.

2. Design overview.

2.1. FS Threads

The file system threads will be implemented as co-routines on top of the Linux
kernel threads, rather than having a one-to-one correspondence between Linux
thread and FS threads.  This will help to keep the scheduler and thread
creation/deletion overhead to the minimum and will prevent the scheduling
problems where the herd of FS threads, FP, NCPU, ACPU polling threads and user
threads will be fighting for the CPU.

There will be several FP polling threads created, the number of the threads
will depend on the number of physical cores in the system.  The Cougar
hardware will have 5 FP polling threads - 5 FP + 1 ACPU + 1 Network
Interrupts + 1 NCPU = 8, or if NCPU thread does not need full core, 6 FP
polling threads. The Pikes Peak hardware will have 1 or 2 FP polling
threads (1 VxWorks + 1 Network + 1 ACPU + 1 FP = 4).  The polling thread will
run the same polling loop as the current software, executing the currently
runnable threads as just one of the routines in the polling loop.  All the
code that the threads will be allowed to execute, including the memory
allocation calls, will be strictly non-blocking.  The polling thread will
switch to the thread context using roughly the same small piece of the MIPS
assembler code.  The appropriate assembler code will have to be created for
x86-64 port.

2.2. Non-blocking Non-failing Memory Manager.

The non-blocking non-failing property of the file system memory manager will
be achieved by using roughly the same techniques as it is done now, with some
enhancements.  Rather than checking for the available memory before request
creation, the polling threads will check for the available memory before
executing the next batch of the active threads/polling functions/ACPU
requests.  The memory manager would be instructed to keep a significant chunk
of memory, on the order of 100+ MB, free at all times.  If the free memory
falls below the threshold, the execution of the FP and ACPU code will be
temporary suspended until the kernel comes up with the free memory.  If no
free memory appears before a predefined amount of time, ~4 seconds, passes,
the box will be rebooted.  This will allow to satisfy the FS and ACPU requests
for memory without blocking or failing.

2.5. Yielding Polling Threads

Despite being called polling threads out of habit, the FP and ACPU threads
will be required to yield the CPU under the linux architecture to allow the
user code to run.  Instead of polling continuously like the current code does,
the polling threads will go to sleep if no work can be done.  All the routines
that produce work for the polling threads will be modified to include a
polling threads wake up if the corresponding input queue goes from empty to
not-empty.  The examples of such routines are sending an event to a thread,
putting a packet on the ACPU input queue, detecting a qlogic interrupt.  The
new routines will be added to process timer and qlogic interrupts.

The polling threads will be also yielding the CPU without going to sleep after
running for longer than predefined time quantum to allow effective sharing of
the CPU with the user code. The user code will be preempted after using its
quantum as it normally happens under linux scheduler.

2.4. CPU_PRIVATE data

The CPU_PRIVATE data will be per-thread rather than per-core.  It will be
implemented as thread specific data associated with the polling threads.  It
is expected that the syntax will have to change and we'll no longer will be
able to use the direct variable reference hack, the references to the
CPU_PRIVATE variables will have to use function call syntax.

2.5. Debuggability.

2.5.1 KGDB and Core Dumps.

The linux core dump and interactive debugging code will have to be modified to
be able to be aware of the file system threads. The current EEE gdb and thread
support and modification to gdb to support the file system threads can be
consulted for ideas on how this is done.

2.5.2. Volume Exception Dumps.

The volume exception supporting code, namely the user space utility that
rewrites the exception dump into a core dump format, will need to be updated
to work with the current version of GCC and GDB. I believe the GDB format has
slightly changed between revisions, but I can be mistaken here. At least
we'll need to make sure that the code still works.

2.5.3. Threadstacks dump.

The thread stacks dump feature will be supported just the same as it is now.
The FP threads will have the same feature allowing to suspend the threads by
setting global flag and dumping the thread stacks to the log once the threads
are suspended.  The only difference is that instead of being per-core the
suspend flags will be per-thread.