Version 1.0. Sep 29 2009 This document describes the FS threads port to Linux and solution for the memory management issues. 1. Problem Statement The file system threads are basic cooperatively scheduled threads implemented on top of EEE state machines. The EEE provides the following very basic API: thread_create() thread_exit() thread_yield() The wake-ups are done by sending the events to the thread state machines, esm_deliverEvent(). The file system code expects to be able to efficiently create and schedule thousands of threads. Both thread creation and context switches are expected to be extremely light-weight. The EEE scheduler is strict FIFO and as such is as efficient as the scheduler can get. The linux implementation is expected to obey the same constraints. The memory manager for the file system is expected to provide never block never fail properties. This is achieved by employing request gate, the new requests will not be scheduled unless it is proven there is enough memory to complete the currently running requests and the new one. There is no static allocation of the data structures, however, unless the fragmentation goes through the roof, the memory allocation is bounded if the number of the data structures allocated is bounded. The use of the slab allocator helps to keep the fragmentation in check, so running out of memory have not been a problem so far. The file system code used to rely on the no block property, there are a lot of places in the current code that allocate the memory under spin locks. It seems that finding and modifying all these places will be too dangerous and it is desirable that the memory manager under the linux port will provide the same properties. There a few places outside the file system code that rely on no-fail property of the memory manager. The restricted environment allowed this code to work, the out of memory condition simply never happened. With the linux port there is a possibility of all kinds of random code being executed on the box so the code should be shielded from the temporary out of memory condition. 2. Design overview. 2.1. FS Threads The file system threads will be implemented as co-routines on top of the Linux kernel threads, rather than having a one-to-one correspondence between Linux thread and FS threads. This will help to keep the scheduler and thread creation/deletion overhead to the minimum and will prevent the scheduling problems where the herd of FS threads, FP, NCPU, ACPU polling threads and user threads will be fighting for the CPU. There will be several FP polling threads created, the number of the threads will depend on the number of physical cores in the system. The Cougar hardware will have 5 FP polling threads - 5 FP + 1 ACPU + 1 Network Interrupts + 1 NCPU = 8, or if NCPU thread does not need full core, 6 FP polling threads. The Pikes Peak hardware will have 1 or 2 FP polling threads (1 VxWorks + 1 Network + 1 ACPU + 1 FP = 4). The polling thread will run the same polling loop as the current software, executing the currently runnable threads as just one of the routines in the polling loop. All the code that the threads will be allowed to execute, including the memory allocation calls, will be strictly non-blocking. The polling thread will switch to the thread context using roughly the same small piece of the MIPS assembler code. The appropriate assembler code will have to be created for x86-64 port. 2.2. Non-blocking Non-failing Memory Manager. The non-blocking non-failing property of the file system memory manager will be achieved by using roughly the same techniques as it is done now, with some enhancements. Rather than checking for the available memory before request creation, the polling threads will check for the available memory before executing the next batch of the active threads/polling functions/ACPU requests. The memory manager would be instructed to keep a significant chunk of memory, on the order of 100+ MB, free at all times. If the free memory falls below the threshold, the execution of the FP and ACPU code will be temporary suspended until the kernel comes up with the free memory. If no free memory appears before a predefined amount of time, ~4 seconds, passes, the box will be rebooted. This will allow to satisfy the FS and ACPU requests for memory without blocking or failing. 2.5. Yielding Polling Threads Despite being called polling threads out of habit, the FP and ACPU threads will be required to yield the CPU under the linux architecture to allow the user code to run. Instead of polling continuously like the current code does, the polling threads will go to sleep if no work can be done. All the routines that produce work for the polling threads will be modified to include a polling threads wake up if the corresponding input queue goes from empty to not-empty. The examples of such routines are sending an event to a thread, putting a packet on the ACPU input queue, detecting a qlogic interrupt. The new routines will be added to process timer and qlogic interrupts. The polling threads will be also yielding the CPU without going to sleep after running for longer than predefined time quantum to allow effective sharing of the CPU with the user code. The user code will be preempted after using its quantum as it normally happens under linux scheduler. 2.4. CPU_PRIVATE data The CPU_PRIVATE data will be per-thread rather than per-core. It will be implemented as thread specific data associated with the polling threads. It is expected that the syntax will have to change and we'll no longer will be able to use the direct variable reference hack, the references to the CPU_PRIVATE variables will have to use function call syntax. 2.5. Debuggability. 2.5.1 KGDB and Core Dumps. The linux core dump and interactive debugging code will have to be modified to be able to be aware of the file system threads. The current EEE gdb and thread support and modification to gdb to support the file system threads can be consulted for ideas on how this is done. 2.5.2. Volume Exception Dumps. The volume exception supporting code, namely the user space utility that rewrites the exception dump into a core dump format, will need to be updated to work with the current version of GCC and GDB. I believe the GDB format has slightly changed between revisions, but I can be mistaken here. At least we'll need to make sure that the code still works. 2.5.3. Threadstacks dump. The thread stacks dump feature will be supported just the same as it is now. The FP threads will have the same feature allowing to suspend the threads by setting global flag and dumping the thread stacks to the log once the threads are suspended. The only difference is that instead of being per-core the suspend flags will be per-thread.