Sunday, August 3, 2014

OPW, Linux: Hunting Bugs with OOPSes

Since writing the performance-related post seems to be taking me the full two weeks (it'll be reviewed by my very helpful OPW mentor, as it might otherwise be too much of a mess), here I am writing about something I've been dealing with almost daily during the past three months: kernel OOPSes. Let's see what a kernel OOPS is, how to read it, how to produce it intentionally, why on Earth would you want to do that, and why being able to squeeze out of it any bit of useful information is essential to debugging kernel code.

Please do understand that everything that's written on this blog is just my personal opinion, and even a reasonably mild one, meaning that, if you prove me wrong, I'll not argue but instead thank you. So, if you have any criticism and the time to condense it in a comment, please do.


Linux, OPW: Hunting Bugs with OOPSes

Debugging the kernel adds some extra fun with respect to debugging a userland application. Although kernel-level debuggers exist, executing kernel code inside a debugging environment might not be particularly comfortable nor useful: in fact, kernel bugs might (and often are) tied to specific timings that, when changed, hide the bug from the programmer's sight. Also, kernel code is very complex and fast, which might not get along well with placing breakpoints or stepping in the code. The approach which is usually followed when debugging kernel code is tracing the execution of such code, restricting the trace to events which are thought to be related to the failure. The main issue with this approach is that you must already have an idea of the kernel subsystem where the failure happens, so that you can restrict tracing to that particular subsystem: in fact, tracing adds an overhead, however small it is, to the kernel's execution. This also implies that tracing can't be always enabled, especially when testing kernel code for production. Finally, tracing is certainly a powerful tool, but, well, it is relatively easy to use when the bug is already known to exist; finding an anomaly in a kernel trace when you don't know if there is one could be probably more complex than the average human could handle.

So, how do we debug kernel code? Luckily, the Linux kernel has been designed to automatically provide debugging aid in case of a failure. Such an aid is represented by kernel OOPSes. A kernel OOPS is an error message produced by the kernel when it understands it is deviating from its intended behavior. If the deviation is disruptive and/or irrecoverable, the OOPS might be closely followed by a kernel panic, which completely halts the kernel's execution. An example of disruptive deviation is one that would cause data corruption. A deviation which is deemed to be irrecoverable is one that the kernel doesn't know how to handle.

What do we need from a kernel error message?
As users, we are accustomed to cryptic error messages providing nothing more than a kind notification that "something bad has happened". As developers, what pieces of information would we want to have, instead?
  • What kind of failure has the software encountered.
  • When did that happen, with respect to the software execution's start time.
  • What was it doing when it encountered it, as detailed as it is possible.
  • What was its state was involved in the software's failure.
Luckily enough, a kernel OOPS gives us exactly these pieces of information. Let's see an example of such an error message (and yes, that's directly from something I messed up while developing the OPW project I'm working on - shame on me). I got it from a serial console configured on my machine and hooked up to a host-USB-capable Androidphone with a serial-to-USB adapter.

[    6.308218] ------------[ cut here ]------------
[    6.308225] kernel BUG at block/blk-core.c:2254!
[    6.308231] invalid opcode: 0000 [#1] PREEMPT SMP 
[    6.308240] Modules linked in: ext4 crc16 mbcache jbd2 xen_kbdfront xen_netfront xen_fbfront syscopyarea sysfillrect sysimgblt fb_sys_fops
[    6.308268] CPU: 0 PID: 0 Comm: swapper/0 Not tainted 3.15.0-rc5-ARCH-xen #96
[    6.308278] task: ffffffff81a16500 ti: ffffffff81a00000 task.ti: ffffffff81a00000
[    6.308286] RIP: e030:[<ffffffff81269780>]  [<ffffffff81269780>] blk_dequeue_request+0x80/0xa0
[    6.308307] RSP: e02b:ffff88001f803d28  EFLAGS: 00010046
[    6.308314] RAX: ffff88001ba14bc0 RBX: ffff88001ba14bc0 RCX: ffff88001d280000
[    6.308321] RDX: 0000000000000175 RSI: 0000000000000000 RDI: ffff88001ba14bc0
[    6.308328] RBP: ffff88001f803d38 R08: 0000000000000000 R09: 0000000000000b6b
[    6.308336] R10: ffff88001f818d70 R11: 0720072007200720 R12: 0000000000000000
[    6.308343] R13: ffff88001d280000 R14: ffff88001ba14bc0 R15: ffffffff817e3fda
[    6.308356] FS:  00007f79846c7700(0000) GS:ffff88001f800000(0000) knlGS:0000000000000000
[    6.308365] CS:  e033 DS: 0000 ES: 0000 CR0: 000000008005003b
[    6.308371] CR2: 00007f79846c9000 CR3: 0000000001a0f000 CR4: 0000000000002660
[    6.308382] Stack:
[    6.308386]  ffffffff812697b2 ffff88001e0f6a00 ffff88001f803d88 ffffffff813af734
[    6.308396]  0000000000000155 ffff88001d6690e0 0000000000000000 ffff88001d280000
[    6.308408]  0000000000000155 ffff88001d6690e0 0000000000000000 ffff88001d668000
[    6.308419] Call Trace:
[    6.308425]  <IRQ> 
[    6.308429]  [<ffffffff812697b2>] ? blk_start_request+0x12/0x60
[    6.308448]  [<ffffffff813af734>] do_blkif_request+0x74/0x200
[    6.308456]  [<ffffffff81265ec3>] __blk_run_queue+0x33/0x40
[    6.308463]  [<ffffffff81266079>] blk_start_queue+0x29/0x50
[    6.308471]  [<ffffffff813b0425>] kick_pending_request_queues+0x35/0x50
[    6.308479]  [<ffffffff813b0eb4>] blkif_interrupt+0xa74/0xab0
[    6.308491]  [<ffffffff810c7d5e>] handle_irq_event_percpu+0x3e/0x1f0
[    6.308501]  [<ffffffff810c7f4d>] handle_irq_event+0x3d/0x60
[    6.308509]  [<ffffffff810cad46>] handle_edge_irq+0x66/0x130
[    6.308517]  [<ffffffff810c7438>] generic_handle_irq+0x28/0x40
[    6.308527]  [<ffffffff813472e1>] evtchn_fifo_handle_events+0x151/0x160
[    6.308539]  [<ffffffff81343cb0>] __xen_evtchn_do_upcall+0x50/0xb0
[    6.308547]  [<ffffffff81345e63>] xen_evtchn_do_upcall+0x33/0x50
[    6.308558]  [<ffffffff8151e8fe>] xen_do_hypervisor_callback+0x1e/0x30
[    6.308564]  <EOI> 
[    6.308568]  [<ffffffff810013aa>] ? xen_hypercall_sched_op+0xa/0x20
[    6.308582]  [<ffffffff810013aa>] ? xen_hypercall_sched_op+0xa/0x20
[    6.308594]  [<ffffffff8100b070>] ? xen_safe_halt+0x10/0x20
[    6.308603]  [<ffffffff8101e54f>] ? default_idle+0x1f/0x100
[    6.308611]  [<ffffffff8101ef1f>] ? arch_cpu_idle+0xf/0x20
[    6.308621]  [<ffffffff810b1fc8>] ? cpu_startup_entry+0x2b8/0x4f0
[    6.308633]  [<ffffffff815042a4>] ? rest_init+0x84/0x90
[    6.308645]  [<ffffffff81af3ea6>] ? start_kernel+0x448/0x469
[    6.308654]  [<ffffffff81af34d7>] ? x86_64_start_reservations+0x2a/0x2c
[    6.308663]  [<ffffffff81af6076>] ? xen_start_kernel+0x577/0x583
[    6.308669] Code: 84 81 3c 05 00 00 01 65 ff 04 25 e0 c7 00 00 e8 07 3d db ff 48 89 83 e8 00 00 00 65 ff 0c 25 e0 c7 00 00 74 16 5b 5d f3 c3 66 90 <0f> 0b 66 0f 1f 44 00 00 0f 0b 66 0f 1f 44 00 00 e8 6b 4c db ff 
[    6.308749] RIP  [<ffffffff81269780>] blk_dequeue_request+0x80/0xa0
[    6.308758]  RSP <ffff88001f803d28>
[    6.308765] ---[ end trace 500f7dcca1bd6a9e ]---

First of all, look at the left column: that's the time since the kernel's execution started. But, let's move on and have a look at the first section, that tells us what happened. It seems there was a kernel BUG at block/blk-core.c:2254! That means that the programmer had explicitly defined some state of the kernel as a failing state and has instructed the kernel to panic when entering that state. We'll see how to do that later. The following lines give us an insight as to what modules were loaded in the kernel when it crashed, and what CPU the crashing thread of execution was running on, what task was involved in it crashing (the swapper probably tried to do I/O through the buggy driver).
The following part is the dump of all of the machine's registers, which also gives an indication of the exact point where the code led to a crash. This information is obtained by the Instruction Pointer register (EIP/RIP depending on whether the machine is 32 or 64 bit-capable), which keeps the address of the currently-executed instruction, and from the map of the kernel's symbols, which allows to locate and print out the human-readable identifier of the function that was being executed and the offset inside it.
The last section gives us an idea of what the kernel was doing when the failure happened; the OOPS contains a full dump of the kernelspace stack, along with a complete call trace. The accuracy of such a trace might be dubious, and the kernel notifies it by printing a question mark symbol next to the tentative elements of the call trace. Note that the value of the instruction pointer, along with the currently-executed function identifier, is reported also at the end of the trace, as it is of great importance and must be reported even if the OOPS got cut out for some reason.

Now, the question is: how do we actually get from this load of information to a concrete pointer to the code that failed? Yeah well, we have it noted up there, it was block/blk-core.c at line 2254. But let's assume we didn't know about that, since some kind of failures might not be so kind to print out such a relevant information about the failing code.

First of all, it's very important that we have our kernel image compiled with debugging symbols enabled (CONFIG_DEBUG_INFO=y, which you can find in the menuconfig as "Kernel hacking > Compile-time checks and compiler options > Compile the kernel with debug info"). After that, we just need two very basic tools: the objdump utility and the kernel map of symbols; we can find the former in any Linux distribution's official repositories, and the latter as output of the kernel make process, as the System.map file.
Let's try to get from the above-reported OOPS some more info about what happened. The first step is getting the start address of the blk_dequeue_request() function from the System.map file.

$ grep blk_dequeue_request System.map
ffffffff81269870 T blk_dequeue_request

Seems like we have it. Let's now add the offset reported by the OOPS to that start address. The offset is the hexadecimal number reported after the plus sign. If you're wandering about it, that second number after the slash is the size of the function, so if the offset was greater than the size, well, it would mean that the OOPS is corrupted. It's kind of an integrity check.
But let's stick with the offset.

ffffffff81269870h + 80h = ffffffff812698f0h

Now we have the address of the exact instruction executing which the kernel crahsed. We just need to have a look at the kernel code with the objdump utility. objdump allows to disassemble a Linux ELF executable. With the --source option, it also interleaves assembly code with original source code, therefore giving immediate access to the original listing. Disassembling the whole kernel would give a very long output and also take a few minutes, so we're giving to the disassembler an address to start analyzing the code from. A good start address is the start address of the function itself.

objdump --disassemble --source --start-address=0xffffffff81269870 vmlinux

Note that "0x" at the beginning of the start address. Trivial enough, but can drive you mad when you don't know it's needed and you're already grumbling on an annoying bug. Also, it's usually very much comfortable if you pipe the output to some text viewer, such as less, or redirect it to a file so that you can go though it in your own time. Anyway, the output should look like the following.

vmlinux:     file format elf64-x86-64

Disassembly of section .text:

ffffffff81269870 <blk_dequeue_request>:
        return rq;
}
EXPORT_SYMBOL(blk_peek_request);

void blk_dequeue_request(struct request *rq)
{
ffffffff81269870:       e8 8b ae 2a 00          callq  ffffffff81514700 <__entry_text_start>
 * list_empty - tests whether a list is empty
 * @head: the list to test.
 */
static inline int list_empty(const struct list_head *head)
{
        return head->next == head;
ffffffff81269875:       48 8b 07                mov    (%rdi),%rax
...

That looks very promising, doesn't it? Let's cut out just the address of the instruction (ffffffff812698f0and let's then go one line back to find out the C instruction involved in the panic.

void blk_dequeue_request(struct request *rq)
{
        struct request_queue *q = rq->q;

        BUG_ON(list_empty(&rq->queuelist));
ffffffff812698f0:       0f 0b                   ud2    
ffffffff812698f2:       66 0f 1f 44 00 00       nopw   0x0(%rax,%rax,1)
        BUG_ON(ELV_ON_HASH(rq));
ffffffff812698f8:       0f 0b                   ud2    
ffffffff812698fa:       66 0f 1f 44 00 00       nopw   0x0(%rax,%rax,1)
        preempt_enable();
ffffffff81269900:       e8 fb 4a db ff          callq  ffffffff8101e400 <___preempt_schedule>
         */
        if (blk_account_rq(rq)) {
                q->in_flight[rq_is_sync(rq)]++;
                set_io_start_time_ns(rq);
        }
}

So, there it is: BUG_ON(list_empty(&rq->queuelist))! In fact the name of the C instruction itself seems to refer to a bug or something like that...
In fact, we hit an invariant expression (or assertion) that explicitly defined a state of the kernel's execution as bug. Go straight to the next paragraph if you want to learn more about them and about why they are of capital importance when writing kernel code. Here I just want to point out something that my advisor showed me a few years ago. You actually don't need to go through the whole kernel map grepping and ELF disassembly thing. You just need the gdb debugger (and of course a kernel compiled with debugging info).

$ gdb vmlinux
GNU gdb (GDB) 7.7.1
Copyright (C) 2014 Free Software Foundation, Inc.
<snip>
Reading symbols from vmlinux...done.
(gdb) l *(blk_dequeue_request+0x80)
0xffffffff812698f0 is in blk_dequeue_request (block/blk-core.c:2254).
2249
2250 void blk_dequeue_request(struct request *rq)
2251 {
2252 struct request_queue *q = rq->q;
2253
2254 BUG_ON(list_empty(&rq->queuelist));
2255 BUG_ON(ELV_ON_HASH(rq));
2256
2257 list_del_init(&rq->queuelist);
2258

The l command (which is list in its extended form) asks the debugger to dump the source code of a specific function at a specific offset. And gdb does all the boring work and leaves us with nothing but debugging fun.


A side note about invariant expressions
So, we were talking about disruptive and irrecoverable anomalies. One important difference that exists between these two types of panic-provoking situations is that the first must be explicitly defined by the programmer [him|her]self with run-time assertions.
An assertion in the kernel is the exact same of an userspace assertion, with the difference that its result might be the code completely halting its execution or just printing out an OOPS. The OOPS provoked by an assertion is the kind of OOPS that very helpfully prints out the name of the source file and the line of it where the buggy state was encountered. The BUG_ON() macro allows the programmer to explicitly define a critical condition that, when it is verified, represents a bug that can't be just got over with. A milder version of that macro is the WARN_ON() macro, that just prints an OOPS in the kernel's log without halting its execution. When using WARN_ON()s it might happen that the OOPSes completely flood kernel logs, if the anomalous state persists or if the expression used for the assertion is not as specific as it could be. In these cases, it could be better to make use of the WARN_ON_ONCE() macro, which prints the OOPS only when first encountering the buggy state; if you instead find very useful having the OOPS printed out if the error condition persists you could use WARN_ON_RATELIMIT(), which limits the rate at which OOPSes are printed, using a ratelimit state defined with the macro DEFINE_RATELIMIT_STATE().

Using assertions in kernel code (as it happens also for userspace applications) is instrumental to be able to spot out anomalies as soon as they show up. When testing code of such complexity, the sooner it panics, the better it is, as the anomaly's true cause is very likely to be hidden in the near future by effects of the anomaly itself. Using assertions allows us to get the most out of triggered error messages, as every bit of information taken just after the anomaly happened will be relevant to debugging and will be of actual help; that same error messages will instead end up being confusing when tons of side-effects have layered on top of the actual problem.


Sources
Bug hunting in the Linux kernel
Manual of the gdb debugger
R. Love, "Linux Kernel Development" (3rd edition), Chapter 18, "Debugging"
Remote Serial Console Howto
Manual of the netconsole module

No comments:

Post a Comment