Tuesday, August 26, 2014

OPW: The Xen Project Developer Summit

So, about halfway through my OPW internship, I was informed that my wonderful mentor, Konrad Wilk, and Xen Project Community Manager Lars Kurth thought to allow me to attend the Xen Project Developer Summit that was to be held in Chicago on the 18th and 19th of August. I actually went there, had fun and learned a lot: now it's time to write a blog post about it!

The Xen Project Developer Summit

The keynote session featured talks from Xen community and development coordinators, which aimed to provide an insight of the main improvements and of the new features introduced in the Xen Project. Lars Kurth described the challenges of coordinating such a large and growing project and the efforts made towards better cooperation. George Dunlap, release coordinator for Xen 4.3 and 4.4, and Konrad Rzeszutek Wilk, that took over the task for the upcoming 4.5, detailed the main improvements made to the Xen hypervisor during the last development cycles. They also explained and commented on the rationale behind the feature selection work performed by release coordinators and some of the toughest decisions that are required by their duty.

Linux x86 co-maintainer David Vrabel provided a status update on the newest features added to Linux domain support and set up a wishlist for future features to be developed. During the keynote session I had the chance to meet a former Xen Project intern, Elena Ufimtseva, who worked on the implementation of vNUMA in Xen during the 2013 OPW round, mentored by Citrix Senior Engineer Dario Faggioli. She also presented her work in 2013 at the Xen Developer Summit. I could gather details and her opinion about her experience with OPW and in finding a job after the end of the internship, and I could discuss with her some of the challenges experienced by an intern and the best ways to exploit those difficult aspects by turning them into stronger development skills.

The following talks provided in-depth details on the main research aspects of Xen development. Some of them covered the performance of Xen with respect to networking, such as Jun Nakajima's talk on the main bottlenecks found while experimenting with Xen as a Network Virtualization Functions platform component and the solutions that were implemented by Intel. Other talks focused on storage, as Felipe Franciosi's insight on memory grant technologies available in Xen that can contribute to optimising aggregate workloads of several GB/s per guest (he actually allowed me to take part in the BoF session that followed his talk, therefore giving me the chance to hear further opinions and learn even more on the storage performance achieved by Xen guests). Still on the same trend, Filipe Manco presented NEC Europe's work towards tracking down performance limitations and bottlenecks that increase startup latencies of Xen guests, when they are run in bursts of thousands; he also proposed a prototype reimplementation of some Xen components to prove his points. Anil Madhavapeddy showed the benefits of the new Irminsule distributed transactional filesystem, that allows to handle storage accesses in a version control system fashion, letting unikernels running in isolated stubdomains, such as MirageOS, use a common and consistent API. More talks covered security aspects of virtualization, as Mihai Dontu's presentation, that proposed a zero-footprint implementation of memory introspection for Xen domUs that can allow a supervisor domain to perform run-time detection of malware on Xen-based guests; James Bielman described Galois' implementation of Mandatory Access Control for the Xenstore, showing how it can be managed by a centralized security server as it does not benefit from the XSM security policy. James Fehlig's talk, instead, covered the important topic of virtualization management tools, providing an overview of libvirt, a status update on the libxenlight driver and a roadmap proposal. Moving on to the topic of architecture and hardware support, Daniel Kiper approached the subject of EFI, outlining how Xen efficiently uses its infrastructure and what can be improved in the support provided by the hypervisor. Wei Liu instead described the status of vNUMA support in Xen, giving an in-depth report of its implementation and of its importance with relevant statistics.

The main session opened with a detailed overview of the Verizon Cloud architecture provided by Don Slutz, which described what features are used and the optimization it provides to both Xen and QEMU. It also featured a report on the Linux kernel delta that SUSE supports for Xen and a proposal on how to address it, delivered by Luis Rodriguez. Following another trend were some Xen-on-ARM-related talks, as the presentation by Stefano Stabellini, that provided an insight on the current state of the project and how it performs on the newest ARMv8 64-bit platforms, and the one by Julien Grall, which detailed the process of porting an OS as a Xen-on-ARM guest. Jonathan Daugherty also described, in his talk, his experience in porting FreeRTOS to Xen on a Cortex A15-based platform. More talks were performance-related, as Zoltan Kiss' presentation on network improvements made in XenServer and Feng Wu's on Intel's work on introducing interrupt posting with its virtualization technology. John Else explained his work about efficient inter-domain communication of performance data, his findings about the XenStore being the bottleneck in the current technique and proposed a lock-free, efficient solution to the issue. Talks also included the relevant topic of testing for a software ecosystem as complex as the Xen one: Ian Jackson presented Xen's automatic testing facility, osstest, outlining its last development steps and the wider set of configurations it now supports. Some of the talks were related, instead, to safety aspects of using Xen in an environment with real-time constraints. Nathan Studer and Robert VanVossen presented DornerWorks' efforts on certifying Xen for automotive, medical and avionics, the challenges behind the task, a proposed roadmap to overcome the most tricky aspects and the current state of the project. Sisu Xi described the Washington University's work on RT-Xen with the aim of combining real-time and virtualization. Willing to give, instead, a more detailed insight on unikernels, Adam Wick outlined their features and described the general rules that establish whether a unikernel is the right choice for a software infrastructure component. Glauber Costa introduced the topic of LibraryOSs, highlighting their benefits in terms of performance, lightness and scalability, describing which applications they support and how can prove to be useful to the Xen community. Philip Tricca explained the drawbacks of the static configuration used to isolate system components in OpenXT, a collection of hardened Linux-on-Xen virtual machines providing a user platform for client devices, and a new toolkit to enhance the platform's flexibility.

During the main session I met some of my fellow OPW interns. I had the chance to talk to the brilliant Mindy Preston, who worked on MirageOS's network stack fixing bugs and implementing missing RFCs, about her experience and exchange opinions about ARM-based boards. I had the chance to take part in the final OPW/GSoC-related panel with her; it also featured the very professional GSoC intern Jyotsna Prakash, who worked on cloud API support for MirageOS by implementing cloud API bindings for OCaml, along with some of our mentors and Lars Kurth as a host. The panel gave us interns the chance to provide feedback to our mentors and to the program's organization and to express our opinion about what we learned from it. It also covered very important aspects of participating in a large open-source project within a heterogeneous and just as big community: George Dunlap thoroughly explained the lights and shadows of Linus Torvalds' approach to commenting bad code, while Konrad Wilk delivered a thoughtful insight about how cultural differences can influence the interaction between developers during software review.

What Did I Learn

Being able to attend the conference was a highly educational experience. It allowed me to get a better idea of how the community is organized, to get involved even more and hear about the experience of other attendees. I also could benefit from my mentor's advice on how to interact with other developers. Having to speak in front of an audience also has always been one of the aspects of working on a project that I feared the most; the chance to take part in the panel and my mentor's very useful advice make a huge addition to my experience and will allow me to fully exploit the opportunity to share my findings and my enthusiasm with others on future occasions.

As a final note, I'd like to thank my very patient mentor, Konrad Wilk, for allowing me to take part in OPW (even if I applied to him as a candidate on the very last day before the deadline) and for his invaluable guidance during the program; I'd like to thank also the GNOME Foundation and Xen Community Manager Lars Kurth for granting me the opportunity to attend the conference, and Elena Ufimtseva for giving me the benefit of her own experience. Last, but not least, I'd like to thank my always so helpful advisor, Paolo Valente, and Citrix Senior Engineer Dario Faggioli for introducing me to the internship program.


This report has been re-posted by The Xen Project on their blog!
Xen Developer Summit - Chicago, 2014 - Schedule
Slides used for many of the talks (Xen Project's official SlideShare)
[ Here there will be a link to videos as soon as they're available ]

Friday, August 8, 2014

OPW, Linux: Profiling a Xen Guest

Here is my take at describing how have I been using some of the most popular tools when it comes to profiling. My OPW mentor suggested me to make use of these tools while looking into locking issues I had to deal with in the context of my summer project. I'll also try to include some hints that my mentor provided and should allow to gather more accurate data.
If you have any criticism or suggestion and the time to write about it, please do.

OPW, Linux: Profiling a Xen Guest

The first tool I had fun with is perf, a tool which abstracts away hardware differences of the CPUs supported by Linux and allows to use performance counters for profiling purposes. Performance counters are CPU registers used to count events that are relevant to profiling: instructions executed, cache events, branch prediction performance; they are instrumental to profiling applications as they allow to trace their execution and identify hot spots in the executed code.

The perf tool works on top of those hardware tools and uses them as building blocks to keep also per-task, per-CPU and per-workload counters, also enabling for source code instrumentation, letting the developer gather accurate annotated traces. Tracing is performed with the kprobes (with ftrace) and uprobes frameworks, which both allow for dynamic tracing, therefore highly reducing the overhead unavoidably added by tracepoints. If you are interested in dynamic tracing you can either read the manual or a previous blog post on ftrace.

From a more practical point of view, the fact that perf relies on ftrace also means that it must be compiled on top of your currently-executing kernel; in fact, it lives in a directory of the kernel source code tree, and can be build by exploiting the kernel make subsystem by issuing:

cd tools/perf

If you want to install the tool, just enter a make install after that. Otherwise, you'll need to use the perf executable which has been created in the tools/perf directory of your development kernel tree.

Then, we're done! Let's find hot spots in the code I'm writing. Let's record a run of the IOmeter emulator built on top of the fio flexible I/O tester and performed from a paravirtualized Xen guest on a physical block device emulated with the null_blk driver. The test script basically starts fio with the IOmeter jobfile. Note that the jobfile specifies libaio as the I/O engine to use for the test; just keep that in mind.

$ cat iometer_test.sh
fio iometer-file-access-server

To make things simple, let's trace all available perf events.

perf record ./iometer_test.sh

Once the command has been executed, we can ask perf to list the recorded tracepoints, ordering them by placing first the ones with the higher overhead. Since the output is very long, I'm making it available in my Dropbox. I'll just paste here the first lines.

# To display the perf.data header info, please use --header/--header-only options.
# Samples: 271K of event 'cpu-clock'
# Event count (approx.): 67840250000
# Overhead    Command       Shared Object                               Symbol
# ........  .........  ..................  ...................................
    24.04%        fio  [kernel.kallsyms]   [k] xen_hypercall_xen_version      
     4.82%        fio  libaio.so.1.0.1     [.] io_submit                      
     3.98%        fio  libaio.so.1.0.1     [.] 0x0000000000000614             
     3.68%        fio  fio                 [.] get_io_u                       
     3.40%        fio  [kernel.kallsyms]   [k] __blockdev_direct_IO           
     2.78%        fio  fio                 [.] axmap_isset                    
     1.93%        fio  [kernel.kallsyms]   [k] lookup_ioctx                   
     1.66%        fio  [kernel.kallsyms]   [k] kmem_cache_alloc               
     1.64%        fio  [kernel.kallsyms]   [k] do_io_submit                   
     1.18%        fio  fio                 [.] _start                         
     1.16%        fio  [kernel.kallsyms]   [k] __blk_mq_run_hw_queue          
     1.08%        fio  [kernel.kallsyms]   [k] blk_throtl_bio                 
     0.91%        fio  [kernel.kallsyms]   [k] blk_mq_make_request            
     0.90%        fio  fio                 [.] fio_gettime                    
     0.90%        fio  fio                 [.] td_io_queue                    
     0.76%        fio  [kernel.kallsyms]   [k] generic_make_request_checks    
     0.72%        fio  [kernel.kallsyms]   [k] pvclock_clocksource_read       
     0.72%        fio  [kernel.kallsyms]   [k] __rcu_read_lock                
     0.71%        fio  [kernel.kallsyms]   [k] aio_read_events                
     0.69%        fio  fio                 [.] add_slat_sample                
     0.67%        fio  fio                 [.] io_u_queued_complete           
     0.66%        fio  [kernel.kallsyms]   [k] generic_file_aio_read          
     0.63%        fio  [kernel.kallsyms]   [k] __fget                         
     0.60%        fio  [kernel.kallsyms]   [k] bio_alloc_bioset               
     0.60%        fio  fio                 [.] add_clat_sample                
     0.59%        fio  [kernel.kallsyms]   [k] blkdev_aio_read                
     0.56%        fio  [kernel.kallsyms]   [k] percpu_ida_alloc               
     0.55%        fio  [kernel.kallsyms]   [k] copy_user_generic_string       
     0.54%        fio  [kernel.kallsyms]   [k] sys_io_getevents               
     0.54%        fio  [kernel.kallsyms]   [k] blkif_queue_rq

First of all, let's look at the fourth column, the one with one character enclosed in square brackets. That columns tells us whether the listed symbol is a userspace symbol (we have a dot in square brackets) or a kernelspace symbol (we see a k character in square brackets).
When it comes to userspace symbols, we can disambiguate whether we're looking at library functions or at user application symbols by glancing at the third column, which shows either the name of the library or the application. The second column, which might mislead us, shows instead the context of execution of the symbol. In this case, most of the execution happened in the context of the fio process, which is the one performing I/O during our tests.
The last column, the one on the right, shows us the name of the symbol, if perf managed to retrieve it. The first column, the one on the left, shows instead the percentage of time in which the list symbol was executing, given that 100% is the amount of time that covered the whole execution of a test.

So, perf did not record just events in the kernel, but also in the library used by fio to perform I/O, which is the libaio one we've seen before. Interesting fact, some of the library calls have a relatively high overhead: for example, the io_submit() function of libaio has been executing for almost the 5% of the test. This doesn't surprise us very much, as we kind of expected to find some hot libaio symbols by just reading Jens Axboe's paper which shows and explains profiling performed on such a test.

The most relevant hotspot, however, is in the kernel, and seems to be the xen_version hypercall, which has been executing for almost a fourth of the test. This seemed to me some kind of mistake until Konrad Wilk told me that the hypercall was probably being executed while grabbing a lock; in fact, the xen_version hypercall is used as a hypervisor callback: it is called to see if an event is pending, and if there is an interrupt is executed on return. That same hypercall, for example, is called whenever the functions spin_lock_irqsave() and spin_unlock_irqrestore() are called. This seems to identify the issue as a locking issue, so...

Debugging contention issues
... to get deeper into the investigation of such an issue, I chose to have a look at lockstat, the lock debugging framework provided by the Linux kernel. Essentially, lockstat provides statistics about the usage of locks, therefore being of great help to idenfity lock contention issues that can heavily impact performance.

First step is enabling it in the kernel's configuration. This can be done by setting CONFIG_LOCK_STAT=y or by browsing the menuconfig to the "Kernel hacking > Lock debugging (spinlocks, mutexes, etc...) > Lock usage statistics" option. As soon as the kernel has been re-compiled and re-installed with the framework enabled, we can enable lock statistics collection with:

$ echo 1 | sudo tee /proc/sys/kernel/lock_stat

From this moment, lock statistics are available in the special procfs file /proc/lock_stat, which we can glance at by issuing:

$ less /proc/lock_stat

An example of the output of that command collected on the frontend guest during the execution of the fio IOmeter test can be found, again, on my Dropbox. It's very very long and would not be much comfortable to read in here, so I'm just commenting it in this blog post.
First, we can see that the output is split in different sections, bordered with dots. Each of the sections contains information about a lock, whose name is the first thing listed there. For example, the first section is about the (&rinfo->io_lock)->rlock. Basically, it's the lock protecting access to the shared data structure used to exchange block I/O requests and responses between the guest domain and the driver domain following Xen's PV protocol. After that, we find a whole set of numbers, whose meaning is explained in the header above. We might be interested in extrapolating the number of contentions (the contentions field), the average time spent in waiting for the lock to be released (waittime-avg), the number of times that the lock has been acquired (acquisitions) and the average time the lock has been held (holdtime-avg). For example, we see that the io_lock has suffered 47500 contentions, has been acquired 1348488 times, and a thread willing to acquire it has been waiting for 25.4 microseconds.
After that batch of information, we find a listing of call traces; this indicated which threads of execution brought to acquiring that lock. Always with reference to our io_lock, we see that the two contenting traces concern the service of an interrupt, probably for a request completion, when the driver queues a response in the shared data structure (blkif_interrupt() is the key here) and the request insertion path, where the driver is queueing a request in the shared data structure (blk_queue_bio(), which is called back as soon as a completion kicks the driver for pending request insertions).

So, this is it. Seems like we've found a possible suspect. Let's make sure there aren't other relevant contention points by just glancing at the stats by ordering them by number of contentions. We can do that very easily if we know that the output is already ordered by number of contentions. Let's just then grep the headers of the sections.

$ grep : /proc/lock_stat

We can see from the output produced by this command that the second-most contented lock (rq->lock) has suffered "just" 1579 contentions during the test, which is an order of magnitude lower than the io_lock.

Tips for more accurate statistics
Just as a side note, I want to list here some tips that my mentor, Konrad Wilk, has given me on collecting more accurate data from profiling tools while dealing with Xen guests, in the hope that they will be useful to others as they have been to me.
Boot the PV guest with vPMU enabled. vPMU is a hardware support to performance monitoring, based on the performance monitoring unit, specifically prepared to help collecting performance data of virtualized operating systems. Exploiting vPMU allows to offload a huge overhead from the hypervisor, therefore allowing for more accurate data. Using it is fairly straight-forward, and involves having the vpmu=1 option in the guest's configuration file.
Use a driver domain. This removes having the dom0 as a bottleneck when using more than one guest, as the backend is located in the driver domain. This also allows to keep roles separated, and to have a specific-purpose domain just for the handling of block I/O.
Use a HVM domain as driver domain. This removes the overhead of PV MMU-related hypercalls and should allow for cleaner data.

Brendan Gregg, "Perf examples"

Sunday, August 3, 2014

OPW, Linux: Hunting Bugs with OOPSes

Since writing the performance-related post seems to be taking me the full two weeks (it'll be reviewed by my very helpful OPW mentor, as it might otherwise be too much of a mess), here I am writing about something I've been dealing with almost daily during the past three months: kernel OOPSes. Let's see what a kernel OOPS is, how to read it, how to produce it intentionally, why on Earth would you want to do that, and why being able to squeeze out of it any bit of useful information is essential to debugging kernel code.

Please do understand that everything that's written on this blog is just my personal opinion, and even a reasonably mild one, meaning that, if you prove me wrong, I'll not argue but instead thank you. So, if you have any criticism and the time to condense it in a comment, please do.

Linux, OPW: Hunting Bugs with OOPSes

Debugging the kernel adds some extra fun with respect to debugging a userland application. Although kernel-level debuggers exist, executing kernel code inside a debugging environment might not be particularly comfortable nor useful: in fact, kernel bugs might (and often are) tied to specific timings that, when changed, hide the bug from the programmer's sight. Also, kernel code is very complex and fast, which might not get along well with placing breakpoints or stepping in the code. The approach which is usually followed when debugging kernel code is tracing the execution of such code, restricting the trace to events which are thought to be related to the failure. The main issue with this approach is that you must already have an idea of the kernel subsystem where the failure happens, so that you can restrict tracing to that particular subsystem: in fact, tracing adds an overhead, however small it is, to the kernel's execution. This also implies that tracing can't be always enabled, especially when testing kernel code for production. Finally, tracing is certainly a powerful tool, but, well, it is relatively easy to use when the bug is already known to exist; finding an anomaly in a kernel trace when you don't know if there is one could be probably more complex than the average human could handle.

So, how do we debug kernel code? Luckily, the Linux kernel has been designed to automatically provide debugging aid in case of a failure. Such an aid is represented by kernel OOPSes. A kernel OOPS is an error message produced by the kernel when it understands it is deviating from its intended behavior. If the deviation is disruptive and/or irrecoverable, the OOPS might be closely followed by a kernel panic, which completely halts the kernel's execution. An example of disruptive deviation is one that would cause data corruption. A deviation which is deemed to be irrecoverable is one that the kernel doesn't know how to handle.

What do we need from a kernel error message?
As users, we are accustomed to cryptic error messages providing nothing more than a kind notification that "something bad has happened". As developers, what pieces of information would we want to have, instead?
  • What kind of failure has the software encountered.
  • When did that happen, with respect to the software execution's start time.
  • What was it doing when it encountered it, as detailed as it is possible.
  • What was its state was involved in the software's failure.
Luckily enough, a kernel OOPS gives us exactly these pieces of information. Let's see an example of such an error message (and yes, that's directly from something I messed up while developing the OPW project I'm working on - shame on me). I got it from a serial console configured on my machine and hooked up to a host-USB-capable Androidphone with a serial-to-USB adapter.

[    6.308218] ------------[ cut here ]------------
[    6.308225] kernel BUG at block/blk-core.c:2254!
[    6.308231] invalid opcode: 0000 [#1] PREEMPT SMP 
[    6.308240] Modules linked in: ext4 crc16 mbcache jbd2 xen_kbdfront xen_netfront xen_fbfront syscopyarea sysfillrect sysimgblt fb_sys_fops
[    6.308268] CPU: 0 PID: 0 Comm: swapper/0 Not tainted 3.15.0-rc5-ARCH-xen #96
[    6.308278] task: ffffffff81a16500 ti: ffffffff81a00000 task.ti: ffffffff81a00000
[    6.308286] RIP: e030:[<ffffffff81269780>]  [<ffffffff81269780>] blk_dequeue_request+0x80/0xa0
[    6.308307] RSP: e02b:ffff88001f803d28  EFLAGS: 00010046
[    6.308314] RAX: ffff88001ba14bc0 RBX: ffff88001ba14bc0 RCX: ffff88001d280000
[    6.308321] RDX: 0000000000000175 RSI: 0000000000000000 RDI: ffff88001ba14bc0
[    6.308328] RBP: ffff88001f803d38 R08: 0000000000000000 R09: 0000000000000b6b
[    6.308336] R10: ffff88001f818d70 R11: 0720072007200720 R12: 0000000000000000
[    6.308343] R13: ffff88001d280000 R14: ffff88001ba14bc0 R15: ffffffff817e3fda
[    6.308356] FS:  00007f79846c7700(0000) GS:ffff88001f800000(0000) knlGS:0000000000000000
[    6.308365] CS:  e033 DS: 0000 ES: 0000 CR0: 000000008005003b
[    6.308371] CR2: 00007f79846c9000 CR3: 0000000001a0f000 CR4: 0000000000002660
[    6.308382] Stack:
[    6.308386]  ffffffff812697b2 ffff88001e0f6a00 ffff88001f803d88 ffffffff813af734
[    6.308396]  0000000000000155 ffff88001d6690e0 0000000000000000 ffff88001d280000
[    6.308408]  0000000000000155 ffff88001d6690e0 0000000000000000 ffff88001d668000
[    6.308419] Call Trace:
[    6.308425]  <IRQ> 
[    6.308429]  [<ffffffff812697b2>] ? blk_start_request+0x12/0x60
[    6.308448]  [<ffffffff813af734>] do_blkif_request+0x74/0x200
[    6.308456]  [<ffffffff81265ec3>] __blk_run_queue+0x33/0x40
[    6.308463]  [<ffffffff81266079>] blk_start_queue+0x29/0x50
[    6.308471]  [<ffffffff813b0425>] kick_pending_request_queues+0x35/0x50
[    6.308479]  [<ffffffff813b0eb4>] blkif_interrupt+0xa74/0xab0
[    6.308491]  [<ffffffff810c7d5e>] handle_irq_event_percpu+0x3e/0x1f0
[    6.308501]  [<ffffffff810c7f4d>] handle_irq_event+0x3d/0x60
[    6.308509]  [<ffffffff810cad46>] handle_edge_irq+0x66/0x130
[    6.308517]  [<ffffffff810c7438>] generic_handle_irq+0x28/0x40
[    6.308527]  [<ffffffff813472e1>] evtchn_fifo_handle_events+0x151/0x160
[    6.308539]  [<ffffffff81343cb0>] __xen_evtchn_do_upcall+0x50/0xb0
[    6.308547]  [<ffffffff81345e63>] xen_evtchn_do_upcall+0x33/0x50
[    6.308558]  [<ffffffff8151e8fe>] xen_do_hypervisor_callback+0x1e/0x30
[    6.308564]  <EOI> 
[    6.308568]  [<ffffffff810013aa>] ? xen_hypercall_sched_op+0xa/0x20
[    6.308582]  [<ffffffff810013aa>] ? xen_hypercall_sched_op+0xa/0x20
[    6.308594]  [<ffffffff8100b070>] ? xen_safe_halt+0x10/0x20
[    6.308603]  [<ffffffff8101e54f>] ? default_idle+0x1f/0x100
[    6.308611]  [<ffffffff8101ef1f>] ? arch_cpu_idle+0xf/0x20
[    6.308621]  [<ffffffff810b1fc8>] ? cpu_startup_entry+0x2b8/0x4f0
[    6.308633]  [<ffffffff815042a4>] ? rest_init+0x84/0x90
[    6.308645]  [<ffffffff81af3ea6>] ? start_kernel+0x448/0x469
[    6.308654]  [<ffffffff81af34d7>] ? x86_64_start_reservations+0x2a/0x2c
[    6.308663]  [<ffffffff81af6076>] ? xen_start_kernel+0x577/0x583
[    6.308669] Code: 84 81 3c 05 00 00 01 65 ff 04 25 e0 c7 00 00 e8 07 3d db ff 48 89 83 e8 00 00 00 65 ff 0c 25 e0 c7 00 00 74 16 5b 5d f3 c3 66 90 <0f> 0b 66 0f 1f 44 00 00 0f 0b 66 0f 1f 44 00 00 e8 6b 4c db ff 
[    6.308749] RIP  [<ffffffff81269780>] blk_dequeue_request+0x80/0xa0
[    6.308758]  RSP <ffff88001f803d28>
[    6.308765] ---[ end trace 500f7dcca1bd6a9e ]---

First of all, look at the left column: that's the time since the kernel's execution started. But, let's move on and have a look at the first section, that tells us what happened. It seems there was a kernel BUG at block/blk-core.c:2254! That means that the programmer had explicitly defined some state of the kernel as a failing state and has instructed the kernel to panic when entering that state. We'll see how to do that later. The following lines give us an insight as to what modules were loaded in the kernel when it crashed, and what CPU the crashing thread of execution was running on, what task was involved in it crashing (the swapper probably tried to do I/O through the buggy driver).
The following part is the dump of all of the machine's registers, which also gives an indication of the exact point where the code led to a crash. This information is obtained by the Instruction Pointer register (EIP/RIP depending on whether the machine is 32 or 64 bit-capable), which keeps the address of the currently-executed instruction, and from the map of the kernel's symbols, which allows to locate and print out the human-readable identifier of the function that was being executed and the offset inside it.
The last section gives us an idea of what the kernel was doing when the failure happened; the OOPS contains a full dump of the kernelspace stack, along with a complete call trace. The accuracy of such a trace might be dubious, and the kernel notifies it by printing a question mark symbol next to the tentative elements of the call trace. Note that the value of the instruction pointer, along with the currently-executed function identifier, is reported also at the end of the trace, as it is of great importance and must be reported even if the OOPS got cut out for some reason.

Now, the question is: how do we actually get from this load of information to a concrete pointer to the code that failed? Yeah well, we have it noted up there, it was block/blk-core.c at line 2254. But let's assume we didn't know about that, since some kind of failures might not be so kind to print out such a relevant information about the failing code.

First of all, it's very important that we have our kernel image compiled with debugging symbols enabled (CONFIG_DEBUG_INFO=y, which you can find in the menuconfig as "Kernel hacking > Compile-time checks and compiler options > Compile the kernel with debug info"). After that, we just need two very basic tools: the objdump utility and the kernel map of symbols; we can find the former in any Linux distribution's official repositories, and the latter as output of the kernel make process, as the System.map file.
Let's try to get from the above-reported OOPS some more info about what happened. The first step is getting the start address of the blk_dequeue_request() function from the System.map file.

$ grep blk_dequeue_request System.map
ffffffff81269870 T blk_dequeue_request

Seems like we have it. Let's now add the offset reported by the OOPS to that start address. The offset is the hexadecimal number reported after the plus sign. If you're wandering about it, that second number after the slash is the size of the function, so if the offset was greater than the size, well, it would mean that the OOPS is corrupted. It's kind of an integrity check.
But let's stick with the offset.

ffffffff81269870h + 80h = ffffffff812698f0h

Now we have the address of the exact instruction executing which the kernel crahsed. We just need to have a look at the kernel code with the objdump utility. objdump allows to disassemble a Linux ELF executable. With the --source option, it also interleaves assembly code with original source code, therefore giving immediate access to the original listing. Disassembling the whole kernel would give a very long output and also take a few minutes, so we're giving to the disassembler an address to start analyzing the code from. A good start address is the start address of the function itself.

objdump --disassemble --source --start-address=0xffffffff81269870 vmlinux

Note that "0x" at the beginning of the start address. Trivial enough, but can drive you mad when you don't know it's needed and you're already grumbling on an annoying bug. Also, it's usually very much comfortable if you pipe the output to some text viewer, such as less, or redirect it to a file so that you can go though it in your own time. Anyway, the output should look like the following.

vmlinux:     file format elf64-x86-64

Disassembly of section .text:

ffffffff81269870 <blk_dequeue_request>:
        return rq;

void blk_dequeue_request(struct request *rq)
ffffffff81269870:       e8 8b ae 2a 00          callq  ffffffff81514700 <__entry_text_start>
 * list_empty - tests whether a list is empty
 * @head: the list to test.
static inline int list_empty(const struct list_head *head)
        return head->next == head;
ffffffff81269875:       48 8b 07                mov    (%rdi),%rax

That looks very promising, doesn't it? Let's cut out just the address of the instruction (ffffffff812698f0and let's then go one line back to find out the C instruction involved in the panic.

void blk_dequeue_request(struct request *rq)
        struct request_queue *q = rq->q;

ffffffff812698f0:       0f 0b                   ud2    
ffffffff812698f2:       66 0f 1f 44 00 00       nopw   0x0(%rax,%rax,1)
ffffffff812698f8:       0f 0b                   ud2    
ffffffff812698fa:       66 0f 1f 44 00 00       nopw   0x0(%rax,%rax,1)
ffffffff81269900:       e8 fb 4a db ff          callq  ffffffff8101e400 <___preempt_schedule>
        if (blk_account_rq(rq)) {

So, there it is: BUG_ON(list_empty(&rq->queuelist))! In fact the name of the C instruction itself seems to refer to a bug or something like that...
In fact, we hit an invariant expression (or assertion) that explicitly defined a state of the kernel's execution as bug. Go straight to the next paragraph if you want to learn more about them and about why they are of capital importance when writing kernel code. Here I just want to point out something that my advisor showed me a few years ago. You actually don't need to go through the whole kernel map grepping and ELF disassembly thing. You just need the gdb debugger (and of course a kernel compiled with debugging info).

$ gdb vmlinux
GNU gdb (GDB) 7.7.1
Copyright (C) 2014 Free Software Foundation, Inc.
Reading symbols from vmlinux...done.
(gdb) l *(blk_dequeue_request+0x80)
0xffffffff812698f0 is in blk_dequeue_request (block/blk-core.c:2254).
2250 void blk_dequeue_request(struct request *rq)
2251 {
2252 struct request_queue *q = rq->q;
2254 BUG_ON(list_empty(&rq->queuelist));
2255 BUG_ON(ELV_ON_HASH(rq));
2257 list_del_init(&rq->queuelist);

The l command (which is list in its extended form) asks the debugger to dump the source code of a specific function at a specific offset. And gdb does all the boring work and leaves us with nothing but debugging fun.

A side note about invariant expressions
So, we were talking about disruptive and irrecoverable anomalies. One important difference that exists between these two types of panic-provoking situations is that the first must be explicitly defined by the programmer [him|her]self with run-time assertions.
An assertion in the kernel is the exact same of an userspace assertion, with the difference that its result might be the code completely halting its execution or just printing out an OOPS. The OOPS provoked by an assertion is the kind of OOPS that very helpfully prints out the name of the source file and the line of it where the buggy state was encountered. The BUG_ON() macro allows the programmer to explicitly define a critical condition that, when it is verified, represents a bug that can't be just got over with. A milder version of that macro is the WARN_ON() macro, that just prints an OOPS in the kernel's log without halting its execution. When using WARN_ON()s it might happen that the OOPSes completely flood kernel logs, if the anomalous state persists or if the expression used for the assertion is not as specific as it could be. In these cases, it could be better to make use of the WARN_ON_ONCE() macro, which prints the OOPS only when first encountering the buggy state; if you instead find very useful having the OOPS printed out if the error condition persists you could use WARN_ON_RATELIMIT(), which limits the rate at which OOPSes are printed, using a ratelimit state defined with the macro DEFINE_RATELIMIT_STATE().

Using assertions in kernel code (as it happens also for userspace applications) is instrumental to be able to spot out anomalies as soon as they show up. When testing code of such complexity, the sooner it panics, the better it is, as the anomaly's true cause is very likely to be hidden in the near future by effects of the anomaly itself. Using assertions allows us to get the most out of triggered error messages, as every bit of information taken just after the anomaly happened will be relevant to debugging and will be of actual help; that same error messages will instead end up being confusing when tons of side-effects have layered on top of the actual problem.

Bug hunting in the Linux kernel
Manual of the gdb debugger
R. Love, "Linux Kernel Development" (3rd edition), Chapter 18, "Debugging"
Remote Serial Console Howto
Manual of the netconsole module

Wednesday, July 23, 2014

OPW, Linux: The block I/O layer, part 4 - The multi-queue interface

So, a couple of weeks ago I performed some first tests on the prototype driver using a simulated device created with the null_blk device driver. Such tests, and the following profiling, highlighted some locking issues due to contention on an internal lock kept by the frontend driver; more in detail, such a lock is instrumental to handling a ringbuffer which is used to exchange requests and responses between the frontend and the backend parts of the block I/O driver: the lock protects the ring against the insertion of new requests and the extraction of responses. My very patient OPW mentor has therefore suggested splitting the ring into two separate halves, one used for requests and another used for responses. During the last week and a half I have been working on that; while struggling with making the interface consistent with the needs of older versions of the driver, I have been writing this last blog article about the multi-queue block layer API, which concludes the series; next week I'll bore you to death with something more related to performance. Also, please note that this post is the reword of some documentation I produced during the first weeks of internship, and it has had the benefit of my mentor's reading, so it's probably much more accurate than usual.

The block I/O layer, part 4 - The multi-queue interface

The request interface was designed for devices that could handle hundreds of I/O operations per second; in a recent paper, block layer maintainer Jens Axboe noted that it suffers from design issues when used for devices that can handle hundreds of thousands of IOPS (see my first blog post or, much better, Jens Axboe's paper). One of the capital issues, lock contention, has very relevant effects event if only having multiple cores concurrently inserting block I/O requests and continuously spinning on the single queue_lock; such a situation is exacerbated when the lock is conteded also by a high-end storage device issuing interrupts and whose driver is still spinning on that same lock. The multi-queue API (also referred to as blk-mq) addresses such an issue by exploiting the ability of block I/O controllers to handle multiple requests in parallel, and thus dramatically reducing lock contention; in fact, in its most common configuration, it allows for block I/O requests to be inserted without the need to lock the whole request_queue. Let's see how it manages to do it.

The blk-mq API implements a two-levels block layer design which makes use of two separate sets of request queues: software staging queues, allocated per-CPU, and hardware dispatch queues, whose number typically match the number of actual hardware queues supported by the block device. The number of software staging queues can be higher than the number of hardware dispatch queues: in this case, two or more software queues will be part of the same hardware context, and a dispatch performed with that hardware context will pull in requests from all the associated software queues. The number of software staging queues can be, instead, less than the number of hardware queues: in this case, sequential mapping is performed. In the third and most simple case where the number of software queues equals the number of hardware queues, a direct 1:1 mapping is performed.

Figure 1: Outline of the multi-queue block layer.

Main data structures
The first relevant data structure used by the multi-queue block layer API is the blk_mq_reg structure, containing all informations of importance during the registration of a new block device to the block layer. This data structure contains the pointer to a blk_mq_ops data structure, used to keep track of the specific routines to be used by the multi-queue block layer to interact with the device's driver. A blk_mq_reg structure also keeps the number of hardware queues to be initialized, the dept of such queues and other information useful during the initialization of data structures related to the particular driver in the block layer. Another data structure of importance is the blk_mq_hw_ctx structure, which represents the hardware context to which a request_queue is associated. Its corresponding structure for the software staging queue is the blk_mq_ctx structure, which is allocated per-CPU. The function performing the mapping between these contexts is specified in the map_queue field of the driver's blk_mq_ops data structure, while the mapping built by this function is kept as the mq_map of the request_queue data structure associated to the block device.
Don't worry: a drawing, such as Figure 2, makes it clearer. Kind of.

Figure 2: Data structures used in the multi-queue block layer.

Queue initialization
When a new device driver using the multi-queue API is loaded, it creates and initializes a new blk_mq_ops structure and sets to its address the related pointer of a new blk_mq_reg. More in detail, the required operations are queue_fn, which must be set to a function in charge of handling the command (e.g. by passing it to the low-level driver), and map_queue, which performs the mapping between hardware and software contexts. Other operations are not strictly required, but can be specified in order to perform specific operations on allocation of contexts or on completion of an I/O request. As of necessary data, the driver must initialize the number of submission queues it supports, along with their size; other data are required, e.g., to determine the size of the command supported by the driver and specific flags that must be exposed to the block layer.

When a new device is initialized, its driver prepares a new data structure whose type may vary according to the device driver handling the device; such a driver-specific data structure, however, is very likely to contain a pointer to the device's gendisk struct and to the request_queue related to the device. As soon as the driver has these data structures ready, it invokes the blk_mq_init_queue() function, which initializes the hardware and software contexts and performs the mapping between them. The initialization routine also sets an alternate make_request function, subsituting to the conventional request submission path (which would include blk_make_request()) the multi-queue submission path (which includes, instead, the function blk_mq_make_request()); as usual, the alternate make_request function is set with the blk_queue_make_request() helper.

Request submission
Device initialization substituted the conventional block I/O submission function with the multi-queue-ready request-submission function, blk_mq_make_request(), letting the multi-queue structures be used transparently from the perspective of the upper layers. The make_request function used by the multi-queue block layer includes the possibility to benefit from per-process plugging, but only for drivers supporting a single hardware queue or for async requests. In case the request is sync and the driver actively uses the multi-queue interface, no plugging is performed. The make_request function also performs request merging, searching for a candidate first inside the task's plug list, if plugging is allowed, and finally in the software queue mapped to the current CPU; the submission path does not involve any I/O scheduling-related callback. Finally, make_request sends immediately to the corresponding hardware queue any sync request, while it delays this transition in case of async or flush requests, to allow for subsequent merging and more efficient dispatching.

Request dispatch
In case that an I/O request is synchronous (and therefore no plugging is allowed for it from the multi-queue block layer) its dispatch to the device driver is performed in the context of the same request; if the request is instead async or flush, and task plugging is present, its dispatch can be performed: a) in the context of the submission of another I/O request to a software queue associated to the same hardware queue; b) when the delayed work scheduled during request submission is executed.
The main run-of-queue function of the multi-queue block layer is the blk_mq_run_hw_queue(), which basically relies on another driver-specific routine, pointed by the queue_rq field of its blk_mq_ops structure. This function delays any run of queue for an async request, while it dispatches a sync request immediately to the driver. The inner function __blk_mq_run_hw_queue(), called by blk_mq_run_queue() in case the request is sync, first joins any software queue associated to the currently-in-service hardware queue; then it joins the resulting list with any entry already on the dispatch list. After collecting all to-be-served entries, the function processes them, starting each request and passing it on to the driver, with its queue_rq function. The function finally handles possible errors, by requeue or deletion of the associated requests.

Figure 3: Functions performing request transition between different data structures.

Bjørling, Matias, et al. "Linux block IO: Introducing multi-queue SSD access on multi-core systems." Proceedings of the 6th International Systems and Storage Conference. ACM, 2013. - http://kernel.dk/systor13-final18.pdf
Johnathan Corbet, "The multi-queue block layer" - http://lwn.net/Articles/552904/
The Linux kernel, blk-mq: new multi-queue block IO queueing mechanism

Friday, July 11, 2014

OPW, Linux: The block I/O layer, part 3 - The make request interface

During the past weeks, I have been learning about profiling an operating system running in a virtual machine, since I have been needing to examine the driver I am working on to locate bottlenecks and work out lock contention issues. My OPW mentor has suggested that I get to grips with some popular profiling tools, such as perf and lockstat. During my bachelor thesis I already had the chance to become familiar with perf to some extent, but I am learning a lot more about collecting accurate data about the performance of a virtualized OS. For example, I read that Xen exploits Intel's Performance Monitoring Unit, which provides architectural support for collecting performance-related data. 

During the tests performed prior to profiling, I also had the chance to make use of the null_blk block device driver to compare the performance of the CFQ and NOOP I/O schedulers with a random workload composed of greedy random readers and writers, having no completion latency. Such a workload emulates Intel's IOmeter on a too-fast-to-be-real device. The throughput achieved by the CFQ I/O scheduler is half of the the one achieved by NOOP, or even lower, depending on the number of processes issuing I/O.

The NOOP scheduler, however, still does merges and sorts requests; none of this seems really necessary with such a workload where I/O operations are issued in a random fashion (so there seem to be not many merges in any case) and there is no seek penalty that would justify sorting. So, there's already something in the Linux kernel's block layer that should perform slightly better than the request API with the NOOP scheduler: the make request interface.

The block I/O layer, part 3 - The make request interface

The make request interface (or bio-based interface) essentially shorts out all processing of block I/O units following the creation of a bio structure. It therefore allows the kernel to directly submit a bio to the storage device's driver. Such an interface is useful to any block device driver needing to perform pre-processing of requests before submitting them to the actual underlying device (such, e.g., stacked drivers implementing RAID). Even if its purpose was not initially that, the bio-based API is also useful to any block device driver that sees the block layer's processing of I/O requests as an overhead; think, for example, to drivers of devices or controller that feature a highly complex internal request processing logic or don't need requests to be processed. The drawbacks for such an interface are evident: a driver making use of it would lose any pre-processing normally performed by the block layer.

Figure 1: Block layer layout when using the make request interface

Let's see how a driver uses such an interface, again from the code of the very simple null_blk driver. Even when in bio-based mode, the null_blk driver still needs to allocate a request_queue structure. The key, however, is defining, after that, an alternate make request function with respect to the default one. The null_blk driver does this in its null_add_dev() function, invoked for each simulated device that it requires to create, on module initialization.

nullb->q = blk_alloc_queue_node(GFP_KERNEL, home_node);
blk_queue_make_request(nullb->q, null_queue_bio);

Let's turn our attention to the bulk of the null_queue_bio() function itself. It is very simple and does not even need to allocate new request structures; however it needs to get a command structure to handle completions afterwards. It just handles the block operation's command with no additional operations.

static void null_queue_bio(struct request_queue *q, struct bio *bio)
        struct nullb *nullb = q->queuedata;
        struct nullb_queue *nq = nullb_to_queue(nullb);
        struct nullb_cmd *cmd;

        cmd = alloc_cmd(nq, 1); 
        cmd->bio = bio;


In this very simple case, completions are handled by just ending the I/O command with no error notification, as if it had been executed by a device controller. We can see how the null_blk driver does in its end_cmd() function, which is invoked directly in the context of the previously-seen null_handle_cmd() function: it invokes the block layer's bio_endio() function by passing to it the completed bio and the error code as its second parameter.

case NULL_Q_BIO:
        bio_endio(cmd->bio, 0);

K. Wilk, "Xen Profiling: oprofile and perf" - http://wiki.xen.org/wiki/Xen_Profiling:_oprofile_and_perf
J. Corbet, "The multiqueue block layer" - http://lwn.net/Articles/552904/