Wednesday, July 23, 2014

OPW, Linux: The block I/O layer, part 4 - The multi-queue interface

So, a couple of weeks ago I performed some first tests on the prototype driver using a simulated device created with the null_blk device driver. Such tests, and the following profiling, highlighted some locking issues due to contention on an internal lock kept by the frontend driver; more in detail, such a lock is instrumental to handling a ringbuffer which is used to exchange requests and responses between the frontend and the backend parts of the block I/O driver: the lock protects the ring against the insertion of new requests and the extraction of responses. My very patient OPW mentor has therefore suggested splitting the ring into two separate halves, one used for requests and another used for responses. During the last week and a half I have been working on that; while struggling with making the interface consistent with the needs of older versions of the driver, I have been writing this last blog article about the multi-queue block layer API, which concludes the series; next week I'll bore you to death with something more related to performance. Also, please note that this post is the reword of some documentation I produced during the first weeks of internship, and it has had the benefit of my mentor's reading, so it's probably much more accurate than usual.


The block I/O layer, part 4 - The multi-queue interface

The request interface was designed for devices that could handle hundreds of I/O operations per second; in a recent paper, block layer maintainer Jens Axboe noted that it suffers from design issues when used for devices that can handle hundreds of thousands of IOPS (see my first blog post or, much better, Jens Axboe's paper). One of the capital issues, lock contention, has very relevant effects event if only having multiple cores concurrently inserting block I/O requests and continuously spinning on the single queue_lock; such a situation is exacerbated when the lock is conteded also by a high-end storage device issuing interrupts and whose driver is still spinning on that same lock. The multi-queue API (also referred to as blk-mq) addresses such an issue by exploiting the ability of block I/O controllers to handle multiple requests in parallel, and thus dramatically reducing lock contention; in fact, in its most common configuration, it allows for block I/O requests to be inserted without the need to lock the whole request_queue. Let's see how it manages to do it.

The blk-mq API implements a two-levels block layer design which makes use of two separate sets of request queues: software staging queues, allocated per-CPU, and hardware dispatch queues, whose number typically match the number of actual hardware queues supported by the block device. The number of software staging queues can be higher than the number of hardware dispatch queues: in this case, two or more software queues will be part of the same hardware context, and a dispatch performed with that hardware context will pull in requests from all the associated software queues. The number of software staging queues can be, instead, less than the number of hardware queues: in this case, sequential mapping is performed. In the third and most simple case where the number of software queues equals the number of hardware queues, a direct 1:1 mapping is performed.

Figure 1: Outline of the multi-queue block layer.

Main data structures
The first relevant data structure used by the multi-queue block layer API is the blk_mq_reg structure, containing all informations of importance during the registration of a new block device to the block layer. This data structure contains the pointer to a blk_mq_ops data structure, used to keep track of the specific routines to be used by the multi-queue block layer to interact with the device's driver. A blk_mq_reg structure also keeps the number of hardware queues to be initialized, the dept of such queues and other information useful during the initialization of data structures related to the particular driver in the block layer. Another data structure of importance is the blk_mq_hw_ctx structure, which represents the hardware context to which a request_queue is associated. Its corresponding structure for the software staging queue is the blk_mq_ctx structure, which is allocated per-CPU. The function performing the mapping between these contexts is specified in the map_queue field of the driver's blk_mq_ops data structure, while the mapping built by this function is kept as the mq_map of the request_queue data structure associated to the block device.
Don't worry: a drawing, such as Figure 2, makes it clearer. Kind of.

Figure 2: Data structures used in the multi-queue block layer.

Queue initialization
When a new device driver using the multi-queue API is loaded, it creates and initializes a new blk_mq_ops structure and sets to its address the related pointer of a new blk_mq_reg. More in detail, the required operations are queue_fn, which must be set to a function in charge of handling the command (e.g. by passing it to the low-level driver), and map_queue, which performs the mapping between hardware and software contexts. Other operations are not strictly required, but can be specified in order to perform specific operations on allocation of contexts or on completion of an I/O request. As of necessary data, the driver must initialize the number of submission queues it supports, along with their size; other data are required, e.g., to determine the size of the command supported by the driver and specific flags that must be exposed to the block layer.

When a new device is initialized, its driver prepares a new data structure whose type may vary according to the device driver handling the device; such a driver-specific data structure, however, is very likely to contain a pointer to the device's gendisk struct and to the request_queue related to the device. As soon as the driver has these data structures ready, it invokes the blk_mq_init_queue() function, which initializes the hardware and software contexts and performs the mapping between them. The initialization routine also sets an alternate make_request function, subsituting to the conventional request submission path (which would include blk_make_request()) the multi-queue submission path (which includes, instead, the function blk_mq_make_request()); as usual, the alternate make_request function is set with the blk_queue_make_request() helper.

Request submission
Device initialization substituted the conventional block I/O submission function with the multi-queue-ready request-submission function, blk_mq_make_request(), letting the multi-queue structures be used transparently from the perspective of the upper layers. The make_request function used by the multi-queue block layer includes the possibility to benefit from per-process plugging, but only for drivers supporting a single hardware queue or for async requests. In case the request is sync and the driver actively uses the multi-queue interface, no plugging is performed. The make_request function also performs request merging, searching for a candidate first inside the task's plug list, if plugging is allowed, and finally in the software queue mapped to the current CPU; the submission path does not involve any I/O scheduling-related callback. Finally, make_request sends immediately to the corresponding hardware queue any sync request, while it delays this transition in case of async or flush requests, to allow for subsequent merging and more efficient dispatching.

Request dispatch
In case that an I/O request is synchronous (and therefore no plugging is allowed for it from the multi-queue block layer) its dispatch to the device driver is performed in the context of the same request; if the request is instead async or flush, and task plugging is present, its dispatch can be performed: a) in the context of the submission of another I/O request to a software queue associated to the same hardware queue; b) when the delayed work scheduled during request submission is executed.
The main run-of-queue function of the multi-queue block layer is the blk_mq_run_hw_queue(), which basically relies on another driver-specific routine, pointed by the queue_rq field of its blk_mq_ops structure. This function delays any run of queue for an async request, while it dispatches a sync request immediately to the driver. The inner function __blk_mq_run_hw_queue(), called by blk_mq_run_queue() in case the request is sync, first joins any software queue associated to the currently-in-service hardware queue; then it joins the resulting list with any entry already on the dispatch list. After collecting all to-be-served entries, the function processes them, starting each request and passing it on to the driver, with its queue_rq function. The function finally handles possible errors, by requeue or deletion of the associated requests.

Figure 3: Functions performing request transition between different data structures.

Sources
Bjørling, Matias, et al. "Linux block IO: Introducing multi-queue SSD access on multi-core systems." Proceedings of the 6th International Systems and Storage Conference. ACM, 2013. - http://kernel.dk/systor13-final18.pdf
Johnathan Corbet, "The multi-queue block layer" - http://lwn.net/Articles/552904/
The Linux kernel, blk-mq: new multi-queue block IO queueing mechanism

Friday, July 11, 2014

OPW, Linux: The block I/O layer, part 3 - The make request interface

During the past weeks, I have been learning about profiling an operating system running in a virtual machine, since I have been needing to examine the driver I am working on to locate bottlenecks and work out lock contention issues. My OPW mentor has suggested that I get to grips with some popular profiling tools, such as perf and lockstat. During my bachelor thesis I already had the chance to become familiar with perf to some extent, but I am learning a lot more about collecting accurate data about the performance of a virtualized OS. For example, I read that Xen exploits Intel's Performance Monitoring Unit, which provides architectural support for collecting performance-related data. 

During the tests performed prior to profiling, I also had the chance to make use of the null_blk block device driver to compare the performance of the CFQ and NOOP I/O schedulers with a random workload composed of greedy random readers and writers, having no completion latency. Such a workload emulates Intel's IOmeter on a too-fast-to-be-real device. The throughput achieved by the CFQ I/O scheduler is half of the the one achieved by NOOP, or even lower, depending on the number of processes issuing I/O.

The NOOP scheduler, however, still does merges and sorts requests; none of this seems really necessary with such a workload where I/O operations are issued in a random fashion (so there seem to be not many merges in any case) and there is no seek penalty that would justify sorting. So, there's already something in the Linux kernel's block layer that should perform slightly better than the request API with the NOOP scheduler: the make request interface.


The block I/O layer, part 3 - The make request interface

The make request interface (or bio-based interface) essentially shorts out all processing of block I/O units following the creation of a bio structure. It therefore allows the kernel to directly submit a bio to the storage device's driver. Such an interface is useful to any block device driver needing to perform pre-processing of requests before submitting them to the actual underlying device (such, e.g., stacked drivers implementing RAID). Even if its purpose was not initially that, the bio-based API is also useful to any block device driver that sees the block layer's processing of I/O requests as an overhead; think, for example, to drivers of devices or controller that feature a highly complex internal request processing logic or don't need requests to be processed. The drawbacks for such an interface are evident: a driver making use of it would lose any pre-processing normally performed by the block layer.

Figure 1: Block layer layout when using the make request interface

Let's see how a driver uses such an interface, again from the code of the very simple null_blk driver. Even when in bio-based mode, the null_blk driver still needs to allocate a request_queue structure. The key, however, is defining, after that, an alternate make request function with respect to the default one. The null_blk driver does this in its null_add_dev() function, invoked for each simulated device that it requires to create, on module initialization.

nullb->q = blk_alloc_queue_node(GFP_KERNEL, home_node);
blk_queue_make_request(nullb->q, null_queue_bio);

Let's turn our attention to the bulk of the null_queue_bio() function itself. It is very simple and does not even need to allocate new request structures; however it needs to get a command structure to handle completions afterwards. It just handles the block operation's command with no additional operations.

static void null_queue_bio(struct request_queue *q, struct bio *bio)
{
        struct nullb *nullb = q->queuedata;
        struct nullb_queue *nq = nullb_to_queue(nullb);
        struct nullb_cmd *cmd;

        cmd = alloc_cmd(nq, 1); 
        cmd->bio = bio;

        null_handle_cmd(cmd);
}

In this very simple case, completions are handled by just ending the I/O command with no error notification, as if it had been executed by a device controller. We can see how the null_blk driver does in its end_cmd() function, which is invoked directly in the context of the previously-seen null_handle_cmd() function: it invokes the block layer's bio_endio() function by passing to it the completed bio and the error code as its second parameter.

case NULL_Q_BIO:
        bio_endio(cmd->bio, 0);
        break;


K. Wilk, "Xen Profiling: oprofile and perf" - http://wiki.xen.org/wiki/Xen_Profiling:_oprofile_and_perf
J. Corbet, "The multiqueue block layer" - http://lwn.net/Articles/552904/