慶應義塾大学
2018年度春学期

システム・ソフトウェア
Operating Systems

2018年度春学期　月曜日４時限、金曜日３時限
科目コード: 60730
開講場所：SFC
授業形態：講義
担当: Rodney Van Meter
E-mail: rdv@sfc.keio.ac.jp

第3回 4月16日プロセススケジューリング
Lectures 3 and 4, April 16 and 20: Process Scheduling

Picture of the Day

Materials for First Scheduling Lecture

Chapters 6 & 7 of OSTEP and accompanying slides.
These notes.

Today, we will talk primarily about principles and some of the low-level mechanisms, then on Thursday we will talk about the specifics of FreeBSD and Linux and MLFQ.

But first, a little card game...

Outline

Picture of the Day
Spending Moore's Dividend
Key Ideas in Scheduling
CPU scheduling
Multiprocessor scheduling
Realtime: deadline scheduling
Bonus: scheduling in multithreaded architectures
Current scheduling research

Spending Moore's Dividend

Jim Larus, a key researcher at Microsoft Research, has an article in the May 2009 issue of Communications of the ACM titled, "Spending Moore's Dividend".

CACM version
Microsoft Research technical report (2008)
Workshop slides @Utah (n.b.: the file name says .ppt, but it's really a PDF) (2008)

I recommend the CACM version, it's more up to date and probably better-written.

Scheduling

Scheduling is the task of deciding which job, process, thread or other task to do at a given point in time. We can divide our discussion of this into two parts: when the scheduling algorithm runs, and the algorithm itself. In computer systems, changing from one running process to another is called context switching.

Goals of Scheduling

Throughput
Fairness
Responsiveness
Effective utilization of all resources

One measure of whether we are doing a good job of managing our resources is throughput: how many of the jobs we have been assigned have we completed in a specified amount of time?

Fairness has an actual mathematical definition, once you have decided what you are attempting to measure. This definition is from Raj Jain:

If the computer is used interactively, we all want it to be responsive: we hate the spinning beach ball or hourglass, but we care much less how long a bigger computation is taking to complete.

Batch Scheduling

Charging
First Come, First Served (FCFS)
Short Job First (SJF)
Priority
CPU-bound v. I/O-bound Jobs

First, to get the basic idea, let's look at how you might organize execution of a large set of jobs, then we will come back to when we should make these decisions.

Scheduling for large batch machine servers, such as those that process databases, concentrates on throughput, measured in jobs per hour. Charging in these systems is generally done in dollars per CPU hour, so it is important to keep the CPU as busy as possible in order to make as many dollars as possible.

The simplest approach of all is first come, first served (FCFS). In FCFS, jobs are simply executed in the order in which they arrive. This approach has the advantage of being fair; all jobs get the processing they need in a relatively predictable time.

Better still, in some ways, is Shortest Job First (SJF). SJF is provably optimal for minimizing the wait time, among a fixed set of jobs. However, in order to maintain fairness, one has to be careful about continuing to allow new jobs to join the processing queue ahead of older, longer jobs. Moreover, actually determining which jobs will be short is often a manual process, and error-prone, at that. When I was a VMS systems administrator, we achieved an equivalent effect by having a high-priority batch queue and a low-priority batch queue. The high-priority one was used only rarely, when someone suddenly needed a particular job done quickly, and usually for shorter jobs than the low-priority batch queue.

If CPU is the only interesting resource, FCFS does well. But in reality, computers are complex machines with multiple resources that we would like to keep busy, and different jobs have different characteristics. What if one job would like to do a lot of disk I/O, and another is only using the CPU? We call these I/O-bound and CPU-bound jobs, respectively. FCFS would have the disk busy for the first one, then the CPU busy for the second one. Is there a way we can keep both busy at the same time, and improve overall throughput?

The efficiency of the next instruction executed is dependent on the current state of the machine. What chunks of main memory are already stored in the cache? What is the disk head position? (We will study disk scheduling more when we get to file systems.)

Basic Priority Scheduling

Sometimes, we divide jobs by priority.

When Do We Run the Scheduler?

Multiprogramming

Cooperative multitasking
Preemptive multitasking (most common)

Time quantum or time slice

In the discussion of batch scheduling, we were talking about job scheduling: deciding which large computation is important enough to run next, but then not really worrying about it until the job ends. But most jobs do some I/O, and leaving the CPU idle while the I/O completes is wasteful. Instead, we can use the CPU for another process while the I/O completes. Such a system is multiprogrammed. In addition to involuntarily giving up the CPU to complete some I/O, most systems support voluntarily giving up the CPU. In the first version of MacOS, such cooperative multitasking was the only form; now it and almost all other major OSes use preemptive multitasking, in which the operating system can take the CPU away from the application. Obviously, cooperative multitasking makes solving problems such as deadlock easier.

You're already familiar with multitasking operating systems; no self-respecting OS today allows one program to use all of the resources until it completes, then picks the next one. Instead, they all use a quantum of time or a time slice; when the process that is currently running uses up a certain amount of time, its quantum is said to expire, and the CPU scheduler is invoked. The CPU scheduler may choose to keep running the same process, or may choose another process to run. This basic approach achieves two major goals: it allows us to balance I/O-bound and CPU-bound jobs, and it allows the computer to be responsive, and give the appearance that it is paying attention to your job.

Materials for Second Scheduling Lecture

Chapters 8 & 10 of OSTEP and accompanying slides.
Sec. 4.4 of the FreeBSD book.
These notes.
Links to actual kernel sources for FreeBSD and Linux, scattered throughout these notes.

CPU Scheduling

Round-robin scheduling
I/O Priority Boost
Fairness: by User or by Process?

This basic concept of a multiprogrammed system was developed for mainframe hardware with multiple terminals attached to the same computer; fifty people or more might be using the same machine. As we discussed in the first lecture, the concept was pioneered by the Compatible Time Sharing System (CTSS), created at MIT by Fernando Corbató and his collaborators and students.

In this environment, it makes sense to give some priority to interactive jobs, so that human time is not wasted. Batch jobs still run, but at a lower priority than interactive ones. But how do you pick among multiple interactive jobs? The simplest approach is round-robin scheduling, in which the jobs are simply executed for their quantum, and when the quantum expires, the next one in the list is taken and the current one is sent to the back of the list. It is important to select an appropriate quantum.

In round-robin scheduling, if we have five compute-bound tasks, they will execute in the order

ABCDEABCDEABCDE

We have already seen the basic idea of priority scheduling a couple of times. Usually, priority scheduling and round-robin scheduling are combined, and the priority scheduling is strict. If any process of a higher priority is ready to run, no lower-priority process gets the CPU. If batch jobs are given lower priority than those run from interactive terminals, this has the disadvantage of making it attractive for users to run their compute-bound jobs in a terminal window, rather than submitting them to a batch queue.

To guarantee that batch jobs make at least some progress, it is also possible to divide the CPU up so that, say, 80 percent of the CPU goes to high-priority jobs and 20 percent goes to low-priority jobs. In practice, this is rarely necessary.

I/O Priority Boost

One important technique in scheduling is to give a priority boost to tasks that are I/O bound, and possibly decrease the priority of CPU-bound tasks. This approach increases responsiveness and helps to keep all of the system resources busy, improving throughput, but implementing it correctly is tricky.

Fairness: by User or by Process?

If, in the above set of tasks, A through D belong to one user, and 1 belongs to another, which is the right approach?

A1B1C1D1A1B1C1D1 or
ABCD1ABCD1?

Multi-Level Feedback Queue

Okay, now you have some of the key ideas: what it means to schedule a resource; what kind of goals we might set and how to measure whether or not we are meeting those goals; key approaches to executing complete jobs in first-come, first served, shortest job first, and basic priority scheduling; and finally, the idea of time slices and round-robin scheduling.

So today, we are going to look at how these come together in a complete system. The most important way that OSes actually manage the large set of heterogeneous tasks is a multi-level feedback queue, one of the oldest ideas in operating systems. (See Ch. 8 and the slides.)

Multiprocessor Scheduling

A little basic queueing theory: a single queue for multiple servers (CPUs) is better than separate queues for each server (CPU). Okay, then why does Linux put each process on a particular CPU and leave it there? Two reasons: simplifying locking and improving the performance of the kernel itself, and improving the behavior of the CPU's memory cache.

This field was heavily researched in the 1980s, and due to the rapid increase in multicore systems, will no doubt be important in commodity operating systems for the next several years, especially the interaction of thread scheduling and CPU scheduling.

Thread Scheduling

Threads can be scheduled in the same way.

In a prior lecture, we discussed the implementation of threads at user level, and at kernel level. If the threads are user level, they are often more efficient, but usually have to be cooperatively scheduled, and the kernel can't help put them on separate CPUs. For kernel-implemented threads, the OS can, potentially, share them out to separate CPUs. In practice, if each CPU has its own cache, this is difficult to do correctly, and the performance penalty is large.

Bonus: Instruction and Thread Scheduling in Multithreaded Architectures

Multiple issue microprocessors
Multithreaded architectures

There are a number of fascinating things happening in computer architecture that affect scheduling. Modern CPUs are multiple issue; more than one instruction is executed in each clock cycle. The most extreme form of this is the TRIPS architecture from the University of Texas at Austin, where the goal is to issue one thousand instructions in every clock cycle!

At the other end, one important experiment is in multithreaded architectures, in which the CPU has enough hardware to support more than one thread, under limited circumstances. The most extreme form of this was the Tera Computer, which had hardware support for 128 threads and always switched threads on every clock cycle. This approach allowed the machine to hide the latency to memory, and work without a cache. It also meant that the overall throughput for the system was poor unless a large number of processes or threads were ready to execute all of the time.

The creator of Tera was Burton Smith, who was an influential architect, won a ridiculous number of awards, and was a Fellow of the American Academy of Arts and Sciences. At the end of his career, he was a Fellow at Microsoft, and was working on hardware technologies to support development of a quantum computer. Sadly, he died on April 2, 2018.

Linux and FreeBSD

There are soooo many ideas in scheduling that have been implemented, especially in Linux over the last twenty-five years, or FreeBSD and its ancestors for even longer...let's discuss them!

FreeBSD
- Separation of low-level scheduler and high-level scheduler
- Low-level scheduler only checks run queues for this particular processor
- High-level scheduling consists of several mechanisms: those that run when a thread unblocks (becomes runnable), and the long-term load balancer, called push migration
- Processor affinity is maintained using a hierarchical tree structure to represent preferred CPUs: first, same motherboard; then, same package (CPU chip); then same core
- Section 4.4 of the FreeBSD book
- The FreeBSD kernel is browsable online. See kern/sched_ule.c
Linux
- O(1) Scheduler
- Completely Fair Scheduler uses a particular notion of fairness

In the Linux kernel, the entity we are scheduling is called a task, which roughly corresponds to a thread. All of the tasks in a process are called a group, and they share their total execution time. Scheduling is implemented using red-black trees, remember them? Documentation by the original authors is here, and an early design note; there are also articles on Wikipedia and by IBM (also available in Japanese!!!) that go back to 2009, but as far as I can tell they remain relevant; I'm a little less sure about their complete accuracy.

Let's have a look at the source code! (Linux/kernel/sched.c existed until 2.4.37, after which it became an entire directory, Linux/kernel/sched/.)

The main function for scheduling is __schedule() in core.c. Just above that in the file is pick_next_task(). The data structure we care most about is cfs_rq.

Similarly, the FreeBSD kernel is browsable online. In kern/sched_ule.c you will find the modern scheduler documented in Section 4.4 of the FreeBSD book. You can find sources for a more complete distribution here.

Miscellaneous

Realtime: Deadline Scheduling

Realtime scheduling is done by establishing deadlines for certain tasks.

Airplanes falling from the sky, death and destruction all around. We don't want that, do we? Then don't play Tetris on your flight avionics hardware...

We should have come to this earlier, but it didn't fit into the flow above. One important class of scheduling algorithms is deadline scheduling algorithms for realtime systems.

Unix Batch Systems

nice, batch, at, and cron are Unix tools for managing priorities and submitting and controlling batch jobs, including those to be executed at later times. All of them are lousy compared to the equivalent tools for VMS and mainframes, and I don't know why. OpenPBS is a batch system that can be installed; I've never used it.

Current Scheduling Research

Kono Lab built tools to allow background jobs to more effectively assess their impact on the system.
Microsoft's Apollo system schedules billions of tasks per day (an average of over 20,000 tasks per second, and peak rate several times that!) on a cluster of 20,000 nodes.

Apollo makes distributed decisions, and allows those decisions to be corrected. Jobs are specified using a custom language that describes the dependencies between portions of the computation. Some parts of the system are implemented using Paxos, a well-known algorithm for Byzantine agreement. It also allows both regular tasks and opportunistic tasks. For large jobs, it can detect "stragglers", processes that are moving too slowly and will stop others from completing their work. A great deal of this builds on a carefully constructed dependency graph.

Homework

A scheduling-related programming assignment will be handed out on Friday.

Next Lecture

Next lecture:

Lecture 5, April 23: Address Spaces, Memory API, and Free Space Management

Readings for next time and followup for this time:

Chapters 13, 14, 16, 17 of OSTEP

Follow-up:

A brief article on the Linux 2.6.0 kernel scheduler.
Since 2.6.23, Linux uses an implementation of the Completely Fair Scheduler (CFS), which is an implementation f weighted fair queueing. There is a short article about it on Wikipedia, in 日本語 as well as English. However, there is really no detail about the algorithm itself. The IBM page is much better. What's more, it's available in Japanese! Check the link that mentions multiprocessing.
Tullsen et al., Simultaneous multithreading
A decade of wasted cores has a great, detailed description of how things can go wrong.
Yoshihisa Abe, Hiroshi Yamada and Kenji Kono (Keio University), Enforcing Appropriate Process Execution for Exploiting Idle Resources from Outside Operating Systems, Proc. EuroSys'08. Authors are from the Kono Group at the Keio Yagami campus.

システム・ソフトウェア Operating Systems

第3回 4月16日 プロセススケジューリング Lectures 3 and 4, April 16 and 20: Process Scheduling