慶應義塾大学
2008年度春学期

システム・ソフトウェア
System Software / Operating Systemsオペレーティングシステム

2008年度春学期　火曜日2時限
科目コード: 60730
開講場所：SFC
授業形態：講義
担当: Rodney Van Meter
E-mail: rdv@sfc.keio.ac.jp

第10回 6月24日ファイル・システムの実装
Lecture 10, June 24: File System Implementation実装s

This lecture needs some serious work. If I'm going to talk about performance, have to include a little about disk architecture to start with. Need to run through number of I/Os to actually read a file from disk.

Outline

Goals of a File System Implementation実装
Metadata
The inode
Kernel implementation実装
- The file operations and inode switches
- The File Buffer Cache and Virtual Memory仮そう記録
- Ordering Operations for Correctness
Disk Layout
Performance Issues
Other Types of File Systems

Goals of an FS Implementation実装

Store and retrieve user data reliably
Use storage devices (especially disk) efficiently
Meet some subset of the "ilities"
- In databases and file systems, Reliability, Availability可用性, Scalability, and Recoverability are common wishes.

Metadata

Metadata is data about data.
In file systems, metadata comes in two flavors:
- Metadata about files (permission許可s, time stamps, etc.)
- Metadata about storage (where the file is stored)
The inode, developed for Unix file systems, holds some of both kinds of metadata about a file.

The file system uses several forms of metadata, or data about the data, to manage files. We have seen two of the most important types of metadata already: names, and security information情報 (permission許可s or access control lists). But there are other types, which may or may not be supported by a particular file system type:

File layout information情報
Timestamps: created, modified, accessed, backed up
File type
File size (last byte)
File storage (which may be either more or less than the file size)
Block size
Realtime info
User-definable metadata

Some of these attributes are preserved across backup and restore, or transfer to another system via ftp. Others are not, and sometimes problems occur.

File forks were originally developed for the Macintosh, to hold file icons. NTFS has a similar feature called file streams. These forks really blur the boundary between system metadata and user data, and typically are not preserved when files move between systems.

Metadata is such an important concept that the IEEE has conferences on the topic.

The inode

The Berkeley Fast File System (FFS) is perhaps the most influential file system in the history of operating systemsオペレーティングシステム. It is not used much anymore in its original form, but some of the principles概念 developed have become so ubiquitous that their vocabulary is standard. The foremost of these concepts is the inode. (Technically, the inode was developed as part of the original Unix file system, but FFS is the system that cemented the inode's place in systems history.)

In FFS, naming information情報 is kept in the directory structure, but the other critical metadata we just discussed is held in the file's inode. (We'll come back to where the inode is stored a little later in the lecture, but it's not in the directory file.)

The concept of the inode has become so important that it is effectively shorthand for the separation of naming and storage metadata. One reason is that the inode structure appears directly in the kernel. Once Unix started to support different types of file systems, all of them were required to use the inode structure.

Below is the Linux kernel inode struct.

struct inode {
        struct hlist_node       i_hash;
        struct list_head        i_list;
        struct list_head        i_sb_list;
        struct list_head        i_dentry;
        unsigned long           i_ino;
        atomic_t                i_count;
        umode_t                 i_mode;
        unsigned int            i_nlink;
        uid_t                   i_uid;
        gid_t                   i_gid;
        dev_t                   i_rdev;
        loff_t                  i_size;
        struct timespec         i_atime;
        struct timespec         i_mtime;
        struct timespec         i_ctime;
        unsigned int            i_blkbits;
        unsigned long           i_version;
        blkcnt_t                i_blocks;
        unsigned short          i_bytes;
        spinlock_t              i_lock; /* i_blocks, i_bytes, maybe i_size */
        struct mutex            i_mutex;
        struct rw_semaphore     i_alloc_sem;
        struct inode_operations *i_op;
        const struct file_operations    *i_fop; /* former ->i_op->default_file_ops */
        struct super_block      *i_sb;
        struct file_lock        *i_flock;
        struct address_space    *i_mapping;
        struct address_space    i_data;
#ifdef CONFIG_QUOTA
        struct dquot            *i_dquot[MAXQUOTAS];
#endif
        struct list_head        i_devices;
        union {
                struct pipe_inode_info  *i_pipe;
                struct block_device     *i_bdev;
                struct cdev             *i_cdev;
        };
        int                     i_cindex;

        __u32                   i_generation;

#ifdef CONFIG_DNOTIFY
        unsigned long           i_dnotify_mask; /* Directory notify events */
        struct dnotify_struct   *i_dnotify; /* for directory notifications */
#endif

#ifdef CONFIG_INOTIFY
        struct list_head        inotify_watches; /* watches on this inode */
        struct mutex            inotify_mutex;  /* protects the watches list */
#endif

        unsigned long           i_state;
        unsigned long           dirtied_when;   /* jiffies of first dirtying */

        unsigned int            i_flags;

        atomic_t                i_writecount;
#ifdef CONFIG_SECURITY
        void                    *i_security;
#endif
        void                    *i_private; /* fs or device private pointer */
#ifdef __NEED_I_SIZE_ORDERED
        seqcount_t              i_size_seqcount;
#endif
};

Kernel Implementation実装

Object-oriented implementation実装s use an inode operations switch and a file operations switch.
Most OSes combine much of the operation of the file system buffer cache with the virtual memory仮そう記録 system.
Ordering of I/O operations affects the correctness of the system.

The inode operations switch

The inode operations that are available manipulate the on-disk file attributes, including the namespace (directory structure).

struct inode_operations {
        int (*create) (struct inode *,struct dentry *,int, struct nameidata *);
        struct dentry * (*lookup) (struct inode *,struct dentry *, struct nameidata *);
        int (*link) (struct dentry *,struct inode *,struct dentry *);
        int (*unlink) (struct inode *,struct dentry *);
        int (*symlink) (struct inode *,struct dentry *,const char *);
        int (*mkdir) (struct inode *,struct dentry *,int);
        int (*rmdir) (struct inode *,struct dentry *);
        int (*mknod) (struct inode *,struct dentry *,int,dev_t);
        int (*rename) (struct inode *, struct dentry *,
                        struct inode *, struct dentry *);
        int (*readlink) (struct dentry *, char __user *,int);
        void * (*follow_link) (struct dentry *, struct nameidata *);
        void (*put_link) (struct dentry *, struct nameidata *, void *);
        void (*truncate) (struct inode *);
        int (*permission許可) (struct inode *, int, struct nameidata *);
        int (*setattr) (struct dentry *, struct iattr *);
        int (*getattr) (struct vfsmount *mnt, struct dentry *, struct kstat *);
        int (*setxattr) (struct dentry *, const char *,const void *,size_t,int);
        ssize_t (*getxattr) (struct dentry *, const char *, void *, size_t);
        ssize_t (*listxattr) (struct dentry *, char *, size_t);
        int (*removexattr) (struct dentry *, const char *);
        void (*truncate_range)(struct inode *, loff_t, loff_t);
};

The file operations switch

The file operations switch inside the kernel is used to actually implement the data read/write calls a program makes on an open file.

struct file_operations {
        struct module *owner;
        loff_t (*llseek) (struct file *, loff_t, int);
        ssize_t (*read) (struct file *, char __user *, size_t, loff_t *);
        ssize_t (*write) (struct file *, const char __user *, size_t, loff_t *);
        ssize_t (*aio_read) (struct kiocb *, const struct iovec *, unsigned long, loff_t);
        ssize_t (*aio_write) (struct kiocb *, const struct iovec *, unsigned long, loff_t);
        int (*readdir) (struct file *, void *, filldir_t);
        unsigned int (*poll) (struct file *, struct poll_table_struct *);
        int (*ioctl) (struct inode *, struct file *, unsigned int, unsigned long);
        long (*unlocked_ioctl) (struct file *, unsigned int, unsigned long);
        long (*compat_ioctl) (struct file *, unsigned int, unsigned long);
        int (*mmap) (struct file *, struct vm_area_struct *);
        int (*open) (struct inode *, struct file *);
        int (*flush) (struct file *, fl_owner_t id);
        int (*release) (struct inode *, struct file *);
        int (*fsync) (struct file *, struct dentry *, int datasync);
        int (*aio_fsync) (struct kiocb *, int datasync);
        int (*fasync) (int, struct file *, int);
        int (*lock) (struct file *, int, struct file_lock *);
        ssize_t (*sendfile) (struct file *, loff_t *, size_t, read_actor_t, void *);
        ssize_t (*sendpage) (struct file *, struct page *, int, size_t, loff_t *, int);
        unsigned long (*get_unmapped_area)(struct file *, unsigned long, unsigned long, unsigned lon
g, unsigned long);
        int (*check_flags)(int);
        int (*dir_notify)(struct file *filp, unsigned long arg);
        int (*flock) (struct file *, int, struct file_lock *);
        ssize_t (*splice_write)(struct pipe_inode_info *, struct file *, loff_t *, size_t, unsigned 
int);
        ssize_t (*splice_read)(struct file *, loff_t *, struct pipe_inode_info *, size_t, unsigned i
nt);
};

The File Buffer Cache and Virtual Memory仮そう記録

Most operating systemsオペレーティングシステム today combine the operation of the file system buffer cache with the virtual memory仮そう記録 system.

Actually, the integration isn't complete, but the same data structures and algorithms are involved. In a Linux system, when memory runs short, a single page frame replacement algorithm (PFRA) is run.

It's also important to note that file pages that are only accessed using read() and write() are not normally mapped into the processプロセス address space, though files that are accessed using mmap() are.

Ordering I/O Operations for Correctness

If the I/O operations for a file system transaction are misordered, data corruption can result, or even a security hole that results in leaking data from one user's files into another user's files. In a Berkeley FFS file system, the order of a data write that lengthens the file must be:

Allocate a block by marking it in the bit map.
Write the data to the block.
Update the inode to point to the block.

Disk Layout

The disk must hold various types of information情報:

information情報 about overall file system structure itself (e.g., partitions)
directory information情報 (or other metadata for naming user's data)
information情報 about which disk blocks are in use
the data blocks for the files

Disks are often divided into partitions. Each partition may be used in a different fashion. Some partitions are used for swap space, some may hold different file systems and even different operating systems. On PC hardware, sector 0 of the disk is known as the Master Boot Record (MBR), and it contains a small program to start the booting of the computer, and contains the partition table.

The basic information情報 about the file is held in a structure known as the superblock. The superblock says what type of file system it is, where to find the root directory, where to find the information情報 about free disk blocks, where to find the inodes, etc.

The above information情報 is relatively common across different types of file systems (of which there are many). The principles概念 of file systems are also fairly common, but the rest of this section deals primarily with the Berkeley FFS, which was heavily imitated for Linux's ext2 file system.

The inodes are held in several areas of the disk specifically set aside to hold them. The number of inodes, and consequently the maximum number of files in the system, is fixed when the file system is created. The intent of putting inodes in several places on disk, rather than all at the beginning of the disk, is for the inode, the corresponding directory entry, and the file data blocks to all be near each other, reducing seek time and improving performance.

Blocks are allocated using a bitmap which represents the blocks that are in use and those that are not, much as we showed for memory management管理.

An inode includes a pointer to the first ten or so blocks of the file. If the file is short, then, the operating systemオペレーティングシステム can retrieve all of the layout information情報 fairly quickly. When the file is larger than ten blocks, an extra pointer that points to an indirect block is used.

Performance Issues

Performance Measures
Efficient Disk Layout
Prefetching
Write Caching
Data Copies in the File System

Performance Measures

When purchasing a computer system, the performance of the I/O may be specified in terms of either bandwidth or transaction performance, measured in megabytes/second for large transfers or operations/second, respectively. In most cases, those are averages, but some systems support real-time I/O, as well, providing specific guarantees for each processプロセス.

An additional concern in some cases can be the CPU overhead of I/O operations.

Efficient Disk Layout

We have entirely glossed over the issue of how to achieve good performance. Reading data randomly gives very poor performance, while reading it sequentially does very well.

Operating systemsオペレーティングシステム must worry about rotational latency, head switch time, and the zoned structure of the disk drive.

Obviously, this is less of an issue with flash devices, but it does play some part.

Prefetching

Operating systemsオペレーティングシステム generally try to recognize sequential I/O, and bring in more than one block at once, hoping that the application will use the extra data.

Write Caching

All modern operating systemsオペレーティングシステム do write caching, where a data written to a file is not actually committed to disk when the application write() completes. In Linux, a special kernel processプロセス called pdflush is charged with this job.

Data Copies in the File System

Last week we talked about file system APIs. At one point, we talked about the alignment of application file read/write buffers. In modern C/Unix APIs, the buffer can be any place in memory, but in older systems, buffers always had to be aligned to the size of the system memory page.

In Unix systems, it is also true that disk I/Os are done in multiples of a page size, and the I/O is also done to page boundaries. So how are the API and the I/O system reconciled? Through the file system buffer cache. The buffer cache serves two important purposes: the first is alignment, and the second

Other Types of File Systems

Most of the disk layout information情報 we talked about applies only to FFS and ext2, but there are many other on-disk structures:

Log-structured file systems (historically important, but not regularly used).
SGI's XFS, which uses b-trees for everything, including managing the disk layout.
NTFS.
The original Microsoft FAT file systems, which used linked lists to maintain the blocks in a file rather than indirect blocks.
Journaled file systems, which use a special file or area of the disk to get write performance and data integrity equivalent to LFS, with the read performance of a regular FS.

There are dozens of others, with many interesting characteristics. We could easily do an entire term on file systems alone.

Homeworkかだい, Etc.

宿題 Homeworkかだい

This is the last homeworkかだい!!! After this week, your homeworkかだい responsibility is your term project.

Imagine that the bitmap showing which disk blocks are free has become unreliable, so you decide to rebuild it by walking through all of the in-use inodes. Assume on-disk inodes are 128 bytes, and the file system was initialized to hold a maximum of one million files. Your disk has a transfer rate of 40 megabytes/second, and can execute a random operation in 10milliseconds. It holds 100GB.
1. If all of the inodes are stored in one contiguous chunk of disk, how long will it take to read them all?
2. If 100,000 of the files each use a single indirect block that is randomly placed on the disk, how long will it take to work through all of them?
3. Using the same type of disk, how long would it take to read every 4KB block on the disk in random order?
Do you back up the data on your computer? If so, how?
1. Execute a data backup. This backup may be of any type, using any tool, and may be just your user files or may be the entire system. You may back up to CD/DVD, external disk, tape, or over the network to a server.
2. Report how long it took to perform your backup, and how much data was transferred.
Report the status of your term project.

Next Lecture

Next lecture:

第11回 6月19日入出力
Lecture 11, July 1: Input/Output Systems

Readings for next week and followup for this week:

Tanenbaum, 5.1-5.4

Follow-up:

McKusick's paper on the original Unix Fast File System
Warning: a lot of information情報 on the web about how Linux's page cache works seems to be out of date. Gregory Smith's page seems to both be up to date, and have good additional references.
Understanding the Linux Kernel can be browsed at Google Books, or, better yet, read in its entirety online here
The OpenVMS Files-11 On-Disk Structure Level 2 (F-11 ODS2)
An article on Linux internals
HP Open Sources AdvFS (Wikipedia page)

システム・ソフトウェア System Software / Operating Systemsオペレーティングシステム

第10回 6月24日 ファイル・システムの実装 Lecture 10, June 24: File System Implementation実装s