慶應義塾大学
2009年度春学期

システム・ソフトウェア
System Software / Operating Systems

2009年度春学期　火曜日2時限
科目コード: 60730
開講場所：SFC
授業形態：講義
担当: Rodney Van Meter
E-mail: rdv@sfc.keio.ac.jp

第11回 6月30日ファイル・システムの実装
Lecture 11, June 30: File System Implementations

Outline

Goals of a File System Implementation
Metadata
The inode
Kernel implementation
- The vnode and inode
- The File Buffer Cache and Virtual Memory
- Ordering Operations for Correctness
Disk Layout
- The vnode and inode
Performance Issues
Other Types of File Systems

Goals of an FS Implementation

Store and retrieve user data reliably
Use storage devices (especially disk) efficiently
Meet some subset of the "ilities"
- In databases and file systems, Reliability, Availability, Scalability, and Recoverability are common wishes. (Compare to ACID in databases, atomicity, consistency, isolation, durability.)

Metadata

Metadata is data about data.
In file systems, metadata comes in two flavors:
- Metadata about files (permissions, time stamps, etc.)
- Metadata about storage (where the file is stored)
The inode, developed for Unix file systems, holds some of both kinds of metadata about a file.

The file system uses several forms of metadata, or data about the data, to manage files. We have seen two of the most important types of metadata already: names, and security information (permissions or access control lists). But there are other types, which may or may not be supported by a particular file system type:

File layout information
Timestamps: created, modified, accessed, backed up
File type
File size (last byte)
File storage (which may be either more or less than the file size)
Block size
Realtime info
User-definable metadata

Some of these attributes are preserved across backup and restore, or transfer to another system via ftp. Others are not, and sometimes problems occur.

File forks were originally developed for the Macintosh, to hold file icons. NTFS has a similar feature called file streams. These forks really blur the boundary between system metadata and user data, and typically are not preserved when files move between systems.

Metadata is such an important concept that the IEEE has conferences on the topic.

The inode

The Berkeley Fast File System (FFS) is perhaps the most influential file system in the history of operating systems. It is not used much anymore in its original form, but some of the principles developed have become so ubiquitous that their vocabulary is standard. The foremost of these concepts is the inode. (Technically, the inode was developed as part of the original Unix file system, but FFS is the system that cemented the inode's place in systems history.)

In FFS, naming information is kept in the directory structure, but the other critical metadata we just discussed is held in the file's inode. (We'll come back to where the inode is stored a little later in the lecture, but it's not in the directory file.)

The concept of the inode has become so important that it is effectively shorthand for the separation of naming and storage metadata. One reason is that the inode structure appears directly in the kernel. Once Unix started to support different types of file systems, all of them were required to use the inode structure.

Below is the Linux kernel inode struct.

struct inode {
        struct hlist_node       i_hash;
        struct list_head        i_list;
        struct list_head        i_sb_list;
        struct list_head        i_dentry;
        unsigned long           i_ino;
        atomic_t                i_count;
        umode_t                 i_mode;
        unsigned int            i_nlink;
        uid_t                   i_uid;
        gid_t                   i_gid;
        dev_t                   i_rdev;
        loff_t                  i_size;
        struct timespec         i_atime;
        struct timespec         i_mtime;
        struct timespec         i_ctime;
        unsigned int            i_blkbits;
        unsigned long           i_version;
        blkcnt_t                i_blocks;
        unsigned short          i_bytes;
        spinlock_t              i_lock; /* i_blocks, i_bytes, maybe i_size */
        struct mutex            i_mutex;
        struct rw_semaphore     i_alloc_sem;
        struct inode_operations *i_op;
        const struct file_operations    *i_fop; /* former ->i_op->default_file_ops */
        struct super_block      *i_sb;
        struct file_lock        *i_flock;
        struct address_space    *i_mapping;
        struct address_space    i_data;
#ifdef CONFIG_QUOTA
        struct dquot            *i_dquot[MAXQUOTAS];
#endif
        struct list_head        i_devices;
        union {
                struct pipe_inode_info  *i_pipe;
                struct block_device     *i_bdev;
                struct cdev             *i_cdev;
        };
        int                     i_cindex;

        __u32                   i_generation;

#ifdef CONFIG_DNOTIFY
        unsigned long           i_dnotify_mask; /* Directory notify events */
        struct dnotify_struct   *i_dnotify; /* for directory notifications */
#endif

#ifdef CONFIG_INOTIFY
        struct list_head        inotify_watches; /* watches on this inode */
        struct mutex            inotify_mutex;  /* protects the watches list */
#endif

        unsigned long           i_state;
        unsigned long           dirtied_when;   /* jiffies of first dirtying */

        unsigned int            i_flags;

        atomic_t                i_writecount;
#ifdef CONFIG_SECURITY
        void                    *i_security;
#endif
        void                    *i_private; /* fs or device private pointer */
#ifdef __NEED_I_SIZE_ORDERED
        seqcount_t              i_size_seqcount;
#endif
};

Kernel Implementation

Object-oriented implementations use an inode operations switch and a file operations switch.
Most OSes combine much of the operation of the file system buffer cache with the virtual memory system.
Ordering of I/O operations affects the correctness of the system.

The inode operations switch

The inode operations that are available manipulate the on-disk file attributes, including the namespace (directory structure).

struct inode_operations {
        int (*create) (struct inode *,struct dentry *,int, struct nameidata *);
        struct dentry * (*lookup) (struct inode *,struct dentry *, struct nameidata *);
        int (*link) (struct dentry *,struct inode *,struct dentry *);
        int (*unlink) (struct inode *,struct dentry *);
        int (*symlink) (struct inode *,struct dentry *,const char *);
        int (*mkdir) (struct inode *,struct dentry *,int);
        int (*rmdir) (struct inode *,struct dentry *);
        int (*mknod) (struct inode *,struct dentry *,int,dev_t);
        int (*rename) (struct inode *, struct dentry *,
                        struct inode *, struct dentry *);
        int (*readlink) (struct dentry *, char __user *,int);
        void * (*follow_link) (struct dentry *, struct nameidata *);
        void (*put_link) (struct dentry *, struct nameidata *, void *);
        void (*truncate) (struct inode *);
        int (*permission) (struct inode *, int, struct nameidata *);
        int (*setattr) (struct dentry *, struct iattr *);
        int (*getattr) (struct vfsmount *mnt, struct dentry *, struct kstat *);
        int (*setxattr) (struct dentry *, const char *,const void *,size_t,int);
        ssize_t (*getxattr) (struct dentry *, const char *, void *, size_t);
        ssize_t (*listxattr) (struct dentry *, char *, size_t);
        int (*removexattr) (struct dentry *, const char *);
        void (*truncate_range)(struct inode *, loff_t, loff_t);
};

The file operations switch

The file operations switch inside the kernel is used to actually implement the data read/write calls a program makes on an open file.

When this object-oriented method was originally developed in SunOS/Solaris, it was called the vnode or vfs approach, for its benefit in virtualizing the file system and allowing many different FS implementations.

struct file_operations {
        struct module *owner;
        loff_t (*llseek) (struct file *, loff_t, int);
        ssize_t (*read) (struct file *, char __user *, size_t, loff_t *);
        ssize_t (*write) (struct file *, const char __user *, size_t, loff_t *);
        ssize_t (*aio_read) (struct kiocb *, const struct iovec *, unsigned long, loff_t);
        ssize_t (*aio_write) (struct kiocb *, const struct iovec *, unsigned long, loff_t);
        int (*readdir) (struct file *, void *, filldir_t);
        unsigned int (*poll) (struct file *, struct poll_table_struct *);
        int (*ioctl) (struct inode *, struct file *, unsigned int, unsigned long);
        long (*unlocked_ioctl) (struct file *, unsigned int, unsigned long);
        long (*compat_ioctl) (struct file *, unsigned int, unsigned long);
        int (*mmap) (struct file *, struct vm_area_struct *);
        int (*open) (struct inode *, struct file *);
        int (*flush) (struct file *, fl_owner_t id);
        int (*release) (struct inode *, struct file *);
        int (*fsync) (struct file *, struct dentry *, int datasync);
        int (*aio_fsync) (struct kiocb *, int datasync);
        int (*fasync) (int, struct file *, int);
        int (*lock) (struct file *, int, struct file_lock *);
        ssize_t (*sendfile) (struct file *, loff_t *, size_t, read_actor_t, void *);
        ssize_t (*sendpage) (struct file *, struct page *, int, size_t, loff_t *, int);
        unsigned long (*get_unmapped_area)(struct file *, unsigned long, unsigned long, unsigned lon
g, unsigned long);
        int (*check_flags)(int);
        int (*dir_notify)(struct file *filp, unsigned long arg);
        int (*flock) (struct file *, int, struct file_lock *);
        ssize_t (*splice_write)(struct pipe_inode_info *, struct file *, loff_t *, size_t, unsigned 
int);
        ssize_t (*splice_read)(struct file *, loff_t *, struct pipe_inode_info *, size_t, unsigned i
nt);
};

The File Buffer Cache and Virtual Memory

Most operating systems today combine the operation of the file system buffer cache with the virtual memory system.

Actually, the integration isn't complete, but the same data structures and algorithms are involved. In a Linux system, when memory runs short, a single page frame replacement algorithm (PFRA) is run.

It's also important to note that file pages that are only accessed using read() and write() are not normally mapped into the process address space, though files that are accessed using mmap() are.

Ordering I/O Operations for Correctness

If the I/O operations for a file system transaction are misordered, data corruption can result, or even a security hole that results in leaking data from one user's files into another user's files. In a Berkeley FFS file system, the order of a data write that lengthens the file must be:

Allocate a block by marking it in the bit map.
Write the data to the block.
Update the inode to point to the block.

Disk Layout

The disk must hold various types of information:

information about overall file system structure itself (e.g., partitions)
directory information (or other metadata for naming user's data)
information about which disk blocks are in use
the data blocks for the files

Disks are often divided into partitions. Each partition may be used in a different fashion. Some partitions are used for swap space, some may hold different file systems and even different operating systems. On PC hardware, sector 0 of the disk is known as the Master Boot Record (MBR), and it contains a small program to start the booting of the computer, and contains the partition table.

The basic information about the file is held in a structure known as the superblock. The superblock says what type of file system it is, where to find the root directory, where to find the information about free disk blocks, where to find the inodes, etc.

The above information is relatively common across different types of file systems (of which there are many). The principles of file systems are also fairly common, but the rest of this section deals primarily with the Berkeley FFS, which was heavily imitated for Linux's ext2 file system.

The inodes are held in several areas of the disk specifically set aside to hold them. The number of inodes, and consequently the maximum number of files in the system, is fixed when the file system is created. The intent of putting inodes in several places on disk, rather than all at the beginning of the disk, is for the inode, the corresponding directory entry, and the file data blocks to all be near each other, reducing seek time and improving performance.

Blocks are allocated using a bitmap which represents the blocks that are in use and those that are not, much as we showed for memory management.

An inode includes a pointer to the first ten or so blocks of the file. If the file is short, then, the operating system can retrieve all of the layout information fairly quickly. When the file is larger than ten blocks, an extra pointer that points to an indirect block is used.

Performance Issues

Performance Measures
Efficient Disk Layout
Prefetching
Write Caching
Data Copies in the File System

Performance Measures

When purchasing a computer system, the performance of the I/O may be specified in terms of either bandwidth or transaction performance, measured in megabytes/second for large transfers or operations/second, respectively. In most cases, those are averages, but some systems support real-time I/O, as well, providing specific guarantees for each process.

An additional concern in some cases can be the CPU overhead of I/O operations.

Efficient Disk Layout

We have entirely glossed over the issue of how to achieve good performance. Reading data randomly gives very poor performance, while reading it sequentially does very well.

Operating systems must worry about rotational latency, head switch time, and the zoned structure of the disk drive.

Obviously, this is less of an issue with flash devices, but it does play some part.

Prefetching

Operating systems generally try to recognize sequential I/O, and bring in more than one block at once, hoping that the application will use the extra data.

Write Caching

All modern operating systems do write caching, where a data written to a file is not actually committed to disk when the application write() completes. In Linux, a special kernel process called pdflush is charged with this job.

Data Copies in the File System

Last week we talked about file system APIs. At one point, we talked about the alignment of application file read/write buffers. In modern C/Unix APIs, the buffer can be any place in memory, but in older systems, buffers always had to be aligned to the size of the system memory page.

In Unix systems, it is also true that disk I/Os are done in multiples of a page size, and the I/O is also done to page boundaries. So how are the API and the I/O system reconciled? Through the file system buffer cache. The buffer cache serves two important purposes: the first is alignment, and the second

Other Types of File Systems

Most of the disk layout information we talked about applies only to FFS and ext2, but there are many other on-disk structures:

Log-structured file systems (historically important, but not regularly used).
SGI's XFS, which uses b-trees for everything, including managing the disk layout.
NTFS.
The original Microsoft FAT file systems, which used linked lists to maintain the blocks in a file rather than indirect blocks.
Journaled file systems, which use a special file or area of the disk to get write performance and data integrity equivalent to LFS, with the read performance of a regular FS.

There are dozens of others, with many interesting characteristics. We could easily do an entire term on file systems alone.

Homework, Etc.

宿題 Homework

Your only homework this week is to report on the progress of your term project.

Your project is not complete until I have received a written report on it, in PDF format (日本語はOK). Your report should probably be 4-6 pages, depending on your project, writing format, and type and number of graphs, etc.

Next Lecture

Next lecture:

第12回 7月8日ハイパーバイザー
Lecture 12, July 8: Hypervisors

Readings for next week:

Dragovic et al., "Xen and the Art of Virtualization", SOSP 2003.
TBD.

Readings for next week and followup for this week:

Tanenbaum, 5.1-5.4

Follow-up:

McKusick's paper on the original Unix Fast File System
The original vnode paper, by Steve Kleiman
Warning: a lot of information on the web about how Linux's page cache works seems to be out of date. Gregory Smith's page seems to both be up to date, and have good additional references, such as the Red Hat whitepaper on the virtual memory system.
Understanding the Linux Kernel can be browsed at Google Books, or, better yet, read in its entirety online here
The OpenVMS Files-11 On-Disk Structure Level 2 (F-11 ODS2)
An article on Linux internals
HP Open Sources AdvFS (Wikipedia page)

システム・ソフトウェア System Software / Operating Systems

第11回 6月30日 ファイル・システムの実装 Lecture 11, June 30: File System Implementations