慶應義塾大学
2007年度 春学期

システム・ソフトウェア
System Software / Operating Systems

2007年度春学期 火曜日2時限
科目コード: 60730
開講場所:SFC
授業形態:講義
担当: Rodney Van Meter
E-mail: rdv@sfc.keio.ac.jp

第10回 6月12日 ファイル・システムの実装
Lecture 10, June 12: File System Implementations

Outline

Metadata

The file system uses several forms of metadata, or data about the data, to manage files. We have seen two of the most important types of metadata already: names, and security information. But there are other types, which may or may not be supported:

Some of these attributes are preserved across backup and restore, or transfer to another system via ftp. Others are not, and sometimes problems occur.

Way back at the beginning, I mentioned file forks but didn't discuss them. Forks were originally developed for the Macintosh, to hold file icons. NTFS has a similar feature called file streams. These forks really blur the boundary between system metadata and user data, and typically are not preserved when files move between systems.

Metadata is such an important concept that the IEEE has conferences on the topic.

The inode

The Berkeley Fast File System (FFS) is perhaps the most influential file system in the history of operating systems. It is not used much anymore in its original form, but some of the principles developed have become so ubiquitous that their vocabulary is standard. The foremost of these concepts is the inode.

In FFS, naming information is kept in the directory structure, but the other critical metadata we just discussed is held in the file's inode. (We'll come back to where the inode is stored a little later in the lecture, but it's not in the directory file.)

The concept of the inode has become so important that it is effectively shorthand for the separation of naming and storage metadata. One reason is that the inode structure appears directly in the kernel. Once Unix started to support different types of file systems, all of them were required to use the inode structure.

Below is the Linux kernel inode struct.

struct inode {
        struct hlist_node       i_hash;
        struct list_head        i_list;
        struct list_head        i_sb_list;
        struct list_head        i_dentry;
        unsigned long           i_ino;
        atomic_t                i_count;
        umode_t                 i_mode;
        unsigned int            i_nlink;
        uid_t                   i_uid;
        gid_t                   i_gid;
        dev_t                   i_rdev;
        loff_t                  i_size;
        struct timespec         i_atime;
        struct timespec         i_mtime;
        struct timespec         i_ctime;
        unsigned int            i_blkbits;
        unsigned long           i_version;
        blkcnt_t                i_blocks;
        unsigned short          i_bytes;
        spinlock_t              i_lock; /* i_blocks, i_bytes, maybe i_size */
        struct mutex            i_mutex;
        struct rw_semaphore     i_alloc_sem;
        struct inode_operations *i_op;
        const struct file_operations    *i_fop; /* former ->i_op->default_file_ops */
        struct super_block      *i_sb;
        struct file_lock        *i_flock;
        struct address_space    *i_mapping;
        struct address_space    i_data;
#ifdef CONFIG_QUOTA
        struct dquot            *i_dquot[MAXQUOTAS];
#endif
        struct list_head        i_devices;
        union {
                struct pipe_inode_info  *i_pipe;
                struct block_device     *i_bdev;
                struct cdev             *i_cdev;
        };
        int                     i_cindex;

        __u32                   i_generation;

#ifdef CONFIG_DNOTIFY
        unsigned long           i_dnotify_mask; /* Directory notify events */
        struct dnotify_struct   *i_dnotify; /* for directory notifications */
#endif

#ifdef CONFIG_INOTIFY
        struct list_head        inotify_watches; /* watches on this inode */
        struct mutex            inotify_mutex;  /* protects the watches list */
#endif

        unsigned long           i_state;
        unsigned long           dirtied_when;   /* jiffies of first dirtying */

        unsigned int            i_flags;

        atomic_t                i_writecount;
#ifdef CONFIG_SECURITY
        void                    *i_security;
#endif
        void                    *i_private; /* fs or device private pointer */
#ifdef __NEED_I_SIZE_ORDERED
        seqcount_t              i_size_seqcount;
#endif
};

Disk Layout

The disk must hold various types of information:

Disks are often divided into partitions. Each partition may be used in a different fashion. Some partitions are used for swap space, some may hold different file systems and even different operating systems. On PC hardware, sector 0 of the disk is known as the Master Boot Record (MBR), and it contains a small program to start the booting of the computer, and contains the partition table.

The basic information about the file is held in a structure known as the superblock. The superblock says what type of file system it is, where to find the root directory, where to find the information about free disk blocks, where to find the inodes, etc.

The above information is relatively common across different types of file systems (of which there are many). The principles of file systems are also fairly common, but the rest of this section deals primarily with the Berkeley FFS, which was heavily imitated for Linux's ext2 file system.

The inodes are held in several areas of the disk specifically set aside to hold them. The number of inodes, and consequently the maximum number of files in the system, is fixed when the file system is created. The intent of putting inodes in several places on disk, rather than all at the beginning of the disk, is for the inode, the corresponding directory entry, and the file data blocks to all be near each other, reducing seek time and improving performance.

Blocks are allocated using a bitmap which represents the blocks that are in use and those that are not, much as we showed for memory management.

An inode includes a pointer to the first ten or so blocks of the file. If the file is short, then, the operating system can retrieve all of the layout information fairly quickly. When the file is larger than ten blocks, an extra pointer that points to an indirect block is used.

Kernel Implementation

The file operations switch

struct inode_operations {
        int (*create) (struct inode *,struct dentry *,int, struct nameidata *);
        struct dentry * (*lookup) (struct inode *,struct dentry *, struct nameidata *);
        int (*link) (struct dentry *,struct inode *,struct dentry *);
        int (*unlink) (struct inode *,struct dentry *);
        int (*symlink) (struct inode *,struct dentry *,const char *);
        int (*mkdir) (struct inode *,struct dentry *,int);
        int (*rmdir) (struct inode *,struct dentry *);
        int (*mknod) (struct inode *,struct dentry *,int,dev_t);
        int (*rename) (struct inode *, struct dentry *,
                        struct inode *, struct dentry *);
        int (*readlink) (struct dentry *, char __user *,int);
        void * (*follow_link) (struct dentry *, struct nameidata *);
        void (*put_link) (struct dentry *, struct nameidata *, void *);
        void (*truncate) (struct inode *);
        int (*permission) (struct inode *, int, struct nameidata *);
        int (*setattr) (struct dentry *, struct iattr *);
        int (*getattr) (struct vfsmount *mnt, struct dentry *, struct kstat *);
        int (*setxattr) (struct dentry *, const char *,const void *,size_t,int);
        ssize_t (*getxattr) (struct dentry *, const char *, void *, size_t);
        ssize_t (*listxattr) (struct dentry *, char *, size_t);
        int (*removexattr) (struct dentry *, const char *);
        void (*truncate_range)(struct inode *, loff_t, loff_t);
};
struct file_operations {
        struct module *owner;
        loff_t (*llseek) (struct file *, loff_t, int);
        ssize_t (*read) (struct file *, char __user *, size_t, loff_t *);
        ssize_t (*write) (struct file *, const char __user *, size_t, loff_t *);
        ssize_t (*aio_read) (struct kiocb *, const struct iovec *, unsigned long, loff_t);
        ssize_t (*aio_write) (struct kiocb *, const struct iovec *, unsigned long, loff_t);
        int (*readdir) (struct file *, void *, filldir_t);
        unsigned int (*poll) (struct file *, struct poll_table_struct *);
        int (*ioctl) (struct inode *, struct file *, unsigned int, unsigned long);
        long (*unlocked_ioctl) (struct file *, unsigned int, unsigned long);
        long (*compat_ioctl) (struct file *, unsigned int, unsigned long);
        int (*mmap) (struct file *, struct vm_area_struct *);
        int (*open) (struct inode *, struct file *);
        int (*flush) (struct file *, fl_owner_t id);
        int (*release) (struct inode *, struct file *);
        int (*fsync) (struct file *, struct dentry *, int datasync);
        int (*aio_fsync) (struct kiocb *, int datasync);
        int (*fasync) (int, struct file *, int);
        int (*lock) (struct file *, int, struct file_lock *);
        ssize_t (*sendfile) (struct file *, loff_t *, size_t, read_actor_t, void *);
        ssize_t (*sendpage) (struct file *, struct page *, int, size_t, loff_t *, int);
        unsigned long (*get_unmapped_area)(struct file *, unsigned long, unsigned long, unsigned lon
g, unsigned long);
        int (*check_flags)(int);
        int (*dir_notify)(struct file *filp, unsigned long arg);
        int (*flock) (struct file *, int, struct file_lock *);
        ssize_t (*splice_write)(struct pipe_inode_info *, struct file *, loff_t *, size_t, unsigned 
int);
        ssize_t (*splice_read)(struct file *, loff_t *, struct pipe_inode_info *, size_t, unsigned i
nt);
};

The File Buffer Cache and Virtual Memory

Most operating systems today combine the operation of the file system buffer cache with the virtual memory system.

Ordering I/O Operations for Correctness

If the I/O operations for a file system transaction are misordered, data corruption can result, or even a security hole that results in leaking data from one user's files into another user's files. In a Berkeley FFS file system, the order of a data write that lengthens the file must be:

  1. Allocate a block by marking it in the bit map.
  2. Write the data to the block.
  3. Update the inode to point to the block.

Performance Issues

Prefetching

Operating systems generally try to recognize sequential I/O, and bring in more than one block at once, hoping that the application will use the extra data.

Data Copies in the File System

Last week we talked about file system APIs. At one point, we talked about the alignment of application file read/write buffers. In modern C/Unix APIs, the buffer can be any place in memory, but in older systems, buffers always had to be aligned to the size of the system memory page.

In Unix systems, it is also true that disk I/Os are done in multiples of a page size, and the I/O is also done to page boundaries. So how are the API and the I/O system reconciled? Through the file system buffer cache. The buffer cache serves two important purposes: the first is alignment, and the second

Devices

We have entirely glossed over the issue of how to achieve good performance. Reading data randomly gives very poor performance, while reading it sequentially does very well.

Operating systems must worry about rotational latency, head switch time, and the zoned structure of the disk drive.

Other Types of File Systems

Most of the disk layout information we talked about applies only to FFS and ext2, but there are many other on-disk structures:

There are dozens of others, with many interesting characteristics. We could easily do an entire term on file systems alone.

宿題 Homework

This is the last homework!!! After this week, your homework responsibility is your term project.
  1. Imagine that the bitmap showing which disk blocks are free has become unreliable, so you decide to rebuild it by walking through all of the in-use inodes. Assume on-disk inodes are 128 bytes, and the file system was initialized to hold a maximum of one million files. Your disk has a transfer rate of 40 megabytes/second, and can execute a random operation in 10milliseconds. It holds 100GB.
    1. If all of the inodes are stored in one contiguous chunk of disk, how long will it take to read them all?
    2. If 100,000 of the files each use a single indirect block that is randomly placed on the disk, how long will it take to work through all of them?
    3. Using the same type of disk, how long would it take to read every 4KB block on the disk in random order?
  2. Do you back up the data on your computer? If so, how?
    1. Execute a data backup. This backup may be of any type, using any tool, and may be just your user files or may be the entire system. You may back up to CD/DVD, external disk, tape, or over the network to a server.
    2. Report how long it took to perform your backup, and how much data was transferred.
  3. Report the status of your term project.

Next Lecture

Next lecture:

第11回 6月19日 入出力
Lecture 11, June 19: Input/Output Systems

Readings for next week and followup for this week:

Follow-up:

その他 Additional Information