File forks were originally developed for the Macintosh, to hold file icons. NTFS has a similar feature called file streams. These forks really blur the boundary between system metadata and user data, and typically are not preserved when files move between systems.
Metadata is such an important concept that the IEEE has conferences on the topic.
In FFS, naming information is kept in the directory structure, but the other critical metadata we just discussed is held in the file's inode. (We'll come back to where the inode is stored a little later in the lecture, but it's not in the directory file.)
The concept of the inode has become so important that it is effectively shorthand for the separation of naming and storage metadata. One reason is that the inode structure appears directly in the kernel. Once Unix started to support different types of file systems, all of them were required to use the inode structure.
Below is the Linux kernel inode struct.
struct inode { struct hlist_node i_hash; struct list_head i_list; struct list_head i_sb_list; struct list_head i_dentry; unsigned long i_ino; atomic_t i_count; umode_t i_mode; unsigned int i_nlink; uid_t i_uid; gid_t i_gid; dev_t i_rdev; loff_t i_size; struct timespec i_atime; struct timespec i_mtime; struct timespec i_ctime; unsigned int i_blkbits; unsigned long i_version; blkcnt_t i_blocks; unsigned short i_bytes; spinlock_t i_lock; /* i_blocks, i_bytes, maybe i_size */ struct mutex i_mutex; struct rw_semaphore i_alloc_sem; struct inode_operations *i_op; const struct file_operations *i_fop; /* former ->i_op->default_file_ops */ struct super_block *i_sb; struct file_lock *i_flock; struct address_space *i_mapping; struct address_space i_data; #ifdef CONFIG_QUOTA struct dquot *i_dquot[MAXQUOTAS]; #endif struct list_head i_devices; union { struct pipe_inode_info *i_pipe; struct block_device *i_bdev; struct cdev *i_cdev; }; int i_cindex; __u32 i_generation; #ifdef CONFIG_DNOTIFY unsigned long i_dnotify_mask; /* Directory notify events */ struct dnotify_struct *i_dnotify; /* for directory notifications */ #endif #ifdef CONFIG_INOTIFY struct list_head inotify_watches; /* watches on this inode */ struct mutex inotify_mutex; /* protects the watches list */ #endif unsigned long i_state; unsigned long dirtied_when; /* jiffies of first dirtying */ unsigned int i_flags; atomic_t i_writecount; #ifdef CONFIG_SECURITY void *i_security; #endif void *i_private; /* fs or device private pointer */ #ifdef __NEED_I_SIZE_ORDERED seqcount_t i_size_seqcount; #endif };
The inode operations that are available manipulate the on-disk file attributes, including the namespace (directory structure).
struct inode_operations { int (*create) (struct inode *,struct dentry *,int, struct nameidata *); struct dentry * (*lookup) (struct inode *,struct dentry *, struct nameidata *); int (*link) (struct dentry *,struct inode *,struct dentry *); int (*unlink) (struct inode *,struct dentry *); int (*symlink) (struct inode *,struct dentry *,const char *); int (*mkdir) (struct inode *,struct dentry *,int); int (*rmdir) (struct inode *,struct dentry *); int (*mknod) (struct inode *,struct dentry *,int,dev_t); int (*rename) (struct inode *, struct dentry *, struct inode *, struct dentry *); int (*readlink) (struct dentry *, char __user *,int); void * (*follow_link) (struct dentry *, struct nameidata *); void (*put_link) (struct dentry *, struct nameidata *, void *); void (*truncate) (struct inode *); int (*permission) (struct inode *, int, struct nameidata *); int (*setattr) (struct dentry *, struct iattr *); int (*getattr) (struct vfsmount *mnt, struct dentry *, struct kstat *); int (*setxattr) (struct dentry *, const char *,const void *,size_t,int); ssize_t (*getxattr) (struct dentry *, const char *, void *, size_t); ssize_t (*listxattr) (struct dentry *, char *, size_t); int (*removexattr) (struct dentry *, const char *); void (*truncate_range)(struct inode *, loff_t, loff_t); };
The file operations switch inside the kernel is used to actually implement the data read/write calls a program makes on an open file.
When this object-oriented method was originally developed in SunOS/Solaris, it was called the vnode or vfs approach, for its benefit in virtualizing the file system and allowing many different FS implementations.
struct file_operations { struct module *owner; loff_t (*llseek) (struct file *, loff_t, int); ssize_t (*read) (struct file *, char __user *, size_t, loff_t *); ssize_t (*write) (struct file *, const char __user *, size_t, loff_t *); ssize_t (*aio_read) (struct kiocb *, const struct iovec *, unsigned long, loff_t); ssize_t (*aio_write) (struct kiocb *, const struct iovec *, unsigned long, loff_t); int (*readdir) (struct file *, void *, filldir_t); unsigned int (*poll) (struct file *, struct poll_table_struct *); int (*ioctl) (struct inode *, struct file *, unsigned int, unsigned long); long (*unlocked_ioctl) (struct file *, unsigned int, unsigned long); long (*compat_ioctl) (struct file *, unsigned int, unsigned long); int (*mmap) (struct file *, struct vm_area_struct *); int (*open) (struct inode *, struct file *); int (*flush) (struct file *, fl_owner_t id); int (*release) (struct inode *, struct file *); int (*fsync) (struct file *, struct dentry *, int datasync); int (*aio_fsync) (struct kiocb *, int datasync); int (*fasync) (int, struct file *, int); int (*lock) (struct file *, int, struct file_lock *); ssize_t (*sendfile) (struct file *, loff_t *, size_t, read_actor_t, void *); ssize_t (*sendpage) (struct file *, struct page *, int, size_t, loff_t *, int); unsigned long (*get_unmapped_area)(struct file *, unsigned long, unsigned long, unsigned lon g, unsigned long); int (*check_flags)(int); int (*dir_notify)(struct file *filp, unsigned long arg); int (*flock) (struct file *, int, struct file_lock *); ssize_t (*splice_write)(struct pipe_inode_info *, struct file *, loff_t *, size_t, unsigned int); ssize_t (*splice_read)(struct file *, loff_t *, struct pipe_inode_info *, size_t, unsigned i nt); };
Most operating systems today combine the operation of the file system buffer cache with the virtual memory system.
Actually, the integration isn't complete, but the same data structures and algorithms are involved. In a Linux system, when memory runs short, a single page frame replacement algorithm (PFRA) is run.
It's also important to note that file pages that are only accessed using read() and write() are not normally mapped into the process address space, though files that are accessed using mmap() are.
Another issue of some importance to operation combines layout and kernel implementation issues.
The basic information about the file is held in a structure known as the superblock. The superblock says what type of file system it is, where to find the root directory, where to find the information about free disk blocks, where to find the inodes, etc.
The above information is relatively common across different types of file systems (of which there are many). The principles of file systems are also fairly common, but the rest of this section deals primarily with the Berkeley FFS, which was heavily imitated for Linux's ext2 file system.
The inodes are held in several areas of the disk specifically set aside to hold them. The number of inodes, and consequently the maximum number of files in the system, is fixed when the file system is created. The intent of putting inodes in several places on disk, rather than all at the beginning of the disk, is for the inode, the corresponding directory entry, and the file data blocks to all be near each other, reducing seek time and improving performance.
Blocks are allocated using a bitmap which represents the blocks that are in use and those that are not, much as we showed for memory management.
An inode includes a pointer to the first ten or so blocks of the file. If the file is short, then, the operating system can retrieve all of the layout information fairly quickly. When the file is larger than ten blocks, an extra pointer that points to an indirect block is used.
When purchasing a computer system, the performance of the I/O may be specified in terms of either bandwidth or transaction performance, measured in megabytes/second for large transfers or operations/second, respectively. In most cases, those are averages, but some systems support real-time I/O, as well, providing specific guarantees for each process.
An additional concern in some cases can be the CPU overhead of I/O operations.
We have entirely glossed over the issue of how to achieve good performance. Reading data randomly gives very poor performance, while reading it sequentially does very well.
Operating systems must worry about rotational latency, head switch time, and the zoned structure of the disk drive.
Obviously, this is less of an issue with flash devices, but it does play some part.
All modern operating systems do write caching, where a data written to a file is not actually committed to disk when the application write() completes. In Linux, a special kernel process called pdflush is charged with this job.
In Unix systems, it is also true that disk I/Os are done in multiples of a page size, and the I/O is also done to page boundaries. So how are the API and the I/O system reconciled? Through the file system buffer cache. The buffer cache serves two important purposes: the first is alignment, and the second
Your project is not complete until I have received a written report on it, in PDF format (日本語はOK). Your report should probably be 4-6 pages, depending on your project, writing format, and type and number of graphs, etc.
Next lecture:
June 25: 第12回 Lecture 12: I/O Systems
Readings for next week and followup for this week: