This lecture needs some serious work. If I'm going to talk about performance, have to include a little about disk architecture to start with. Need to run through number of I/Os to actually read a file from disk.
File forks were originally developed for the Macintosh, to hold file icons. NTFS has a similar feature called file streams. These forks really blur the boundary between system metadata and user data, and typically are not preserved when files move between systems.
Metadata is such an important concept that the IEEE has conferences on the topic.
In FFS, naming
The concept of the inode has become so important that it is effectively shorthand for the separation of naming and storage metadata. One reason is that the inode structure appears directly in the kernel. Once Unix started to support different types of file systems, all of them were required to use the inode structure.
Below is the Linux kernel inode struct.
struct inode { struct hlist_node i_hash; struct list_head i_list; struct list_head i_sb_list; struct list_head i_dentry; unsigned long i_ino; atomic_t i_count; umode_t i_mode; unsigned int i_nlink; uid_t i_uid; gid_t i_gid; dev_t i_rdev; loff_t i_size; struct timespec i_atime; struct timespec i_mtime; struct timespec i_ctime; unsigned int i_blkbits; unsigned long i_version; blkcnt_t i_blocks; unsigned short i_bytes; spinlock_t i_lock; /* i_blocks, i_bytes, maybe i_size */ struct mutex i_mutex; struct rw_semaphore i_alloc_sem; struct inode_operations *i_op; const struct file_operations *i_fop; /* former ->i_op->default_file_ops */ struct super_block *i_sb; struct file_lock *i_flock; struct address_space *i_mapping; struct address_space i_data; #ifdef CONFIG_QUOTA struct dquot *i_dquot[MAXQUOTAS]; #endif struct list_head i_devices; union { struct pipe_inode_info *i_pipe; struct block_device *i_bdev; struct cdev *i_cdev; }; int i_cindex; __u32 i_generation; #ifdef CONFIG_DNOTIFY unsigned long i_dnotify_mask; /* Directory notify events */ struct dnotify_struct *i_dnotify; /* for directory notifications */ #endif #ifdef CONFIG_INOTIFY struct list_head inotify_watches; /* watches on this inode */ struct mutex inotify_mutex; /* protects the watches list */ #endif unsigned long i_state; unsigned long dirtied_when; /* jiffies of first dirtying */ unsigned int i_flags; atomic_t i_writecount; #ifdef CONFIG_SECURITY void *i_security; #endif void *i_private; /* fs or device private pointer */ #ifdef __NEED_I_SIZE_ORDERED seqcount_t i_size_seqcount; #endif };
The inode operations that are available manipulate the on-disk file attributes, including the namespace (directory structure).
struct inode_operations { int (*create) (struct inode *,struct dentry *,int, struct nameidata *); struct dentry * (*lookup) (struct inode *,struct dentry *, struct nameidata *); int (*link) (struct dentry *,struct inode *,struct dentry *); int (*unlink) (struct inode *,struct dentry *); int (*symlink) (struct inode *,struct dentry *,const char *); int (*mkdir) (struct inode *,struct dentry *,int); int (*rmdir) (struct inode *,struct dentry *); int (*mknod) (struct inode *,struct dentry *,int,dev_t); int (*rename) (struct inode *, struct dentry *, struct inode *, struct dentry *); int (*readlink) (struct dentry *, char __user *,int); void * (*follow_link) (struct dentry *, struct nameidata *); void (*put_link) (struct dentry *, struct nameidata *, void *); void (*truncate) (struct inode *); int (*permission許可 ) (struct inode *, int, struct nameidata *); int (*setattr) (struct dentry *, struct iattr *); int (*getattr) (struct vfsmount *mnt, struct dentry *, struct kstat *); int (*setxattr) (struct dentry *, const char *,const void *,size_t,int); ssize_t (*getxattr) (struct dentry *, const char *, void *, size_t); ssize_t (*listxattr) (struct dentry *, char *, size_t); int (*removexattr) (struct dentry *, const char *); void (*truncate_range)(struct inode *, loff_t, loff_t); };
The file operations switch inside the kernel is used to actually implement the data read/write calls a program makes on an open file.
struct file_operations { struct module *owner; loff_t (*llseek) (struct file *, loff_t, int); ssize_t (*read) (struct file *, char __user *, size_t, loff_t *); ssize_t (*write) (struct file *, const char __user *, size_t, loff_t *); ssize_t (*aio_read) (struct kiocb *, const struct iovec *, unsigned long, loff_t); ssize_t (*aio_write) (struct kiocb *, const struct iovec *, unsigned long, loff_t); int (*readdir) (struct file *, void *, filldir_t); unsigned int (*poll) (struct file *, struct poll_table_struct *); int (*ioctl) (struct inode *, struct file *, unsigned int, unsigned long); long (*unlocked_ioctl) (struct file *, unsigned int, unsigned long); long (*compat_ioctl) (struct file *, unsigned int, unsigned long); int (*mmap) (struct file *, struct vm_area_struct *); int (*open) (struct inode *, struct file *); int (*flush) (struct file *, fl_owner_t id); int (*release) (struct inode *, struct file *); int (*fsync) (struct file *, struct dentry *, int datasync); int (*aio_fsync) (struct kiocb *, int datasync); int (*fasync) (int, struct file *, int); int (*lock) (struct file *, int, struct file_lock *); ssize_t (*sendfile) (struct file *, loff_t *, size_t, read_actor_t, void *); ssize_t (*sendpage) (struct file *, struct page *, int, size_t, loff_t *, int); unsigned long (*get_unmapped_area)(struct file *, unsigned long, unsigned long, unsigned lon g, unsigned long); int (*check_flags)(int); int (*dir_notify)(struct file *filp, unsigned long arg); int (*flock) (struct file *, int, struct file_lock *); ssize_t (*splice_write)(struct pipe_inode_info *, struct file *, loff_t *, size_t, unsigned int); ssize_t (*splice_read)(struct file *, loff_t *, struct pipe_inode_info *, size_t, unsigned i nt); };
Most
Actually, the integration isn't complete, but the same data structures and algorithms are involved. In a Linux system, when memory runs short, a single page frame replacement algorithm (PFRA) is run.
It's also important to note that file pages that are only accessed
using read() and write() are not normally mapped into the
The basic
The above
The inodes are held in several areas of the disk specifically set aside to hold them. The number of inodes, and consequently the maximum number of files in the system, is fixed when the file system is created. The intent of putting inodes in several places on disk, rather than all at the beginning of the disk, is for the inode, the corresponding directory entry, and the file data blocks to all be near each other, reducing seek time and improving performance.
Blocks are allocated using a bitmap which represents the blocks
that are in use and those that are not, much as we showed for memory
An inode includes a pointer to the first ten or so blocks of the
file. If the file is short, then, the
When purchasing a computer system, the performance of the I/O may be
specified in terms of either bandwidth or transaction
performance, measured in megabytes/second for large transfers
or operations/second, respectively. In most cases, those are
averages, but some systems support real-time I/O, as well,
providing specific guarantees for each
An additional concern in some cases can be the CPU overhead of I/O operations.
We have entirely glossed over the issue of how to achieve good performance. Reading data randomly gives very poor performance, while reading it sequentially does very well.
Obviously, this is less of an issue with flash devices, but it does play some part.
All modern
In Unix systems, it is also true that disk I/Os are done in multiples of a page size, and the I/O is also done to page boundaries. So how are the API and the I/O system reconciled? Through the file system buffer cache. The buffer cache serves two important purposes: the first is alignment, and the second
第11回 6月19日 入出力
Lecture 11, July 1: Input/Output Systems
Readings for next week and followup for this week: