|
Design
ObjectiveCreate a user space control application and filesystem layer on top of NFS that provides disconnected operation, two-way transparent file synchronization, and simple automatic conflict resolution. See QuestionsAndAnswers for a quick rundown of more of the esoteric functions of TsumuFS. Goals
Non-goals
BackgroundThere are many cases where offline access to your home directory is advantageous. Some of these use cases are:
See the the UserProfiles page for more information about these. In each use case, there are several issues that crop up:
Related ProjectsOverviewTsumuFS is designed to eliminate the following problems:
To accomplish these goals, TsumuFS is implemented as a combination of a user-space control program and a disconnected caching filesystem. The user-space daemon will provide a method for interacting with the non-POSIX portions of the filesystem (cache synchronization control, disconnected NFS notification, etc.). The filesystem portion will be implemented as a set of threads:
Notifications will be handled through a file or device on the filesystem (either in /dev as a device, or in /sys as a file that can be read from and used to communicate with). Since the control application communicates easily over devices/files, it may also be used as a daemon to enforce hard policies upon the filesystem. In this use-case, TsumuFS could conceivably be used as a read-only backend for a cluster of web servers. During disconnected operation, the main thread works entirely from cache, making all read and write calls there first, while all changes are written to a synchronization log for playback later. When the NFS server is available, however, read and stat calls are done on the fileserver first and the cache is updated immediately by the main thread. Like disconnected mode, all changes to files are written to the local cache first, and then replayed back to the NFS server by the synchronization thread. The disconnected state is switched to when the synchronization thread makes a filesystem call and the call signals an error. At this point, the current entry that is being processed is placed back into the log. During disconnected operation, the mount thread begins checking every n minutes for the NFS server's availablity and attempts to remount the backend. Once a susccessful mount has been performed, this thread fires the connected event again, and goes back to sleep until the disconnected event has fired. In connected mode, each time a file is accessed, via any system call, TsumuFS will automatically cache a copy of it to the local touch cache, which is separate from the persistent cache. The touch cache is unique in the fact that it has a fixed size that it may not grow past. If the touch cache is full, TsumuFS will remove the oldest last accessed file until it has enough space to accomodate the newly added file. In the case that the file is larger than the touch cache, TsumuFS will not cache it, and will notify the user via the control application. In the case where the sync thread detects a conflict, it will attempt an automatic merge based upon the UNIX patch algorithm in the local cache first. Only upon a successful merge will it integrate the data back to the fileserver. If the automatic merge fails, the sync thread will rename the local cached copy to ${FILE}.yours, copy the remote version to ${FILE}.theirs, and signal the user that a conflict has occurred via the control application. Syncronization log entries are only removed upon a successful sync, and as such doubles as a journal to ensure proper reads and writes to the underlying NFS filesystem. All three threads can be visualized using these three flowcharts:
Detailed DesignData LocationsQuick Reference$MOUNT_DEST is the mount source and destination with slashes replaced by dashes and separated by a semicolon. Eg: fileserver1:/home/axel and /home/axel become fileserver1:-home-axel;-home-axel.
General File/Directory InformationBy default, TsumuFS will store all of its working files in /var/lib/tsumufs to follow the FHS standard. An additional mount option, tsumufs_basedir will allow users to override this setting by specifying a fully qualified path. Since TsumuFS is a filesystem and expected to behave as such, it must provide for a way to mount multiple times and still map each cache repository with each mount point uniquely, even after an unmount. As the only differentiating information we have available at mount time is the source and destination mount points, these will be combined to form the unique key used to differentiate data relating to each mount point in the tsumufs_basedir. To generate this unique key, the mount source and destination are altered by replacing all slashes with dashes (eg: /home/axel becomes -home-axel), separating them by a semicolon, and appending this to tsumufs_basedir. Thus, all of the data relating to the NFS export fileserver1:/home/axel mounted at /home/axel will be stored in /var/lib/tsumufs/fileserver1:-home-axel;-home-axel/ by default. If this directory does not exist intially, it will be created as root:root with an 0700 mode, and a simple cache.spec file will be copied into place as well. ,,There has to be a better way of calculating this unique key and still keeping it human-readable. --jtg,, Cache Specification File DetailsCached files are stored in the cache directory by copying each verbatim from the NFS mount, prefixing the TsumuFS mount data directory to the path of the file in the NFS mount. Eg: /home/axel/test.txt is cached to /var/lib/tsumufs/fileserver:-home-axel;-home-axel/test.txt, as the root of the NFS mount is /home/axel. Files will be copied into the cache according to the policy laid out in the cachespec files for each TsumuFS mount point. Each cachespec file will be stored as plaintext in /var/lib/tsumufs/fileserver:-home-axel;-home-axel/cachespec using a simple layout similar to the perforce client description format: #
# This is an example tsumufs mount description file for an example
# mount of axel's home directory. In this case, the home
# directory is being mounted from mtvhome1.mtv:/home/axel to
# /home/axel and the cache point is in
# /var/lib/tsumufs/mtvhome1.mtv:-home-axel;-home-axel/cache
#
# This is a comment that exists from the start of the
# to the end of the line.
#
# Each line is of the format:
#
# [+-]<path>
#
# + means to proactively cache the path on the same line.
# - means to never cache the files in the path specified.
#
# Paths are recursive if they end in an elipsis (...), or if the
# last element in the path is a directory. If the last element is
# a file, only that file will be proactively included or excluded
# in the cache.
#
# Paths are also relative to the root of the exported NFS
# filesystem. Thus "/Documents" refers to
# "/home/axel/Documents". Relative pathnames
# will be treated as errors, so lines like
# "+Documents/..." are invalid.
#
# CacheDescription lines are applied in order from top to bottom,
# with the ones below overriding the ones above.
#
# The maximum size of the touch cache. If this limit is exceeded
# during a sync, tsumufs will remove the files that have the oldest
# access times first. If, for any reason there is a single file
# that exceeds this size, it will not be placed in the cache and
# the user will be signaled via a dbus notification. Note that
# this size limit does not include the persistent cache!
TouchCacheMaxSize: 40GB
# Description of what to cache.
CacheDescription:
+/Documents/... # Proactively cache the documents directory and
# everything in it explicitly...
-/Documents/password-list.txt
# ...except for the password-list.txt file. This
# should never be placed in the cache.
+/bin # Same idea as above, except since bin is a
# directory, it implicitly means everything
# inside it as well.
-/.mozilla/firefox/.../Cache
# Prevent all firefox caches for all profiles
# from being stored in the local cache since
# they generate a ton of disk churn and can
# cause lots of conflicts.
-/.gnupg/secring.gpg
# Prevent only the private keys from being stored
# in the disk cache forever.The user may edit this file at any time, but must signal TsumuFS that the file has changed by sending a DBUS notification via the control application. Once changes are complete and the user has notified TsumuFS, it will rebuild the cache based upon the new settings. If files that were previously included in the cache are now excluded, TsumuFS must immediately remove the specified files from the cache. Likewise, if any files are listed as being proactively cached, TsumuFS must begin copying the specified files to cache. It should be noted that, TsumuFS must erase files from the cache first before it may begin copying files to the cache. This is so that any security issues with having these files on disk arise, they may be removed in an expedient manner. Threading ModelQuick ReferenceAt least three threads are expected to be running at all times. These threads are: Main TsumuFS thread Responsible for servicing all filesystem calls such as read, write and stat. Syncronization thread Responsible for reading from the syncronization log, propogating changes back to the NFS server, merging changes into files automatically, and for detecting merge conflicts. NFS mount thread Verifies that the NFS mount has been, in fact, successfully mounted and is not flapping. Note that this thread is not responsible for detecting when the NFS mount has gone away -- only when it has returned. Theory of OperationUpon the initial mount of a TsumuFS filesystem (eg: `mount -t tsumufs mtvhome1.mtv:/home/axel /home/axel`), the following occurs:
At this point, the mount.tsumufs spins off the nfs mount and sync threads. The sync thread does essentially the following: SyncQueue = SyncQueue.new()
SyncQueue.load()
SyncQueue.validate()
while MountedEvent.isSet() do
while not NFSConnectedEvent.isSet() do
# Don't do any syncing until we have a valid NFS connection.
NFSConnectedEvent.wait()
done
SyncQueueLock.acquire()
item = SyncQueue.get_oldest_item() # excludes conflicted changes
SyncQueueLock.release()
# Verify that what the synclog contains is actually what is on the
# file server.
try
for change in item.pre_change_contents do
if NFSMount.getFileRegion(item.filename, change.start, change.end) != change.data
throw ConflictException(item)
end
end
catch IOErrorException e
NFSConnectedEvent.unset()
catch ConflictException e
# Do something here to attempt to merge the data anyway for text
# files, if possible. Failing that, mark the item as a conflict in
# the synclog, and notify the user.
if e.item.fileType != e.item.TEXT
item.markConflict()
notifyUser()
end
end
# If we don't have any conflicts, we can proceed here -- the
# original hasn't changed since we synced it to the cache last. Just
# copy over the whole cache file on top of it.
try
NFSMount.lockFile(item.filename)
item.copyToNFS()
NFSMount.unlockFile(item.filename)
SyncQueueLock.acquire()
SyncQueue.remove_topmost_item()
SyncQueue.flush_to_disk()
SyncQueueLock.release()
catch IOErrorException e
NFSConnectedEvent.unset()
end
end
# This chunk is reached upon request to have TsumuFS unmounted.
SyncQueueLock.acquire()
SyncQueue.flush_to_disk()
SyncQueueLock.release()At the same time, the nfs mount thread starts up and does the following: while MountedEvent.isSet() do
while not NFSConnectedEvent.isSet() do
if ping_nfs_server_ok() then
if nfs_check_ok() then
if mount_nfs(to = "/var/lib/tsumufs/-mount-point/nfs") then
NFSConnectedEvent.set()
end
end
end
sleep for 5 minutes
end
while NFSConnectedEvent.isSet() do
pass
end
endFinally, mount.tsumufs enters the main event loop to service requests from the filesystem. During normal connected operation, the main TsumuFS thread will perform all read-related calls on the NFS mount, and proactively add and update entries in the cache. All write-related calls will be done to local disk and recorded to the synclog for propogation back to the NFS mount in another thread. Theory of Operation: The Synclog and Syncing to NFSDuring a write-related call, what was in a file previously before the write is stored in the syncronization log, and the new data to write is stored directly to disk in the cache. This is an important distinction, as the synclog only contains old data, not the newest changes. Syncronization back to the file server is done by reading the oldest entries at the top of the synclog, verifying that the original data is still the same on the backend file server, performing the actual change to the file server, and finally popping the entry off of the synclog. In the event that any one of the read or write calls to the file server fails, TsumuFS will stop the syncronization thread and work entirely from the local cache (disconnected mode) until it has become available again. Ultimately the synclog will contain a snapshot of what the file was just before the time of the write -- only the ultimate result of the changes is recorded, instead of every single write. For instance, if multiple changes to the same region of a file are made, such as in the below figure, the cache will only contain the newest of the changes, while the synclog will only contain a changed region of the original data stored there just before the change was made.
Theory of Operation: Syncing to CacheSynclog Data Structure
CaveatsSimpler solutions such as rsync or unison function, but unfortunately either function in massively painful ways, or are far too simple to provide the same disconnected availability that TsumuFS can. In rsync's case, the syncing methodology is left entirely up to the user -- rsync was designed originally for differential copying of files that change infrequently, such as mirroring an FTP site. As such, it only handles one-way syncing and either clobbers change conflicts with the original source, or ignores them altogether and leaves conflict detection and resolution entirely up to the user running it. Unfortunately, in this design too much is left up to the user and provides next to no actual disconnected operation whatsoever. The user is left to either duplicate their home directory in its entirety on their local machine (far from optimal), or write complicated lists of directories and files to include and exclude in the rsync. | |||
*Figure 1*: Flow of Changes to Synclog and Cache Note that if two changes overlap they are merged together in the synclog into a contiguous region. The cache, however, allows newer changes to override previous ones, just like how any other filesystem functions. In this way, the synclog is essentially piecing together the original file change by change. This is specifically designed to help detect file conflicts on the backend NFS connection, as the NFS server does not provide any facility for notification of filesystem changes. Note that in practice, however, lines are not the addressing scheme used for the regions stored in the synclog -- byte offsets into the file are used instead. Line numbers in the figures are used for illustrative purposes only.
*Figure 2*: Flow of Multiple Overlapping Changes In this way, conflict detection becomes a simple matter of verifying if the underlying original data in the file has changed on the NFS backend. If it has, then the change generates a conflict which will require user intervention before the file can be sucessfully synced back. Also note that file metadata such as timestamps, permissions and such can be done the same way, albeit with much less human interaction. An additional command line client and desktop notification applet will be available to manage which portions of the filesystem that the user would like to proactively fetch and keep in the local disk cache, which parts should never be cached even if fetched, which file caches to remove entirely from the cache, and to manage the state. By default, the filesystem will place all accessed files into a cache, but not put anything proactively into the cache later.