1999-06-24

Concurrent Versioning File System (CVFS)

[This is a preliminary description of the idea based on notes I made on 1999-06-24. If anyone knows of other work along these lines, or wants to participate in the development of this, let me know.]

CVFS is a redirecting Linux file system that doesn't manage physical storage itself, but rather is an indirection layer to reinterpret the (specially organized) contents of another directory as a repository that is hierarchical both on the structure (directories and files) and on lineage/history/derivation (branches and versions).

It may be possible once CVFS is created to create an emulation layer that appears to users as traditional CVS, although I think the CVFS interface itself would be superior.

Since CVFS allows users to have limbo files at any leaf node, it is possible that the user might start a branch and then take another tack, leaving the limbo files to become stale. Facilities for the detection of stale limbo files and the creation of a report or (semi-)automatic reaping/purging would be very useful.

Analogies with file system structure-oriented commands:

Structural and historical analogies.
Structural Historical
Command Description Command Description
cd Change Directory
(default = home)
cv Change Version
(default = leaf or next fork)
ls List (Files) lv List Versions
pwd Print Working Directory pwv Print Working Version
mkdir Make Directory mkver Make Version (branch)
rmdir Remove Directory rmver Remove Version (prune branch)

CVFS handles the standard file operations specially. For example, touch, cp, rm, and others are all intercepted to modify the user's limbo area rather than operate directly on the versioned files. This is transparent to the user. rm still appears to remove the file, in that ls will not show it. But, the changes don't make it to the repository until they are committed. Once a file is in the repository, it cannot be removed without an explicit pruning operation.

CVFS only allows the modification of directories or files at leaf versions. So, a branch must be created first before attempting to make any changes at an interior node.

In software development, usually the derived files (object files and others) are not managed by the version management system. Instead, each user has a sandbox which contains these files, and they are not committed to the repository. In CVFS these are limbo files, and as other changes are made to the parent directory and committed (such as the addition of another source code file), the limbo files are promoted to the new version, since it is not permitted to have limbo files at any version other than a leaf. This retains the usual semantics of a sandbox. In fact, CVFS limbo files are precisely equivalent to the ability for each user to have his own sandboxes for any branches of interest. Although, CVFS permits only one sandbox per user per branch, but in practice this is probably acceptable, since if one were to create such a situation in CVS or another version control system, it would be in anticipation of branching (handled differently in CVFS), or just plain redundant (dangerous).

A project repository can be mounted in a single well-known location that all users access by the same name. So, instead of having sandboxes in home directories or elsewhere, everyone uses the same place. CVFS takes care to be sure that each user's view of the structure is according to their current working version, and shows the limbo files associated with their user id. This again simplifies the user's view to the point where it works almost exactly like any other shared directory, except that another user cannot see his limbo files.

The mount command for CVFS gives the repository location instead of a device location. The repository can be supported by (almost) any other type of underlying file system, and this is transparent to CVFS, because it accesses its base storage like a user program.

[[ TODO: Talk more about branching and merging and versioning and labeling. ]]

lv -lR lists along the version hierarchy, showing presence of limbo files.

lv shows files, noting which are not in the repository, changed vs. the repository, or up to date with the repository.

The gv (Get Version) command is used to update the user's view to the repository's contents. It is only relevant for limbo files, since other files are always seen as their current contents. This is finer granularity than with CVS because an explicit cvs upd must be run to pull copies of the latest modifications into the user's sandbox. But with CVFS, the latest (committed) modifications to any file are immediately visible to the user, unless the user has a limbo file, in which case the files must be merged explicitly.

[[ TODO: This may not always be a good thing. Another way to look at it is that if a person is working at the leaf and another person commits work, the other person has now created a new leaf, and turned the current version of the other user into an interior node. In this case, perhaps it should be considered an automatic (probably temporary) branch, so that the rule about limbo files only at leaves an be maintained. Then, we would want a quick and easy way for the merge to happen. For example, when the branch is automatic, the next Get Version (gv) command automatically does a merge with leaf version of the parent line, and automatically prunes the temporary branch. Again, this is probably (at least mostly) hidden from the user, although the current version may reflect the change to a temporary branch... ]]

[[ TODO: gv is already used for ghostscript, and brings up the X windows viewing interface. Pick better names/mnemonics for Get Version and Put Version. ]]

Structural changes (adding a file or directory, moving or renaming, etc.) cause the parent directory to be out of date, and so require a commit on the parent directory to put them into effect.

cp is intercepted and done as a copy-on-write such that at first, we just remember it as a link to a particular version of the source. The source can change many times, but the copied (pseudo-linked) file still points at the version it came from, still not taking up extra room. When it is changed, then the changes are logged relative to the original.

By having the repository be a specially structured regular filesystem entity, traditional backup and restore techniques can be applied to it.

Below is an example project structure:

    /cvfs/project/
        README
        client/
            Makefile
            client.c
            main.c
        protocol.h 
        server/
            Makefile
            main.c
            server.c
        test/
            Makefile

Below is how it could appear in the repository:

    /var/cvfs/repository/
        project/
            1.1/
                README/
                    1.1
                    LIMBO/                # Non-committed user modifications to file
                        gregor
                        scott
                client/
                    1.1/
                        Makefile/
                            1.1
                        client.c/
                            1.1
                        main.c/
                            1.1
                    LIMBO/                # Non-committed user modifications to directory
                        client
                        client.o
                        main.o
                protocol.h/
                    1.1
                server/
                    1.1/
                        Makefile/
                            1.1
                        main.c/
                            1.1
                        server.c/
                            1.1
                    LIMBO/
                        gregor/
                            main.o
                            server
                            server.o
                test/
                    1.1/
                        Makefile/
                            1.1
            LIMBO/
                gregor/
                    INSTALL
                    doc/
                        design.txt

[[ TODO: Do we need to have IDs for the files and directories, so that if one is removed and a node is added with the same name later, they will be distinct? How to handle tags and comments? ]]

An enhanced ls could know it is looking at a CVFS area and show flags for update, etc. along the lines of CVS' status command.

chown and chmod and friends cause limbification.

Read-only access doesn't cause the creation of ghosts in limbo.

CVS has its own diff command. It would be nice if CVFS allowed syntax like: diff foo.c,1.7 foo.c,1.6 when there is a file foo.c in the current directory having versions 1.7 and 1.6. The idea is that foo.c would show up in ls' output, and the versions would not. But, if a request is made to access a file with the comma in it, then CVFS will allow read access to the particular version after the comma (can this be done?). Another angle would be: diff foo.c,= foo.c,-1, which would compare the current version (not the limbo file, if any, since it can be accessed with an unqualified name) with the one prior. To compare the limbo file with the most recent version: diff foo.c foo.c,=.

Consider ls foo.c,. Perhaps this could allow us to view the list of versions.

[[ TODO: Should we use a different character than the comma? How about the caret? Whatever we choose, CVFS should refuse to create files having that character in their name. ]]

    ls .,1.6.2
    more .,1.6.2/foo.c
    more foo.c,1.6.2

Can we do something appropriate for the $Id: $ facility of CVS and RCS? Can we also show presence of limbo file and user?

We will probably want to cache the results of diffs applied to files, with some expiry policy. Factor in: size, cost to produce, and access history. If a cached result has only been accessed recently to construct a further result, then trash the earlier and cache the derived.

[[ TODO: File vs. dir vs. project versions? Numbers vs. labels? Binary vs. text files -- better not screw with them. Binary diff. If diffs are too big, just store a copy of new (we can always re-diff if someone wants to know the diffs). ]]

Dimensional File System: Structure dimension (traditional), Historical dimension (versions).

Given the version root and a leaf, we have defined a time line, and so we can view things as-of a particular time.

When you do an rm -rf on a directory that is in CVFS and not in limbo, you have created a new version of the parent directory, and must do a Put Version on it to commit it. Put Version should warn for some actions. We should have the ability to resurrect it later if we need to:

    cd foo                # my directory
    rm -r bar             # kill bar
    pv .                  # I mean it
    cp -r .,1.7.3/bar .   # resurrect it (remember: copy-on-write)
    pv .                  # I mean it

    gv -f .               # remove all limbo files and revert to versioned files

    rv -r ,.3             # remove versions past subversion 3 of this (prune the branch)
    rv -r ,               # remove this version

More on how to revert a file: gv normally merges changes made by others, but leaves file in limbo if it was in limbo before. gv -f removes limbo file(s) so that only the versioned file is visible. Unrecoverable. rm of a file/dir outdates its parent directory. gv of the directory will restore it. Looking at ,= lets you still see it.

No comments: