Archiving and version control

Marc Groenewegen (marcg@dinkum.nl)

September 12, 2006

Document Information
Organisation	Hogeschool voor de Kunsten Utrecht (HKU)
Version	0.3
Status	proposal

Abstract:
This document provides an introduction into the principles of archiving and version control of software and documentation.

3 Versions, revisions and releases

1 Introduction

Archiving is a way to store your knowledge and be able to retrieve it. This requires an archiving system that helps you store all relevant information in such a way that you can retrieve it from the system much later.

1.1 About information structure

The structure of the information is of vital importance. Not so much for storing information but for retrieval. Putting your files into a large garbage collector isn't hard but when you're looking for the needle in the haystack you wish you'd have some structure to hang on to. In general we have to define the structure for our archive ourselves and agree upon certain conventions we will all use. The bottom line, of course, is that the users put information into the archive at all. In other words: for an archiving system to work, commitment is required from all its users to use it properly.

1.2 About conventions

An important convention is that the users of the archiving system put their information into the system according to the structure that has been chosen for the archive and supply information that clarifies why a new version was made and what the changes w.r.t. the previous version are. Another convention specifies specific properties of the files, like a generic layout, certain special hooks used by the archiving system and meaningful file names.

1.3 Reasons for archiving

Knowledge preservation
Knowledge retrieval
Structuring what we have
Work with several people on the same project

Another purpose of an archiving system is version control. This is the process of giving version numbers or names to specific releases of your product. With a well maintained archiving system it is possible to retrieve all files belonging to any release that was made, e.g. for maintenance purposes.

1.4 Reasons for version control

Access to previous versions
See differences between versions
See the changes you or others made and why
Work with several people on a common project
Make releases

2 How does archiving work ?

In general, a version control system can be seen as a large pool of files, called a repository, with a management system that takes care of the file handling.

However, it is much more than just a file system, it stores the files in such a way that it becomes possible to retrieve any previous version that was ever stored. It does this in an efficient way, much more efficient than just storing all versions of all files, but e.g. by only storing differences between two adjacent versions. On top of that, the version control system makes it possible to compare any two versions of a file, including your own working copy, to see what the differences are or to track how a file developed over time.

To illustrate the use of a central repository, the following picture shows the working directories of three people working on the same project.

In the repository are 3 files: a, b and c, of which a number of versions are stored. Newer versions get more quotes to indicate that these are different instances of the same file.

In each working directory we see some files that are also stored in the central repository of the archiving system. Beside that, we also see files in working directories that are not in the repository and files in the repository that don't show up in either one of the working directories. The files with a square drawn around it are (based on) the most recent versions of a specific file in the archive. A more detailed discussion follows.

Let's start with Peter's files. He's got three files that are also in the repository: a', b and c". The files a' and b are (based on) older versions of files a and b. This means that files a and b were updated in the archive after Peter got them from the archive or submitted them himself. The file c" is equal to the most recent version of file c that is stored in the archive. There are also two files d and e in his working directory that are not in the archive at all. No problem, this might be temporary files, files that he wishes to add to the archive at later moment or Peter's own utilities.

Gerard has three files that are all in sync with the archive. He's probably just done an update on his working directory. There's also a file f which is not (yet) in the archive.

Marc is working with old stuff. Files a and b are based upon really old versions and file c isn't even in his working directory yet. Looks like its getting time for an update. Or maybe he's working on a version of the software that doesn't need file c, but still it's time for updating files a and b then. The files g and h are not (yet) in the archive.

Any of the three developers can get files from the archive, make changes and update the file in the archive. Thus it can happen that you get file a from the archive while it has version a' and shortly after that someone else updates file a, which then becomes version a" and at the same moment your file is no longer in sync with file a in the archive. For the same reason you must always bring your files up to date just before checking them in to the archive. This means that all changes in the archive are first merged into your working copy, an action that a good archiving system does automatically for you. After this merge you have a working copy that is based on the most recent version in the archive and on top of that contains your changes. This merging process is mostly harmless but there are situations that cause conflicts. Read more about merging in the chapter about conventions.

3 Versions, revisions and releases

Each version of a file has a unique revision number. Revision numbers look like `1.1' or `1.2' and are given to your files automatically by the revision control system when you check in your files.

A release is a collection of software, tools, hardware and documents that belong together at a certain moment in time. A release is made by putting a label on every file in the archive that should be in the release. At a later time it is then possible to retrieve the entire release collection from the archive by specifying the release label.

4 Archive structure

To set up our archive, we have to agree upon a structure. A starting point can be the following list of directories for every subsystem:

doc - subsystem documentation
src - subsystem source code
include - subsystem header files
config - subsystem configuration scripts
bin - subsystem binaries

5 Conventions

This chapter proposes some conventions and policies that may help you to successfully use archiving systems.

5.1 File names

File names should adhere to the following:

Only letters, digits, decimal points and underscores are allowed. File names should not contain spaces
Descriptive names bring clarity. As an example: the name 'mediate_architecture.xml' says far more than 'm_121.xml'

5.2 File headers

File headers provide information about the file, such as the name of the author, the purpose of the file, description of modifications, modification dates etc.

Especially for program code some of this information is very useful, therefore all program code files should have a file header conforming to a format that is more or less exactly generic, except for differences due to the way comments are handled in the various programming languages.

The following applies to CVS as a revision control system. For all files stored in the revision control system an ID, version number and log information are kept. Using special keywords, like e.g. $Revision: 1.7 $ this information can be made explicit in the actual files. The revision number and log information are incorporated in file headers for all program files. The revision number may also be very useful in documentation and can be used in e.g. XML or HTML files that are under version control.

An example of a file header for C or C++ files is shown here:


/********************************************************************
* 	(c) Copyright 2002, Hogeschool voor de Kunsten Utrecht
*			Hilversum, the Netherlands
*********************************************************************
*
* File name	: archiving.xml
* System name	: mediate
* 
* Version	: $Revision: 1.7 $
*
*
* Description	: A data preservation fairytale
*
*
* Author	: Marc Groenewegen
* E-mail	: marcg@dinkum.nl
*
*
********************************************************************/

/************
   $Log: archiving.xml,v $
   Revision 1.7  2007/10/09 14:16:57  marcg
   subversion instead of cvs

   Revision 1.6  2005/09/14 20:30:43  marcg
   from draft to proposal

   Revision 1.5  2003/05/05 15:21:49  marcg
   Set current date

   Revision 1.4  2002/04/24 15:56:23  marcg
   Changed titlepage layout according to new format
   Added LaTeX control commands for paragraphs
   and some minor improvements

   Revision 1.3  2002/03/27 21:24:48  marcg
   Lots of additions

   Revision 1.2  2002/03/26 10:29:28  marcg
   Numerous additions

*************/

5.3 Binary files

Binary files can be stored in a version control system but this is generally not a good idea. In the case of CVS, storing binary files is possible but viewing differences between versions must be done by an external program by e.g. checking out two versions into two different places and applying a special purpose diff'ing utility. There is another reason not to submit binary files: resolving merge conflicts is hardly possible. More about this is written in my CVS introduction document.

Examples of binary files

Executable (program) files (.exe .com)
Word documents (.doc)
Excel sheets (.xls)
Images (.jpg .gif)
Audio files (.au .wav .aiff)

Examples of text files

Plain text files (.txt)
HTML (.html)
XML (.xml)
Program code (.c .cpp .h .java .php)

5.4 Executables, releases

Executables derived from source files in the archive are only stored in the archive when they belong to a release. The reason for this is that executables normally can be generated from a set of source files and because these source files are already available in the archive it is not necessary to store the executables.

This reasoning can be extended to all files that can be generated from files already in the archive.

An exception to this rule is made for executables being part of a release. In theory it is possible to reconstruct entire releases from previous versions of source files but in practice this is not always the case. For that reason, for every major release, all generated files are also packed in a release file and stored in the archive for future use in e.g. debugging.

5.5 Check-in policy and merging

You must always bring your files up to date just before checking them in to the archive. This means that all changes in the archive are first merged into your working copy, an action that a good archiving system does automatically for you. After this merge you have a working copy that is based on the most recent version in the archive and on top of that contains your changes. This merging process is mostly harmless but there are situations that cause conflicts. Most of these conflicts are solved by plain human communication.

What causes merge conflicts ? The most obvious example of a merge conflict is when you and one of your colleagues are working on the same file. You both have your own working copies so you're free to modify what you want, as long as you merge the latest version in the version control system and your own file just before check-in. Now suppose you remove a specific section and put the file back into the version control system. Your colleague then wants to put his working copy into the system and updates his file before you commit your version to CVS. His version will then still contain the section you just removed.

5.5.1 Check-in rule 1: never submit junk

This section mainly applies to software design but can be extended to other areas of work. A version control system is not a backup mechanism for your daily work and non-functional source code. The archive typically contains groups of files that together make working software programs. As each group develops, the associated programs develop. A golden rule in using a version control system is that it must at any time be possible to construct a working copy of your program from the latest source code in the archive. This implies that one should never put non-functional or uncompilable files into the archive. Remember you're probably working in a team of people who also use your files, so if you introduce a bug or submit partly finished, uncompilable files into the archive, your colleagues will not be able to test their own files because they will be unable to build a working program.

Making sure you don't put rubbish into the archive is easy. When you've finished editing your working copies, update all files that are needed to build the program so they are in-sync with the archive. Rebuild your program from your working files. If the build process is successful and your program doesn't blow up in your face when you start it, you're ready to submit your changed working copies into the archive.

5.5.2 Check-in rule 2: take small steps

Work incrementally in small steps and at every step submit your changes to the archive. Small steps are expressed in days rather than weeks of implementation work and can even be as small as half an hour's work. This way of working has several advantages:

For every change, you'll make a separate log entry describing the why and what of the change, so it is immediately clear to what code changes your log remarks correspond
Small changes are more or less restricted to a specific version of a file, which makes difference tracking easier
When other developers have made changes to files in the version control you'll find out early
There's hardly any need for making backups because you'll use the version control database for that purpose

5.6 Checkout policy: (un)reserved check-outs

If the version control system offers both reserved and unreserved checkout, unreserved checkout is preferred. Reserved checkout means that the file gets locked upon checkout, thus preventing other users to work on the same file. The lock is freed after check-in. In most cases, this mechanism is unnecessary and causes more trouble than it prevents. The most prominent disadvantage is that users tend to forget unlocking files, thus slowing down the development.

One case for which reserved checkouts are useful concerns binary files. If there is no merge tool for binary data, which is often the case, then conflicts due to file updates by multiple users are often hard to solve. In my opinion however, reserved checkouts are mainly a replacement for bad group communication so I propose not to use them unless you hate your colleagues and are looking for a way to really annoy them.

6 Archiving systems

6.1 CVS

CVS is used throughout open-source community and suitable for 'serious work'. It is able to use a remote repository using its proprietary 'pserver' protocol. This is a dedicated protocol that makes it possible to connect to the CVS server from computers through TCP/IP. You use CVS as if the repository were on your own computer, but in fact all CVS commands go through the CVS server. This requires a connection being made to a dedicated port on the server, so if the server is behind a firewall, special provisions may have to be taken.

6.1.1 CVS clients

tkCVS, graphical client for UNIX and Windows
MacCVS Pro, graphical client for MacOs
CVSweb, only for viewing
Chora, only for viewing

As the graphical front-end tool tkCVS, is just a graphical shell around CVS, this also works perfectly with the remote access mechanism (pserver). TkCVS requires Tcl/Tk version 8.1 or newer. As an alternative to graphical clients it is possible to use CVS as a command-line tool in a telnet session.

6.2 Subversion

See http://subversion.tigris.org for more info. Subversion is said to be a "version control system that is a compelling replacement for CVS in the open source community". Server runs on various UNIX platforms, clients are available for numerous UNIX platforms and Windows.

7 CVS related links

The following links direct you to more information, documentation and downloads.

CVS	http://ximbiot.com/cvs
CVS documentation	http://ximbiot.com/cvs/manual
Tcl/Tk	tcl.sourceforge.net
TkCVS for UNIX/Linux and Windows	twobarleycorns.net
MacCVS Pro	www.maccvs.org

Archiving and version control

Table Of Contents

1 Introduction

1.1 About information structure

1.2 About conventions

1.3 Reasons for archiving

1.4 Reasons for version control

2 How does archiving work ?

3 Versions, revisions and releases

4 Archive structure

5 Conventions

5.1 File names

5.2 File headers

5.3 Binary files

5.4 Executables, releases

5.5 Check-in policy and merging

5.6 Checkout policy: (un)reserved check-outs

6 Archiving systems

6.1 CVS

6.2 Subversion

7 CVS related links

1 Introduction

1.1 About information structure

1.2 About conventions

1.3 Reasons for archiving

1.4 Reasons for version control

2 How does archiving work ?

3 Versions, revisions and releases

4 Archive structure

5 Conventions

5.1 File names

5.2 File headers

5.3 Binary files

5.4 Executables, releases

5.5 Check-in policy and merging

5.5.1 Check-in rule 1: never submit junk

5.5.2 Check-in rule 2: take small steps

5.6 Checkout policy: (un)reserved check-outs

6 Archiving systems

6.1 CVS

6.1.1 CVS clients

6.2 Subversion

7 CVS related links