OddThinking

A blog for odd things and odd thoughts.

Version Number Challenge

Here’s a little challenge for the software developers. It arose as I considered the meta-data in a RAID configuration, but it has wider applicability, so I will generalise a little. It is probably a solved problem in Computer Science, but I’ve not heard of it before, so I will happily reinvent this wheel. I do not have a good solution yet, but it seemed an interesting puzzle.

Simple Starter Problem.

Suppose you are developing some software. The requirement is that, when the software starts up, it is given two different versions of an file, and it must use the latest version.

Simple Starter Solution

That’s easy! You simply store a generation number with the object and increase it every time the object is modified. A relatively small number amount of bytes can handle a large number of changes.

According to my calculations, a 64-bit generation number easily cover 10 version changes every second for the entire history of the Universe.

Oohh! I’ve just had a product idea! A font and template for Excel just for these sorts of calculations. It makes the spreadsheet look like the cells have actually been written on the back of an envelope!

Slightly Harder Version of the Problem

However, what if the files might have been branched in the meantime? So a copy of the file might have been edited while the main file was also been edited.

There are three possible scenarios:

  • Neither file was touched, in which case they are both identical.
  • One file is derived directly from the other file, in which case the later file should be used.
  • Both files were edited, in which case a more sophisticated merge needs to take place.

For example: When a PDA is reconnected to a PC, and the files are resynchronised, the system needs to work out which files should be kept, which should be overwritten and which should be flagged as being in conflict.

Overly-Simplistic Solution

If storage space was no object, this would be easy to solve. You could store a complete version history of the object in the version number – perhaps using a code for each device.

A version number might look like this PC.PDA.PDA.PDA, which means the file was created on the PC, and edited three times on the PDA. When the file is compared against PC.PDA.PDA.PDA.PC.PC, the system can see that the latter is more up to date. Meanwhile, PC.PDA.PDA.PC.PC.PC would be highlighted as a conflicting version.

You could make this system even more robust with timestamps and other distinguishing marks.

Hard Problem

When the system’s architect sees your plans above, a new design constraint is quickly added. The architect limits you to a bounded storage budget.

On the other hand, the architect will accept a probabilitic solution – one that will probably detect which one is the latest. If it helps, it may be less accurate (i.e. lower probability of a correct answer) after a massive number of off-line changes. In comparison, when there has only been one or two changes off-line, when it should be highly accurate.

There’s room to negotiate how much space you need in return for how accurate you can get, but let’s say, in the first draft, 32 bytes of storage have been put aside for this.

A Poor Solution

Here’s one solution:

You treat the 32 bytes as a pool of 256 flags, all initialised to 0 as the file is created.

Each time the file is modified, a random flag is set to 1.

When comparing two files, if one file contains all the set flags of the other file, and more, then it is later.

This solution is very limited – it only handles 256 changes. The desired behaviour starts to collapse completely as the number of changes approaches 256.

Can you do better?

So my question is, what would be a good algorithm to use here, to handle more than 256 changes, while still retaining the right stochastic profile?


Comments

  1. Umm, maybe I don’t understand the problem. What’s wrong with using the last-modified date? Or does the solution have to use a version/generation number, or derivative?

  2. Divide the 32 bytes into two (or more) 16 byte integers. One stores the number of times the file has been edited on the PC, one the number of times it has been edited on the PDA.

    When you sync, if both numbers are the same, no changes have been made; if one differs then a change has been made on that device (and the higher value tells you the more recent version); if both differ you have a conflict.

  3. Alastair: last-modified doesn’t have enough information. Consider…

    scenario A:

    time 1: file synced
    time 2: modified on PDA
    time 9: trying to resync

    PDA version has latest modified date, and includes all edits.

    scenario B:

    time 1: file synced
    time 2: modified on PC
    time 3: modified on PDA
    time 9: trying to resync

    PDA version has latest modified date, but does not include all edits; a merge is required.

    At the resync, modified date alone does not distinguish between these scenarios.

  4. I’m afraid I also don’t get it. This is all about the merge tool, right? If it needs the base version, then you either need to store the base version or the actual changes that were made. If it doesn’t need a base version, what difference does it make which file is newer? It can do the merge either way… IMHO synch works with timestamps. (ie: synch time, latest time of current file, which makes 3 timestamps total)

    I’ve just verified this with my palm, and it seems to use timestamps (it does this with timestamps per calendar item, instead of per file, but if you mess with the same calendar item in two different ways, it will dupe the task)

  5. Chris: but if we store the date of the last sync, we can compare this with the dates of the two files, and see if either or both of them has been modified since that date.

    I gather the idea is to detect the need for a merge, not necessarily provide all the information needed to merge the two files.

  6. I’m grumpy.

    I had visions of cool, stochastic algorithms, using the power of Mandelbrot sets (or some other sexed-up mathematics) to shoot constantly-bifurcating horribly-intertwined paths through the 32-byte phase-space, in a manner exhibited remarkable emergent behaviour – (i) the phase-space would be covered completely and evenly and (ii) the recent history of the bifurcations could be determined purely by the current position in phase-space.

    Instead, I am offered two solutions that completely solve the problem – one is a single line of code and the other only uses only 12-bytes and a wristwatch.

    Humph! Dull. Good engineering, but dull. 🙂

    Let me go back to the original inspiration to the problem, and see if I can’t couch it differently, to disrupt these solutions.

    The original inspiration was mirrored (RAID) disk drives.

    If a RAID disk-controller has two drives, they can be kept in sync. If the power cable falls out of one drive, for a while, the computer can keep going. When the power cable is restored, the disk-controller needs to detect the version on the fixed drive is a true ancestor of the current drive, and therefore can be overwritten with the latest. (Note: This isn’t a merge, with human intervention. It is just an overwrite.)

    On the other hand, consider if the second drive had been unplugged and plugged into another machine, which had written more data onto it, before being moved back to the original machine. When it got plugged back in again, the RAID controller should get alarmed. There is novel data on this drive that may get overwritten. A manual merge is required (in either direction as Sunny points out.)

    So how can we detect between these two circumstances?

    Am I allowed to claim that RAID-controllers don’t have access to accurate time-of-day clocks and therefore can’t use timestamps?

    That would sabotrage Alastair’s idea, but it still leaves me with Chris’s idea of splitting the 32 bytes into two halves. I want to generalise the problem to an larger number of devices but then my mirrored disk example fails to justify the requirement, and I am left unable to force a sexy solution.

    Hence I’m grumpy.

    I will just have to resolve myself with having to put up with simple, elegant, workable solutions.

  7. Sexy maths is almost always bad. Have I ever told you about Quaternians?

Leave a comment

You must be logged in to post a comment.