Digital Data Preservation

About

Preserving data short term is relatively simple from an end-user point of view. Make a file, copy it somewhere, trust your computers and storage are not going to delete or corrupt it. Things get considerably more complex when we want to store data "long term".

"Long term" is also ambiguous. For some people and industries, "long term" means seven years (for example, financial legislation in the country I live in). But if we consider what we want from preservation, "long term" can mean decades, centuries, or more. Consider data that you may have saved to 3.5" floppy diskette in the 80s. Or really any data format we now consider obsolete in less than a single human lifetime. Can we extract data from these today? And if we can, how can we verify the data is complete and free from any corruption? Will future generations be able to do the same?

We consider "data" to be anything of interest, represented digitally. ROM dumps, image scans, digital photographs, audio files, etc. How we obtain this data is out of scope for this document, as is functionally using the data once we've veriified its integrity. This document will focus purely on preserving "data at rest" from a technical stand point.

Background

I'm a systems administrator by trade, specialising in open source, information security and data integrity. My career has been a focus on gaining practical expertise in multiple technology domains around storing, transferring, managing and protecting data that is important or valuable to companies, organisations and governments over several decades. I've worked for and within the industries of finance, superannuation and banking, architecture and engineering, Hollywood VFX and post production, high performance computing and several others all with their own strict rules around data preservation (and sometimes destruction). I've been responsible for data three orders of magnitude larger than what was the industry norm (terabytes when the standard was gigabytes, petabytes when the standard was terabytes).

Data preservation, especially that of the games industry in Australia, is a hobby and passion of mine, and I apply the expertise I've leared in my career to that hobby.

Challenges

The biggest technical challenges to data preservation are three fold:

  • Accessibility - can the stored data be read on modern machines and systems?
  • Integrity - can we verify the data read is complete, and has not suffered degradation, damage, corruption or other forms of "bit rot"?
  • Repair - if the data is damaged, can we repair it, and verify that it's repaired?

This page will aim to provide some "tips and tricks" towards ideas to combat these three challenges.

Open Source and Free Software is key

To begin with, all the tools on this page will be free software ("free" as in "liberty". Zero cost software that is not open source is not considered). The tools we use to ensure preservation of software must also be easily preserved. Software that is proprietary cannot be maintained easily by a community, and rely on the motivations of the owners. Should they cease development in the tools, then the tools themselves are no longer preserved, making our preservation efforts doubly difficult.

Free Software and Open Source tools are easily ported to new computers and operating systems, which itself is a key point in preservation. Ensuring that these tools run the same way across different generations of hardware is essential.

Constantly migrate data, keep multiple copies

Obsolete hardware is a huge challenge. 3.5" floppy diskettes, DAT tapes, 40 pin SCSI drives - all of these were premium storage options once, and all are either technically or financially difficult to access today. Likewise, both the standard PC SATA storage and removable devices like SDCards will one day be obsolete, and difficult to retrieve data from.

Migrating data to new storage is key, as is keeping it on different types of storage. It is understood that no media is permanent. CD/DVD/BluRay and other optical media degrades, as the writable layers rot or are peeled away from the plastic over time. Spindle hard disks suffer mechanical failures in their components, as well as logical failures on the magnetic spindles. Flash storage has cell write wear, and eventually fails.

ALL MEDIA FAILS, EVENTUALLY. And any good IT person will tell you that if you have only one copy of your data, you may as well have zero copies. The Digital Preservation Coalition each year updates their Endangered Species List, which sees an ever increasing list of information being objectively identified of being at risk of permanent loss.

Migrating data to new platforms and different media is essential. Diversifying the types of storage chosen (by technology and vendor) ensure that future chances at reading that data are improved.

Choosing good tools to copy data is essential. My preferred copy tools is rsync. When using rsync with the --checksum and --archive flags, rysnc will checksum a file in transit block by block and copy metadata such as date and time stamps. Repeating an rsync --checksum --archive command on existing data will not overwrite existing data if unchanged, and provide a further integrity check between two copies on different machines. Rsync works over multiple protocols, and provides a sane choice for sending data over long distances, including over-the-wire compression and encryption depending on the network protocols used. Rsync is ported to many different platforms, including modern desktops, tablets and phones.

Some file systems include good quality efficient send/receive network protocols. ZFS and BtrFS both employ these methods, which will be covered more in "File Systems".

The section "Verify Data Integrity" includes links to further tools that provide reliable copy options.

If cloud services are used, consider hosting your data on at least two competing services, to avoid loss should one of them fail (whether technology or business failures).

If local RAID systems are used, consider keeping data on two different types of systems, operating system, or file systems by different vendors.

Even if cloud or complex RAID systems are used, keep offline backups. Neither cloud nor RAID are a substitute for backups, and backups should be kept on different media, and preferably in geographically disperse locations to prevent destruction by fire/flood/theft or other physical means.

Verifying data integrity

Digital data is "perfect" in that it is ultimately represented by zeroes or ones. Unlike analogue data, this finite representation can be mathematically "fingerprinted" or checksummed to verify it has not been altered. Several checksum tools exist, but like all software tools, these are updated as flaws are found.

Currently several good methods exist for data checksumming. Two preferred methods today are:

  • SHA - Secure Hash Algorithm. SHA offers a variety of block, bit and word sizes, and generally speaking later versions with longer bitlengths ensure better protection against "collisions" (i.e.: where two pieces of data have the chance of sharing the same fingerprint). For the purposes of pragmatic preservation, SHA3-256 at time of writing provides a nice balance of performance and trustworthiness.

  • xxHash. xxHash is considered "non-cryptographic", but basic testing has proven itself to be not only reliable, but extremely fast. For preservation efforts, constant checksum verification is necessary, but can be hard on CPUs (which means it can take a long time, but can also cost a lot of money in power, heat and for cloud services, compute time). xxHash's speed can relieve great strain on devices or systems low on resources.

Creating checksums can be done through a variety of tools that support these algorithms, the simplest being command-line tools that spit out a file to checksum file. Where possible, checksums should be taken as early in the aquisition process as possible, and routinely verified through scheduled/scripted means. The xxHash link above provides a list of tools that support xxHash, including verification and copy tools.

Some file types contain internal checksums, which can help the process. Often these are older style checkusums such as CRC, and while useful to check against what has been stored previously, re-calculating checksums with newer/better algorithms and storing these externally should be done to mitigate problems with the old algorithms.

Third party tools are available that can automate the checksumming and verification processes. As always, I recommend open source tools first and foremost to ensure the longevity of the tool itself (any tool that cease working when a license or vendor expires is not long-term reliable). With that in mind, AVP Fixity is available as a GUI tool for Windows and Mac users, and is considered a defacto tool in the data preservation world. Source code is available on GitHub.

Repair damaged data

The simplest way to repair damaged data is to have multiple copies, verified by checksum, and when errors are detected, make a new copy. The disadvantage in this case is the time required to do these steps, and the data storge space requirements.

Alternative methods exist that can help detect and repair damaged data. These shouldn't be considered a replacement for any of the above, but can be used in conjunction with them.

Erasure code techniques such as Reed-Solomon error correction are processes where various extra data can be stored along with original data, and mathematically recreate either damaged or missing data. The amount of data that can be repaired is typically represented as a percentage of how much space we're willing to sacrifice to the repair code data. This technique is similar to how RAID5 and RAID6 systems can survive a failed disk, and rebuild data once a replacement drive is inserted, and the missing data recalculated from parity information.

For practical use, the Parchive (joining the words "Parity" and "Archive") tools are the easiest way to achieve this in user space at a file level. The par2 command line tools can generate a given volume of erasure code data (generally chosen as a percentage of the original file size, with a higher percentage offering better protection against more data loss, at the expense of more disk space). Storing PAR2 files with original data, in addition to other techniques in this document, further reduce the risk of data corruption by offering a method to repair the data. This can sometimes expedite the repair process, especially if backups and secondary copies take time and resources to fetch.

File Systems

A file system is the method by which a disk or drive is formatted to logically hold data, often representing it back to the user as files. File systems, like all software, have matured over time. There now exist several modern file systems that represent a high quality in data storage.

Computer hard disks are growing in size. Single hard disks were once measured in Kilobytes (one thousand bytes), and are now measured in Terabytes (one trillion bytes, or one million-million bytes), and like everything in computing continues to grow exponentially. Hard disks have what is called a URE ("Unrecoverable Read Error") rate. This is the rate at which a data read (or write) can "silently" be incorrect, and the computer as a whole cannot detect the error.

For domestic "home user" hard disks, this is roughly 1:10^14 bytes (1 byte in 12TB), and for commercial "enterprise quality" disks, 1:10^15 bytes (1 byte in 125TB). These are averages, which means that error rates can be worse than this. Considering that, at time of writing, 10TB drives exist, this means the likelihood of corrupt data existing on large volumes is growing. This silent corruption is commonly referred to as "bit rot".

For a file system to accurately detect this bit rot, it must have direct access to the physical disk (i.e.: not through a RAID controller), and it must be able to checksum the data at all levels (in RAM before being written to disk, and again once on disk, block by block). Several file systems exist that offer this feature set, and should be considered mandatory for long term data storage on computers and servers. This list includes (but is not limited to):

  • ZFS - stable in multiple configurations, including "RAIDZ", which offers RAID5/6 and beyond style disk configurations
  • BtrFS - stable only for RAID1 and RAID10 configurations (RAID5/6 configurations are not yet production stable)
  • Microsoft ReFS - by default this ships without the features enabled, but when set up in a "Resilient configuration" with "Integrity Streams" enabled, offers the features required.

Sadly Apple's APFS does not provide block level checksumming, which puts Apple users at a disadvantage. However, for non-boot drives, ZFS is available via the OpenZFS project for multiple operating systems, include Linux, BSD, Solaris and Apple macOS users.

Older file systems such as FAT and HFS+ are noted for their lack of reliability, especially on moderm-sized large volumes. For long term data storage, these should be avoided.

Several options exist for people looking to build purpose-specific multi-disk archival systems. FreeNAS allows you to install FreeBSD and the ZFS file system onto consumer hardware, presented through a simple to manage web intercace. It supports various network share export options including SMB (Microsoft Windows and Apple macOS style file sharing), NFS (UNIX/Linux style file sharing), FTP/SFTP/SCP and other network transfer protocols. All the other features of ZFS including block level checksums, built-in lossless compression, various RAID levels and internal duplicate copies, file system snapshots and remote send/recieve (rapid, automated, simplified, bandwidth-efficient trasferring of data and file modifications between geographically separate file stores) are all available too.

Compression

Repeating the advice earlier in this document, it's recommended to choose compression standards that are both open source, and long-term reliable.

Older proprietary compression techniques such as LHArc and RAR, while both popular in their time, are beginning to become problematic to decompress today. At some point, they may even require emulation techniques of their own to run the binary decompressors. With that in mind, there are also several modern decompressors today that are very tempting to use, but their long term viability have to be questioned. For example, lrzip provides excellent compression on large files such as ISO images and full disk dumps [2020 edit - zstd mentioned elsewhere in this document also now supports long range windows for compression, as well as better multi-threading and performance, which makes lrzip less tempting]. However the code base changed heavily between then 0.5X release and the 0.6X release, making backwards compatibility impossible. Until the standard settles, the viability of the tool for long term data archive should be questioned.

zlib is arguably one of the longest lived and best supported compression libraries around, and is the basis for both gzip and zip compressors. The downside of it today is that better compressors exist (lzma/xz/7-Zip, and the more recent zstd), but again they should be evaluated for their long term viability and compatibility between version updates before being chosen. The other dowside to standard gzip compressors is that they are single threaded. Luckily the pigz project exists, which offers near-linear scalability speed up of zlib on multi-core CPUs, while maintaining compatibility with existing decompression tools.

Cautions

There are some things to be cautious of when storing data long term. These typically fall under the temptation to save space and the expense of data integrity. Some of these include

  • Lossy compression. JPG, MP3, H.264 and similar media compression tools offer substantial space saving, at the expense of losing fidelity and accuracy of the original data. Other lossless compression methods are preferred, even if they don't give the same space savings.
  • Deduplication. Blocks of data that share identical checksums can often be consolidated into a single piece of data on a hard disk, saving space. This can, in a way, be considered a form of compression. However it means that if that single block is corrupted or destroyed, several files will suffer the damage, and duplicate information can't be found to restore or repair the original.
  • RAID is not backup. Consider RAID a way to lesson the impact of losing a single disk on a working system. But multiple drive failures, accidental deletion, or loss of an entire computer through fire/flood/theft are never mitigated by RAID.

It's also worth noting that parity data, erasure codes and other items often compress poorly due to the way their data formats are stored. The temptation to avoid these to save on disk space is high, but for long term reliability, present much higher risk.