8. Data Authenticity - Provenance.

Provenance of a digital archive has two parts, first is the recitation of the chain of custody of the media, and second is the fidelity of the data transcription into working copies for preservation, circulation and presentation. Provenance becomes complex when the original media has been lost or intentionally replaced and we must iterate on media custody and data fidelity for each copy event. Unlike Euclid’s Elements or the Bible, this archive is not analog and it has not yet needed to be translated by human scribes into Arabic or Greek, Unlike those examples this information was born digital, copied only twice by 20th century electro-mechanical means, and remains digital. Also noteworthy, over the past 40 years, the 50 Gigabyte quantity of the SailDart has transitioned from a room sized off-line big-data set of 3000 reels of twelve inch tape, weighing 2.2 pounds each; into chip size, on which all of the data can now fit on-line inside one CPU main memory address space. The quality of the two off-line copy events of 1990 and 1998 will be depicted in this chapter. Future digital copy events will be cheap, fast, frequent and bitwise exact.

After the provenance paragraphs, this chapter concludes with a description of motives and practices which prejudice which parts of the archive are visible today and who can see them in the 21st century under the existing copyright, practical triage and social politeness constraints. Past the year 2100, it is my wish that this small quantity of data shall be free and open to all. My mentors and teachers: John McCarthy, Les Earnest and Don Knuth have requested; my associates Diffie, Frost, Petit, Gorin have not complained too much; there has been very little push back from third parties; as well as very little encouragement or push forward. All has been quiet on the SailDart web sites except for the relentless crawling of the many search engine robots, as well as Robots-dot-Text non-compliant download attempts. A few times each year, I receive a relevant query from a human concerning what can be learned from the SailDart archive.

Custody

For the 229 reels of DART tape the provenance story told here will detail the path from off the 1970s SAIL-WAITS File System through the lab relocation and a tape media conversion until 1998 when the final tapes were read into 229 Unix file system compressed tar balls, tgz, each with its MD5 hash value.

The 229 tar balls expand into exactly 41620 DART records -26 = 41594 DART records, which in turn “undart” into 886476 unique data blobs. Each data blob has its MD5 hash value. The MD5 hash values were serial numbered sn/000001 to sn/886476. A traditional archivist might wish to call this serial numbering the SailDart accession numbers. Data blobs often have an obvious MIME/type such as human authored text, using the text editor named “E”, computer generated text, digital images (usually black and white at six bits per pixel), vector graphics, executable PDP-10 machine code, audio data (often as twelve bit samples), accounting system database records and DART program backup database records. Along with the data blobs the undart processing generates SAIL file system metadata such as the filename, extension, project, programmer, protection bits, size, an xor checksum and four date-time stamps. Recapping, there were twenty five years of SAIL 36-bit computer operations from 1966 to 1991; within which there were eighteen years of low density DART tape recordings, 1972 to 1990 which are serial numbered 1 to 2984.

From 1988 to 1990 First baton pass: 7 to 9 track tape at Margaret Jacks. Marty Frost copied the almost three thousand reels of low density seven track tape into the 229 reels of higher density nine track tape, serial numbered 3000 to 3228.

March 1998 Second baton pass: Tape to Disk at Gates. Bruce Baumgart (with the help of Marty Frost, Les Earnest, John Nagle and Tom Costello copied the 229 reels of the DART 3000 series (via external 9 GB disks) into various systems and media at IBM Almaden and at the Baumgart residence. April 2011 Third baton pass: original DART media is moved from Gates to Green. On 26 April 2011 we (Baumgart, Frost, Earnest, and Hartwig) moved the 229 reels of DART tape from the Computer Science Department at Gates Hall to the special collections at Green Library on the Stanford Campus. Bruce BAUMGART — Statement of the custody of the physical reels of DART tape. Based on DART tape header dates, I assume that the low density tapes reel#1 to reel#1583 were written in the computer room at SAIL in the D.C.Power Building at 1600 Arastradero Road Palo Alto CA and that those tapes were moved to MJH in November 1979.

Tape reel#1584 to reel#2984 were written in MJH. The tape conversion software was developed and tested in early 1988, but not vigorously used until May 1990. Only the first three high density tapes were written in 1988, the remain 226 reels were written in 1990, apparently there was no tape conversion work done in 1989. The 229 high density tapes were moved from MJH to Gates in December 1995 or January 1996. We read the 229 high density reels of tape using Sun Microsystems equipment to 9 GByte SCSI disks (Maxtor) that I happened to own at the time. The 9 GByte disks, by sneaker net (that is hand carried by automobile) to the IBM Almaden Research Center where I was working as a Research Associate. The tar files off the 9 GByte disks were transferred to various systems I had access to at the time (AIX and Redhat Linux) as well as DLT tape and the ADSM backup system. I still have a 1998 set of gold colored CD disks with the 229 tar files.

Reading one reel of tape took 15 minutes and would leave a noticeable quantity of iron oxide dust on the tape read heads and in the tape path so we would clean the tape drive with alcohol swabs frequently. I trust that the next readers of these tapes will have exquisite technology that avoids inflicting as much tape damage as we inflicted. We fetched and returned the tapes from a storage room adjacent to the locked server room in the basement of Gates.

While I was at IBM, the media included DLT-IV tape cartridges. Only a single DLT cartridge was needed to hold the archive at DLT model 7000 density. The SailDart data fits on some forty (40) ordinary CD compact disks. Such sets of CD disks are slower and less convenient to read and write in bulk, but the CD readers were ubiquitous in those years, the media was cheaper than DLT tape, and so as a long term archival strategy, writing to CDs was briefly considered a viable approach.

What soon proved more viable was a chain of many cheap disk drives SCSI to IDE to SATA. The SailDart preservation copy of the DART tapes now fits on USB thumbnail drives as well as SD memory chips. Writing a copy to the IBM ADSM (later the product was re-branded Tivoli something) proved to me the lack of endurance of large data sets in the corporate research environment. In the late 1990s at Almaden, the bandwidth and backup time windows were such that only with great patience could 50 gigabytes be written into ADSM and that without senior management priority such large quantities could never be read out. My large presence inside the robotic tape machine was well known and resented. I was unfortunately asked to perform a similar large backup stunt again for some of my peers in the Web Fountain group at IBM. I finally ended up building a skunk works cluster of cheap commodity disks outside the ADSM service.

Full copies to special people: At my own expense, I built three Redhat Linux PC systems with a full copy of the SailDart, and gave them away to Marty Frost, John McCarthy and Les Earnest. Usually I avoid cute host-names, but in this case those Redhat systems were named after American Civil War generals: Grant, Lee and Sherman. So when a Les Earnest email to me says U.S.Grant lost his whatever or failed to do something, then you will know what that refers to. CD distribution of individual programmer areas to the authoring individuals occurred from late 1998 to 2000.

Fidelity

The bytes found on each high density tape in the 1998 reading using the GNU/Linux ’dd’ utility were aggregated into 229 compressed tar balls and MD5 hashed. The hash numbers assure that the present 229 tar balls are the same as the 1998 ones. In 2015, the GNU/Linux tar dependency was removed and the raw DART byte string written into a single file.

Prejudice

The files now visible on www.saildart.org are files which were visible during the SAIL years, 1972 to 1990. Plus files from disk areas of people who have granted permission to display their files. My first work on converting files to modern formats concerned my own files GEOMED and my PhD thesis work, resulting with good presentation of the PLT and VID files. My recent interest has been the operating system PDP-10 assembly code software, which I have narrowed down to just what is found for 1974. This is a tactic to get some results out in a finite amount of time with little or no help. Meeting with Les Earnest, from time to time, we have further decided that all DART index filenames and dates can be made public.

Infidelity

Running the 1974 SAIL operating system as an exact emulation would seem to us 21st century people as very slow and ugly, it would also crash a lot. The SailDart code re-enactment has taken considerable artistic liberty to remove slow and ugly, as well as to mitigate system crash defects.

Exercises

  1. Keep an eye on Green. Visit Stanford’s Green Library to verify that the SailDart material still exists.
  2. Someday re-read the magnetic tapes. Consider the trade-off between re-reading the tapes sooner with existing (or even worse museum grade) technology or later with more advanced technology but more decayed magnetic tapes. There is no need to chisel bones out of the La Brea Tar Pit when you have X-ray tomography. Vinyl records are now read with a Laser not a mechanical needle. Consider reading all the old non DART tapes that can be found around Stanford University and else where. Martin Frost and I only grabbed the 229 tapes we knew were the final permanent ones, the rooms and attic storage areas where these tapes dwelled in the 1990s had many other reels of tape.
  3. Keep an eye on SailDart. This document is a pointer to a long message, verify that you have access file (byte vector)

Story about Tape Preservation in 2116.

Date: Monday 3 February 2116, The 229 reels of magnetic tape are moved from the Stanford Green Library back into the hills, this time up Sand Hill Road — not down Page Mill Road as in 1979 — to the newly opened Stanford Linear Archival Conservatory. A special building for a collection of linear media from the 20th century including celluloid films and magnetic tapes stored in stacks of lamina of carbon fiber re-enforced foam. The fragile films 8mm, Super8, 35 millimeter, 70 millimeter and magnetic tapes audio, video tapes and numerous forms of early computer magnetic media are unreeled once only unto a plank of the pellucid clear, exquisitely thin, rigid carbon fiber foam. The planks stack in the two mile long archival vault built on the site of a mid twentieth century physics project. The SailDart tapes rest shiny side down, iron oxide side up, so that the electro-magnetic scanners view the top side looking down and the optical sensor the bottom shiny side looking up. Opto-Chem is secondary for digital magnetic tape, but finds numerous greasy human fingerprints mostly near the ends of reels but on occasion in the middle of a tape along with obvious mechanical damage to the media. The fingerprint images can be dated to either the 1990 write phase or the 1998 read event.