7. Access, Privacy & Search.

There are two implementations for working with the SAILDART Archive. The second implementation is the public SAILDART web site, which is built from the first, a private GNU/Linux file system. The SAILDART use of database software is substantial but it remains auxillary to file systems. I do miss Jim Gray (lost at sea, 2007), but I have not yet converted to his database-first world view.

Public Access

Access by Canonical URL

The canonical and permanent, SAILDART file URL is simply the old SAIL PDP-10 file name, extension, project, programmer with the old punctuation marks optionally postfixed with a decimal version number — there are no curly braces around version numbers:


  FILNAM[PRJ,PRG]
  or
  FILNAM.EXT[PRJ,PRG]
  or
  FILNAM.EXT[PRJ,PRG]{version}

  

To get a bitwise exact copy of a file, append "_octal" to the URL. For example:


  wget -q http://www.saildart.org/BUCK75.FNT[XGP,SYS]_octal
  

serial numbering the data blob hash codes.

 Access from programmer Home Page

For each PRG code (well actually PRG+1 owner codes, which are equal to PRG codes for most everyone except when a code was reused for a different person) a SailDart home page exists at URL


  http://www.saildart.org/BGB
  or
  http://www.saildart.org/[1,BGB]
  

 Access by Date

I once had the SailDart files accessible by URLs in the form of www.saildart.org://{isodate}/FILNAM.EXT[PRJ,PRG] for accessing a revision without having to know its {revision number} since server side mechanism could select the correct revision existing on the given date. This is not a unique canonical URL, but rather provides a large set of URLs for each day in the span of the file revision’s existence. I could be encouraged to re-implement this form of access, and have appended it as a low priority exercise.

 Access by Serial Number of content blob

 Access by pathname

Copyright and Ownership

The copyright status of the almost one million items inside the SailDart archive varies and may be looked up per item. Most SailDart items were never published, others are public domain. SailDart is an archival collection with human curators. Compliance with the original ARPA, NSF and other contracts supporting academic research at Stanford University is continued best effort. Compliance with the Stanford University policy for archiving research data continues.

 Privacy, Courtesy and Ownership

John McCarthy punted on the privacy issue. He said (paraphrasing) 1. Do not be in a hurry to contact Stanford officials, 2. Get advise from Les Earnest and Marty Frost, and memorably he repeated the cliché: 3. It is easier to ask for forgiveness than it is to ask for permission.

Stanford University has had continuous possession of the DART permanent tapes. The 229 reels of DART tape are now safely housed in the Digital Collection at the Green Library, on my initiative, with the assistance from Earnest, Frost and Hartwig.

Question: Who guards the guardians?

Answer: The guardians must guard each other. Les Earnest has observed that at any computer project, there is an inner circle of system programmers who have access to everything. It is peer pressure from others that preserves privacy.

Stanford Research Policy Handbook

The URL https: //doresearch.stanford.edu /policies /research-policy-handbook /conduct-research /retention-and-access-research-data links to a page concerning Stanford University policy on the retention of and access to research data. I am aware of this policy now, and I was aware of the issues and ambiguities of an unsorted bulk data collection in 1998 when working with John McCarthy and Ted Selker on long term digital preservation for data mining at the IBM Almaden Research Center. From the Stanford policy, I wish to quote four sentences verbatim:

  1. When individuals involved in research projects at Stanford leave the University, they may take copies of research data for projects on which they have worked.
  2. Original data, however, must be retained at Stanford by the Principal Investigator.
  3. Research data must be archived for a minimum of three years after the final project close-out, with original data retained wherever possible.
  4. Beyond the period of retention specified here, the destruction of the research record is at the discretion of the PI and his or her department or laboratory.

I claim a wide interpretation for sentence #1, starting with my PhD thesis work on which I indeed hold a 1974 copyright and which arguably is my intellectual property and not that of Stanford University. I am in compliance with policy sentence #2 since the original media is still at Stanford. John McCarthy seemed aware of the Stanford University policy ideas in sentences #3 and #4, and he took it that some folks might exist that assumed the three year retention period was a maximum after which old data should be destroyed in order to avoid difficulties and to cut off the possibility of belated reviews or whistle-blowing. John McCarthy was of the opinion that A.I. should be like Astronomy where research records are kept forever.

External Search Engines

The SailDart collection that has been on the web for the past decade is too large, too fragmented and too redundant for the search engines to make much sense of it. The search engines downgrade sites that are as large and as illegible as SailDart has been. However search for keyword SailDart appended with a couple of your special keywords will turn up SailDart stuff. For example, search “SailDart ZORK” returns a set of SAIL files referring to Don Woods game Adventure.

Internal search mechanisms

The digital curators (such as myself) who have a copy of the SailDart in a file system can navigate the million files using find and grep. I have built and used a full word index (a concordance) from time to time, but I do not have one built at the moment. Frequency histograms of N-glyphs and N-grams are a routine way of finding stuff, however I do not have a SailDart search tool kit to hand off. Semantic networks of the documents using the same vocabulary (especially names) might be useful.

Exercises

  1. Build (rebuild) a new suite of concordance tables by Words, N-grams, Names, phrases, sentences.
  2. Finish writing taxonomy predicates, such as is-lisp and is-assembly using either a parser or frequency histograms or both.
  3. Rebuild the search by date mechanism into the SailDart web presentation.