11. Data Sets – the dot • EXT codes.

A corpus is a set of files associated with a dot filename extension code. The plural of corpus is corpora. If you are like Les Earnest and think the word corpora is too Harvard snotty, you may say corpuses, which sounds to me like a mass murder. Sets of files, corpora, found within SailDart are described. I shall postpone general remarks until after exhibiting quite a few specific dot • EXT labeled corpora.

• DMP

Dump files (dot DMP) are binary executable PDP-10 machine code. Metaphorically, the computer was like a gravel truck in which the programs were first LOADED, driven around for a while, and then DUMPED. At DEC, but not Stanford, the metaphor morphed to dot SAV for saved. There are DMP-to-SAV and SAV-to-DMP converters to bridge between the SAIL and DEC convention. At SAILDART the DMP file is the Standard.

As old SAIL people may recall, the first word of a dump file loads into memory address 000074 of user space and the start address of the program is taken from the right half of 000140. Many of the dot DMP files have a symbol table for the debugging tools, DDT and RAID. I have used these symbol tables to disassemble the PDP-10 code and to joining the DMP files back upstream (against the current, with a headwind and poor visibility) to their source files. In all there are 32730 DMP files of which 7428 have symbol tables. The largest and most interesting DMP files are built from many small source files, control files and separately maintained library packages. Reading source code and the disassembly listings of 1970s software is feasible, large programs were a lot smaller back then.

• FAI and • {nothing} and • MAC

The assembly machine code files number in the tens of thousands with three extensions: .FAI, .MAC and .{nothing}. The dot nothing files may be anything, but they were often FAIL. I have written a set of “Izzy” detectors, for example isAssembly, isFail, isMacro, isLISP, isSAIL, isPASCAL and so on.
Yet another hacker story: The LISP programmers would have written the isFAIL detector and named it FAIL-P for FAIL-Predicate. Search “Gosperism Soup” to find the Hacker Dictionery story – at the Chinese Restaurant Bill Gosper, a famous MIT Lisp hacker asked "split-P soup ?" meaning does anyone want to share a soup order.

• LSP and • LAP

The LISP source code files have the extension dot LSP. The greater LISP family of associated languages and systems include PLANNER, Micro Planner, Metalisp, mathematics (REDUCE and MATLAB), theorem provers and program verification systems.

• PUB and • TEX and • DOC

The pregnancy and birthing of digital typography, digital printing and desktop publishing occurred at Stanford in parallel with Xerox PARC, CMU and MIT ( slightly earlier at Information International Inc, a bit later at Adobe, Imagen, HP and overseas ) is documented inside the SailDart.

• MSG

Back in the 20th century, blogs were called bulletin boards and the messages of a discussion group were appended to dot MSG files. Ordinary email also resides in dot MSG files. This 2014 SAIL archive shall attempt to keep personal and personnel messages in the dark for another 86 years. All the files of the SailDart may be published at the stroke of midnight PST going into New Years Day Friday 1 January 2100. Never the less, Earnest and I (Baumgart) wish to make available the SAIL bulletin boards that were published on the “ARPA/Internet” in the 1970s and 1980s. Within each message file, each message in SAIL is prefixed with the partial differential symbol, ∂ prefix are often found in non .MSG files.

• SAI

Source code, written in the ALGOL like language named SAIL, comprises the .SAI corpus.

• DRW

The Stanford digital electronic design CAD software was named SUDS for Stanford University Drawing System. SUDS — Stanford University Drawing System – the suite of electronic design drawing programs with cryptic names D, PC, RPC, L, TD, LR and TRD.

• XGP

• PLT

• DAT

The dot DAT extension was used for generic binary data. Several long lived system programs wrote to dot DAT files with one or another convention for aggregating data by day, month and year; or at irregular intervals into dot OLD or dot ARC. Every 15 seconds for over 18 years, the program named ACCT, alias JOBNAM *SPY*, appended compute usage meter readings to its dot DAT file for of each day. ACCT made additional appropriate log entries for crash / reboot cycles and for when a user did a login or a logout. Mirabula Dictu, we can now view who the ACCT program saw as logged in at SAIL for almost every hour in the eighteen year period.

• FNT fonts

There are 4034 files with the extension FNT in the SailDart. One gaudy example is my Bocklin knock off named BUCK75. On linux, you may fetch an octal dump of this font with a command like wget -q http://www.saildart.org/BUCK75.FNT[XGP,SYS]_octal The format for *.FNT[XGP,SYS] is The early FNT COMMENT STANFORD FONT FILE FORMAT.--------------------------------- WORDS 0-177: XWD CHARACTER_WIDTH,CHARACTER_ADDRESS WORDS 200-237: CHARACTER_SET_NUMBER HEIGHT MAX_WIDTH (IN BITS) BASE LINE (BITS FROM TOP OF CHARACTER) WORDS 240-377: ASCIZ/FONT DESCRIPTION/ REMAINDER OF FILE: EACH CHARACTER: CHARACTER_CODE,,WORD_COUNT+2 ROWS_FROM_TOP,,DATA_ROW_COUNT BLOCK WORD_COUNT -------------------------------------------------------------------- For details concerning the early XGP see Ted Panofsky’s HM[H,DOC] section 18 aka SAIL Operating Note 56 titled Facility Manual, by Ted Panofsky. The latter day version of Ted’s manual is at FACIL.TED[H,DOC] http://www.saildart.org/XGPSER[J17,SYS] re BDF http://en.wikipedia.org/wiki/Glyph_Bitmap_Distribution_Format bdftopcf BUCK75.BDF > buck75.pcf cd /home/font mkfontdir . # Add to X11 font path # xset fp+ /home/font # View font path # xset -q # tell X server to rescan the fonts xset fp rehash # view font name xfd -fn buck75

• TXT • LOG • LST • OLD • TMP

The already mentioned the {nothing} file extension is the most numerous generic extension.

• DMD

Of the top dozen extension codes DMD was the first I did not recognize on sight. It is in the top rank because of LCS, Leland Smith. I leave it as an exercise to find out what it stood for and what the few people LYN, GHB, MMM, UW, DGP, BRP, PW, RD, SEK, MFB and MRC were using this extension for. Likely something to do with Music.

• REL

The REL extension is for relocatable files which were the intermediate assembler or compiler step prior to the loader, but again LCS and MUS have first and second place in terms of their file count with REL extension, while SYS and 3 are third and fourth in having REL files. My BGB sixty REL files in DART all seem to be part of GEOMED and might exist on DART since they were parts of shared libraries at times.

• F4 • FOR • PAS • ADA • C • H

I was surprised to see how many Fortran, Pascal and ’C’ programming language files exist in the SailDart. There are 7360 files with .F4 extension, 1129 files with .FOR extension including the famous Adventure Game. There are 9074 Pascal .PAS files, ADA has 1972 files and 1356 .C files with 488 .H files. PL1 exists at 159 files. The two files with CPP extension are text having to do with a Child Phonology Project. It could have been possible for C++ files to exist on SAIL. C++ was developed in the 1980s and had exploded in popularity before the final DART reel.

• B3D • CAM • CRE

Well this is my memoir, I am pleased that such a large set of geometric modeling related files has been preserved by DART.

SUDS Computer Aided Design Corpus

Digital Images Corpus

The digital image formats •PIC •PIX •PIK •VID •DAT

Audio Sound Corpus

Extension EXT code Theory

The dot extension postfix to file names was a user option and not a mandatory MIME code for either the file system or the operating system. Over time various software packages developed that enforced EXT naming conventions.

Large, Medium and Small collections

In handling SailDart, a further set of sets are named by T-Shirt size Large, Medium and Small. My typical mount point for a full set of SAIL objects is /Large which is for Curator access only. The /Medium is a comprehensive collection but with privacy filtering, copyright restrictions, redundancy removal, damaged data redaction and some relevancy redaction. For example, the DART tables are not included in /Medium, because which tape held which file name is part of the envelope not the message. The obsessive future scholar can go read the /Lcorpus. The /Small size S-corpus again has samples of everything but after extensive editorial selection - in particular the S-corpus attempts to have the latest or the best or even a typical version of each document. Ephemeral files as well as seven hundred ephemeral people are redacted. The ephemeral people are students, guest users, no name user codes, as well as users who left nothing but a trivial ’hello world’ practice exercise or a few boilerplate files copied from elsewhere. An important aspect of the Medium and Small collections is that they have been manually curated, best efforts, to protect personal privacy and to avoid copyright issues, and so can be widely mirrored and distributed. If in doubt, leave it out.

Document Sampler

Let me recommend that you read, or that you at least know about, the following particular SailDart documents: RESO.LES[UP,DOC] This is the SU-AI entry for the ARPAnet Resource Handbook as of 30 September 1977. It is a concise description of SAIL, the software and the documentation. Miraculously restored in HTML all the links pointing to old SAIL filenames are clickable. WORKS.MSG[UP,DOC] A blog from 1981 to 1983 discussing work stations, which was then a niche market bigger and better than home personal computers. Indeed for us privileged few, the SUN work stations were our personal home computers. YUMYUM[P,DOC] The San Francisco Bay area electronic restaurant guide with patron reviews. YUMYUM was the YELP for the decade 1973 to 1984. My hardcopy version of this is marked copyright reserved. SYSTEM.MTG[A,REG] Minutes from the month SYSTEM meeting from May 1974 to October 1979. From this 185 page document it is easy to glean dates when major hardware, software or personnel changes were made. The KL10 arrived 1976-03-31. Amusing to quote, both the Librascope and Ralph Gorin were decommissioned on 1976-11-1:

   PEOPLE
   Wizard: Jeff Rubin is taking over as Chief System Wizard as Ralph Gorin goes looking for LOTS more trouble.
   EQUIPMENT
   Librascope: Rest in peace. Decided that the maintenance effort required is no longer worth the performance gain. We will either give it away or scrap
   it.

Exercises

  1. Pick an EXT data type from this chapter and write the definitive guide to its content. Or add a few more paragraphs with illustrated examples to the above sections.
  2. Write the T-Shirt sub sections for exactly what is in Large, Medium and Small at the end of 2015 and 2016. Coordinate the descriptions with the Exegesis chapter-9.
  3. Append more sampler document descriptions. Arrange into categories.