POSTS

Streams and the Soul of the Machine

Blog

Introduction

For the last year I have been teaching passionate beginners about programming at DevBootCamp. In this time I have come to realize that one of my primary tasks as teacher is to process the patterns and idioms of the computer and of programming languages (as I have experienced them) and rareify them into metaphors that my students can grasp experientially and/or emotionally. Having found an emotional or experiential connection to the rareified metaphor, they are able to condense it back into the universe of text-on-screen where I show the praxis of the metaphor.

The primary advantage to this approach, as I see it, is that even if the praxis of “what to type” or “what is the computer doing” is unclear, having a series of metaphors whereiwth to communicate or reason about the praxis greatly faciliates understanding.

Given my own philosophical bent, one question I have been pursuing in discussion with my students is this: “What is the the metaphor that describes data’s nature?”

Data’s Sine Qua Non

It all started rather simply. I was a bit chagrinned to see my students reaching for incorrect tools (e.g. sublime) when attempting to get information from their server logs or from large files. Realizing it was my duty to make sure that machine navigation was as well covered as SOLID programming principles, I assigned work on researching the Unix primitve utilities: cat, head, tail, sed, et al.

These utilities’ functions are generally described as the following:

Name Description
`cat` Display the contents of a file on the screen
`head` Display the first 10 lines of a file on screen
`tail` Display the last 10 lines of a file on the screen
`less` Display a "page" of screen data from a file

While this synopsis certainly works for those **learning** to use a Unix system, it fails _philosophically_ as one starts to learn more of the features of some of these commands.

A Deeper Look at cat

As a preliminary invocation let us consider: cat /etc/passwd. This does, as per our table above, the work of “displaying the contents of a file,” in this case /etc/passwd. The word “displaying” simplifies the important mechanics of this command, though. Something else is happening, something more important, and something subtler, and something infinitely more wonderful which is brought out in this invocation:

cat file > file2

This makes a copy of file to file2 by means of the “redirection” (>) operator. This duplication both works in the case of binary files (things you run e.g. Google Chrome) as well as text files (human-readable text, e.g. letter_to_grandma.txt) If cat merely displays things, how does this second invocation wind up making copies?

Perhaps we’ve over-simplified something that’s important of which we ought take better notice. As Einstein said our goal should be to “Make things as simple as possible, but not simpler.” The “conventional” explanation of cat seems to have lost a critical detail.

In Search of cat’s Origins: Linguistics

cat is the short, and easily-typed version of catenate deriving from Latin’s catenatus whose verbal form is catena, catenare, catenavi, catenatum: to chain. To make a quick jaunt to English via old French: catenate means to “enchain” or “yoke.”

What are we yoking when we use cat? We’re enchaining a sequence of bytes. Yet that’s not the end of the story. As we saw above, catting a file results in screen display, not some simple diagnostic à la: catenated 48 bytes. cat is to catenare but it’s also to direct that chained collection somewhere: and that is whence the breakthrough metaphor bursts forth: the essence of data in Unix, and perhaps all systems, is a stream in which enchained signals (i.e. bits) flow.

Defining Create, Read, Update, and Destroy in Terms of cat

From this superior metaphor we can define all the operations on a hard disk as a function of a flow.

Operation Stream Interpretation Invocation
Create a file Associate a name to a byte address of a size equivalent to the size of an enchained entity cat > file; _enter data_; CTRL+d
Delete a file Flow from the void sufficient null-bytes to fill the container which formerly held bytes. Remove the name associated to the first byte address cat /dev/null > filename && rm filename
Read a file Flow from the byte associated with the file name the number of bytes inside the file to the display device. cat file
Edit a file Flow a temporary buffer of bytes onto a disk. Move the previously used file name to point to the new collection. Flow the void into the previous enchainment of bytes. Acomplished by editor, etc.

Visualizing into the Stream: Enjoying the Metaphor

From the flow metaphor, a new metaphor emerges as to what a file is. A file is nothing more than a stream trapped in a puddle, frozen like the one that my dog scratches at when we enter Prospect Park. To cat a puddle, er, file is to:

  • Reliquify it
  • Duplicate the contents or enchain them
  • Return the original puddle back to its container, frozen
  • Direct the enchained copy somewhere. By default the stream follows a channel to your screen, a stream called, “standard output” or STDOUT for short, hence the simplification that cat is a tool for “displaying things on screen”

Our improved analysis explains what’s happening with my “copying” invocation as well:

cat letter_to_grandma > ~/grandma_letter_archive/2014-12-25-letter.txt

  • Reliquify letter_to_grandma
  • Return the original to its puddle
  • Take the duplicate stream and, instead of directing it to STDOUT, direct it to (thanks to a visually helpful operator called redirect or >) a new puddle called ~/grandma_letter_archive/2014-12-25-letter.txt

If we have indeed found a better metaphor we would expect that it would hold up with a stream being directed in. Thankfully there is a means to direct data into a program, the inversion of >, <.

Consider a program called numberer.rb which puts line numbers before each line of a text file and which can be invoked with:

numberer.rb < /etc/passwd

numberer.rb’s code looks like:

  STDIN.each_with_index do |line, count|
    puts "[#{count}] " + line
  end

Unix natively affords users access to it uses on a specially-named stream of enchained bytes: STDIN

As a final test of soundness of metaphor, we would expect that a itentity transformation be possible. It is.

cat < letter_to_grandma > clone_of_letter

Advanced Stream Tricks

Taking cat as our primordial utility, other uses of it are natural consequences. Give me the first n lines of the stream is the head utility. Give me the last n lines, tail. Bundle the stream in chunks of screenfuls, less. Also, since output flows on STDOUT and input on STDIN, it is sensible that the output of one command could flow into another, like water through a pipe.

This is a sensible model and, lo, Unix provides a | (pipe) character which is used for linking flows, e.g:

ls | grep harry |wc -l > files_like_harry_count.txt

  • List all the files
  • Of that output, find lines that match a regular expression with ‘harry’
  • Of those lines, count how many lines there are (word count, by line)
  • Redirect that number into a file

Stream editors like sed are provided so that data flowing in streams can be altered before reaching the next command e.g.:

cat /etc/passwd |sed 's/dog/byron/'

It’s a beautiful, powerful, and flexible concept. In fact, it’s more than a mere concept, it’s an abstraction: “a technique for managing complexity of computer systems…by establishing a level of complexity on which a person interacts with the system, suppressing the more complex details below the current level.(Wikipedia)”

Harnessing the Stream Abstraction

Since binary text or plain text can both be harnessed by cat, any device which can produce a stream can be harnessed by the operating system and the programs written for it for reading or writing.

cat /dev/random > file_of_noise

“Unix, let your random device give me noise that I can store somewhere”

cat /dev/null > file_of_noise

“Unix pour nullness, the void itself, into the puddle called file_of_noise but leave the puddle and its name to be filled with new bytes anon”. It’s rather like [Hindu / Vedic philosophy][vedic]. All that is has the void within, no?

When cat was invented a data source could have been a box of punch-cards. The computer read them via the hopper and load the stream into memory. Read in a stream of data points unto the program and redirect the output to a light-board, a screen, or even a teletype printer.

As the technology improved, magnetic tape replaced and / or augmented punch-cards. Whatever the internal magic of the tape drive was, as long as it could produce a stream and flow it, the operating system’s fundamental assumptions did not need to change. The abstraction created a more modular and decoupled architecture.

As technology further improved, new devices would come to communicate with the machine. The machine would come to listen to a constant stream for events of “click” or “drag.” It would learn to flow enchained bytes to technologies unthinkable at its time of birth: bytes would flow onto optical discs like writable DVD’s and unimaginably long enhcained bytes would be read from discs and render amazing photorealistic images onto screens. Computers would even learn to flow data in small chunks to network cards which would fracture the streams into small sections of chained bytes that would be re-flowed into one and turned into input on a remote host (networking). In the future, it’s perhaps possible that some genetic data will be read from discs and redirected to genetic printers where plants, medicine, and perhaps even clothing will be grown versus fabricated.

Legacy of the Stream

The stream metaphor has shown a robustness enviable for most designs. Consider that the original implementation dated from 1969. Imagine taking bets on things that would last or fail by the year 2015 in that year: space travel, the USSR, Cuba policy, etc. But one bet made that year has proven so durable that it’s unlikely to be replaced any time soon: the stream.