POSTS
Streams and the Soul of the Machine
BlogIntroduction
For the last year I have been teaching passionate beginners about programming at DevBootCamp. In this time I have come to realize that one of my primary tasks as teacher is to process the patterns and idioms of the computer and of programming languages (as I have experienced them) and rareify them into metaphors that my students can grasp experientially and/or emotionally. Having found an emotional or experiential connection to the rareified metaphor, they are able to condense it back into the universe of text-on-screen where I show the praxis of the metaphor.
The primary advantage to this approach, as I see it, is that even if the praxis of “what to type” or “what is the computer doing” is unclear, having a series of metaphors whereiwth to communicate or reason about the praxis greatly faciliates understanding.
Given my own philosophical bent, one question I have been pursuing in discussion with my students is this: “What is the the metaphor that describes data’s nature?”
Data’s Sine Qua Non
It all started rather simply. I was a bit chagrinned to see my students
reaching for incorrect tools (e.g. sublime) when attempting to get information
from their server logs or from large files. Realizing it was my duty to make
sure that machine navigation was as well covered as
SOLID programming principles, I assigned work on researching the Unix primitve utilities:
cat
, head
, tail
, sed
, et al.
These utilities’ functions are generally described as the following:
Name | Description |
`cat` | Display the contents of a file on the screen |
`head` | Display the first 10 lines of a file on screen |
`tail` | Display the last 10 lines of a file on the screen |
`less` | Display a "page" of screen data from a file |
While this synopsis certainly works for those **learning** to use a Unix system, it fails _philosophically_ as one starts to learn more of the features of some of these commands.
A Deeper Look at cat
As a preliminary invocation let us consider: cat /etc/passwd
. This does, as
per our table above, the work of “displaying the contents of a file,” in this
case /etc/passwd
. The word “displaying” simplifies the important mechanics
of this command, though. Something else is happening, something more
important, and something subtler, and something infinitely more wonderful which
is brought out in this invocation:
cat file > file2
This makes a copy of file
to file2
by means of the “redirection” (>
)
operator. This duplication both works in the case of binary files (things you
run e.g. Google Chrome) as well as text files (human-readable text, e.g.
letter_to_grandma.txt
) If cat merely displays things, how does this
second invocation wind up making copies?
Perhaps we’ve over-simplified something that’s important of which we ought take
better notice. As Einstein said our goal should be to “Make things as simple
as possible, but not simpler.” The “conventional” explanation of cat
seems
to have lost a critical detail.
In Search of cat
’s Origins: Linguistics
cat
is the short, and easily-typed version of catenate deriving from
Latin’s catenatus whose verbal form is catena, catenare, catenavi,
catenatum: to chain. To make a quick jaunt to English via old French:
catenate means to “enchain” or “yoke.”
What are we yoking when we use cat
? We’re enchaining a sequence of
bytes. Yet that’s not the end of the story. As we saw above, cat
ting a
file results in screen display, not some simple diagnostic à la:
catenated 48 bytes
. cat
is to catenare but it’s also to direct that
chained collection somewhere: and that is whence the breakthrough metaphor
bursts forth: the essence of data in Unix, and perhaps all systems, is a
stream in which enchained signals (i.e. bits) flow.
Defining Create, Read, Update, and Destroy in Terms of cat
From this superior metaphor we can define all the operations on a hard disk as a function of a flow.
Operation | Stream Interpretation | Invocation |
Create a file | Associate a name to a byte address of a size equivalent to the size of an enchained entity | cat > file; _enter data_; CTRL+d |
Delete a file | Flow from the void sufficient null-bytes to fill the container which formerly held bytes. Remove the name associated to the first byte address | cat /dev/null > filename && rm filename |
Read a file | Flow from the byte associated with the file name the number of bytes inside the file to the display device. | cat file |
Edit a file | Flow a temporary buffer of bytes onto a disk. Move the previously used file name to point to the new collection. Flow the void into the previous enchainment of bytes. | Acomplished by editor, etc. |
Visualizing into the Stream: Enjoying the Metaphor
From the flow metaphor, a new metaphor emerges as to what a file is. A file
is nothing more than a stream trapped in a puddle, frozen like the one that my
dog scratches at when we enter Prospect Park. To cat
a puddle, er, file is
to:
- Reliquify it
- Duplicate the contents or enchain them
- Return the original puddle back to its container, frozen
- Direct the enchained copy somewhere. By default the stream follows a channel to your screen, a stream called, “standard output” or
STDOUT
for short, hence the simplification thatcat
is a tool for “displaying things on screen”
Our improved analysis explains what’s happening with my “copying” invocation as well:
cat letter_to_grandma > ~/grandma_letter_archive/2014-12-25-letter.txt
- Reliquify
letter_to_grandma
- Return the original to its puddle
- Take the duplicate stream and, instead of directing it to STDOUT, direct it to (thanks to a visually helpful operator called redirect or
>
) a new puddle called~/grandma_letter_archive/2014-12-25-letter.txt
If we have indeed found a better metaphor we would expect that it would hold up
with a stream being directed in. Thankfully there is a means to direct data
into a program, the inversion of >
, <
.
Consider a program called numberer.rb
which puts line numbers before each
line of a text file and which can be invoked with:
numberer.rb < /etc/passwd
numberer.rb
’s code looks like:
STDIN.each_with_index do |line, count|
puts "[#{count}] " + line
end
Unix natively affords users access to it uses on a specially-named stream of
enchained bytes: STDIN
As a final test of soundness of metaphor, we would expect that a itentity transformation be possible. It is.
cat < letter_to_grandma > clone_of_letter
Advanced Stream Tricks
Taking cat
as our primordial utility, other uses of it are natural
consequences. Give me the first n lines of the stream is the head
utility.
Give me the last n lines, tail
. Bundle the stream in chunks of screenfuls,
less
. Also, since output flows on STDOUT and input on STDIN, it is sensible
that the output of one command could flow into another, like water through a
pipe.
This is a sensible model and, lo, Unix provides a |
(pipe) character which is
used for linking flows, e.g:
ls | grep harry |wc -l > files_like_harry_count.txt
- List all the files
- Of that output, find lines that match a regular expression with ‘harry’
- Of those lines, count how many lines there are (word count, by line)
- Redirect that number into a file
Stream editors like sed
are provided so that data flowing in streams can be
altered before reaching the next command e.g.:
cat /etc/passwd |sed 's/dog/byron/'
It’s a beautiful, powerful, and flexible concept. In fact, it’s more than a mere concept, it’s an abstraction: “a technique for managing complexity of computer systems…by establishing a level of complexity on which a person interacts with the system, suppressing the more complex details below the current level.(Wikipedia)”
Harnessing the Stream Abstraction
Since binary text or plain text can both be harnessed by cat
, any device
which can produce a stream can be harnessed by the operating system and the
programs written for it for reading or writing.
cat /dev/random > file_of_noise
“Unix, let your random device give me noise that I can store somewhere”
cat /dev/null > file_of_noise
“Unix pour nullness, the void itself, into the puddle called file_of_noise
but leave the puddle and its name to be filled with new bytes anon”. It’s
rather like [Hindu / Vedic philosophy][vedic]. All that is has the void within, no?
When cat
was invented a data source could have been a box of punch-cards.
The computer read them via the hopper and load the stream into memory.
Read in a stream of data points unto the program and redirect the output to a
light-board, a screen, or even a teletype printer.
As the technology improved, magnetic tape replaced and / or augmented punch-cards. Whatever the internal magic of the tape drive was, as long as it could produce a stream and flow it, the operating system’s fundamental assumptions did not need to change. The abstraction created a more modular and decoupled architecture.
As technology further improved, new devices would come to communicate with the machine. The machine would come to listen to a constant stream for events of “click” or “drag.” It would learn to flow enchained bytes to technologies unthinkable at its time of birth: bytes would flow onto optical discs like writable DVD’s and unimaginably long enhcained bytes would be read from discs and render amazing photorealistic images onto screens. Computers would even learn to flow data in small chunks to network cards which would fracture the streams into small sections of chained bytes that would be re-flowed into one and turned into input on a remote host (networking). In the future, it’s perhaps possible that some genetic data will be read from discs and redirected to genetic printers where plants, medicine, and perhaps even clothing will be grown versus fabricated.
Legacy of the Stream
The stream metaphor has shown a robustness enviable for most designs. Consider that the original implementation dated from 1969. Imagine taking bets on things that would last or fail by the year 2015 in that year: space travel, the USSR, Cuba policy, etc. But one bet made that year has proven so durable that it’s unlikely to be replaced any time soon: the stream.