POSTS

In which I begin work in the Semantic Web

Blog

One of the blessings of living in Austin – and it’s important to remember them this time of year when you feel your eyeballs melting out when you step into mid-afternoon sun – is its legacy of work in machine learning and AI. Here we have a very active interest group, Semantic Web Austin run by Juan Sequeda, who has, over the last year or so, brought some very visible researchers in Semantic Web to town to teach hands-on tutorials.

If the concept of “Semantic Web” is foreign to you, let me try to capture its essence succinctly. Presently one can conceive of the Web as a web of documents: presentation and data are represented as web pages. My web document points to Ryan’s document and Lauren’s document. Now imagine a resume on the Web. This resume is a series of facts (and gross exaggerations ;) ), these have nothing per se to do with the document construct you learn from books called a resume – the thing with a name at the top, horizontal rules under section headings, etc. that, purportedly, employers like to read. Non-human examiners of my resume web page care only about the facts, not the prettiness of the artifact. Thus, the Semantic Web is one in which meaningful data is presented (as a resume) for humans, but also presented (as the essential facts of the resume) for machines such that relationships between the various data can be utilized by semantically-aware web applications.

Both Tom Heath and Peter Mika. gave great presentations full of ideas and hands-on activities to the Semantic Web Austin group. From Tom I learned the basics of RDF, the language for enumerating data-facts to machines, and how to build a basic RDF document. Peter showed us RDFa and illustrated that HTML and RDF data can be written into the same document. That was a “whoa” moment for me.

Because I hadn’t had a chance to integrate these lessons from the SemWeb Austin sessions, my understanding was a bit shaky. The only way, I decided, to actually figure this out was to find a project that would give me opportunity to work with these respective ideas.

About this time my yearly review concluded and I was about to update my resume, an activity I exhort you to do after reviews. Yet resume-writing had always irritated me: writing a document and then trying to port it to various formats, and then Mithras help you if you need to “skew” these documents to particular employers quickly.

Thus I decided I needed to write my resume in some sort of meta-language so that I could publish to both LaTeX and HTML and “skew” to particular employers quickly. This was the goal of my project, m4resume.

The output is Steven Harms XHTML+RDFa resume. If you’re interested in how I learned RDFa well enough to be able to embed it into XHTML, and are curious how I was able to disintegrate that into a series of M4 macros, you may want to read on in this exceedingly technical post. Oh by the way, this very post also has an RDF / Semantic Web payload: check it out.

Phase 1: Get familiar with the RDF specifications

There is really no way around it, you need to get familiar and comfortable with the RDF and RDFa specifications. I wound up needing to consult them so often that I created local copies (for offline access). If you:

git clone git://github.com/sgharms/m4resume.git

you will be given the default ‘master’ branch. If you want to view the branch that generates my resume, execute:

git checkout -b demo origin/sgharms_example

you’ll find my reference documentations in m4resume/reference in your freshly created “demo” branch. You’ll want to read:

  1. RDF Primer.webarchive
  2. notes from rdf primer.txt

This should give you familiarity with the basic terms, and hopefully my notes should give you a few salient summation points. My notes are in the “notes from rdf primer.txt” file

Build validating RDF documents

At this point, I spent a lot of time playing with the W3C’s RDF Validator. If I’ve learned anything about writing things that produce other things, it’s very helpful to produce the thing you want, so that you can test whether your producing thing actually produces something identical.

As such, I wrote out my resume by hand, in RDF. I slowly built it up block by block in RDF/XML and fed it through the validator.

I took advantage of a number of ontologies:

  • cv="http://purl.org/captsolo/resume-rdf/0.2/cv#"
  • dc="http://dublincore.org/2008/01/14/dcelements.rdf#"
  • dc1="http://purl.org/dc/terms/"
  • doap="http://usefulinc.com/ns/doap#"
  • foaf="http://xmlns.com/foaf/0.1/"
  • geo="http://www.w3.org/2003/01/geo/wgs84_pos#"
  • rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
  • rdfs="http://www.w3.org/2000/01/rdf-schema#"
  • xhv="http://www.w3.org/1999/xhtml/vocab#"
  • xml="http://www.w3.org/XML/1998/namespace"

I basically paged through them, found attributes and relationships I wanted to express in the RDFResume and integrated them. The following two documents, again in the $GITROOT/reference directory, provided the information I needed

  • doap
  • ResumeRDF Ontology Specification.webarchive

Protip: Don’t underestimate the value of using the graph output from the validator. If you mean for something to be connected, this tool will quickly show you if it’s not. If you notice arrows pointing back to themselves (as I did), you can be assured you’re not doing what you want.

Once I got this far it was a major milestone. I can’t say how thankful I am for git in helping me branch quickly, integrate working bits, and roll back any mistakes. Given the frequency of iteration in development, I’d say that a good source code tool is a must as is a good editor (nothing new there).

Embed RDF into XHTML, thus making RDFa

Now that you have working RDF your battle is half-won, you need to integrate it into XHTML. This is defined in the XHTML+RDFa DTD. Here were the sources that I used to make an RDFa-ized version of the RDF-Resume (again, in $GITROOT/reference).

  • RDFa - Wikipedia, the free encyclopedia.webarchive
  • RDFa Primer.webarchive
  • RDFa Use Cases: Scenarios for Embedding RDF in HTML.webarchive
  • RDFa for HTML Authors.webarchive
  • RDFa in XHTML: Syntax and Processing.webarchive
  • Tip Use rdf about and rdf ID effectively in RDF XML.webarchive

My “notes…txt” file also has my notes that I extracted chiefly from the first two links.

The process here is a bit more complex. First you take the XHTML+RDFa document and pass it through the W3C’s RDFa distiller. Thereafter, you make take that distilled RDF/XML data and put it into the aforementioned RDF validator (that also makes the pretty graph!). Through many (many!) iterations of this process things, I eventually produced a valid XHTML+RDFa resume.

Now, if you are sane you stop here, you enjoy life with your semantically marked up resume. I, however, am not sane. And decided, you’ll recall, to generate a beautifully LaTeX-formatted resume as well as this XHTML+RDFa resume. You have the benefit of being able to crib from my m4resume project.

M4Resume

Why M4?

First question, why M4? It was written in the early 60’s, has arcane syntax, and, in the words of one SWiK IRC-er: “does anyone use M4 for any serious programming anymore?” Here was my thinking…

First, I come from a Sendmail admin background, so knowing M4 (sorta) is not an optional thing for me so the syntax wasn’t that baffling for me to get into.

Second, M4 also has the ability to be entirely self-contained: no libraries, no external dependenices, no gems (despite the very generous leg-up Dave Coupland tried to give my stubborn self – he underestimated my block-headedness).

Third, it’s philosophically sexy. I have a philosophy degree and holders of this are not the most pragmatically-minded people. There’s something very attractive about this example of m4 code.

divert(-1)`'dnl
------------------------------------------------------------------------------
The above makes sure definitions don't get put on the output stream, 
we're going to define there macros below.
------------------------------------------------------------------------------

define(`foo',`FOO')
define(`bar', `BAR')
define(`FOOBAR', `M4 all the way down')

------------------------------------------------------------------------------
Next, we'll get back on the main output stream
------------------------------------------------------------------------------

divert`'dnl
"M4's philosophical coolness BEGIN"

dnl Put a closing "tag" in a buffer, handy if M4 needs to generate markup
divert(2)dnl
"The End!"
divert`'dnl
indir(foo`'bar)
dnl You don't need the following, the temporary 2 buffer is 
dnl automatically dumped

undivert(2)

Because of the simple stream-based replace, it allows you to embed the following in your sources

define(__RDFA_CANDIDATE_NAME', ifdef(do_rdfa', more metadata’, less metadata)’)

Lastly because of its recursion friendly design, M4 feels a lot more like programming a text-stream editing LISP. Unlike imperative paradigms, you don’t have to know how many iterations, you just let expansions happen until they don’t and then drop that out STDOUT. That was attractive to me. M4 is tail-recursion capable so all the iteration you need is there and there are enough decision structures to allow rich application logic. How often are you going to be tweaking your resume?

That said, if I ever rewrite this with a specific eye towards RDFa, I would think about using Ruby objects effectively mapping to RDF/XML blocks. Live and learn.

Exploration