What I'm Working On: XHTML Transformer for HyperScope

This week, I'm working on an XHTML transformer for the HyperScope Project.

One of our big goals is to be able to use HyperScope's advanced addressing and studying abilities when working with the W3C specifications. For example, imagine being able to directly point into arbitrary parts of the XML Specification, or use HyperScope's advanced studying tools in order to gain greater access and understanding when working with a new specification.

Before we can get there, though, we have to create a transformer that lives on the network, which can automatically proxy in and change all of the W3C specs into the HyperScope format, which is OPML. This week, I've begun working on an XHTML transformer that would do this. This would live on the network and would change any XHTML that's out there into OPML.

The tool that I've selected to do this is XSLT and specifically, I'm using several UNIX tools chained together. I love Unix and Bash. Here's the command I'm using so far as I build this up; Unix just makes this kind of stuff so easy (I've bolded the command name to make it easier to read):

curl --silent http://www.w3.org/TR/2006/REC-xml-20060816/
| tidy -indent -bare -numeric -utf8 --doctype omit --add-xml-decl yes --fix-bad-comments yes --escape-cdata yes -asxhtml - 2>/dev/null
| xsltproc xhtml.xsl -

Basicly, this chains together three unix-command line tools: curl, tidy, and xsltproc. Curl is a tool which can talk HTTP on the network and return the results; we tell curl to go and grab the XML specification, which lives at http://www.w3.org/TR/2006/REC-xml-20060816/.

Then, we run tidy, which can take malformed HTML and make it nice XHTML. We give tidy a long list of options that basicly fix up and make arbitrary HTML play nice in an XML universe. I've created a footnote with the options documented, since it might help others.

The output of tidy is then chained into xsltproc, which applies the xhtml.xsl stylesheet against standard input (that little dash at the very end is a way to pass the standard input into xsltproc, which expects a filename rather than working with standard input like other Unix tools). The output is HyperScope OPML.

I've gotten relatively far with the XSLT so far. I based my work on the excellent XOXO transformer that Les Orchard created. I can transform all the H* elements, such as H1, H2, etc., into their OPML outline. I've got alot more work though. I also need to change the command-line stuff above into something that can be on the network in a safe way; I'll probably use PHP for this to invoke the commands in a safe way, using safe PHP wrappers.

The XSLT is actually really fun and challenging. XHTML is a really flat format; it's very hard to infer structure from it, while OPML very overtly tells you it's structure. It's a hard problem to infer structure from a flat format. I'm using a great web page that goes over some of the solutions, which might be helpful for you if you ever run into a similar problem.

By the way, one nice side effect of HyperScope will be that Dave Winer will get a bunch of transformers that will turn lots of different formats into OPML.

Tidy footnote:

indent the results (-indent); turn HTML entity names into their numeric values (-numeric); turn old-school HTML formatting tags into their style CSS equivalents (-bare); treat everything in a Unicode way so that internationalization works (-utf8); don't write out a funky SGML-ish doctype (-doctype omit), like HTML has; do write out an XML declaration at the top (--add-xml-decl yes); fix messed up XML comments (--fix-bad-comments yes); escape areas that are CDATA sections, like SCRIPT blogs, with XML CDATA sections so that a spurious < sign for example won't break things (--escape-cdata); and write everything out as XHTML (-asxhtml). I also tell tidy to send its warning and error messages to /dev/null (i.e., don't display them).