Straw Man Proposal for Purple Include Spec (Version 0.1)

This is a simple straw-man proposal giving a spec for Purple Includes. Its mainly meant to help folks like Kevin Burton who are asking if they can create server-side support for Purple Includes in their software, such as Spinn3r, Kevin's mojo-licious blog spider. Note that the existing JavaScript Purple Include actually doesn't fully support this spec yet. Once we hash out this strawman proposal I'll update things in the JavaScript.

Let's call this version 0.1 of the spec. A Purple Include is a way to include pieces of other, remote documents into your own web page just like you can include images from all over the world using the IMG tag. The idea behind Purple Includes is based on Ted Nelson and Douglas Engelbart's work.

Purple Includes can be added to HTML's standard Q and BLOCKQUOTE tags. A Q is an inline quote (i.e. you can use it inside of other text), while a BLOCKQUOTE is block-level, like a DIV. To use, simply set the 'cite' attribute to the remote resource or piece of remote resource you want to include and the 'embed' attribute to 'true'. Two examples:


<q cite="http://codinginparadise.org/paperairplane#quote(What if community and editing...research and coding efforts can steer towards.)" embed="true"></q>

<blockquote cite="http://www.eekim.com/blog/2007/06/21/networkedtoolsemail#nidMDU" embed="true"></blockquote>


The 'embed' attribute is necessary so that we can differentiate normal, non-Purple Include quotes and blockquotes from ones that want the Purple Include magic.

If 'embed' is present and true, then the user agent should take the URI given in 'cite', follow it, and either grab the whole thing or a fragment and inline it into the quote or blockquote, replacing that element's older contents. If 'embed' is false or not present, the user agent should do nothing and simply display what is already inside the quote or blockquote.

For the above BLOCKQUOTE example, once the Purple Include has happened here is what the markup would look like after the remote content was inlined:


<blockquote cite="http://www.eekim.com/blog/2007/06/21/networkedtoolsemail#nidMDU"
embed="true">
<p>
<a name="nidMDU" id="nidMDU"></a>

What is a collaborative tool? It's a tool that facilitates
collaboration. Certainly, a shared authoring tool like a Wiki has
affordances that facilitate collaboration. But a plain old text
editor is just as legitimately a collaboration tool, because it can
also be used to facilitate collaboration (for example, when used on a
<a href="http://www.eekim.com/cgi-bin/wiki.pl?SharedDisplay" class="wikiword">SharedDisplay</a>).

<a class="nid" title="MDU" href="http://www.eekim.com/blog/2007/06/21/networkedtoolsemail#nidMDU">(MDU)</a>
</p>
</blockquote>


And here is what that Purple Include actually looks like using the JavaScript Purple Include (under the covers using the older syntax since I don't support this spec fully yet):



If the user agent is visible to the user (i.e. a browser), it should display a spinning ball inside the element while grabbing the remote resource. Feel free to use this one, which is one of the free Ajax spinner balls around the net that are free to use for any purpose:

If the user agent is again visible to the user, the user agent should set a 'class' value on the quote or blockquote so that document writers can add nifty CSS to style the Purple Include based on whether it succeeded or not. If the Purple Include succeeds, the quote or blockquote should be given the classes "included" and "include_ok". If it fails, it should be given the CSS class name "include_error".

If there was an error, an error message should be inlined into the element's value so that users can see what happened and perhaps address the error.

Being able to grab portions of remote documents is the real magic of Purple Include. Existing server-side templating systems grab entire documents, which have limited utility in the real world when it comes to discourse and annotation. Let's go over the kinds of URIs that can be inside the 'cite' attribute.

First things first: only HTML, XHTML, and XML are supported for remote resources right now, i.e. the following MIME types:
  • text/html
  • application/xhtml+xml
  • text/xml
  • application/xml
Any other MIME types must throw an error, writing the error into the element's value so that the user can see what happened and possibly changing the 'class' value as described above if this is a user-facing user agent.

Before applying any of the addressing schemes below, the user agent should transform HTML into XHTML by using a tidy program, whether tidy, JTidy, or another language specific tidy library. [Note: would it help to provide some of the options I give to JTidy?]

Now we are ready to look at the different kinds of URIs that can be inside the 'cite' attribute and what a user agent should do with them:

If you just give a full URI with no anchor (ex: http://codinginparadise.org/paperairplane), the user agent should grab the entire remote resource. If the remote resource is HTML/XHTML, it should return just the BODY tag and its children (in XPath notation this would be /body). If the resource is XML, it should return the root of the XML document plus its children.

If you have a full URI with a simple anchor name (ex: http://www.eekim.com/blog/2007/06/21/networkedtoolsemail#nidMDU), then we have some special behavior since these could be Purple Numbers.

Purple Numbers are supported by some publishing systems, and cause a unique anchor to be placed onto each paragraph of the page so that you can grab just specific parts. This makes it easy to bookmark and point at specific parts of a given document. In addition, even if this isn't a Purple Number and just an anchor name, generally anchors have no children -- the intent of someone making an anchor is to make its parent have a unique address. Here's an example:

<h1><a name="first_section"></a>The First Section</h1>

If we just made the behavior of Purple Include 'dumb' and simply grabbed the anchor and returned its children, this is probably not the right thing to do. Instead, we have some special logic here:
  1. Find an anchor tag with the given name or ID. If found, grab its immediate parent and return that.
  2. Find a paragraph with the given name or ID. If found, return that paragraph.
Here's what that logic looks like in XPath if you end up using XPath on your server-side to do this:
//a[@name=anchorName or @id=anchorName]/..
|
//p[@name=anchorName or @id=anchorName]
[Note: is the special logic here worth the complexity? Should we just drop it? It's nice for end-users because they can quickly grab specific anchors, however]

Now we get to the fun part: addressing schemes. After the anchor you can give an addressing scheme, such as:

http://codinginparadise.org/paperairplane#quote(What if community and editing...research and coding efforts can steer towards.)

Address schemes always have the form:

#address(input to address type)

These address schemes are meant to grab portions of the resource. Right now there are only two schemes, a quote() scheme and an xpath() scheme. The only scheme that must be supported right now is the quote() scheme, since the xpath() scheme has shown itself to be of limited usability.

The quote() scheme always has the following form:

#quote(Start of quote...End of quote)

where I give the start of the quote, followed by three dots, followed by the end of the quote.

For example, if I had the following markup:
<p>What
if community and editing were a central and transparent part of the
web and browsers?</p>
<p>What
if the web was extremely integrated for usability, with instant
messaging, site creation, the web server, and more all integrated
into one whole?</p>
<p>What would this web look like if despite being
integrated it was massively decentralized on a peer-to-peer network,
able to exist and run without businesses or governments?</p>
and I wanted to grab some of the first question and some of the last question, shown in bold:
<p>What
if community and editing were a central and transparent part of the
web and browsers?</p>
<p>What
if the web was extremely integrated for usability, with instant
messaging, site creation, the web server, and more all integrated
into one whole?</p>
<p>What would this web look like if despite being
integrated it was massively decentralized on a peer-to-peer network,
able to exist and run without businesses or governments?</p>
I might do the following:

#quote(were a central and transparent part...massively decentralized)

This would cause the following fragment to be returned:
<p>were a central and transparent part of the
web and browsers?</p>
<p>What
if the web was extremely integrated for usability, with instant
messaging, site creation, the web server, and more all integrated
into one whole?</p>
<p>What would this web look like if despite being
integrated it was massively decentralized</p>
Notice that the HTML is correct; we don't just grab some of the substring and return incorrect HTML. 'dumb' return results would look like this:
were a central and transparent part of the
web and browsers?</p>
<p>What
if the web was extremely integrated for usability, with instant
messaging, site creation, the web server, and more all integrated
into one whole?</p>
<p>What would this web look like if despite being
integrated it was massively decentralized
This would be useless, since we would 99% of the time get bad markup and no one would want to use this system.

Now, doing this right is hard, which is why I suggest you cheat: let someone else do the hard work. Specifically, on the server-side portion of the JavaScript Purple Include, for example, I use the Xerces DOM Range support. The DOM Range spec (and Xerces implementation) lets you specify a range that might cut across various elements of an HTML document. You simply set the beginning of the range, then the end of the range. This is what I do on the server-side, and I suggest you do the same since the DOM Range stuff will do all the hard work of correctly closing start and end tags that cut across ranges. It transforms what would probably be several weeks of work into several hours of work (which is how long it took me to do the quote() scheme myself). Once you have the range, you can simply ask it to give you its contents using cloneContents(), which will have everything correctly setup in the markup.

Here's another tip that will make things easier to implement. On the server-side I also use the DOM Traversal functionality of Xerces to grab all the text and CDATA nodes, and then just iterate over all of these to find the start and end strings. The DOM Traversal stuff is another nifty spec that lets you grab just some type of nodes.

One tricky thing you will need to keep in mind is that the start and end strings might fall across different text nodes so you should match strings that fall across node boundaries, and also remember to turn off hidden whitespace so you match correctly. Man I wish I could just give you the code (which can be viewed here in the QuoteAddress class), but unfortunately due to its heritage it is under a GPL license [Note: Eugene, can we just relicense this all as BSD code?]. Viral licenses are a pain in the butt. Studying the algorithm should be ok [Note: is that correct? Studying can't also be viral].

If you manage to do all this successfully (and without bugs) using SAX good luck. Send me the code when you do so if it is Java so I can replace what I have.

Some final last notes about the quote() scheme: if you want to use a parantheses in the quote scheme, just backslash it:

#quote(This is some text\(...and here is some text that ends with a parantheses\))

Also, you should scan from the top of the document to the bottom in the same way that a human would read the document (i.e. traverse the document using pre-order traversal). This means that if someone gives a start or end string that has multiple matches, the one that will be found is the one that occurs first in the document.

I mentioned that there is an xpath() scheme. You don't need to support this, since it turned out to not be useful for the majority of users (the quote scheme is much more usable), but, it can be fun to have for more obscure and advanced usage. If you want to get some extra credit and implement it, the one thing to keep in mind is that it must be able to support XPath version 2 and not just XPath version 1. What this means in practice is that your user-agent must use an XPath version 2 parser that can handle things like the following:

http://codinginparadise.org/paperairplane#xpath(for $i in (4 to 10) return //p[$i])

The reason for this is that XPath version 2 has some extra functionality that makes being able to have an xpath() scheme actually useful, like the 'for' loop above, while XPath version 1 is just too limited to be useful for this use case.

After doing the above, you should filter what you return to the client to prevent XSS (Cross-Site Scripting Attacks) based on the returned client. You should:
  • Strip out SCRIPT blocks
  • Strip out javascript: URLs
  • Strip out eval() values in inline CSS
Everything else should be left in the returned values. [Note: should we get into more details on stripping out SCRIPT blocks and javascript: urls since there is a little trickery here?]. One of the chief reasons Firefox would never include, um, inclusions was because of XSS attacks, but since the nifty algorithm above helps to prevent them... maybe this stuff will show up in teh (yes, teh) browser.

Since the current web does not allow sites to easily work cross-domain, the JavaScript Purple Include has a server-side that 'proxies' all this stuff. The JavaScript Purple Include defaults to my web site, codinginparadise.org (I know I'm going to regret that some day... or maybe have a happy weblogs.com payoff). You can change this with a META tag; if your user agent does something similar, you should have the same META tag:

<meta name="purple.include.addressService" content="http://brad.com:8000/purple_include/"></meta>

One final note for server-side folks; remember that when you are working with a Q or BLOCKQUOTE tag that you should automatically add quotes to Q elements. In fact, here is what the HTML spec says about these using a Purple Include:




Whew, there you go; you kids have fun. :)

Comments

Ian Bicking said…
Some comments:

Q and BLOCKQUOTE fit the quotation use case, but don't fit other use cases well. For instance, a common kind of transclusion currently done is to include statistics in a sidebar. It could potentially be used for server-side page composition. Using these semantically weighty elements seems restrictive.

Potentially the notion of letting any element have a src attribute, as in XHTML, would be equivalent to what you are thinking? I always see that referred to in terms of images, but there's no reason why you couldn't include HTML. That this overlaps with the XHTML specification is a bit annoying, and might make it hard to reconcile.

For XHTML, are the cite/embed attributes allowed? If not, a namespace will have to be used.

This doesn't talk about relative links. Ideally those links should be resolved. Doing this in Javascript sucks, probably another server-side task.

Generally #fragment URLs that refer to an id refer to a specific element suitable to be included. But when the fragment refers to a named anchor your technique is likely necessary. Though the difference between [a name=foo][h1]...[/h1][/a] and [h1][a name=foo]...[/a][/h1] will be hard to explain. That said, it think it's a useful feature, and fits non-quotation use cases better than #quote(). #xpath() serves a similar use case when there are formal relationships between pages (typical users may only be able to get these URLs by copying them, but that's a reasonable and already common use case for permalinks).

I like using the A tag for this, as it (mostly) fits with HTML's model already, and degrades quite gracefully. But it's not a perfect degradation; it would be better if you could include a quote, include a citation, and then have the citation load the live quote from the destination. But that feels really hard to compose as an author. Ugh. Redundancy makes the quote more reliable; I'm not sure how well this can work when you don't have services that are specifically intended to be used this way. What about when the other server goes down? Are you letting people change history too much? For quotations I'm unlikely to want a live quote.

Anyway, that was an aside. Using an A fits well with WYSIWYG editors; there just has to be an addition to mark the A as something suitable to be included. In my own code I've used rel="include". In theory that's wrong, as it's not a registered relation.

For #quote(), I believe there's already work on extending fragments in several ways. This seems compatible with that, but some attention to that would be useful, and seeing if there is useful overlap with existing work.

For #quote() there's more than just well-formedness, but also validity. I don't know if that should be addressed at all. But the case is if, say, you get a TD element, and transclude it somewhere that's invalid. Usually browsers handle this somewhat gracefully at least.

XSS concerns are greater than what you list. I'm not sure there's any single canonical list of things you have to do to ensure safety. There's weird ironies here too, e.g., EMBED is possible to handle but technically deprecated. OBJECT is infeasible to clean well or allow any opt-ins (like embedded youtube videos), but in theory is supposed to replace EMBED. Now everyone uses EMBED because of cleaners.

Failure cases for fragments would probably be good. Should you just give an error, or try to return an only partially valid response?

Should the subrequests use the same cookies, etc., as the original request? When they are on the same domain? Probably not; subrequests should probably be generic and not tied to the user. But I'm not sure; there are some useful cases to making subrequests with cookies and whatnot (when it's possible).

You might want to refer to the recent proposed extensions to Cache-Control (http://www.mnot.net/blog/2007/12/12/stale) as they are useful for this case.

I'm -1 on converting to XHTML. It doesn't feel useful, and could be anti-useful.

One thing we've done is try to merge the HEAD tags when transcluding. I'm not sure this is a good idea. But it's hard, because some content is really dependent on external stylesheets or scripts. Of course, scripts have XSS concerns, and that only makes any sense with trusted links. Stylesheets unfortunately frequently have conflicts. Maybe inline STYLE tags are the only reasonable way to address this. It would also be possible to make all the styles inline as part of the server transformation. This may or may not be desired by the person doing the transclusion; certainly both cases are reasonable. Possibly even class names could be munged to avoid unintentional overlap; though again this may or may not be desirable; there's no general rule. CSS classes kind of suck.

Anyway, that's my thoughts so far.
Brad Neuberg said…
@ian: Hi Ian, good to see your comments. I'll give a longer reply soon, but I just wanted to touch on the approach I'm trying to take.

In general, the history of hypertext technologies, and the ones on the web, have not been very succesful. The tech field is literally sprinkled with hypertext technologies that have not been adopted, including Ted Nelson's work, much of Engelbart's, XLink, XPointer, and more. In general, I am trying to take a much more pragmatic, piece-meal evolution on this stuff, very similar to the way that blog technologies evolved. Using the quote() scheme rather than the unwieldy XPointer scheme; not trying to get everything perfect when it comes to validity of bringing in remote documents; not trying to be so general that the use-cases for this stuff become hard to figure out; and not getting lost by perfectly aligning with the rest of the standards world around XHTML and registering 'rel' tags. The approach I've been trying to take is similar to Alex Russell's which he blogs about here (http://alex.dojotoolkit.org/?p=642) as well as Ian Hixie's with the HTML 5 work (http://www.whatwg.org/specs/web-apps/current-work/).

With that said, you provide many good ideas that I'll respond to in length in a bit. I have to run out the door and go grab some lunch right now :) Thanks for your feedback and the cool stuff you have also been doing around transclusions. I look forward to collaborating.

Best,
Brad
Ian Bicking said…
One of the inspirations for me for using A is microformats, which as a general principle use HTML as-is without adding new tags or attributes, and using links for links. (In fact, with that in mind, why META and not [link rel="purple.include.server" href="..."]? -- OpenID also uses LINK I think)
Anonymous said…
A couple of nodes.

A robot might not actually have a DOM parser.

This is one of the reasons I don't like microformats for robots because they could become too fragile. :-/

Also, does quote() need to be URL encoded?

Good stuff!

Kevin
Brad Neuberg said…
@kevin: Hey Kevin, good to hear from you.

You don't have to use DOM for a robot; I just suggested that because it makes implementation much easier. If you use SAX I'm very interested to hear how you would approach the problem. Is there SAX Range support?

Also, for the 'cite' attribute, the value might not be URL encoded since users will be creating their HTML by hand and might not know they need to URL encode things (plus its much easier to read this stuff in the source when its not URL encoded). So expect that this attribute might or might not be URL encoded, and that there might be real spaces (I believe that URLs in HTML all share this issue, such as the A or LINK tags).

Best,
Brad