2003-10-31

UTF-8+

Tim Bray mines a rich seam in ongoing. He's best known for his XML and Web expertise, of course, but I've also found his advocacy and exposition of Unicode, UTF, etc. most instructive. (I've been hacking together a multi-lingual website recently.)

Of these, I suspect that his recent essay motivating and proposing The UTF-8+names Unicode Encoding Form will prove by far the most significant in the long run. It's one of those ideas that just has to be more-or-less right, if only because (with the smug omniscience of hindsight) it's so blatantly obvious!

The devil's in the details, of course. But before picking nits in the current Internet-Draft, let's poke around at the boundaries of the concept.

Wouldn't it be nice if …?

A number of thoughts/questions spring to mind:

Numeric character references

Although designed with XML in mind, the UTF-8+names encoding has no dependence whatsoever on any aspect of XML and may be freely used in other textual contexts.

They would not add anything for XML usage, of course, but I can't help wondering whether including numeric character references (e.g.: ?, ¿) as well as named replacements might not make the encoding form more useful for these other textual contexts.
UTF-8+?

Especially with numeric character references as well, maybe just UTF-8+ might be a better name than UTF-8+names?

(It's also more extensible for other developments, as well as being less Anglo-centric.)

[I'd also thought of UTF-8+replacements, but on balance UTF-8+ seems better.]
Escaping &

[UTF-8+names I-D; §5] describes the &&; replacement, while Tim's essay mentioned that &; had also been suggested, and that the jury was still out.

However, [UTF-8+names I-D; §6,7] include the XHTML 1.0 and MathML 2.0 entities by reference, so we also have &. (If numeric character references were included, there'd be & and & as well!)

I can see that &&; / &; are 'simpler' (both to use and also to define in the I-D/RFC) than &, but – given the mindshare & inherits from XML/(X)HTML – are they that much better as to justify the duplication?

(Related: &&; may also interact with an ambiguity.)
Why just UTF-8+?

Tim also asks: Another open question is, why is this based on UTF-8? It could be based on UTF-16 or ISO-8859-1 or even US-ASCII.

Would it not be possible to frame the specification in generic terms, along the lines of RFC3023's +xml convention for MIME/media types?

In other words, one could specify that for any base encoding E there is a derived encoding E+ that is the same except that if E contains the characters '&' and ';', the corresponding E+ allows for replacements; more specifically:
1. E+ has decimal numeric character references if E contains^† the characters '#' and '0'–'9' (and hexadecimal if E also contains 'x'/'X' and 'a'/'A'–'f'/'F');
2. E+ has replacements for those named entities from XHTML 1.0 and MathML 2.0 for which E contains^† the relevant characters.
3. † – and/or gets by nested replacement.
As well as UTF-8+, you thereby get UTF-16+, ISO-8859-1+, us-ascii+, windows-1252+, etc.

[It is not totally clear to me whether it's even necessary that the base encoding has '&' and ';' (or any of the other relevant characters) at their ISO/IEC 10646/Unicode positions for this to work (though that might make the definition less fraught); what about ebcdic+?]
A yet-more generic scheme?

Internationalization and future-proofing (not to mention pedantry ☺) beg the similar question of why (X)HTML and MathML names should be singled out for special recognition. (Indeed, Tim noted: The jury's still out on whether HTML and MathML are the right sets to adopt.)
Since we can only guess at how people might choose to use E+ encodings, perhaps it would be best to defer the decision to them? For one way in which this might be achieved, consider replacing clause (b) above with something like:
- E+X+Y+Z has replacements for those named entities from replacement sets X, Y and Z for which E contains^‡ the relevant characters.
  
  [To allow users control over any conflicts that might arise, decoding should be deterministic – matching replacements from X first, then from Y, then Z.]
- ‡ – and/or gets by nested replacement (including from replacement sets earlier in the matching sequence).
Users could then define replacement sets containing the entities not only from XHTML 1.0 and MathML 2.0, but also any others of interest – for example, those for Braille or musical notation, from esoteric scripts such as Ogham or Runic, or the DocBook XML Character Entities – using as few or as many of them as appropriate to the particular application at hand.

[If desired, though it hardly seems worth it, one could treat the plain E+ as a shorthand for E+xhtml+mathml (say), thereby using the (X)HTML and MathML replacement sets mentioned in the current Internet-Draft as the defaults to be used when none are explicitly specified.]

Nit picking

Regardless of whether or not any of the above ideas merit further consideration, there are a few (fixable) problems with the current Internet-Draft, of which the first two seem substantive:

Inconsistency

The defining text and the example in [UTF-8+names I-D; §4] are inconsistent!

Using the definitions in [UTF-8+names I-D; §3, ¶2], the first sentence specifies that an undefined replacement (e.g.: "&U2;") is replaced by its replacement name (i.e.: "U2"), but the example clearly states that it represents "&U2;".
For the example to hold as stated, the final word of the first sentence would have to be struck – viz.: … the replacement value of an undefined replacement is identical to the replacement itself [rather than the replacement name].
Ambiguity

The specification of a replacement name is surprisingly imprecise. In particular, it doesn't say what should happen if the replacement name contains '&' and/or ';' (except that, by defining &&;, the former at least is implicitly allowed – on the other hand, see below).
This manifests itself in at least two ways:
1. Nested replacements?
  
  For example, (assuming that all of the possible replacements are undefined and therefore represent themselves) what does “abc&pq&r;st;xyz” represent?
  - an encoding error?
  - abc&pq&r;st;xyz itself:
    - because &pq&r; is itself, leaving &pq&r;st;, which is itself, …? [first & to first ;]
    - because &pq&r;st; is itself? [first & to 'matching' ;]
    - because &r; is itself, leaving &pq&r;st;, …? [innermost &…; pair]
    - because &r;st; is itself, leaving &pq&r;st;, …?
    - for some other reason?
  - abcpqrstxyz?
  - something else?
  While that horror story suggests that nesting should not be allowed, there are cases where it might prove useful. For a silly example, suppose that the S key on your keyboard has broken(!); you could still get a non-breaking space by the circumlocution of &nbsp;. More realistic situations where nesting would be valuable could easily occur if replacement sets involving internationalized names were supported.
2. Overly-greedy matching
  
  It is also necessary to specify that a×b−c really is a×b−c (matching × and −) rather than a×b−c (failing to match ×b−)!
The simplest solution must surely be to ban nesting entirely, by stipulating that a replacement is terminated by the the first ';' to follow the opening '&', and that any intervening '&' is an encoding error. Well-formed nesting should also be easy to accommodate; after all, '&' and ';' are just as distinct as '(' and ')'.

[It should be noted that both of these formulations would exclude the &&; escape sequence; one could either allow it as a special case, or use & instead.]
Inconvenience

I can see just one thing that E+ encodings make harder: getting certain escaped markup into an XML document. ☹
This is a particular issue for the XML-predefined entities <, >, &, ' and ", all of which are also defined in both XHTML and MathML.
Let's try for &. You can't just enter &, of course; UTF-8+names maps that to & straightaway (almost certainly confusing the XML processor in the process). &&;amp; or &amp; don't work, either: UTF-8+names will replace the initial &&;/&, leaving &, but XML will then reduce that to &. I think that &&;amp;amp; / &amp;amp; might do the trick. Similar shenanigans for the other XML-predefined entities <, >, ' and ".

(And if you think that looks hairy, try describing it … ☺)
Clerical errors?
- The titles of sections 6 and 7 are “The HTML Replacement Set” and “The MathML Representation Set” respectively; I'm guessing that the latter's “Representation” should probably be “Replacement”.
- Reference [1] gives http://www.w3.org/TR/1998/REC-xml as the URI for the XML 1.0 specification, but this link does not resolve …
- Reference [2] has a couple of typos: Standords, Organizotion; these are also present in the title attribute of the citing “SGML” link in section 1.

Fresh in my mind:

Random jottings:

Out and about:

TechnoLust:

UTF-8+

Wouldn't it be nice if …?

Numeric character references

UTF-8+?

Escaping `&`

Why just UTF-8+?

A yet-more generic scheme?

Nit picking

Inconsistency

Ambiguity

Nested replacements?

Overly-greedy matching

Inconvenience

Clerical errors?

Fresh in my mind:

Random jottings:

Out and about:

TechnoLust:

UTF-8+

Wouldn't it be nice if …?

Numeric character references

UTF-8+?

Escaping &

Why just UTF-8+?

A yet-more generic scheme?

Nit picking

Inconsistency

Ambiguity

Nested replacements?

Overly-greedy matching

Inconvenience

Clerical errors?

Escaping `&`