2003-11-01
Welcome!
I'm pleased to announce that I've finally converted a few of my ideas
and aspirations for eXtremities
into some semblence of reality.
What's here now isn't quite how I envisioned things, but that may well
be for the better. To be honest, I'm not sure I ever did know exactly
what I did have in mind, and I don't think it really matters: the
journey is more important than the final destination.
Moreover, I've learnt (several times, so maybe not) that what you
think you want doesn't always fit the bill, so it's important to be
aware of what feels right and works for you, and what
doesn't.
That's why I'm evolving things myself rather than using one of the
many highly-regarded tools that are out there: not only is it more
fun, satisfying (when things work) and instructive (when they don't),
but until I figure out what I'm looking for I couldn't know which
system(s) are worth trying.
As the Chinese proverb says:
if I read, I forget;
if I see, I remember;
if I do, I understand
.
What now?
I've got a bunch of notes in various stages of preparation, which I
hope to post here in the near future.
I also plan to evolve both infrastructure and content in various
directions – some of which are still unknown even to me.
Comments on what I've got thus far, what you'd like to see, and/or
suggest I look at would therefore be most welcome.
All the best —
David Bruce
2003-10-31
UTF-8+
Wouldn't it be nice if …?
A number of thoughts/questions spring to mind:
-
Numeric character references
Although designed with XML in mind, the UTF-8+names encoding has no
dependence whatsoever on any aspect of XML and may be freely used in
other textual contexts
.
They would not add anything for XML usage, of course,
but I can't help wondering whether including numeric character
references
(e.g.:
?
,
¿
)
as well as named replacements might not make the encoding form more
useful for these other textual contexts
.
-
UTF-8+?
Especially with numeric character references as well,
maybe just UTF-8+
might be a better name than
UTF-8+names
?
(It's also more extensible for other developments,
as well as being less Anglo-centric.)
[I'd also thought of UTF-8+replacements
,
but on balance UTF-8+
seems better.]
-
Escaping &
[UTF-8+names I-D;
§5] describes the &&;
replacement, while Tim's essay
mentioned that &;
had also been suggested, and that the jury was
still out.
However, [UTF-8+names I-D; §6,7]
include the XHTML 1.0
and MathML 2.0
entities by reference, so we also have &
.
(If numeric character references were included, there'd be
&
and
&
as well!)
I can see that &&;
/ &;
are
'simpler'
(both to use and also to define in the
I-D/RFC)
than &
, but – given the mindshare
&
inherits from XML/(X)HTML – are they that
much better as to justify the duplication?
(Related:
&&;
may also interact with an ambiguity.)
-
Why just UTF-8+?
Tim
also asks: Another open question is, why is this based on UTF-8? It
could be based on UTF-16 or ISO-8859-1 or even US-ASCII.
Would it not be possible to frame the specification in generic terms,
along the lines of RFC3023's
+xml convention for
MIME/media types?
In other words, one could specify that
for any base encoding E there is a derived encoding
E+ that is the same except that
if E contains the characters '&
' and
';
', the corresponding E+ allows for
replacements; more specifically:
-
E+ has decimal numeric character references if E
contains
the characters '#
' and '0
'–'9
'
(and hexadecimal if E also contains 'x
'/'X
'
and 'a
'/'A
'–'f
'/'F
');
-
E+ has replacements for those named entities from
XHTML 1.0 and MathML 2.0
for which E contains
the relevant characters.
As well as UTF-8+,
you thereby get UTF-16+, ISO-8859-1+, us-ascii+, windows-1252+,
etc.
[It is not totally clear to me whether it's even necessary that the
base encoding has '&
' and ';
' (or any of
the other relevant characters) at their ISO/IEC 10646/Unicode positions
for this to work (though that might make the definition less fraught);
what about ebcdic+?]
-
A yet-more generic scheme?
Internationalization and future-proofing (not to mention pedantry
☺)
beg the similar question of why (X)HTML and MathML names should be
singled out for special recognition.
(Indeed, Tim noted:
The jury's
still out on whether HTML and MathML are the right sets to adopt
.)
Since we can only guess at how people might choose to use
E+ encodings, perhaps it would be best to
defer the decision to them?
For one way in which this might be achieved,
consider replacing clause (b) above
with something like:
-
E+X+Y+Z
has replacements for those named entities from
replacement sets
X, Y and Z
for which E contains
the relevant characters.
[To allow users control over any conflicts that might arise,
decoding should be deterministic – matching replacements from
X first, then from Y, then Z.]
Users could then define replacement sets containing the entities not
only from XHTML 1.0
and MathML 2.0,
but also any others of interest –
for example, those for
Braille or
musical notation,
from esoteric scripts such as
Ogham or
Runic,
or the DocBook
XML
Character Entities
–
using as few or as many of them as appropriate
to the particular application at hand.
[If desired, though it hardly seems worth it,
one could treat the plain E+ as
a shorthand for E+xhtml+mathml
(say), thereby using the (X)HTML and MathML
replacement sets mentioned in the current
Internet-Draft as the defaults to be used when none
are explicitly specified.]
Nit picking
Regardless of whether or not any of the above ideas merit further consideration,
there are a few (fixable) problems with the current Internet-Draft,
of which the first two seem substantive:
-
Inconsistency
The defining text and the example in [UTF-8+names I-D; §4]
are inconsistent!
Using the definitions in [UTF-8+names I-D; §3, ¶2],
the first sentence specifies that an undefined replacement (e.g.:
"&U2;
") is replaced by its replacement name
(i.e.: "U2
"),
but the example clearly states that it represents "&U2;
".
For the example to hold as stated, the final word of the first
sentence would have to be struck –
viz.:
… the replacement value of an undefined replacement is
identical to the replacement itself
[rather than the
replacement name].
-
Ambiguity
The specification of a replacement name
is surprisingly imprecise.
In particular, it doesn't say what should happen if the replacement
name contains '&
' and/or ';
'
(except that, by defining &&;
,
the former at least is implicitly allowed –
on the other hand,
see below).
This manifests itself in at least two ways:
-
Nested replacements?
For example, (assuming that all of the possible
replacements are undefined and therefore represent themselves)
what does “abc&pq&r;st;xyz
” represent?
-
an encoding error?
-
abc&pq&r;st;xyz
itself:
- because
&pq&r;
is itself,
leaving &pq&r;st;
, which is itself,
…?
[first &
to first ;
]
- because
&pq&r;st;
is itself?
[first &
to 'matching' ;
]
- because
&r;
is itself,
leaving &pq&r;st;
, …?
[innermost &
…;
pair]
- because
&r;st;
is itself,
leaving &pq&r;st;
, …?
- for some other reason?
-
abcpqrstxyz
?
-
something else?
While that horror story suggests that nesting should not be
allowed, there are cases where it might prove useful.
For a silly example,
suppose that the S
key on your keyboard has broken(!);
you could still get a non-breaking space by the circumlocution of
 
.
More realistic situations where nesting would be valuable
could easily occur if replacement sets
involving internationalized names were supported.
-
Overly-greedy matching
It is also necessary to specify that
a×b−c
really is
a×b−c
(matching ×
and −
)
rather than a×b−c
(failing to match
×b−
)!
The simplest solution must surely be to ban nesting entirely, by
stipulating that a replacement is terminated by the the first
';
' to follow the opening '&
', and that
any intervening '&
' is an encoding error.
Well-formed nesting
should also be easy to accommodate; after all, '&
' and
';
' are just as distinct as '(
' and
')
'.
[It should be noted that both of these formulations would exclude
the &&;
escape sequence; one could either
allow it as a special case, or use &
instead.]
-
Inconvenience
I can see just one thing that E+ encodings make harder:
getting certain escaped markup into an XML document.
☹
This is a particular issue for the XML-predefined entities
<
, >
,
&
,
'
and
"
,
all of which are also defined in both XHTML and MathML.
Let's try for &
.
You can't just enter &
, of course;
UTF-8+names maps that to &
straightaway
(almost certainly confusing the XML processor in the process).
&&;amp;
or &amp;
don't work,
either: UTF-8+names will replace the initial
&&;
/&
, leaving
&
, but XML will then reduce that to
&
.
I think that
&&;amp;amp;
/ &amp;amp;
might do the trick.
Similar shenanigans for the other XML-predefined entities
<
, >
,
'
and
"
.
(And if you think that looks hairy, try describing it … ☺)
-
Clerical errors?
-
The titles of sections 6 and 7 are “The HTML Replacement Set” and “The MathML
Representation Set” respectively;
I'm guessing that the latter's “Representation”
should probably be “Replacement”.
-
Reference [1] gives
http://www.w3.org/TR/1998/REC-xml
as the URI
for the XML 1.0 specification, but this link does not resolve …
-
Reference [2] has a couple of typos:
Standords
,
Organizotion
;
these are also present in the title
attribute
of the citing “SGML” link in section 1.