String character range
m.b.taylor at bristol.ac.uk
Fri Aug 1 03:02:32 PDT 2008
while writing the hub tests, I have come across a problem with the
definition of the SAMP string data type. Section 3.3 of the SAMP
doc defines a string as:
"a scalar value consisting of a sequence of characters;
each character may be in the range 0x01-0x7f"
Section 2.2 of the XML specification meanwhile
(http://www.w3.org/TR/2006/REC-xml-20060816/#charsets) has the following
BNF production for characters allowed in an XML document:
 Char ::= #x9 | #xA | #xD | [#x20-#xD7FF]
| [#xE000-#xFFFD] | [#x10000-#x10FFFF]
/* any Unicode character, excluding the
surrogate blocks, FFFE, and FFFF. */
(I do not understand the comment here - as far as I can see Unicode
does include the other control characters in the range #x0-#x1f.
What this means is that there are legal SAMP strings (ones containing
any character in the ranges 0x01-0x08, 0x0B, 0x0C, 0x0E-0x1F) which
cannot be transmitted as an XML-RPC <string> element. This means
that either the definition of a SAMP string, or the prescription for
transmitting SAMP strings in XML-RPC messages in the Standard Profile,
must be modified to avoid inconsistency.
I think the possibilities are as follows:
1. Encode all SAMP strings as <base64> elements when transmitting
2. Allow SAMP strings to be transmitted as either <string> or
<base64> elements when transmitting over XML-RPC (the latter
case being required only if the string contains un-XML
3. Define some escaping convention for un-XML characters, e.g.
\u001f for character 31.
4. Change the SAMP string definition so that only XML-friendly
characters are allowed.
Both (1) and (2) would entail significant extra complication
(base64 decoding required) for Standard Profile clients, and (2) would
additionally make debugging harder (it's nice that you can see what's
in a SAMP/XML-RPC message just by looking). (3) would make life a bit
more complicated than now for clients, but not that much. The existing
legal range 0x01-0x7f for SAMP string characters was in any case just
intended to be a range of characters which would be sufficient for
'normal' strings, while excluding non-printable ones (i.e. ones which
would likely cause problems for some transport types), and it looks
like I decided on a range that was too wide for that purpose.
So I suggest that we do (4). I think we do need at least one line-break
character, though the need for both 0xA and 0x0D may be moot, as is the
need for 0x09 (tab). So I suggest that we change the definition of
a SAMP string in sec 3.3 to one of:
4a. "a scalar value consisting of a sequence of characters;
each character may be in the range 0x20-0x7f or one of
the special characters 0x09 (tab), 0x0A (line feed) or
0x0d (carriage return)"
4b. "a scalar value consisting of a sequence of characters;
each character may be in the range 0x20-0x7f or the
line break character 0x0a"
(4b) might be more rigorous since it obviates the possibility of
confusion when transforming between OSs (Windows and *nix), but
since SAMP usage will probably mostly be intra-OS this might cause
more trouble than it's worth - also, I bet that Windows-based
implementations would routinely violate this in any case
(see Goldfarb's First Law of Text Processing) so probably 4a is
Mark Taylor Astronomical Programmer Physics, Bristol University, UK
m.b.taylor at bris.ac.uk +44-117-928-8776 http://www.star.bris.ac.uk/~mbt/
More information about the apps-samp