String character range
dtody at nrao.edu
Fri Aug 1 13:53:06 PDT 2008
Hey Mark -
I agree with your sentiment that string data which we want to
manipulate in any language or environment should be simple; if
necessary a separate datatype could be declared for representing
e.g. general Unicode encoded text.
What about UTF-8 though? This is backwards compatible with ASCII
but allows any Unicode character to be represented using multi-byte
sequences - if there are no funny characters it is the same as ASCII.
This is much like your escape sequence proposal, but is a widely used
standard. XML has mandatory support for UTF-8 (almost any XML document
one sees is UTF-8 encoded) so there should be no problems there.
I suspect that if some old ASCII-oriented code got a UTF-8 encoded
string containing multi-byte Unicode characters it would print these
oddly, however it would probably still work (things like the null
test for end of string etc. still work normally for UTF-8). There
would be no problem for the usual case of simple ASCII text.
On Fri, 1 Aug 2008, Mark Taylor wrote:
> On Fri, 1 Aug 2008, Carlos Rodrigo Blanco wrote:
> > Hi
> > I'm sorry that I don't know much about unicode encoding and I feel quite
> > ashamed of showing this ignorance, but I wonder what happens with latin
> > characters and so.
> > If I have to write, for instance, some author name in a xml document that
> > includes some latin character (like ñ), is that allowed?
> Writing it in an XML document - no problem. XML, and Unicode on which
> it is based, is very capable at representing almost any character
> from almost any language you can think of (and a lot more).
> As far as SAMP goes: that character looks to me like code point 0xf1, from the
> Latin-1 Supplement code block. So you could not send it using either the
> existing definition for a SAMP string or the proposal (4) that I am
> suggesting. If we used a variant of my suggestion (3):
> 3. Define some escaping convention for un-XML characters, e.g. \u001f
> for character 31.
> with the intention that this escaping mechanism could be used for
> any 8-bit character it would be possible to transmit this kind of non-7-bit
> Latin character. However, characters with the 8th bit set might cause
> problems for certain other transports and language environments. I must admit
> apart from RFC-822 mail-type contexts I can't think of what these might be,
> but I'd be inclined to steer clear of non-7-bit characters just in case.
> However, if others (e.g. with less Anglo-Saxon prejudices) think that it's an
> important requirement to permit transmission of characters like this within
> SAMP we could take that on board. We could even in principle say that this
> escaping mechanism could be used to specify any Unicode character - but I
> think that would definitely be a bad idea as it would effectively restrict use
> of the protocol to languages with Unicode support, which excludes quite a lot.
> Mark Taylor Astronomical Programmer Physics, Bristol University, UK
> m.b.taylor at bris.ac.uk +44-117-928-8776 http://www.star.bris.ac.uk/~mbt/
More information about the apps-samp