Each String object should specify its encoding scheme. UTF-8 should be the default, but all commonly-encounterd encodings should be supported, and should all be useable at once (in different String instances.) When a Character is reified from a String, it should use the Unicode code point values (full 32-bit value.) Ideally, the encoding of a String should be a function of an associated Strategy object, and not be based on having different subclasses of String.
Is this better than using UTF32 throught the image for all Strings? One reason would be that for some chars in domestic encodings, the round-trip conversion is not exactly guaranteed; so you can avoid that problem in this way. But ohter than that, encodings only matters when the system is interfacing with the outside world. So, the internal representation can be uniform, I think.
Would you write all comparison methods for all of combinations of different encodings?
-- Yoshiki
<Alan L>Each String object should specify its encoding scheme. UTF-8 should
be the default, but all commonly-encounterd encodings should be supported, and should all be useable at once (in different String instances.) When a Character is reified from a String, it should use the Unicode code point values (full 32-bit value.) Ideally, the encoding of a String should be a function of an associated Strategy object, and not be based on having different subclasses of String
</Alan L>
<Yoshiki>Is this better than using UTF32 throught the image for all Strings? One reason would be that for some chars in domestic encodings, the round-trip conversion is not exactly guaranteed; so you can avoid that problem in this way. But ohter than that, encodings only matters when the system is interfacing with the outside world. So, the internal representation can be uniform, I think.
Would you write all comparison methods for all of combinations of different encodings? </Yoshiki>
Well, perhaps UTF-32 would be a better default, now that I think about it--due to performance issues for accessing characters at an index. But using 32-bit-wide or 16-bit-wide strings internally as the only option would be a waste of memory in many situations, especially for the "Latin-1" languages.
Having String instances that use specified encodings enables one to avoid doing conversions unless and until it's needed. It also makes it easy to deal with the data as it will actually exist when persisted, or when transported over the network. And it makes it easier to handle the host plaform's native character encodings (there may be more than one,) or the character encodings used by external libraries or applications that either offer callpoints to, or consume callpoints from, a Squeak process. It also documents the encoding used by each String.
If all Strings use UTF-32, and are only converted to other encodings by the VM, how does one write Smalltalk code to convert text from one character encoding to another? I'd rather not make character encodings yet another bit of magic that only the VM can do.
It is already the case that accessing individual characters from a String results in the reification of a Character object. So, leveraging what is already the case, convervsion to/from the internal encoding to the canonical (Unicode) encoding should occur when a Character object is reified from an encoded character in a String (or in a Stream.) Character objects that are "put:" into a String would be converted from the Unicode code point to the encoding native to that String. Using Character reification to/from Unicode as the unification mechanism provides the illusion that all Strings use the same code points for their characters, even though they in fact do not.
Of course, for some encodings (such as UTF-8) there would probably be a performance penalty for accessing characters at an arbitrary index ("aString at: n.") But there may be good ways to mitigate that, using clever implementation tricks (caveat: I haven't actually tried it.) However, with my proposal, one is free to use UTF-16 for all Strings, or UTF-32 for all Strings, or ASCII for all Strings--based on one's space and performance constraints, and based on the character repertoire one needs for one's user base. And the conversion to UTF-16 or UTF-32 (or whatever) can be done when the String is read from an external Stream (using the VW stream decorator approach, for example.)
The ASCII encoding would be good for the mutlitude of legacy applications that are English-only. ISO 8859-1 would be best for post-1980s/pre-UTFx legacy applications that have to deal with non-English languages, or have to deal with either HTML or pre-Vista Windows. UTF-x would be best for most other situations.
--Alan
squeak-dev@lists.squeakfoundation.org