On Thu, Jun 07, 2007 at 08:16:21PM -0700, Alan Lovejoy wrote:
It is already the case that accessing individual characters from a String results in the reification of a Character object. So, leveraging what is already the case, convervsion to/from the internal encoding to the canonical (Unicode) encoding should occur when a Character object is reified from an encoded character in a String (or in a Stream.) Character objects that are "put:" into a String would be converted from the Unicode code point to the encoding native to that String. Using Character reification to/from Unicode as the unification mechanism provides the illusion that all Strings use the same code points for their characters, even though they in fact do not.
Someone already mentioned the way Plan-9 did this, and provided a link, which I read, and it sounded pretty logical. What follows is my assessment of what I read.
The key realization that Plan-9 made is that random-access string access is the exception, rather than the rule. Stream access is much more common, and much more in need of optimization. This seems logical to me. UTF-8 is a stream-oriented encoding of Unicode that Plan-9 invented to solve this optimization issue. UTF-8 is self-synchronizing and byte-oriented, which allows a reader to be nearly stateless, and still consume much less memory that UCS-32. Plan 9 also described that, contrary to what some expect, very few programs do better with UCS-32, because very few programs really need to process the string in a non-linear way. Regular expressions and sorting are the two main exceptions.
UTF-8 also allows the transition to be made slightly more smoothly, since many ASCII programs will already work with UTF-8.
This is a synopsis of what I read. I am not familiar with this issue as much as you are.
squeak-dev@lists.squeakfoundation.org