Re: UTF8 Squeak - Squeak-dev

8 Jun 2007


      On Thu, Jun 07, 2007 at 08:16:21PM -0700, Alan Lovejoy wrote:
...
It is already the case that accessing individual characters from a String
results in the reification of a Character object.  So, leveraging what is
already the case, convervsion to/from the internal encoding to the canonical
(Unicode) encoding should occur when a Character object is reified from an
encoded character in a String (or in a Stream.)  Character objects that are
"put:" into a String would be converted from the Unicode code point to the
encoding native to that String.  Using Character reification to/from Unicode
as the unification mechanism provides the illusion that all Strings use the
same code points for their characters, even though they in fact do not.
Someone already mentioned the way Plan-9 did this, and provided
a link, which I read, and it sounded pretty logical. What
follows is my assessment of what I read.
The key realization that Plan-9 made is that random-access
string access is the exception, rather than the rule. Stream
access is much more common, and much more in need of
optimization. This seems logical to me. UTF-8 is a
stream-oriented encoding of Unicode that Plan-9 invented to
solve this optimization issue. UTF-8 is self-synchronizing and
byte-oriented, which allows a reader to be nearly stateless, and
still consume much less memory that UCS-32. Plan 9 also
described that, contrary to what some expect, very few programs
do better with UCS-32, because very few programs really need to
process the string in a non-linear way. Regular expressions and
sorting are the two main exceptions.
UTF-8 also allows the transition to be made slightly more
smoothly, since many ASCII programs will already work with
UTF-8.
This is a synopsis of what I read. I am not familiar with this
issue as much as you are.
-- 
Matthew Fulmer -- http://mtfulmer.wordpress.com/
Help improve Squeak Documentation: http://wiki.squeak.org/squeak/808