As for the issue of lower case e acute, it is compatibility codepoint 00e9 hex and therefore encodable in UTF-8 as compatibility codepoint e9 hex and as the composed character #(0065 00b4) (all in hex) and as the same composed character as both #(feff 0065 00b4) and #(ffef 0065 00b4) when endianness markers are included and as I understand it, should also be legitimate to encode it in UTF8 as a composed character #(65 b4) (all hex)
etc
On 7 December 2015 at 08:27, Sven Van Caekenberghe sven@stfx.eu wrote:
I am sorry but one of your basic assumptions is completely wrong:
'Les élèves Français' encodeWith: #iso99591.
=> #[76 101 115 32 233 108 232 118 101 115 32 70 114 97 110 231 97 105 115]
'Les élèves Français' utf8Encoded.
=> #[76 101 115 32 195 169 108 195 168 118 101 115 32 70 114 97 110 195 167 97 105 115]
ISO-9959-1 (~Latin1) is NOT AT ALL identical to UTF-8 in its upper, non-ACII part !!
Or shorter, $é is encoded in ISO-9959-1 as #[233], but as #[195 169] in UTF-8.
So more than half the points you make, or the facts that you state, are thus plain wrong.
The only thing that is correct is that the code points are equal, but that is not the same as the encoding !
From this I am inclined to conclude that you do not fundamentally understand how UTF-8 works, which does not strike me as good basis to design something called a UTF8String.
Sorry.
PS: Note also that Cuis' choice to use ISO-9959-1 only is pretty limiting in a Unicode world.
On 07 Dec 2015, at 04:21, EuanM euanmee@gmail.com wrote:
This a long email. A *lot* of it is encapsulated in the Venn diagram both: http://smalltalk.uk.to/unicode-utf8.html and my Smalltalk in Small Steps blog at: http://smalltalkinsmallsteps.blogspot.co.uk/2015/12/utf-8-for-cuis-pharo-and...
My current thinking, and understanding.
- a) ASCII and ISO-8859-1 consist of characters encoded in 1 byte.
b) UTF-8 can encode all of those characters in 1 byte, but can prefer some of them to be encoded as sequences of multiple bytes. And can encode additional characters as sequences of multiple bytes.
Smalltalk has long had multiple String classes.
Any UTF16 Unicode codepoint which has a codepoint of 00nn hex
is encoded as a UTF-8 codepoint of nn hex.
- All valid ISO-8859-1 characters have a character code between 20
hex and 7E hex, or between A0 hex and FF hex. https://en.wikipedia.org/wiki/ISO/IEC_8859-1
- All valid ASCII characters have a character code between 00 hex and 7E hex.
https://en.wikipedia.org/wiki/ASCII
- a) All character codes which are defined within ISO-8859-1 and also
defined within ASCII. (i.e. character codes 20 hex to 7E hex) are defined identically in both.
b) All printable ASCII characters are defined identically in both ASCII and ISO-8859-1
- All character codes defined in ASCII (00 hex to 7E hex) are
defined identically in Unicode UTF-8.
- All character codes defined in ISO-8859-1 (20 hex - 7E hex ; A0 hex
- FF hex ) are defined identically in UTF-8.
=> some Unicode codepoints map to both ASCII and ISO-8859-1. all ASCII maps 1:1 to Unicode UTF-8 all ISO-8859-1 maps 1:1 to Unicode UTF-8
All ByteStrings elements which are either a valid ISO-8859-1
character or a valid ASCII character are *also* a valid UTF-8 character.
- ISO-8859-1 characters representing a character with a diacritic,
or a two-character ligature, have no ASCII equivalent. In Unicode UTF-8, those character codes which are representing compound glyphs, are called "compatibility codepoints".
- The preferred Unicode representation of the characters which have
compatibility codepoints is as a a short set of codepoints representing the characters which are combined together to form the glyph of the convenience codepoint, as a sequence of bytes representing the component characters.
- Some concrete examples:
A - aka Upper Case A In ASCII, in ISO 8859-1 ASCII A - 41 hex ISO-8859-1 A - 41 hex UTF-8 A - 41 hex
BEL (a bell sound, often invoked by a Ctrl-g keyboard chord) In ASCII, not in ISO 8859-1 ASCII : BEL - 07 hex ISO-8859-1 : 07 hex is not a valid character code UTF-8 : BEL - 07 hex
£ (GBP currency symbol) In ISO-8859-1, not in ASCII ASCII : A3 hex is not a valid ASCII code UTF-8: £ - A3 hex ISO-8859-1: £ - A3 hex
Upper Case C cedilla In ISO-8859-1, not in ASCII, in UTF-8 as a compatibility codepoint *and* a composed set of codepoints ASCII : C7 hex is not a valid ASCII character code ISO-8859-1 : Upper Case C cedilla - C7 hex UTF-8 : Upper Case C cedilla (compatibility codepoint) - C7 hex Unicode preferred Upper Case C cedilla (composed set of codepoints) Upper case C 0043 hex (Upper case C) followed by cedilla 00B8 hex (cedilla)
- For any valid ASCII string *and* for any valid ISO-8859-1 string,
aByteString is completely adequate for editing and display.
- When sorting any valid ASCII string *or* any valid ISO-8859-1
string, upper and lower case versions of the same character will be treated differently.
- When sorting any valid ISO-8859-1 string containing
letter+diacritic combination glyphs or ligature combination glyphs, the glyphs in combination will treated differently to a "plain" glyph of the character i.e. "C" and "C cedilla" will be treated very differently. "ß" and "fs" will be treated very differently.
- Different nations have different rules about where diacritic-ed
characted and ligature pairs should be placed when in alphabetical order.
- Some nations even have multiple standards - e.g. surnames
beginning either "M superscript-c" or "M superscript-a superscript-c" are treated as beginning equivalently in UK phone directories, but not in other situations.
Some practical upshots
- Cuis and its ISO-8859-1 encoding is *exactly* the same as UTF-8,
for any single character it considers valid, or any ByteString it has made up of characters it considers valid.
- Any ByteString is valid UTF-8 in any of Squeak, Pharo, Cuis or any
other Smalltalk with a single byte ByteString following ASCII or ISO-8859-1.
- Any Smalltalk (or derivative language) using ByteString can
immediately consider it's ByteString as valid UTF-8, as long as it also considers the ByteSring as valid ASCII and/or ISO-8859-1.
- All of those can be successfully exported to any system using UTF-8
(e.g. HTML).
- To successfully *accept* all UTF-8 we much be able to do either:
a) accept UTF-8 strings with composed characters b) convert UTF-8 strings with composed characters into UTF-8 strings that use *only* compatibility codepoints.
Class + protocol proposals
a Utf8CompatibilityString class.
asByteString - ensure only compatibility codepoints are used. Ensure it doews not encode characters above 00FF hex.
asIso8859String - ensures only compatibility codepoints are used, and that the characters are each valid ISO 8859-1
asAsciiString - ensures only characters 00hex - 7F hex are used.
asUtf8ComposedIso8859String - ensures all compatibility codepoints are expanded into small OrderedCollections of codepoints
a Utf8ComposedIso8859String class - will provide sortable and comparable UTF8 strings of all ASCII and ISO 8859-1 strings.
Then a Utf8SortableCollection class - a collection of Utf8ComposedIso8859Strings words and phrases.
Custom sortBlocks will define the applicable sort order.
We can create a collection... a Dictionary, thinking about it, of named, prefabricated sortBlocks.
This will work for all UTF8 strings of ISO-8859-1 and ASCII strings.
If anyone has better names for the classes, please let me know.
If anyone else wants to help
- build these,
- create SUnit tests for these
- write documentation for these
Please let me know.
n.b. I have had absolutely no experience of Ropes.
My own background with this stuff: In the early 90's as a Project Manager implementing office automation systems across a global company, with offices in the Americas, Western, Eastern and Central Europe, (including Slavic and Cyrillic users) nations, Japan and China. The mission-critical application was word-processing.
Our offices were spread around the globe, and we needed those offices to successfully exchange documents with their sister offices, and with the customers in each region the offices were in.
Unicode was then new, and our platform supplier was the NeXT Corporation, who had been founder members in of the Unicode Consortium in 1990.
So far: I've read the latest version of the Unicode Standard (v8.0). This is freely downloadable. I've purchased a paper copy of an earlier release. New releases typically consist additional codespaces (i.e. alphabets). So old copies are useful, as well as cheap. (Paper copies of version 4.0 are available second-hand for < $10 / €10).
The typical change with each release is the addition of further codespaces (i.e alphabets (more or less) ), so you don't lose a lot. (I'll be going through my V4.0 just to make sure)
Cheers, Euan
On 5 December 2015 at 13:08, stepharo stepharo@free.fr wrote:
Hi EuanM
Le 4/12/15 12:42, EuanM a écrit :
I'm currently groping my way to seeing how feature-complete our Unicode support is. I am doing this to establish what still needs to be done to provide full Unicode support.
this is great. Thanks for pushing this. I wrote and collected some roadmap (analyses on different topics) on the pharo github project feel free to add this one there.
This seems to me to be an area where it would be best to write it once, and then have the same codebase incorporated into the Smalltalks that most share a common ancestry.
I am keen to get: equality-testing for strings; sortability for strings which have ligatures and diacritic characters; and correct round-tripping of data.
Go! My suggestion is start small make steady progress write tests commit often :)
Stef
What is the french phoneBook ordering because this is the first time I hear about it.
Call to action:
If you have comments on these proposals - such as "but we already have that facility" or "the reason we do not have these facilities is because they are dog-slow" - please let me know them.
If you would like to help out, please let me know.
If you have Unicode experience and expertise, and would like to be, or would be willing to be, in the 'council of experts' for this project, please let me know.
If you have comments or ideas on anything mentioned in this email
In the first instance, the initiative's website will be: http://smalltalk.uk.to/unicode.html
I have created a SqueakSource.com project called UnicodeSupport
I want to avoid re-inventing any facilities which already exist. Except where they prevent us reaching the goals of:
- sortable UTF8 strings
- sortable UTF16 strings
- equivalence testing of 2 UTF8 strings
- equivalence testing of 2 UTF16 strings
- round-tripping UTF8 strings through Smalltalk
- roundtripping UTF16 strings through Smalltalk.
As I understand it, we have limited Unicode support atm.
Current state of play
ByteString gets converted to WideString when need is automagically detected.
Is there anything else that currently exists?
Definition of Terms
A quick definition of terms before I go any further:
Standard terms from the Unicode standard
a compatibility character : an additional encoding of a *normal* character, for compatibility and round-trip conversion purposes. For instance, a 1-byte encoding of a Latin character with a diacritic.
Made-up terms
a convenience codepoint : a single codepoint which represents an item that is also encoded as a string of codepoints.
(I tend to use the terms compatibility character and compatibility codepoint interchangably. The standard only refers to them as compatibility characters. However, the standard is determined to emphasise that characters are abstract and that codepoints are concrete. So I think it is often more useful and productive to think of compatibility or convenience codepoints).
a composed character : a character made up of several codepoints
Unicode encoding explained
A convenience codepoint can therefore be thought of as a code point used for a character which also has a composed form.
The way Unicode works is that sometimes you can encode a character in one byte, sometimes not. Sometimes you can encode it in two bytes, sometimes not.
You can therefore have a long stream of ASCII which is single-byte Unicode. If there is an occasional Cyrillic or Greek character in the stream, it would be represented either by a compatibility character or by a multi-byte combination.
Using compatibility characters can prevent proper sorting and equivalence testing.
Using "pure" Unicode, ie. "normal encodings", can cause compatibility and round-tripping probelms. Although avoiding them can *also* cause compatibility issues and round-tripping problems.
Currently my thinking is:
a Utf8String class an Ordered collection, with 1 byte characters as the modal element, but short arrays of wider strings where necessary a Utf16String class an Ordered collection, with 2 byte characters as the modal element, but short arrays of wider strings beginning with a 2-byte endianness indicator.
Utf8Strings sometimes need to be sortable, and sometimes need to be compatible.
So my thinking is that Utf8String will contain convenience codepoints, for round-tripping. And where there are multiple convenience codepoints for a character, that it standardises on one.
And that there is a Utf8SortableString which uses *only* normal characters.
We then need methods to convert between the two.
aUtf8String asUtf8SortableString
and
aUtf8SortableString asUtf8String
Sort orders are culture and context dependent - Sweden and Germany have different sort orders for the same diacritic-ed characters. Some countries have one order in general usage, and another for specific usages, such as phone directories (e.g. UK and France)
Similarly for Utf16 : Utf16String and Utf16SortableString and conversion methods
A list of sorted words would be a SortedCollection, and there could be pre-prepared sortBlocks for them, e.g. frPhoneBookOrder, deOrder, seOrder, ukOrder, etc
along the lines of aListOfWords := SortedCollection sortBlock: deOrder
If a word is either a Utf8SortableString, or a well-formed Utf8String, then we can perform equivalence testing on them trivially.
To make sure a Utf8String is well formed, we would need to have a way of cleaning up any convenience codepoints which were valid, but which were for a character which has multiple equally-valid alternative convenience codepoints, and for which the string currently had the "wrong" convenience codepoint. (i.e for any character with valid alternative convenience codepoints, we would choose one to be in the well-formed Utf8String, and we would need a method for cleaning the alternative convenience codepoints out of the string, and replacing them with the chosen approved convenience codepoint.
aUtf8String cleanUtf8String
With WideString, a lot of the issues disappear - except round-tripping(although I'm sure I have seen something recently about 4-byte strings that also have an additional bit. Which would make some Unicode characters 5-bytes long.)
(I'm starting to zone out now - if I've overlooked anything - obvious, subtle, or somewhere in between, please let me know)
Cheers, Euan
On 07 Dec 2015, at 04:21, EuanM euanmee@gmail.com wrote:
This a long email. A *lot* of it is encapsulated in the Venn diagram both: http://smalltalk.uk.to/unicode-utf8.html and my Smalltalk in Small Steps blog at: http://smalltalkinsmallsteps.blogspot.co.uk/2015/12/utf-8-for-cuis-pharo-and...
My current thinking, and understanding.
- a) ASCII and ISO-8859-1 consist of characters encoded in 1 byte. b) UTF-8 can encode all of those characters in 1 byte, but can
prefer some of them to be encoded as sequences of multiple bytes. And can encode additional characters as sequences of multiple bytes.
Smalltalk has long had multiple String classes.
Any UTF16 Unicode codepoint which has a codepoint of 00nn hex is encoded as a UTF-8 codepoint of nn hex.
All valid ISO-8859-1 characters have a character code between 20
hex and 7E hex, or between A0 hex and FF hex. https://en.wikipedia.org/wiki/ISO/IEC_8859-1
- All valid ASCII characters have a character code between 00 hex and 7E hex.
https://en.wikipedia.org/wiki/ASCII
- a) All character codes which are defined within ISO-8859-1 and also
defined within ASCII. (i.e. character codes 20 hex to 7E hex) are defined identically in both.
b) All printable ASCII characters are defined identically in both ASCII and ISO-8859-1
- All character codes defined in ASCII (00 hex to 7E hex) are
defined identically in Unicode UTF-8.
- All character codes defined in ISO-8859-1 (20 hex - 7E hex ; A0 hex
- FF hex ) are defined identically in UTF-8.
=> some Unicode codepoints map to both ASCII and ISO-8859-1. all ASCII maps 1:1 to Unicode UTF-8 all ISO-8859-1 maps 1:1 to Unicode UTF-8
All ByteStrings elements which are either a valid ISO-8859-1
character or a valid ASCII character are *also* a valid UTF-8 character.
- ISO-8859-1 characters representing a character with a diacritic,
or a two-character ligature, have no ASCII equivalent. In Unicode UTF-8, those character codes which are representing compound glyphs, are called "compatibility codepoints".
- The preferred Unicode representation of the characters which have
compatibility codepoints is as a a short set of codepoints representing the characters which are combined together to form the glyph of the convenience codepoint, as a sequence of bytes representing the component characters.
- Some concrete examples:
A - aka Upper Case A In ASCII, in ISO 8859-1 ASCII A - 41 hex ISO-8859-1 A - 41 hex UTF-8 A - 41 hex
BEL (a bell sound, often invoked by a Ctrl-g keyboard chord) In ASCII, not in ISO 8859-1 ASCII : BEL - 07 hex ISO-8859-1 : 07 hex is not a valid character code UTF-8 : BEL - 07 hex
£ (GBP currency symbol) In ISO-8859-1, not in ASCII ASCII : A3 hex is not a valid ASCII code UTF-8: £ - A3 hex ISO-8859-1: £ - A3 hex
Upper Case C cedilla In ISO-8859-1, not in ASCII, in UTF-8 as a compatibility codepoint *and* a composed set of codepoints ASCII : C7 hex is not a valid ASCII character code ISO-8859-1 : Upper Case C cedilla - C7 hex UTF-8 : Upper Case C cedilla (compatibility codepoint) - C7 hex Unicode preferred Upper Case C cedilla (composed set of codepoints) Upper case C 0043 hex (Upper case C) followed by cedilla 00B8 hex (cedilla)
- For any valid ASCII string *and* for any valid ISO-8859-1 string,
aByteString is completely adequate for editing and display.
- When sorting any valid ASCII string *or* any valid ISO-8859-1
string, upper and lower case versions of the same character will be treated differently.
- When sorting any valid ISO-8859-1 string containing
letter+diacritic combination glyphs or ligature combination glyphs, the glyphs in combination will treated differently to a "plain" glyph of the character i.e. "C" and "C cedilla" will be treated very differently. "ß" and "fs" will be treated very differently.
- Different nations have different rules about where diacritic-ed
characted and ligature pairs should be placed when in alphabetical order.
- Some nations even have multiple standards - e.g. surnames
beginning either "M superscript-c" or "M superscript-a superscript-c" are treated as beginning equivalently in UK phone directories, but not in other situations.
Some practical upshots
- Cuis and its ISO-8859-1 encoding is *exactly* the same as UTF-8,
for any single character it considers valid, or any ByteString it has made up of characters it considers valid.
- Any ByteString is valid UTF-8 in any of Squeak, Pharo, Cuis or any
other Smalltalk with a single byte ByteString following ASCII or ISO-8859-1.
- Any Smalltalk (or derivative language) using ByteString can
immediately consider it's ByteString as valid UTF-8, as long as it also considers the ByteSring as valid ASCII and/or ISO-8859-1.
- All of those can be successfully exported to any system using UTF-8
(e.g. HTML).
- To successfully *accept* all UTF-8 we much be able to do either:
a) accept UTF-8 strings with composed characters b) convert UTF-8 strings with composed characters into UTF-8 strings that use *only* compatibility codepoints.
Class + protocol proposals
a Utf8CompatibilityString class.
asByteString - ensure only compatibility codepoints are used. Ensure it doews not encode characters above 00FF hex.
asIso8859String - ensures only compatibility codepoints are used, and that the characters are each valid ISO 8859-1
asAsciiString - ensures only characters 00hex - 7F hex are used.
asUtf8ComposedIso8859String - ensures all compatibility codepoints are expanded into small OrderedCollections of codepoints
a Utf8ComposedIso8859String class - will provide sortable and comparable UTF8 strings of all ASCII and ISO 8859-1 strings.
Then a Utf8SortableCollection class - a collection of Utf8ComposedIso8859Strings words and phrases.
Custom sortBlocks will define the applicable sort order.
We can create a collection... a Dictionary, thinking about it, of named, prefabricated sortBlocks.
This will work for all UTF8 strings of ISO-8859-1 and ASCII strings.
If anyone has better names for the classes, please let me know.
If anyone else wants to help
- build these,
- create SUnit tests for these
- write documentation for these
Please let me know.
n.b. I have had absolutely no experience of Ropes.
My own background with this stuff: In the early 90's as a Project Manager implementing office automation systems across a global company, with offices in the Americas, Western, Eastern and Central Europe, (including Slavic and Cyrillic users) nations, Japan and China. The mission-critical application was word-processing.
Our offices were spread around the globe, and we needed those offices to successfully exchange documents with their sister offices, and with the customers in each region the offices were in.
Unicode was then new, and our platform supplier was the NeXT Corporation, who had been founder members in of the Unicode Consortium in 1990.
So far: I've read the latest version of the Unicode Standard (v8.0). This is freely downloadable. I've purchased a paper copy of an earlier release. New releases typically consist additional codespaces (i.e. alphabets). So old copies are useful, as well as cheap. (Paper copies of version 4.0 are available second-hand for < $10 / €10).
The typical change with each release is the addition of further codespaces (i.e alphabets (more or less) ), so you don't lose a lot. (I'll be going through my V4.0 just to make sure)
Cheers, Euan
On 5 December 2015 at 13:08, stepharo stepharo@free.fr wrote:
Hi EuanM
Le 4/12/15 12:42, EuanM a écrit :
I'm currently groping my way to seeing how feature-complete our Unicode support is. I am doing this to establish what still needs to be done to provide full Unicode support.
this is great. Thanks for pushing this. I wrote and collected some roadmap (analyses on different topics) on the pharo github project feel free to add this one there.
This seems to me to be an area where it would be best to write it once, and then have the same codebase incorporated into the Smalltalks that most share a common ancestry.
I am keen to get: equality-testing for strings; sortability for strings which have ligatures and diacritic characters; and correct round-tripping of data.
Go! My suggestion is start small make steady progress write tests commit often :)
Stef
What is the french phoneBook ordering because this is the first time I hear about it.
Call to action:
If you have comments on these proposals - such as "but we already have that facility" or "the reason we do not have these facilities is because they are dog-slow" - please let me know them.
If you would like to help out, please let me know.
If you have Unicode experience and expertise, and would like to be, or would be willing to be, in the 'council of experts' for this project, please let me know.
If you have comments or ideas on anything mentioned in this email
In the first instance, the initiative's website will be: http://smalltalk.uk.to/unicode.html
I have created a SqueakSource.com project called UnicodeSupport
I want to avoid re-inventing any facilities which already exist. Except where they prevent us reaching the goals of:
- sortable UTF8 strings
- sortable UTF16 strings
- equivalence testing of 2 UTF8 strings
- equivalence testing of 2 UTF16 strings
- round-tripping UTF8 strings through Smalltalk
- roundtripping UTF16 strings through Smalltalk.
As I understand it, we have limited Unicode support atm.
Current state of play
ByteString gets converted to WideString when need is automagically detected.
Is there anything else that currently exists?
Definition of Terms
A quick definition of terms before I go any further:
Standard terms from the Unicode standard
a compatibility character : an additional encoding of a *normal* character, for compatibility and round-trip conversion purposes. For instance, a 1-byte encoding of a Latin character with a diacritic.
Made-up terms
a convenience codepoint : a single codepoint which represents an item that is also encoded as a string of codepoints.
(I tend to use the terms compatibility character and compatibility codepoint interchangably. The standard only refers to them as compatibility characters. However, the standard is determined to emphasise that characters are abstract and that codepoints are concrete. So I think it is often more useful and productive to think of compatibility or convenience codepoints).
a composed character : a character made up of several codepoints
Unicode encoding explained
A convenience codepoint can therefore be thought of as a code point used for a character which also has a composed form.
The way Unicode works is that sometimes you can encode a character in one byte, sometimes not. Sometimes you can encode it in two bytes, sometimes not.
You can therefore have a long stream of ASCII which is single-byte Unicode. If there is an occasional Cyrillic or Greek character in the stream, it would be represented either by a compatibility character or by a multi-byte combination.
Using compatibility characters can prevent proper sorting and equivalence testing.
Using "pure" Unicode, ie. "normal encodings", can cause compatibility and round-tripping probelms. Although avoiding them can *also* cause compatibility issues and round-tripping problems.
Currently my thinking is:
a Utf8String class an Ordered collection, with 1 byte characters as the modal element, but short arrays of wider strings where necessary a Utf16String class an Ordered collection, with 2 byte characters as the modal element, but short arrays of wider strings beginning with a 2-byte endianness indicator.
Utf8Strings sometimes need to be sortable, and sometimes need to be compatible.
So my thinking is that Utf8String will contain convenience codepoints, for round-tripping. And where there are multiple convenience codepoints for a character, that it standardises on one.
And that there is a Utf8SortableString which uses *only* normal characters.
We then need methods to convert between the two.
aUtf8String asUtf8SortableString
and
aUtf8SortableString asUtf8String
Sort orders are culture and context dependent - Sweden and Germany have different sort orders for the same diacritic-ed characters. Some countries have one order in general usage, and another for specific usages, such as phone directories (e.g. UK and France)
Similarly for Utf16 : Utf16String and Utf16SortableString and conversion methods
A list of sorted words would be a SortedCollection, and there could be pre-prepared sortBlocks for them, e.g. frPhoneBookOrder, deOrder, seOrder, ukOrder, etc
along the lines of aListOfWords := SortedCollection sortBlock: deOrder
If a word is either a Utf8SortableString, or a well-formed Utf8String, then we can perform equivalence testing on them trivially.
To make sure a Utf8String is well formed, we would need to have a way of cleaning up any convenience codepoints which were valid, but which were for a character which has multiple equally-valid alternative convenience codepoints, and for which the string currently had the "wrong" convenience codepoint. (i.e for any character with valid alternative convenience codepoints, we would choose one to be in the well-formed Utf8String, and we would need a method for cleaning the alternative convenience codepoints out of the string, and replacing them with the chosen approved convenience codepoint.
aUtf8String cleanUtf8String
With WideString, a lot of the issues disappear - except round-tripping(although I'm sure I have seen something recently about 4-byte strings that also have an additional bit. Which would make some Unicode characters 5-bytes long.)
(I'm starting to zone out now - if I've overlooked anything - obvious, subtle, or somewhere in between, please let me know)
Cheers, Euan
squeak-dev@lists.squeakfoundation.org