My current thinking, and understanding.

7 Dec 2015

      As for the issue of lower case e acute, it is
compatibility codepoint 00e9 hex
and therefore encodable in UTF-8 as compatibility codepoint e9 hex
and as the composed character #(0065 00b4) (all in hex) and as the
same composed character as both
#(feff 0065 00b4) and #(ffef 0065 00b4) when endianness markers are included
and as I understand it, should also be legitimate to encode it in UTF8
as a composed character #(65 b4) (all hex)
etc
On 7 December 2015 at 08:27, Sven Van Caekenberghe sven@stfx.eu wrote:
...
I am sorry but one of your basic assumptions is completely wrong:
'Les élèves Français' encodeWith: #iso99591.
=> #[76 101 115 32 233 108 232 118 101 115 32 70 114 97 110 231 97 105 115]
'Les élèves Français' utf8Encoded.
=> #[76 101 115 32 195 169 108 195 168 118 101 115 32 70 114 97 110 195 167 97 105 115]
ISO-9959-1 (~Latin1) is NOT AT ALL identical to UTF-8 in its upper, non-ACII part !!
Or shorter, $é is encoded in ISO-9959-1 as #[233], but as #[195 169] in UTF-8.
So more than half the points you make, or the facts that you state, are thus plain wrong.
The only thing that is correct is that the code points are equal, but that is not the same as the encoding !
From this I am inclined to conclude that you do not fundamentally understand how UTF-8 works, which does not strike me as good basis to design something called a UTF8String.
Sorry.
PS: Note also that Cuis' choice to use ISO-9959-1 only is pretty limiting in a Unicode world.
...
On 07 Dec 2015, at 04:21, EuanM euanmee@gmail.com wrote:
This a long email.  A *lot* of it is encapsulated in the Venn diagram both:
http://smalltalk.uk.to/unicode-utf8.html
and my Smalltalk in Small Steps blog at:
http://smalltalkinsmallsteps.blogspot.co.uk/2015/12/utf-8-for-cuis-pharo-and...
My current thinking, and understanding.

a) ASCII and ISO-8859-1 consist of characters encoded in 1 byte.

b) UTF-8 can encode all of those characters in 1 byte, but can
prefer some of them to be encoded as sequences of multiple bytes.  And
can encode additional characters as sequences of multiple bytes.

Smalltalk has long had multiple String classes.

Any UTF16 Unicode codepoint which has a codepoint of 00nn hex

is encoded as a UTF-8 codepoint of nn hex.

All valid ISO-8859-1 characters have a character code between 20

hex and 7E hex, or between A0 hex and FF hex.
https://en.wikipedia.org/wiki/ISO/IEC_8859-1

All valid ASCII characters have a character code between 00 hex and 7E hex.

https://en.wikipedia.org/wiki/ASCII

a) All character codes which are defined within ISO-8859-1 and also

defined within ASCII.  (i.e. character codes 20 hex to 7E hex) are
defined identically in both.
b) All printable ASCII characters are defined identically in both
ASCII and ISO-8859-1

All character codes defined in ASCII  (00 hex to 7E hex) are

defined identically in Unicode UTF-8.

All character codes defined in ISO-8859-1 (20 hex - 7E hex ; A0 hex

FF hex ) are defined identically in UTF-8.

=> some Unicode codepoints map to both ASCII and ISO-8859-1.
    all ASCII maps 1:1 to Unicode UTF-8
    all ISO-8859-1 maps 1:1 to Unicode UTF-8

All ByteStrings elements which are either a valid ISO-8859-1

character  or a valid ASCII character are *also* a valid UTF-8
character.

ISO-8859-1 characters representing a character with a diacritic,

or a two-character ligature, have no ASCII equivalent.  In Unicode
UTF-8, those character codes which are representing compound glyphs,
are called "compatibility codepoints".

The preferred Unicode representation of the characters which have

compatibility codepoints is as a  a short set of codepoints
representing the characters which are combined together to form the
glyph of the convenience codepoint, as a sequence of bytes
representing the component characters.

Some concrete examples:

A - aka Upper Case A
In ASCII, in ISO 8859-1
ASCII A - 41 hex
ISO-8859-1 A - 41 hex
UTF-8 A - 41 hex
BEL (a bell sound, often invoked by a Ctrl-g keyboard chord)
In ASCII, not in ISO 8859-1
ASCII : BEL  - 07 hex
ISO-8859-1 : 07 hex is not a valid character code
UTF-8 : BEL - 07 hex
£ (GBP currency symbol)
In ISO-8859-1, not in ASCII
ASCII : A3 hex is not a valid ASCII code
UTF-8: £ - A3 hex
ISO-8859-1: £ - A3 hex
Upper Case C cedilla
In ISO-8859-1, not in ASCII, in UTF-8 as a compatibility codepoint
*and* a composed set of codepoints
ASCII : C7 hex is not a valid ASCII character code
ISO-8859-1 : Upper Case C cedilla - C7 hex
UTF-8 : Upper Case C cedilla (compatibility codepoint) - C7 hex
Unicode preferred Upper Case C cedilla  (composed set of codepoints)
 Upper case C 0043 hex (Upper case C)
     followed by
 cedilla 00B8 hex (cedilla)

For any valid ASCII string *and* for any valid ISO-8859-1 string,

aByteString is completely adequate for editing and display.

When sorting any valid ASCII string *or* any valid ISO-8859-1

string, upper and lower case versions of the same character will be
treated differently.

When sorting any valid ISO-8859-1 string containing

letter+diacritic combination glyphs or ligature combination glyphs,
the glyphs in combination will treated differently to a "plain" glyph
of the character
i.e. "C" and "C cedilla" will be treated very differently.  "ß" and
"fs" will be treated very differently.

Different nations have different rules about where diacritic-ed

characted and ligature pairs should be placed when in alphabetical
order.

Some nations even have multiple standards - e.g.  surnames

beginning either "M superscript-c" or "M superscript-a superscript-c"
are treated as beginning equivalently in UK phone directories, but not
in other situations.
Some practical upshots

Cuis and its ISO-8859-1 encoding is *exactly* the same as UTF-8,

for any single character it considers valid, or any ByteString it has
made up of characters it considers valid.

Any ByteString is valid UTF-8 in any of Squeak, Pharo, Cuis or any

other Smalltalk with a single byte ByteString following ASCII or
ISO-8859-1.

Any Smalltalk (or derivative language) using ByteString can

immediately consider it's ByteString as valid UTF-8, as long as it
also considers the ByteSring as valid ASCII and/or ISO-8859-1.

All of those can be successfully exported to any system using UTF-8

(e.g. HTML).

To successfully *accept* all UTF-8 we much be able to do either:

a) accept UTF-8 strings with composed characters
b) convert UTF-8 strings with composed characters into UTF-8 strings
that use *only* compatibility codepoints.
Class + protocol proposals
a Utf8CompatibilityString class.
asByteString  - ensure only compatibility codepoints are used.
Ensure it doews not encode characters above 00FF hex.
asIso8859String - ensures only compatibility codepoints are used,
and that the characters are each valid ISO 8859-1
asAsciiString - ensures only characters 00hex - 7F hex are used.
asUtf8ComposedIso8859String - ensures all compatibility codepoints
are expanded into small OrderedCollections of codepoints
a Utf8ComposedIso8859String class - will provide sortable and
comparable UTF8 strings of all ASCII and ISO 8859-1 strings.
Then a Utf8SortableCollection class - a collection of
Utf8ComposedIso8859Strings words and phrases.
Custom sortBlocks will define the applicable sort order.
We can create a collection...  a Dictionary, thinking about it, of
named, prefabricated sortBlocks.
This will work for all UTF8 strings of ISO-8859-1 and ASCII strings.
If anyone has better names for the classes, please let me know.
If anyone else wants to help

build these,
create SUnit tests for these
write documentation for these

Please let me know.
n.b. I have had absolutely no experience of Ropes.
My own background with this stuff:  In the early 90's as a Project
Manager implementing office automation systems across a global
company, with offices in the Americas, Western, Eastern and Central
Europe, (including Slavic and Cyrillic users) nations, Japan and
China. The mission-critical application was word-processing.
Our offices were spread around the globe, and we needed those offices
to successfully exchange documents with their sister offices, and with
the customers in each region the offices were in.
Unicode was then new, and our platform supplier was the NeXT
Corporation, who had been founder members in of the Unicode Consortium
in 1990.
So far: I've read the latest version of the Unicode Standard (v8.0).
This is freely downloadable.
I've purchased a paper copy of an earlier release.  New releases
typically consist additional codespaces (i.e. alphabets).  So old
copies are useful, as well as cheap.  (Paper copies of  version 4.0
are available second-hand for < $10 / €10).
The typical change with each release is the addition of further
codespaces (i.e alphabets (more or less) ), so you don't lose a lot.
(I'll be going through my V4.0 just to make sure)
Cheers,
 Euan
On 5 December 2015 at 13:08, stepharo stepharo@free.fr wrote:
...
Hi EuanM
Le 4/12/15 12:42, EuanM a écrit :
...
I'm currently groping my way to seeing how feature-complete our
Unicode support is.  I am doing this to establish what still needs to
be done to provide full Unicode support.
this is great. Thanks for pushing this. I wrote and collected some roadmap
(analyses on different topics)
on the pharo github project feel free to add this one there.
...
This seems to me to be an area where it would be best to write it
once, and then have the same codebase incorporated into the Smalltalks
that most share a common ancestry.
I am keen to get: equality-testing for strings; sortability for
strings which have ligatures and diacritic characters; and correct
round-tripping of data.
Go!
My suggestion is
  start small
  make steady progress
  write tests
  commit often :)
Stef
What is the french phoneBook ordering because this is the first time I hear
about it.
...
Call to action:
If you have comments on these proposals - such as "but we already have
that facility" or "the reason we do not have these facilities is
because they are dog-slow" - please let me know them.
If you would like to help out, please let me know.
If you have Unicode experience and expertise, and would like to be, or
would be willing to be, in the  'council of experts' for this project,
please let me know.
If you have comments or ideas on anything mentioned in this email
In the first instance, the initiative's website will be:
http://smalltalk.uk.to/unicode.html
I have created a SqueakSource.com project called UnicodeSupport
I want to avoid re-inventing any facilities which already exist.
Except where they prevent us reaching the goals of:

sortable UTF8 strings
sortable UTF16 strings
equivalence testing of 2 UTF8 strings
equivalence testing of 2 UTF16 strings
round-tripping UTF8 strings through Smalltalk
roundtripping UTF16 strings through Smalltalk.

As I understand it, we have limited Unicode support atm.
Current state of play
ByteString gets converted to WideString when need is automagically
detected.
Is there anything else that currently exists?
Definition of Terms
A quick definition of terms before I go any further:
Standard terms from the Unicode standard
a compatibility character : an additional encoding of a *normal*
character, for compatibility and round-trip conversion purposes.  For
instance, a 1-byte encoding of a Latin character with a diacritic.
Made-up terms
a convenience codepoint :  a single codepoint which represents an item
that is also encoded as a string of codepoints.
(I tend to use the terms compatibility character and compatibility
codepoint interchangably.  The standard only refers to them as
compatibility characters.  However, the standard is determined to
emphasise that characters are abstract and that codepoints are
concrete.  So I think it is often more useful and productive to think
of compatibility or convenience codepoints).
a composed character :  a character made up of several codepoints
Unicode encoding explained
A convenience codepoint can therefore be thought of as a code point
used for a character which also has a composed form.
The way Unicode works is that sometimes you can encode a character in
one byte, sometimes not.  Sometimes you can encode it in two bytes,
sometimes not.
You can therefore have a long stream of ASCII which is single-byte
Unicode.  If there is an occasional Cyrillic or Greek character in the
stream, it would be represented either by a compatibility character or
by a multi-byte combination.
Using compatibility characters can prevent proper sorting and
equivalence testing.
Using "pure" Unicode, ie. "normal encodings", can cause compatibility
and round-tripping probelms.  Although avoiding them can *also* cause
compatibility issues and round-tripping problems.
Currently my thinking is:
a Utf8String class
an Ordered collection, with 1 byte characters as the modal element,
but short arrays of wider strings where necessary
a Utf16String class
an Ordered collection, with 2 byte characters as the modal element,
but short arrays of wider strings
beginning with a 2-byte endianness indicator.
Utf8Strings sometimes need to be sortable, and sometimes need to be
compatible.
So my thinking is that Utf8String will contain convenience codepoints,
for round-tripping.  And where there are multiple convenience
codepoints for a character, that it standardises on one.
And that there is a Utf8SortableString which uses *only* normal
characters.
We then need methods to convert between the two.
aUtf8String asUtf8SortableString
and
aUtf8SortableString asUtf8String
Sort orders are culture and context dependent - Sweden and Germany
have different sort orders for the same diacritic-ed characters.  Some
countries have one order in general usage, and another for specific
usages, such as phone directories (e.g. UK and France)
Similarly for Utf16 :  Utf16String and Utf16SortableString and
conversion methods
A list of sorted words would be a SortedCollection, and there could be
pre-prepared sortBlocks for them, e.g. frPhoneBookOrder, deOrder,
seOrder, ukOrder, etc
along the lines of
aListOfWords := SortedCollection sortBlock: deOrder
If a word is either a Utf8SortableString, or a well-formed Utf8String,
then we can perform equivalence testing on them trivially.
To make sure a Utf8String is well formed, we would need to have a way
of cleaning up any convenience codepoints which were valid, but which
were for a character which has multiple equally-valid alternative
convenience codepoints, and for which the string currently had the
"wrong" convenience codepoint.  (i.e for any character with valid
alternative convenience codepoints, we would choose one to be in the
well-formed Utf8String, and we would need a method for cleaning the
alternative convenience codepoints out of the string, and replacing
them with the chosen approved convenience codepoint.
aUtf8String cleanUtf8String
With WideString, a lot of the issues disappear - except
round-tripping(although I'm sure I have seen something recently about
4-byte strings that also have an additional bit.  Which would make
some Unicode characters 5-bytes long.)
(I'm starting to zone out now - if I've overlooked anything - obvious,
subtle, or somewhere in between, please let me know)
Cheers,
   Euan
...
On 07 Dec 2015, at 04:21, EuanM euanmee@gmail.com wrote:
This a long email.  A *lot* of it is encapsulated in the Venn diagram both:
http://smalltalk.uk.to/unicode-utf8.html
and my Smalltalk in Small Steps blog at:
http://smalltalkinsmallsteps.blogspot.co.uk/2015/12/utf-8-for-cuis-pharo-and...
My current thinking, and understanding.

a) ASCII and ISO-8859-1 consist of characters encoded in 1 byte.
b) UTF-8 can encode all of those characters in 1 byte, but can

prefer some of them to be encoded as sequences of multiple bytes.  And
can encode additional characters as sequences of multiple bytes.

Smalltalk has long had multiple String classes.

Any UTF16 Unicode codepoint which has a codepoint of 00nn hex
is encoded as a UTF-8 codepoint of nn hex.

All valid ISO-8859-1 characters have a character code between 20

hex and 7E hex, or between A0 hex and FF hex.
https://en.wikipedia.org/wiki/ISO/IEC_8859-1

All valid ASCII characters have a character code between 00 hex and 7E hex.

https://en.wikipedia.org/wiki/ASCII

a) All character codes which are defined within ISO-8859-1 and also

defined within ASCII.  (i.e. character codes 20 hex to 7E hex) are
defined identically in both.
b) All printable ASCII characters are defined identically in both
ASCII and ISO-8859-1

All character codes defined in ASCII  (00 hex to 7E hex) are

defined identically in Unicode UTF-8.

All character codes defined in ISO-8859-1 (20 hex - 7E hex ; A0 hex

FF hex ) are defined identically in UTF-8.

=> some Unicode codepoints map to both ASCII and ISO-8859-1.
     all ASCII maps 1:1 to Unicode UTF-8
     all ISO-8859-1 maps 1:1 to Unicode UTF-8

All ByteStrings elements which are either a valid ISO-8859-1

character  or a valid ASCII character are *also* a valid UTF-8
character.

ISO-8859-1 characters representing a character with a diacritic,

or a two-character ligature, have no ASCII equivalent.  In Unicode
UTF-8, those character codes which are representing compound glyphs,
are called "compatibility codepoints".

The preferred Unicode representation of the characters which have

compatibility codepoints is as a  a short set of codepoints
representing the characters which are combined together to form the
glyph of the convenience codepoint, as a sequence of bytes
representing the component characters.

Some concrete examples:

A - aka Upper Case A
In ASCII, in ISO 8859-1
ASCII A - 41 hex
ISO-8859-1 A - 41 hex
UTF-8 A - 41 hex
BEL (a bell sound, often invoked by a Ctrl-g keyboard chord)
In ASCII, not in ISO 8859-1
ASCII : BEL  - 07 hex
ISO-8859-1 : 07 hex is not a valid character code
UTF-8 : BEL - 07 hex
£ (GBP currency symbol)
In ISO-8859-1, not in ASCII
ASCII : A3 hex is not a valid ASCII code
UTF-8: £ - A3 hex
ISO-8859-1: £ - A3 hex
Upper Case C cedilla
In ISO-8859-1, not in ASCII, in UTF-8 as a compatibility codepoint
*and* a composed set of codepoints
ASCII : C7 hex is not a valid ASCII character code
ISO-8859-1 : Upper Case C cedilla - C7 hex
UTF-8 : Upper Case C cedilla (compatibility codepoint) - C7 hex
Unicode preferred Upper Case C cedilla  (composed set of codepoints)
  Upper case C 0043 hex (Upper case C)
      followed by
  cedilla 00B8 hex (cedilla)

For any valid ASCII string *and* for any valid ISO-8859-1 string,

aByteString is completely adequate for editing and display.

When sorting any valid ASCII string *or* any valid ISO-8859-1

string, upper and lower case versions of the same character will be
treated differently.

When sorting any valid ISO-8859-1 string containing

letter+diacritic combination glyphs or ligature combination glyphs,
the glyphs in combination will treated differently to a "plain" glyph
of the character
i.e. "C" and "C cedilla" will be treated very differently.  "ß" and
"fs" will be treated very differently.

Different nations have different rules about where diacritic-ed

characted and ligature pairs should be placed when in alphabetical
order.

Some nations even have multiple standards - e.g.  surnames

beginning either "M superscript-c" or "M superscript-a superscript-c"
are treated as beginning equivalently in UK phone directories, but not
in other situations.
Some practical upshots

Cuis and its ISO-8859-1 encoding is *exactly* the same as UTF-8,

for any single character it considers valid, or any ByteString it has
made up of characters it considers valid.

Any ByteString is valid UTF-8 in any of Squeak, Pharo, Cuis or any

other Smalltalk with a single byte ByteString following ASCII or
ISO-8859-1.

Any Smalltalk (or derivative language) using ByteString can

immediately consider it's ByteString as valid UTF-8, as long as it
also considers the ByteSring as valid ASCII and/or ISO-8859-1.

All of those can be successfully exported to any system using UTF-8

(e.g. HTML).

To successfully *accept* all UTF-8 we much be able to do either:

a) accept UTF-8 strings with composed characters
b) convert UTF-8 strings with composed characters into UTF-8 strings
that use *only* compatibility codepoints.
Class + protocol proposals
a Utf8CompatibilityString class.
asByteString  - ensure only compatibility codepoints are used.
Ensure it doews not encode characters above 00FF hex.
asIso8859String - ensures only compatibility codepoints are used,
and that the characters are each valid ISO 8859-1
asAsciiString - ensures only characters 00hex - 7F hex are used.
asUtf8ComposedIso8859String - ensures all compatibility codepoints
are expanded into small OrderedCollections of codepoints
a Utf8ComposedIso8859String class - will provide sortable and
comparable UTF8 strings of all ASCII and ISO 8859-1 strings.
Then a Utf8SortableCollection class - a collection of
Utf8ComposedIso8859Strings words and phrases.
Custom sortBlocks will define the applicable sort order.
We can create a collection...  a Dictionary, thinking about it, of
named, prefabricated sortBlocks.
This will work for all UTF8 strings of ISO-8859-1 and ASCII strings.
If anyone has better names for the classes, please let me know.
If anyone else wants to help

build these,
create SUnit tests for these
write documentation for these

Please let me know.
n.b. I have had absolutely no experience of Ropes.
My own background with this stuff:  In the early 90's as a Project
Manager implementing office automation systems across a global
company, with offices in the Americas, Western, Eastern and Central
Europe, (including Slavic and Cyrillic users) nations, Japan and
China. The mission-critical application was word-processing.
Our offices were spread around the globe, and we needed those offices
to successfully exchange documents with their sister offices, and with
the customers in each region the offices were in.
Unicode was then new, and our platform supplier was the NeXT
Corporation, who had been founder members in of the Unicode Consortium
in 1990.
So far: I've read the latest version of the Unicode Standard (v8.0).
This is freely downloadable.
I've purchased a paper copy of an earlier release.  New releases
typically consist additional codespaces (i.e. alphabets).  So old
copies are useful, as well as cheap.  (Paper copies of  version 4.0
are available second-hand for < $10 / €10).
The typical change with each release is the addition of further
codespaces (i.e alphabets (more or less) ), so you don't lose a lot.
(I'll be going through my V4.0 just to make sure)
Cheers,
  Euan
On 5 December 2015 at 13:08, stepharo stepharo@free.fr wrote:
...
Hi EuanM
Le 4/12/15 12:42, EuanM a écrit :
...
I'm currently groping my way to seeing how feature-complete our
Unicode support is.  I am doing this to establish what still needs to
be done to provide full Unicode support.
this is great. Thanks for pushing this. I wrote and collected some roadmap
(analyses on different topics)
on the pharo github project feel free to add this one there.
...
This seems to me to be an area where it would be best to write it
once, and then have the same codebase incorporated into the Smalltalks
that most share a common ancestry.
I am keen to get: equality-testing for strings; sortability for
strings which have ligatures and diacritic characters; and correct
round-tripping of data.
Go!
My suggestion is
   start small
   make steady progress
   write tests
   commit often :)
Stef
What is the french phoneBook ordering because this is the first time I hear
about it.
...
Call to action:
If you have comments on these proposals - such as "but we already have
that facility" or "the reason we do not have these facilities is
because they are dog-slow" - please let me know them.
If you would like to help out, please let me know.
If you have Unicode experience and expertise, and would like to be, or
would be willing to be, in the  'council of experts' for this project,
please let me know.
If you have comments or ideas on anything mentioned in this email
In the first instance, the initiative's website will be:
http://smalltalk.uk.to/unicode.html
I have created a SqueakSource.com project called UnicodeSupport
I want to avoid re-inventing any facilities which already exist.
Except where they prevent us reaching the goals of:

sortable UTF8 strings
sortable UTF16 strings
equivalence testing of 2 UTF8 strings
equivalence testing of 2 UTF16 strings
round-tripping UTF8 strings through Smalltalk
roundtripping UTF16 strings through Smalltalk.

As I understand it, we have limited Unicode support atm.
Current state of play
ByteString gets converted to WideString when need is automagically
detected.
Is there anything else that currently exists?
Definition of Terms
A quick definition of terms before I go any further:
Standard terms from the Unicode standard
a compatibility character : an additional encoding of a *normal*
character, for compatibility and round-trip conversion purposes.  For
instance, a 1-byte encoding of a Latin character with a diacritic.
Made-up terms
a convenience codepoint :  a single codepoint which represents an item
that is also encoded as a string of codepoints.
(I tend to use the terms compatibility character and compatibility
codepoint interchangably.  The standard only refers to them as
compatibility characters.  However, the standard is determined to
emphasise that characters are abstract and that codepoints are
concrete.  So I think it is often more useful and productive to think
of compatibility or convenience codepoints).
a composed character :  a character made up of several codepoints
Unicode encoding explained
A convenience codepoint can therefore be thought of as a code point
used for a character which also has a composed form.
The way Unicode works is that sometimes you can encode a character in
one byte, sometimes not.  Sometimes you can encode it in two bytes,
sometimes not.
You can therefore have a long stream of ASCII which is single-byte
Unicode.  If there is an occasional Cyrillic or Greek character in the
stream, it would be represented either by a compatibility character or
by a multi-byte combination.
Using compatibility characters can prevent proper sorting and
equivalence testing.
Using "pure" Unicode, ie. "normal encodings", can cause compatibility
and round-tripping probelms.  Although avoiding them can *also* cause
compatibility issues and round-tripping problems.
Currently my thinking is:
a Utf8String class
an Ordered collection, with 1 byte characters as the modal element,
but short arrays of wider strings where necessary
a Utf16String class
an Ordered collection, with 2 byte characters as the modal element,
but short arrays of wider strings
beginning with a 2-byte endianness indicator.
Utf8Strings sometimes need to be sortable, and sometimes need to be
compatible.
So my thinking is that Utf8String will contain convenience codepoints,
for round-tripping.  And where there are multiple convenience
codepoints for a character, that it standardises on one.
And that there is a Utf8SortableString which uses *only* normal
characters.
We then need methods to convert between the two.
aUtf8String asUtf8SortableString
and
aUtf8SortableString asUtf8String
Sort orders are culture and context dependent - Sweden and Germany
have different sort orders for the same diacritic-ed characters.  Some
countries have one order in general usage, and another for specific
usages, such as phone directories (e.g. UK and France)
Similarly for Utf16 :  Utf16String and Utf16SortableString and
conversion methods
A list of sorted words would be a SortedCollection, and there could be
pre-prepared sortBlocks for them, e.g. frPhoneBookOrder, deOrder,
seOrder, ukOrder, etc
along the lines of
aListOfWords := SortedCollection sortBlock: deOrder
If a word is either a Utf8SortableString, or a well-formed Utf8String,
then we can perform equivalence testing on them trivially.
To make sure a Utf8String is well formed, we would need to have a way
of cleaning up any convenience codepoints which were valid, but which
were for a character which has multiple equally-valid alternative
convenience codepoints, and for which the string currently had the
"wrong" convenience codepoint.  (i.e for any character with valid
alternative convenience codepoints, we would choose one to be in the
well-formed Utf8String, and we would need a method for cleaning the
alternative convenience codepoints out of the string, and replacing
them with the chosen approved convenience codepoint.
aUtf8String cleanUtf8String
With WideString, a lot of the issues disappear - except
round-tripping(although I'm sure I have seen something recently about
4-byte strings that also have an additional bit.  Which would make
some Unicode characters 5-bytes long.)
(I'm starting to zone out now - if I've overlooked anything - obvious,
subtle, or somewhere in between, please let me know)
Cheers,
    Euan

Re: [Pharo-dev] Unicode Support

My current thinking, and understanding.

Some practical upshots

Call to action:

Current state of play

Definition of Terms

Standard terms from the Unicode standard

Made-up terms

Unicode encoding explained

My current thinking, and understanding.

Some practical upshots

Call to action:

Current state of play

Definition of Terms

Standard terms from the Unicode standard

Made-up terms

Unicode encoding explained