Re: Squeak to/from UTF-8 conversions

26 Jun 2007

      However, if you strip the language tag, you will run into very minor 
bugs with the A macron and a macron, because their encodings have been 
hijacked as CrossedX and EndOfRun in the CharacterScanner family (clever 
trick when Characters were 256). I searched how these damned characters 
could ever work in Squeak and Sophie, and found black magic was this 
language tag.
Andreas, maybe you could have a look at how RTF text are converted in 
SOphie, it seems to deal with language tag correctly, at least with 
extended latin characters.
Nicolas
Bert Freudenberg a écrit :
...
On Jun 26, 2007, at 9:19 , Andreas Raab wrote:
...
Hi -
I was working on a little improvement in UTF-8 conversion speed (so 
far it's about 150x faster for latin-1 text ;-) and for measuring the 
improvements was running a test that said:
strings := String allSubInstances.
1 to: strings size do:[:i|
    original := strings at: i.
    utf8 := original squeakToUtf8.
    copy := utf8 utf8ToSqueak.
    original = copy ifFalse:[self error: 'Encoding problem'].
].
When I ran this test it failed on each and every WideString instance. 
Digging into it, it seems that all of the WideStrings in Squeak have a 
language tag that is being supplied implicitly by the current 
LanguageEnvironment.
Questions:

From what it looks like right now there is no way to preserve that

language tag through a UTF-8 conversion. Is this indeed the case or am 
I missing something?
2) Given that my language environment is being set to Latin-1, how 
should clients treat UTF-8 to provide the "proper" language tag? For 
example, I expected that a client be able to read and write UTF-8 text 
without implicitly providing that language tag. If that's the case, 
then how does one store these in common text files? (I could see how 
to do this for formatted text but not for "plain text files" without 
further attributation)
3) More generally asking, isn't the language tag here more of a 
"decorator" along the lines of text attributes? This would certainly 
model more closely the effect that I'm seeing here (some attributes 
are dropped by the squeak -> utf8 -> squeak conversion) *except* that 
I didn't expect any lossy conversion for strings (contrary to Text 
where dropping text attributes is obviously lossy).
Nice catch. We had the discussion before, and this to me is another hint 
that we really should strip the language tag from Strings and move it to 
Text attributes. For rendering bare strings the default language could 
be taken from the current environment. The problem is, IIUC, that 
currently a lot of bare strings are passed around so it was simpler to 
just tag the language onto the string itself.

Bert -