The Trunk: Collections-topa.806.mcz

List overview All Threads
Download

newer

older

The Trunk: Collections-topa.807.mcz

Squeak surprised me again today

commits＠source.squeak.org

12 Sep 2018 12 Sep '18

3:28 p.m.

Tobias Pape uploaded a new version of Collections to project The Trunk: http://source.squeak.org/trunk/Collections-topa.806.mcz

==================== Summary ====================

Name: Collections-topa.806 Author: topa Time: 12 September 2018, 3:28:40.687052 pm UUID: 46b95db5-a773-4113-92f0-5ee905404b49 Ancestors: Collections-cmm.805

Fix separators to include U+00A0 (no break space)

Thanks Ron!

=============== Diff against Collections-cmm.805 ===============

Item was changed: ----- Method: Character class>>separators (in category 'instance creation') ----- separators + "Answer a collection of space-like separator characters. + Note that we do not consider spaces in >8bit code points yet. + " - "Answer a collection of the standard ASCII separator characters."

+ ^ #(9 "tab" - ^ #(32 "space" - 13 "cr" - 9 "tab" 10 "line feed" + 12 "form feed" + 13 "cr" + 32 "space" + 160 "non-breaking space, see Unicode Z general category") + collect: [:v | Character value: v] as: String + " To be considered: + 16r1680 OGHAM SPACE MARK + 16r2000 EN QUAD + 16r2001 EM QUAD + 16r2002 EN SPACE + 16r2003 EM SPACE + 16r2004 THREE-PER-EM SPACE + 16r2005 FOUR-PER-EM SPACE + 16r2006 SIX-PER-EM SPACE + 16r2007 FIGURE SPACE + 16r2008 PUNCTUATION SPACE + 16r2009 THIN SPACE + 16r200A HAIR SPACE + 16r2028 LINE SEPARATOR + 16r2029 PARAGRAPH SEPARATOR + 16r202F NARROW NO-BREAK SPACE + 16r205F MEDIUM MATHEMATICAL SPACE + 16r3000 IDEOGRAPHIC SPACE + "! - 12 "form feed") - collect: [:v | Character value: v] as: String!

Item was changed: + (PackageInfo named: 'Collections') postscript: 'CharacterSet cleanUp: false.'! - (PackageInfo named: 'Collections') postscript: 'Character initializeClassificationTable'!

Show replies by date

Levente Uzonyi

13 Sep 13 Sep

4:35 p.m.

You're opening a can of worms with this. There are several other separator/white space characters missing from that list. Also, this change makes the various #*separator* implementations (e.g. #isSeparator) inconsistent, so I strongly disagree with this change.

Levente

On Wed, 12 Sep 2018, commits@source.squeak.org wrote:

...

Tobias Pape uploaded a new version of Collections to project The Trunk: http://source.squeak.org/trunk/Collections-topa.806.mcz

==================== Summary ====================

Name: Collections-topa.806 Author: topa Time: 12 September 2018, 3:28:40.687052 pm UUID: 46b95db5-a773-4113-92f0-5ee905404b49 Ancestors: Collections-cmm.805

Fix separators to include U+00A0 (no break space)

Thanks Ron!

=============== Diff against Collections-cmm.805 ===============

Item was changed: ----- Method: Character class>>separators (in category 'instance creation') ----- separators

"Answer a collection of space-like separator characters.

Note that we do not consider spaces in >8bit code points yet.

"

"Answer a collection of the standard ASCII separator characters."

^ #(9 "tab"
^ #(32 "space"
13 "cr"
9 "tab"
10 "line feed"
12 "form feed"
13 "cr"
32 "space"
160 "non-breaking space, see Unicode Z general category")
collect: [:v | Character value: v] as: String
" To be considered:

16r1680 OGHAM SPACE MARK

16r2000 EN QUAD

16r2001 EM QUAD

16r2002 EN SPACE

16r2003 EM SPACE

16r2004 THREE-PER-EM SPACE

16r2005 FOUR-PER-EM SPACE

16r2006 SIX-PER-EM SPACE

16r2007 FIGURE SPACE

16r2008 PUNCTUATION SPACE

16r2009 THIN SPACE

16r200A HAIR SPACE

16r2028 LINE SEPARATOR

16r2029 PARAGRAPH SEPARATOR

16r202F NARROW NO-BREAK SPACE

16r205F MEDIUM MATHEMATICAL SPACE

16r3000 IDEOGRAPHIC SPACE

"!
12 "form feed")
collect: [:v | Character value: v] as: String!
Item was changed:

(PackageInfo named: 'Collections') postscript: 'CharacterSet cleanUp: false.'!

(PackageInfo named: 'Collections') postscript: 'Character initializeClassificationTable'!

Tobias Pape

5:11 p.m.

...

On 13.09.2018, at 16:35, Levente Uzonyi leves@caesar.elte.hu wrote:

You're opening a can of worms with this. There are several other separator/white space characters missing from that list.

Yeah, thats listed below in a comment. I am hesitating to add the other because WideString, so I just put them in a comment.

...

Also, this change makes the various #*separator* implementations (e.g. #isSeparator) inconsistent, so I strongly disagree with this change.

Hmm. But isSeparator is Wrong, then… because nbsp _is_ a separator, right? See the discussion with Ron. On a related note, is a very fast #isSeparator important? Otherwise I'd just propose

^ #( 9 10 12 13 32 160 ) includes: self asInteger for now…

All other *separator* messages fall back either to either Character>>#isSeparator or #separators from CharacterSet, which in turn is based on Character class>>#separators.

...

Levente

On Wed, 12 Sep 2018, commits@source.squeak.org wrote:

...
Tobias Pape uploaded a new version of Collections to project The Trunk: http://source.squeak.org/trunk/Collections-topa.806.mcz

==================== Summary ====================

Name: Collections-topa.806 Author: topa Time: 12 September 2018, 3:28:40.687052 pm UUID: 46b95db5-a773-4113-92f0-5ee905404b49 Ancestors: Collections-cmm.805

Fix separators to include U+00A0 (no break space)

Thanks Ron!

=============== Diff against Collections-cmm.805 ===============

Item was changed: ----- Method: Character class>>separators (in category 'instance creation') ----- separators

"Answer a collection of space-like separator characters.

Note that we do not consider spaces in >8bit code points yet.

"

"Answer a collection of the standard ASCII separator characters."

^ #(9 "tab"
^ #(32 "space"
13 "cr"
9 "tab"
10 "line feed"
12 "form feed"
13 "cr"
32 "space"
160 "non-breaking space, see Unicode Z general category")
collect: [:v | Character value: v] as: String
" To be considered:

16r1680 OGHAM SPACE MARK

16r2000 EN QUAD

16r2001 EM QUAD

16r2002 EN SPACE

16r2003 EM SPACE

16r2004 THREE-PER-EM SPACE

16r2005 FOUR-PER-EM SPACE

16r2006 SIX-PER-EM SPACE

16r2007 FIGURE SPACE

16r2008 PUNCTUATION SPACE

16r2009 THIN SPACE

16r200A HAIR SPACE

16r2028 LINE SEPARATOR

16r2029 PARAGRAPH SEPARATOR

16r202F NARROW NO-BREAK SPACE

16r205F MEDIUM MATHEMATICAL SPACE

16r3000 IDEOGRAPHIC SPACE

"!
12 "form feed")
collect: [:v | Character value: v] as: String!
Item was changed:

(PackageInfo named: 'Collections') postscript: 'CharacterSet cleanUp: false.'!

(PackageInfo named: 'Collections') postscript: 'Character initializeClassificationTable'!

Levente Uzonyi

7:13 p.m.

On Thu, 13 Sep 2018, Tobias Pape wrote:

...

...
On 13.09.2018, at 16:35, Levente Uzonyi leves@caesar.elte.hu wrote:

You're opening a can of worms with this. There are several other separator/white space characters missing from that list.

Yeah, thats listed below in a comment. I am hesitating to add the other because WideString, so I just put them in a comment.

That list is still incomplete (e.g. zero width space), and you still have to deal with the can of worms - aka answering "What is a separator?".

...

...
Also, this change makes the various #*separator* implementations (e.g. #isSeparator) inconsistent, so I strongly disagree with this change.

Hmm. But isSeparator is Wrong, then… because nbsp _is_ a separator, right? See the discussion with Ron. On a related note, is a very fast #isSeparator important?

Yes, it is. It's used extensively by various parsers. For example, see the senders of #isSeparator and #skipSeparators. Also, consider how the change of behavior affects those methods (along with other users, e.g. those methods which use the character sets).

...

Otherwise I'd just propose

^ #( 9 10 12 13 32 160 ) includes: self asInteger for now…

According to my measurements, that would be 10-15x slower than the current implementation. I optimized it for a reason not just for fun.

...

All other *separator* messages fall back either to either Character>>#isSeparator or #separators from CharacterSet, which in turn is based on Character class>>#separators.

That's true, but those are inconsistent now.

Levente

...

...
Levente

On Wed, 12 Sep 2018, commits@source.squeak.org wrote:

...
Tobias Pape uploaded a new version of Collections to project The Trunk: http://source.squeak.org/trunk/Collections-topa.806.mcz

==================== Summary ====================

Name: Collections-topa.806 Author: topa Time: 12 September 2018, 3:28:40.687052 pm UUID: 46b95db5-a773-4113-92f0-5ee905404b49 Ancestors: Collections-cmm.805

Fix separators to include U+00A0 (no break space)

Thanks Ron!

=============== Diff against Collections-cmm.805 ===============

Item was changed: ----- Method: Character class>>separators (in category 'instance creation') ----- separators

"Answer a collection of space-like separator characters.

Note that we do not consider spaces in >8bit code points yet.

"

"Answer a collection of the standard ASCII separator characters."

^ #(9 "tab"
^ #(32 "space"
13 "cr"
9 "tab"
10 "line feed"
12 "form feed"
13 "cr"
32 "space"
160 "non-breaking space, see Unicode Z general category")
collect: [:v | Character value: v] as: String
" To be considered:

16r1680 OGHAM SPACE MARK

16r2000 EN QUAD

16r2001 EM QUAD

16r2002 EN SPACE

16r2003 EM SPACE

16r2004 THREE-PER-EM SPACE

16r2005 FOUR-PER-EM SPACE

16r2006 SIX-PER-EM SPACE

16r2007 FIGURE SPACE

16r2008 PUNCTUATION SPACE

16r2009 THIN SPACE

16r200A HAIR SPACE

16r2028 LINE SEPARATOR

16r2029 PARAGRAPH SEPARATOR

16r202F NARROW NO-BREAK SPACE

16r205F MEDIUM MATHEMATICAL SPACE

16r3000 IDEOGRAPHIC SPACE

"!
12 "form feed")
collect: [:v | Character value: v] as: String!
Item was changed:

(PackageInfo named: 'Collections') postscript: 'CharacterSet cleanUp: false.'!

(PackageInfo named: 'Collections') postscript: 'Character initializeClassificationTable'!

Chris Muller

9 p.m.

I think Levente raises very good points, Squeak should present a consistent implementation of what a separator is. I've always considered hard space and hard page break, etc. as "Word Processor" characters, since they have "functionality", not merely "separators".

I think we should give more time for proper consideration, discussion and full implementation (with consistent behaviors everywhere), and testing, too. IMO, this type of change is low-level enough that it should not be a last-minute change put in merely minutes before the 5.2 release but we should discuss it for the next release.

Best, Chris

On Thu, Sep 13, 2018 at 12:13 PM Levente Uzonyi leves@caesar.elte.hu wrote:

...

On Thu, 13 Sep 2018, Tobias Pape wrote:

...
...
On 13.09.2018, at 16:35, Levente Uzonyi leves@caesar.elte.hu wrote:

You're opening a can of worms with this. There are several other separator/white space characters missing from that list.

Yeah, thats listed below in a comment. I am hesitating to add the other because WideString, so I just put them in a comment.

That list is still incomplete (e.g. zero width space), and you still have to deal with the can of worms - aka answering "What is a separator?".

...
...
Also, this change makes the various #*separator* implementations (e.g. #isSeparator) inconsistent, so I strongly disagree with this change.

Hmm. But isSeparator is Wrong, then… because nbsp _is_ a separator, right? See the discussion with Ron. On a related note, is a very fast #isSeparator important?

Yes, it is. It's used extensively by various parsers. For example, see the senders of #isSeparator and #skipSeparators. Also, consider how the change of behavior affects those methods (along with other users, e.g. those methods which use the character sets).

...
Otherwise I'd just propose
  ^ #( 9 10 12 13 32 160 ) includes: self asInteger
for now…
According to my measurements, that would be 10-15x slower than the current implementation. I optimized it for a reason not just for fun.

...
All other *separator* messages fall back either to either Character>>#isSeparator or #separators from CharacterSet, which in turn is based on Character class>>#separators.

That's true, but those are inconsistent now.

Levente

...
...
Levente

On Wed, 12 Sep 2018, commits@source.squeak.org wrote:

...
Tobias Pape uploaded a new version of Collections to project The Trunk: http://source.squeak.org/trunk/Collections-topa.806.mcz

==================== Summary ====================

Name: Collections-topa.806 Author: topa Time: 12 September 2018, 3:28:40.687052 pm UUID: 46b95db5-a773-4113-92f0-5ee905404b49 Ancestors: Collections-cmm.805

Fix separators to include U+00A0 (no break space)

Thanks Ron!

=============== Diff against Collections-cmm.805 ===============

Item was changed: ----- Method: Character class>>separators (in category 'instance creation') ----- separators

"Answer a collection of space-like separator characters.

Note that we do not consider spaces in >8bit code points yet.

"

"Answer a collection of the standard ASCII separator characters."

^ #(9 "tab"
^ #(32 "space"
      13 "cr"
      9 "tab"
      10 "line feed"
      12 "form feed"
      13 "cr"
      32 "space"
      160 "non-breaking space, see Unicode Z general category")
      collect: [:v | Character value: v] as: String
" To be considered:

16r1680 OGHAM SPACE MARK

16r2000 EN QUAD

16r2001 EM QUAD

16r2002 EN SPACE

16r2003 EM SPACE

16r2004 THREE-PER-EM SPACE

16r2005 FOUR-PER-EM SPACE

16r2006 SIX-PER-EM SPACE

16r2007 FIGURE SPACE

16r2008 PUNCTUATION SPACE

16r2009 THIN SPACE

16r200A HAIR SPACE

16r2028 LINE SEPARATOR

16r2029 PARAGRAPH SEPARATOR

16r202F NARROW NO-BREAK SPACE

16r205F MEDIUM MATHEMATICAL SPACE

16r3000 IDEOGRAPHIC SPACE

"!
      12 "form feed")
      collect: [:v | Character value: v] as: String!
Item was changed:

(PackageInfo named: 'Collections') postscript: 'CharacterSet cleanUp: false.'!

(PackageInfo named: 'Collections') postscript: 'Character initializeClassificationTable'!

Frank Shearar

9:29 p.m.

On Thu, 13 Sep 2018 at 12:00, Chris Muller asqueaker@gmail.com wrote:

...

I think Levente raises very good points, Squeak should present a consistent implementation of what a separator is.

That sounds like a category error. A _character set_ knows what a separator is. Unicode, ASCII, etc.

The question should, IMO at least, be "what character set should Squeak use" and, again IMO, that should be Unicode and, in particular, the UTF-8 encoding. (http://utf8everywhere.org/)

...

I've always considered hard space and hard page break, etc. as "Word Processor" characters, since they have "functionality", not merely "separators".

I think we should give more time for proper consideration, discussion and full implementation (with consistent behaviors everywhere), and testing, too. IMO, this type of change is low-level enough that it should not be a last-minute change put in merely minutes before the 5.2 release but we should discuss it for the next release.

+1 to this. Even if everyone decided that UTF-8 is the perfect encoding to use, and we should proceed with alacrity towards using it, now is not the time to start. My impression was that 5.2 was in a feature freeze, bugfix only phase.

frank

Best,

...

Chris

On Thu, Sep 13, 2018 at 12:13 PM Levente Uzonyi leves@caesar.elte.hu wrote:

...
On Thu, 13 Sep 2018, Tobias Pape wrote:

...
...
On 13.09.2018, at 16:35, Levente Uzonyi leves@caesar.elte.hu wrote:

You're opening a can of worms with this. There are several other

separator/white space characters missing from that list.

...
...
Yeah, thats listed below in a comment. I am hesitating to add the

other because WideString, so I just put them in a comment.

...
That list is still incomplete (e.g. zero width space), and you still have to deal with the can of worms - aka answering "What is a separator?".

...
...
Also, this change makes the various #*separator* implementations

(e.g. #isSeparator) inconsistent, so I strongly disagree with this change.

...
...
Hmm. But isSeparator is Wrong, then… because nbsp _is_ a separator,

right?

...
...
See the discussion with Ron. On a related note, is a very fast #isSeparator important?

Yes, it is. It's used extensively by various parsers. For example, see

the

...
senders of #isSeparator and #skipSeparators. Also, consider how the change of behavior affects those methods (along with other users, e.g. those methods which use the character sets).

...
Otherwise I'd just propose
  ^ #( 9 10 12 13 32 160 ) includes: self asInteger
for now…
According to my measurements, that would be 10-15x slower than the current implementation. I optimized it for a reason not just for fun.

...
All other *separator* messages fall back either to either
Character>>#isSeparator or #separators from CharacterSet, which in turn is based on Character class>>#separators.

...
That's true, but those are inconsistent now.

Levente

...
...
Levente

On Wed, 12 Sep 2018, commits@source.squeak.org wrote:

...
Tobias Pape uploaded a new version of Collections to project The

Trunk:

...
...
...
...
http://source.squeak.org/trunk/Collections-topa.806.mcz

==================== Summary ====================

Name: Collections-topa.806 Author: topa Time: 12 September 2018, 3:28:40.687052 pm UUID: 46b95db5-a773-4113-92f0-5ee905404b49 Ancestors: Collections-cmm.805

Fix separators to include U+00A0 (no break space)

Thanks Ron!

=============== Diff against Collections-cmm.805 ===============

Item was changed: ----- Method: Character class>>separators (in category 'instance

creation') -----

...
...
...
...
separators

"Answer a collection of space-like separator characters.

Note that we do not consider spaces in >8bit code points yet.

"

"Answer a collection of the standard ASCII separator characters."

^ #(9 "tab"
^ #(32 "space"
      13 "cr"
      9 "tab"
      10 "line feed"
      12 "form feed"
      13 "cr"
      32 "space"
      160 "non-breaking space, see Unicode Z general category")
      collect: [:v | Character value: v] as: String
" To be considered:

16r1680 OGHAM SPACE MARK

16r2000 EN QUAD

16r2001 EM QUAD

16r2002 EN SPACE

16r2003 EM SPACE

16r2004 THREE-PER-EM SPACE

16r2005 FOUR-PER-EM SPACE

16r2006 SIX-PER-EM SPACE

16r2007 FIGURE SPACE

16r2008 PUNCTUATION SPACE

16r2009 THIN SPACE

16r200A HAIR SPACE

16r2028 LINE SEPARATOR

16r2029 PARAGRAPH SEPARATOR

16r202F NARROW NO-BREAK SPACE

16r205F MEDIUM MATHEMATICAL SPACE

16r3000 IDEOGRAPHIC SPACE

"!
      12 "form feed")
      collect: [:v | Character value: v] as: String!
Item was changed:

(PackageInfo named: 'Collections') postscript: 'CharacterSet
cleanUp: false.'!

...
...
...
...

(PackageInfo named: 'Collections') postscript: 'Character

initializeClassificationTable'!

...
...
...

Levente Uzonyi

14 Sep 14 Sep

12:38 a.m.

On Thu, 13 Sep 2018, Frank Shearar wrote:

...

On Thu, 13 Sep 2018 at 12:00, Chris Muller asqueaker@gmail.com wrote: I think Levente raises very good points, Squeak should present a consistent implementation of what a separator is.

That sounds like a category error. A _character set_ knows what a separator is. Unicode, ASCII, etc.

The question should, IMO at least, be "what character set should Squeak use" and, again IMO, that should be Unicode and, in particular, the UTF-8 encoding. (http://utf8everywhere.org/)

My impression is that UTF-8 is slightly better and slightly worse at the same time than the current UTF-32 (+leading char extension) representation. So, I don't find it very tempting to make a huge change for something "different".

Levente

...

I've always considered hard space and hard page break, etc. as "Word Processor" characters, since they have "functionality", not merely "separators".

  I think we should give more time for proper consideration, discussion
  and full implementation (with consistent behaviors everywhere), and
  testing, too.  IMO, this type of change is low-level enough that it
  should not be a last-minute change put in merely minutes before the
  5.2 release but we should discuss it for the next release.

frank

  Best,
    Chris

  On Thu, Sep 13, 2018 at 12:13 PM Levente Uzonyi <leves@caesar.elte.hu> wrote:
  >
  > On Thu, 13 Sep 2018, Tobias Pape wrote:
  >
  > >
  > >> On 13.09.2018, at 16:35, Levente Uzonyi <leves@caesar.elte.hu> wrote:
  > >>
  > >> You're opening a can of worms with this. There are several other separator/white space characters missing from that list.
  > >
  > > Yeah, thats listed below in a comment. I am hesitating to add the other because WideString, so I just put them in a comment.
  >
  > That list is still incomplete (e.g. zero width space), and you still have
  > to deal with the can of worms - aka answering "What is a separator?".
  >
  > >
  > >> Also, this change makes the various #*separator* implementations (e.g. #isSeparator) inconsistent, so I strongly disagree with this change.
  > >
  > > Hmm. But isSeparator is Wrong, then… because nbsp _is_ a separator, right?
  > > See the discussion with Ron.
  > > On a related note, is a very fast #isSeparator important?
  >
  > Yes, it is. It's used extensively by various parsers. For example, see the
  > senders of #isSeparator and #skipSeparators.
  > Also, consider how the change of behavior affects those methods (along
  > with other users, e.g. those methods which use the character sets).
  >
  > > Otherwise I'd just propose
  > >
  > >       ^ #( 9 10 12 13 32 160 ) includes: self asInteger
  > > for now…
  >
  > According to my measurements, that would be 10-15x slower than the
  > current implementation. I optimized it for a reason not just for fun.
  >
  > >
  > > All other *separator* messages fall back either to either Character>>#isSeparator or #separators from CharacterSet, which in turn is based on Character class>>#separators.
  >
  > That's true, but those are inconsistent now.
  >
  > Levente
  >
  > >
  > >
  > >
  > >>
  > >> Levente
  > >>
  > >> On Wed, 12 Sep 2018, commits@source.squeak.org wrote:
  > >>
  > >>> Tobias Pape uploaded a new version of Collections to project The Trunk:
  > >>> http://source.squeak.org/trunk/Collections-topa.806.mcz
  > >>>
  > >>> ==================== Summary ====================
  > >>>
  > >>> Name: Collections-topa.806
  > >>> Author: topa
  > >>> Time: 12 September 2018, 3:28:40.687052 pm
  > >>> UUID: 46b95db5-a773-4113-92f0-5ee905404b49
  > >>> Ancestors: Collections-cmm.805
  > >>>
  > >>> Fix separators to include U+00A0 (no break space)
  > >>>
  > >>> Thanks Ron!
  > >>>
  > >>> =============== Diff against Collections-cmm.805 ===============
  > >>>
  > >>> Item was changed:
  > >>> ----- Method: Character class>>separators (in category 'instance creation') -----
  > >>> separators
  > >>> +   "Answer a collection of space-like separator characters.
  > >>> +   Note that we do not consider spaces in >8bit code points yet.
  > >>> +   "
  > >>> -   "Answer a collection of the standard ASCII separator characters."
  > >>> +   ^ #(9 "tab"
  > >>> -   ^ #(32 "space"
  > >>> -           13 "cr"
  > >>> -           9 "tab"
  > >>>             10 "line feed"
  > >>> +           12 "form feed"
  > >>> +           13 "cr"
  > >>> +           32 "space"
  > >>> +           160 "non-breaking space, see Unicode Z general category")
  > >>> +           collect: [:v | Character value: v] as: String
  > >>> + " To be considered:
  > >>> + 16r1680 OGHAM SPACE MARK
  > >>> + 16r2000 EN QUAD
  > >>> + 16r2001 EM QUAD
  > >>> + 16r2002 EN SPACE
  > >>> + 16r2003 EM SPACE
  > >>> + 16r2004 THREE-PER-EM SPACE
  > >>> + 16r2005 FOUR-PER-EM SPACE
  > >>> + 16r2006 SIX-PER-EM SPACE
  > >>> + 16r2007 FIGURE SPACE
  > >>> + 16r2008 PUNCTUATION SPACE
  > >>> + 16r2009 THIN SPACE
  > >>> + 16r200A HAIR SPACE
  > >>> + 16r2028 LINE SEPARATOR
  > >>> + 16r2029 PARAGRAPH SEPARATOR
  > >>> + 16r202F NARROW NO-BREAK SPACE
  > >>> + 16r205F MEDIUM MATHEMATICAL SPACE
  > >>> + 16r3000 IDEOGRAPHIC SPACE
  > >>> + "!
  > >>> -           12 "form feed")
  > >>> -           collect: [:v | Character value: v] as: String!
  > >>>
  > >>> Item was changed:
  > >>> + (PackageInfo named: 'Collections') postscript: 'CharacterSet cleanUp: false.'!
  > >>> - (PackageInfo named: 'Collections') postscript: 'Character initializeClassificationTable'!
  > >>

tim Rowledge

12:52 a.m.

...

...
The question should, IMO at least, be "what character set should Squeak use" and, again IMO, that should be Unicode and, in particular, the UTF-8 encoding. (http://utf8everywhere.org/)

We should probably have a proper UTF8String class so that at least we know that it is encoded and needs conversion to a 'real' String. During the NuScratch work I toiled mightily with string stuff and really ought to have done it then. The current widestring/bytestring stuff works quite well though for most internal cases, though the cost of converting an entire string anytime a big char is inserted could get annoying.

If one were making a word processor for large amounts of text, rather than a text editor with some prettiness tweaks for code editting etc, it might pay to have a form of text that allows for mixed byte & wide sub-parts. Perhaps even possible to use text attributes in yet another twisted and sneaky way? As we discovered in the Sophie Project, handling formatted texts is decidedly non-trivial. Especially when the customer can't even define a paragraph for you....

tim -- tim Rowledge; tim@rowledge.org; http://www.rowledge.org/tim Strange OpCodes: PSM: Print and SMear

2070

Age (days ago)

2071

Last active (days ago)

squeak-dev@lists.squeakfoundation.org

7 comments

6 participants

tags (0)

participants (6)

Chris Muller
commits＠source.squeak.org
Frank Shearar
Levente Uzonyi
tim Rowledge
Tobias Pape