Christoph Thiede uploaded a new version of Multilingual to project The Trunk: http://source.squeak.org/trunk/Multilingual-ct.271.mcz
==================== Summary ====================
Name: Multilingual-ct.271 Author: ct Time: 4 April 2022, 7:57:21.50746 pm UUID: de94b8ca-494e-d149-b2e0-e7e6d714d25b Ancestors: Multilingual-mt.270
Merges UnicodeData.cs: This changeset repairs the fetching & parsing of unicode category data and adds new interface #generalCategoryTagOf: and protocol for converting between unicode categories and tags.. Usage: Unicode reinitializeData. Unicode generalCategoryLabelOf: 16r1F388. 'Symbol, Other' Still present limitations include: - Duplication between #parseUnicodeDataFrom: and #parseCompositionMappingFrom: - Redundant and scattered declaration of character categories
Revision from 3.cs: Minor clean-up in #parseUnicodeDataFrom:, fix default category values.
Note that this change will not yet upgrade your Unicode database, which will only happen when building a new image. However, you can run "Unicode reinitializeData" to benefit from the new data right now. The only reason why I did not put this into the postscript is literally to avoid any trouble with proxies or firewalls. :-)
Thanks to Levente (ul) and Marcel (mt) for their help! For more information, see: http://lists.squeakfoundation.org/pipermail/squeak-dev/2020-March/208020.htm...
=============== Diff against Multilingual-mt.270 ===============
Item was changed: EUCCNTextConverter subclass: #CNGBTextConverter instanceVariableNames: '' classVariableNames: '' poolDictionaries: '' category: 'Multilingual-TextConversion'! + + !CNGBTextConverter commentStamp: '<historical>' prior: 0! + Text converter for Simplified Chinese variation of EUC. (Even though the name doesn't look so, it is what it is.)!
Item was added: + ----- Method: Unicode class>>allCategoryTags (in category 'character classification') ----- + allCategoryTags + + ^ #(Cn Cc Cf Co Cs Ll Lm Lo Lt Lu Mc Me Mn Nd Nl No Pc Pd Pe Pf Pi Po Ps Sc Sk Sm So Zl Zp Zs)!
Item was changed: ----- Method: Unicode class>>blocks320Comment (in category 'comments') ----- blocks320Comment + "http://www.unicode.org/Public/3.2-Update/Blocks-3.2.0.txt"
"# Blocks-3.2.0.txt # Correlated with Unicode 3.2 # Start Code..End Code; Block Name 0000..007F; Basic Latin 0080..00FF; Latin-1 Supplement 0100..017F; Latin Extended-A 0180..024F; Latin Extended-B 0250..02AF; IPA Extensions 02B0..02FF; Spacing Modifier Letters 0300..036F; Combining Diacritical Marks 0370..03FF; Greek and Coptic 0400..04FF; Cyrillic 0500..052F; Cyrillic Supplementary 0530..058F; Armenian 0590..05FF; Hebrew 0600..06FF; Arabic 0700..074F; Syriac 0780..07BF; Thaana 0900..097F; Devanagari 0980..09FF; Bengali 0A00..0A7F; Gurmukhi 0A80..0AFF; Gujarati 0B00..0B7F; Oriya 0B80..0BFF; Tamil 0C00..0C7F; Telugu 0C80..0CFF; Kannada 0D00..0D7F; Malayalam 0D80..0DFF; Sinhala 0E00..0E7F; Thai 0E80..0EFF; Lao 0F00..0FFF; Tibetan 1000..109F; Myanmar 10A0..10FF; Georgian 1100..11FF; Hangul Jamo 1200..137F; Ethiopic 13A0..13FF; Cherokee 1400..167F; Unified Canadian Aboriginal Syllabics 1680..169F; Ogham 16A0..16FF; Runic 1700..171F; Tagalog 1720..173F; Hanunoo 1740..175F; Buhid 1760..177F; Tagbanwa 1780..17FF; Khmer 1800..18AF; Mongolian 1E00..1EFF; Latin Extended Additional 1F00..1FFF; Greek Extended 2000..206F; General Punctuation 2070..209F; Superscripts and Subscripts 20A0..20CF; Currency Symbols 20D0..20FF; Combining Diacritical Marks for Symbols 2100..214F; Letterlike Symbols 2150..218F; Number Forms 2190..21FF; Arrows 2200..22FF; Mathematical Operators 2300..23FF; Miscellaneous Technical 2400..243F; Control Pictures 2440..245F; Optical Character Recognition 2460..24FF; Enclosed Alphanumerics 2500..257F; Box Drawing 2580..259F; Block Elements 25A0..25FF; Geometric Shapes 2600..26FF; Miscellaneous Symbols 2700..27BF; Dingbats 27C0..27EF; Miscellaneous Mathematical Symbols-A 27F0..27FF; Supplemental Arrows-A 2800..28FF; Braille Patterns 2900..297F; Supplemental Arrows-B 2980..29FF; Miscellaneous Mathematical Symbols-B 2A00..2AFF; Supplemental Mathematical Operators 2E80..2EFF; CJK Radicals Supplement 2F00..2FDF; Kangxi Radicals 2FF0..2FFF; Ideographic Description Characters 3000..303F; CJK Symbols and Punctuation 3040..309F; Hiragana 30A0..30FF; Katakana 3100..312F; Bopomofo 3130..318F; Hangul Compatibility Jamo 3190..319F; Kanbun 31A0..31BF; Bopomofo Extended 31F0..31FF; Katakana Phonetic Extensions 3200..32FF; Enclosed CJK Letters and Months 3300..33FF; CJK Compatibility 3400..4DBF; CJK Unified Ideographs Extension A 4E00..9FFF; CJK Unified Ideographs A000..A48F; Yi Syllables A490..A4CF; Yi Radicals AC00..D7AF; Hangul Syllables D800..DB7F; High Surrogates DB80..DBFF; High Private Use Surrogates DC00..DFFF; Low Surrogates E000..F8FF; Private Use Area F900..FAFF; CJK Compatibility Ideographs FB00..FB4F; Alphabetic Presentation Forms FB50..FDFF; Arabic Presentation Forms-A FE00..FE0F; Variation Selectors FE20..FE2F; Combining Half Marks FE30..FE4F; CJK Compatibility Forms FE50..FE6F; Small Form Variants FE70..FEFF; Arabic Presentation Forms-B FF00..FFEF; Halfwidth and Fullwidth Forms FFF0..FFFF; Specials 10300..1032F; Old Italic 10330..1034F; Gothic 10400..1044F; Deseret 1D000..1D0FF; Byzantine Musical Symbols 1D100..1D1FF; Musical Symbols 1D400..1D7FF; Mathematical Alphanumeric Symbols 20000..2A6DF; CJK Unified Ideographs Extension B 2F800..2FA1F; CJK Compatibility Ideographs Supplement E0000..E007F; Tags F0000..FFFFF; Supplementary Private Use Area-A 100000..10FFFF; Supplementary Private Use Area-B
"!
Item was added: + ----- Method: Unicode class>>generalCategoryIndexFromTag: (in category 'character classification') ----- + generalCategoryIndexFromTag: tag + + ^ (self allCategoryTags indexOf: tag) - 1!
Item was added: + ----- Method: Unicode class>>generalCategoryLabelForTag: (in category 'character classification') ----- + generalCategoryLabelForTag: tag + + ^ self generalCategoryLabels at: (self generalCategoryIndexFromTag: tag) + 1!
Item was added: + ----- Method: Unicode class>>generalCategoryTagOf: (in category 'character classification') ----- + generalCategoryTagOf: aCharacterCode + + ^ (self generalCategoryOf: aCharacterCode) + ifNotNil: [:code | self allCategoryTags at: code + 1 ifAbsent: [#Cn]] + ifNil: [#Cn]!
Item was changed: ----- Method: Unicode class>>initialize (in category 'class initialization') ----- initialize " Unicode initialize " self initializeTagConstants. + + self flag: #deduplicate. "Currently, we are downloading and parsing #unicodeData twice." + Compositions isEmptyOrNil ifTrue: [self initializeCompositionMappings]. + GeneralCategory isEmptyOrNil ifTrue: [self initializeUnicodeData].! - Compositions isEmptyOrNil ifTrue:[self initializeCompositionMappings].!
Item was added: + ----- Method: Unicode class>>initializeUnicodeData (in category 'unicode data') ----- + initializeUnicodeData + "self initializeUnicodeData" + + self parseUnicodeDataFrom: self unicodeData readStream.!
Item was changed: ----- Method: Unicode class>>parseUnicodeDataFrom: (in category 'unicode data') ----- parseUnicodeDataFrom: stream + "self initializeUnicodeData." - " - self halt. - self parseUnicodeDataFile - "
+ | line fieldEnd point fieldStart toNumber generalCategory decimalProperty tag | - | line fieldEnd point fieldStart toNumber generalCategory decimalProperty |
toNumber := [:quad | ('16r', quad) asNumber].
+ GeneralCategory := SparseLargeTable new: 16rE0080 chunkSize: 1024 arrayClass: Array base: 1 defaultValue: Cn. - GeneralCategory := SparseLargeTable new: 16rE0080 chunkSize: 1024 arrayClass: Array base: 1 defaultValue: 'Cn'. DecimalProperty := SparseLargeTable new: 16rE0080 chunkSize: 32 arrayClass: Array base: 1 defaultValue: -1.
+ GeneralCategory atAll: (16r3400 to: 16r4DB5) +1 put: Lo. + GeneralCategory atAll: (16r4E00 to: 16r9FA5) + 1 put: Lo. + GeneralCategory atAll: (16rAC00 to: 16rD7FF) + 1 put: Lo. - 16r3400 to: 16r4DB5 do: [:i | GeneralCategory at: i+1 put: 'Lo']. - 16r4E00 to: 16r9FA5 do: [:i | GeneralCategory at: i+1 put: 'Lo']. - 16rAC00 to: 16rD7FF do: [:i | GeneralCategory at: i+1 put: 'Lo'].
[(line := stream nextLine) size > 0] whileTrue: [ fieldEnd := line indexOf: $; startingAt: 1. point := toNumber value: (line copyFrom: 1 to: fieldEnd - 1). point > 16rE007F ifTrue: [ GeneralCategory zapDefaultOnlyEntries. DecimalProperty zapDefaultOnlyEntries. ^ self]. 2 to: 3 do: [:i | fieldStart := fieldEnd + 1. fieldEnd := line indexOf: $; startingAt: fieldStart. ]. + tag := line copyFrom: fieldStart to: fieldEnd - 1. + generalCategory := self generalCategoryIndexFromTag: tag. - generalCategory := line copyFrom: fieldStart to: fieldEnd - 1. GeneralCategory at: point+1 put: generalCategory. + generalCategory = Nd ifTrue: [ - generalCategory = 'Nd' ifTrue: [ 4 to: 7 do: [:i | fieldStart := fieldEnd + 1. fieldEnd := line indexOf: $; startingAt: fieldStart. ]. decimalProperty := line copyFrom: fieldStart to: fieldEnd - 1. DecimalProperty at: point+1 put: decimalProperty asNumber. ]. ]. GeneralCategory zapDefaultOnlyEntries. + DecimalProperty zapDefaultOnlyEntries.! - DecimalProperty zapDefaultOnlyEntries. - !
Item was added: + ----- Method: Unicode class>>reinitializeData (in category 'class initialization') ----- + reinitializeData + + Compositions := GeneralCategory := nil. + self initialize.!
squeak-dev@lists.squeakfoundation.org