Hi all! :-)
After some recent fun with the Unicode class, I found out that its data is quite out of date (for example, the comments do not even "know" that code points can be longer than 4 bytes. Younger characters such as 😺❤🤓 are not categorized correctly, etc. ...). Luckily, there is already some logic to fetch the latest data from www.unicode.orghttp://www.unicode.org. I'm currently reworking this logic because it's not completely automated yet and has some slips, but so long, I have one general question for you:
At the moment, we have 30 class variables each for one Unicode category number. These class vars map in alphabetical order to the integers from 0 to: 29. Is this tedious structure really necessary? For different purposes, I would like to get the category name of a specific code point from a client. The current design makes this impossible without writing additional mappings.
Tl;dr: I would like to propose to drop these class variables and use Symbols instead. They are comparable like integers, and as they are flyweights, this should not be a performance issue either. Of course, #generalCategoryOf: will have to keep returning numbers, but we could deprecate it and use a new #generalTagOf: in the future. Furthermore, this would also allow us to deal with later added category names (though I don't know whether this will ever happen).
Examples: Unicode generalTagOf: $a asUnicode. "#Ll"
Unicode class >> isLetterCode: charCode ^ (self generalTagOf: charCode) first = $L
Unicode class >> isAlphaNumericCode: charCode | tag| ^ (tag := self generalCategoryOf: charCode) first = $L or: [tag = #Nd]
How do you think about this proposal? Please let me know and I will go ahead! :D
Best, Christoph
I’m not in any position to provide authority for anything, but I’m interested in learning more about what you’re doing. I’d like to know more about Unicode in Squeak, so if you post more on the topic, perhaps some examples, you can be sure I’ll be reading them.
Chris
On Mar 17, 2020, at 6:51 PM, Thiede, Christoph Christoph.Thiede@student.hpi.uni-potsdam.de wrote:
Hi all! :-)
After some recent fun with the Unicode class, I found out that its data is quite out of date (for example, the comments do not even "know" that code points can be longer than 4 bytes. Younger characters such as 😺❤🤓 are not categorized correctly, etc. ...). Luckily, there is already some logic to fetch the latest data from www.unicode.org http://www.unicode.org/. I'm currently reworking this logic because it's not completely automated yet and has some slips, but so long, I have one general question for you:
At the moment, we have 30 class variables each for one Unicode category number. These class vars map in alphabetical order to the integers from 0 to: 29. Is this tedious structure really necessary? For different purposes, I would like to get the category name of a specific code point from a client. The current design makes this impossible without writing additional mappings. Tl;dr: I would like to propose to drop these class variables and use Symbols instead. They are comparable like integers, and as they are flyweights, this should not be a performance issue either. Of course, #generalCategoryOf: will have to keep returning numbers, but we could deprecate it and use a new #generalTagOf: in the future. Furthermore, this would also allow us to deal with later added category names (though I don't know whether this will ever happen).
Examples: Unicode generalTagOf: $a asUnicode. "#Ll"
Unicode class >> isLetterCode: charCode ^ (self generalTagOf: charCode) first = $L
Unicode class >> isAlphaNumericCode: charCode | tag| ^ (tag := self generalCategoryOf: charCode) first = $L or: [tag = #Nd]
How do you think about this proposal? Please let me know and I will go ahead! :D
Best, Christoph
Hi Chris, I don't know much about Unicode in Squeak at the moment, too, but I will try to document as many insights as possible when I commit related stuff :)
However, at the moment this project is blocked for me as I need a second opinion before continuing with my proposed design changes ...
Best,
Christoph
________________________________ Von: Squeak-dev squeak-dev-bounces@lists.squeakfoundation.org im Auftrag von Chris Cunnington brasspen@gmail.com Gesendet: Sonntag, 29. März 2020 18:05:31 An: The general-purpose Squeak developers list Betreff: Re: [squeak-dev] Unicode
I’m not in any position to provide authority for anything, but I’m interested in learning more about what you’re doing. I’d like to know more about Unicode in Squeak, so if you post more on the topic, perhaps some examples, you can be sure I’ll be reading them.
Chris
On Mar 17, 2020, at 6:51 PM, Thiede, Christoph <Christoph.Thiede@student.hpi.uni-potsdam.demailto:Christoph.Thiede@student.hpi.uni-potsdam.de> wrote:
Hi all! :-)
After some recent fun with the Unicode class, I found out that its data is quite out of date (for example, the comments do not even "know" that code points can be longer than 4 bytes. Younger characters such as 😺❤🤓 are not categorized correctly, etc. ...). Luckily, there is already some logic to fetch the latest data from www.unicode.orghttp://www.unicode.org/. I'm currently reworking this logic because it's not completely automated yet and has some slips, but so long, I have one general question for you:
At the moment, we have 30 class variables each for one Unicode category number. These class vars map in alphabetical order to the integers from 0 to: 29. Is this tedious structure really necessary? For different purposes, I would like to get the category name of a specific code point from a client. The current design makes this impossible without writing additional mappings. Tl;dr: I would like to propose to drop these class variables and use Symbols instead. They are comparable like integers, and as they are flyweights, this should not be a performance issue either. Of course, #generalCategoryOf: will have to keep returning numbers, but we could deprecate it and use a new #generalTagOf: in the future. Furthermore, this would also allow us to deal with later added category names (though I don't know whether this will ever happen).
Examples: Unicode generalTagOf: $a asUnicode. "#Ll"
Unicode class >> isLetterCode: charCode ^ (self generalTagOf: charCode) first = $L
Unicode class >> isAlphaNumericCode: charCode | tag| ^ (tag := self generalCategoryOf: charCode) first = $L or: [tag = #Nd]
How do you think about this proposal? Please let me know and I will go ahead! :D
Best, Christoph
Hi all,
I would still be interested in resuming this project and supporting the latest Unicode codepoints in Squeak. Is there really no one who could find some minutes to review my proposed design change? Or putting it another way: If I will upload these changes into the inbox, will anyone merge it? :-)
Best, Christoph
-- Sent from: http://forum.world.st/Squeak-Dev-f45488.html
Hi Christoph,
On Sat, 5 Sep 2020, Christoph Thiede wrote:
Hi all,
I would still be interested in resuming this project and supporting the latest Unicode codepoints in Squeak. Is there really no one who could find some minutes to review my proposed design change?
Your words suggest that it has already been published, but I can't find it anywhere.
Or putting it another way: If I will upload these changes into the inbox, will anyone merge it? :-)
I will review it and I'm sure others will do as well. Though I can't promise to merge it without having a look. :)
Levente
Best, Christoph
-- Sent from: http://forum.world.st/Squeak-Dev-f45488.html
Hi Christoph, Hi All,
On Mar 17, 2020, at 3:51 PM, Thiede, Christoph Christoph.Thiede@student.hpi.uni-potsdam.de wrote:
Hi all! :-)
After some recent fun with the Unicode class, I found out that its data is quite out of date (for example, the comments do not even "know" that code points can be longer than 4 bytes. Younger characters such as 😺❤🤓 are not categorized correctly, etc. ...). Luckily, there is already some logic to fetch the latest data from www.unicode.org. I'm currently reworking this logic because it's not completely automated yet and has some slips, but so long, I have one general question for you:
And consequently I have a couple of questions for you. In the Spur VM Characters are immediate (they are like SmallInteger and exist in oops (object-oriented pointers) as tagged values). In the 32-bit variant Characters are 30-bit unsigned integers. In the 64-bit variant they are also 30-bit unsigned integers, but could easily be extended to be up to 61-bit unsigned integers.
Q1, can you arrange that the Unicode support does not break in initialization on the 32-bit variant? It may be that the 32-bit variant cannot represent code points beyond 30 bits in size, but we should try to ensure that initialization still runs to completion even if it fails to initialize information relating to code points beyond 30 bits in size.
Q2, how many bits should the 64-bit variant VM support for immediate Characters?
Then something to consider is that it is conceptually possible to support something like WideCharacter, which would represent code points outside of the immediate Character range on the 32-bit variant, analogous to LargePositiveInteger beyond SmallInteger maxVal. This can be made to work seamlessly, just as it does currently with integers, and with Floats where SmallFloat64 is only used on 64-bits.
It has implications in a few parts of the system: - failure code for WideString (VeryWideString?) at:[put:] primitives that would have to manage overflow into/access from WideCharacter instances - ImageSegment and other (un)pickling systems that need to convert to/from a bit-specific “wire” protocol/representation - 32-bit <=> 64-bit image conversion
All this is easily doable (because we have models of doing it for Float and Integer general instances). But we need good specifications so we can implement the right thing from the get-go.
At the moment, we have 30 class variables each for one Unicode category number. These class vars map in alphabetical order to the integers from 0 to: 29. Is this tedious structure really necessary? For different purposes, I would like to get the category name of a specific code point from a client. The current design makes this impossible without writing additional mappings.
Tl;dr: I would like to propose to drop these class variables and use Symbols instead. They are comparable like integers, and as they are flyweights, this should not be a performance issue either. Of course, #generalCategoryOf: will have to keep returning numbers, but we could deprecate it and use a new #generalTagOf: in the future. Furthermore, this would also allow us to deal with later added category names (though I don't know whether this will ever happen).
Examples: Unicode generalTagOf: $a asUnicode. "#Ll"
Unicode class >> isLetterCode: charCode ^ (self generalTagOf: charCode) first = $L
Unicode class >> isAlphaNumericCode: charCode | tag| ^ (tag := self generalCategoryOf: charCode) first = $L or: [tag = #Nd]
How do you think about this proposal? Please let me know and I will go ahead! :D
Best, Christoph
Best, Eliot _,,,^..^,,,_ (phone)
On 06.09.2020, at 19:15, Eliot Miranda eliot.miranda@gmail.com wrote:
Hi Christoph, Hi All,
On Mar 17, 2020, at 3:51 PM, Thiede, Christoph Christoph.Thiede@student.hpi.uni-potsdam.de wrote:
Hi all! :-)
After some recent fun with the Unicode class, I found out that its data is quite out of date (for example, the comments do not even "know" that code points can be longer than 4 bytes. Younger characters such as 😺❤🤓 are not categorized correctly, etc. ...). Luckily, there is already some logic to fetch the latest data from www.unicode.org. I'm currently reworking this logic because it's not completely automated yet and has some slips, but so long, I have one general question for you:
And consequently I have a couple of questions for you. In the Spur VM Characters are immediate (they are like SmallInteger and exist in oops (object-oriented pointers) as tagged values). In the 32-bit variant Characters are 30-bit unsigned integers. In the 64-bit variant they are also 30-bit unsigned integers, but could easily be extended to be up to 61-bit unsigned integers.
Q1, can you arrange that the Unicode support does not break in initialization on the 32-bit variant? It may be that the 32-bit variant cannot represent code points beyond 30 bits in size, but we should try to ensure that initialization still runs to completion even if it fails to initialize information relating to code points beyond 30 bits in size.
Q2, how many bits should the 64-bit variant VM support for immediate Characters?
Unicode has a max value of 0x10FFFF. That makes 21 bit. So no worries there.
We should just not forget the leading-char stuff (Yoshiki, Andreas,...)
BEst regards -Tobias
Then something to consider is that it is conceptually possible to support something like WideCharacter, which would represent code points outside of the immediate Character range on the 32-bit variant, analogous to LargePositiveInteger beyond SmallInteger maxVal. This can be made to work seamlessly, just as it does currently with integers, and with Floats where SmallFloat64 is only used on 64-bits.
It has implications in a few parts of the system:
- failure code for WideString (VeryWideString?) at:[put:] primitives that would have to manage overflow into/access from WideCharacter instances
- ImageSegment and other (un)pickling systems that need to convert to/from a bit-specific “wire” protocol/representation
- 32-bit <=> 64-bit image conversion
All this is easily doable (because we have models of doing it for Float and Integer general instances). But we need good specifications so we can implement the right thing from the get-go.
At the moment, we have 30 class variables each for one Unicode category number. These class vars map in alphabetical order to the integers from 0 to: 29. Is this tedious structure really necessary? For different purposes, I would like to get the category name of a specific code point from a client. The current design makes this impossible without writing additional mappings.
Tl;dr: I would like to propose to drop these class variables and use Symbols instead. They are comparable like integers, and as they are flyweights, this should not be a performance issue either. Of course, #generalCategoryOf: will have to keep returning numbers, but we could deprecate it and use a new #generalTagOf: in the future. Furthermore, this would also allow us to deal with later added category names (though I don't know whether this will ever happen).
Examples: Unicode generalTagOf: $a asUnicode. "#Ll"
Unicode class >> isLetterCode: charCode ^ (self generalTagOf: charCode) first = $L
Unicode class >> isAlphaNumericCode: charCode | tag| ^ (tag := self generalCategoryOf: charCode) first = $L or: [tag = #Nd]
How do you think about this proposal? Please let me know and I will go ahead! :D
Best, Christoph
Best, Eliot _,,,^..^,,,_ (phone)
On Sun, 6 Sep 2020, Tobias Pape wrote:
On 06.09.2020, at 19:15, Eliot Miranda eliot.miranda@gmail.com wrote:
Hi Christoph, Hi All,
On Mar 17, 2020, at 3:51 PM, Thiede, Christoph Christoph.Thiede@student.hpi.uni-potsdam.de wrote:
Hi all! :-)
After some recent fun with the Unicode class, I found out that its data is quite out of date (for example, the comments do not even "know" that code points can be longer than 4 bytes. Younger characters such as 😺❤🤓 are not categorized correctly, etc. ...). Luckily, there is already some logic to fetch the latest data from www.unicode.org. I'm currently reworking this logic because it's not completely automated yet and has some slips, but so long, I have one general question for you:
And consequently I have a couple of questions for you. In the Spur VM Characters are immediate (they are like SmallInteger and exist in oops (object-oriented pointers) as tagged values). In the 32-bit variant Characters are 30-bit unsigned integers. In the 64-bit variant they are also 30-bit unsigned integers, but could easily be extended to be up to 61-bit unsigned integers.
Q1, can you arrange that the Unicode support does not break in initialization on the 32-bit variant? It may be that the 32-bit variant cannot represent code points beyond 30 bits in size, but we should try to ensure that initialization still runs to completion even if it fails to initialize information relating to code points beyond 30 bits in size.
Q2, how many bits should the 64-bit variant VM support for immediate Characters?
Unicode has a max value of 0x10FFFF. That makes 21 bit. So no worries there.
We should just not forget the leading-char stuff (Yoshiki, Andreas,...)
AFAIU the leading char only makes sense when you have multiple CJK(V?) languages in use at the same time. In other cases Unicode (leadingChar = 0) is perfectly fine. IIRC there are 22 bits available for the codePoint and 8 for the leadingChar, so we're still good: all unicode characters fit.
Levente
BEst regards -Tobias
Then something to consider is that it is conceptually possible to support something like WideCharacter, which would represent code points outside of the immediate Character range on the 32-bit variant, analogous to LargePositiveInteger beyond SmallInteger maxVal. This can be made to work seamlessly, just as it does currently with integers, and with Floats where SmallFloat64 is only used on 64-bits.
It has implications in a few parts of the system:
- failure code for WideString (VeryWideString?) at:[put:] primitives that would have to manage overflow into/access from WideCharacter instances
- ImageSegment and other (un)pickling systems that need to convert to/from a bit-specific “wire” protocol/representation
- 32-bit <=> 64-bit image conversion
All this is easily doable (because we have models of doing it for Float and Integer general instances). But we need good specifications so we can implement the right thing from the get-go.
At the moment, we have 30 class variables each for one Unicode category number. These class vars map in alphabetical order to the integers from 0 to: 29. Is this tedious structure really necessary? For different purposes, I would like to get the category name of a specific code point from a client. The current design makes this impossible without writing additional mappings.
Tl;dr: I would like to propose to drop these class variables and use Symbols instead. They are comparable like integers, and as they are flyweights, this should not be a performance issue either. Of course, #generalCategoryOf: will have to keep returning numbers, but we could deprecate it and use a new #generalTagOf: in the future. Furthermore, this would also allow us to deal with later added category names (though I don't know whether this will ever happen).
Examples: Unicode generalTagOf: $a asUnicode. "#Ll"
Unicode class >> isLetterCode: charCode ^ (self generalTagOf: charCode) first = $L
Unicode class >> isAlphaNumericCode: charCode | tag| ^ (tag := self generalCategoryOf: charCode) first = $L or: [tag = #Nd]
How do you think about this proposal? Please let me know and I will go ahead! :D
Best, Christoph
Best, Eliot _,,,^..^,,,_ (phone)
On 06.09.2020, at 20:40, Levente Uzonyi leves@caesar.elte.hu wrote:
On Sun, 6 Sep 2020, Tobias Pape wrote:
On 06.09.2020, at 19:15, Eliot Miranda eliot.miranda@gmail.com wrote: Hi Christoph, Hi All,
On Mar 17, 2020, at 3:51 PM, Thiede, Christoph Christoph.Thiede@student.hpi.uni-potsdam.de wrote: Hi all! :-) After some recent fun with the Unicode class, I found out that its data is quite out of date (for example, the comments do not even "know" that code points can be longer than 4 bytes. Younger characters such as 😺❤🤓 are not categorized correctly, etc. ...). Luckily, there is already some logic to fetch the latest data from www.unicode.org. I'm currently reworking this logic because it's not completely automated yet and has some slips, but so long, I have one general question for you:
And consequently I have a couple of questions for you. In the Spur VM Characters are immediate (they are like SmallInteger and exist in oops (object-oriented pointers) as tagged values). In the 32-bit variant Characters are 30-bit unsigned integers. In the 64-bit variant they are also 30-bit unsigned integers, but could easily be extended to be up to 61-bit unsigned integers. Q1, can you arrange that the Unicode support does not break in initialization on the 32-bit variant? It may be that the 32-bit variant cannot represent code points beyond 30 bits in size, but we should try to ensure that initialization still runs to completion even if it fails to initialize information relating to code points beyond 30 bits in size. Q2, how many bits should the 64-bit variant VM support for immediate Characters?
Unicode has a max value of 0x10FFFF. That makes 21 bit. So no worries there.
We should just not forget the leading-char stuff (Yoshiki, Andreas,...)
AFAIU the leading char only makes sense when you have multiple CJK(V?) languages in use at the same time. In other cases Unicode (leadingChar = 0) is perfectly fine. IIRC there are 22 bits available for the codePoint and 8 for the leadingChar, so we're still good: all unicode characters fit.
\o/ hooray!
Levente
BEst regards -Tobias
Then something to consider is that it is conceptually possible to support something like WideCharacter, which would represent code points outside of the immediate Character range on the 32-bit variant, analogous to LargePositiveInteger beyond SmallInteger maxVal. This can be made to work seamlessly, just as it does currently with integers, and with Floats where SmallFloat64 is only used on 64-bits. It has implications in a few parts of the system:
- failure code for WideString (VeryWideString?) at:[put:] primitives that would have to manage overflow into/access from WideCharacter instances
- ImageSegment and other (un)pickling systems that need to convert to/from a bit-specific “wire” protocol/representation
- 32-bit <=> 64-bit image conversion All this is easily doable (because we have models of doing it for Float and Integer general instances). But we need good specifications so we can implement the right thing from the get-go.
At the moment, we have 30 class variables each for one Unicode category number. These class vars map in alphabetical order to the integers from 0 to: 29. Is this tedious structure really necessary? For different purposes, I would like to get the category name of a specific code point from a client. The current design makes this impossible without writing additional mappings. Tl;dr: I would like to propose to drop these class variables and use Symbols instead. They are comparable like integers, and as they are flyweights, this should not be a performance issue either. Of course, #generalCategoryOf: will have to keep returning numbers, but we could deprecate it and use a new #generalTagOf: in the future. Furthermore, this would also allow us to deal with later added category names (though I don't know whether this will ever happen). Examples: Unicode generalTagOf: $a asUnicode. "#Ll" Unicode class >> isLetterCode: charCode ^ (self generalTagOf: charCode) first = $L Unicode class >> isAlphaNumericCode: charCode | tag| ^ (tag := self generalCategoryOf: charCode) first = $L or: [tag = #Nd] How do you think about this proposal? Please let me know and I will go ahead! :D Best, Christoph
Best, Eliot _,,,^..^,,,_ (phone)
Hi all,
Your words suggest that it has already been published, but I can't find it anywhere.
Then I must have expressed myself wrong. I did not yet publish any code changes, but in my original post from March, you can find a short description of the design changes I'd like to implement. Essentially, I would like to replace the separate class variables for every known character class in favor of greater flexibility.
Eliot, are there any remaining questions regarding the VM size? Character size should be sufficient as discussed below, and of course, I can test any changes in a 32-bit image, too. :-)
WideString withAll: (#(42r2l 16r1F497 16r1F388 33) collect: #asCharacter)
Best, Christoph
________________________________ Von: Squeak-dev squeak-dev-bounces@lists.squeakfoundation.org im Auftrag von Tobias Pape Das.Linux@gmx.de Gesendet: Sonntag, 6. September 2020 21:00:14 An: The general-purpose Squeak developers list Betreff: Re: [squeak-dev] Unicode
On 06.09.2020, at 20:40, Levente Uzonyi leves@caesar.elte.hu wrote:
On Sun, 6 Sep 2020, Tobias Pape wrote:
On 06.09.2020, at 19:15, Eliot Miranda eliot.miranda@gmail.com wrote: Hi Christoph, Hi All,
On Mar 17, 2020, at 3:51 PM, Thiede, Christoph Christoph.Thiede@student.hpi.uni-potsdam.de wrote: Hi all! :-) After some recent fun with the Unicode class, I found out that its data is quite out of date (for example, the comments do not even "know" that code points can be longer than 4 bytes. Younger characters such as 😺❤🤓 are not categorized correctly, etc. ...). Luckily, there is already some logic to fetch the latest data from www.unicode.orghttp://www.unicode.org. I'm currently reworking this logic because it's not completely automated yet and has some slips, but so long, I have one general question for you:
And consequently I have a couple of questions for you. In the Spur VM Characters are immediate (they are like SmallInteger and exist in oops (object-oriented pointers) as tagged values). In the 32-bit variant Characters are 30-bit unsigned integers. In the 64-bit variant they are also 30-bit unsigned integers, but could easily be extended to be up to 61-bit unsigned integers. Q1, can you arrange that the Unicode support does not break in initialization on the 32-bit variant? It may be that the 32-bit variant cannot represent code points beyond 30 bits in size, but we should try to ensure that initialization still runs to completion even if it fails to initialize information relating to code points beyond 30 bits in size. Q2, how many bits should the 64-bit variant VM support for immediate Characters?
Unicode has a max value of 0x10FFFF. That makes 21 bit. So no worries there.
We should just not forget the leading-char stuff (Yoshiki, Andreas,...)
AFAIU the leading char only makes sense when you have multiple CJK(V?) languages in use at the same time. In other cases Unicode (leadingChar = 0) is perfectly fine. IIRC there are 22 bits available for the codePoint and 8 for the leadingChar, so we're still good: all unicode characters fit.
\o/ hooray!
Levente
BEst regards -Tobias
Then something to consider is that it is conceptually possible to support something like WideCharacter, which would represent code points outside of the immediate Character range on the 32-bit variant, analogous to LargePositiveInteger beyond SmallInteger maxVal. This can be made to work seamlessly, just as it does currently with integers, and with Floats where SmallFloat64 is only used on 64-bits. It has implications in a few parts of the system:
- failure code for WideString (VeryWideString?) at:[put:] primitives that would have to manage overflow into/access from WideCharacter instances
- ImageSegment and other (un)pickling systems that need to convert to/from a bit-specific “wire” protocol/representation
- 32-bit <=> 64-bit image conversion All this is easily doable (because we have models of doing it for Float and Integer general instances). But we need good specifications so we can implement the right thing from the get-go.
At the moment, we have 30 class variables each for one Unicode category number. These class vars map in alphabetical order to the integers from 0 to: 29. Is this tedious structure really necessary? For different purposes, I would like to get the category name of a specific code point from a client. The current design makes this impossible without writing additional mappings. Tl;dr: I would like to propose to drop these class variables and use Symbols instead. They are comparable like integers, and as they are flyweights, this should not be a performance issue either. Of course, #generalCategoryOf: will have to keep returning numbers, but we could deprecate it and use a new #generalTagOf: in the future. Furthermore, this would also allow us to deal with later added category names (though I don't know whether this will ever happen). Examples: Unicode generalTagOf: $a asUnicode. "#Ll" Unicode class >> isLetterCode: charCode ^ (self generalTagOf: charCode) first = $L Unicode class >> isAlphaNumericCode: charCode | tag| ^ (tag := self generalCategoryOf: charCode) first = $L or: [tag = #Nd] How do you think about this proposal? Please let me know and I will go ahead! :D Best, Christoph
Best, Eliot _,,,^..^,,,_ (phone)
Hi Christoph,
On Tue, 8 Sep 2020, Thiede, Christoph wrote:
Hi all,
Your words suggest that it has already been published, but I can't find it anywhere.
Then I must have expressed myself wrong. I did not yet publish any code changes, but in my original post from March, you can find a short description of the design changes I'd like to implement. Essentially, I would like to replace the separate class variables for every known character class in favor of greater flexibility.
How would your changes affect GeneralCategory? Would it still be a SpareLargeTable with ByteArray as arrayClass? If you just replace those integers with symbols, the size of the table will be at least 4 or 8 times larger in 32 or 64 bit images, respectively.
Levente.
Eliot, are there any remaining questions regarding the VM size? Character size should be sufficient as discussed below, and of course, I can test any changes in a 32-bit image, too. :-)
WideString withAll: (#(42r2l 16r1F497 16r1F388 33) collect: #asCharacter)
Best, Christoph
Von: Squeak-dev squeak-dev-bounces@lists.squeakfoundation.org im Auftrag von Tobias Pape Das.Linux@gmx.de Gesendet: Sonntag, 6. September 2020 21:00:14 An: The general-purpose Squeak developers list Betreff: Re: [squeak-dev] Unicode
On 06.09.2020, at 20:40, Levente Uzonyi leves@caesar.elte.hu wrote:
On Sun, 6 Sep 2020, Tobias Pape wrote:
On 06.09.2020, at 19:15, Eliot Miranda eliot.miranda@gmail.com wrote: Hi Christoph, Hi All,
On Mar 17, 2020, at 3:51 PM, Thiede, Christoph Christoph.Thiede@student.hpi.uni-potsdam.de wrote: Hi all! :-) After some recent fun with the Unicode class, I found out that its data is quite out of date (for example, the comments do not even "know" that code points can be longer than 4 bytes. Younger characters such as 😺❤🤓
are not categorized correctly, etc. ...). Luckily, there is already some logic to fetch the latest data from www.unicode.org. I'm currently reworking this logic because it's not completely automated yet and has some slips, but so long, I have one general question for you:
And consequently I have a couple of questions for you. In the Spur VM Characters are immediate (they are like SmallInteger and exist in oops (object-oriented pointers) as tagged values). In the 32-bit variant Characters
are 30-bit unsigned integers. In the 64-bit variant they are also 30-bit unsigned integers, but could easily be extended to be up to 61-bit unsigned integers.
Q1, can you arrange that the Unicode support does not break in initialization on the 32-bit variant? It may be that the 32-bit variant cannot represent code points beyond 30 bits in size, but we should try to ensure that
initialization still runs to completion even if it fails to initialize information relating to code points beyond 30 bits in size.
Q2, how many bits should the 64-bit variant VM support for immediate Characters?
Unicode has a max value of 0x10FFFF. That makes 21 bit. So no worries there.
We should just not forget the leading-char stuff (Yoshiki, Andreas,...)
AFAIU the leading char only makes sense when you have multiple CJK(V?) languages in use at the same time. In other cases Unicode (leadingChar = 0) is perfectly fine. IIRC there are 22 bits available for the codePoint and 8 for the leadingChar, so we're still good: all unicode characters fit.
\o/ hooray!
Levente
BEst regards -Tobias
Then something to consider is that it is conceptually possible to support something like WideCharacter, which would represent code points outside of the immediate Character range on the 32-bit variant, analogous to
LargePositiveInteger beyond SmallInteger maxVal. This can be made to work seamlessly, just as it does currently with integers, and with Floats where SmallFloat64 is only used on 64-bits.
It has implications in a few parts of the system:
- failure code for WideString (VeryWideString?) at:[put:] primitives that would have to manage overflow into/access from WideCharacter instances
- ImageSegment and other (un)pickling systems that need to convert to/from a bit-specific “wire” protocol/representation
- 32-bit <=> 64-bit image conversion All this is easily doable (because we have models of doing it for Float and Integer general instances). But we need good specifications so we can implement the right thing from the
get-go.
At the moment, we have 30 class variables each for one Unicode category number. These class vars map in alphabetical order to the integers from 0 to: 29. Is this tedious structure really necessary? For different
purposes, I would like to get the category name of a specific code point from a client. The current design makes this impossible without writing additional mappings.
Tl;dr: I would like to propose to drop these class variables and use Symbols instead. They are comparable like integers, and as they are flyweights, this should not be a performance issue either. Of course,
#generalCategoryOf: will have to keep returning numbers, but we could deprecate it and use a new #generalTagOf: in the future. Furthermore, this would also allow us to deal with later added category names (though I don't know whether this will ever happen).
Examples: Unicode generalTagOf: $a asUnicode. "#Ll" Unicode class >> isLetterCode: charCode ^ (self generalTagOf: charCode) first = $L Unicode class >> isAlphaNumericCode: charCode | tag| ^ (tag := self generalCategoryOf: charCode) first = $L or: [tag = #Nd] How do you think about this proposal? Please let me know and I will go ahead! :D Best, Christoph
Best, Eliot _,,,^..^,,,_ (phone)
Hi Levente,
basically, I only would like to get rid of the class variables for every single Unicode category because they provide low explorability (it's hard to work with the numeric output of #generalCategoryOf:) and extensibility (you need to recompile the class definition for adding UTF-16 support). If you are critical of increasing the size of the SparseLargeTable, I think we would also just make one or two extra dictionaries to map every category symbol to a number and vice versa. What do you think?
Best,
Christoph
________________________________ Von: Squeak-dev squeak-dev-bounces@lists.squeakfoundation.org im Auftrag von Levente Uzonyi leves@caesar.elte.hu Gesendet: Dienstag, 8. September 2020 21:43:56 An: The general-purpose Squeak developers list Betreff: Re: [squeak-dev] Unicode
Hi Christoph,
On Tue, 8 Sep 2020, Thiede, Christoph wrote:
Hi all,
Your words suggest that it has already been published, but I can't find it anywhere.
Then I must have expressed myself wrong. I did not yet publish any code changes, but in my original post from March, you can find a short description of the design changes I'd like to implement. Essentially, I would like to replace the separate class variables for every known character class in favor of greater flexibility.
How would your changes affect GeneralCategory? Would it still be a SpareLargeTable with ByteArray as arrayClass? If you just replace those integers with symbols, the size of the table will be at least 4 or 8 times larger in 32 or 64 bit images, respectively.
Levente.
Eliot, are there any remaining questions regarding the VM size? Character size should be sufficient as discussed below, and of course, I can test any changes in a 32-bit image, too. :-)
WideString withAll: (#(42r2l 16r1F497 16r1F388 33) collect: #asCharacter)
Best, Christoph
Von: Squeak-dev squeak-dev-bounces@lists.squeakfoundation.org im Auftrag von Tobias Pape Das.Linux@gmx.de Gesendet: Sonntag, 6. September 2020 21:00:14 An: The general-purpose Squeak developers list Betreff: Re: [squeak-dev] Unicode
On 06.09.2020, at 20:40, Levente Uzonyi leves@caesar.elte.hu wrote:
On Sun, 6 Sep 2020, Tobias Pape wrote:
On 06.09.2020, at 19:15, Eliot Miranda eliot.miranda@gmail.com wrote: Hi Christoph, Hi All,
On Mar 17, 2020, at 3:51 PM, Thiede, Christoph Christoph.Thiede@student.hpi.uni-potsdam.de wrote: Hi all! :-) After some recent fun with the Unicode class, I found out that its data is quite out of date (for example, the comments do not even "know" that code points can be longer than 4 bytes. Younger characters such as 😺❤🤓
are not categorized correctly, etc. ...). Luckily, there is already some logic to fetch the latest data from www.unicode.orghttp://www.unicode.org. I'm currently reworking this logic because it's not completely automated yet and has some slips, but so long, I have one general question for you:
And consequently I have a couple of questions for you. In the Spur VM Characters are immediate (they are like SmallInteger and exist in oops (object-oriented pointers) as tagged values). In the 32-bit variant Characters
are 30-bit unsigned integers. In the 64-bit variant they are also 30-bit unsigned integers, but could easily be extended to be up to 61-bit unsigned integers.
Q1, can you arrange that the Unicode support does not break in initialization on the 32-bit variant? It may be that the 32-bit variant cannot represent code points beyond 30 bits in size, but we should try to ensure that
initialization still runs to completion even if it fails to initialize information relating to code points beyond 30 bits in size.
Q2, how many bits should the 64-bit variant VM support for immediate Characters?
Unicode has a max value of 0x10FFFF. That makes 21 bit. So no worries there.
We should just not forget the leading-char stuff (Yoshiki, Andreas,...)
AFAIU the leading char only makes sense when you have multiple CJK(V?) languages in use at the same time. In other cases Unicode (leadingChar = 0) is perfectly fine. IIRC there are 22 bits available for the codePoint and 8 for the leadingChar, so we're still good: all unicode characters fit.
\o/ hooray!
Levente
BEst regards -Tobias
Then something to consider is that it is conceptually possible to support something like WideCharacter, which would represent code points outside of the immediate Character range on the 32-bit variant, analogous to
LargePositiveInteger beyond SmallInteger maxVal. This can be made to work seamlessly, just as it does currently with integers, and with Floats where SmallFloat64 is only used on 64-bits.
It has implications in a few parts of the system:
- failure code for WideString (VeryWideString?) at:[put:] primitives that would have to manage overflow into/access from WideCharacter instances
- ImageSegment and other (un)pickling systems that need to convert to/from a bit-specific “wire” protocol/representation
- 32-bit <=> 64-bit image conversion All this is easily doable (because we have models of doing it for Float and Integer general instances). But we need good specifications so we can implement the right thing from the
get-go.
At the moment, we have 30 class variables each for one Unicode category number. These class vars map in alphabetical order to the integers from 0 to: 29. Is this tedious structure really necessary? For different
purposes, I would like to get the category name of a specific code point from a client. The current design makes this impossible without writing additional mappings.
Tl;dr: I would like to propose to drop these class variables and use Symbols instead. They are comparable like integers, and as they are flyweights, this should not be a performance issue either. Of course,
#generalCategoryOf: will have to keep returning numbers, but we could deprecate it and use a new #generalTagOf: in the future. Furthermore, this would also allow us to deal with later added category names (though I don't know whether this will ever happen).
Examples: Unicode generalTagOf: $a asUnicode. "#Ll" Unicode class >> isLetterCode: charCode ^ (self generalTagOf: charCode) first = $L Unicode class >> isAlphaNumericCode: charCode | tag| ^ (tag := self generalCategoryOf: charCode) first = $L or: [tag = #Nd] How do you think about this proposal? Please let me know and I will go ahead! :D Best, Christoph
Best, Eliot _,,,^..^,,,_ (phone)
Hi Christoph,
On Wed, 9 Sep 2020, Thiede, Christoph wrote:
Hi Levente,
basically, I only would like to get rid of the class variables for every single Unicode category because they provide low explorability (it's hard to work with the numeric output of #generalCategoryOf:) and extensibility (you need to recompile the class definition for adding UTF-16 support). If you are critical of increasing the size of the SparseLargeTable, I think we would also just make one or two extra dictionaries to map every category symbol to a number and vice versa. What do you think?
You mean an array to map the integers to symbols, right? :) Anyway, I don't think it's worth using symbols internally. For example, #isLetterCode: is 8-10% slower with the extra array lookup and checking the category symbol's first letter than the current method of integer comparisons.
Do you expect these constants to appear outside the Unicode class? If yes, then using symbols for those cases is probably a good solution. But for internal use, the integers are better.
Levente
Best,
Christoph
Von: Squeak-dev squeak-dev-bounces@lists.squeakfoundation.org im Auftrag von Levente Uzonyi leves@caesar.elte.hu Gesendet: Dienstag, 8. September 2020 21:43:56 An: The general-purpose Squeak developers list Betreff: Re: [squeak-dev] Unicode Hi Christoph,
On Tue, 8 Sep 2020, Thiede, Christoph wrote:
Hi all,
Your words suggest that it has already been published, but I can't find it anywhere.
Then I must have expressed myself wrong. I did not yet publish any code changes, but in my original post from March, you can find a short description of the design changes I'd like to implement. Essentially, I would like to replace the separate class variables for every known character class in favor of greater flexibility.
How would your changes affect GeneralCategory? Would it still be a SpareLargeTable with ByteArray as arrayClass? If you just replace those integers with symbols, the size of the table will be at least 4 or 8 times larger in 32 or 64 bit images, respectively.
Levente.
Eliot, are there any remaining questions regarding the VM size? Character size should be sufficient as discussed below, and of course, I can test any changes in a 32-bit image, too. :-)
WideString withAll: (#(42r2l 16r1F497 16r1F388 33) collect: #asCharacter)
Best, Christoph
_
Von: Squeak-dev squeak-dev-bounces@lists.squeakfoundation.org im Auftrag von Tobias Pape Das.Linux@gmx.de Gesendet: Sonntag, 6. September 2020 21:00:14 An: The general-purpose Squeak developers list Betreff: Re: [squeak-dev] Unicode
On 06.09.2020, at 20:40, Levente Uzonyi leves@caesar.elte.hu wrote:
On Sun, 6 Sep 2020, Tobias Pape wrote:
On 06.09.2020, at 19:15, Eliot Miranda eliot.miranda@gmail.com wrote: Hi Christoph, Hi All,
On Mar 17, 2020, at 3:51 PM, Thiede, Christoph Christoph.Thiede@student.hpi.uni-potsdam.de wrote: Hi all! :-) After some recent fun with the Unicode class, I found out that its data is quite out of date (for example, the comments do not even "know" that code points can be longer than 4 bytes. Younger characters such as 😺❤🤓
are not categorized correctly, etc. ...). Luckily, there is already some logic to fetch the latest data from www.unicode.org. I'm currently reworking this logic because it's not completely automated yet and has some slips, but so long, I have one general question for you:
And consequently I have a couple of questions for you. In the Spur VM Characters are immediate (they are like SmallInteger and exist in oops (object-oriented pointers) as tagged values). In the 32-bit variant
Characters
are 30-bit unsigned integers. In the 64-bit variant they are also 30-bit unsigned integers, but could easily be extended to be up to 61-bit unsigned integers.
Q1, can you arrange that the Unicode support does not break in initialization on the 32-bit variant? It may be that the 32-bit variant cannot represent code points beyond 30 bits in size, but we should try to ensure
that
initialization still runs to completion even if it fails to initialize information relating to code points beyond 30 bits in size.
Q2, how many bits should the 64-bit variant VM support for immediate Characters?
Unicode has a max value of 0x10FFFF. That makes 21 bit. So no worries there.
We should just not forget the leading-char stuff (Yoshiki, Andreas,...)
AFAIU the leading char only makes sense when you have multiple CJK(V?) languages in use at the same time. In other cases Unicode (leadingChar = 0) is perfectly fine. IIRC there are 22 bits available for the codePoint and 8 for the leadingChar, so we're still good: all unicode characters fit.
\o/ hooray!
Levente
BEst regards -Tobias
Then something to consider is that it is conceptually possible to support something like WideCharacter, which would represent code points outside of the immediate Character range on the 32-bit variant, analogous to
LargePositiveInteger beyond SmallInteger maxVal. This can be made to work seamlessly, just as it does currently with integers, and with Floats where SmallFloat64 is only used on 64-bits.
It has implications in a few parts of the system:
- failure code for WideString (VeryWideString?) at:[put:] primitives that would have to manage overflow into/access from WideCharacter instances
- ImageSegment and other (un)pickling systems that need to convert to/from a bit-specific “wire” protocol/representation
- 32-bit <=> 64-bit image conversion All this is easily doable (because we have models of doing it for Float and Integer general instances). But we need good specifications so we can implement the right thing from the
get-go.
At the moment, we have 30 class variables each for one Unicode category number. These class vars map in alphabetical order to the integers from 0 to: 29. Is this tedious structure really necessary? For different
purposes, I would like to get the category name of a specific code point from a client. The current design makes this impossible without writing additional mappings.
Tl;dr: I would like to propose to drop these class variables and use Symbols instead. They are comparable like integers, and as they are flyweights, this should not be a performance issue either. Of course,
#generalCategoryOf: will have to keep returning numbers, but we could deprecate it and use a new #generalTagOf: in the future. Furthermore, this would also allow us to deal with later added category names (though I don't
know
whether this will ever happen).
Examples: Unicode generalTagOf: $a asUnicode. "#Ll" Unicode class >> isLetterCode: charCode ^ (self generalTagOf: charCode) first = $L Unicode class >> isAlphaNumericCode: charCode | tag| ^ (tag := self generalCategoryOf: charCode) first = $L or: [tag = #Nd] How do you think about this proposal? Please let me know and I will go ahead! :D Best, Christoph
Best, Eliot _,,,^..^,,,_ (phone)
Hi all, Hi Marcel, Hi Levente,
finally, here is a changeset that takes the first step for updating or in-image Unicode database. After filing it in, please run:
Unicode initializeUnicodeData.
In addition to the preamble (please read first below), I have still a number of questions:
* Is it okay to fetch the data from unicode.org via a postscript in the update stream? Hypothetically, some clients might use a proxy or a strict firewall/safelist of IP addresses. * How much effort shall we put in deduplicating the logic and the data in this class? This includes both the two similar parsing methods and the redundant specification of the Unicode character tags.
Best, Christoph
=============== Summary ===============
Change Set: UnicodeData Date: 24 February 2022 Author: Christoph Thiede
This changeset repairs the fetching & parsing of unicode category data. Usage:
Unicode initializeUnicodeData. Unicode generalCategoryLabelOf: 16r1F388. 'Symbol, Other'
This revision resolves some slips in the category tags, adds an interface for retrieving/converting tags, unifies the vocabulary of the Unicode protocol, integrates the #initializeUnicodeData into the class initializer, and adds some tests.
Still present limitations include: - Duplication between #parseUnicodeDataFrom: and #parseCompositionMappingFrom: - Redundant and scattered declaration of character categories
For more information, see: http://lists.squeakfoundation.org/pipermail/squeak-dev/2020-March/208020.htm...
=============== Postscript ===============
"Postscript: Leave the line above, and replace the rest of this comment by a useful one. Executable statements should follow this comment, and should be separated by periods, with no exclamation points (!). Be sure to put any further comments in double-quotes, like this one."
=============== Diff ===============
Unicode class>>allCategoryTags {character classification} · ct 2/24/2022 19:41 + allCategoryTags + + ^ #(Cn Cc Cf Co Cs Ll Lm Lo Lt Lu Mc Me Mn Nd Nl No Pc Pd Pe Pf Pi Po Ps Sc Sk Sm So Zl Zp Zs)
Unicode class>>generalCategoryIndexFromTag: {character classification} · ct 2/24/2022 19:33 + generalCategoryIndexFromTag: tag + + ^ (self allCategoryTags indexOf: tag) - 1
Unicode class>>generalCategoryLabelForTag: {character classification} · ct 2/24/2022 21:47 + generalCategoryLabelForTag: tag + + ^ self generalCategoryLabels at: (self generalCategoryIndexFromTag: tag) + 1
Unicode class>>generalCategoryTagOf: {character classification} · ct 2/24/2022 19:44 + generalCategoryTagOf: aCharacterCode + + ^ (self generalCategoryOf: aCharacterCode) + ifNotNil: [:code | self allCategoryTags at: code + 1 ifAbsent: [#Cn]] + ifNil: [#Cn]
Unicode class>>initialize {class initialization} · ct 2/24/2022 21:43 (changed) initialize " Unicode initialize " self initializeTagConstants. - Compositions isEmptyOrNil ifTrue:[self initializeCompositionMappings]. + Compositions isEmptyOrNil ifTrue:[self initializeCompositionMappings]. + GeneralCategory isEmptyOrNil ifTrue: [self initializeUnicodeData].
Unicode class>>initializeUnicodeData {unicode data} · ct 2/24/2022 19:03 + initializeUnicodeData + "self initializeUnicodeData" + + self parseUnicodeDataFrom: self unicodeData readStream.
Unicode class>>parseUnicodeDataFrom: {unicode data} · ct 2/24/2022 19:33 (changed) parseUnicodeDataFrom: stream - " - self halt. - self parseUnicodeDataFile - " + "self initializeUnicodeData."
- | line fieldEnd point fieldStart toNumber generalCategory decimalProperty | + | line fieldEnd point fieldStart toNumber generalCategory decimalProperty tag |
toNumber := [:quad | ('16r', quad) asNumber].
GeneralCategory := SparseLargeTable new: 16rE0080 chunkSize: 1024 arrayClass: Array base: 1 defaultValue: 'Cn'. DecimalProperty := SparseLargeTable new: 16rE0080 chunkSize: 32 arrayClass: Array base: 1 defaultValue: -1.
16r3400 to: 16r4DB5 do: [:i | GeneralCategory at: i+1 put: 'Lo']. 16r4E00 to: 16r9FA5 do: [:i | GeneralCategory at: i+1 put: 'Lo']. 16rAC00 to: 16rD7FF do: [:i | GeneralCategory at: i+1 put: 'Lo'].
[(line := stream nextLine) size > 0] whileTrue: [ fieldEnd := line indexOf: $; startingAt: 1. point := toNumber value: (line copyFrom: 1 to: fieldEnd - 1). point > 16rE007F ifTrue: [ GeneralCategory zapDefaultOnlyEntries. DecimalProperty zapDefaultOnlyEntries. ^ self]. 2 to: 3 do: [:i | fieldStart := fieldEnd + 1. fieldEnd := line indexOf: $; startingAt: fieldStart. ]. - generalCategory := line copyFrom: fieldStart to: fieldEnd - 1. + tag := line copyFrom: fieldStart to: fieldEnd - 1. + generalCategory := self generalCategoryIndexFromTag: tag. GeneralCategory at: point+1 put: generalCategory. - generalCategory = 'Nd' ifTrue: [ + generalCategory = Nd ifTrue: [ 4 to: 7 do: [:i | fieldStart := fieldEnd + 1. fieldEnd := line indexOf: $; startingAt: fieldStart. ]. decimalProperty := line copyFrom: fieldStart to: fieldEnd - 1. DecimalProperty at: point+1 put: decimalProperty asNumber. ]. ]. GeneralCategory zapDefaultOnlyEntries. DecimalProperty zapDefaultOnlyEntries.
UnicodeTest + TestCase subclass: #UnicodeTest + instanceVariableNames: '' + classVariableNames: '' + poolDictionaries: '' + category: 'MultilingualTests-Encodings' + + UnicodeTest class + instanceVariableNames: '' + + ""
UnicodeTest class>>resources {accessing} · ct 2/24/2022 21:46 + resources + + ^ super resources copyWith: UnicodeTestResource
UnicodeTest>>testGeneralCategoryLabel {tests - character classification} · ct 2/24/2022 21:49 + testGeneralCategoryLabel + + self assert: 'Letter, Lowercase' equals: (Unicode generalCategoryLabelOf: $a asUnicode). + self assert: 'Letter, Uppercase' equals: (Unicode generalCategoryLabelOf: $Z asUnicode). + + self assert: 'Number, Decimal' equals: (Unicode generalCategoryLabelOf: $5 asUnicode). + self assert: 'Symbol, Other' equals: (Unicode generalCategoryLabelOf: 16r1F388). + + self assert: 'n/a' equals: (Unicode generalCategoryLabelOf: Float infinity).
UnicodeTest>>testGeneralCategoryLabelForTag {tests - character classification} · ct 2/24/2022 21:48 + testGeneralCategoryLabelForTag + + self assert: 'Letter, Lowercase' equals: (Unicode generalCategoryLabelForTag: #Ll).
UnicodeTest>>testGeneralCategoryTag {tests - character classification} · ct 2/24/2022 21:49 + testGeneralCategoryTag + + self assert: #Ll equals: (Unicode generalCategoryTagOf: $a asUnicode). + self assert: #Lu equals: (Unicode generalCategoryTagOf: $Z asUnicode). + + self assert: #Nd equals: (Unicode generalCategoryTagOf: $5 asUnicode). + self assert: #So equals: (Unicode generalCategoryTagOf: 16r1F388). + + self assert: #Cn equals: (Unicode generalCategoryTagOf: Float infinity).
UnicodeTestResource + TestResource subclass: #UnicodeTestResource + instanceVariableNames: '' + classVariableNames: '' + poolDictionaries: '' + category: 'MultilingualTests-Encodings' + + UnicodeTestResource class + instanceVariableNames: '' + + ""
UnicodeTestResource>>setUp {running} · ct 2/24/2022 21:45 + setUp + + super setUp. + + "Test the functionality of this update logic" + Unicode initializeCompositionMappings. + Unicode initializeUnicodeData.
--- Sent from Squeak Inbox Talk
On 2020-09-11T00:49:24+02:00, leves@caesar.elte.hu wrote:
Hi Christoph,
On Wed, 9 Sep 2020, Thiede, Christoph wrote:
Hi Levente,
basically, I only would like to get rid of the class variables for every single Unicode category because they provide low explorability (it's hard to work with the numeric output of #generalCategoryOf:) and extensibility (you need to recompile the class definition for adding UTF-16 support). If you are critical of increasing the size of the SparseLargeTable, I think we would also just make one or two extra dictionaries to map every category symbol to a number and vice versa. What do you think?
You mean an array to map the integers to symbols, right? :) Anyway, I don't think it's worth using symbols internally. For example, #isLetterCode: is 8-10% slower with the extra array lookup and checking the category symbol's first letter than the current method of integer comparisons.
Do you expect these constants to appear outside the Unicode class? If yes, then using symbols for those cases is probably a good solution. But for internal use, the integers are better.
Levente
Best,
Christoph
Von: Squeak-dev <squeak-dev-bounces at lists.squeakfoundation.org> im Auftrag von Levente Uzonyi <leves at caesar.elte.hu> Gesendet: Dienstag, 8. September 2020 21:43:56 An: The general-purpose Squeak developers list Betreff: Re: [squeak-dev] Unicode Hi Christoph,
On Tue, 8 Sep 2020, Thiede, Christoph wrote:
Hi all,
Your words suggest that it has already been published, but I can't find it anywhere.
Then I must have expressed myself wrong. I did not yet publish any code changes, but in my original post from March, you can find a short description of the design changes I'd like to implement. Essentially, I would like to replace the separate class variables for every known character class in favor of greater flexibility.
How would your changes affect GeneralCategory? Would it still be a SpareLargeTable with ByteArray as arrayClass? If you just replace those integers with symbols, the size of the table will be at least 4 or 8 times larger in 32 or 64 bit images, respectively.
Levente.
Eliot, are there any remaining questions regarding the VM size? Character size should be sufficient as discussed below, and of course, I can test any changes in a 32-bit image, too. :-)
WideString withAll: (#(42r2l 16r1F497 16r1F388 33) collect: #asCharacter)
Best, Christoph
_
Von: Squeak-dev <squeak-dev-bounces at lists.squeakfoundation.org> im Auftrag von Tobias Pape <Das.Linux at gmx.de> Gesendet: Sonntag, 6. September 2020 21:00:14 An: The general-purpose Squeak developers list Betreff: Re: [squeak-dev] Unicode
On 06.09.2020, at 20:40, Levente Uzonyi <leves at caesar.elte.hu> wrote:
On Sun, 6 Sep 2020, Tobias Pape wrote:
On 06.09.2020, at 19:15, Eliot Miranda <eliot.miranda at gmail.com> wrote: Hi Christoph, Hi All, > On Mar 17, 2020, at 3:51 PM, Thiede, Christoph <Christoph.Thiede at student.hpi.uni-potsdam.de> wrote: > Hi all! :-) > After some recent fun with the Unicode class, I found out that its data is quite out of date (for example, the comments do not even "know" that code points can be longer than 4 bytes. Younger characters such as ???
are not categorized correctly, etc. ...). Luckily, there is already some logic to fetch the latest data from www.unicode.org. I'm currently reworking this logic because it's not completely automated yet and has some slips, but so long, I have one general question for you:
And consequently I have a couple of questions for you. In the Spur VM Characters are immediate (they are like SmallInteger and exist in oops (object-oriented pointers) as tagged values). In the 32-bit variant
Characters
are 30-bit unsigned integers. In the 64-bit variant they are also 30-bit unsigned integers, but could easily be extended to be up to 61-bit unsigned integers.
Q1, can you arrange that the Unicode support does not break in initialization on the 32-bit variant? It may be that the 32-bit variant cannot represent code points beyond 30 bits in size, but we should try to ensure
that
initialization still runs to completion even if it fails to initialize information relating to code points beyond 30 bits in size.
Q2, how many bits should the 64-bit variant VM support for immediate Characters?
Unicode has a max value of 0x10FFFF. That makes 21 bit. So no worries there.
We should just not forget the leading-char stuff (Yoshiki, Andreas,...)
AFAIU the leading char only makes sense when you have multiple CJK(V?) languages in use at the same time. In other cases Unicode (leadingChar = 0) is perfectly fine. IIRC there are 22 bits available for the codePoint and 8 for the leadingChar, so we're still good: all unicode characters fit.
\o/ hooray!
Levente
BEst regards -Tobias
Then something to consider is that it is conceptually possible to support something like WideCharacter, which would represent code points outside of the immediate Character range on the 32-bit variant, analogous to
LargePositiveInteger beyond SmallInteger maxVal. This can be made to work seamlessly, just as it does currently with integers, and with Floats where SmallFloat64 is only used on 64-bits.
It has implications in a few parts of the system:
- failure code for WideString (VeryWideString?) at:[put:] primitives that would have to manage overflow into/access from WideCharacter instances
- ImageSegment and other (un)pickling systems that need to convert to/from a bit-specific ?wire? protocol/representation
- 32-bit <=> 64-bit image conversion All this is easily doable (because we have models of doing it for Float and Integer general instances). But we need good specifications so we can implement the right thing from the
get-go.
> At the moment, we have 30 class variables each for one Unicode category number. These class vars map in alphabetical order to the integers from 0 to: 29. Is this tedious structure really necessary? For different
purposes, I would like to get the category name of a specific code point from a client. The current design makes this impossible without writing additional mappings.
> Tl;dr: I would like to propose to drop these class variables and use Symbols instead. They are comparable like integers, and as they are flyweights, this should not be a performance issue either. Of course,
#generalCategoryOf: will have to keep returning numbers, but we could deprecate it and use a new #generalTagOf: in the future. Furthermore, this would also allow us to deal with later added category names (though I don't
know
whether this will ever happen).
> Examples: > Unicode generalTagOf: $a asUnicode. "#Ll" > Unicode class >> isLetterCode: charCode > ^ (self generalTagOf: charCode) first = $L > Unicode class >> isAlphaNumericCode: charCode > | tag| > ^ (tag := self generalCategoryOf: charCode) first = $L > or: [tag = #Nd] > How do you think about this proposal? Please let me know and I will go ahead! :D > Best, > Christoph Best, Eliot _,,,^..^,,,_ (phone)
["UnicodeData.2.cs"]
Hi Christoph --
Thanks for doing this!
Is it okay to fetch the data from unicode.org via a postscript in the update stream? Hypothetically, some clients might use a proxy or a strict firewall/safelist of IP addresses.
I think it is okay. We could make it explicit in the ReleaseBuilder or in some external CI script but having it part of the update stream is okay. The updates are fetched from the outside anyway, right? ;-)
Best; Marcel Am 24.02.2022 22:16:42 schrieb christoph.thiede@student.hpi.uni-potsdam.de christoph.thiede@student.hpi.uni-potsdam.de: Hi all, Hi Marcel, Hi Levente,
finally, here is a changeset that takes the first step for updating or in-image Unicode database. After filing it in, please run:
Unicode initializeUnicodeData.
In addition to the preamble (please read first below), I have still a number of questions:
* Is it okay to fetch the data from unicode.org via a postscript in the update stream? Hypothetically, some clients might use a proxy or a strict firewall/safelist of IP addresses. * How much effort shall we put in deduplicating the logic and the data in this class? This includes both the two similar parsing methods and the redundant specification of the Unicode character tags.
Best, Christoph
=============== Summary ===============
Change Set: UnicodeData Date: 24 February 2022 Author: Christoph Thiede
This changeset repairs the fetching & parsing of unicode category data. Usage:
Unicode initializeUnicodeData. Unicode generalCategoryLabelOf: 16r1F388. 'Symbol, Other'
This revision resolves some slips in the category tags, adds an interface for retrieving/converting tags, unifies the vocabulary of the Unicode protocol, integrates the #initializeUnicodeData into the class initializer, and adds some tests.
Still present limitations include: - Duplication between #parseUnicodeDataFrom: and #parseCompositionMappingFrom: - Redundant and scattered declaration of character categories
For more information, see: http://lists.squeakfoundation.org/pipermail/squeak-dev/2020-March/208020.htm...
=============== Postscript ===============
"Postscript: Leave the line above, and replace the rest of this comment by a useful one. Executable statements should follow this comment, and should be separated by periods, with no exclamation points (!). Be sure to put any further comments in double-quotes, like this one."
=============== Diff ===============
Unicode class>>allCategoryTags {character classification} · ct 2/24/2022 19:41 + allCategoryTags + + ^ #(Cn Cc Cf Co Cs Ll Lm Lo Lt Lu Mc Me Mn Nd Nl No Pc Pd Pe Pf Pi Po Ps Sc Sk Sm So Zl Zp Zs)
Unicode class>>generalCategoryIndexFromTag: {character classification} · ct 2/24/2022 19:33 + generalCategoryIndexFromTag: tag + + ^ (self allCategoryTags indexOf: tag) - 1
Unicode class>>generalCategoryLabelForTag: {character classification} · ct 2/24/2022 21:47 + generalCategoryLabelForTag: tag + + ^ self generalCategoryLabels at: (self generalCategoryIndexFromTag: tag) + 1
Unicode class>>generalCategoryTagOf: {character classification} · ct 2/24/2022 19:44 + generalCategoryTagOf: aCharacterCode + + ^ (self generalCategoryOf: aCharacterCode) + ifNotNil: [:code | self allCategoryTags at: code + 1 ifAbsent: [#Cn]] + ifNil: [#Cn]
Unicode class>>initialize {class initialization} · ct 2/24/2022 21:43 (changed) initialize " Unicode initialize " self initializeTagConstants. - Compositions isEmptyOrNil ifTrue:[self initializeCompositionMappings]. + Compositions isEmptyOrNil ifTrue:[self initializeCompositionMappings]. + GeneralCategory isEmptyOrNil ifTrue: [self initializeUnicodeData].
Unicode class>>initializeUnicodeData {unicode data} · ct 2/24/2022 19:03 + initializeUnicodeData + "self initializeUnicodeData" + + self parseUnicodeDataFrom: self unicodeData readStream.
Unicode class>>parseUnicodeDataFrom: {unicode data} · ct 2/24/2022 19:33 (changed) parseUnicodeDataFrom: stream - " - self halt. - self parseUnicodeDataFile - " + "self initializeUnicodeData."
- | line fieldEnd point fieldStart toNumber generalCategory decimalProperty | + | line fieldEnd point fieldStart toNumber generalCategory decimalProperty tag |
toNumber := [:quad | ('16r', quad) asNumber].
GeneralCategory := SparseLargeTable new: 16rE0080 chunkSize: 1024 arrayClass: Array base: 1 defaultValue: 'Cn'. DecimalProperty := SparseLargeTable new: 16rE0080 chunkSize: 32 arrayClass: Array base: 1 defaultValue: -1.
16r3400 to: 16r4DB5 do: [:i | GeneralCategory at: i+1 put: 'Lo']. 16r4E00 to: 16r9FA5 do: [:i | GeneralCategory at: i+1 put: 'Lo']. 16rAC00 to: 16rD7FF do: [:i | GeneralCategory at: i+1 put: 'Lo'].
[(line := stream nextLine) size > 0] whileTrue: [ fieldEnd := line indexOf: $; startingAt: 1. point := toNumber value: (line copyFrom: 1 to: fieldEnd - 1). point > 16rE007F ifTrue: [ GeneralCategory zapDefaultOnlyEntries. DecimalProperty zapDefaultOnlyEntries. ^ self]. 2 to: 3 do: [:i | fieldStart := fieldEnd + 1. fieldEnd := line indexOf: $; startingAt: fieldStart. ]. - generalCategory := line copyFrom: fieldStart to: fieldEnd - 1. + tag := line copyFrom: fieldStart to: fieldEnd - 1. + generalCategory := self generalCategoryIndexFromTag: tag. GeneralCategory at: point+1 put: generalCategory. - generalCategory = 'Nd' ifTrue: [ + generalCategory = Nd ifTrue: [ 4 to: 7 do: [:i | fieldStart := fieldEnd + 1. fieldEnd := line indexOf: $; startingAt: fieldStart. ]. decimalProperty := line copyFrom: fieldStart to: fieldEnd - 1. DecimalProperty at: point+1 put: decimalProperty asNumber. ]. ]. GeneralCategory zapDefaultOnlyEntries. DecimalProperty zapDefaultOnlyEntries.
UnicodeTest + TestCase subclass: #UnicodeTest + instanceVariableNames: '' + classVariableNames: '' + poolDictionaries: '' + category: 'MultilingualTests-Encodings' + + UnicodeTest class + instanceVariableNames: '' + + ""
UnicodeTest class>>resources {accessing} · ct 2/24/2022 21:46 + resources + + ^ super resources copyWith: UnicodeTestResource
UnicodeTest>>testGeneralCategoryLabel {tests - character classification} · ct 2/24/2022 21:49 + testGeneralCategoryLabel + + self assert: 'Letter, Lowercase' equals: (Unicode generalCategoryLabelOf: $a asUnicode). + self assert: 'Letter, Uppercase' equals: (Unicode generalCategoryLabelOf: $Z asUnicode). + + self assert: 'Number, Decimal' equals: (Unicode generalCategoryLabelOf: $5 asUnicode). + self assert: 'Symbol, Other' equals: (Unicode generalCategoryLabelOf: 16r1F388). + + self assert: 'n/a' equals: (Unicode generalCategoryLabelOf: Float infinity).
UnicodeTest>>testGeneralCategoryLabelForTag {tests - character classification} · ct 2/24/2022 21:48 + testGeneralCategoryLabelForTag + + self assert: 'Letter, Lowercase' equals: (Unicode generalCategoryLabelForTag: #Ll).
UnicodeTest>>testGeneralCategoryTag {tests - character classification} · ct 2/24/2022 21:49 + testGeneralCategoryTag + + self assert: #Ll equals: (Unicode generalCategoryTagOf: $a asUnicode). + self assert: #Lu equals: (Unicode generalCategoryTagOf: $Z asUnicode). + + self assert: #Nd equals: (Unicode generalCategoryTagOf: $5 asUnicode). + self assert: #So equals: (Unicode generalCategoryTagOf: 16r1F388). + + self assert: #Cn equals: (Unicode generalCategoryTagOf: Float infinity).
UnicodeTestResource + TestResource subclass: #UnicodeTestResource + instanceVariableNames: '' + classVariableNames: '' + poolDictionaries: '' + category: 'MultilingualTests-Encodings' + + UnicodeTestResource class + instanceVariableNames: '' + + ""
UnicodeTestResource>>setUp {running} · ct 2/24/2022 21:45 + setUp + + super setUp. + + "Test the functionality of this update logic" + Unicode initializeCompositionMappings. + Unicode initializeUnicodeData.
--- Sent from Squeak Inbox Talk [https://github.com/hpi-swa-lab/squeak-inbox-talk]
On 2020-09-11T00:49:24+02:00, leves@caesar.elte.hu wrote:
Hi Christoph,
On Wed, 9 Sep 2020, Thiede, Christoph wrote:
Hi Levente,
basically, I only would like to get rid of the class variables for every single Unicode category because they provide low explorability (it's hard to work with the numeric output of #generalCategoryOf:) and extensibility (you need to recompile the class definition for adding UTF-16 support). If you are critical of increasing the size of the SparseLargeTable, I think we would also just make one or two extra dictionaries to map every category symbol to a number and vice versa. What do you think?
You mean an array to map the integers to symbols, right? :) Anyway, I don't think it's worth using symbols internally. For example, #isLetterCode: is 8-10% slower with the extra array lookup and checking the category symbol's first letter than the current method of integer comparisons.
Do you expect these constants to appear outside the Unicode class? If yes, then using symbols for those cases is probably a good solution. But for internal use, the integers are better.
Levente
Best,
Christoph
Von: Squeak-dev <squeak-dev-bounces at lists.squeakfoundation.org> im Auftrag von Levente Uzonyi <leves at caesar.elte.hu> Gesendet: Dienstag, 8. September 2020 21:43:56 An: The general-purpose Squeak developers list Betreff: Re: [squeak-dev] Unicode Hi Christoph,
On Tue, 8 Sep 2020, Thiede, Christoph wrote:
Hi all,
Your words suggest that it has already been published, but I can't find it anywhere.
Then I must have expressed myself wrong. I did not yet publish any code changes, but in my original post from March, you can find a short description of the design changes I'd like to implement. Essentially, I would like to replace the separate class variables for every known character class in favor of greater flexibility.
How would your changes affect GeneralCategory? Would it still be a SpareLargeTable with ByteArray as arrayClass? If you just replace those integers with symbols, the size of the table will be at least 4 or 8 times larger in 32 or 64 bit images, respectively.
Levente.
Eliot, are there any remaining questions regarding the VM size? Character size should be sufficient as discussed below, and of course, I can test any changes in a 32-bit image, too. :-)
WideString withAll: (#(42r2l 16r1F497 16r1F388 33) collect: #asCharacter)
Best, Christoph
_
Von: Squeak-dev <squeak-dev-bounces at lists.squeakfoundation.org> im Auftrag von Tobias Pape <Das.Linux at gmx.de> Gesendet: Sonntag, 6. September 2020 21:00:14 An: The general-purpose Squeak developers list Betreff: Re: [squeak-dev] Unicode
On 06.09.2020, at 20:40, Levente Uzonyi <leves at caesar.elte.hu> wrote:
On Sun, 6 Sep 2020, Tobias Pape wrote:
On 06.09.2020, at 19:15, Eliot Miranda <eliot.miranda at gmail.com> wrote: Hi Christoph, Hi All, > On Mar 17, 2020, at 3:51 PM, Thiede, Christoph <Christoph.Thiede at student.hpi.uni-potsdam.de> wrote: > Hi all! :-) > After some recent fun with the Unicode class, I found out that its data is quite out of date (for example, the comments do not even "know" that code points can be longer than 4 bytes. Younger characters such as ???
are not categorized correctly, etc. ...). Luckily, there is already some logic to fetch the latest data from www.unicode.org. I'm currently reworking this logic because it's not completely automated yet and has some slips, but so long, I have one general question for you:
And consequently I have a couple of questions for you. In the Spur VM Characters are immediate (they are like SmallInteger and exist in oops (object-oriented pointers) as tagged values). In the 32-bit variant
Characters
are 30-bit unsigned integers. In the 64-bit variant they are also 30-bit unsigned integers, but could easily be extended to be up to 61-bit unsigned integers.
Q1, can you arrange that the Unicode support does not break in initialization on the 32-bit variant? It may be that the 32-bit variant cannot represent code points beyond 30 bits in size, but we should try to ensure
that
initialization still runs to completion even if it fails to initialize information relating to code points beyond 30 bits in size.
Q2, how many bits should the 64-bit variant VM support for immediate Characters?
Unicode has a max value of 0x10FFFF. That makes 21 bit. So no worries there.
We should just not forget the leading-char stuff (Yoshiki, Andreas,...)
AFAIU the leading char only makes sense when you have multiple CJK(V?) languages in use at the same time. In other cases Unicode (leadingChar = 0) is perfectly fine. IIRC there are 22 bits available for the codePoint and 8 for the leadingChar, so we're still good: all unicode characters fit.
\o/ hooray!
Levente
BEst regards -Tobias
Then something to consider is that it is conceptually possible to support something like WideCharacter, which would represent code points outside of the immediate Character range on the 32-bit variant, analogous to
LargePositiveInteger beyond SmallInteger maxVal. This can be made to work seamlessly, just as it does currently with integers, and with Floats where SmallFloat64 is only used on 64-bits.
It has implications in a few parts of the system:
- failure code for WideString (VeryWideString?) at:[put:] primitives that would have to manage overflow into/access from WideCharacter instances
- ImageSegment and other (un)pickling systems that need to convert to/from a bit-specific ?wire? protocol/representation
- 32-bit <=> 64-bit image conversion All this is easily doable (because we have models of doing it for Float and Integer general instances). But we need good specifications so we can implement the right thing from the
get-go.
> At the moment, we have 30 class variables each for one Unicode category number. These class vars map in alphabetical order to the integers from 0 to: 29. Is this tedious structure really necessary? For different
purposes, I would like to get the category name of a specific code point from a client. The current design makes this impossible without writing additional mappings.
> Tl;dr: I would like to propose to drop these class variables and use Symbols instead. They are comparable like integers, and as they are flyweights, this should not be a performance issue either. Of course,
#generalCategoryOf: will have to keep returning numbers, but we could deprecate it and use a new #generalTagOf: in the future. Furthermore, this would also allow us to deal with later added category names (though I don't
know
whether this will ever happen).
> Examples: > Unicode generalTagOf: $a asUnicode. "#Ll" > Unicode class >> isLetterCode: charCode > ^ (self generalTagOf: charCode) first = $L > Unicode class >> isAlphaNumericCode: charCode > | tag| > ^ (tag := self generalCategoryOf: charCode) first = $L > or: [tag = #Nd] > How do you think about this proposal? Please let me know and I will go ahead! :D > Best, > Christoph Best, Eliot _,,,^..^,,,_ (phone)
["UnicodeData.2.cs"]
Hi Marcel, thanks for the review! Below is an updated changeset. If you have no further objections, I would like to merge it within the next few days. :-)
=============== Summary ===============
Change Set: UnicodeData Date: 24 February 2022 Author: Christoph Thiede
This changeset repairs the fetching & parsing of unicode category data. Usage:
Unicode reinitializeData. Unicode generalCategoryLabelOf: 16r1F388. 'Symbol, Other'
This revision resolves some slips in the category tags, adds an interface for retrieving/converting tags, unifies the vocabulary of the Unicode protocol, integrates the #initializeUnicodeData into the class initializer, and adds some tests. Furthermore, the Unicode data are automatically reinitialized as part of the ReleaseBuilder.
Still present limitations include: - Duplication between #parseUnicodeDataFrom: and #parseCompositionMappingFrom: - Redundant and scattered declaration of character categories
For more information, see: http://lists.squeakfoundation.org/pipermail/squeak-dev/2020-March/208020.htm...
=============== Postscript ===============
"Postscript: Leave the line above, and replace the rest of this comment by a useful one. Executable statements should follow this comment, and should be separated by periods, with no exclamation points (!). Be sure to put any further comments in double-quotes, like this one."
=============== Diff ===============
ReleaseBuilder class>>prepareSourceCode {preparing} · ct 2/28/2022 15:54 (changed) prepareSourceCode "Update packages. Remove foreign packages. Recompile."
CurrentReadOnlySourceFiles cacheDuring: [self updateCorePackages; unloadForeignPackages; checkForDirtyPackages; loadWellKnownPackages; checkForUndeclaredSymbols; checkForNilCategories; - recompileAll] + recompileAll; + updateDatabases]
ReleaseBuilder class>>updateDatabases {scripts - support} · ct 2/28/2022 16:06 + updateDatabases + + Unicode reinitializeData.
Unicode class>>allCategoryTags {character classification} · ct 2/24/2022 19:41 + allCategoryTags + + ^ #(Cn Cc Cf Co Cs Ll Lm Lo Lt Lu Mc Me Mn Nd Nl No Pc Pd Pe Pf Pi Po Ps Sc Sk Sm So Zl Zp Zs)
Unicode class>>blocks320Comment {comments} · ct 2/28/2022 15:50 (changed) blocks320Comment + "http://www.unicode.org/Public/3.2-Update/Blocks-3.2.0.txt"
"# Blocks-3.2.0.txt # Correlated with Unicode 3.2 # Start Code..End Code; Block Name 0000..007F; Basic Latin 0080..00FF; Latin-1 Supplement 0100..017F; Latin Extended-A 0180..024F; Latin Extended-B 0250..02AF; IPA Extensions 02B0..02FF; Spacing Modifier Letters 0300..036F; Combining Diacritical Marks 0370..03FF; Greek and Coptic 0400..04FF; Cyrillic 0500..052F; Cyrillic Supplementary 0530..058F; Armenian 0590..05FF; Hebrew 0600..06FF; Arabic 0700..074F; Syriac 0780..07BF; Thaana 0900..097F; Devanagari 0980..09FF; Bengali 0A00..0A7F; Gurmukhi 0A80..0AFF; Gujarati 0B00..0B7F; Oriya 0B80..0BFF; Tamil 0C00..0C7F; Telugu 0C80..0CFF; Kannada 0D00..0D7F; Malayalam 0D80..0DFF; Sinhala 0E00..0E7F; Thai 0E80..0EFF; Lao 0F00..0FFF; Tibetan 1000..109F; Myanmar 10A0..10FF; Georgian 1100..11FF; Hangul Jamo 1200..137F; Ethiopic 13A0..13FF; Cherokee 1400..167F; Unified Canadian Aboriginal Syllabics 1680..169F; Ogham 16A0..16FF; Runic 1700..171F; Tagalog 1720..173F; Hanunoo 1740..175F; Buhid 1760..177F; Tagbanwa 1780..17FF; Khmer 1800..18AF; Mongolian 1E00..1EFF; Latin Extended Additional 1F00..1FFF; Greek Extended 2000..206F; General Punctuation 2070..209F; Superscripts and Subscripts 20A0..20CF; Currency Symbols 20D0..20FF; Combining Diacritical Marks for Symbols 2100..214F; Letterlike Symbols 2150..218F; Number Forms 2190..21FF; Arrows 2200..22FF; Mathematical Operators 2300..23FF; Miscellaneous Technical 2400..243F; Control Pictures 2440..245F; Optical Character Recognition 2460..24FF; Enclosed Alphanumerics 2500..257F; Box Drawing 2580..259F; Block Elements 25A0..25FF; Geometric Shapes 2600..26FF; Miscellaneous Symbols 2700..27BF; Dingbats 27C0..27EF; Miscellaneous Mathematical Symbols-A 27F0..27FF; Supplemental Arrows-A 2800..28FF; Braille Patterns 2900..297F; Supplemental Arrows-B 2980..29FF; Miscellaneous Mathematical Symbols-B 2A00..2AFF; Supplemental Mathematical Operators 2E80..2EFF; CJK Radicals Supplement 2F00..2FDF; Kangxi Radicals 2FF0..2FFF; Ideographic Description Characters 3000..303F; CJK Symbols and Punctuation 3040..309F; Hiragana 30A0..30FF; Katakana 3100..312F; Bopomofo 3130..318F; Hangul Compatibility Jamo 3190..319F; Kanbun 31A0..31BF; Bopomofo Extended 31F0..31FF; Katakana Phonetic Extensions 3200..32FF; Enclosed CJK Letters and Months 3300..33FF; CJK Compatibility 3400..4DBF; CJK Unified Ideographs Extension A 4E00..9FFF; CJK Unified Ideographs A000..A48F; Yi Syllables A490..A4CF; Yi Radicals AC00..D7AF; Hangul Syllables D800..DB7F; High Surrogates DB80..DBFF; High Private Use Surrogates DC00..DFFF; Low Surrogates E000..F8FF; Private Use Area F900..FAFF; CJK Compatibility Ideographs FB00..FB4F; Alphabetic Presentation Forms FB50..FDFF; Arabic Presentation Forms-A FE00..FE0F; Variation Selectors FE20..FE2F; Combining Half Marks FE30..FE4F; CJK Compatibility Forms FE50..FE6F; Small Form Variants FE70..FEFF; Arabic Presentation Forms-B FF00..FFEF; Halfwidth and Fullwidth Forms FFF0..FFFF; Specials 10300..1032F; Old Italic 10330..1034F; Gothic 10400..1044F; Deseret 1D000..1D0FF; Byzantine Musical Symbols 1D100..1D1FF; Musical Symbols 1D400..1D7FF; Mathematical Alphanumeric Symbols 20000..2A6DF; CJK Unified Ideographs Extension B 2F800..2FA1F; CJK Compatibility Ideographs Supplement E0000..E007F; Tags F0000..FFFFF; Supplementary Private Use Area-A 100000..10FFFF; Supplementary Private Use Area-B
"
Unicode class>>generalCategoryIndexFromTag: {character classification} · ct 2/24/2022 19:33 + generalCategoryIndexFromTag: tag + + ^ (self allCategoryTags indexOf: tag) - 1
Unicode class>>generalCategoryLabelForTag: {character classification} · ct 2/24/2022 21:47 + generalCategoryLabelForTag: tag + + ^ self generalCategoryLabels at: (self generalCategoryIndexFromTag: tag) + 1
Unicode class>>generalCategoryTagOf: {character classification} · ct 2/24/2022 19:44 + generalCategoryTagOf: aCharacterCode + + ^ (self generalCategoryOf: aCharacterCode) + ifNotNil: [:code | self allCategoryTags at: code + 1 ifAbsent: [#Cn]] + ifNil: [#Cn]
Unicode class>>initialize {class initialization} · ct 2/28/2022 15:52 (changed) initialize " Unicode initialize " self initializeTagConstants. - Compositions isEmptyOrNil ifTrue:[self initializeCompositionMappings]. + + self flag: #deduplicate. "Currently, we are downloading and parsing #unicodeData twice." + Compositions isEmptyOrNil ifTrue: [self initializeCompositionMappings]. + GeneralCategory isEmptyOrNil ifTrue: [self initializeUnicodeData].
Unicode class>>initializeUnicodeData {unicode data} · ct 2/24/2022 19:03 + initializeUnicodeData + "self initializeUnicodeData" + + self parseUnicodeDataFrom: self unicodeData readStream.
Unicode class>>parseUnicodeDataFrom: {unicode data} · ct 2/24/2022 19:33 (changed) parseUnicodeDataFrom: stream - " - self halt. - self parseUnicodeDataFile - " + "self initializeUnicodeData."
- | line fieldEnd point fieldStart toNumber generalCategory decimalProperty | + | line fieldEnd point fieldStart toNumber generalCategory decimalProperty tag |
toNumber := [:quad | ('16r', quad) asNumber].
GeneralCategory := SparseLargeTable new: 16rE0080 chunkSize: 1024 arrayClass: Array base: 1 defaultValue: 'Cn'. DecimalProperty := SparseLargeTable new: 16rE0080 chunkSize: 32 arrayClass: Array base: 1 defaultValue: -1.
16r3400 to: 16r4DB5 do: [:i | GeneralCategory at: i+1 put: 'Lo']. 16r4E00 to: 16r9FA5 do: [:i | GeneralCategory at: i+1 put: 'Lo']. 16rAC00 to: 16rD7FF do: [:i | GeneralCategory at: i+1 put: 'Lo'].
[(line := stream nextLine) size > 0] whileTrue: [ fieldEnd := line indexOf: $; startingAt: 1. point := toNumber value: (line copyFrom: 1 to: fieldEnd - 1). point > 16rE007F ifTrue: [ GeneralCategory zapDefaultOnlyEntries. DecimalProperty zapDefaultOnlyEntries. ^ self]. 2 to: 3 do: [:i | fieldStart := fieldEnd + 1. fieldEnd := line indexOf: $; startingAt: fieldStart. ]. - generalCategory := line copyFrom: fieldStart to: fieldEnd - 1. + tag := line copyFrom: fieldStart to: fieldEnd - 1. + generalCategory := self generalCategoryIndexFromTag: tag. GeneralCategory at: point+1 put: generalCategory. - generalCategory = 'Nd' ifTrue: [ + generalCategory = Nd ifTrue: [ 4 to: 7 do: [:i | fieldStart := fieldEnd + 1. fieldEnd := line indexOf: $; startingAt: fieldStart. ]. decimalProperty := line copyFrom: fieldStart to: fieldEnd - 1. DecimalProperty at: point+1 put: decimalProperty asNumber. ]. ]. GeneralCategory zapDefaultOnlyEntries. DecimalProperty zapDefaultOnlyEntries.
Unicode class>>reinitializeData {class initialization} · ct 2/28/2022 16:05 + reinitializeData + + Compositions := GeneralCategory := nil. + self initialize.
UnicodeTest + TestCase subclass: #UnicodeTest + instanceVariableNames: '' + classVariableNames: '' + poolDictionaries: '' + category: 'MultilingualTests-Encodings' + + UnicodeTest class + instanceVariableNames: '' + + ""
UnicodeTest class>>resources {accessing} · ct 2/24/2022 21:46 + resources + + ^ super resources copyWith: UnicodeTestResource
UnicodeTest>>testGeneralCategoryLabel {tests - character classification} · ct 2/24/2022 21:49 + testGeneralCategoryLabel + + self assert: 'Letter, Lowercase' equals: (Unicode generalCategoryLabelOf: $a asUnicode). + self assert: 'Letter, Uppercase' equals: (Unicode generalCategoryLabelOf: $Z asUnicode). + + self assert: 'Number, Decimal' equals: (Unicode generalCategoryLabelOf: $5 asUnicode). + self assert: 'Symbol, Other' equals: (Unicode generalCategoryLabelOf: 16r1F388). + + self assert: 'n/a' equals: (Unicode generalCategoryLabelOf: Float infinity).
UnicodeTest>>testGeneralCategoryLabelForTag {tests - character classification} · ct 2/24/2022 21:48 + testGeneralCategoryLabelForTag + + self assert: 'Letter, Lowercase' equals: (Unicode generalCategoryLabelForTag: #Ll).
UnicodeTest>>testGeneralCategoryTag {tests - character classification} · ct 2/24/2022 21:49 + testGeneralCategoryTag + + self assert: #Ll equals: (Unicode generalCategoryTagOf: $a asUnicode). + self assert: #Lu equals: (Unicode generalCategoryTagOf: $Z asUnicode). + + self assert: #Nd equals: (Unicode generalCategoryTagOf: $5 asUnicode). + self assert: #So equals: (Unicode generalCategoryTagOf: 16r1F388). + + self assert: #Cn equals: (Unicode generalCategoryTagOf: Float infinity).
UnicodeTestResource + TestResource subclass: #UnicodeTestResource + instanceVariableNames: '' + classVariableNames: '' + poolDictionaries: '' + category: 'MultilingualTests-Encodings' + + UnicodeTestResource class + instanceVariableNames: '' + + ""
UnicodeTestResource>>setUp {running} · ct 2/24/2022 21:45 + setUp + + super setUp. + + "Test the functionality of this update logic" + Unicode initializeCompositionMappings. + Unicode initializeUnicodeData.
--- Sent from Squeak Inbox Talk
On 2022-02-25T11:36:53+01:00, marcel.taeumel@hpi.de wrote:
Hi Christoph --
Thanks for doing this!
Is it okay to fetch the data from unicode.org via a postscript in the update stream? Hypothetically, some clients might use a proxy or a strict firewall/safelist of IP addresses.
I think it is okay. We could make it explicit in the ReleaseBuilder or in some external CI script but having it part of the update stream is okay. The updates are fetched from the outside anyway, right? ;-)
Best; Marcel Am 24.02.2022 22:16:42 schrieb christoph.thiede at student.hpi.uni-potsdam.de <christoph.thiede at student.hpi.uni-potsdam.de>: Hi all, Hi Marcel, Hi Levente,
finally, here is a changeset that takes the first step for updating or in-image Unicode database. After filing it in, please run:
Unicode initializeUnicodeData.
In addition to the preamble (please read first below), I have still a number of questions:
- Is it okay to fetch the data from unicode.org via a postscript in the update stream? Hypothetically, some clients might use a proxy or a strict firewall/safelist of IP addresses.
- How much effort shall we put in deduplicating the logic and the data in this class? This includes both the two similar parsing methods and the redundant specification of the Unicode character tags.
Best, Christoph
=============== Summary ===============
Change Set: UnicodeData Date: 24 February 2022 Author: Christoph Thiede
This changeset repairs the fetching & parsing of unicode category data. Usage:
Unicode initializeUnicodeData. Unicode generalCategoryLabelOf: 16r1F388. 'Symbol, Other'
This revision resolves some slips in the category tags, adds an interface for retrieving/converting tags, unifies the vocabulary of the Unicode protocol, integrates the #initializeUnicodeData into the class initializer, and adds some tests.
Still present limitations include:
- Duplication between #parseUnicodeDataFrom: and #parseCompositionMappingFrom:
- Redundant and scattered declaration of character categories
For more information, see: http://lists.squeakfoundation.org/pipermail/squeak-dev/2020-March/208020.htm...
=============== Postscript ===============
"Postscript: Leave the line above, and replace the rest of this comment by a useful one. Executable statements should follow this comment, and should be separated by periods, with no exclamation points (!). Be sure to put any further comments in double-quotes, like this one."
=============== Diff ===============
Unicode class>>allCategoryTags {character classification} · ct 2/24/2022 19:41
- allCategoryTags
- ^ #(Cn Cc Cf Co Cs Ll Lm Lo Lt Lu Mc Me Mn Nd Nl No Pc Pd Pe Pf Pi Po Ps Sc Sk Sm So Zl Zp Zs)
Unicode class>>generalCategoryIndexFromTag: {character classification} · ct 2/24/2022 19:33
- generalCategoryIndexFromTag: tag
- ^ (self allCategoryTags indexOf: tag) - 1
Unicode class>>generalCategoryLabelForTag: {character classification} · ct 2/24/2022 21:47
- generalCategoryLabelForTag: tag
- ^ self generalCategoryLabels at: (self generalCategoryIndexFromTag: tag) + 1
Unicode class>>generalCategoryTagOf: {character classification} · ct 2/24/2022 19:44
- generalCategoryTagOf: aCharacterCode
- ^ (self generalCategoryOf: aCharacterCode)
- ifNotNil: [:code | self allCategoryTags at: code + 1 ifAbsent: [#Cn]]
- ifNil: [#Cn]
Unicode class>>initialize {class initialization} · ct 2/24/2022 21:43 (changed) initialize " Unicode initialize " self initializeTagConstants.
- Compositions isEmptyOrNil ifTrue:[self initializeCompositionMappings].
- Compositions isEmptyOrNil ifTrue:[self initializeCompositionMappings].
- GeneralCategory isEmptyOrNil ifTrue: [self initializeUnicodeData].
Unicode class>>initializeUnicodeData {unicode data} · ct 2/24/2022 19:03
- initializeUnicodeData
- "self initializeUnicodeData"
- self parseUnicodeDataFrom: self unicodeData readStream.
Unicode class>>parseUnicodeDataFrom: {unicode data} · ct 2/24/2022 19:33 (changed) parseUnicodeDataFrom: stream
- "
- self halt.
- self parseUnicodeDataFile
- "
- "self initializeUnicodeData."
- | line fieldEnd point fieldStart toNumber generalCategory decimalProperty |
- | line fieldEnd point fieldStart toNumber generalCategory decimalProperty tag |
toNumber := [:quad | ('16r', quad) asNumber].
GeneralCategory := SparseLargeTable new: 16rE0080 chunkSize: 1024 arrayClass: Array base: 1 defaultValue: 'Cn'. DecimalProperty := SparseLargeTable new: 16rE0080 chunkSize: 32 arrayClass: Array base: 1 defaultValue: -1.
16r3400 to: 16r4DB5 do: [:i | GeneralCategory at: i+1 put: 'Lo']. 16r4E00 to: 16r9FA5 do: [:i | GeneralCategory at: i+1 put: 'Lo']. 16rAC00 to: 16rD7FF do: [:i | GeneralCategory at: i+1 put: 'Lo'].
[(line := stream nextLine) size > 0] whileTrue: [ fieldEnd := line indexOf: $; startingAt: 1. point := toNumber value: (line copyFrom: 1 to: fieldEnd - 1). point > 16rE007F ifTrue: [ GeneralCategory zapDefaultOnlyEntries. DecimalProperty zapDefaultOnlyEntries. ^ self]. 2 to: 3 do: [:i | fieldStart := fieldEnd + 1. fieldEnd := line indexOf: $; startingAt: fieldStart. ].
- generalCategory := line copyFrom: fieldStart to: fieldEnd - 1.
- tag := line copyFrom: fieldStart to: fieldEnd - 1.
- generalCategory := self generalCategoryIndexFromTag: tag.
GeneralCategory at: point+1 put: generalCategory.
- generalCategory = 'Nd' ifTrue: [
- generalCategory = Nd ifTrue: [
4 to: 7 do: [:i | fieldStart := fieldEnd + 1. fieldEnd := line indexOf: $; startingAt: fieldStart. ]. decimalProperty := line copyFrom: fieldStart to: fieldEnd - 1. DecimalProperty at: point+1 put: decimalProperty asNumber. ]. ]. GeneralCategory zapDefaultOnlyEntries. DecimalProperty zapDefaultOnlyEntries.
UnicodeTest
- TestCase subclass: #UnicodeTest
- instanceVariableNames: ''
- classVariableNames: ''
- poolDictionaries: ''
- category: 'MultilingualTests-Encodings'
- UnicodeTest class
- instanceVariableNames: ''
- ""
UnicodeTest class>>resources {accessing} · ct 2/24/2022 21:46
- resources
- ^ super resources copyWith: UnicodeTestResource
UnicodeTest>>testGeneralCategoryLabel {tests - character classification} · ct 2/24/2022 21:49
- testGeneralCategoryLabel
- self assert: 'Letter, Lowercase' equals: (Unicode generalCategoryLabelOf: $a asUnicode).
- self assert: 'Letter, Uppercase' equals: (Unicode generalCategoryLabelOf: $Z asUnicode).
- self assert: 'Number, Decimal' equals: (Unicode generalCategoryLabelOf: $5 asUnicode).
- self assert: 'Symbol, Other' equals: (Unicode generalCategoryLabelOf: 16r1F388).
- self assert: 'n/a' equals: (Unicode generalCategoryLabelOf: Float infinity).
UnicodeTest>>testGeneralCategoryLabelForTag {tests - character classification} · ct 2/24/2022 21:48
- testGeneralCategoryLabelForTag
- self assert: 'Letter, Lowercase' equals: (Unicode generalCategoryLabelForTag: #Ll).
UnicodeTest>>testGeneralCategoryTag {tests - character classification} · ct 2/24/2022 21:49
- testGeneralCategoryTag
- self assert: #Ll equals: (Unicode generalCategoryTagOf: $a asUnicode).
- self assert: #Lu equals: (Unicode generalCategoryTagOf: $Z asUnicode).
- self assert: #Nd equals: (Unicode generalCategoryTagOf: $5 asUnicode).
- self assert: #So equals: (Unicode generalCategoryTagOf: 16r1F388).
- self assert: #Cn equals: (Unicode generalCategoryTagOf: Float infinity).
UnicodeTestResource
- TestResource subclass: #UnicodeTestResource
- instanceVariableNames: ''
- classVariableNames: ''
- poolDictionaries: ''
- category: 'MultilingualTests-Encodings'
- UnicodeTestResource class
- instanceVariableNames: ''
- ""
UnicodeTestResource>>setUp {running} · ct 2/24/2022 21:45
- setUp
- super setUp.
- "Test the functionality of this update logic"
- Unicode initializeCompositionMappings.
- Unicode initializeUnicodeData.
Sent from Squeak Inbox Talk [https://github.com/hpi-swa-lab/squeak-inbox-talk]
On 2020-09-11T00:49:24+02:00, leves at caesar.elte.hu wrote:
Hi Christoph,
On Wed, 9 Sep 2020, Thiede, Christoph wrote:
Hi Levente,
basically, I only would like to get rid of the class variables for every single Unicode category because they provide low explorability (it's hard to work with the numeric output of #generalCategoryOf:) and extensibility (you need to recompile the class definition for adding UTF-16 support). If you are critical of increasing the size of the SparseLargeTable, I think we would also just make one or two extra dictionaries to map every category symbol to a number and vice versa. What do you think?
You mean an array to map the integers to symbols, right? :) Anyway, I don't think it's worth using symbols internally. For example, #isLetterCode: is 8-10% slower with the extra array lookup and checking the category symbol's first letter than the current method of integer comparisons.
Do you expect these constants to appear outside the Unicode class? If yes, then using symbols for those cases is probably a good solution. But for internal use, the integers are better.
Levente
Best,
Christoph
Von: Squeak-dev <squeak-dev-bounces at lists.squeakfoundation.org> im Auftrag von Levente Uzonyi <leves at caesar.elte.hu> Gesendet: Dienstag, 8. September 2020 21:43:56 An: The general-purpose Squeak developers list Betreff: Re: [squeak-dev] Unicode Hi Christoph,
On Tue, 8 Sep 2020, Thiede, Christoph wrote:
Hi all,
Your words suggest that it has already been published, but I can't find it anywhere.
Then I must have expressed myself wrong. I did not yet publish any code changes, but in my original post from March, you can find a short description of the design changes I'd like to implement. Essentially, I would like to replace the separate class variables for every known character class in favor of greater flexibility.
How would your changes affect GeneralCategory? Would it still be a SpareLargeTable with ByteArray as arrayClass? If you just replace those integers with symbols, the size of the table will be at least 4 or 8 times larger in 32 or 64 bit images, respectively.
Levente.
Eliot, are there any remaining questions regarding the VM size? Character size should be sufficient as discussed below, and of course, I can test any changes in a 32-bit image, too. :-)
WideString withAll: (#(42r2l 16r1F497 16r1F388 33) collect: #asCharacter)
Best, Christoph
_
Von: Squeak-dev <squeak-dev-bounces at lists.squeakfoundation.org> im Auftrag von Tobias Pape <Das.Linux at gmx.de> Gesendet: Sonntag, 6. September 2020 21:00:14 An: The general-purpose Squeak developers list Betreff: Re: [squeak-dev] Unicode
On 06.09.2020, at 20:40, Levente Uzonyi <leves at caesar.elte.hu> wrote:
On Sun, 6 Sep 2020, Tobias Pape wrote:
> On 06.09.2020, at 19:15, Eliot Miranda <eliot.miranda at gmail.com> wrote: > Hi Christoph, Hi All, >> On Mar 17, 2020, at 3:51 PM, Thiede, Christoph <Christoph.Thiede at student.hpi.uni-potsdam.de> wrote: >> Hi all! :-) >> After some recent fun with the Unicode class, I found out that its data is quite out of date (for example, the comments do not even "know" that code points can be longer than 4 bytes. Younger characters such as ???
are not categorized correctly, etc. ...). Luckily, there is already some logic to fetch the latest data from www.unicode.org. I'm currently reworking this logic because it's not completely automated yet and has some slips, but so long, I have one general question for you:
> And consequently I have a couple of questions for you. In the Spur VM Characters are immediate (they are like SmallInteger and exist in oops (object-oriented pointers) as tagged values). In the 32-bit variant
Characters
are 30-bit unsigned integers. In the 64-bit variant they are also 30-bit unsigned integers, but could easily be extended to be up to 61-bit unsigned integers.
> Q1, can you arrange that the Unicode support does not break in initialization on the 32-bit variant? It may be that the 32-bit variant cannot represent code points beyond 30 bits in size, but we should try to ensure
that
initialization still runs to completion even if it fails to initialize information relating to code points beyond 30 bits in size.
> Q2, how many bits should the 64-bit variant VM support for immediate Characters?
Unicode has a max value of 0x10FFFF. That makes 21 bit. So no worries there.
We should just not forget the leading-char stuff (Yoshiki, Andreas,...)
AFAIU the leading char only makes sense when you have multiple CJK(V?) languages in use at the same time. In other cases Unicode (leadingChar = 0) is perfectly fine. IIRC there are 22 bits available for the codePoint and 8 for the leadingChar, so we're still good: all unicode characters fit.
\o/ hooray!
Levente
BEst regards -Tobias
> Then something to consider is that it is conceptually possible to support something like WideCharacter, which would represent code points outside of the immediate Character range on the 32-bit variant, analogous to
LargePositiveInteger beyond SmallInteger maxVal. This can be made to work seamlessly, just as it does currently with integers, and with Floats where SmallFloat64 is only used on 64-bits.
> It has implications in a few parts of the system: > - failure code for WideString (VeryWideString?) at:[put:] primitives that would have to manage overflow into/access from WideCharacter instances > - ImageSegment and other (un)pickling systems that need to convert to/from a bit-specific ?wire? protocol/representation > - 32-bit <=> 64-bit image conversion All this is easily doable (because we have models of doing it for Float and Integer general instances). But we need good specifications so we can implement the right thing from the
get-go.
>> At the moment, we have 30 class variables each for one Unicode category number. These class vars map in alphabetical order to the integers from 0 to: 29. Is this tedious structure really necessary? For different
purposes, I would like to get the category name of a specific code point from a client. The current design makes this impossible without writing additional mappings.
>> Tl;dr: I would like to propose to drop these class variables and use Symbols instead. They are comparable like integers, and as they are flyweights, this should not be a performance issue either. Of course,
#generalCategoryOf: will have to keep returning numbers, but we could deprecate it and use a new #generalTagOf: in the future. Furthermore, this would also allow us to deal with later added category names (though I don't
know
whether this will ever happen).
>> Examples: >> Unicode generalTagOf: $a asUnicode. "#Ll" >> Unicode class >> isLetterCode: charCode >> ^ (self generalTagOf: charCode) first = $L >> Unicode class >> isAlphaNumericCode: charCode >> | tag| >> ^ (tag := self generalCategoryOf: charCode) first = $L >> or: [tag = #Nd] >> How do you think about this proposal? Please let me know and I will go ahead! :D >> Best, >> Christoph > Best, Eliot > _,,,^..^,,,_ (phone)
["UnicodeData.2.cs"]
Merged via Multilingual-ct.271, Multilingual-ct.272, MultilingualTests-ct.41, and ReleaseBuilder-ct.231.
Please run the following in your image to install the new Unicode data (and to uncover any regressions I may have missed :D):
Unicode reinitializeData.
Best, Christoph
--- Sent from Squeak Inbox Talk
On 2022-02-28T16:09:06+01:00, christoph.thiede@student.hpi.uni-potsdam.de wrote:
Hi Marcel, thanks for the review! Below is an updated changeset. If you have no further objections, I would like to merge it within the next few days. :-)
=============== Summary ===============
Change Set:????????UnicodeData Date:????????????24 February 2022 Author:????????????Christoph Thiede
This changeset repairs the fetching & parsing of unicode category data. Usage:
????Unicode reinitializeData. ????Unicode generalCategoryLabelOf: 16r1F388. 'Symbol, Other'
This revision resolves some slips in the category tags, adds an interface for retrieving/converting tags, unifies the vocabulary of the Unicode protocol, integrates the #initializeUnicodeData into the class initializer, and adds some tests. Furthermore, the Unicode data are automatically reinitialized as part of the ReleaseBuilder.
Still present limitations include:
- Duplication between #parseUnicodeDataFrom: and #parseCompositionMappingFrom:
- Redundant and scattered declaration of character categories
For more information, see: http://lists.squeakfoundation.org/pipermail/squeak-dev/2020-March/208020.htm...
=============== Postscript ===============
"Postscript: Leave the line above, and replace the rest of this comment by a useful one. Executable statements should follow this comment, and should be separated by periods, with no exclamation points (!). Be sure to put any further comments in double-quotes, like this one."
=============== Diff ===============
ReleaseBuilder class>>prepareSourceCode {preparing} ? ct 2/28/2022 15:54 (changed) prepareSourceCode ????"Update packages. Remove foreign packages. Recompile."
????CurrentReadOnlySourceFiles cacheDuring: ????????[self ????????????updateCorePackages; ????????????unloadForeignPackages; ????????????checkForDirtyPackages; ????????????loadWellKnownPackages; ????????????checkForUndeclaredSymbols; ????????????checkForNilCategories;
- ????????????recompileAll]
- ????????????recompileAll;
- ????????????updateDatabases]
ReleaseBuilder class>>updateDatabases {scripts - support} ? ct 2/28/2022 16:06
- updateDatabases
- ????Unicode reinitializeData.
Unicode class>>allCategoryTags {character classification} ? ct 2/24/2022 19:41
- allCategoryTags
- ????^ #(Cn Cc Cf Co Cs Ll Lm Lo Lt Lu Mc Me Mn Nd Nl No Pc Pd Pe Pf Pi Po Ps Sc Sk Sm So Zl Zp Zs)
Unicode class>>blocks320Comment {comments} ? ct 2/28/2022 15:50 (changed) blocks320Comment
"# Blocks-3.2.0.txt # Correlated with Unicode 3.2 # Start Code..End Code; Block Name 0000..007F; Basic Latin 0080..00FF; Latin-1 Supplement 0100..017F; Latin Extended-A 0180..024F; Latin Extended-B 0250..02AF; IPA Extensions 02B0..02FF; Spacing Modifier Letters 0300..036F; Combining Diacritical Marks 0370..03FF; Greek and Coptic 0400..04FF; Cyrillic 0500..052F; Cyrillic Supplementary 0530..058F; Armenian 0590..05FF; Hebrew 0600..06FF; Arabic 0700..074F; Syriac 0780..07BF; Thaana 0900..097F; Devanagari 0980..09FF; Bengali 0A00..0A7F; Gurmukhi 0A80..0AFF; Gujarati 0B00..0B7F; Oriya 0B80..0BFF; Tamil 0C00..0C7F; Telugu 0C80..0CFF; Kannada 0D00..0D7F; Malayalam 0D80..0DFF; Sinhala 0E00..0E7F; Thai 0E80..0EFF; Lao 0F00..0FFF; Tibetan 1000..109F; Myanmar 10A0..10FF; Georgian 1100..11FF; Hangul Jamo 1200..137F; Ethiopic 13A0..13FF; Cherokee 1400..167F; Unified Canadian Aboriginal Syllabics 1680..169F; Ogham 16A0..16FF; Runic 1700..171F; Tagalog 1720..173F; Hanunoo 1740..175F; Buhid 1760..177F; Tagbanwa 1780..17FF; Khmer 1800..18AF; Mongolian 1E00..1EFF; Latin Extended Additional 1F00..1FFF; Greek Extended 2000..206F; General Punctuation 2070..209F; Superscripts and Subscripts 20A0..20CF; Currency Symbols 20D0..20FF; Combining Diacritical Marks for Symbols 2100..214F; Letterlike Symbols 2150..218F; Number Forms 2190..21FF; Arrows 2200..22FF; Mathematical Operators 2300..23FF; Miscellaneous Technical 2400..243F; Control Pictures 2440..245F; Optical Character Recognition 2460..24FF; Enclosed Alphanumerics 2500..257F; Box Drawing 2580..259F; Block Elements 25A0..25FF; Geometric Shapes 2600..26FF; Miscellaneous Symbols 2700..27BF; Dingbats 27C0..27EF; Miscellaneous Mathematical Symbols-A 27F0..27FF; Supplemental Arrows-A 2800..28FF; Braille Patterns 2900..297F; Supplemental Arrows-B 2980..29FF; Miscellaneous Mathematical Symbols-B 2A00..2AFF; Supplemental Mathematical Operators 2E80..2EFF; CJK Radicals Supplement 2F00..2FDF; Kangxi Radicals 2FF0..2FFF; Ideographic Description Characters 3000..303F; CJK Symbols and Punctuation 3040..309F; Hiragana 30A0..30FF; Katakana 3100..312F; Bopomofo 3130..318F; Hangul Compatibility Jamo 3190..319F; Kanbun 31A0..31BF; Bopomofo Extended 31F0..31FF; Katakana Phonetic Extensions 3200..32FF; Enclosed CJK Letters and Months 3300..33FF; CJK Compatibility 3400..4DBF; CJK Unified Ideographs Extension A 4E00..9FFF; CJK Unified Ideographs A000..A48F; Yi Syllables A490..A4CF; Yi Radicals AC00..D7AF; Hangul Syllables D800..DB7F; High Surrogates DB80..DBFF; High Private Use Surrogates DC00..DFFF; Low Surrogates E000..F8FF; Private Use Area F900..FAFF; CJK Compatibility Ideographs FB00..FB4F; Alphabetic Presentation Forms FB50..FDFF; Arabic Presentation Forms-A FE00..FE0F; Variation Selectors FE20..FE2F; Combining Half Marks FE30..FE4F; CJK Compatibility Forms FE50..FE6F; Small Form Variants FE70..FEFF; Arabic Presentation Forms-B FF00..FFEF; Halfwidth and Fullwidth Forms FFF0..FFFF; Specials 10300..1032F; Old Italic 10330..1034F; Gothic 10400..1044F; Deseret 1D000..1D0FF; Byzantine Musical Symbols 1D100..1D1FF; Musical Symbols 1D400..1D7FF; Mathematical Alphanumeric Symbols 20000..2A6DF; CJK Unified Ideographs Extension B 2F800..2FA1F; CJK Compatibility Ideographs Supplement E0000..E007F; Tags F0000..FFFFF; Supplementary Private Use Area-A 100000..10FFFF; Supplementary Private Use Area-B
"
Unicode class>>generalCategoryIndexFromTag: {character classification} ? ct 2/24/2022 19:33
- generalCategoryIndexFromTag: tag
- ????^ (self allCategoryTags indexOf: tag) - 1
Unicode class>>generalCategoryLabelForTag: {character classification} ? ct 2/24/2022 21:47
- generalCategoryLabelForTag: tag
- ????^ self generalCategoryLabels at: (self generalCategoryIndexFromTag: tag) + 1
Unicode class>>generalCategoryTagOf: {character classification} ? ct 2/24/2022 19:44
- generalCategoryTagOf: aCharacterCode
- ????^ (self generalCategoryOf: aCharacterCode)
- ????????ifNotNil: [:code | self allCategoryTags at: code + 1 ifAbsent: [#Cn]]
- ????????ifNil: [#Cn]
Unicode class>>initialize {class initialization} ? ct 2/28/2022 15:52 (changed) initialize ????" Unicode initialize " ????self initializeTagConstants.
- ????Compositions isEmptyOrNil ifTrue:[self initializeCompositionMappings].
- ????
- ????self flag: #deduplicate. "Currently, we are downloading and parsing #unicodeData twice."
- ????Compositions isEmptyOrNil ifTrue: [self initializeCompositionMappings].
- ????GeneralCategory isEmptyOrNil ifTrue: [self initializeUnicodeData].
Unicode class>>initializeUnicodeData {unicode data} ? ct 2/24/2022 19:03
- initializeUnicodeData
- ????"self initializeUnicodeData"
- ????self parseUnicodeDataFrom: self unicodeData readStream.
Unicode class>>parseUnicodeDataFrom: {unicode data} ? ct 2/24/2022 19:33 (changed) parseUnicodeDataFrom: stream
- "
- ????self halt.
- ????self parseUnicodeDataFile
- "
- ????"self initializeUnicodeData."
- ????| line fieldEnd point fieldStart toNumber generalCategory decimalProperty |
- ????| line fieldEnd point fieldStart toNumber generalCategory decimalProperty tag |
????toNumber := [:quad | ('16r', quad) asNumber].
????GeneralCategory := SparseLargeTable new: 16rE0080 chunkSize: 1024 arrayClass: Array base: 1 defaultValue: 'Cn'. ????DecimalProperty := SparseLargeTable new: 16rE0080 chunkSize: 32 arrayClass: Array base: 1 defaultValue: -1.
????16r3400 to: 16r4DB5 do: [:i | GeneralCategory at: i+1 put: 'Lo']. ????16r4E00 to: 16r9FA5 do: [:i | GeneralCategory at: i+1 put: 'Lo']. ????16rAC00 to: 16rD7FF do: [:i | GeneralCategory at: i+1 put: 'Lo'].
????[(line := stream nextLine) size > 0] whileTrue: [ ????????fieldEnd := line indexOf: $; startingAt: 1. ????????point := toNumber value: (line copyFrom: 1 to: fieldEnd - 1). ????????point > 16rE007F ifTrue: [ ????????????GeneralCategory zapDefaultOnlyEntries. ????????????DecimalProperty zapDefaultOnlyEntries. ????????????^ self]. ????????2 to: 3 do: [:i | ????????????fieldStart := fieldEnd + 1. ????????????fieldEnd := line indexOf: $; startingAt: fieldStart. ????????].
- ????????generalCategory := line copyFrom: fieldStart to: fieldEnd - 1.
- ????????tag := line copyFrom: fieldStart to: fieldEnd - 1.
- ????????generalCategory := self generalCategoryIndexFromTag: tag.
????????GeneralCategory at: point+1 put: generalCategory.
- ????????generalCategory = 'Nd' ifTrue: [
- ????????generalCategory = Nd ifTrue: [
????????????4 to: 7 do: [:i | ????????????????fieldStart := fieldEnd + 1. ????????????????fieldEnd := line indexOf: $; startingAt: fieldStart. ????????????]. ????????????decimalProperty := line copyFrom: fieldStart to: fieldEnd - 1. ????????????DecimalProperty at: point+1 put: decimalProperty asNumber. ????????]. ????]. ????GeneralCategory zapDefaultOnlyEntries. ????DecimalProperty zapDefaultOnlyEntries.
Unicode class>>reinitializeData {class initialization} ? ct 2/28/2022 16:05
- reinitializeData
- ????Compositions := GeneralCategory := nil.
- ????self initialize.
UnicodeTest
- TestCase subclass: #UnicodeTest
- ????instanceVariableNames: ''
- ????classVariableNames: ''
- ????poolDictionaries: ''
- ????category: 'MultilingualTests-Encodings'
- UnicodeTest class
- ????instanceVariableNames: ''
- ""
UnicodeTest class>>resources {accessing} ? ct 2/24/2022 21:46
- resources
- ???? ^ super resources copyWith: UnicodeTestResource
UnicodeTest>>testGeneralCategoryLabel {tests - character classification} ? ct 2/24/2022 21:49
- testGeneralCategoryLabel
- ????self assert: 'Letter, Lowercase' equals: (Unicode generalCategoryLabelOf: $a asUnicode).
- ????self assert: 'Letter, Uppercase' equals: (Unicode generalCategoryLabelOf: $Z asUnicode).
- ????
- ????self assert: 'Number, Decimal' equals: (Unicode generalCategoryLabelOf: $5 asUnicode).
- ????self assert: 'Symbol, Other' equals: (Unicode generalCategoryLabelOf: 16r1F388).
- ????
- ????self assert: 'n/a' equals: (Unicode generalCategoryLabelOf: Float infinity).
UnicodeTest>>testGeneralCategoryLabelForTag {tests - character classification} ? ct 2/24/2022 21:48
- testGeneralCategoryLabelForTag
- ????self assert: 'Letter, Lowercase' equals: (Unicode generalCategoryLabelForTag: #Ll).
UnicodeTest>>testGeneralCategoryTag {tests - character classification} ? ct 2/24/2022 21:49
- testGeneralCategoryTag
- ????self assert: #Ll equals: (Unicode generalCategoryTagOf: $a asUnicode).
- ????self assert: #Lu equals: (Unicode generalCategoryTagOf: $Z asUnicode).
- ????
- ????self assert: #Nd equals: (Unicode generalCategoryTagOf: $5 asUnicode).
- ????self assert: #So equals: (Unicode generalCategoryTagOf: 16r1F388).
- ????
- ????self assert: #Cn equals: (Unicode generalCategoryTagOf: Float infinity).
UnicodeTestResource
- TestResource subclass: #UnicodeTestResource
- ????instanceVariableNames: ''
- ????classVariableNames: ''
- ????poolDictionaries: ''
- ????category: 'MultilingualTests-Encodings'
- UnicodeTestResource class
- ????instanceVariableNames: ''
- ""
UnicodeTestResource>>setUp {running} ? ct 2/24/2022 21:45
- setUp
- ????super setUp.
- ????
- ????"Test the functionality of this update logic"
- ????Unicode initializeCompositionMappings.
- ????Unicode initializeUnicodeData.
Sent from Squeak Inbox Talk
On 2022-02-25T11:36:53+01:00, marcel.taeumel at hpi.de wrote:
Hi Christoph --
Thanks for doing this!
??Is it okay to fetch the data from unicode.org via a postscript in the update stream? Hypothetically, some clients might use a proxy or a strict firewall/safelist of IP addresses.
I think it is okay. We could make it explicit in the ReleaseBuilder or in some external CI script but having it part of the update stream is okay. The updates are fetched from the outside anyway, right? ;-)
Best; Marcel Am 24.02.2022 22:16:42 schrieb christoph.thiede at student.hpi.uni-potsdam.de <christoph.thiede at student.hpi.uni-potsdam.de>: Hi all, Hi Marcel, Hi Levente,
finally, here is a changeset that takes the first step for updating or in-image Unicode database. After filing it in, please run:
????Unicode initializeUnicodeData.
In addition to the preamble (please read first below), I have still a number of questions:
- Is it okay to fetch the data from unicode.org via a postscript in the update stream? Hypothetically, some clients might use a proxy or a strict firewall/safelist of IP addresses.
- How much effort shall we put in deduplicating the logic and the data in this class? This includes both the two similar parsing methods and the redundant specification of the Unicode character tags.
Best, Christoph
=============== Summary ===============
Change Set:????????UnicodeData Date:????????????24 February 2022 Author:????????????Christoph Thiede
This changeset repairs the fetching & parsing of unicode category data. Usage:
????Unicode initializeUnicodeData. ????Unicode generalCategoryLabelOf: 16r1F388. 'Symbol, Other'
This revision resolves some slips in the category tags, adds an interface for retrieving/converting tags, unifies the vocabulary of the Unicode protocol, integrates the #initializeUnicodeData into the class initializer, and adds some tests.
Still present limitations include:
- Duplication between #parseUnicodeDataFrom: and #parseCompositionMappingFrom:
- Redundant and scattered declaration of character categories
For more information, see: http://lists.squeakfoundation.org/pipermail/squeak-dev/2020-March/208020.htm...
=============== Postscript ===============
"Postscript: Leave the line above, and replace the rest of this comment by a useful one. Executable statements should follow this comment, and should be separated by periods, with no exclamation points (!). Be sure to put any further comments in double-quotes, like this one."
=============== Diff ===============
Unicode class>>allCategoryTags {character classification} ? ct 2/24/2022 19:41
- allCategoryTags
- ????^ #(Cn Cc Cf Co Cs Ll Lm Lo Lt Lu Mc Me Mn Nd Nl No Pc Pd Pe Pf Pi Po Ps Sc Sk Sm So Zl Zp Zs)
Unicode class>>generalCategoryIndexFromTag: {character classification} ? ct 2/24/2022 19:33
- generalCategoryIndexFromTag: tag
- ????^ (self allCategoryTags indexOf: tag) - 1
Unicode class>>generalCategoryLabelForTag: {character classification} ? ct 2/24/2022 21:47
- generalCategoryLabelForTag: tag
- ????^ self generalCategoryLabels at: (self generalCategoryIndexFromTag: tag) + 1
Unicode class>>generalCategoryTagOf: {character classification} ? ct 2/24/2022 19:44
- generalCategoryTagOf: aCharacterCode
- ????^ (self generalCategoryOf: aCharacterCode)
- ????????ifNotNil: [:code | self allCategoryTags at: code + 1 ifAbsent: [#Cn]]
- ????????ifNil: [#Cn]
Unicode class>>initialize {class initialization} ? ct 2/24/2022 21:43 (changed) initialize ????" Unicode initialize " ????self initializeTagConstants.
- ????Compositions isEmptyOrNil ifTrue:[self initializeCompositionMappings].
- ????Compositions isEmptyOrNil ifTrue:[self initializeCompositionMappings].
- ????GeneralCategory isEmptyOrNil ifTrue: [self initializeUnicodeData].
Unicode class>>initializeUnicodeData {unicode data} ? ct 2/24/2022 19:03
- initializeUnicodeData
- ????"self initializeUnicodeData"
- ????self parseUnicodeDataFrom: self unicodeData readStream.
Unicode class>>parseUnicodeDataFrom: {unicode data} ? ct 2/24/2022 19:33 (changed) parseUnicodeDataFrom: stream
- "
- ????self halt.
- ????self parseUnicodeDataFile
- "
- ????"self initializeUnicodeData."
- ????| line fieldEnd point fieldStart toNumber generalCategory decimalProperty |
- ????| line fieldEnd point fieldStart toNumber generalCategory decimalProperty tag |
????toNumber := [:quad | ('16r', quad) asNumber].
????GeneralCategory := SparseLargeTable new: 16rE0080 chunkSize: 1024 arrayClass: Array base: 1 defaultValue: 'Cn'. ????DecimalProperty := SparseLargeTable new: 16rE0080 chunkSize: 32 arrayClass: Array base: 1 defaultValue: -1.
????16r3400 to: 16r4DB5 do: [:i | GeneralCategory at: i+1 put: 'Lo']. ????16r4E00 to: 16r9FA5 do: [:i | GeneralCategory at: i+1 put: 'Lo']. ????16rAC00 to: 16rD7FF do: [:i | GeneralCategory at: i+1 put: 'Lo'].
????[(line := stream nextLine) size > 0] whileTrue: [ ????????fieldEnd := line indexOf: $; startingAt: 1. ????????point := toNumber value: (line copyFrom: 1 to: fieldEnd - 1). ????????point > 16rE007F ifTrue: [ ????????????GeneralCategory zapDefaultOnlyEntries. ????????????DecimalProperty zapDefaultOnlyEntries. ????????????^ self]. ????????2 to: 3 do: [:i | ????????????fieldStart := fieldEnd + 1. ????????????fieldEnd := line indexOf: $; startingAt: fieldStart. ????????].
- ????????generalCategory := line copyFrom: fieldStart to: fieldEnd - 1.
- ????????tag := line copyFrom: fieldStart to: fieldEnd - 1.
- ????????generalCategory := self generalCategoryIndexFromTag: tag.
????????GeneralCategory at: point+1 put: generalCategory.
- ????????generalCategory = 'Nd' ifTrue: [
- ????????generalCategory = Nd ifTrue: [
????????????4 to: 7 do: [:i | ????????????????fieldStart := fieldEnd + 1. ????????????????fieldEnd := line indexOf: $; startingAt: fieldStart. ????????????]. ????????????decimalProperty := line copyFrom: fieldStart to: fieldEnd - 1. ????????????DecimalProperty at: point+1 put: decimalProperty asNumber. ????????]. ????]. ????GeneralCategory zapDefaultOnlyEntries. ????DecimalProperty zapDefaultOnlyEntries.
UnicodeTest
- TestCase subclass: #UnicodeTest
- ????instanceVariableNames: ''
- ????classVariableNames: ''
- ????poolDictionaries: ''
- ????category: 'MultilingualTests-Encodings'
- UnicodeTest class
- ????instanceVariableNames: ''
- ""
UnicodeTest class>>resources {accessing} ? ct 2/24/2022 21:46
- resources
- ???? ^ super resources copyWith: UnicodeTestResource
UnicodeTest>>testGeneralCategoryLabel {tests - character classification} ? ct 2/24/2022 21:49
- testGeneralCategoryLabel
- ????self assert: 'Letter, Lowercase' equals: (Unicode generalCategoryLabelOf: $a asUnicode).
- ????self assert: 'Letter, Uppercase' equals: (Unicode generalCategoryLabelOf: $Z asUnicode).
- ????
- ????self assert: 'Number, Decimal' equals: (Unicode generalCategoryLabelOf: $5 asUnicode).
- ????self assert: 'Symbol, Other' equals: (Unicode generalCategoryLabelOf: 16r1F388).
- ????
- ????self assert: 'n/a' equals: (Unicode generalCategoryLabelOf: Float infinity).
UnicodeTest>>testGeneralCategoryLabelForTag {tests - character classification} ? ct 2/24/2022 21:48
- testGeneralCategoryLabelForTag
- ????self assert: 'Letter, Lowercase' equals: (Unicode generalCategoryLabelForTag: #Ll).
UnicodeTest>>testGeneralCategoryTag {tests - character classification} ? ct 2/24/2022 21:49
- testGeneralCategoryTag
- ????self assert: #Ll equals: (Unicode generalCategoryTagOf: $a asUnicode).
- ????self assert: #Lu equals: (Unicode generalCategoryTagOf: $Z asUnicode).
- ????
- ????self assert: #Nd equals: (Unicode generalCategoryTagOf: $5 asUnicode).
- ????self assert: #So equals: (Unicode generalCategoryTagOf: 16r1F388).
- ????
- ????self assert: #Cn equals: (Unicode generalCategoryTagOf: Float infinity).
UnicodeTestResource
- TestResource subclass: #UnicodeTestResource
- ????instanceVariableNames: ''
- ????classVariableNames: ''
- ????poolDictionaries: ''
- ????category: 'MultilingualTests-Encodings'
- UnicodeTestResource class
- ????instanceVariableNames: ''
- ""
UnicodeTestResource>>setUp {running} ? ct 2/24/2022 21:45
- setUp
- ????super setUp.
- ????
- ????"Test the functionality of this update logic"
- ????Unicode initializeCompositionMappings.
- ????Unicode initializeUnicodeData.
Sent from Squeak Inbox Talk [https://github.com/hpi-swa-lab/squeak-inbox-talk]
On 2020-09-11T00:49:24+02:00, leves at caesar.elte.hu wrote:
Hi Christoph,
On Wed, 9 Sep 2020, Thiede, Christoph wrote:
Hi Levente,
basically, I only?would like to get rid of the class variables for every single Unicode category because they provide low explorability (it's hard to work with the numeric output of #generalCategoryOf:) and extensibility (you need to recompile the class definition for adding UTF-16 support). If you are critical of?increasing the?size of the SparseLargeTable, I think we?would also just make one or two?extra dictionaries to map every category symbol to a number and vice versa. What do you think?
You mean an array to map the integers to symbols, right? :) Anyway, I don't think it's worth using symbols internally. For example, #isLetterCode: is 8-10% slower with the extra array lookup and checking the category symbol's first letter than the current method of integer comparisons.
Do you expect these constants to appear outside the Unicode class? If yes, then using symbols for those cases is probably a good solution. But for internal use, the integers are better.
Levente
Best,
Christoph
Von: Squeak-dev <squeak-dev-bounces at lists.squeakfoundation.org> im Auftrag von Levente Uzonyi <leves at caesar.elte.hu> Gesendet: Dienstag, 8. September 2020 21:43:56 An: The general-purpose Squeak developers list Betreff: Re: [squeak-dev] Unicode ? Hi Christoph,
On Tue, 8 Sep 2020, Thiede, Christoph wrote:
Hi all,
?Your words suggest that it has already been published, but I can't find it?anywhere.
Then I must have expressed myself wrong. I did not yet publish any code?changes, but in my original post from March,?you can find a short description of the design?changes I'd like to implement. Essentially, I would like to replace the separate class variables for every known character class in favor of greater flexibility.
How would your changes affect GeneralCategory? Would it still be a SpareLargeTable with ByteArray as arrayClass? If you just replace those integers with symbols, the size of the table will be at least 4 or 8 times larger in 32 or 64 bit images, respectively.
Levente.
Eliot, are there any remaining questions regarding the VM size? Character size should be sufficient as discussed below, and of course, I can test any changes in a 32-bit image, too. :-)
WideString withAll: (#(42r2l 16r1F497 16r1F388 33) collect: #asCharacter)
Best, Christoph
_
Von: Squeak-dev <squeak-dev-bounces at lists.squeakfoundation.org> im Auftrag von Tobias Pape <Das.Linux at gmx.de> Gesendet: Sonntag, 6. September 2020 21:00:14 An: The general-purpose Squeak developers list Betreff: Re: [squeak-dev] Unicode ?
On 06.09.2020, at 20:40, Levente Uzonyi <leves at caesar.elte.hu> wrote:
On Sun, 6 Sep 2020, Tobias Pape wrote:
> >> On 06.09.2020, at 19:15, Eliot Miranda <eliot.miranda at gmail.com> wrote: >> Hi Christoph, Hi All, >>> On Mar 17, 2020, at 3:51 PM, Thiede, Christoph <Christoph.Thiede at student.hpi.uni-potsdam.de> wrote: >>> Hi all! :-) >>> After some recent fun with the Unicode class, I found out that its data is quite out of date (for example, the comments do not even "know" that code points can be longer than 4 bytes. Younger characters such as ???
are not categorized correctly, etc. ...). Luckily, there is already some logic to fetch the latest data from www.unicode.org. I'm currently reworking this logic because it's not completely automated yet and has some slips, but so long, I have one general question for you:
>> And consequently I have a couple of questions for you. In the Spur VM Characters are immediate (they are like SmallInteger and exist in oops (object-oriented pointers) as tagged values).? In the 32-bit variant
Characters
are 30-bit unsigned integers.? In the 64-bit variant they are also 30-bit unsigned integers, but could easily be extended to be up to 61-bit unsigned integers.
>> Q1, can you arrange that the Unicode support does not break in initialization on the 32-bit variant?? It may be that the 32-bit variant cannot represent code points beyond 30 bits in size, but we should try to ensure
that
initialization still runs to completion even if it fails to initialize information relating to code points beyond 30 bits in size.
>> Q2, how many bits should the 64-bit variant VM support for immediate Characters? > > Unicode has a max value of 0x10FFFF. That makes 21 bit. So no worries there. > > We should just not forget the leading-char stuff (Yoshiki, Andreas,...)
AFAIU the leading char only makes sense when you have multiple CJK(V?) languages in use at the same time. In other cases Unicode (leadingChar = 0) is perfectly fine. IIRC there are 22 bits available for the codePoint and 8 for the leadingChar, so we're still good: all unicode characters fit.
\o/ hooray!
Levente
> > > BEst regards >?????? -Tobias > >> Then something to consider is that it is conceptually possible to support something like WideCharacter, which would represent code points outside of the immediate Character range on the 32-bit variant, analogous to
LargePositiveInteger beyond SmallInteger maxVal.? This can be made to work seamlessly, just as it does currently with integers, and with Floats where SmallFloat64 is only used on 64-bits.
>> It has implications in a few parts of the system: >> - failure code for WideString (VeryWideString?) at:[put:] primitives that would have to manage overflow into/access from WideCharacter instances >> - ImageSegment and other (un)pickling systems that need to convert to/from a bit-specific ?wire? protocol/representation >> - 32-bit <=> 64-bit image conversion All this is easily doable (because we have models of doing it for Float and Integer general instances).? But we need good specifications so we can implement the right thing from the
get-go.
>>> At the moment, we have 30 class variables each for one Unicode category number. These class vars map in alphabetical order to the integers from 0 to: 29. Is this tedious structure really necessary? For different
purposes, I would like to get the category name of a specific code point from a client. The current design makes this impossible without writing additional mappings.
>>> Tl;dr: I would like to propose to drop these class variables and use Symbols instead. They are comparable like integers, and as they are flyweights, this should not be a performance issue either. Of course,
#generalCategoryOf: will have to keep returning numbers, but we could deprecate it and use a new #generalTagOf: in the future. Furthermore, this would also allow us to deal with later added category names (though I don't
know
whether this will ever happen).
>>> Examples: >>> Unicode generalTagOf: $a asUnicode. "#Ll" >>> Unicode class >> isLetterCode: charCode >>>? ^ (self generalTagOf: charCode) first = $L >>> Unicode class >> isAlphaNumericCode: charCode >>>? | tag| >>>? ^ (tag := self generalCategoryOf: charCode) first = $L >>>??????? or: [tag = #Nd] >>> How do you think about this proposal? Please let me know and I will go ahead! :D >>> Best, >>> Christoph >> Best, Eliot >> _,,,^..^,,,_ (phone)
["UnicodeData.2.cs"]
On Apr 4, 2022, at 11:20 AM, Christoph.Thiede@student.hpi.uni-potsdam.de wrote:
Merged via Multilingual-ct.271, Multilingual-ct.272, MultilingualTests-ct.41, and ReleaseBuilder-ct.231.
Please run the following in your image to install the new Unicode data (and to uncover any regressions I may have missed :D):
Unicode reinitializeData.
If this is essential then it *must* be added as a post load script to one (or more) of the relevant packages. Asking “did you run Unicode reinitializeData?” when someone reports a strange bug isn’t acceptable.
Best, Christoph
Sent from Squeak Inbox Talk
On 2022-02-28T16:09:06+01:00, christoph.thiede@student.hpi.uni-potsdam.de wrote:
Hi Marcel, thanks for the review! Below is an updated changeset. If you have no further objections, I would like to merge it within the next few days. :-)
=============== Summary ===============
Change Set:????????UnicodeData Date:????????????24 February 2022 Author:????????????Christoph Thiede
This changeset repairs the fetching & parsing of unicode category data. Usage:
????Unicode reinitializeData. ????Unicode generalCategoryLabelOf: 16r1F388. 'Symbol, Other'
This revision resolves some slips in the category tags, adds an interface for retrieving/converting tags, unifies the vocabulary of the Unicode protocol, integrates the #initializeUnicodeData into the class initializer, and adds some tests. Furthermore, the Unicode data are automatically reinitialized as part of the ReleaseBuilder.
Still present limitations include:
- Duplication between #parseUnicodeDataFrom: and #parseCompositionMappingFrom:
- Redundant and scattered declaration of character categories
For more information, see: http://lists.squeakfoundation.org/pipermail/squeak-dev/2020-March/208020.htm...
=============== Postscript ===============
"Postscript: Leave the line above, and replace the rest of this comment by a useful one. Executable statements should follow this comment, and should be separated by periods, with no exclamation points (!). Be sure to put any further comments in double-quotes, like this one."
=============== Diff ===============
ReleaseBuilder class>>prepareSourceCode {preparing} ? ct 2/28/2022 15:54 (changed) prepareSourceCode ????"Update packages. Remove foreign packages. Recompile."
????CurrentReadOnlySourceFiles cacheDuring: ????????[self ????????????updateCorePackages; ????????????unloadForeignPackages; ????????????checkForDirtyPackages; ????????????loadWellKnownPackages; ????????????checkForUndeclaredSymbols; ????????????checkForNilCategories;
- ????????????recompileAll]
- ????????????recompileAll;
- ????????????updateDatabases]
ReleaseBuilder class>>updateDatabases {scripts - support} ? ct 2/28/2022 16:06
- updateDatabases
- ????Unicode reinitializeData.
Unicode class>>allCategoryTags {character classification} ? ct 2/24/2022 19:41
- allCategoryTags
- ????^ #(Cn Cc Cf Co Cs Ll Lm Lo Lt Lu Mc Me Mn Nd Nl No Pc Pd Pe Pf Pi Po Ps Sc Sk Sm So Zl Zp Zs)
Unicode class>>blocks320Comment {comments} ? ct 2/28/2022 15:50 (changed) blocks320Comment
"# Blocks-3.2.0.txt # Correlated with Unicode 3.2 # Start Code..End Code; Block Name 0000..007F; Basic Latin 0080..00FF; Latin-1 Supplement 0100..017F; Latin Extended-A 0180..024F; Latin Extended-B 0250..02AF; IPA Extensions 02B0..02FF; Spacing Modifier Letters 0300..036F; Combining Diacritical Marks 0370..03FF; Greek and Coptic 0400..04FF; Cyrillic 0500..052F; Cyrillic Supplementary 0530..058F; Armenian 0590..05FF; Hebrew 0600..06FF; Arabic 0700..074F; Syriac 0780..07BF; Thaana 0900..097F; Devanagari 0980..09FF; Bengali 0A00..0A7F; Gurmukhi 0A80..0AFF; Gujarati 0B00..0B7F; Oriya 0B80..0BFF; Tamil 0C00..0C7F; Telugu 0C80..0CFF; Kannada 0D00..0D7F; Malayalam 0D80..0DFF; Sinhala 0E00..0E7F; Thai 0E80..0EFF; Lao 0F00..0FFF; Tibetan 1000..109F; Myanmar 10A0..10FF; Georgian 1100..11FF; Hangul Jamo 1200..137F; Ethiopic 13A0..13FF; Cherokee 1400..167F; Unified Canadian Aboriginal Syllabics 1680..169F; Ogham 16A0..16FF; Runic 1700..171F; Tagalog 1720..173F; Hanunoo 1740..175F; Buhid 1760..177F; Tagbanwa 1780..17FF; Khmer 1800..18AF; Mongolian 1E00..1EFF; Latin Extended Additional 1F00..1FFF; Greek Extended 2000..206F; General Punctuation 2070..209F; Superscripts and Subscripts 20A0..20CF; Currency Symbols 20D0..20FF; Combining Diacritical Marks for Symbols 2100..214F; Letterlike Symbols 2150..218F; Number Forms 2190..21FF; Arrows 2200..22FF; Mathematical Operators 2300..23FF; Miscellaneous Technical 2400..243F; Control Pictures 2440..245F; Optical Character Recognition 2460..24FF; Enclosed Alphanumerics 2500..257F; Box Drawing 2580..259F; Block Elements 25A0..25FF; Geometric Shapes 2600..26FF; Miscellaneous Symbols 2700..27BF; Dingbats 27C0..27EF; Miscellaneous Mathematical Symbols-A 27F0..27FF; Supplemental Arrows-A 2800..28FF; Braille Patterns 2900..297F; Supplemental Arrows-B 2980..29FF; Miscellaneous Mathematical Symbols-B 2A00..2AFF; Supplemental Mathematical Operators 2E80..2EFF; CJK Radicals Supplement 2F00..2FDF; Kangxi Radicals 2FF0..2FFF; Ideographic Description Characters 3000..303F; CJK Symbols and Punctuation 3040..309F; Hiragana 30A0..30FF; Katakana 3100..312F; Bopomofo 3130..318F; Hangul Compatibility Jamo 3190..319F; Kanbun 31A0..31BF; Bopomofo Extended 31F0..31FF; Katakana Phonetic Extensions 3200..32FF; Enclosed CJK Letters and Months 3300..33FF; CJK Compatibility 3400..4DBF; CJK Unified Ideographs Extension A 4E00..9FFF; CJK Unified Ideographs A000..A48F; Yi Syllables A490..A4CF; Yi Radicals AC00..D7AF; Hangul Syllables D800..DB7F; High Surrogates DB80..DBFF; High Private Use Surrogates DC00..DFFF; Low Surrogates E000..F8FF; Private Use Area F900..FAFF; CJK Compatibility Ideographs FB00..FB4F; Alphabetic Presentation Forms FB50..FDFF; Arabic Presentation Forms-A FE00..FE0F; Variation Selectors FE20..FE2F; Combining Half Marks FE30..FE4F; CJK Compatibility Forms FE50..FE6F; Small Form Variants FE70..FEFF; Arabic Presentation Forms-B FF00..FFEF; Halfwidth and Fullwidth Forms FFF0..FFFF; Specials 10300..1032F; Old Italic 10330..1034F; Gothic 10400..1044F; Deseret 1D000..1D0FF; Byzantine Musical Symbols 1D100..1D1FF; Musical Symbols 1D400..1D7FF; Mathematical Alphanumeric Symbols 20000..2A6DF; CJK Unified Ideographs Extension B 2F800..2FA1F; CJK Compatibility Ideographs Supplement E0000..E007F; Tags F0000..FFFFF; Supplementary Private Use Area-A 100000..10FFFF; Supplementary Private Use Area-B
"
Unicode class>>generalCategoryIndexFromTag: {character classification} ? ct 2/24/2022 19:33
- generalCategoryIndexFromTag: tag
- ????^ (self allCategoryTags indexOf: tag) - 1
Unicode class>>generalCategoryLabelForTag: {character classification} ? ct 2/24/2022 21:47
- generalCategoryLabelForTag: tag
- ????^ self generalCategoryLabels at: (self generalCategoryIndexFromTag: tag) + 1
Unicode class>>generalCategoryTagOf: {character classification} ? ct 2/24/2022 19:44
- generalCategoryTagOf: aCharacterCode
- ????^ (self generalCategoryOf: aCharacterCode)
- ????????ifNotNil: [:code | self allCategoryTags at: code + 1 ifAbsent: [#Cn]]
- ????????ifNil: [#Cn]
Unicode class>>initialize {class initialization} ? ct 2/28/2022 15:52 (changed) initialize ????" Unicode initialize " ????self initializeTagConstants.
- ????Compositions isEmptyOrNil ifTrue:[self initializeCompositionMappings].
- ????
- ????self flag: #deduplicate. "Currently, we are downloading and parsing #unicodeData twice."
- ????Compositions isEmptyOrNil ifTrue: [self initializeCompositionMappings].
- ????GeneralCategory isEmptyOrNil ifTrue: [self initializeUnicodeData].
Unicode class>>initializeUnicodeData {unicode data} ? ct 2/24/2022 19:03
- initializeUnicodeData
- ????"self initializeUnicodeData"
- ????self parseUnicodeDataFrom: self unicodeData readStream.
Unicode class>>parseUnicodeDataFrom: {unicode data} ? ct 2/24/2022 19:33 (changed) parseUnicodeDataFrom: stream
- "
- ????self halt.
- ????self parseUnicodeDataFile
- "
- ????"self initializeUnicodeData."
- ????| line fieldEnd point fieldStart toNumber generalCategory decimalProperty |
- ????| line fieldEnd point fieldStart toNumber generalCategory decimalProperty tag |
????toNumber := [:quad | ('16r', quad) asNumber].
????GeneralCategory := SparseLargeTable new: 16rE0080 chunkSize: 1024 arrayClass: Array base: 1 defaultValue: 'Cn'. ????DecimalProperty := SparseLargeTable new: 16rE0080 chunkSize: 32 arrayClass: Array base: 1 defaultValue: -1.
????16r3400 to: 16r4DB5 do: [:i | GeneralCategory at: i+1 put: 'Lo']. ????16r4E00 to: 16r9FA5 do: [:i | GeneralCategory at: i+1 put: 'Lo']. ????16rAC00 to: 16rD7FF do: [:i | GeneralCategory at: i+1 put: 'Lo'].
????[(line := stream nextLine) size > 0] whileTrue: [ ????????fieldEnd := line indexOf: $; startingAt: 1. ????????point := toNumber value: (line copyFrom: 1 to: fieldEnd - 1). ????????point > 16rE007F ifTrue: [ ????????????GeneralCategory zapDefaultOnlyEntries. ????????????DecimalProperty zapDefaultOnlyEntries. ????????????^ self]. ????????2 to: 3 do: [:i | ????????????fieldStart := fieldEnd + 1. ????????????fieldEnd := line indexOf: $; startingAt: fieldStart. ????????].
- ????????generalCategory := line copyFrom: fieldStart to: fieldEnd - 1.
- ????????tag := line copyFrom: fieldStart to: fieldEnd - 1.
- ????????generalCategory := self generalCategoryIndexFromTag: tag.
????????GeneralCategory at: point+1 put: generalCategory.
- ????????generalCategory = 'Nd' ifTrue: [
- ????????generalCategory = Nd ifTrue: [
????????????4 to: 7 do: [:i | ????????????????fieldStart := fieldEnd + 1. ????????????????fieldEnd := line indexOf: $; startingAt: fieldStart. ????????????]. ????????????decimalProperty := line copyFrom: fieldStart to: fieldEnd - 1. ????????????DecimalProperty at: point+1 put: decimalProperty asNumber. ????????]. ????]. ????GeneralCategory zapDefaultOnlyEntries. ????DecimalProperty zapDefaultOnlyEntries.
Unicode class>>reinitializeData {class initialization} ? ct 2/28/2022 16:05
- reinitializeData
- ????Compositions := GeneralCategory := nil.
- ????self initialize.
UnicodeTest
- TestCase subclass: #UnicodeTest
- ????instanceVariableNames: ''
- ????classVariableNames: ''
- ????poolDictionaries: ''
- ????category: 'MultilingualTests-Encodings'
- UnicodeTest class
- ????instanceVariableNames: ''
- ""
UnicodeTest class>>resources {accessing} ? ct 2/24/2022 21:46
- resources
- ???? ^ super resources copyWith: UnicodeTestResource
UnicodeTest>>testGeneralCategoryLabel {tests - character classification} ? ct 2/24/2022 21:49
- testGeneralCategoryLabel
- ????self assert: 'Letter, Lowercase' equals: (Unicode generalCategoryLabelOf: $a asUnicode).
- ????self assert: 'Letter, Uppercase' equals: (Unicode generalCategoryLabelOf: $Z asUnicode).
- ????
- ????self assert: 'Number, Decimal' equals: (Unicode generalCategoryLabelOf: $5 asUnicode).
- ????self assert: 'Symbol, Other' equals: (Unicode generalCategoryLabelOf: 16r1F388).
- ????
- ????self assert: 'n/a' equals: (Unicode generalCategoryLabelOf: Float infinity).
UnicodeTest>>testGeneralCategoryLabelForTag {tests - character classification} ? ct 2/24/2022 21:48
- testGeneralCategoryLabelForTag
- ????self assert: 'Letter, Lowercase' equals: (Unicode generalCategoryLabelForTag: #Ll).
UnicodeTest>>testGeneralCategoryTag {tests - character classification} ? ct 2/24/2022 21:49
- testGeneralCategoryTag
- ????self assert: #Ll equals: (Unicode generalCategoryTagOf: $a asUnicode).
- ????self assert: #Lu equals: (Unicode generalCategoryTagOf: $Z asUnicode).
- ????
- ????self assert: #Nd equals: (Unicode generalCategoryTagOf: $5 asUnicode).
- ????self assert: #So equals: (Unicode generalCategoryTagOf: 16r1F388).
- ????
- ????self assert: #Cn equals: (Unicode generalCategoryTagOf: Float infinity).
UnicodeTestResource
- TestResource subclass: #UnicodeTestResource
- ????instanceVariableNames: ''
- ????classVariableNames: ''
- ????poolDictionaries: ''
- ????category: 'MultilingualTests-Encodings'
- UnicodeTestResource class
- ????instanceVariableNames: ''
- ""
UnicodeTestResource>>setUp {running} ? ct 2/24/2022 21:45
- setUp
- ????super setUp.
- ????
- ????"Test the functionality of this update logic"
- ????Unicode initializeCompositionMappings.
- ????Unicode initializeUnicodeData.
Sent from Squeak Inbox Talk
On 2022-02-25T11:36:53+01:00, marcel.taeumel at hpi.de wrote:
Hi Christoph --
Thanks for doing this!
??Is it okay to fetch the data from unicode.org via a postscript in the update stream? Hypothetically, some clients might use a proxy or a strict firewall/safelist of IP addresses.
I think it is okay. We could make it explicit in the ReleaseBuilder or in some external CI script but having it part of the update stream is okay. The updates are fetched from the outside anyway, right? ;-)
Best; Marcel Am 24.02.2022 22:16:42 schrieb christoph.thiede at student.hpi.uni-potsdam.de <christoph.thiede at student.hpi.uni-potsdam.de>: Hi all, Hi Marcel, Hi Levente,
finally, here is a changeset that takes the first step for updating or in-image Unicode database. After filing it in, please run:
????Unicode initializeUnicodeData.
In addition to the preamble (please read first below), I have still a number of questions:
- Is it okay to fetch the data from unicode.org via a postscript in the update stream? Hypothetically, some clients might use a proxy or a strict firewall/safelist of IP addresses.
- How much effort shall we put in deduplicating the logic and the data in this class? This includes both the two similar parsing methods and the redundant specification of the Unicode character tags.
Best, Christoph
=============== Summary ===============
Change Set:????????UnicodeData Date:????????????24 February 2022 Author:????????????Christoph Thiede
This changeset repairs the fetching & parsing of unicode category data. Usage:
????Unicode initializeUnicodeData. ????Unicode generalCategoryLabelOf: 16r1F388. 'Symbol, Other'
This revision resolves some slips in the category tags, adds an interface for retrieving/converting tags, unifies the vocabulary of the Unicode protocol, integrates the #initializeUnicodeData into the class initializer, and adds some tests.
Still present limitations include:
- Duplication between #parseUnicodeDataFrom: and #parseCompositionMappingFrom:
- Redundant and scattered declaration of character categories
For more information, see: http://lists.squeakfoundation.org/pipermail/squeak-dev/2020-March/208020.htm...
=============== Postscript ===============
"Postscript: Leave the line above, and replace the rest of this comment by a useful one. Executable statements should follow this comment, and should be separated by periods, with no exclamation points (!). Be sure to put any further comments in double-quotes, like this one."
=============== Diff ===============
Unicode class>>allCategoryTags {character classification} ? ct 2/24/2022 19:41
- allCategoryTags
- ????^ #(Cn Cc Cf Co Cs Ll Lm Lo Lt Lu Mc Me Mn Nd Nl No Pc Pd Pe Pf Pi Po Ps Sc Sk Sm So Zl Zp Zs)
Unicode class>>generalCategoryIndexFromTag: {character classification} ? ct 2/24/2022 19:33
- generalCategoryIndexFromTag: tag
- ????^ (self allCategoryTags indexOf: tag) - 1
Unicode class>>generalCategoryLabelForTag: {character classification} ? ct 2/24/2022 21:47
- generalCategoryLabelForTag: tag
- ????^ self generalCategoryLabels at: (self generalCategoryIndexFromTag: tag) + 1
Unicode class>>generalCategoryTagOf: {character classification} ? ct 2/24/2022 19:44
- generalCategoryTagOf: aCharacterCode
- ????^ (self generalCategoryOf: aCharacterCode)
- ????????ifNotNil: [:code | self allCategoryTags at: code + 1 ifAbsent: [#Cn]]
- ????????ifNil: [#Cn]
Unicode class>>initialize {class initialization} ? ct 2/24/2022 21:43 (changed) initialize ????" Unicode initialize " ????self initializeTagConstants.
- ????Compositions isEmptyOrNil ifTrue:[self initializeCompositionMappings].
- ????Compositions isEmptyOrNil ifTrue:[self initializeCompositionMappings].
- ????GeneralCategory isEmptyOrNil ifTrue: [self initializeUnicodeData].
Unicode class>>initializeUnicodeData {unicode data} ? ct 2/24/2022 19:03
- initializeUnicodeData
- ????"self initializeUnicodeData"
- ????self parseUnicodeDataFrom: self unicodeData readStream.
Unicode class>>parseUnicodeDataFrom: {unicode data} ? ct 2/24/2022 19:33 (changed) parseUnicodeDataFrom: stream
- "
- ????self halt.
- ????self parseUnicodeDataFile
- "
- ????"self initializeUnicodeData."
- ????| line fieldEnd point fieldStart toNumber generalCategory decimalProperty |
- ????| line fieldEnd point fieldStart toNumber generalCategory decimalProperty tag |
????toNumber := [:quad | ('16r', quad) asNumber].
????GeneralCategory := SparseLargeTable new: 16rE0080 chunkSize: 1024 arrayClass: Array base: 1 defaultValue: 'Cn'. ????DecimalProperty := SparseLargeTable new: 16rE0080 chunkSize: 32 arrayClass: Array base: 1 defaultValue: -1.
????16r3400 to: 16r4DB5 do: [:i | GeneralCategory at: i+1 put: 'Lo']. ????16r4E00 to: 16r9FA5 do: [:i | GeneralCategory at: i+1 put: 'Lo']. ????16rAC00 to: 16rD7FF do: [:i | GeneralCategory at: i+1 put: 'Lo'].
????[(line := stream nextLine) size > 0] whileTrue: [ ????????fieldEnd := line indexOf: $; startingAt: 1. ????????point := toNumber value: (line copyFrom: 1 to: fieldEnd - 1). ????????point > 16rE007F ifTrue: [ ????????????GeneralCategory zapDefaultOnlyEntries. ????????????DecimalProperty zapDefaultOnlyEntries. ????????????^ self]. ????????2 to: 3 do: [:i | ????????????fieldStart := fieldEnd + 1. ????????????fieldEnd := line indexOf: $; startingAt: fieldStart. ????????].
- ????????generalCategory := line copyFrom: fieldStart to: fieldEnd - 1.
- ????????tag := line copyFrom: fieldStart to: fieldEnd - 1.
- ????????generalCategory := self generalCategoryIndexFromTag: tag.
????????GeneralCategory at: point+1 put: generalCategory.
- ????????generalCategory = 'Nd' ifTrue: [
- ????????generalCategory = Nd ifTrue: [
????????????4 to: 7 do: [:i | ????????????????fieldStart := fieldEnd + 1. ????????????????fieldEnd := line indexOf: $; startingAt: fieldStart. ????????????]. ????????????decimalProperty := line copyFrom: fieldStart to: fieldEnd - 1. ????????????DecimalProperty at: point+1 put: decimalProperty asNumber. ????????]. ????]. ????GeneralCategory zapDefaultOnlyEntries. ????DecimalProperty zapDefaultOnlyEntries.
UnicodeTest
- TestCase subclass: #UnicodeTest
- ????instanceVariableNames: ''
- ????classVariableNames: ''
- ????poolDictionaries: ''
- ????category: 'MultilingualTests-Encodings'
- UnicodeTest class
- ????instanceVariableNames: ''
- ""
UnicodeTest class>>resources {accessing} ? ct 2/24/2022 21:46
- resources
- ???? ^ super resources copyWith: UnicodeTestResource
UnicodeTest>>testGeneralCategoryLabel {tests - character classification} ? ct 2/24/2022 21:49
- testGeneralCategoryLabel
- ????self assert: 'Letter, Lowercase' equals: (Unicode generalCategoryLabelOf: $a asUnicode).
- ????self assert: 'Letter, Uppercase' equals: (Unicode generalCategoryLabelOf: $Z asUnicode).
- ????
- ????self assert: 'Number, Decimal' equals: (Unicode generalCategoryLabelOf: $5 asUnicode).
- ????self assert: 'Symbol, Other' equals: (Unicode generalCategoryLabelOf: 16r1F388).
- ????
- ????self assert: 'n/a' equals: (Unicode generalCategoryLabelOf: Float infinity).
UnicodeTest>>testGeneralCategoryLabelForTag {tests - character classification} ? ct 2/24/2022 21:48
- testGeneralCategoryLabelForTag
- ????self assert: 'Letter, Lowercase' equals: (Unicode generalCategoryLabelForTag: #Ll).
UnicodeTest>>testGeneralCategoryTag {tests - character classification} ? ct 2/24/2022 21:49
- testGeneralCategoryTag
- ????self assert: #Ll equals: (Unicode generalCategoryTagOf: $a asUnicode).
- ????self assert: #Lu equals: (Unicode generalCategoryTagOf: $Z asUnicode).
- ????
- ????self assert: #Nd equals: (Unicode generalCategoryTagOf: $5 asUnicode).
- ????self assert: #So equals: (Unicode generalCategoryTagOf: 16r1F388).
- ????
- ????self assert: #Cn equals: (Unicode generalCategoryTagOf: Float infinity).
UnicodeTestResource
- TestResource subclass: #UnicodeTestResource
- ????instanceVariableNames: ''
- ????classVariableNames: ''
- ????poolDictionaries: ''
- ????category: 'MultilingualTests-Encodings'
- UnicodeTestResource class
- ????instanceVariableNames: ''
- ""
UnicodeTestResource>>setUp {running} ? ct 2/24/2022 21:45
- setUp
- ????super setUp.
- ????
- ????"Test the functionality of this update logic"
- ????Unicode initializeCompositionMappings.
- ????Unicode initializeUnicodeData.
Sent from Squeak Inbox Talk [https://github.com/hpi-swa-lab/squeak-inbox-talk]
On 2020-09-11T00:49:24+02:00, leves at caesar.elte.hu wrote:
Hi Christoph,
On Wed, 9 Sep 2020, Thiede, Christoph wrote:
Hi Levente,
basically, I only?would like to get rid of the class variables for every single Unicode category because they provide low explorability (it's hard to work with the numeric output of #generalCategoryOf:) and extensibility (you need to recompile the class definition for adding UTF-16 support). If you are critical of?increasing the?size of the SparseLargeTable, I think we?would also just make one or two?extra dictionaries to map every category symbol to a number and vice versa. What do you think?
You mean an array to map the integers to symbols, right? :) Anyway, I don't think it's worth using symbols internally. For example, #isLetterCode: is 8-10% slower with the extra array lookup and checking the category symbol's first letter than the current method of integer comparisons.
Do you expect these constants to appear outside the Unicode class? If yes, then using symbols for those cases is probably a good solution. But for internal use, the integers are better.
Levente
Best,
Christoph
Von: Squeak-dev <squeak-dev-bounces at lists.squeakfoundation.org> im Auftrag von Levente Uzonyi <leves at caesar.elte.hu> Gesendet: Dienstag, 8. September 2020 21:43:56 An: The general-purpose Squeak developers list Betreff: Re: [squeak-dev] Unicode ? Hi Christoph,
On Tue, 8 Sep 2020, Thiede, Christoph wrote:
Hi all,
>?Your words suggest that it has already been published, but I can't find it?anywhere.
Then I must have expressed myself wrong. I did not yet publish any code?changes, but in my original post from March,?you can find a short description of the design?changes I'd like to implement. Essentially, I would like to replace the separate class variables for every known character class in favor of greater flexibility.
How would your changes affect GeneralCategory? Would it still be a SpareLargeTable with ByteArray as arrayClass? If you just replace those integers with symbols, the size of the table will be at least 4 or 8 times larger in 32 or 64 bit images, respectively.
Levente.
Eliot, are there any remaining questions regarding the VM size? Character size should be sufficient as discussed below, and of course, I can test any changes in a 32-bit image, too. :-)
WideString withAll: (#(42r2l 16r1F497 16r1F388 33) collect: #asCharacter)
Best, Christoph
_
Von: Squeak-dev <squeak-dev-bounces at lists.squeakfoundation.org> im Auftrag von Tobias Pape <Das.Linux at gmx.de> Gesendet: Sonntag, 6. September 2020 21:00:14 An: The general-purpose Squeak developers list Betreff: Re: [squeak-dev] Unicode ?
> On 06.09.2020, at 20:40, Levente Uzonyi <leves at caesar.elte.hu> wrote: > > On Sun, 6 Sep 2020, Tobias Pape wrote: > >> >>> On 06.09.2020, at 19:15, Eliot Miranda <eliot.miranda at gmail.com> wrote: >>> Hi Christoph, Hi All, >>>> On Mar 17, 2020, at 3:51 PM, Thiede, Christoph <Christoph.Thiede at student.hpi.uni-potsdam.de> wrote: >>>> Hi all! :-) >>>> After some recent fun with the Unicode class, I found out that its data is quite out of date (for example, the comments do not even "know" that code points can be longer than 4 bytes. Younger characters such as ??? are not categorized correctly, etc. ...). Luckily, there is already some logic to fetch the latest data from www.unicode.org. I'm currently reworking this logic because it's not completely automated yet and has some slips, but so long, I have one general question for you: >>> And consequently I have a couple of questions for you. In the Spur VM Characters are immediate (they are like SmallInteger and exist in oops (object-oriented pointers) as tagged values).? In the 32-bit variant
Characters
are 30-bit unsigned integers.? In the 64-bit variant they are also 30-bit unsigned integers, but could easily be extended to be up to 61-bit unsigned integers. >>> Q1, can you arrange that the Unicode support does not break in initialization on the 32-bit variant?? It may be that the 32-bit variant cannot represent code points beyond 30 bits in size, but we should try to ensure
that
initialization still runs to completion even if it fails to initialize information relating to code points beyond 30 bits in size. >>> Q2, how many bits should the 64-bit variant VM support for immediate Characters? >> >> Unicode has a max value of 0x10FFFF. That makes 21 bit. So no worries there. >> >> We should just not forget the leading-char stuff (Yoshiki, Andreas,...) > > AFAIU the leading char only makes sense when you have multiple CJK(V?) languages in use at the same time. In other cases Unicode (leadingChar = 0) is perfectly fine. > IIRC there are 22 bits available for the codePoint and 8 for the leadingChar, so we're still good: all unicode characters fit. > >
\o/ hooray!
> Levente > >> >> >> BEst regards >>?????? -Tobias >> >>> Then something to consider is that it is conceptually possible to support something like WideCharacter, which would represent code points outside of the immediate Character range on the 32-bit variant, analogous to LargePositiveInteger beyond SmallInteger maxVal.? This can be made to work seamlessly, just as it does currently with integers, and with Floats where SmallFloat64 is only used on 64-bits. >>> It has implications in a few parts of the system: >>> - failure code for WideString (VeryWideString?) at:[put:] primitives that would have to manage overflow into/access from WideCharacter instances >>> - ImageSegment and other (un)pickling systems that need to convert to/from a bit-specific ?wire? protocol/representation >>> - 32-bit <=> 64-bit image conversion All this is easily doable (because we have models of doing it for Float and Integer general instances).? But we need good specifications so we can implement the right thing from the get-go. >>>> At the moment, we have 30 class variables each for one Unicode category number. These class vars map in alphabetical order to the integers from 0 to: 29. Is this tedious structure really necessary? For different purposes, I would like to get the category name of a specific code point from a client. The current design makes this impossible without writing additional mappings. >>>> Tl;dr: I would like to propose to drop these class variables and use Symbols instead. They are comparable like integers, and as they are flyweights, this should not be a performance issue either. Of course, #generalCategoryOf: will have to keep returning numbers, but we could deprecate it and use a new #generalTagOf: in the future. Furthermore, this would also allow us to deal with later added category names (though I don't
know
whether this will ever happen). >>>> Examples: >>>> Unicode generalTagOf: $a asUnicode. "#Ll" >>>> Unicode class >> isLetterCode: charCode >>>>? ^ (self generalTagOf: charCode) first = $L >>>> Unicode class >> isAlphaNumericCode: charCode >>>>? | tag| >>>>? ^ (tag := self generalCategoryOf: charCode) first = $L >>>>??????? or: [tag = #Nd] >>>> How do you think about this proposal? Please let me know and I will go ahead! :D >>>> Best, >>>> Christoph >>> Best, Eliot >>> _,,,^..^,,,_ (phone)
["UnicodeData.2.cs"]
If this is essential
Well, how do you define essential? You can still use your image with the old font definitions. However, for some newer codepoints such as 😁😎😍, Unicode generalCategoryLabelOf: and friends will answers "not assigned" without the upgrade. You can watch the difference by browsing any comprehensive font in the FontImporter. But I am not aware of any code path that relies on the presence of newer Unicode data.
Apart from that, I was already discussing with Marcel what would be the consequences of downloading data from a third-party server during an image update. There might be any images, most likely server images, that do not have free internet access due to a strict firewall. Hypothetically, this might even introduce any security issues. So in the end, we decided on leaving this optional for now. It will only break if any future patch of any package relies on exact Unicode data.
Best,
Christoph
________________________________ Von: Squeak-dev squeak-dev-bounces@lists.squeakfoundation.org im Auftrag von Eliot Miranda eliot.miranda@gmail.com Gesendet: Montag, 4. April 2022 20:54:59 An: The general-purpose Squeak developers list Betreff: Re: [squeak-dev] Unicode
On Apr 4, 2022, at 11:20 AM, Christoph.Thiede@student.hpi.uni-potsdam.de wrote:
Merged via Multilingual-ct.271, Multilingual-ct.272, MultilingualTests-ct.41, and ReleaseBuilder-ct.231.
Please run the following in your image to install the new Unicode data (and to uncover any regressions I may have missed :D):
Unicode reinitializeData.
If this is essential then it *must* be added as a post load script to one (or more) of the relevant packages. Asking “did you run Unicode reinitializeData?” when someone reports a strange bug isn’t acceptable.
Best, Christoph
--- Sent from Squeak Inbox Talkhttps://github.com/hpi-swa-lab/squeak-inbox-talk
On 2022-02-28T16:09:06+01:00, christoph.thiede@student.hpi.uni-potsdam.de wrote:
Hi Marcel, thanks for the review! Below is an updated changeset. If you have no further objections, I would like to merge it within the next few days. :-)
=============== Summary ===============
Change Set:????????UnicodeData Date:????????????24 February 2022 Author:????????????Christoph Thiede
This changeset repairs the fetching & parsing of unicode category data. Usage:
????Unicode reinitializeData. ????Unicode generalCategoryLabelOf: 16r1F388. 'Symbol, Other'
This revision resolves some slips in the category tags, adds an interface for retrieving/converting tags, unifies the vocabulary of the Unicode protocol, integrates the #initializeUnicodeData into the class initializer, and adds some tests. Furthermore, the Unicode data are automatically reinitialized as part of the ReleaseBuilder.
Still present limitations include:
- Duplication between #parseUnicodeDataFrom: and #parseCompositionMappingFrom:
- Redundant and scattered declaration of character categories
For more information, see: http://lists.squeakfoundation.org/pipermail/squeak-dev/2020-March/208020.htm...
=============== Postscript ===============
"Postscript: Leave the line above, and replace the rest of this comment by a useful one. Executable statements should follow this comment, and should be separated by periods, with no exclamation points (!). Be sure to put any further comments in double-quotes, like this one."
=============== Diff ===============
ReleaseBuilder class>>prepareSourceCode {preparing} ? ct 2/28/2022 15:54 (changed) prepareSourceCode ????"Update packages. Remove foreign packages. Recompile."
????CurrentReadOnlySourceFiles cacheDuring: ????????[self ????????????updateCorePackages; ????????????unloadForeignPackages; ????????????checkForDirtyPackages; ????????????loadWellKnownPackages; ????????????checkForUndeclaredSymbols; ????????????checkForNilCategories;
- ????????????recompileAll]
- ????????????recompileAll;
- ????????????updateDatabases]
ReleaseBuilder class>>updateDatabases {scripts - support} ? ct 2/28/2022 16:06
- updateDatabases
- ????Unicode reinitializeData.
Unicode class>>allCategoryTags {character classification} ? ct 2/24/2022 19:41
- allCategoryTags
- ????^ #(Cn Cc Cf Co Cs Ll Lm Lo Lt Lu Mc Me Mn Nd Nl No Pc Pd Pe Pf Pi Po Ps Sc Sk Sm So Zl Zp Zs)
Unicode class>>blocks320Comment {comments} ? ct 2/28/2022 15:50 (changed) blocks320Comment
"# Blocks-3.2.0.txt # Correlated with Unicode 3.2 # Start Code..End Code; Block Name 0000..007F; Basic Latin 0080..00FF; Latin-1 Supplement 0100..017F; Latin Extended-A 0180..024F; Latin Extended-B 0250..02AF; IPA Extensions 02B0..02FF; Spacing Modifier Letters 0300..036F; Combining Diacritical Marks 0370..03FF; Greek and Coptic 0400..04FF; Cyrillic 0500..052F; Cyrillic Supplementary 0530..058F; Armenian 0590..05FF; Hebrew 0600..06FF; Arabic 0700..074F; Syriac 0780..07BF; Thaana 0900..097F; Devanagari 0980..09FF; Bengali 0A00..0A7F; Gurmukhi 0A80..0AFF; Gujarati 0B00..0B7F; Oriya 0B80..0BFF; Tamil 0C00..0C7F; Telugu 0C80..0CFF; Kannada 0D00..0D7F; Malayalam 0D80..0DFF; Sinhala 0E00..0E7F; Thai 0E80..0EFF; Lao 0F00..0FFF; Tibetan 1000..109F; Myanmar 10A0..10FF; Georgian 1100..11FF; Hangul Jamo 1200..137F; Ethiopic 13A0..13FF; Cherokee 1400..167F; Unified Canadian Aboriginal Syllabics 1680..169F; Ogham 16A0..16FF; Runic 1700..171F; Tagalog 1720..173F; Hanunoo 1740..175F; Buhid 1760..177F; Tagbanwa 1780..17FF; Khmer 1800..18AF; Mongolian 1E00..1EFF; Latin Extended Additional 1F00..1FFF; Greek Extended 2000..206F; General Punctuation 2070..209F; Superscripts and Subscripts 20A0..20CF; Currency Symbols 20D0..20FF; Combining Diacritical Marks for Symbols 2100..214F; Letterlike Symbols 2150..218F; Number Forms 2190..21FF; Arrows 2200..22FF; Mathematical Operators 2300..23FF; Miscellaneous Technical 2400..243F; Control Pictures 2440..245F; Optical Character Recognition 2460..24FF; Enclosed Alphanumerics 2500..257F; Box Drawing 2580..259F; Block Elements 25A0..25FF; Geometric Shapes 2600..26FF; Miscellaneous Symbols 2700..27BF; Dingbats 27C0..27EF; Miscellaneous Mathematical Symbols-A 27F0..27FF; Supplemental Arrows-A 2800..28FF; Braille Patterns 2900..297F; Supplemental Arrows-B 2980..29FF; Miscellaneous Mathematical Symbols-B 2A00..2AFF; Supplemental Mathematical Operators 2E80..2EFF; CJK Radicals Supplement 2F00..2FDF; Kangxi Radicals 2FF0..2FFF; Ideographic Description Characters 3000..303F; CJK Symbols and Punctuation 3040..309F; Hiragana 30A0..30FF; Katakana 3100..312F; Bopomofo 3130..318F; Hangul Compatibility Jamo 3190..319F; Kanbun 31A0..31BF; Bopomofo Extended 31F0..31FF; Katakana Phonetic Extensions 3200..32FF; Enclosed CJK Letters and Months 3300..33FF; CJK Compatibility 3400..4DBF; CJK Unified Ideographs Extension A 4E00..9FFF; CJK Unified Ideographs A000..A48F; Yi Syllables A490..A4CF; Yi Radicals AC00..D7AF; Hangul Syllables D800..DB7F; High Surrogates DB80..DBFF; High Private Use Surrogates DC00..DFFF; Low Surrogates E000..F8FF; Private Use Area F900..FAFF; CJK Compatibility Ideographs FB00..FB4F; Alphabetic Presentation Forms FB50..FDFF; Arabic Presentation Forms-A FE00..FE0F; Variation Selectors FE20..FE2F; Combining Half Marks FE30..FE4F; CJK Compatibility Forms FE50..FE6F; Small Form Variants FE70..FEFF; Arabic Presentation Forms-B FF00..FFEF; Halfwidth and Fullwidth Forms FFF0..FFFF; Specials 10300..1032F; Old Italic 10330..1034F; Gothic 10400..1044F; Deseret 1D000..1D0FF; Byzantine Musical Symbols 1D100..1D1FF; Musical Symbols 1D400..1D7FF; Mathematical Alphanumeric Symbols 20000..2A6DF; CJK Unified Ideographs Extension B 2F800..2FA1F; CJK Compatibility Ideographs Supplement E0000..E007F; Tags F0000..FFFFF; Supplementary Private Use Area-A 100000..10FFFF; Supplementary Private Use Area-B
"
Unicode class>>generalCategoryIndexFromTag: {character classification} ? ct 2/24/2022 19:33
- generalCategoryIndexFromTag: tag
- ????^ (self allCategoryTags indexOf: tag) - 1
Unicode class>>generalCategoryLabelForTag: {character classification} ? ct 2/24/2022 21:47
- generalCategoryLabelForTag: tag
- ????^ self generalCategoryLabels at: (self generalCategoryIndexFromTag: tag) + 1
Unicode class>>generalCategoryTagOf: {character classification} ? ct 2/24/2022 19:44
- generalCategoryTagOf: aCharacterCode
- ????^ (self generalCategoryOf: aCharacterCode)
- ????????ifNotNil: [:code | self allCategoryTags at: code + 1 ifAbsent: [#Cn]]
- ????????ifNil: [#Cn]
Unicode class>>initialize {class initialization} ? ct 2/28/2022 15:52 (changed) initialize ????" Unicode initialize " ????self initializeTagConstants.
- ????Compositions isEmptyOrNil ifTrue:[self initializeCompositionMappings].
- ????
- ????self flag: #deduplicate. "Currently, we are downloading and parsing #unicodeData twice."
- ????Compositions isEmptyOrNil ifTrue: [self initializeCompositionMappings].
- ????GeneralCategory isEmptyOrNil ifTrue: [self initializeUnicodeData].
Unicode class>>initializeUnicodeData {unicode data} ? ct 2/24/2022 19:03
- initializeUnicodeData
- ????"self initializeUnicodeData"
- ????self parseUnicodeDataFrom: self unicodeData readStream.
Unicode class>>parseUnicodeDataFrom: {unicode data} ? ct 2/24/2022 19:33 (changed) parseUnicodeDataFrom: stream
- "
- ????self halt.
- ????self parseUnicodeDataFile
- "
- ????"self initializeUnicodeData."
- ????| line fieldEnd point fieldStart toNumber generalCategory decimalProperty |
- ????| line fieldEnd point fieldStart toNumber generalCategory decimalProperty tag |
????toNumber := [:quad | ('16r', quad) asNumber].
????GeneralCategory := SparseLargeTable new: 16rE0080 chunkSize: 1024 arrayClass: Array base: 1 defaultValue: 'Cn'. ????DecimalProperty := SparseLargeTable new: 16rE0080 chunkSize: 32 arrayClass: Array base: 1 defaultValue: -1.
????16r3400 to: 16r4DB5 do: [:i | GeneralCategory at: i+1 put: 'Lo']. ????16r4E00 to: 16r9FA5 do: [:i | GeneralCategory at: i+1 put: 'Lo']. ????16rAC00 to: 16rD7FF do: [:i | GeneralCategory at: i+1 put: 'Lo'].
????[(line := stream nextLine) size > 0] whileTrue: [ ????????fieldEnd := line indexOf: $; startingAt: 1. ????????point := toNumber value: (line copyFrom: 1 to: fieldEnd - 1). ????????point > 16rE007F ifTrue: [ ????????????GeneralCategory zapDefaultOnlyEntries. ????????????DecimalProperty zapDefaultOnlyEntries. ????????????^ self]. ????????2 to: 3 do: [:i | ????????????fieldStart := fieldEnd + 1. ????????????fieldEnd := line indexOf: $; startingAt: fieldStart. ????????].
- ????????generalCategory := line copyFrom: fieldStart to: fieldEnd - 1.
- ????????tag := line copyFrom: fieldStart to: fieldEnd - 1.
- ????????generalCategory := self generalCategoryIndexFromTag: tag.
????????GeneralCategory at: point+1 put: generalCategory.
- ????????generalCategory = 'Nd' ifTrue: [
- ????????generalCategory = Nd ifTrue: [
????????????4 to: 7 do: [:i | ????????????????fieldStart := fieldEnd + 1. ????????????????fieldEnd := line indexOf: $; startingAt: fieldStart. ????????????]. ????????????decimalProperty := line copyFrom: fieldStart to: fieldEnd - 1. ????????????DecimalProperty at: point+1 put: decimalProperty asNumber. ????????]. ????]. ????GeneralCategory zapDefaultOnlyEntries. ????DecimalProperty zapDefaultOnlyEntries.
Unicode class>>reinitializeData {class initialization} ? ct 2/28/2022 16:05
- reinitializeData
- ????Compositions := GeneralCategory := nil.
- ????self initialize.
UnicodeTest
- TestCase subclass: #UnicodeTest
- ????instanceVariableNames: ''
- ????classVariableNames: ''
- ????poolDictionaries: ''
- ????category: 'MultilingualTests-Encodings'
- UnicodeTest class
- ????instanceVariableNames: ''
- ""
UnicodeTest class>>resources {accessing} ? ct 2/24/2022 21:46
- resources
- ???? ^ super resources copyWith: UnicodeTestResource
UnicodeTest>>testGeneralCategoryLabel {tests - character classification} ? ct 2/24/2022 21:49
- testGeneralCategoryLabel
- ????self assert: 'Letter, Lowercase' equals: (Unicode generalCategoryLabelOf: $a asUnicode).
- ????self assert: 'Letter, Uppercase' equals: (Unicode generalCategoryLabelOf: $Z asUnicode).
- ????
- ????self assert: 'Number, Decimal' equals: (Unicode generalCategoryLabelOf: $5 asUnicode).
- ????self assert: 'Symbol, Other' equals: (Unicode generalCategoryLabelOf: 16r1F388).
- ????
- ????self assert: 'n/a' equals: (Unicode generalCategoryLabelOf: Float infinity).
UnicodeTest>>testGeneralCategoryLabelForTag {tests - character classification} ? ct 2/24/2022 21:48
- testGeneralCategoryLabelForTag
- ????self assert: 'Letter, Lowercase' equals: (Unicode generalCategoryLabelForTag: #Ll).
UnicodeTest>>testGeneralCategoryTag {tests - character classification} ? ct 2/24/2022 21:49
- testGeneralCategoryTag
- ????self assert: #Ll equals: (Unicode generalCategoryTagOf: $a asUnicode).
- ????self assert: #Lu equals: (Unicode generalCategoryTagOf: $Z asUnicode).
- ????
- ????self assert: #Nd equals: (Unicode generalCategoryTagOf: $5 asUnicode).
- ????self assert: #So equals: (Unicode generalCategoryTagOf: 16r1F388).
- ????
- ????self assert: #Cn equals: (Unicode generalCategoryTagOf: Float infinity).
UnicodeTestResource
- TestResource subclass: #UnicodeTestResource
- ????instanceVariableNames: ''
- ????classVariableNames: ''
- ????poolDictionaries: ''
- ????category: 'MultilingualTests-Encodings'
- UnicodeTestResource class
- ????instanceVariableNames: ''
- ""
UnicodeTestResource>>setUp {running} ? ct 2/24/2022 21:45
- setUp
- ????super setUp.
- ????
- ????"Test the functionality of this update logic"
- ????Unicode initializeCompositionMappings.
- ????Unicode initializeUnicodeData.
Sent from Squeak Inbox Talk
On 2022-02-25T11:36:53+01:00, marcel.taeumel at hpi.de wrote:
Hi Christoph --
Thanks for doing this!
??Is it okay to fetch the data from unicode.org via a postscript in the update stream? Hypothetically, some clients might use a proxy or a strict firewall/safelist of IP addresses.
I think it is okay. We could make it explicit in the ReleaseBuilder or in some external CI script but having it part of the update stream is okay. The updates are fetched from the outside anyway, right? ;-)
Best; Marcel Am 24.02.2022 22:16:42 schrieb christoph.thiede at student.hpi.uni-potsdam.de <christoph.thiede at student.hpi.uni-potsdam.de>: Hi all, Hi Marcel, Hi Levente,
finally, here is a changeset that takes the first step for updating or in-image Unicode database. After filing it in, please run:
????Unicode initializeUnicodeData.
In addition to the preamble (please read first below), I have still a number of questions:
- Is it okay to fetch the data from unicode.org via a postscript in the update stream? Hypothetically, some clients might use a proxy or a strict firewall/safelist of IP addresses.
- How much effort shall we put in deduplicating the logic and the data in this class? This includes both the two similar parsing methods and the redundant specification of the Unicode character tags.
Best, Christoph
=============== Summary ===============
Change Set:????????UnicodeData Date:????????????24 February 2022 Author:????????????Christoph Thiede
This changeset repairs the fetching & parsing of unicode category data. Usage:
????Unicode initializeUnicodeData. ????Unicode generalCategoryLabelOf: 16r1F388. 'Symbol, Other'
This revision resolves some slips in the category tags, adds an interface for retrieving/converting tags, unifies the vocabulary of the Unicode protocol, integrates the #initializeUnicodeData into the class initializer, and adds some tests.
Still present limitations include:
- Duplication between #parseUnicodeDataFrom: and #parseCompositionMappingFrom:
- Redundant and scattered declaration of character categories
For more information, see: http://lists.squeakfoundation.org/pipermail/squeak-dev/2020-March/208020.htm...
=============== Postscript ===============
"Postscript: Leave the line above, and replace the rest of this comment by a useful one. Executable statements should follow this comment, and should be separated by periods, with no exclamation points (!). Be sure to put any further comments in double-quotes, like this one."
=============== Diff ===============
Unicode class>>allCategoryTags {character classification} ? ct 2/24/2022 19:41
- allCategoryTags
- ????^ #(Cn Cc Cf Co Cs Ll Lm Lo Lt Lu Mc Me Mn Nd Nl No Pc Pd Pe Pf Pi Po Ps Sc Sk Sm So Zl Zp Zs)
Unicode class>>generalCategoryIndexFromTag: {character classification} ? ct 2/24/2022 19:33
- generalCategoryIndexFromTag: tag
- ????^ (self allCategoryTags indexOf: tag) - 1
Unicode class>>generalCategoryLabelForTag: {character classification} ? ct 2/24/2022 21:47
- generalCategoryLabelForTag: tag
- ????^ self generalCategoryLabels at: (self generalCategoryIndexFromTag: tag) + 1
Unicode class>>generalCategoryTagOf: {character classification} ? ct 2/24/2022 19:44
- generalCategoryTagOf: aCharacterCode
- ????^ (self generalCategoryOf: aCharacterCode)
- ????????ifNotNil: [:code | self allCategoryTags at: code + 1 ifAbsent: [#Cn]]
- ????????ifNil: [#Cn]
Unicode class>>initialize {class initialization} ? ct 2/24/2022 21:43 (changed) initialize ????" Unicode initialize " ????self initializeTagConstants.
- ????Compositions isEmptyOrNil ifTrue:[self initializeCompositionMappings].
- ????Compositions isEmptyOrNil ifTrue:[self initializeCompositionMappings].
- ????GeneralCategory isEmptyOrNil ifTrue: [self initializeUnicodeData].
Unicode class>>initializeUnicodeData {unicode data} ? ct 2/24/2022 19:03
- initializeUnicodeData
- ????"self initializeUnicodeData"
- ????self parseUnicodeDataFrom: self unicodeData readStream.
Unicode class>>parseUnicodeDataFrom: {unicode data} ? ct 2/24/2022 19:33 (changed) parseUnicodeDataFrom: stream
- "
- ????self halt.
- ????self parseUnicodeDataFile
- "
- ????"self initializeUnicodeData."
- ????| line fieldEnd point fieldStart toNumber generalCategory decimalProperty |
- ????| line fieldEnd point fieldStart toNumber generalCategory decimalProperty tag |
????toNumber := [:quad | ('16r', quad) asNumber].
????GeneralCategory := SparseLargeTable new: 16rE0080 chunkSize: 1024 arrayClass: Array base: 1 defaultValue: 'Cn'. ????DecimalProperty := SparseLargeTable new: 16rE0080 chunkSize: 32 arrayClass: Array base: 1 defaultValue: -1.
????16r3400 to: 16r4DB5 do: [:i | GeneralCategory at: i+1 put: 'Lo']. ????16r4E00 to: 16r9FA5 do: [:i | GeneralCategory at: i+1 put: 'Lo']. ????16rAC00 to: 16rD7FF do: [:i | GeneralCategory at: i+1 put: 'Lo'].
????[(line := stream nextLine) size > 0] whileTrue: [ ????????fieldEnd := line indexOf: $; startingAt: 1. ????????point := toNumber value: (line copyFrom: 1 to: fieldEnd - 1). ????????point > 16rE007F ifTrue: [ ????????????GeneralCategory zapDefaultOnlyEntries. ????????????DecimalProperty zapDefaultOnlyEntries. ????????????^ self]. ????????2 to: 3 do: [:i | ????????????fieldStart := fieldEnd + 1. ????????????fieldEnd := line indexOf: $; startingAt: fieldStart. ????????].
- ????????generalCategory := line copyFrom: fieldStart to: fieldEnd - 1.
- ????????tag := line copyFrom: fieldStart to: fieldEnd - 1.
- ????????generalCategory := self generalCategoryIndexFromTag: tag.
????????GeneralCategory at: point+1 put: generalCategory.
- ????????generalCategory = 'Nd' ifTrue: [
- ????????generalCategory = Nd ifTrue: [
????????????4 to: 7 do: [:i | ????????????????fieldStart := fieldEnd + 1. ????????????????fieldEnd := line indexOf: $; startingAt: fieldStart. ????????????]. ????????????decimalProperty := line copyFrom: fieldStart to: fieldEnd - 1. ????????????DecimalProperty at: point+1 put: decimalProperty asNumber. ????????]. ????]. ????GeneralCategory zapDefaultOnlyEntries. ????DecimalProperty zapDefaultOnlyEntries.
UnicodeTest
- TestCase subclass: #UnicodeTest
- ????instanceVariableNames: ''
- ????classVariableNames: ''
- ????poolDictionaries: ''
- ????category: 'MultilingualTests-Encodings'
- UnicodeTest class
- ????instanceVariableNames: ''
- ""
UnicodeTest class>>resources {accessing} ? ct 2/24/2022 21:46
- resources
- ???? ^ super resources copyWith: UnicodeTestResource
UnicodeTest>>testGeneralCategoryLabel {tests - character classification} ? ct 2/24/2022 21:49
- testGeneralCategoryLabel
- ????self assert: 'Letter, Lowercase' equals: (Unicode generalCategoryLabelOf: $a asUnicode).
- ????self assert: 'Letter, Uppercase' equals: (Unicode generalCategoryLabelOf: $Z asUnicode).
- ????
- ????self assert: 'Number, Decimal' equals: (Unicode generalCategoryLabelOf: $5 asUnicode).
- ????self assert: 'Symbol, Other' equals: (Unicode generalCategoryLabelOf: 16r1F388).
- ????
- ????self assert: 'n/a' equals: (Unicode generalCategoryLabelOf: Float infinity).
UnicodeTest>>testGeneralCategoryLabelForTag {tests - character classification} ? ct 2/24/2022 21:48
- testGeneralCategoryLabelForTag
- ????self assert: 'Letter, Lowercase' equals: (Unicode generalCategoryLabelForTag: #Ll).
UnicodeTest>>testGeneralCategoryTag {tests - character classification} ? ct 2/24/2022 21:49
- testGeneralCategoryTag
- ????self assert: #Ll equals: (Unicode generalCategoryTagOf: $a asUnicode).
- ????self assert: #Lu equals: (Unicode generalCategoryTagOf: $Z asUnicode).
- ????
- ????self assert: #Nd equals: (Unicode generalCategoryTagOf: $5 asUnicode).
- ????self assert: #So equals: (Unicode generalCategoryTagOf: 16r1F388).
- ????
- ????self assert: #Cn equals: (Unicode generalCategoryTagOf: Float infinity).
UnicodeTestResource
- TestResource subclass: #UnicodeTestResource
- ????instanceVariableNames: ''
- ????classVariableNames: ''
- ????poolDictionaries: ''
- ????category: 'MultilingualTests-Encodings'
- UnicodeTestResource class
- ????instanceVariableNames: ''
- ""
UnicodeTestResource>>setUp {running} ? ct 2/24/2022 21:45
- setUp
- ????super setUp.
- ????
- ????"Test the functionality of this update logic"
- ????Unicode initializeCompositionMappings.
- ????Unicode initializeUnicodeData.
Sent from Squeak Inbox Talk [https://github.com/hpi-swa-lab/squeak-inbox-talk]
On 2020-09-11T00:49:24+02:00, leves at caesar.elte.hu wrote:
Hi Christoph,
On Wed, 9 Sep 2020, Thiede, Christoph wrote:
Hi Levente,
basically, I only?would like to get rid of the class variables for every single Unicode category because they provide low explorability (it's hard to work with the numeric output of #generalCategoryOf:) and extensibility (you need to recompile the class definition for adding UTF-16 support). If you are critical of?increasing the?size of the SparseLargeTable, I think we?would also just make one or two?extra dictionaries to map every category symbol to a number and vice versa. What do you think?
You mean an array to map the integers to symbols, right? :) Anyway, I don't think it's worth using symbols internally. For example, #isLetterCode: is 8-10% slower with the extra array lookup and checking the category symbol's first letter than the current method of integer comparisons.
Do you expect these constants to appear outside the Unicode class? If yes, then using symbols for those cases is probably a good solution. But for internal use, the integers are better.
Levente
Best,
Christoph
Von: Squeak-dev <squeak-dev-bounces at lists.squeakfoundation.org> im Auftrag von Levente Uzonyi <leves at caesar.elte.hu> Gesendet: Dienstag, 8. September 2020 21:43:56 An: The general-purpose Squeak developers list Betreff: Re: [squeak-dev] Unicode ? Hi Christoph,
On Tue, 8 Sep 2020, Thiede, Christoph wrote:
Hi all,
?Your words suggest that it has already been published, but I can't find it?anywhere.
Then I must have expressed myself wrong. I did not yet publish any code?changes, but in my original post from March,?you can find a short description of the design?changes I'd like to implement. Essentially, I would like to replace the separate class variables for every known character class in favor of greater flexibility.
How would your changes affect GeneralCategory? Would it still be a SpareLargeTable with ByteArray as arrayClass? If you just replace those integers with symbols, the size of the table will be at least 4 or 8 times larger in 32 or 64 bit images, respectively.
Levente.
Eliot, are there any remaining questions regarding the VM size? Character size should be sufficient as discussed below, and of course, I can test any changes in a 32-bit image, too. :-)
WideString withAll: (#(42r2l 16r1F497 16r1F388 33) collect: #asCharacter)
Best, Christoph
_
Von: Squeak-dev <squeak-dev-bounces at lists.squeakfoundation.org> im Auftrag von Tobias Pape <Das.Linux at gmx.de> Gesendet: Sonntag, 6. September 2020 21:00:14 An: The general-purpose Squeak developers list Betreff: Re: [squeak-dev] Unicode ?
On 06.09.2020, at 20:40, Levente Uzonyi <leves at caesar.elte.hu> wrote:
On Sun, 6 Sep 2020, Tobias Pape wrote:
> >> On 06.09.2020, at 19:15, Eliot Miranda <eliot.miranda at gmail.com> wrote: >> Hi Christoph, Hi All, >>> On Mar 17, 2020, at 3:51 PM, Thiede, Christoph <Christoph.Thiede at student.hpi.uni-potsdam.de> wrote: >>> Hi all! :-) >>> After some recent fun with the Unicode class, I found out that its data is quite out of date (for example, the comments do not even "know" that code points can be longer than 4 bytes. Younger characters such as ???
are not categorized correctly, etc. ...). Luckily, there is already some logic to fetch the latest data from www.unicode.org. I'm currently reworking this logic because it's not completely automated yet and has some slips, but so long, I have one general question for you:
>> And consequently I have a couple of questions for you. In the Spur VM Characters are immediate (they are like SmallInteger and exist in oops (object-oriented pointers) as tagged values).? In the 32-bit variant
Characters
are 30-bit unsigned integers.? In the 64-bit variant they are also 30-bit unsigned integers, but could easily be extended to be up to 61-bit unsigned integers.
>> Q1, can you arrange that the Unicode support does not break in initialization on the 32-bit variant?? It may be that the 32-bit variant cannot represent code points beyond 30 bits in size, but we should try to ensure
that
initialization still runs to completion even if it fails to initialize information relating to code points beyond 30 bits in size.
>> Q2, how many bits should the 64-bit variant VM support for immediate Characters? > > Unicode has a max value of 0x10FFFF. That makes 21 bit. So no worries there. > > We should just not forget the leading-char stuff (Yoshiki, Andreas,...)
AFAIU the leading char only makes sense when you have multiple CJK(V?) languages in use at the same time. In other cases Unicode (leadingChar = 0) is perfectly fine. IIRC there are 22 bits available for the codePoint and 8 for the leadingChar, so we're still good: all unicode characters fit.
\o/ hooray!
Levente
> > > BEst regards >?????? -Tobias > >> Then something to consider is that it is conceptually possible to support something like WideCharacter, which would represent code points outside of the immediate Character range on the 32-bit variant, analogous to
LargePositiveInteger beyond SmallInteger maxVal.? This can be made to work seamlessly, just as it does currently with integers, and with Floats where SmallFloat64 is only used on 64-bits.
>> It has implications in a few parts of the system: >> - failure code for WideString (VeryWideString?) at:[put:] primitives that would have to manage overflow into/access from WideCharacter instances >> - ImageSegment and other (un)pickling systems that need to convert to/from a bit-specific ?wire? protocol/representation >> - 32-bit <=> 64-bit image conversion All this is easily doable (because we have models of doing it for Float and Integer general instances).? But we need good specifications so we can implement the right thing from the
get-go.
>>> At the moment, we have 30 class variables each for one Unicode category number. These class vars map in alphabetical order to the integers from 0 to: 29. Is this tedious structure really necessary? For different
purposes, I would like to get the category name of a specific code point from a client. The current design makes this impossible without writing additional mappings.
>>> Tl;dr: I would like to propose to drop these class variables and use Symbols instead. They are comparable like integers, and as they are flyweights, this should not be a performance issue either. Of course,
#generalCategoryOf: will have to keep returning numbers, but we could deprecate it and use a new #generalTagOf: in the future. Furthermore, this would also allow us to deal with later added category names (though I don't
know
whether this will ever happen).
>>> Examples: >>> Unicode generalTagOf: $a asUnicode. "#Ll" >>> Unicode class >> isLetterCode: charCode >>>? ^ (self generalTagOf: charCode) first = $L >>> Unicode class >> isAlphaNumericCode: charCode >>>? | tag| >>>? ^ (tag := self generalCategoryOf: charCode) first = $L >>>??????? or: [tag = #Nd] >>> How do you think about this proposal? Please let me know and I will go ahead! :D >>> Best, >>> Christoph >> Best, Eliot >> _,,,^..^,,,_ (phone)
["UnicodeData.2.cs"] -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.squeakfoundation.org/pipermail/squeak-dev/attachments/20220225/b0b74049/attachment-0001.html
-------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.squeakfoundation.org/pipermail/squeak-dev/attachments/20220228/2d10d9ef/attachment.html -------------- next part -------------- A non-text attachment was scrubbed... Name: UnicodeData.3.cs Type: application/octet-stream Size: 10387 bytes Desc: not available URL: http://lists.squeakfoundation.org/pipermail/squeak-dev/attachments/20220228/2d10d9ef/attachment.obj
["UnicodeData.png"][cid:56562B42-BD64-4CA7-9A3E-A4E3F6B12F7B]
Hi Eliot, hi Christoph --
Unicode reinitializeData.
I think that this method has an unfortunate name. Since it downloads data from the Internet, it should be called #downloadAndInitializeData.
And that's the reason for it not being in the post-load script. We might put the raw info there, but it would be very surprising if "Update Squeak" fetches data other than from source.squeak.org.
Best, Marcel Am 04.04.2022 21:17:17 schrieb Thiede, Christoph christoph.thiede@student.hpi.uni-potsdam.de:
If this is essential
Well, how do you define essential? You can still use your image with the old font definitions. However, for some newer codepoints such as 😁😎😍, Unicode generalCategoryLabelOf: and friends will answers "not assigned" without the upgrade. You can watch the difference by browsing any comprehensive font in the FontImporter. But I am not aware of any code path that relies on the presence of newer Unicode data.
Apart from that, I was already discussing with Marcel what would be the consequences of downloading data from a third-party server during an image update. There might be any images, most likely server images, that do not have free internet access due to a strict firewall. Hypothetically, this might even introduce any security issues. So in the end, we decided on leaving this optional for now. It will only break if any future patch of any package relies on exact Unicode data.
Best, Christoph Von: Squeak-dev squeak-dev-bounces@lists.squeakfoundation.org im Auftrag von Eliot Miranda eliot.miranda@gmail.com Gesendet: Montag, 4. April 2022 20:54:59 An: The general-purpose Squeak developers list Betreff: Re: [squeak-dev] Unicode
On Apr 4, 2022, at 11:20 AM, Christoph.Thiede@student.hpi.uni-potsdam.de wrote:
Merged via Multilingual-ct.271, Multilingual-ct.272, MultilingualTests-ct.41, and ReleaseBuilder-ct.231.
Please run the following in your image to install the new Unicode data (and to uncover any regressions I may have missed :D):
Unicode reinitializeData.
If this is essential then it *must* be added as a post load script to one (or more) of the relevant packages. Asking “did you run Unicode reinitializeData?” when someone reports a strange bug isn’t acceptable.
Best, Christoph
--- Sent from Squeak Inbox Talk [https://github.com/hpi-swa-lab/squeak-inbox-talk]
On 2022-02-28T16:09:06+01:00, christoph.thiede@student.hpi.uni-potsdam.de wrote:
Hi Marcel, thanks for the review! Below is an updated changeset. If you have no further objections, I would like to merge it within the next few days. :-)
=============== Summary ===============
Change Set:????????UnicodeData Date:????????????24 February 2022 Author:????????????Christoph Thiede
This changeset repairs the fetching & parsing of unicode category data. Usage:
????Unicode reinitializeData. ????Unicode generalCategoryLabelOf: 16r1F388. 'Symbol, Other'
This revision resolves some slips in the category tags, adds an interface for retrieving/converting tags, unifies the vocabulary of the Unicode protocol, integrates the #initializeUnicodeData into the class initializer, and adds some tests. Furthermore, the Unicode data are automatically reinitialized as part of the ReleaseBuilder.
Still present limitations include:
- Duplication between #parseUnicodeDataFrom: and #parseCompositionMappingFrom:
- Redundant and scattered declaration of character categories
For more information, see: http://lists.squeakfoundation.org/pipermail/squeak-dev/2020-March/208020.htm...
=============== Postscript ===============
"Postscript: Leave the line above, and replace the rest of this comment by a useful one. Executable statements should follow this comment, and should be separated by periods, with no exclamation points (!). Be sure to put any further comments in double-quotes, like this one."
=============== Diff ===============
ReleaseBuilder class>>prepareSourceCode {preparing} ? ct 2/28/2022 15:54 (changed) prepareSourceCode ????"Update packages. Remove foreign packages. Recompile."
????CurrentReadOnlySourceFiles cacheDuring: ????????[self ????????????updateCorePackages; ????????????unloadForeignPackages; ????????????checkForDirtyPackages; ????????????loadWellKnownPackages; ????????????checkForUndeclaredSymbols; ????????????checkForNilCategories;
- ????????????recompileAll]
- ????????????recompileAll;
- ????????????updateDatabases]
ReleaseBuilder class>>updateDatabases {scripts - support} ? ct 2/28/2022 16:06
- updateDatabases
- ????Unicode reinitializeData.
Unicode class>>allCategoryTags {character classification} ? ct 2/24/2022 19:41
- allCategoryTags
- ????^ #(Cn Cc Cf Co Cs Ll Lm Lo Lt Lu Mc Me Mn Nd Nl No Pc Pd Pe Pf Pi Po Ps Sc Sk Sm So Zl Zp Zs)
Unicode class>>blocks320Comment {comments} ? ct 2/28/2022 15:50 (changed) blocks320Comment
"# Blocks-3.2.0.txt # Correlated with Unicode 3.2 # Start Code..End Code; Block Name 0000..007F; Basic Latin 0080..00FF; Latin-1 Supplement 0100..017F; Latin Extended-A 0180..024F; Latin Extended-B 0250..02AF; IPA Extensions 02B0..02FF; Spacing Modifier Letters 0300..036F; Combining Diacritical Marks 0370..03FF; Greek and Coptic 0400..04FF; Cyrillic 0500..052F; Cyrillic Supplementary 0530..058F; Armenian 0590..05FF; Hebrew 0600..06FF; Arabic 0700..074F; Syriac 0780..07BF; Thaana 0900..097F; Devanagari 0980..09FF; Bengali 0A00..0A7F; Gurmukhi 0A80..0AFF; Gujarati 0B00..0B7F; Oriya 0B80..0BFF; Tamil 0C00..0C7F; Telugu 0C80..0CFF; Kannada 0D00..0D7F; Malayalam 0D80..0DFF; Sinhala 0E00..0E7F; Thai 0E80..0EFF; Lao 0F00..0FFF; Tibetan 1000..109F; Myanmar 10A0..10FF; Georgian 1100..11FF; Hangul Jamo 1200..137F; Ethiopic 13A0..13FF; Cherokee 1400..167F; Unified Canadian Aboriginal Syllabics 1680..169F; Ogham 16A0..16FF; Runic 1700..171F; Tagalog 1720..173F; Hanunoo 1740..175F; Buhid 1760..177F; Tagbanwa 1780..17FF; Khmer 1800..18AF; Mongolian 1E00..1EFF; Latin Extended Additional 1F00..1FFF; Greek Extended 2000..206F; General Punctuation 2070..209F; Superscripts and Subscripts 20A0..20CF; Currency Symbols 20D0..20FF; Combining Diacritical Marks for Symbols 2100..214F; Letterlike Symbols 2150..218F; Number Forms 2190..21FF; Arrows 2200..22FF; Mathematical Operators 2300..23FF; Miscellaneous Technical 2400..243F; Control Pictures 2440..245F; Optical Character Recognition 2460..24FF; Enclosed Alphanumerics 2500..257F; Box Drawing 2580..259F; Block Elements 25A0..25FF; Geometric Shapes 2600..26FF; Miscellaneous Symbols 2700..27BF; Dingbats 27C0..27EF; Miscellaneous Mathematical Symbols-A 27F0..27FF; Supplemental Arrows-A 2800..28FF; Braille Patterns 2900..297F; Supplemental Arrows-B 2980..29FF; Miscellaneous Mathematical Symbols-B 2A00..2AFF; Supplemental Mathematical Operators 2E80..2EFF; CJK Radicals Supplement 2F00..2FDF; Kangxi Radicals 2FF0..2FFF; Ideographic Description Characters 3000..303F; CJK Symbols and Punctuation 3040..309F; Hiragana 30A0..30FF; Katakana 3100..312F; Bopomofo 3130..318F; Hangul Compatibility Jamo 3190..319F; Kanbun 31A0..31BF; Bopomofo Extended 31F0..31FF; Katakana Phonetic Extensions 3200..32FF; Enclosed CJK Letters and Months 3300..33FF; CJK Compatibility 3400..4DBF; CJK Unified Ideographs Extension A 4E00..9FFF; CJK Unified Ideographs A000..A48F; Yi Syllables A490..A4CF; Yi Radicals AC00..D7AF; Hangul Syllables D800..DB7F; High Surrogates DB80..DBFF; High Private Use Surrogates DC00..DFFF; Low Surrogates E000..F8FF; Private Use Area F900..FAFF; CJK Compatibility Ideographs FB00..FB4F; Alphabetic Presentation Forms FB50..FDFF; Arabic Presentation Forms-A FE00..FE0F; Variation Selectors FE20..FE2F; Combining Half Marks FE30..FE4F; CJK Compatibility Forms FE50..FE6F; Small Form Variants FE70..FEFF; Arabic Presentation Forms-B FF00..FFEF; Halfwidth and Fullwidth Forms FFF0..FFFF; Specials 10300..1032F; Old Italic 10330..1034F; Gothic 10400..1044F; Deseret 1D000..1D0FF; Byzantine Musical Symbols 1D100..1D1FF; Musical Symbols 1D400..1D7FF; Mathematical Alphanumeric Symbols 20000..2A6DF; CJK Unified Ideographs Extension B 2F800..2FA1F; CJK Compatibility Ideographs Supplement E0000..E007F; Tags F0000..FFFFF; Supplementary Private Use Area-A 100000..10FFFF; Supplementary Private Use Area-B
"
Unicode class>>generalCategoryIndexFromTag: {character classification} ? ct 2/24/2022 19:33
- generalCategoryIndexFromTag: tag
- ????^ (self allCategoryTags indexOf: tag) - 1
Unicode class>>generalCategoryLabelForTag: {character classification} ? ct 2/24/2022 21:47
- generalCategoryLabelForTag: tag
- ????^ self generalCategoryLabels at: (self generalCategoryIndexFromTag: tag) + 1
Unicode class>>generalCategoryTagOf: {character classification} ? ct 2/24/2022 19:44
- generalCategoryTagOf: aCharacterCode
- ????^ (self generalCategoryOf: aCharacterCode)
- ????????ifNotNil: [:code | self allCategoryTags at: code + 1 ifAbsent: [#Cn]]
- ????????ifNil: [#Cn]
Unicode class>>initialize {class initialization} ? ct 2/28/2022 15:52 (changed) initialize ????" Unicode initialize " ????self initializeTagConstants.
- ????Compositions isEmptyOrNil ifTrue:[self initializeCompositionMappings].
- ????
- ????self flag: #deduplicate. "Currently, we are downloading and parsing #unicodeData twice."
- ????Compositions isEmptyOrNil ifTrue: [self initializeCompositionMappings].
- ????GeneralCategory isEmptyOrNil ifTrue: [self initializeUnicodeData].
Unicode class>>initializeUnicodeData {unicode data} ? ct 2/24/2022 19:03
- initializeUnicodeData
- ????"self initializeUnicodeData"
- ????self parseUnicodeDataFrom: self unicodeData readStream.
Unicode class>>parseUnicodeDataFrom: {unicode data} ? ct 2/24/2022 19:33 (changed) parseUnicodeDataFrom: stream
- "
- ????self halt.
- ????self parseUnicodeDataFile
- "
- ????"self initializeUnicodeData."
- ????| line fieldEnd point fieldStart toNumber generalCategory decimalProperty |
- ????| line fieldEnd point fieldStart toNumber generalCategory decimalProperty tag |
????toNumber := [:quad | ('16r', quad) asNumber].
????GeneralCategory := SparseLargeTable new: 16rE0080 chunkSize: 1024 arrayClass: Array base: 1 defaultValue: 'Cn'. ????DecimalProperty := SparseLargeTable new: 16rE0080 chunkSize: 32 arrayClass: Array base: 1 defaultValue: -1.
????16r3400 to: 16r4DB5 do: [:i | GeneralCategory at: i+1 put: 'Lo']. ????16r4E00 to: 16r9FA5 do: [:i | GeneralCategory at: i+1 put: 'Lo']. ????16rAC00 to: 16rD7FF do: [:i | GeneralCategory at: i+1 put: 'Lo'].
????[(line := stream nextLine) size > 0] whileTrue: [ ????????fieldEnd := line indexOf: $; startingAt: 1. ????????point := toNumber value: (line copyFrom: 1 to: fieldEnd - 1). ????????point > 16rE007F ifTrue: [ ????????????GeneralCategory zapDefaultOnlyEntries. ????????????DecimalProperty zapDefaultOnlyEntries. ????????????^ self]. ????????2 to: 3 do: [:i | ????????????fieldStart := fieldEnd + 1. ????????????fieldEnd := line indexOf: $; startingAt: fieldStart. ????????].
- ????????generalCategory := line copyFrom: fieldStart to: fieldEnd - 1.
- ????????tag := line copyFrom: fieldStart to: fieldEnd - 1.
- ????????generalCategory := self generalCategoryIndexFromTag: tag.
????????GeneralCategory at: point+1 put: generalCategory.
- ????????generalCategory = 'Nd' ifTrue: [
- ????????generalCategory = Nd ifTrue: [
????????????4 to: 7 do: [:i | ????????????????fieldStart := fieldEnd + 1. ????????????????fieldEnd := line indexOf: $; startingAt: fieldStart. ????????????]. ????????????decimalProperty := line copyFrom: fieldStart to: fieldEnd - 1. ????????????DecimalProperty at: point+1 put: decimalProperty asNumber. ????????]. ????]. ????GeneralCategory zapDefaultOnlyEntries. ????DecimalProperty zapDefaultOnlyEntries.
Unicode class>>reinitializeData {class initialization} ? ct 2/28/2022 16:05
- reinitializeData
- ????Compositions := GeneralCategory := nil.
- ????self initialize.
UnicodeTest
- TestCase subclass: #UnicodeTest
- ????instanceVariableNames: ''
- ????classVariableNames: ''
- ????poolDictionaries: ''
- ????category: 'MultilingualTests-Encodings'
- UnicodeTest class
- ????instanceVariableNames: ''
- ""
UnicodeTest class>>resources {accessing} ? ct 2/24/2022 21:46
- resources
- ???? ^ super resources copyWith: UnicodeTestResource
UnicodeTest>>testGeneralCategoryLabel {tests - character classification} ? ct 2/24/2022 21:49
- testGeneralCategoryLabel
- ????self assert: 'Letter, Lowercase' equals: (Unicode generalCategoryLabelOf: $a asUnicode).
- ????self assert: 'Letter, Uppercase' equals: (Unicode generalCategoryLabelOf: $Z asUnicode).
- ????
- ????self assert: 'Number, Decimal' equals: (Unicode generalCategoryLabelOf: $5 asUnicode).
- ????self assert: 'Symbol, Other' equals: (Unicode generalCategoryLabelOf: 16r1F388).
- ????
- ????self assert: 'n/a' equals: (Unicode generalCategoryLabelOf: Float infinity).
UnicodeTest>>testGeneralCategoryLabelForTag {tests - character classification} ? ct 2/24/2022 21:48
- testGeneralCategoryLabelForTag
- ????self assert: 'Letter, Lowercase' equals: (Unicode generalCategoryLabelForTag: #Ll).
UnicodeTest>>testGeneralCategoryTag {tests - character classification} ? ct 2/24/2022 21:49
- testGeneralCategoryTag
- ????self assert: #Ll equals: (Unicode generalCategoryTagOf: $a asUnicode).
- ????self assert: #Lu equals: (Unicode generalCategoryTagOf: $Z asUnicode).
- ????
- ????self assert: #Nd equals: (Unicode generalCategoryTagOf: $5 asUnicode).
- ????self assert: #So equals: (Unicode generalCategoryTagOf: 16r1F388).
- ????
- ????self assert: #Cn equals: (Unicode generalCategoryTagOf: Float infinity).
UnicodeTestResource
- TestResource subclass: #UnicodeTestResource
- ????instanceVariableNames: ''
- ????classVariableNames: ''
- ????poolDictionaries: ''
- ????category: 'MultilingualTests-Encodings'
- UnicodeTestResource class
- ????instanceVariableNames: ''
- ""
UnicodeTestResource>>setUp {running} ? ct 2/24/2022 21:45
- setUp
- ????super setUp.
- ????
- ????"Test the functionality of this update logic"
- ????Unicode initializeCompositionMappings.
- ????Unicode initializeUnicodeData.
Sent from Squeak Inbox Talk
On 2022-02-25T11:36:53+01:00, marcel.taeumel at hpi.de wrote:
Hi Christoph --
Thanks for doing this!
??Is it okay to fetch the data from unicode.org via a postscript in the update stream? Hypothetically, some clients might use a proxy or a strict firewall/safelist of IP addresses.
I think it is okay. We could make it explicit in the ReleaseBuilder or in some external CI script but having it part of the update stream is okay. The updates are fetched from the outside anyway, right? ;-)
Best; Marcel Am 24.02.2022 22:16:42 schrieb christoph.thiede at student.hpi.uni-potsdam.de <christoph.thiede at student.hpi.uni-potsdam.de>: Hi all, Hi Marcel, Hi Levente,
finally, here is a changeset that takes the first step for updating or in-image Unicode database. After filing it in, please run:
????Unicode initializeUnicodeData.
In addition to the preamble (please read first below), I have still a number of questions:
- Is it okay to fetch the data from unicode.org via a postscript in the update stream? Hypothetically, some clients might use a proxy or a strict firewall/safelist of IP addresses.
- How much effort shall we put in deduplicating the logic and the data in this class? This includes both the two similar parsing methods and the redundant specification of the Unicode character tags.
Best, Christoph
=============== Summary ===============
Change Set:????????UnicodeData Date:????????????24 February 2022 Author:????????????Christoph Thiede
This changeset repairs the fetching & parsing of unicode category data. Usage:
????Unicode initializeUnicodeData. ????Unicode generalCategoryLabelOf: 16r1F388. 'Symbol, Other'
This revision resolves some slips in the category tags, adds an interface for retrieving/converting tags, unifies the vocabulary of the Unicode protocol, integrates the #initializeUnicodeData into the class initializer, and adds some tests.
Still present limitations include:
- Duplication between #parseUnicodeDataFrom: and #parseCompositionMappingFrom:
- Redundant and scattered declaration of character categories
For more information, see: http://lists.squeakfoundation.org/pipermail/squeak-dev/2020-March/208020.htm...
=============== Postscript ===============
"Postscript: Leave the line above, and replace the rest of this comment by a useful one. Executable statements should follow this comment, and should be separated by periods, with no exclamation points (!). Be sure to put any further comments in double-quotes, like this one."
=============== Diff ===============
Unicode class>>allCategoryTags {character classification} ? ct 2/24/2022 19:41
- allCategoryTags
- ????^ #(Cn Cc Cf Co Cs Ll Lm Lo Lt Lu Mc Me Mn Nd Nl No Pc Pd Pe Pf Pi Po Ps Sc Sk Sm So Zl Zp Zs)
Unicode class>>generalCategoryIndexFromTag: {character classification} ? ct 2/24/2022 19:33
- generalCategoryIndexFromTag: tag
- ????^ (self allCategoryTags indexOf: tag) - 1
Unicode class>>generalCategoryLabelForTag: {character classification} ? ct 2/24/2022 21:47
- generalCategoryLabelForTag: tag
- ????^ self generalCategoryLabels at: (self generalCategoryIndexFromTag: tag) + 1
Unicode class>>generalCategoryTagOf: {character classification} ? ct 2/24/2022 19:44
- generalCategoryTagOf: aCharacterCode
- ????^ (self generalCategoryOf: aCharacterCode)
- ????????ifNotNil: [:code | self allCategoryTags at: code + 1 ifAbsent: [#Cn]]
- ????????ifNil: [#Cn]
Unicode class>>initialize {class initialization} ? ct 2/24/2022 21:43 (changed) initialize ????" Unicode initialize " ????self initializeTagConstants.
- ????Compositions isEmptyOrNil ifTrue:[self initializeCompositionMappings].
- ????Compositions isEmptyOrNil ifTrue:[self initializeCompositionMappings].
- ????GeneralCategory isEmptyOrNil ifTrue: [self initializeUnicodeData].
Unicode class>>initializeUnicodeData {unicode data} ? ct 2/24/2022 19:03
- initializeUnicodeData
- ????"self initializeUnicodeData"
- ????self parseUnicodeDataFrom: self unicodeData readStream.
Unicode class>>parseUnicodeDataFrom: {unicode data} ? ct 2/24/2022 19:33 (changed) parseUnicodeDataFrom: stream
- "
- ????self halt.
- ????self parseUnicodeDataFile
- "
- ????"self initializeUnicodeData."
- ????| line fieldEnd point fieldStart toNumber generalCategory decimalProperty |
- ????| line fieldEnd point fieldStart toNumber generalCategory decimalProperty tag |
????toNumber := [:quad | ('16r', quad) asNumber].
????GeneralCategory := SparseLargeTable new: 16rE0080 chunkSize: 1024 arrayClass: Array base: 1 defaultValue: 'Cn'. ????DecimalProperty := SparseLargeTable new: 16rE0080 chunkSize: 32 arrayClass: Array base: 1 defaultValue: -1.
????16r3400 to: 16r4DB5 do: [:i | GeneralCategory at: i+1 put: 'Lo']. ????16r4E00 to: 16r9FA5 do: [:i | GeneralCategory at: i+1 put: 'Lo']. ????16rAC00 to: 16rD7FF do: [:i | GeneralCategory at: i+1 put: 'Lo'].
????[(line := stream nextLine) size > 0] whileTrue: [ ????????fieldEnd := line indexOf: $; startingAt: 1. ????????point := toNumber value: (line copyFrom: 1 to: fieldEnd - 1). ????????point > 16rE007F ifTrue: [ ????????????GeneralCategory zapDefaultOnlyEntries. ????????????DecimalProperty zapDefaultOnlyEntries. ????????????^ self]. ????????2 to: 3 do: [:i | ????????????fieldStart := fieldEnd + 1. ????????????fieldEnd := line indexOf: $; startingAt: fieldStart. ????????].
- ????????generalCategory := line copyFrom: fieldStart to: fieldEnd - 1.
- ????????tag := line copyFrom: fieldStart to: fieldEnd - 1.
- ????????generalCategory := self generalCategoryIndexFromTag: tag.
????????GeneralCategory at: point+1 put: generalCategory.
- ????????generalCategory = 'Nd' ifTrue: [
- ????????generalCategory = Nd ifTrue: [
????????????4 to: 7 do: [:i | ????????????????fieldStart := fieldEnd + 1. ????????????????fieldEnd := line indexOf: $; startingAt: fieldStart. ????????????]. ????????????decimalProperty := line copyFrom: fieldStart to: fieldEnd - 1. ????????????DecimalProperty at: point+1 put: decimalProperty asNumber. ????????]. ????]. ????GeneralCategory zapDefaultOnlyEntries. ????DecimalProperty zapDefaultOnlyEntries.
UnicodeTest
- TestCase subclass: #UnicodeTest
- ????instanceVariableNames: ''
- ????classVariableNames: ''
- ????poolDictionaries: ''
- ????category: 'MultilingualTests-Encodings'
- UnicodeTest class
- ????instanceVariableNames: ''
- ""
UnicodeTest class>>resources {accessing} ? ct 2/24/2022 21:46
- resources
- ???? ^ super resources copyWith: UnicodeTestResource
UnicodeTest>>testGeneralCategoryLabel {tests - character classification} ? ct 2/24/2022 21:49
- testGeneralCategoryLabel
- ????self assert: 'Letter, Lowercase' equals: (Unicode generalCategoryLabelOf: $a asUnicode).
- ????self assert: 'Letter, Uppercase' equals: (Unicode generalCategoryLabelOf: $Z asUnicode).
- ????
- ????self assert: 'Number, Decimal' equals: (Unicode generalCategoryLabelOf: $5 asUnicode).
- ????self assert: 'Symbol, Other' equals: (Unicode generalCategoryLabelOf: 16r1F388).
- ????
- ????self assert: 'n/a' equals: (Unicode generalCategoryLabelOf: Float infinity).
UnicodeTest>>testGeneralCategoryLabelForTag {tests - character classification} ? ct 2/24/2022 21:48
- testGeneralCategoryLabelForTag
- ????self assert: 'Letter, Lowercase' equals: (Unicode generalCategoryLabelForTag: #Ll).
UnicodeTest>>testGeneralCategoryTag {tests - character classification} ? ct 2/24/2022 21:49
- testGeneralCategoryTag
- ????self assert: #Ll equals: (Unicode generalCategoryTagOf: $a asUnicode).
- ????self assert: #Lu equals: (Unicode generalCategoryTagOf: $Z asUnicode).
- ????
- ????self assert: #Nd equals: (Unicode generalCategoryTagOf: $5 asUnicode).
- ????self assert: #So equals: (Unicode generalCategoryTagOf: 16r1F388).
- ????
- ????self assert: #Cn equals: (Unicode generalCategoryTagOf: Float infinity).
UnicodeTestResource
- TestResource subclass: #UnicodeTestResource
- ????instanceVariableNames: ''
- ????classVariableNames: ''
- ????poolDictionaries: ''
- ????category: 'MultilingualTests-Encodings'
- UnicodeTestResource class
- ????instanceVariableNames: ''
- ""
UnicodeTestResource>>setUp {running} ? ct 2/24/2022 21:45
- setUp
- ????super setUp.
- ????
- ????"Test the functionality of this update logic"
- ????Unicode initializeCompositionMappings.
- ????Unicode initializeUnicodeData.
Sent from Squeak Inbox Talk [https://github.com/hpi-swa-lab/squeak-inbox-talk]
On 2020-09-11T00:49:24+02:00, leves at caesar.elte.hu wrote:
Hi Christoph,
On Wed, 9 Sep 2020, Thiede, Christoph wrote:
Hi Levente,
basically, I only?would like to get rid of the class variables for every single Unicode category because they provide low explorability (it's hard to work with the numeric output of #generalCategoryOf:) and extensibility (you need to recompile the class definition for adding UTF-16 support). If you are critical of?increasing the?size of the SparseLargeTable, I think we?would also just make one or two?extra dictionaries to map every category symbol to a number and vice versa. What do you think?
You mean an array to map the integers to symbols, right? :) Anyway, I don't think it's worth using symbols internally. For example, #isLetterCode: is 8-10% slower with the extra array lookup and checking the category symbol's first letter than the current method of integer comparisons.
Do you expect these constants to appear outside the Unicode class? If yes, then using symbols for those cases is probably a good solution. But for internal use, the integers are better.
Levente
Best,
Christoph
Von: Squeak-dev <squeak-dev-bounces at lists.squeakfoundation.org> im Auftrag von Levente Uzonyi <leves at caesar.elte.hu> Gesendet: Dienstag, 8. September 2020 21:43:56 An: The general-purpose Squeak developers list Betreff: Re: [squeak-dev] Unicode ? Hi Christoph,
On Tue, 8 Sep 2020, Thiede, Christoph wrote:
Hi all,
?Your words suggest that it has already been published, but I can't find it?anywhere.
Then I must have expressed myself wrong. I did not yet publish any code?changes, but in my original post from March,?you can find a short description of the design?changes I'd like to implement. Essentially, I would like to replace the separate class variables for every known character class in favor of greater flexibility.
How would your changes affect GeneralCategory? Would it still be a SpareLargeTable with ByteArray as arrayClass? If you just replace those integers with symbols, the size of the table will be at least 4 or 8 times larger in 32 or 64 bit images, respectively.
Levente.
Eliot, are there any remaining questions regarding the VM size? Character size should be sufficient as discussed below, and of course, I can test any changes in a 32-bit image, too. :-)
WideString withAll: (#(42r2l 16r1F497 16r1F388 33) collect: #asCharacter)
Best, Christoph
_
Von: Squeak-dev <squeak-dev-bounces at lists.squeakfoundation.org> im Auftrag von Tobias Pape <Das.Linux at gmx.de> Gesendet: Sonntag, 6. September 2020 21:00:14 An: The general-purpose Squeak developers list Betreff: Re: [squeak-dev] Unicode ?
On 06.09.2020, at 20:40, Levente Uzonyi <leves at caesar.elte.hu> wrote:
On Sun, 6 Sep 2020, Tobias Pape wrote:
> >> On 06.09.2020, at 19:15, Eliot Miranda <eliot.miranda at gmail.com> wrote: >> Hi Christoph, Hi All, >>> On Mar 17, 2020, at 3:51 PM, Thiede, Christoph <Christoph.Thiede at student.hpi.uni-potsdam.de> wrote: >>> Hi all! :-) >>> After some recent fun with the Unicode class, I found out that its data is quite out of date (for example, the comments do not even "know" that code points can be longer than 4 bytes. Younger characters such as ???
are not categorized correctly, etc. ...). Luckily, there is already some logic to fetch the latest data from www.unicode.org. I'm currently reworking this logic because it's not completely automated yet and has some slips, but so long, I have one general question for you:
>> And consequently I have a couple of questions for you. In the Spur VM Characters are immediate (they are like SmallInteger and exist in oops (object-oriented pointers) as tagged values).? In the 32-bit variant
Characters
are 30-bit unsigned integers.? In the 64-bit variant they are also 30-bit unsigned integers, but could easily be extended to be up to 61-bit unsigned integers.
>> Q1, can you arrange that the Unicode support does not break in initialization on the 32-bit variant?? It may be that the 32-bit variant cannot represent code points beyond 30 bits in size, but we should try to ensure
that
initialization still runs to completion even if it fails to initialize information relating to code points beyond 30 bits in size.
>> Q2, how many bits should the 64-bit variant VM support for immediate Characters? > > Unicode has a max value of 0x10FFFF. That makes 21 bit. So no worries there. > > We should just not forget the leading-char stuff (Yoshiki, Andreas,...)
AFAIU the leading char only makes sense when you have multiple CJK(V?) languages in use at the same time. In other cases Unicode (leadingChar = 0) is perfectly fine. IIRC there are 22 bits available for the codePoint and 8 for the leadingChar, so we're still good: all unicode characters fit.
\o/ hooray!
Levente
> > > BEst regards >?????? -Tobias > >> Then something to consider is that it is conceptually possible to support something like WideCharacter, which would represent code points outside of the immediate Character range on the 32-bit variant, analogous to
LargePositiveInteger beyond SmallInteger maxVal.? This can be made to work seamlessly, just as it does currently with integers, and with Floats where SmallFloat64 is only used on 64-bits.
>> It has implications in a few parts of the system: >> - failure code for WideString (VeryWideString?) at:[put:] primitives that would have to manage overflow into/access from WideCharacter instances >> - ImageSegment and other (un)pickling systems that need to convert to/from a bit-specific ?wire? protocol/representation >> - 32-bit <=> 64-bit image conversion All this is easily doable (because we have models of doing it for Float and Integer general instances).? But we need good specifications so we can implement the right thing from the
get-go.
>>> At the moment, we have 30 class variables each for one Unicode category number. These class vars map in alphabetical order to the integers from 0 to: 29. Is this tedious structure really necessary? For different
purposes, I would like to get the category name of a specific code point from a client. The current design makes this impossible without writing additional mappings.
>>> Tl;dr: I would like to propose to drop these class variables and use Symbols instead. They are comparable like integers, and as they are flyweights, this should not be a performance issue either. Of course,
#generalCategoryOf: will have to keep returning numbers, but we could deprecate it and use a new #generalTagOf: in the future. Furthermore, this would also allow us to deal with later added category names (though I don't
know
whether this will ever happen).
>>> Examples: >>> Unicode generalTagOf: $a asUnicode. "#Ll" >>> Unicode class >> isLetterCode: charCode >>>? ^ (self generalTagOf: charCode) first = $L >>> Unicode class >> isAlphaNumericCode: charCode >>>? | tag| >>>? ^ (tag := self generalCategoryOf: charCode) first = $L >>>??????? or: [tag = #Nd] >>> How do you think about this proposal? Please let me know and I will go ahead! :D >>> Best, >>> Christoph >> Best, Eliot >> _,,,^..^,,,_ (phone)
["UnicodeData.2.cs"]
Would it be possible/practical to separate this? * Download and transform into code/objects * Distribute the generated code/objects via the update stream directly One could run the download&generate step as needed to update the data. (CI, release build, manually) Or are there any reasons not to do that?
I was asking myself the same thing recently about the package that provides time zone information. It needs a Unix timezone database from the operating system to initialize, rather than providing Smalltalk objects/code directly in Monticello, based on the official online database.
Kind regards, Jakob
Am Di., 5. Apr. 2022 um 08:59 Uhr schrieb Marcel Taeumel < marcel.taeumel@hpi.de>:
Hi Eliot, hi Christoph --
Unicode reinitializeData.
I think that this method has an unfortunate name. Since it downloads data from the Internet, it should be called #downloadAndInitializeData.
And that's the reason for it not being in the post-load script. We might put the raw info there, but it would be very surprising if "Update Squeak" fetches data other than from source.squeak.org.
Best, Marcel
Am 04.04.2022 21:17:17 schrieb Thiede, Christoph < christoph.thiede@student.hpi.uni-potsdam.de>:
If this is essential
Well, how do you define essential? You can still use your image with the old font definitions. However, for some newer codepoints such as 😁😎😍, Unicode generalCategoryLabelOf: and friends will answers "not assigned" without the upgrade. You can watch the difference by browsing any comprehensive font in the FontImporter. But I am not aware of any code path that relies on the presence of newer Unicode data.
Apart from that, I was already discussing with Marcel what would be the consequences of downloading data from a third-party server during an image update. There might be any images, most likely server images, that do not have free internet access due to a strict firewall. Hypothetically, this might even introduce any security issues. So in the end, we decided on leaving this optional for now. It will only break if any future patch of any package relies on exact Unicode data.
Best,
Christoph
*Von:* Squeak-dev squeak-dev-bounces@lists.squeakfoundation.org im Auftrag von Eliot Miranda eliot.miranda@gmail.com *Gesendet:* Montag, 4. April 2022 20:54:59 *An:* The general-purpose Squeak developers list *Betreff:* Re: [squeak-dev] Unicode
On Apr 4, 2022, at 11:20 AM, Christoph.Thiede@student.hpi.uni-potsdam.de wrote:
Merged via Multilingual-ct.271, Multilingual-ct.272, MultilingualTests-ct.41, and ReleaseBuilder-ct.231.
Please run the following in your image to install the new Unicode data (and to uncover any regressions I may have missed :D):
Unicode reinitializeData.
If this is essential then it *must* be added as a post load script to one (or more) of the relevant packages. Asking “did you run Unicode reinitializeData?” when someone reports a strange bug isn’t acceptable.
Best, Christoph
*Sent from **Squeak Inbox Talk https://github.com/hpi-swa-lab/squeak-inbox-talk*
On 2022-02-28T16:09:06+01:00, christoph.thiede@student.hpi.uni-potsdam.de wrote:
Hi Marcel, thanks for the review! Below is an updated changeset. If you
have no further objections, I would like to merge it within the next few days. :-)
=============== Summary ===============
Change Set:????????UnicodeData Date:????????????24 February 2022 Author:????????????Christoph Thiede
This changeset repairs the fetching & parsing of unicode category data.
Usage:
????Unicode reinitializeData. ????Unicode generalCategoryLabelOf: 16r1F388. 'Symbol, Other'
This revision resolves some slips in the category tags, adds an
interface for retrieving/converting tags, unifies the vocabulary of the Unicode protocol, integrates the #initializeUnicodeData into the class initializer, and adds some tests. Furthermore, the Unicode data are automatically reinitialized as part of the ReleaseBuilder.
Still present limitations include:
- Duplication between #parseUnicodeDataFrom: and
#parseCompositionMappingFrom:
- Redundant and scattered declaration of character categories
For more information, see:
http://lists.squeakfoundation.org/pipermail/squeak-dev/2020-March/208020.htm...
=============== Postscript ===============
"Postscript: Leave the line above, and replace the rest of this comment by a useful
one.
Executable statements should follow this comment, and should be separated by periods, with no exclamation points (!). Be sure to put any further comments in double-quotes, like this one."
=============== Diff ===============
ReleaseBuilder class>>prepareSourceCode {preparing} ? ct 2/28/2022 15:54
(changed)
prepareSourceCode ????"Update packages. Remove foreign packages. Recompile."
????CurrentReadOnlySourceFiles cacheDuring: ????????[self ????????????updateCorePackages; ????????????unloadForeignPackages; ????????????checkForDirtyPackages; ????????????loadWellKnownPackages; ????????????checkForUndeclaredSymbols; ????????????checkForNilCategories;
- ????????????recompileAll]
- ????????????recompileAll;
- ????????????updateDatabases]
ReleaseBuilder class>>updateDatabases {scripts - support} ? ct 2/28/2022
16:06
- updateDatabases
- ????Unicode reinitializeData.
Unicode class>>allCategoryTags {character classification} ? ct 2/24/2022
19:41
- allCategoryTags
- ????^ #(Cn Cc Cf Co Cs Ll Lm Lo Lt Lu Mc Me Mn Nd Nl No Pc Pd Pe Pf Pi
Po Ps Sc Sk Sm So Zl Zp Zs)
Unicode class>>blocks320Comment {comments} ? ct 2/28/2022 15:50 (changed) blocks320Comment
"# Blocks-3.2.0.txt # Correlated with Unicode 3.2 # Start Code..End Code; Block Name 0000..007F; Basic Latin 0080..00FF; Latin-1 Supplement 0100..017F; Latin Extended-A 0180..024F; Latin Extended-B 0250..02AF; IPA Extensions 02B0..02FF; Spacing Modifier Letters 0300..036F; Combining Diacritical Marks 0370..03FF; Greek and Coptic 0400..04FF; Cyrillic 0500..052F; Cyrillic Supplementary 0530..058F; Armenian 0590..05FF; Hebrew 0600..06FF; Arabic 0700..074F; Syriac 0780..07BF; Thaana 0900..097F; Devanagari 0980..09FF; Bengali 0A00..0A7F; Gurmukhi 0A80..0AFF; Gujarati 0B00..0B7F; Oriya 0B80..0BFF; Tamil 0C00..0C7F; Telugu 0C80..0CFF; Kannada 0D00..0D7F; Malayalam 0D80..0DFF; Sinhala 0E00..0E7F; Thai 0E80..0EFF; Lao 0F00..0FFF; Tibetan 1000..109F; Myanmar 10A0..10FF; Georgian 1100..11FF; Hangul Jamo 1200..137F; Ethiopic 13A0..13FF; Cherokee 1400..167F; Unified Canadian Aboriginal Syllabics 1680..169F; Ogham 16A0..16FF; Runic 1700..171F; Tagalog 1720..173F; Hanunoo 1740..175F; Buhid 1760..177F; Tagbanwa 1780..17FF; Khmer 1800..18AF; Mongolian 1E00..1EFF; Latin Extended Additional 1F00..1FFF; Greek Extended 2000..206F; General Punctuation 2070..209F; Superscripts and Subscripts 20A0..20CF; Currency Symbols 20D0..20FF; Combining Diacritical Marks for Symbols 2100..214F; Letterlike Symbols 2150..218F; Number Forms 2190..21FF; Arrows 2200..22FF; Mathematical Operators 2300..23FF; Miscellaneous Technical 2400..243F; Control Pictures 2440..245F; Optical Character Recognition 2460..24FF; Enclosed Alphanumerics 2500..257F; Box Drawing 2580..259F; Block Elements 25A0..25FF; Geometric Shapes 2600..26FF; Miscellaneous Symbols 2700..27BF; Dingbats 27C0..27EF; Miscellaneous Mathematical Symbols-A 27F0..27FF; Supplemental Arrows-A 2800..28FF; Braille Patterns 2900..297F; Supplemental Arrows-B 2980..29FF; Miscellaneous Mathematical Symbols-B 2A00..2AFF; Supplemental Mathematical Operators 2E80..2EFF; CJK Radicals Supplement 2F00..2FDF; Kangxi Radicals 2FF0..2FFF; Ideographic Description Characters 3000..303F; CJK Symbols and Punctuation 3040..309F; Hiragana 30A0..30FF; Katakana 3100..312F; Bopomofo 3130..318F; Hangul Compatibility Jamo 3190..319F; Kanbun 31A0..31BF; Bopomofo Extended 31F0..31FF; Katakana Phonetic Extensions 3200..32FF; Enclosed CJK Letters and Months 3300..33FF; CJK Compatibility 3400..4DBF; CJK Unified Ideographs Extension A 4E00..9FFF; CJK Unified Ideographs A000..A48F; Yi Syllables A490..A4CF; Yi Radicals AC00..D7AF; Hangul Syllables D800..DB7F; High Surrogates DB80..DBFF; High Private Use Surrogates DC00..DFFF; Low Surrogates E000..F8FF; Private Use Area F900..FAFF; CJK Compatibility Ideographs FB00..FB4F; Alphabetic Presentation Forms FB50..FDFF; Arabic Presentation Forms-A FE00..FE0F; Variation Selectors FE20..FE2F; Combining Half Marks FE30..FE4F; CJK Compatibility Forms FE50..FE6F; Small Form Variants FE70..FEFF; Arabic Presentation Forms-B FF00..FFEF; Halfwidth and Fullwidth Forms FFF0..FFFF; Specials 10300..1032F; Old Italic 10330..1034F; Gothic 10400..1044F; Deseret 1D000..1D0FF; Byzantine Musical Symbols 1D100..1D1FF; Musical Symbols 1D400..1D7FF; Mathematical Alphanumeric Symbols 20000..2A6DF; CJK Unified Ideographs Extension B 2F800..2FA1F; CJK Compatibility Ideographs Supplement E0000..E007F; Tags F0000..FFFFF; Supplementary Private Use Area-A 100000..10FFFF; Supplementary Private Use Area-B
"
Unicode class>>generalCategoryIndexFromTag: {character classification} ?
ct 2/24/2022 19:33
- generalCategoryIndexFromTag: tag
- ????^ (self allCategoryTags indexOf: tag) - 1
Unicode class>>generalCategoryLabelForTag: {character classification} ?
ct 2/24/2022 21:47
- generalCategoryLabelForTag: tag
- ????^ self generalCategoryLabels at: (self
generalCategoryIndexFromTag: tag) + 1
Unicode class>>generalCategoryTagOf: {character classification} ? ct
2/24/2022 19:44
- generalCategoryTagOf: aCharacterCode
- ????^ (self generalCategoryOf: aCharacterCode)
- ????????ifNotNil: [:code | self allCategoryTags at: code + 1 ifAbsent:
[#Cn]]
- ????????ifNil: [#Cn]
Unicode class>>initialize {class initialization} ? ct 2/28/2022 15:52
(changed)
initialize ????" Unicode initialize " ????self initializeTagConstants.
- ????Compositions isEmptyOrNil ifTrue:[self
initializeCompositionMappings].
- ????
- ????self flag: #deduplicate. "Currently, we are downloading and
parsing #unicodeData twice."
- ????Compositions isEmptyOrNil ifTrue: [self
initializeCompositionMappings].
- ????GeneralCategory isEmptyOrNil ifTrue: [self initializeUnicodeData].
Unicode class>>initializeUnicodeData {unicode data} ? ct 2/24/2022 19:03
- initializeUnicodeData
- ????"self initializeUnicodeData"
- ????self parseUnicodeDataFrom: self unicodeData readStream.
Unicode class>>parseUnicodeDataFrom: {unicode data} ? ct 2/24/2022 19:33
(changed)
parseUnicodeDataFrom: stream
- "
- ????self halt.
- ????self parseUnicodeDataFile
- "
- ????"self initializeUnicodeData."
- ????| line fieldEnd point fieldStart toNumber generalCategory
decimalProperty |
- ????| line fieldEnd point fieldStart toNumber generalCategory
decimalProperty tag |
????toNumber := [:quad | ('16r', quad) asNumber].
????GeneralCategory := SparseLargeTable new: 16rE0080 chunkSize: 1024
arrayClass: Array base: 1 defaultValue: 'Cn'.
????DecimalProperty := SparseLargeTable new: 16rE0080 chunkSize: 32
arrayClass: Array base: 1 defaultValue: -1.
????16r3400 to: 16r4DB5 do: [:i | GeneralCategory at: i+1 put: 'Lo']. ????16r4E00 to: 16r9FA5 do: [:i | GeneralCategory at: i+1 put: 'Lo']. ????16rAC00 to: 16rD7FF do: [:i | GeneralCategory at: i+1 put: 'Lo'].
????[(line := stream nextLine) size > 0] whileTrue: [ ????????fieldEnd := line indexOf: $; startingAt: 1. ????????point := toNumber value: (line copyFrom: 1 to: fieldEnd - 1). ????????point > 16rE007F ifTrue: [ ????????????GeneralCategory zapDefaultOnlyEntries. ????????????DecimalProperty zapDefaultOnlyEntries. ????????????^ self]. ????????2 to: 3 do: [:i | ????????????fieldStart := fieldEnd + 1. ????????????fieldEnd := line indexOf: $; startingAt: fieldStart. ????????].
- ????????generalCategory := line copyFrom: fieldStart to: fieldEnd - 1.
- ????????tag := line copyFrom: fieldStart to: fieldEnd - 1.
- ????????generalCategory := self generalCategoryIndexFromTag: tag.
????????GeneralCategory at: point+1 put: generalCategory.
- ????????generalCategory = 'Nd' ifTrue: [
- ????????generalCategory = Nd ifTrue: [
????????????4 to: 7 do: [:i | ????????????????fieldStart := fieldEnd + 1. ????????????????fieldEnd := line indexOf: $; startingAt: fieldStart. ????????????]. ????????????decimalProperty := line copyFrom: fieldStart to: fieldEnd -
????????????DecimalProperty at: point+1 put: decimalProperty asNumber. ????????]. ????]. ????GeneralCategory zapDefaultOnlyEntries. ????DecimalProperty zapDefaultOnlyEntries.
Unicode class>>reinitializeData {class initialization} ? ct 2/28/2022
16:05
- reinitializeData
- ????Compositions := GeneralCategory := nil.
- ????self initialize.
UnicodeTest
- TestCase subclass: #UnicodeTest
- ????instanceVariableNames: ''
- ????classVariableNames: ''
- ????poolDictionaries: ''
- ????category: 'MultilingualTests-Encodings'
- UnicodeTest class
- ????instanceVariableNames: ''
- ""
UnicodeTest class>>resources {accessing} ? ct 2/24/2022 21:46
- resources
- ???? ^ super resources copyWith: UnicodeTestResource
UnicodeTest>>testGeneralCategoryLabel {tests - character classification}
? ct 2/24/2022 21:49
- testGeneralCategoryLabel
- ????self assert: 'Letter, Lowercase' equals: (Unicode
generalCategoryLabelOf: $a asUnicode).
- ????self assert: 'Letter, Uppercase' equals: (Unicode
generalCategoryLabelOf: $Z asUnicode).
- ????
- ????self assert: 'Number, Decimal' equals: (Unicode
generalCategoryLabelOf: $5 asUnicode).
- ????self assert: 'Symbol, Other' equals: (Unicode
generalCategoryLabelOf: 16r1F388).
- ????
- ????self assert: 'n/a' equals: (Unicode generalCategoryLabelOf: Float
infinity).
UnicodeTest>>testGeneralCategoryLabelForTag {tests - character
classification} ? ct 2/24/2022 21:48
- testGeneralCategoryLabelForTag
- ????self assert: 'Letter, Lowercase' equals: (Unicode
generalCategoryLabelForTag: #Ll).
UnicodeTest>>testGeneralCategoryTag {tests - character classification} ?
ct 2/24/2022 21:49
- testGeneralCategoryTag
- ????self assert: #Ll equals: (Unicode generalCategoryTagOf: $a
asUnicode).
- ????self assert: #Lu equals: (Unicode generalCategoryTagOf: $Z
asUnicode).
- ????
- ????self assert: #Nd equals: (Unicode generalCategoryTagOf: $5
asUnicode).
- ????self assert: #So equals: (Unicode generalCategoryTagOf: 16r1F388).
- ????
- ????self assert: #Cn equals: (Unicode generalCategoryTagOf: Float
infinity).
UnicodeTestResource
- TestResource subclass: #UnicodeTestResource
- ????instanceVariableNames: ''
- ????classVariableNames: ''
- ????poolDictionaries: ''
- ????category: 'MultilingualTests-Encodings'
- UnicodeTestResource class
- ????instanceVariableNames: ''
- ""
UnicodeTestResource>>setUp {running} ? ct 2/24/2022 21:45
- setUp
- ????super setUp.
- ????
- ????"Test the functionality of this update logic"
- ????Unicode initializeCompositionMappings.
- ????Unicode initializeUnicodeData.
Sent from Squeak Inbox Talk
On 2022-02-25T11:36:53+01:00, marcel.taeumel at hpi.de wrote:
Hi Christoph --
Thanks for doing this!
??Is it okay to fetch the data from unicode.org via a postscript in
the update stream? Hypothetically, some clients might use a proxy or a strict firewall/safelist of IP addresses.
I think it is okay. We could make it explicit in the ReleaseBuilder or
in some external CI script but having it part of the update stream is okay. The updates are fetched from the outside anyway, right? ;-)
Best; Marcel Am 24.02.2022 22:16:42 schrieb christoph.thiede at
student.hpi.uni-potsdam.de <christoph.thiede at student.hpi.uni-potsdam.de
:
Hi all, Hi Marcel, Hi Levente,
finally, here is a changeset that takes the first step for updating or
in-image Unicode database. After filing it in, please run:
????Unicode initializeUnicodeData.
In addition to the preamble (please read first below), I have still a
number of questions:
- Is it okay to fetch the data from unicode.org via a postscript in
the update stream? Hypothetically, some clients might use a proxy or a strict firewall/safelist of IP addresses.
- How much effort shall we put in deduplicating the logic and the data
in this class? This includes both the two similar parsing methods and the redundant specification of the Unicode character tags.
Best, Christoph
=============== Summary ===============
Change Set:????????UnicodeData Date:????????????24 February 2022 Author:????????????Christoph Thiede
This changeset repairs the fetching & parsing of unicode category
data. Usage:
????Unicode initializeUnicodeData. ????Unicode generalCategoryLabelOf: 16r1F388. 'Symbol, Other'
This revision resolves some slips in the category tags, adds an
interface for retrieving/converting tags, unifies the vocabulary of the Unicode protocol, integrates the #initializeUnicodeData into the class initializer, and adds some tests.
Still present limitations include:
- Duplication between #parseUnicodeDataFrom: and
#parseCompositionMappingFrom:
- Redundant and scattered declaration of character categories
For more information, see:
http://lists.squeakfoundation.org/pipermail/squeak-dev/2020-March/208020.htm...
=============== Postscript ===============
"Postscript: Leave the line above, and replace the rest of this comment by a useful
one.
Executable statements should follow this comment, and should be separated by periods, with no exclamation points (!). Be sure to put any further comments in double-quotes, like this one."
=============== Diff ===============
Unicode class>>allCategoryTags {character classification} ? ct
2/24/2022 19:41
- allCategoryTags
- ????^ #(Cn Cc Cf Co Cs Ll Lm Lo Lt Lu Mc Me Mn Nd Nl No Pc Pd Pe Pf
Pi Po Ps Sc Sk Sm So Zl Zp Zs)
Unicode class>>generalCategoryIndexFromTag: {character classification}
? ct 2/24/2022 19:33
- generalCategoryIndexFromTag: tag
- ????^ (self allCategoryTags indexOf: tag) - 1
Unicode class>>generalCategoryLabelForTag: {character classification}
? ct 2/24/2022 21:47
- generalCategoryLabelForTag: tag
- ????^ self generalCategoryLabels at: (self
generalCategoryIndexFromTag: tag) + 1
Unicode class>>generalCategoryTagOf: {character classification} ? ct
2/24/2022 19:44
- generalCategoryTagOf: aCharacterCode
- ????^ (self generalCategoryOf: aCharacterCode)
- ????????ifNotNil: [:code | self allCategoryTags at: code + 1
ifAbsent: [#Cn]]
- ????????ifNil: [#Cn]
Unicode class>>initialize {class initialization} ? ct 2/24/2022 21:43
(changed)
initialize ????" Unicode initialize " ????self initializeTagConstants.
- ????Compositions isEmptyOrNil ifTrue:[self
initializeCompositionMappings].
- ????Compositions isEmptyOrNil ifTrue:[self
initializeCompositionMappings].
- ????GeneralCategory isEmptyOrNil ifTrue: [self
initializeUnicodeData].
Unicode class>>initializeUnicodeData {unicode data} ? ct 2/24/2022
19:03
- initializeUnicodeData
- ????"self initializeUnicodeData"
- ????self parseUnicodeDataFrom: self unicodeData readStream.
Unicode class>>parseUnicodeDataFrom: {unicode data} ? ct 2/24/2022
19:33 (changed)
parseUnicodeDataFrom: stream
- "
- ????self halt.
- ????self parseUnicodeDataFile
- "
- ????"self initializeUnicodeData."
- ????| line fieldEnd point fieldStart toNumber generalCategory
decimalProperty |
- ????| line fieldEnd point fieldStart toNumber generalCategory
decimalProperty tag |
????toNumber := [:quad | ('16r', quad) asNumber].
????GeneralCategory := SparseLargeTable new: 16rE0080 chunkSize: 1024
arrayClass: Array base: 1 defaultValue: 'Cn'.
????DecimalProperty := SparseLargeTable new: 16rE0080 chunkSize: 32
arrayClass: Array base: 1 defaultValue: -1.
????16r3400 to: 16r4DB5 do: [:i | GeneralCategory at: i+1 put: 'Lo']. ????16r4E00 to: 16r9FA5 do: [:i | GeneralCategory at: i+1 put: 'Lo']. ????16rAC00 to: 16rD7FF do: [:i | GeneralCategory at: i+1 put: 'Lo'].
????[(line := stream nextLine) size > 0] whileTrue: [ ????????fieldEnd := line indexOf: $; startingAt: 1. ????????point := toNumber value: (line copyFrom: 1 to: fieldEnd - 1). ????????point > 16rE007F ifTrue: [ ????????????GeneralCategory zapDefaultOnlyEntries. ????????????DecimalProperty zapDefaultOnlyEntries. ????????????^ self]. ????????2 to: 3 do: [:i | ????????????fieldStart := fieldEnd + 1. ????????????fieldEnd := line indexOf: $; startingAt: fieldStart. ????????].
- ????????generalCategory := line copyFrom: fieldStart to: fieldEnd -
- ????????tag := line copyFrom: fieldStart to: fieldEnd - 1.
- ????????generalCategory := self generalCategoryIndexFromTag: tag.
????????GeneralCategory at: point+1 put: generalCategory.
- ????????generalCategory = 'Nd' ifTrue: [
- ????????generalCategory = Nd ifTrue: [
????????????4 to: 7 do: [:i | ????????????????fieldStart := fieldEnd + 1. ????????????????fieldEnd := line indexOf: $; startingAt: fieldStart. ????????????]. ????????????decimalProperty := line copyFrom: fieldStart to: fieldEnd
????????????DecimalProperty at: point+1 put: decimalProperty asNumber. ????????]. ????]. ????GeneralCategory zapDefaultOnlyEntries. ????DecimalProperty zapDefaultOnlyEntries.
UnicodeTest
- TestCase subclass: #UnicodeTest
- ????instanceVariableNames: ''
- ????classVariableNames: ''
- ????poolDictionaries: ''
- ????category: 'MultilingualTests-Encodings'
- UnicodeTest class
- ????instanceVariableNames: ''
- ""
UnicodeTest class>>resources {accessing} ? ct 2/24/2022 21:46
- resources
- ???? ^ super resources copyWith: UnicodeTestResource
UnicodeTest>>testGeneralCategoryLabel {tests - character
classification} ? ct 2/24/2022 21:49
- testGeneralCategoryLabel
- ????self assert: 'Letter, Lowercase' equals: (Unicode
generalCategoryLabelOf: $a asUnicode).
- ????self assert: 'Letter, Uppercase' equals: (Unicode
generalCategoryLabelOf: $Z asUnicode).
- ????
- ????self assert: 'Number, Decimal' equals: (Unicode
generalCategoryLabelOf: $5 asUnicode).
- ????self assert: 'Symbol, Other' equals: (Unicode
generalCategoryLabelOf: 16r1F388).
- ????
- ????self assert: 'n/a' equals: (Unicode generalCategoryLabelOf:
Float infinity).
UnicodeTest>>testGeneralCategoryLabelForTag {tests - character
classification} ? ct 2/24/2022 21:48
- testGeneralCategoryLabelForTag
- ????self assert: 'Letter, Lowercase' equals: (Unicode
generalCategoryLabelForTag: #Ll).
UnicodeTest>>testGeneralCategoryTag {tests - character classification}
? ct 2/24/2022 21:49
- testGeneralCategoryTag
- ????self assert: #Ll equals: (Unicode generalCategoryTagOf: $a
asUnicode).
- ????self assert: #Lu equals: (Unicode generalCategoryTagOf: $Z
asUnicode).
- ????
- ????self assert: #Nd equals: (Unicode generalCategoryTagOf: $5
asUnicode).
- ????self assert: #So equals: (Unicode generalCategoryTagOf:
16r1F388).
- ????
- ????self assert: #Cn equals: (Unicode generalCategoryTagOf: Float
infinity).
UnicodeTestResource
- TestResource subclass: #UnicodeTestResource
- ????instanceVariableNames: ''
- ????classVariableNames: ''
- ????poolDictionaries: ''
- ????category: 'MultilingualTests-Encodings'
- UnicodeTestResource class
- ????instanceVariableNames: ''
- ""
UnicodeTestResource>>setUp {running} ? ct 2/24/2022 21:45
- setUp
- ????super setUp.
- ????
- ????"Test the functionality of this update logic"
- ????Unicode initializeCompositionMappings.
- ????Unicode initializeUnicodeData.
Sent from Squeak Inbox Talk [
https://github.com/hpi-swa-lab/squeak-inbox-talk]
On 2020-09-11T00:49:24+02:00, leves at caesar.elte.hu wrote:
Hi Christoph,
On Wed, 9 Sep 2020, Thiede, Christoph wrote:
Hi Levente,
basically, I only?would like to get rid of the class variables for
every single Unicode category because they provide low explorability (it's hard to work with the numeric output of #generalCategoryOf:) and extensibility (you
need to recompile the class definition for adding UTF-16 support).
If you are critical of?increasing the?size of the SparseLargeTable, I think we?would also just make one or two?extra dictionaries to map every category symbol
to a number and vice versa. What do you think?
You mean an array to map the integers to symbols, right? :) Anyway, I don't think it's worth using symbols internally. For example, #isLetterCode: is 8-10% slower with the extra array
lookup
and checking the category symbol's first letter than the current
method
of integer comparisons.
Do you expect these constants to appear outside the Unicode class?
If yes,
then using symbols for those cases is probably a good solution. But for internal use, the integers are better.
Levente
Best,
Christoph
Von: Squeak-dev <squeak-dev-bounces at lists.squeakfoundation.org>
im Auftrag von Levente Uzonyi <leves at caesar.elte.hu>
Gesendet: Dienstag, 8. September 2020 21:43:56 An: The general-purpose Squeak developers list Betreff: Re: [squeak-dev] Unicode ? Hi Christoph,
On Tue, 8 Sep 2020, Thiede, Christoph wrote:
Hi all,
>?Your words suggest that it has already been published, but I
can't find it?anywhere.
Then I must have expressed myself wrong. I did not yet publish
any code?changes, but in my original post from March,?you can find a short description of the design?changes I'd like to implement. Essentially, I would like to
replace the separate class variables for every known character
class in favor of greater flexibility.
How would your changes affect GeneralCategory? Would it still be a SpareLargeTable with ByteArray as arrayClass? If you just replace those integers with symbols, the size of the
table
will be at least 4 or 8 times larger in 32 or 64 bit images, respectively.
Levente.
Eliot, are there any remaining questions regarding the VM size?
Character size should be sufficient as discussed below, and of course, I can test any changes in a 32-bit image, too. :-)
WideString withAll: (#(42r2l 16r1F497 16r1F388 33) collect:
#asCharacter)
Best, Christoph
_
Von: Squeak-dev <squeak-dev-bounces at
lists.squeakfoundation.org> im Auftrag von Tobias Pape <Das.Linux at gmx.de>
Gesendet: Sonntag, 6. September 2020 21:00:14 An: The general-purpose Squeak developers list Betreff: Re: [squeak-dev] Unicode ?
> On 06.09.2020, at 20:40, Levente Uzonyi <leves at
caesar.elte.hu> wrote:
> > On Sun, 6 Sep 2020, Tobias Pape wrote: > >> >>> On 06.09.2020, at 19:15, Eliot Miranda <eliot.miranda at
gmail.com> wrote:
>>> Hi Christoph, Hi All, >>>> On Mar 17, 2020, at 3:51 PM, Thiede, Christoph
<Christoph.Thiede at student.hpi.uni-potsdam.de> wrote:
>>>> Hi all! :-) >>>> After some recent fun with the Unicode class, I found out
that its data is quite out of date (for example, the comments do not even "know" that code points can be longer than 4 bytes. Younger characters such as ???
are not categorized correctly, etc. ...). Luckily, there is
already some logic to fetch the latest data from www.unicode.org. I'm currently reworking this logic because it's not completely automated yet and has some slips,
but so long, I have one general question for you: >>> And consequently I have a couple of questions for you. In
the Spur VM Characters are immediate (they are like SmallInteger and exist in oops (object-oriented pointers) as tagged values).? In the 32-bit variant
Characters
are 30-bit unsigned integers.? In the 64-bit variant they are
also 30-bit unsigned integers, but could easily be extended to be up to 61-bit unsigned integers.
>>> Q1, can you arrange that the Unicode support does not break
in initialization on the 32-bit variant?? It may be that the 32-bit variant cannot represent code points beyond 30 bits in size, but we should try to ensure
that
initialization still runs to completion even if it fails to
initialize information relating to code points beyond 30 bits in size.
>>> Q2, how many bits should the 64-bit variant VM support for
immediate Characters?
>> >> Unicode has a max value of 0x10FFFF. That makes 21 bit. So no
worries there.
>> >> We should just not forget the leading-char stuff (Yoshiki,
Andreas,...)
> > AFAIU the leading char only makes sense when you have multiple
CJK(V?) languages in use at the same time. In other cases Unicode (leadingChar = 0) is perfectly fine.
> IIRC there are 22 bits available for the codePoint and 8 for
the leadingChar, so we're still good: all unicode characters fit.
> >
\o/ hooray!
> Levente > >> >> >> BEst regards >>?????? -Tobias >> >>> Then something to consider is that it is conceptually
possible to support something like WideCharacter, which would represent code points outside of the immediate Character range on the 32-bit variant, analogous to
LargePositiveInteger beyond SmallInteger maxVal.? This can be
made to work seamlessly, just as it does currently with integers, and with Floats where SmallFloat64 is only used on 64-bits.
>>> It has implications in a few parts of the system: >>> - failure code for WideString (VeryWideString?) at:[put:]
primitives that would have to manage overflow into/access from WideCharacter instances
>>> - ImageSegment and other (un)pickling systems that need to
convert to/from a bit-specific ?wire? protocol/representation
>>> - 32-bit <=> 64-bit image conversion All this is easily
doable (because we have models of doing it for Float and Integer general instances).? But we need good specifications so we can implement the right thing from the
get-go. >>>> At the moment, we have 30 class variables each for one
Unicode category number. These class vars map in alphabetical order to the integers from 0 to: 29. Is this tedious structure really necessary? For different
purposes, I would like to get the category name of a specific
code point from a client. The current design makes this impossible without writing additional mappings.
>>>> Tl;dr: I would like to propose to drop these class
variables and use Symbols instead. They are comparable like integers, and as they are flyweights, this should not be a performance issue either. Of course,
#generalCategoryOf: will have to keep returning numbers, but we
could deprecate it and use a new #generalTagOf: in the future. Furthermore, this would also allow us to deal with later added category names (though I don't
know
whether this will ever happen). >>>> Examples: >>>> Unicode generalTagOf: $a asUnicode. "#Ll" >>>> Unicode class >> isLetterCode: charCode >>>>? ^ (self generalTagOf: charCode) first = $L >>>> Unicode class >> isAlphaNumericCode: charCode >>>>? | tag| >>>>? ^ (tag := self generalCategoryOf: charCode) first = $L >>>>??????? or: [tag = #Nd] >>>> How do you think about this proposal? Please let me know
and I will go ahead! :D
>>>> Best, >>>> Christoph >>> Best, Eliot >>> _,,,^..^,,,_ (phone)
["UnicodeData.2.cs"]
Hi Jakob --
One could run the download&generate step as needed to update the data. (CI, release bubild, manually)
To integrate it as part of ReleaseBuilder class >> #prepareEnvironment would be my preferred way. Then it would be part of the CI.
Best, Marcel Am 05.04.2022 14:41:14 schrieb Jakob Reschke jakres+squeak@gmail.com: Would it be possible/practical to separate this? * Download and transform into code/objects * Distribute the generated code/objects via the update stream directly One could run the download&generate step as needed to update the data. (CI, release build, manually) Or are there any reasons not to do that?
I was asking myself the same thing recently about the package that provides time zone information. It needs a Unix timezone database from the operating system to initialize, rather than providing Smalltalk objects/code directly in Monticello, based on the official online database.
Kind regards, Jakob
Am Di., 5. Apr. 2022 um 08:59 Uhr schrieb Marcel Taeumel <marcel.taeumel@hpi.de [mailto:marcel.taeumel@hpi.de]>:
Hi Eliot, hi Christoph --
Unicode reinitializeData.
I think that this method has an unfortunate name. Since it downloads data from the Internet, it should be called #downloadAndInitializeData.
And that's the reason for it not being in the post-load script. We might put the raw info there, but it would be very surprising if "Update Squeak" fetches data other than from source.squeak.org [http://source.squeak.org].
Best, Marcel Am 04.04.2022 21:17:17 schrieb Thiede, Christoph <christoph.thiede@student.hpi.uni-potsdam.de [mailto:christoph.thiede@student.hpi.uni-potsdam.de]>:
If this is essential
Well, how do you define essential? You can still use your image with the old font definitions. However, for some newer codepoints such as 😁😎😍, Unicode generalCategoryLabelOf: and friends will answers "not assigned" without the upgrade. You can watch the difference by browsing any comprehensive font in the FontImporter. But I am not aware of any code path that relies on the presence of newer Unicode data.
Apart from that, I was already discussing with Marcel what would be the consequences of downloading data from a third-party server during an image update. There might be any images, most likely server images, that do not have free internet access due to a strict firewall. Hypothetically, this might even introduce any security issues. So in the end, we decided on leaving this optional for now. It will only break if any future patch of any package relies on exact Unicode data.
Best, Christoph Von: Squeak-dev <squeak-dev-bounces@lists.squeakfoundation.org [mailto:squeak-dev-bounces@lists.squeakfoundation.org]> im Auftrag von Eliot Miranda <eliot.miranda@gmail.com [mailto:eliot.miranda@gmail.com]> Gesendet: Montag, 4. April 2022 20:54:59 An: The general-purpose Squeak developers list Betreff: Re: [squeak-dev] Unicode
On Apr 4, 2022, at 11:20 AM, Christoph.Thiede@student.hpi.uni-potsdam.de [mailto:Christoph.Thiede@student.hpi.uni-potsdam.de] wrote:
Merged via Multilingual-ct.271, Multilingual-ct.272, MultilingualTests-ct.41, and ReleaseBuilder-ct.231.
Please run the following in your image to install the new Unicode data (and to uncover any regressions I may have missed :D):
Unicode reinitializeData.
If this is essential then it *must* be added as a post load script to one (or more) of the relevant packages. Asking “did you run Unicode reinitializeData?” when someone reports a strange bug isn’t acceptable.
Best, Christoph
--- Sent from Squeak Inbox Talk [https://github.com/hpi-swa-lab/squeak-inbox-talk]
On 2022-02-28T16:09:06+01:00, christoph.thiede@student.hpi.uni-potsdam.de [mailto:christoph.thiede@student.hpi.uni-potsdam.de] wrote:
Hi Marcel, thanks for the review! Below is an updated changeset. If you have no further objections, I would like to merge it within the next few days. :-)
=============== Summary ===============
Change Set:????????UnicodeData Date:????????????24 February 2022 Author:????????????Christoph Thiede
This changeset repairs the fetching & parsing of unicode category data. Usage:
????Unicode reinitializeData. ????Unicode generalCategoryLabelOf: 16r1F388. 'Symbol, Other'
This revision resolves some slips in the category tags, adds an interface for retrieving/converting tags, unifies the vocabulary of the Unicode protocol, integrates the #initializeUnicodeData into the class initializer, and adds some tests. Furthermore, the Unicode data are automatically reinitialized as part of the ReleaseBuilder.
Still present limitations include:
- Duplication between #parseUnicodeDataFrom: and #parseCompositionMappingFrom:
- Redundant and scattered declaration of character categories
For more information, see: http://lists.squeakfoundation.org/pipermail/squeak-dev/2020-March/208020.htm... [http://lists.squeakfoundation.org/pipermail/squeak-dev/2020-March/208020.htm...]
=============== Postscript ===============
"Postscript: Leave the line above, and replace the rest of this comment by a useful one. Executable statements should follow this comment, and should be separated by periods, with no exclamation points (!). Be sure to put any further comments in double-quotes, like this one."
=============== Diff ===============
ReleaseBuilder class>>prepareSourceCode {preparing} ? ct 2/28/2022 15:54 (changed) prepareSourceCode ????"Update packages. Remove foreign packages. Recompile."
????CurrentReadOnlySourceFiles cacheDuring: ????????[self ????????????updateCorePackages; ????????????unloadForeignPackages; ????????????checkForDirtyPackages; ????????????loadWellKnownPackages; ????????????checkForUndeclaredSymbols; ????????????checkForNilCategories;
- ????????????recompileAll]
- ????????????recompileAll;
- ????????????updateDatabases]
ReleaseBuilder class>>updateDatabases {scripts - support} ? ct 2/28/2022 16:06
- updateDatabases
- ????Unicode reinitializeData.
Unicode class>>allCategoryTags {character classification} ? ct 2/24/2022 19:41
- allCategoryTags
- ????^ #(Cn Cc Cf Co Cs Ll Lm Lo Lt Lu Mc Me Mn Nd Nl No Pc Pd Pe Pf Pi Po Ps Sc Sk Sm So Zl Zp Zs)
Unicode class>>blocks320Comment {comments} ? ct 2/28/2022 15:50 (changed) blocks320Comment
- "http://www.unicode.org/Public/3.2-Update/Blocks-3.2.0.txt [http://www.unicode.org/Public/3.2-Update/Blocks-3.2.0.txt]"
"# Blocks-3.2.0.txt # Correlated with Unicode 3.2 # Start Code..End Code; Block Name 0000..007F; Basic Latin 0080..00FF; Latin-1 Supplement 0100..017F; Latin Extended-A 0180..024F; Latin Extended-B 0250..02AF; IPA Extensions 02B0..02FF; Spacing Modifier Letters 0300..036F; Combining Diacritical Marks 0370..03FF; Greek and Coptic 0400..04FF; Cyrillic 0500..052F; Cyrillic Supplementary 0530..058F; Armenian 0590..05FF; Hebrew 0600..06FF; Arabic 0700..074F; Syriac 0780..07BF; Thaana 0900..097F; Devanagari 0980..09FF; Bengali 0A00..0A7F; Gurmukhi 0A80..0AFF; Gujarati 0B00..0B7F; Oriya 0B80..0BFF; Tamil 0C00..0C7F; Telugu 0C80..0CFF; Kannada 0D00..0D7F; Malayalam 0D80..0DFF; Sinhala 0E00..0E7F; Thai 0E80..0EFF; Lao 0F00..0FFF; Tibetan 1000..109F; Myanmar 10A0..10FF; Georgian 1100..11FF; Hangul Jamo 1200..137F; Ethiopic 13A0..13FF; Cherokee 1400..167F; Unified Canadian Aboriginal Syllabics 1680..169F; Ogham 16A0..16FF; Runic 1700..171F; Tagalog 1720..173F; Hanunoo 1740..175F; Buhid 1760..177F; Tagbanwa 1780..17FF; Khmer 1800..18AF; Mongolian 1E00..1EFF; Latin Extended Additional 1F00..1FFF; Greek Extended 2000..206F; General Punctuation 2070..209F; Superscripts and Subscripts 20A0..20CF; Currency Symbols 20D0..20FF; Combining Diacritical Marks for Symbols 2100..214F; Letterlike Symbols 2150..218F; Number Forms 2190..21FF; Arrows 2200..22FF; Mathematical Operators 2300..23FF; Miscellaneous Technical 2400..243F; Control Pictures 2440..245F; Optical Character Recognition 2460..24FF; Enclosed Alphanumerics 2500..257F; Box Drawing 2580..259F; Block Elements 25A0..25FF; Geometric Shapes 2600..26FF; Miscellaneous Symbols 2700..27BF; Dingbats 27C0..27EF; Miscellaneous Mathematical Symbols-A 27F0..27FF; Supplemental Arrows-A 2800..28FF; Braille Patterns 2900..297F; Supplemental Arrows-B 2980..29FF; Miscellaneous Mathematical Symbols-B 2A00..2AFF; Supplemental Mathematical Operators 2E80..2EFF; CJK Radicals Supplement 2F00..2FDF; Kangxi Radicals 2FF0..2FFF; Ideographic Description Characters 3000..303F; CJK Symbols and Punctuation 3040..309F; Hiragana 30A0..30FF; Katakana 3100..312F; Bopomofo 3130..318F; Hangul Compatibility Jamo 3190..319F; Kanbun 31A0..31BF; Bopomofo Extended 31F0..31FF; Katakana Phonetic Extensions 3200..32FF; Enclosed CJK Letters and Months 3300..33FF; CJK Compatibility 3400..4DBF; CJK Unified Ideographs Extension A 4E00..9FFF; CJK Unified Ideographs A000..A48F; Yi Syllables A490..A4CF; Yi Radicals AC00..D7AF; Hangul Syllables D800..DB7F; High Surrogates DB80..DBFF; High Private Use Surrogates DC00..DFFF; Low Surrogates E000..F8FF; Private Use Area F900..FAFF; CJK Compatibility Ideographs FB00..FB4F; Alphabetic Presentation Forms FB50..FDFF; Arabic Presentation Forms-A FE00..FE0F; Variation Selectors FE20..FE2F; Combining Half Marks FE30..FE4F; CJK Compatibility Forms FE50..FE6F; Small Form Variants FE70..FEFF; Arabic Presentation Forms-B FF00..FFEF; Halfwidth and Fullwidth Forms FFF0..FFFF; Specials 10300..1032F; Old Italic 10330..1034F; Gothic 10400..1044F; Deseret 1D000..1D0FF; Byzantine Musical Symbols 1D100..1D1FF; Musical Symbols 1D400..1D7FF; Mathematical Alphanumeric Symbols 20000..2A6DF; CJK Unified Ideographs Extension B 2F800..2FA1F; CJK Compatibility Ideographs Supplement E0000..E007F; Tags F0000..FFFFF; Supplementary Private Use Area-A 100000..10FFFF; Supplementary Private Use Area-B
"
Unicode class>>generalCategoryIndexFromTag: {character classification} ? ct 2/24/2022 19:33
- generalCategoryIndexFromTag: tag
- ????^ (self allCategoryTags indexOf: tag) - 1
Unicode class>>generalCategoryLabelForTag: {character classification} ? ct 2/24/2022 21:47
- generalCategoryLabelForTag: tag
- ????^ self generalCategoryLabels at: (self generalCategoryIndexFromTag: tag) + 1
Unicode class>>generalCategoryTagOf: {character classification} ? ct 2/24/2022 19:44
- generalCategoryTagOf: aCharacterCode
- ????^ (self generalCategoryOf: aCharacterCode)
- ????????ifNotNil: [:code | self allCategoryTags at: code + 1 ifAbsent: [#Cn]]
- ????????ifNil: [#Cn]
Unicode class>>initialize {class initialization} ? ct 2/28/2022 15:52 (changed) initialize ????" Unicode initialize " ????self initializeTagConstants.
- ????Compositions isEmptyOrNil ifTrue:[self initializeCompositionMappings].
- ????
- ????self flag: #deduplicate. "Currently, we are downloading and parsing #unicodeData twice."
- ????Compositions isEmptyOrNil ifTrue: [self initializeCompositionMappings].
- ????GeneralCategory isEmptyOrNil ifTrue: [self initializeUnicodeData].
Unicode class>>initializeUnicodeData {unicode data} ? ct 2/24/2022 19:03
- initializeUnicodeData
- ????"self initializeUnicodeData"
- ????self parseUnicodeDataFrom: self unicodeData readStream.
Unicode class>>parseUnicodeDataFrom: {unicode data} ? ct 2/24/2022 19:33 (changed) parseUnicodeDataFrom: stream
- "
- ????self halt.
- ????self parseUnicodeDataFile
- "
- ????"self initializeUnicodeData."
- ????| line fieldEnd point fieldStart toNumber generalCategory decimalProperty |
- ????| line fieldEnd point fieldStart toNumber generalCategory decimalProperty tag |
????toNumber := [:quad | ('16r', quad) asNumber].
????GeneralCategory := SparseLargeTable new: 16rE0080 chunkSize: 1024 arrayClass: Array base: 1 defaultValue: 'Cn'. ????DecimalProperty := SparseLargeTable new: 16rE0080 chunkSize: 32 arrayClass: Array base: 1 defaultValue: -1.
????16r3400 to: 16r4DB5 do: [:i | GeneralCategory at: i+1 put: 'Lo']. ????16r4E00 to: 16r9FA5 do: [:i | GeneralCategory at: i+1 put: 'Lo']. ????16rAC00 to: 16rD7FF do: [:i | GeneralCategory at: i+1 put: 'Lo'].
????[(line := stream nextLine) size > 0] whileTrue: [ ????????fieldEnd := line indexOf: $; startingAt: 1. ????????point := toNumber value: (line copyFrom: 1 to: fieldEnd - 1). ????????point > 16rE007F ifTrue: [ ????????????GeneralCategory zapDefaultOnlyEntries. ????????????DecimalProperty zapDefaultOnlyEntries. ????????????^ self]. ????????2 to: 3 do: [:i | ????????????fieldStart := fieldEnd + 1. ????????????fieldEnd := line indexOf: $; startingAt: fieldStart. ????????].
- ????????generalCategory := line copyFrom: fieldStart to: fieldEnd - 1.
- ????????tag := line copyFrom: fieldStart to: fieldEnd - 1.
- ????????generalCategory := self generalCategoryIndexFromTag: tag.
????????GeneralCategory at: point+1 put: generalCategory.
- ????????generalCategory = 'Nd' ifTrue: [
- ????????generalCategory = Nd ifTrue: [
????????????4 to: 7 do: [:i | ????????????????fieldStart := fieldEnd + 1. ????????????????fieldEnd := line indexOf: $; startingAt: fieldStart. ????????????]. ????????????decimalProperty := line copyFrom: fieldStart to: fieldEnd - 1. ????????????DecimalProperty at: point+1 put: decimalProperty asNumber. ????????]. ????]. ????GeneralCategory zapDefaultOnlyEntries. ????DecimalProperty zapDefaultOnlyEntries.
Unicode class>>reinitializeData {class initialization} ? ct 2/28/2022 16:05
- reinitializeData
- ????Compositions := GeneralCategory := nil.
- ????self initialize.
UnicodeTest
- TestCase subclass: #UnicodeTest
- ????instanceVariableNames: ''
- ????classVariableNames: ''
- ????poolDictionaries: ''
- ????category: 'MultilingualTests-Encodings'
- UnicodeTest class
- ????instanceVariableNames: ''
- ""
UnicodeTest class>>resources {accessing} ? ct 2/24/2022 21:46
- resources
- ???? ^ super resources copyWith: UnicodeTestResource
UnicodeTest>>testGeneralCategoryLabel {tests - character classification} ? ct 2/24/2022 21:49
- testGeneralCategoryLabel
- ????self assert: 'Letter, Lowercase' equals: (Unicode generalCategoryLabelOf: $a asUnicode).
- ????self assert: 'Letter, Uppercase' equals: (Unicode generalCategoryLabelOf: $Z asUnicode).
- ????
- ????self assert: 'Number, Decimal' equals: (Unicode generalCategoryLabelOf: $5 asUnicode).
- ????self assert: 'Symbol, Other' equals: (Unicode generalCategoryLabelOf: 16r1F388).
- ????
- ????self assert: 'n/a' equals: (Unicode generalCategoryLabelOf: Float infinity).
UnicodeTest>>testGeneralCategoryLabelForTag {tests - character classification} ? ct 2/24/2022 21:48
- testGeneralCategoryLabelForTag
- ????self assert: 'Letter, Lowercase' equals: (Unicode generalCategoryLabelForTag: #Ll).
UnicodeTest>>testGeneralCategoryTag {tests - character classification} ? ct 2/24/2022 21:49
- testGeneralCategoryTag
- ????self assert: #Ll equals: (Unicode generalCategoryTagOf: $a asUnicode).
- ????self assert: #Lu equals: (Unicode generalCategoryTagOf: $Z asUnicode).
- ????
- ????self assert: #Nd equals: (Unicode generalCategoryTagOf: $5 asUnicode).
- ????self assert: #So equals: (Unicode generalCategoryTagOf: 16r1F388).
- ????
- ????self assert: #Cn equals: (Unicode generalCategoryTagOf: Float infinity).
UnicodeTestResource
- TestResource subclass: #UnicodeTestResource
- ????instanceVariableNames: ''
- ????classVariableNames: ''
- ????poolDictionaries: ''
- ????category: 'MultilingualTests-Encodings'
- UnicodeTestResource class
- ????instanceVariableNames: ''
- ""
UnicodeTestResource>>setUp {running} ? ct 2/24/2022 21:45
- setUp
- ????super setUp.
- ????
- ????"Test the functionality of this update logic"
- ????Unicode initializeCompositionMappings.
- ????Unicode initializeUnicodeData.
Sent from Squeak Inbox Talk
On 2022-02-25T11:36:53+01:00, marcel.taeumel at hpi.de [http://hpi.de] wrote:
Hi Christoph --
Thanks for doing this!
??Is it okay to fetch the data from unicode.org [http://unicode.org] via a postscript in the update stream? Hypothetically, some clients might use a proxy or a strict firewall/safelist of IP addresses.
I think it is okay. We could make it explicit in the ReleaseBuilder or in some external CI script but having it part of the update stream is okay. The updates are fetched from the outside anyway, right? ;-)
Best; Marcel Am 24.02.2022 22:16:42 schrieb christoph.thiede at student.hpi.uni-potsdam.de [http://student.hpi.uni-potsdam.de] <christoph.thiede at student.hpi.uni-potsdam.de [http://student.hpi.uni-potsdam.de%5D%3E: Hi all, Hi Marcel, Hi Levente,
finally, here is a changeset that takes the first step for updating or in-image Unicode database. After filing it in, please run:
????Unicode initializeUnicodeData.
In addition to the preamble (please read first below), I have still a number of questions:
- Is it okay to fetch the data from unicode.org [http://unicode.org] via a postscript in the update stream? Hypothetically, some clients might use a proxy or a strict firewall/safelist of IP addresses.
- How much effort shall we put in deduplicating the logic and the data in this class? This includes both the two similar parsing methods and the redundant specification of the Unicode character tags.
Best, Christoph
=============== Summary ===============
Change Set:????????UnicodeData Date:????????????24 February 2022 Author:????????????Christoph Thiede
This changeset repairs the fetching & parsing of unicode category data. Usage:
????Unicode initializeUnicodeData. ????Unicode generalCategoryLabelOf: 16r1F388. 'Symbol, Other'
This revision resolves some slips in the category tags, adds an interface for retrieving/converting tags, unifies the vocabulary of the Unicode protocol, integrates the #initializeUnicodeData into the class initializer, and adds some tests.
Still present limitations include:
- Duplication between #parseUnicodeDataFrom: and #parseCompositionMappingFrom:
- Redundant and scattered declaration of character categories
For more information, see: http://lists.squeakfoundation.org/pipermail/squeak-dev/2020-March/208020.htm... [http://lists.squeakfoundation.org/pipermail/squeak-dev/2020-March/208020.htm...]
=============== Postscript ===============
"Postscript: Leave the line above, and replace the rest of this comment by a useful one. Executable statements should follow this comment, and should be separated by periods, with no exclamation points (!). Be sure to put any further comments in double-quotes, like this one."
=============== Diff ===============
Unicode class>>allCategoryTags {character classification} ? ct 2/24/2022 19:41
- allCategoryTags
- ????^ #(Cn Cc Cf Co Cs Ll Lm Lo Lt Lu Mc Me Mn Nd Nl No Pc Pd Pe Pf Pi Po Ps Sc Sk Sm So Zl Zp Zs)
Unicode class>>generalCategoryIndexFromTag: {character classification} ? ct 2/24/2022 19:33
- generalCategoryIndexFromTag: tag
- ????^ (self allCategoryTags indexOf: tag) - 1
Unicode class>>generalCategoryLabelForTag: {character classification} ? ct 2/24/2022 21:47
- generalCategoryLabelForTag: tag
- ????^ self generalCategoryLabels at: (self generalCategoryIndexFromTag: tag) + 1
Unicode class>>generalCategoryTagOf: {character classification} ? ct 2/24/2022 19:44
- generalCategoryTagOf: aCharacterCode
- ????^ (self generalCategoryOf: aCharacterCode)
- ????????ifNotNil: [:code | self allCategoryTags at: code + 1 ifAbsent: [#Cn]]
- ????????ifNil: [#Cn]
Unicode class>>initialize {class initialization} ? ct 2/24/2022 21:43 (changed) initialize ????" Unicode initialize " ????self initializeTagConstants.
- ????Compositions isEmptyOrNil ifTrue:[self initializeCompositionMappings].
- ????Compositions isEmptyOrNil ifTrue:[self initializeCompositionMappings].
- ????GeneralCategory isEmptyOrNil ifTrue: [self initializeUnicodeData].
Unicode class>>initializeUnicodeData {unicode data} ? ct 2/24/2022 19:03
- initializeUnicodeData
- ????"self initializeUnicodeData"
- ????self parseUnicodeDataFrom: self unicodeData readStream.
Unicode class>>parseUnicodeDataFrom: {unicode data} ? ct 2/24/2022 19:33 (changed) parseUnicodeDataFrom: stream
- "
- ????self halt.
- ????self parseUnicodeDataFile
- "
- ????"self initializeUnicodeData."
- ????| line fieldEnd point fieldStart toNumber generalCategory decimalProperty |
- ????| line fieldEnd point fieldStart toNumber generalCategory decimalProperty tag |
????toNumber := [:quad | ('16r', quad) asNumber].
????GeneralCategory := SparseLargeTable new: 16rE0080 chunkSize: 1024 arrayClass: Array base: 1 defaultValue: 'Cn'. ????DecimalProperty := SparseLargeTable new: 16rE0080 chunkSize: 32 arrayClass: Array base: 1 defaultValue: -1.
????16r3400 to: 16r4DB5 do: [:i | GeneralCategory at: i+1 put: 'Lo']. ????16r4E00 to: 16r9FA5 do: [:i | GeneralCategory at: i+1 put: 'Lo']. ????16rAC00 to: 16rD7FF do: [:i | GeneralCategory at: i+1 put: 'Lo'].
????[(line := stream nextLine) size > 0] whileTrue: [ ????????fieldEnd := line indexOf: $; startingAt: 1. ????????point := toNumber value: (line copyFrom: 1 to: fieldEnd - 1). ????????point > 16rE007F ifTrue: [ ????????????GeneralCategory zapDefaultOnlyEntries. ????????????DecimalProperty zapDefaultOnlyEntries. ????????????^ self]. ????????2 to: 3 do: [:i | ????????????fieldStart := fieldEnd + 1. ????????????fieldEnd := line indexOf: $; startingAt: fieldStart. ????????].
- ????????generalCategory := line copyFrom: fieldStart to: fieldEnd - 1.
- ????????tag := line copyFrom: fieldStart to: fieldEnd - 1.
- ????????generalCategory := self generalCategoryIndexFromTag: tag.
????????GeneralCategory at: point+1 put: generalCategory.
- ????????generalCategory = 'Nd' ifTrue: [
- ????????generalCategory = Nd ifTrue: [
????????????4 to: 7 do: [:i | ????????????????fieldStart := fieldEnd + 1. ????????????????fieldEnd := line indexOf: $; startingAt: fieldStart. ????????????]. ????????????decimalProperty := line copyFrom: fieldStart to: fieldEnd - 1. ????????????DecimalProperty at: point+1 put: decimalProperty asNumber. ????????]. ????]. ????GeneralCategory zapDefaultOnlyEntries. ????DecimalProperty zapDefaultOnlyEntries.
UnicodeTest
- TestCase subclass: #UnicodeTest
- ????instanceVariableNames: ''
- ????classVariableNames: ''
- ????poolDictionaries: ''
- ????category: 'MultilingualTests-Encodings'
- UnicodeTest class
- ????instanceVariableNames: ''
- ""
UnicodeTest class>>resources {accessing} ? ct 2/24/2022 21:46
- resources
- ???? ^ super resources copyWith: UnicodeTestResource
UnicodeTest>>testGeneralCategoryLabel {tests - character classification} ? ct 2/24/2022 21:49
- testGeneralCategoryLabel
- ????self assert: 'Letter, Lowercase' equals: (Unicode generalCategoryLabelOf: $a asUnicode).
- ????self assert: 'Letter, Uppercase' equals: (Unicode generalCategoryLabelOf: $Z asUnicode).
- ????
- ????self assert: 'Number, Decimal' equals: (Unicode generalCategoryLabelOf: $5 asUnicode).
- ????self assert: 'Symbol, Other' equals: (Unicode generalCategoryLabelOf: 16r1F388).
- ????
- ????self assert: 'n/a' equals: (Unicode generalCategoryLabelOf: Float infinity).
UnicodeTest>>testGeneralCategoryLabelForTag {tests - character classification} ? ct 2/24/2022 21:48
- testGeneralCategoryLabelForTag
- ????self assert: 'Letter, Lowercase' equals: (Unicode generalCategoryLabelForTag: #Ll).
UnicodeTest>>testGeneralCategoryTag {tests - character classification} ? ct 2/24/2022 21:49
- testGeneralCategoryTag
- ????self assert: #Ll equals: (Unicode generalCategoryTagOf: $a asUnicode).
- ????self assert: #Lu equals: (Unicode generalCategoryTagOf: $Z asUnicode).
- ????
- ????self assert: #Nd equals: (Unicode generalCategoryTagOf: $5 asUnicode).
- ????self assert: #So equals: (Unicode generalCategoryTagOf: 16r1F388).
- ????
- ????self assert: #Cn equals: (Unicode generalCategoryTagOf: Float infinity).
UnicodeTestResource
- TestResource subclass: #UnicodeTestResource
- ????instanceVariableNames: ''
- ????classVariableNames: ''
- ????poolDictionaries: ''
- ????category: 'MultilingualTests-Encodings'
- UnicodeTestResource class
- ????instanceVariableNames: ''
- ""
UnicodeTestResource>>setUp {running} ? ct 2/24/2022 21:45
- setUp
- ????super setUp.
- ????
- ????"Test the functionality of this update logic"
- ????Unicode initializeCompositionMappings.
- ????Unicode initializeUnicodeData.
Sent from Squeak Inbox Talk [https://github.com/hpi-swa-lab/squeak-inbox-talk [https://github.com/hpi-swa-lab/squeak-inbox-talk]]
On 2020-09-11T00:49:24+02:00, leves at caesar.elte.hu [http://caesar.elte.hu] wrote:
Hi Christoph,
On Wed, 9 Sep 2020, Thiede, Christoph wrote:
Hi Levente,
basically, I only?would like to get rid of the class variables for every single Unicode category because they provide low explorability (it's hard to work with the numeric output of #generalCategoryOf:) and extensibility (you need to recompile the class definition for adding UTF-16 support). If you are critical of?increasing the?size of the SparseLargeTable, I think we?would also just make one or two?extra dictionaries to map every category symbol to a number and vice versa. What do you think?
You mean an array to map the integers to symbols, right? :) Anyway, I don't think it's worth using symbols internally. For example, #isLetterCode: is 8-10% slower with the extra array lookup and checking the category symbol's first letter than the current method of integer comparisons.
Do you expect these constants to appear outside the Unicode class? If yes, then using symbols for those cases is probably a good solution. But for internal use, the integers are better.
Levente
Best,
Christoph
Von: Squeak-dev <squeak-dev-bounces at lists.squeakfoundation.org [http://lists.squeakfoundation.org%5D%3E im Auftrag von Levente Uzonyi <leves at caesar.elte.hu [http://caesar.elte.hu%5D%3E Gesendet: Dienstag, 8. September 2020 21:43:56 An: The general-purpose Squeak developers list Betreff: Re: [squeak-dev] Unicode ? Hi Christoph,
On Tue, 8 Sep 2020, Thiede, Christoph wrote:
Hi all,
?Your words suggest that it has already been published, but I can't find it?anywhere.
Then I must have expressed myself wrong. I did not yet publish any code?changes, but in my original post from March,?you can find a short description of the design?changes I'd like to implement. Essentially, I would like to replace the separate class variables for every known character class in favor of greater flexibility.
How would your changes affect GeneralCategory? Would it still be a SpareLargeTable with ByteArray as arrayClass? If you just replace those integers with symbols, the size of the table will be at least 4 or 8 times larger in 32 or 64 bit images, respectively.
Levente.
Eliot, are there any remaining questions regarding the VM size? Character size should be sufficient as discussed below, and of course, I can test any changes in a 32-bit image, too. :-)
WideString withAll: (#(42r2l 16r1F497 16r1F388 33) collect: #asCharacter)
Best, Christoph
_
Von: Squeak-dev <squeak-dev-bounces at lists.squeakfoundation.org [http://lists.squeakfoundation.org%5D%3E im Auftrag von Tobias Pape <Das.Linux at gmx.de [http://gmx.de%5D%3E Gesendet: Sonntag, 6. September 2020 21:00:14 An: The general-purpose Squeak developers list Betreff: Re: [squeak-dev] Unicode ?
On 06.09.2020, at 20:40, Levente Uzonyi <leves at caesar.elte.hu [http://caesar.elte.hu%5D%3E wrote:
On Sun, 6 Sep 2020, Tobias Pape wrote:
> >> On 06.09.2020, at 19:15, Eliot Miranda <eliot.miranda at gmail.com [http://gmail.com%5D%3E wrote: >> Hi Christoph, Hi All, >>> On Mar 17, 2020, at 3:51 PM, Thiede, Christoph <Christoph.Thiede at student.hpi.uni-potsdam.de [http://student.hpi.uni-potsdam.de%5D%3E wrote: >>> Hi all! :-) >>> After some recent fun with the Unicode class, I found out that its data is quite out of date (for example, the comments do not even "know" that code points can be longer than 4 bytes. Younger characters such as ???
are not categorized correctly, etc. ...). Luckily, there is already some logic to fetch the latest data from www.unicode.org [http://www.unicode.org]. I'm currently reworking this logic because it's not completely automated yet and has some slips, but so long, I have one general question for you:
>> And consequently I have a couple of questions for you. In the Spur VM Characters are immediate (they are like SmallInteger and exist in oops (object-oriented pointers) as tagged values).? In the 32-bit variant
Characters
are 30-bit unsigned integers.? In the 64-bit variant they are also 30-bit unsigned integers, but could easily be extended to be up to 61-bit unsigned integers.
>> Q1, can you arrange that the Unicode support does not break in initialization on the 32-bit variant?? It may be that the 32-bit variant cannot represent code points beyond 30 bits in size, but we should try to ensure
that
initialization still runs to completion even if it fails to initialize information relating to code points beyond 30 bits in size.
>> Q2, how many bits should the 64-bit variant VM support for immediate Characters? > > Unicode has a max value of 0x10FFFF. That makes 21 bit. So no worries there. > > We should just not forget the leading-char stuff (Yoshiki, Andreas,...)
AFAIU the leading char only makes sense when you have multiple CJK(V?) languages in use at the same time. In other cases Unicode (leadingChar = 0) is perfectly fine. IIRC there are 22 bits available for the codePoint and 8 for the leadingChar, so we're still good: all unicode characters fit.
\o/ hooray!
Levente
> > > BEst regards >?????? -Tobias > >> Then something to consider is that it is conceptually possible to support something like WideCharacter, which would represent code points outside of the immediate Character range on the 32-bit variant, analogous to
LargePositiveInteger beyond SmallInteger maxVal.? This can be made to work seamlessly, just as it does currently with integers, and with Floats where SmallFloat64 is only used on 64-bits.
>> It has implications in a few parts of the system: >> - failure code for WideString (VeryWideString?) at:[put:] primitives that would have to manage overflow into/access from WideCharacter instances >> - ImageSegment and other (un)pickling systems that need to convert to/from a bit-specific ?wire? protocol/representation >> - 32-bit <=> 64-bit image conversion All this is easily doable (because we have models of doing it for Float and Integer general instances).? But we need good specifications so we can implement the right thing from the
get-go.
>>> At the moment, we have 30 class variables each for one Unicode category number. These class vars map in alphabetical order to the integers from 0 to: 29. Is this tedious structure really necessary? For different
purposes, I would like to get the category name of a specific code point from a client. The current design makes this impossible without writing additional mappings.
>>> Tl;dr: I would like to propose to drop these class variables and use Symbols instead. They are comparable like integers, and as they are flyweights, this should not be a performance issue either. Of course,
#generalCategoryOf: will have to keep returning numbers, but we could deprecate it and use a new #generalTagOf: in the future. Furthermore, this would also allow us to deal with later added category names (though I don't
know
whether this will ever happen).
>>> Examples: >>> Unicode generalTagOf: $a asUnicode. "#Ll" >>> Unicode class >> isLetterCode: charCode >>>? ^ (self generalTagOf: charCode) first = $L >>> Unicode class >> isAlphaNumericCode: charCode >>>? | tag| >>>? ^ (tag := self generalCategoryOf: charCode) first = $L >>>??????? or: [tag = #Nd] >>> How do you think about this proposal? Please let me know and I will go ahead! :D >>> Best, >>> Christoph >> Best, Eliot >> _,,,^..^,,,_ (phone)
["UnicodeData.2.cs"]
Hi Jakob, Hi Marcel,
the advantage of your proposed solution is that we would have more control over the process.
The disadvantage is that it would increase the package size and tangle logic + data together. At least, we're talking about ~300 kB for the Unicode data if I used SpaceTally correctly. :-)
Personally, I would prefer to stay with the existing practice because updating your Unicode data locally really seems to be optional at the moment.
Best,
Christoph
________________________________ Von: Squeak-dev squeak-dev-bounces@lists.squeakfoundation.org im Auftrag von Taeumel, Marcel Gesendet: Freitag, 8. April 2022 09:56:36 An: squeak-dev Betreff: Re: [squeak-dev] Unicode
Hi Jakob --
One could run the download&generate step as needed to update the data. (CI, release bubild, manually)
To integrate it as part of ReleaseBuilder class >> #prepareEnvironment would be my preferred way. Then it would be part of the CI.
Best, Marcel
Am 05.04.2022 14:41:14 schrieb Jakob Reschke jakres+squeak@gmail.com:
Would it be possible/practical to separate this? * Download and transform into code/objects * Distribute the generated code/objects via the update stream directly One could run the download&generate step as needed to update the data. (CI, release build, manually) Or are there any reasons not to do that?
I was asking myself the same thing recently about the package that provides time zone information. It needs a Unix timezone database from the operating system to initialize, rather than providing Smalltalk objects/code directly in Monticello, based on the official online database.
Kind regards, Jakob
Am Di., 5. Apr. 2022 um 08:59 Uhr schrieb Marcel Taeumel <marcel.taeumel@hpi.demailto:marcel.taeumel@hpi.de>: Hi Eliot, hi Christoph --
Unicode reinitializeData.
I think that this method has an unfortunate name. Since it downloads data from the Internet, it should be called #downloadAndInitializeData.
And that's the reason for it not being in the post-load script. We might put the raw info there, but it would be very surprising if "Update Squeak" fetches data other than from source.squeak.orghttp://source.squeak.org.
Best, Marcel
Am 04.04.2022 21:17:17 schrieb Thiede, Christoph <christoph.thiede@student.hpi.uni-potsdam.demailto:christoph.thiede@student.hpi.uni-potsdam.de>:
If this is essential
Well, how do you define essential? You can still use your image with the old font definitions. However, for some newer codepoints such as 😁😎😍, Unicode generalCategoryLabelOf: and friends will answers "not assigned" without the upgrade. You can watch the difference by browsing any comprehensive font in the FontImporter. But I am not aware of any code path that relies on the presence of newer Unicode data.
Apart from that, I was already discussing with Marcel what would be the consequences of downloading data from a third-party server during an image update. There might be any images, most likely server images, that do not have free internet access due to a strict firewall. Hypothetically, this might even introduce any security issues. So in the end, we decided on leaving this optional for now. It will only break if any future patch of any package relies on exact Unicode data.
Best,
Christoph
________________________________ Von: Squeak-dev <squeak-dev-bounces@lists.squeakfoundation.orgmailto:squeak-dev-bounces@lists.squeakfoundation.org> im Auftrag von Eliot Miranda <eliot.miranda@gmail.commailto:eliot.miranda@gmail.com> Gesendet: Montag, 4. April 2022 20:54:59 An: The general-purpose Squeak developers list Betreff: Re: [squeak-dev] Unicode
On Apr 4, 2022, at 11:20 AM, Christoph.Thiede@student.hpi.uni-potsdam.demailto:Christoph.Thiede@student.hpi.uni-potsdam.de wrote:
Merged via Multilingual-ct.271, Multilingual-ct.272, MultilingualTests-ct.41, and ReleaseBuilder-ct.231.
Please run the following in your image to install the new Unicode data (and to uncover any regressions I may have missed :D):
Unicode reinitializeData.
If this is essential then it *must* be added as a post load script to one (or more) of the relevant packages. Asking “did you run Unicode reinitializeData?” when someone reports a strange bug isn’t acceptable.
Best, Christoph
--- Sent from Squeak Inbox Talkhttps://github.com/hpi-swa-lab/squeak-inbox-talk
On 2022-02-28T16:09:06+01:00, christoph.thiede@student.hpi.uni-potsdam.demailto:christoph.thiede@student.hpi.uni-potsdam.de wrote:
Hi Marcel, thanks for the review! Below is an updated changeset. If you have no further objections, I would like to merge it within the next few days. :-)
=============== Summary ===============
Change Set:????????UnicodeData Date:????????????24 February 2022 Author:????????????Christoph Thiede
This changeset repairs the fetching & parsing of unicode category data. Usage:
????Unicode reinitializeData. ????Unicode generalCategoryLabelOf: 16r1F388. 'Symbol, Other'
This revision resolves some slips in the category tags, adds an interface for retrieving/converting tags, unifies the vocabulary of the Unicode protocol, integrates the #initializeUnicodeData into the class initializer, and adds some tests. Furthermore, the Unicode data are automatically reinitialized as part of the ReleaseBuilder.
Still present limitations include:
- Duplication between #parseUnicodeDataFrom: and #parseCompositionMappingFrom:
- Redundant and scattered declaration of character categories
For more information, see: http://lists.squeakfoundation.org/pipermail/squeak-dev/2020-March/208020.htm...
=============== Postscript ===============
"Postscript: Leave the line above, and replace the rest of this comment by a useful one. Executable statements should follow this comment, and should be separated by periods, with no exclamation points (!). Be sure to put any further comments in double-quotes, like this one."
=============== Diff ===============
ReleaseBuilder class>>prepareSourceCode {preparing} ? ct 2/28/2022 15:54 (changed) prepareSourceCode ????"Update packages. Remove foreign packages. Recompile."
????CurrentReadOnlySourceFiles cacheDuring: ????????[self ????????????updateCorePackages; ????????????unloadForeignPackages; ????????????checkForDirtyPackages; ????????????loadWellKnownPackages; ????????????checkForUndeclaredSymbols; ????????????checkForNilCategories;
- ????????????recompileAll]
- ????????????recompileAll;
- ????????????updateDatabases]
ReleaseBuilder class>>updateDatabases {scripts - support} ? ct 2/28/2022 16:06
- updateDatabases
- ????Unicode reinitializeData.
Unicode class>>allCategoryTags {character classification} ? ct 2/24/2022 19:41
- allCategoryTags
- ????^ #(Cn Cc Cf Co Cs Ll Lm Lo Lt Lu Mc Me Mn Nd Nl No Pc Pd Pe Pf Pi Po Ps Sc Sk Sm So Zl Zp Zs)
Unicode class>>blocks320Comment {comments} ? ct 2/28/2022 15:50 (changed) blocks320Comment
"# Blocks-3.2.0.txt # Correlated with Unicode 3.2 # Start Code..End Code; Block Name 0000..007F; Basic Latin 0080..00FF; Latin-1 Supplement 0100..017F; Latin Extended-A 0180..024F; Latin Extended-B 0250..02AF; IPA Extensions 02B0..02FF; Spacing Modifier Letters 0300..036F; Combining Diacritical Marks 0370..03FF; Greek and Coptic 0400..04FF; Cyrillic 0500..052F; Cyrillic Supplementary 0530..058F; Armenian 0590..05FF; Hebrew 0600..06FF; Arabic 0700..074F; Syriac 0780..07BF; Thaana 0900..097F; Devanagari 0980..09FF; Bengali 0A00..0A7F; Gurmukhi 0A80..0AFF; Gujarati 0B00..0B7F; Oriya 0B80..0BFF; Tamil 0C00..0C7F; Telugu 0C80..0CFF; Kannada 0D00..0D7F; Malayalam 0D80..0DFF; Sinhala 0E00..0E7F; Thai 0E80..0EFF; Lao 0F00..0FFF; Tibetan 1000..109F; Myanmar 10A0..10FF; Georgian 1100..11FF; Hangul Jamo 1200..137F; Ethiopic 13A0..13FF; Cherokee 1400..167F; Unified Canadian Aboriginal Syllabics 1680..169F; Ogham 16A0..16FF; Runic 1700..171F; Tagalog 1720..173F; Hanunoo 1740..175F; Buhid 1760..177F; Tagbanwa 1780..17FF; Khmer 1800..18AF; Mongolian 1E00..1EFF; Latin Extended Additional 1F00..1FFF; Greek Extended 2000..206F; General Punctuation 2070..209F; Superscripts and Subscripts 20A0..20CF; Currency Symbols 20D0..20FF; Combining Diacritical Marks for Symbols 2100..214F; Letterlike Symbols 2150..218F; Number Forms 2190..21FF; Arrows 2200..22FF; Mathematical Operators 2300..23FF; Miscellaneous Technical 2400..243F; Control Pictures 2440..245F; Optical Character Recognition 2460..24FF; Enclosed Alphanumerics 2500..257F; Box Drawing 2580..259F; Block Elements 25A0..25FF; Geometric Shapes 2600..26FF; Miscellaneous Symbols 2700..27BF; Dingbats 27C0..27EF; Miscellaneous Mathematical Symbols-A 27F0..27FF; Supplemental Arrows-A 2800..28FF; Braille Patterns 2900..297F; Supplemental Arrows-B 2980..29FF; Miscellaneous Mathematical Symbols-B 2A00..2AFF; Supplemental Mathematical Operators 2E80..2EFF; CJK Radicals Supplement 2F00..2FDF; Kangxi Radicals 2FF0..2FFF; Ideographic Description Characters 3000..303F; CJK Symbols and Punctuation 3040..309F; Hiragana 30A0..30FF; Katakana 3100..312F; Bopomofo 3130..318F; Hangul Compatibility Jamo 3190..319F; Kanbun 31A0..31BF; Bopomofo Extended 31F0..31FF; Katakana Phonetic Extensions 3200..32FF; Enclosed CJK Letters and Months 3300..33FF; CJK Compatibility 3400..4DBF; CJK Unified Ideographs Extension A 4E00..9FFF; CJK Unified Ideographs A000..A48F; Yi Syllables A490..A4CF; Yi Radicals AC00..D7AF; Hangul Syllables D800..DB7F; High Surrogates DB80..DBFF; High Private Use Surrogates DC00..DFFF; Low Surrogates E000..F8FF; Private Use Area F900..FAFF; CJK Compatibility Ideographs FB00..FB4F; Alphabetic Presentation Forms FB50..FDFF; Arabic Presentation Forms-A FE00..FE0F; Variation Selectors FE20..FE2F; Combining Half Marks FE30..FE4F; CJK Compatibility Forms FE50..FE6F; Small Form Variants FE70..FEFF; Arabic Presentation Forms-B FF00..FFEF; Halfwidth and Fullwidth Forms FFF0..FFFF; Specials 10300..1032F; Old Italic 10330..1034F; Gothic 10400..1044F; Deseret 1D000..1D0FF; Byzantine Musical Symbols 1D100..1D1FF; Musical Symbols 1D400..1D7FF; Mathematical Alphanumeric Symbols 20000..2A6DF; CJK Unified Ideographs Extension B 2F800..2FA1F; CJK Compatibility Ideographs Supplement E0000..E007F; Tags F0000..FFFFF; Supplementary Private Use Area-A 100000..10FFFF; Supplementary Private Use Area-B
"
Unicode class>>generalCategoryIndexFromTag: {character classification} ? ct 2/24/2022 19:33
- generalCategoryIndexFromTag: tag
- ????^ (self allCategoryTags indexOf: tag) - 1
Unicode class>>generalCategoryLabelForTag: {character classification} ? ct 2/24/2022 21:47
- generalCategoryLabelForTag: tag
- ????^ self generalCategoryLabels at: (self generalCategoryIndexFromTag: tag) + 1
Unicode class>>generalCategoryTagOf: {character classification} ? ct 2/24/2022 19:44
- generalCategoryTagOf: aCharacterCode
- ????^ (self generalCategoryOf: aCharacterCode)
- ????????ifNotNil: [:code | self allCategoryTags at: code + 1 ifAbsent: [#Cn]]
- ????????ifNil: [#Cn]
Unicode class>>initialize {class initialization} ? ct 2/28/2022 15:52 (changed) initialize ????" Unicode initialize " ????self initializeTagConstants.
- ????Compositions isEmptyOrNil ifTrue:[self initializeCompositionMappings].
- ????
- ????self flag: #deduplicate. "Currently, we are downloading and parsing #unicodeData twice."
- ????Compositions isEmptyOrNil ifTrue: [self initializeCompositionMappings].
- ????GeneralCategory isEmptyOrNil ifTrue: [self initializeUnicodeData].
Unicode class>>initializeUnicodeData {unicode data} ? ct 2/24/2022 19:03
- initializeUnicodeData
- ????"self initializeUnicodeData"
- ????self parseUnicodeDataFrom: self unicodeData readStream.
Unicode class>>parseUnicodeDataFrom: {unicode data} ? ct 2/24/2022 19:33 (changed) parseUnicodeDataFrom: stream
- "
- ????self halt.
- ????self parseUnicodeDataFile
- "
- ????"self initializeUnicodeData."
- ????| line fieldEnd point fieldStart toNumber generalCategory decimalProperty |
- ????| line fieldEnd point fieldStart toNumber generalCategory decimalProperty tag |
????toNumber := [:quad | ('16r', quad) asNumber].
????GeneralCategory := SparseLargeTable new: 16rE0080 chunkSize: 1024 arrayClass: Array base: 1 defaultValue: 'Cn'. ????DecimalProperty := SparseLargeTable new: 16rE0080 chunkSize: 32 arrayClass: Array base: 1 defaultValue: -1.
????16r3400 to: 16r4DB5 do: [:i | GeneralCategory at: i+1 put: 'Lo']. ????16r4E00 to: 16r9FA5 do: [:i | GeneralCategory at: i+1 put: 'Lo']. ????16rAC00 to: 16rD7FF do: [:i | GeneralCategory at: i+1 put: 'Lo'].
????[(line := stream nextLine) size > 0] whileTrue: [ ????????fieldEnd := line indexOf: $; startingAt: 1. ????????point := toNumber value: (line copyFrom: 1 to: fieldEnd - 1). ????????point > 16rE007F ifTrue: [ ????????????GeneralCategory zapDefaultOnlyEntries. ????????????DecimalProperty zapDefaultOnlyEntries. ????????????^ self]. ????????2 to: 3 do: [:i | ????????????fieldStart := fieldEnd + 1. ????????????fieldEnd := line indexOf: $; startingAt: fieldStart. ????????].
- ????????generalCategory := line copyFrom: fieldStart to: fieldEnd - 1.
- ????????tag := line copyFrom: fieldStart to: fieldEnd - 1.
- ????????generalCategory := self generalCategoryIndexFromTag: tag.
????????GeneralCategory at: point+1 put: generalCategory.
- ????????generalCategory = 'Nd' ifTrue: [
- ????????generalCategory = Nd ifTrue: [
????????????4 to: 7 do: [:i | ????????????????fieldStart := fieldEnd + 1. ????????????????fieldEnd := line indexOf: $; startingAt: fieldStart. ????????????]. ????????????decimalProperty := line copyFrom: fieldStart to: fieldEnd - 1. ????????????DecimalProperty at: point+1 put: decimalProperty asNumber. ????????]. ????]. ????GeneralCategory zapDefaultOnlyEntries. ????DecimalProperty zapDefaultOnlyEntries.
Unicode class>>reinitializeData {class initialization} ? ct 2/28/2022 16:05
- reinitializeData
- ????Compositions := GeneralCategory := nil.
- ????self initialize.
UnicodeTest
- TestCase subclass: #UnicodeTest
- ????instanceVariableNames: ''
- ????classVariableNames: ''
- ????poolDictionaries: ''
- ????category: 'MultilingualTests-Encodings'
- UnicodeTest class
- ????instanceVariableNames: ''
- ""
UnicodeTest class>>resources {accessing} ? ct 2/24/2022 21:46
- resources
- ???? ^ super resources copyWith: UnicodeTestResource
UnicodeTest>>testGeneralCategoryLabel {tests - character classification} ? ct 2/24/2022 21:49
- testGeneralCategoryLabel
- ????self assert: 'Letter, Lowercase' equals: (Unicode generalCategoryLabelOf: $a asUnicode).
- ????self assert: 'Letter, Uppercase' equals: (Unicode generalCategoryLabelOf: $Z asUnicode).
- ????
- ????self assert: 'Number, Decimal' equals: (Unicode generalCategoryLabelOf: $5 asUnicode).
- ????self assert: 'Symbol, Other' equals: (Unicode generalCategoryLabelOf: 16r1F388).
- ????
- ????self assert: 'n/a' equals: (Unicode generalCategoryLabelOf: Float infinity).
UnicodeTest>>testGeneralCategoryLabelForTag {tests - character classification} ? ct 2/24/2022 21:48
- testGeneralCategoryLabelForTag
- ????self assert: 'Letter, Lowercase' equals: (Unicode generalCategoryLabelForTag: #Ll).
UnicodeTest>>testGeneralCategoryTag {tests - character classification} ? ct 2/24/2022 21:49
- testGeneralCategoryTag
- ????self assert: #Ll equals: (Unicode generalCategoryTagOf: $a asUnicode).
- ????self assert: #Lu equals: (Unicode generalCategoryTagOf: $Z asUnicode).
- ????
- ????self assert: #Nd equals: (Unicode generalCategoryTagOf: $5 asUnicode).
- ????self assert: #So equals: (Unicode generalCategoryTagOf: 16r1F388).
- ????
- ????self assert: #Cn equals: (Unicode generalCategoryTagOf: Float infinity).
UnicodeTestResource
- TestResource subclass: #UnicodeTestResource
- ????instanceVariableNames: ''
- ????classVariableNames: ''
- ????poolDictionaries: ''
- ????category: 'MultilingualTests-Encodings'
- UnicodeTestResource class
- ????instanceVariableNames: ''
- ""
UnicodeTestResource>>setUp {running} ? ct 2/24/2022 21:45
- setUp
- ????super setUp.
- ????
- ????"Test the functionality of this update logic"
- ????Unicode initializeCompositionMappings.
- ????Unicode initializeUnicodeData.
Sent from Squeak Inbox Talk
On 2022-02-25T11:36:53+01:00, marcel.taeumel at hpi.dehttp://hpi.de wrote:
Hi Christoph --
Thanks for doing this!
??Is it okay to fetch the data from unicode.orghttp://unicode.org via a postscript in the update stream? Hypothetically, some clients might use a proxy or a strict firewall/safelist of IP addresses.
I think it is okay. We could make it explicit in the ReleaseBuilder or in some external CI script but having it part of the update stream is okay. The updates are fetched from the outside anyway, right? ;-)
Best; Marcel Am 24.02.2022 22:16:42 schrieb christoph.thiede at student.hpi.uni-potsdam.dehttp://student.hpi.uni-potsdam.de <christoph.thiede at student.hpi.uni-potsdam.dehttp://student.hpi.uni-potsdam.de>: Hi all, Hi Marcel, Hi Levente,
finally, here is a changeset that takes the first step for updating or in-image Unicode database. After filing it in, please run:
????Unicode initializeUnicodeData.
In addition to the preamble (please read first below), I have still a number of questions:
- Is it okay to fetch the data from unicode.orghttp://unicode.org via a postscript in the update stream? Hypothetically, some clients might use a proxy or a strict firewall/safelist of IP addresses.
- How much effort shall we put in deduplicating the logic and the data in this class? This includes both the two similar parsing methods and the redundant specification of the Unicode character tags.
Best, Christoph
=============== Summary ===============
Change Set:????????UnicodeData Date:????????????24 February 2022 Author:????????????Christoph Thiede
This changeset repairs the fetching & parsing of unicode category data. Usage:
????Unicode initializeUnicodeData. ????Unicode generalCategoryLabelOf: 16r1F388. 'Symbol, Other'
This revision resolves some slips in the category tags, adds an interface for retrieving/converting tags, unifies the vocabulary of the Unicode protocol, integrates the #initializeUnicodeData into the class initializer, and adds some tests.
Still present limitations include:
- Duplication between #parseUnicodeDataFrom: and #parseCompositionMappingFrom:
- Redundant and scattered declaration of character categories
For more information, see: http://lists.squeakfoundation.org/pipermail/squeak-dev/2020-March/208020.htm...
=============== Postscript ===============
"Postscript: Leave the line above, and replace the rest of this comment by a useful one. Executable statements should follow this comment, and should be separated by periods, with no exclamation points (!). Be sure to put any further comments in double-quotes, like this one."
=============== Diff ===============
Unicode class>>allCategoryTags {character classification} ? ct 2/24/2022 19:41
- allCategoryTags
- ????^ #(Cn Cc Cf Co Cs Ll Lm Lo Lt Lu Mc Me Mn Nd Nl No Pc Pd Pe Pf Pi Po Ps Sc Sk Sm So Zl Zp Zs)
Unicode class>>generalCategoryIndexFromTag: {character classification} ? ct 2/24/2022 19:33
- generalCategoryIndexFromTag: tag
- ????^ (self allCategoryTags indexOf: tag) - 1
Unicode class>>generalCategoryLabelForTag: {character classification} ? ct 2/24/2022 21:47
- generalCategoryLabelForTag: tag
- ????^ self generalCategoryLabels at: (self generalCategoryIndexFromTag: tag) + 1
Unicode class>>generalCategoryTagOf: {character classification} ? ct 2/24/2022 19:44
- generalCategoryTagOf: aCharacterCode
- ????^ (self generalCategoryOf: aCharacterCode)
- ????????ifNotNil: [:code | self allCategoryTags at: code + 1 ifAbsent: [#Cn]]
- ????????ifNil: [#Cn]
Unicode class>>initialize {class initialization} ? ct 2/24/2022 21:43 (changed) initialize ????" Unicode initialize " ????self initializeTagConstants.
- ????Compositions isEmptyOrNil ifTrue:[self initializeCompositionMappings].
- ????Compositions isEmptyOrNil ifTrue:[self initializeCompositionMappings].
- ????GeneralCategory isEmptyOrNil ifTrue: [self initializeUnicodeData].
Unicode class>>initializeUnicodeData {unicode data} ? ct 2/24/2022 19:03
- initializeUnicodeData
- ????"self initializeUnicodeData"
- ????self parseUnicodeDataFrom: self unicodeData readStream.
Unicode class>>parseUnicodeDataFrom: {unicode data} ? ct 2/24/2022 19:33 (changed) parseUnicodeDataFrom: stream
- "
- ????self halt.
- ????self parseUnicodeDataFile
- "
- ????"self initializeUnicodeData."
- ????| line fieldEnd point fieldStart toNumber generalCategory decimalProperty |
- ????| line fieldEnd point fieldStart toNumber generalCategory decimalProperty tag |
????toNumber := [:quad | ('16r', quad) asNumber].
????GeneralCategory := SparseLargeTable new: 16rE0080 chunkSize: 1024 arrayClass: Array base: 1 defaultValue: 'Cn'. ????DecimalProperty := SparseLargeTable new: 16rE0080 chunkSize: 32 arrayClass: Array base: 1 defaultValue: -1.
????16r3400 to: 16r4DB5 do: [:i | GeneralCategory at: i+1 put: 'Lo']. ????16r4E00 to: 16r9FA5 do: [:i | GeneralCategory at: i+1 put: 'Lo']. ????16rAC00 to: 16rD7FF do: [:i | GeneralCategory at: i+1 put: 'Lo'].
????[(line := stream nextLine) size > 0] whileTrue: [ ????????fieldEnd := line indexOf: $; startingAt: 1. ????????point := toNumber value: (line copyFrom: 1 to: fieldEnd - 1). ????????point > 16rE007F ifTrue: [ ????????????GeneralCategory zapDefaultOnlyEntries. ????????????DecimalProperty zapDefaultOnlyEntries. ????????????^ self]. ????????2 to: 3 do: [:i | ????????????fieldStart := fieldEnd + 1. ????????????fieldEnd := line indexOf: $; startingAt: fieldStart. ????????].
- ????????generalCategory := line copyFrom: fieldStart to: fieldEnd - 1.
- ????????tag := line copyFrom: fieldStart to: fieldEnd - 1.
- ????????generalCategory := self generalCategoryIndexFromTag: tag.
????????GeneralCategory at: point+1 put: generalCategory.
- ????????generalCategory = 'Nd' ifTrue: [
- ????????generalCategory = Nd ifTrue: [
????????????4 to: 7 do: [:i | ????????????????fieldStart := fieldEnd + 1. ????????????????fieldEnd := line indexOf: $; startingAt: fieldStart. ????????????]. ????????????decimalProperty := line copyFrom: fieldStart to: fieldEnd - 1. ????????????DecimalProperty at: point+1 put: decimalProperty asNumber. ????????]. ????]. ????GeneralCategory zapDefaultOnlyEntries. ????DecimalProperty zapDefaultOnlyEntries.
UnicodeTest
- TestCase subclass: #UnicodeTest
- ????instanceVariableNames: ''
- ????classVariableNames: ''
- ????poolDictionaries: ''
- ????category: 'MultilingualTests-Encodings'
- UnicodeTest class
- ????instanceVariableNames: ''
- ""
UnicodeTest class>>resources {accessing} ? ct 2/24/2022 21:46
- resources
- ???? ^ super resources copyWith: UnicodeTestResource
UnicodeTest>>testGeneralCategoryLabel {tests - character classification} ? ct 2/24/2022 21:49
- testGeneralCategoryLabel
- ????self assert: 'Letter, Lowercase' equals: (Unicode generalCategoryLabelOf: $a asUnicode).
- ????self assert: 'Letter, Uppercase' equals: (Unicode generalCategoryLabelOf: $Z asUnicode).
- ????
- ????self assert: 'Number, Decimal' equals: (Unicode generalCategoryLabelOf: $5 asUnicode).
- ????self assert: 'Symbol, Other' equals: (Unicode generalCategoryLabelOf: 16r1F388).
- ????
- ????self assert: 'n/a' equals: (Unicode generalCategoryLabelOf: Float infinity).
UnicodeTest>>testGeneralCategoryLabelForTag {tests - character classification} ? ct 2/24/2022 21:48
- testGeneralCategoryLabelForTag
- ????self assert: 'Letter, Lowercase' equals: (Unicode generalCategoryLabelForTag: #Ll).
UnicodeTest>>testGeneralCategoryTag {tests - character classification} ? ct 2/24/2022 21:49
- testGeneralCategoryTag
- ????self assert: #Ll equals: (Unicode generalCategoryTagOf: $a asUnicode).
- ????self assert: #Lu equals: (Unicode generalCategoryTagOf: $Z asUnicode).
- ????
- ????self assert: #Nd equals: (Unicode generalCategoryTagOf: $5 asUnicode).
- ????self assert: #So equals: (Unicode generalCategoryTagOf: 16r1F388).
- ????
- ????self assert: #Cn equals: (Unicode generalCategoryTagOf: Float infinity).
UnicodeTestResource
- TestResource subclass: #UnicodeTestResource
- ????instanceVariableNames: ''
- ????classVariableNames: ''
- ????poolDictionaries: ''
- ????category: 'MultilingualTests-Encodings'
- UnicodeTestResource class
- ????instanceVariableNames: ''
- ""
UnicodeTestResource>>setUp {running} ? ct 2/24/2022 21:45
- setUp
- ????super setUp.
- ????
- ????"Test the functionality of this update logic"
- ????Unicode initializeCompositionMappings.
- ????Unicode initializeUnicodeData.
Sent from Squeak Inbox Talk [https://github.com/hpi-swa-lab/squeak-inbox-talk]
On 2020-09-11T00:49:24+02:00, leves at caesar.elte.huhttp://caesar.elte.hu wrote:
Hi Christoph,
On Wed, 9 Sep 2020, Thiede, Christoph wrote:
Hi Levente,
basically, I only?would like to get rid of the class variables for every single Unicode category because they provide low explorability (it's hard to work with the numeric output of #generalCategoryOf:) and extensibility (you need to recompile the class definition for adding UTF-16 support). If you are critical of?increasing the?size of the SparseLargeTable, I think we?would also just make one or two?extra dictionaries to map every category symbol to a number and vice versa. What do you think?
You mean an array to map the integers to symbols, right? :) Anyway, I don't think it's worth using symbols internally. For example, #isLetterCode: is 8-10% slower with the extra array lookup and checking the category symbol's first letter than the current method of integer comparisons.
Do you expect these constants to appear outside the Unicode class? If yes, then using symbols for those cases is probably a good solution. But for internal use, the integers are better.
Levente
Best,
Christoph
Von: Squeak-dev <squeak-dev-bounces at lists.squeakfoundation.orghttp://lists.squeakfoundation.org> im Auftrag von Levente Uzonyi <leves at caesar.elte.huhttp://caesar.elte.hu> Gesendet: Dienstag, 8. September 2020 21:43:56 An: The general-purpose Squeak developers list Betreff: Re: [squeak-dev] Unicode ? Hi Christoph,
On Tue, 8 Sep 2020, Thiede, Christoph wrote:
Hi all,
?Your words suggest that it has already been published, but I can't find it?anywhere.
Then I must have expressed myself wrong. I did not yet publish any code?changes, but in my original post from March,?you can find a short description of the design?changes I'd like to implement. Essentially, I would like to replace the separate class variables for every known character class in favor of greater flexibility.
How would your changes affect GeneralCategory? Would it still be a SpareLargeTable with ByteArray as arrayClass? If you just replace those integers with symbols, the size of the table will be at least 4 or 8 times larger in 32 or 64 bit images, respectively.
Levente.
Eliot, are there any remaining questions regarding the VM size? Character size should be sufficient as discussed below, and of course, I can test any changes in a 32-bit image, too. :-)
WideString withAll: (#(42r2l 16r1F497 16r1F388 33) collect: #asCharacter)
Best, Christoph
_
Von: Squeak-dev <squeak-dev-bounces at lists.squeakfoundation.orghttp://lists.squeakfoundation.org> im Auftrag von Tobias Pape <Das.Linux at gmx.dehttp://gmx.de> Gesendet: Sonntag, 6. September 2020 21:00:14 An: The general-purpose Squeak developers list Betreff: Re: [squeak-dev] Unicode ?
On 06.09.2020, at 20:40, Levente Uzonyi <leves at caesar.elte.huhttp://caesar.elte.hu> wrote:
On Sun, 6 Sep 2020, Tobias Pape wrote:
> >> On 06.09.2020, at 19:15, Eliot Miranda <eliot.miranda at gmail.comhttp://gmail.com> wrote: >> Hi Christoph, Hi All, >>> On Mar 17, 2020, at 3:51 PM, Thiede, Christoph <Christoph.Thiede at student.hpi.uni-potsdam.dehttp://student.hpi.uni-potsdam.de> wrote: >>> Hi all! :-) >>> After some recent fun with the Unicode class, I found out that its data is quite out of date (for example, the comments do not even "know" that code points can be longer than 4 bytes. Younger characters such as ???
are not categorized correctly, etc. ...). Luckily, there is already some logic to fetch the latest data from www.unicode.orghttp://www.unicode.org. I'm currently reworking this logic because it's not completely automated yet and has some slips, but so long, I have one general question for you:
>> And consequently I have a couple of questions for you. In the Spur VM Characters are immediate (they are like SmallInteger and exist in oops (object-oriented pointers) as tagged values).? In the 32-bit variant
Characters
are 30-bit unsigned integers.? In the 64-bit variant they are also 30-bit unsigned integers, but could easily be extended to be up to 61-bit unsigned integers.
>> Q1, can you arrange that the Unicode support does not break in initialization on the 32-bit variant?? It may be that the 32-bit variant cannot represent code points beyond 30 bits in size, but we should try to ensure
that
initialization still runs to completion even if it fails to initialize information relating to code points beyond 30 bits in size.
>> Q2, how many bits should the 64-bit variant VM support for immediate Characters? > > Unicode has a max value of 0x10FFFF. That makes 21 bit. So no worries there. > > We should just not forget the leading-char stuff (Yoshiki, Andreas,...)
AFAIU the leading char only makes sense when you have multiple CJK(V?) languages in use at the same time. In other cases Unicode (leadingChar = 0) is perfectly fine. IIRC there are 22 bits available for the codePoint and 8 for the leadingChar, so we're still good: all unicode characters fit.
\o/ hooray!
Levente
> > > BEst regards >?????? -Tobias > >> Then something to consider is that it is conceptually possible to support something like WideCharacter, which would represent code points outside of the immediate Character range on the 32-bit variant, analogous to
LargePositiveInteger beyond SmallInteger maxVal.? This can be made to work seamlessly, just as it does currently with integers, and with Floats where SmallFloat64 is only used on 64-bits.
>> It has implications in a few parts of the system: >> - failure code for WideString (VeryWideString?) at:[put:] primitives that would have to manage overflow into/access from WideCharacter instances >> - ImageSegment and other (un)pickling systems that need to convert to/from a bit-specific ?wire? protocol/representation >> - 32-bit <=> 64-bit image conversion All this is easily doable (because we have models of doing it for Float and Integer general instances).? But we need good specifications so we can implement the right thing from the
get-go.
>>> At the moment, we have 30 class variables each for one Unicode category number. These class vars map in alphabetical order to the integers from 0 to: 29. Is this tedious structure really necessary? For different
purposes, I would like to get the category name of a specific code point from a client. The current design makes this impossible without writing additional mappings.
>>> Tl;dr: I would like to propose to drop these class variables and use Symbols instead. They are comparable like integers, and as they are flyweights, this should not be a performance issue either. Of course,
#generalCategoryOf: will have to keep returning numbers, but we could deprecate it and use a new #generalTagOf: in the future. Furthermore, this would also allow us to deal with later added category names (though I don't
know
whether this will ever happen).
>>> Examples: >>> Unicode generalTagOf: $a asUnicode. "#Ll" >>> Unicode class >> isLetterCode: charCode >>>? ^ (self generalTagOf: charCode) first = $L >>> Unicode class >> isAlphaNumericCode: charCode >>>? | tag| >>>? ^ (tag := self generalCategoryOf: charCode) first = $L >>>??????? or: [tag = #Nd] >>> How do you think about this proposal? Please let me know and I will go ahead! :D >>> Best, >>> Christoph >> Best, Eliot >> _,,,^..^,,,_ (phone)
["UnicodeData.2.cs"] -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.squeakfoundation.org/pipermail/squeak-dev/attachments/20220225/b0b74049/attachment-0001.html
-------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.squeakfoundation.org/pipermail/squeak-dev/attachments/20220228/2d10d9ef/attachment.html -------------- next part -------------- A non-text attachment was scrubbed... Name: UnicodeData.3.cs Type: application/octet-stream Size: 10387 bytes Desc: not available URL: http://lists.squeakfoundation.org/pipermail/squeak-dev/attachments/20220228/2d10d9ef/attachment.obj
["UnicodeData.png"][cid:17ff9b3610c99dd0fc21]
Hi Christoph,
Is the tangling of logic and data not one characteristic of objects? 🙃
I tend to agree more with Eliot here. Having dozens of Unicode data configurations out there in the wild, independent of the update state of the image, is not good for maintainability.
Also it is kind of inconsistent: on the one hand Squeak does not want to rely on foreign libraries (via FFI) by default, and have everything in Smalltalk or within the image, on the other hand it loads basic databases from the outside... Batteries not included.
A real argument could be to suppose that too much time will pass between a new release of the Unicode database and some Squeak developer pushing the update data button and have it delivered to the update stream. Hence the CI proposal, or having it in the release process. If that does not suffice at some point in the future, Squeak will probably have vitality issues at that point anyway. Moreover everyone can push that button for their own image anyway, if they can find out how.
If the Monticello package size is the main concern, then Monticello needs a more space-efficient storage backend. In the meantime, have two packages, one with the database (changes infrequently) and one with the code that works with the database (can be changed more frequently without somebody complaining about having to download the database again).
Kind regards, Jakob
Thiede, Christoph Christoph.Thiede@student.hpi.uni-potsdam.de schrieb am Do., 5. Mai 2022, 19:24:
Hi Jakob, Hi Marcel,
the advantage of your proposed solution is that we would have more control over the process.
The disadvantage is that it would increase the package size and tangle logic + data together. At least, we're talking about ~300 kB for the Unicode data if I used SpaceTally correctly. :-)
Personally, I would prefer to stay with the existing practice because updating your Unicode data locally really seems to be optional at the moment.
Best,
Christoph
*Von:* Squeak-dev squeak-dev-bounces@lists.squeakfoundation.org im Auftrag von Taeumel, Marcel *Gesendet:* Freitag, 8. April 2022 09:56:36 *An:* squeak-dev *Betreff:* Re: [squeak-dev] Unicode
Hi Jakob --
One could run the download&generate step as needed to update the data.
(CI, release bubild, manually)
To integrate it as part of ReleaseBuilder class >> #prepareEnvironment would be my preferred way. Then it would be part of the CI.
Best, Marcel
Am 05.04.2022 14:41:14 schrieb Jakob Reschke jakres+squeak@gmail.com: Would it be possible/practical to separate this?
- Download and transform into code/objects
- Distribute the generated code/objects via the update stream directly
One could run the download&generate step as needed to update the data. (CI, release build, manually) Or are there any reasons not to do that?
I was asking myself the same thing recently about the package that provides time zone information. It needs a Unix timezone database from the operating system to initialize, rather than providing Smalltalk objects/code directly in Monticello, based on the official online database.
Kind regards, Jakob
Am Di., 5. Apr. 2022 um 08:59 Uhr schrieb Marcel Taeumel < marcel.taeumel@hpi.de>:
Hi Eliot, hi Christoph --
Unicode reinitializeData.
I think that this method has an unfortunate name. Since it downloads data from the Internet, it should be called #downloadAndInitializeData.
And that's the reason for it not being in the post-load script. We might put the raw info there, but it would be very surprising if "Update Squeak" fetches data other than from source.squeak.org.
Best, Marcel
Am 04.04.2022 21:17:17 schrieb Thiede, Christoph < christoph.thiede@student.hpi.uni-potsdam.de>:
If this is essential
Well, how do you define essential? You can still use your image with the old font definitions. However, for some newer codepoints such as 😁😎😍, Unicode generalCategoryLabelOf: and friends will answers "not assigned" without the upgrade. You can watch the difference by browsing any comprehensive font in the FontImporter. But I am not aware of any code path that relies on the presence of newer Unicode data.
Apart from that, I was already discussing with Marcel what would be the consequences of downloading data from a third-party server during an image update. There might be any images, most likely server images, that do not have free internet access due to a strict firewall. Hypothetically, this might even introduce any security issues. So in the end, we decided on leaving this optional for now. It will only break if any future patch of any package relies on exact Unicode data.
Best,
Christoph
*Von:* Squeak-dev squeak-dev-bounces@lists.squeakfoundation.org im Auftrag von Eliot Miranda eliot.miranda@gmail.com *Gesendet:* Montag, 4. April 2022 20:54:59 *An:* The general-purpose Squeak developers list *Betreff:* Re: [squeak-dev] Unicode
On Apr 4, 2022, at 11:20 AM, Christoph.Thiede@student.hpi.uni-potsdam.de wrote:
Merged via Multilingual-ct.271, Multilingual-ct.272, MultilingualTests-ct.41, and ReleaseBuilder-ct.231.
Please run the following in your image to install the new Unicode data (and to uncover any regressions I may have missed :D):
Unicode reinitializeData.
If this is essential then it *must* be added as a post load script to one (or more) of the relevant packages. Asking “did you run Unicode reinitializeData?” when someone reports a strange bug isn’t acceptable.
Best, Christoph
*Sent from **Squeak Inbox Talk https://github.com/hpi-swa-lab/squeak-inbox-talk*
I want to retract the batteries not included argument, because I conflated the timezones situation too much with the Unicode situation. For the Unicode, there is already something in the image, although it is outdated.
Am So., 8. Mai 2022 um 12:23 Uhr schrieb Jakob Reschke < jakres+squeak@gmail.com>:
Also it is kind of inconsistent: on the one hand Squeak does not want to rely on foreign libraries (via FFI) by default, and have everything in Smalltalk or within the image, on the other hand it loads basic databases from the outside... Batteries not included.
Thiede, Christoph Christoph.Thiede@student.hpi.uni-potsdam.de schrieb am Do., 5. Mai 2022, 19:24:
Hi Jakob, Hi Marcel,
the advantage of your proposed solution is that we would have more control over the process.
The disadvantage is that it would increase the package size and tangle logic + data together. At least, we're talking about ~300 kB for the Unicode data if I used SpaceTally correctly. :-)
Personally, I would prefer to stay with the existing practice because updating your Unicode data locally really seems to be optional at the moment.
Best,
Christoph
*Von:* Squeak-dev squeak-dev-bounces@lists.squeakfoundation.org im Auftrag von Taeumel, Marcel *Gesendet:* Freitag, 8. April 2022 09:56:36 *An:* squeak-dev *Betreff:* Re: [squeak-dev] Unicode
Hi Jakob --
One could run the download&generate step as needed to update the data.
(CI, release bubild, manually)
To integrate it as part of ReleaseBuilder class >> #prepareEnvironment would be my preferred way. Then it would be part of the CI.
Best, Marcel
Am 05.04.2022 14:41:14 schrieb Jakob Reschke jakres+squeak@gmail.com: Would it be possible/practical to separate this?
- Download and transform into code/objects
- Distribute the generated code/objects via the update stream directly
One could run the download&generate step as needed to update the data. (CI, release build, manually) Or are there any reasons not to do that?
I was asking myself the same thing recently about the package that provides time zone information. It needs a Unix timezone database from the operating system to initialize, rather than providing Smalltalk objects/code directly in Monticello, based on the official online database.
Kind regards, Jakob
Am Di., 5. Apr. 2022 um 08:59 Uhr schrieb Marcel Taeumel < marcel.taeumel@hpi.de>:
Hi Eliot, hi Christoph --
Unicode reinitializeData.
I think that this method has an unfortunate name. Since it downloads data from the Internet, it should be called #downloadAndInitializeData.
And that's the reason for it not being in the post-load script. We might put the raw info there, but it would be very surprising if "Update Squeak" fetches data other than from source.squeak.org.
Best, Marcel
Am 04.04.2022 21:17:17 schrieb Thiede, Christoph < christoph.thiede@student.hpi.uni-potsdam.de>:
If this is essential
Well, how do you define essential? You can still use your image with the old font definitions. However, for some newer codepoints such as 😁😎😍, Unicode generalCategoryLabelOf: and friends will answers "not assigned" without the upgrade. You can watch the difference by browsing any comprehensive font in the FontImporter. But I am not aware of any code path that relies on the presence of newer Unicode data.
Apart from that, I was already discussing with Marcel what would be the consequences of downloading data from a third-party server during an image update. There might be any images, most likely server images, that do not have free internet access due to a strict firewall. Hypothetically, this might even introduce any security issues. So in the end, we decided on leaving this optional for now. It will only break if any future patch of any package relies on exact Unicode data.
Best,
Christoph
*Von:* Squeak-dev squeak-dev-bounces@lists.squeakfoundation.org im Auftrag von Eliot Miranda eliot.miranda@gmail.com *Gesendet:* Montag, 4. April 2022 20:54:59 *An:* The general-purpose Squeak developers list *Betreff:* Re: [squeak-dev] Unicode
On Apr 4, 2022, at 11:20 AM, Christoph.Thiede@student.hpi.uni-potsdam.de wrote:
Merged via Multilingual-ct.271, Multilingual-ct.272, MultilingualTests-ct.41, and ReleaseBuilder-ct.231.
Please run the following in your image to install the new Unicode data (and to uncover any regressions I may have missed :D):
Unicode reinitializeData.
If this is essential then it *must* be added as a post load script to one (or more) of the relevant packages. Asking “did you run Unicode reinitializeData?” when someone reports a strange bug isn’t acceptable.
Best, Christoph
*Sent from **Squeak Inbox Talk https://github.com/hpi-swa-lab/squeak-inbox-talk*
Hi Jakob, hi all,
I can see your point ... Given the current efficiency of Monticello (IMO less storage- and more transmission-related efficiency), I would hesitate to store the Unicode data in a package. But we could decide on running the database update indeed via the update stream in situations (maybe wrap it into an exception handler so that server images behind a firewall do not get in trouble).
If you deem this important enough, we can also pin the version of the downloaded Unicode data (i.e., https://www.unicode.org/Public/14.0.0/ucd/UnicodeData.txt instead of https://www.unicode.org/Public/UCD/latest/ucd/UnicodeData.txt). But then we would need a process for updating the version number ... Maybe I could write a reminder test for this ...
I don't know. Decide you (board) what is the right option and I will give my best to implement it. Please also decide whether this still should go into the current beta version. :-)
Best,
Christoph
________________________________ Von: Squeak-dev squeak-dev-bounces@lists.squeakfoundation.org im Auftrag von Jakob Reschke jakres+squeak@gmail.com Gesendet: Sonntag, 8. Mai 2022 12:55:36 An: The general-purpose Squeak developers list Betreff: Re: [squeak-dev] Unicode
I want to retract the batteries not included argument, because I conflated the timezones situation too much with the Unicode situation. For the Unicode, there is already something in the image, although it is outdated.
Am So., 8. Mai 2022 um 12:23 Uhr schrieb Jakob Reschke <jakres+squeak@gmail.commailto:jakres%2Bsqueak@gmail.com>:
Also it is kind of inconsistent: on the one hand Squeak does not want to rely on foreign libraries (via FFI) by default, and have everything in Smalltalk or within the image, on the other hand it loads basic databases from the outside... Batteries not included.
Thiede, Christoph <Christoph.Thiede@student.hpi.uni-potsdam.demailto:Christoph.Thiede@student.hpi.uni-potsdam.de> schrieb am Do., 5. Mai 2022, 19:24:
Hi Jakob, Hi Marcel,
the advantage of your proposed solution is that we would have more control over the process.
The disadvantage is that it would increase the package size and tangle logic + data together. At least, we're talking about ~300 kB for the Unicode data if I used SpaceTally correctly. :-)
Personally, I would prefer to stay with the existing practice because updating your Unicode data locally really seems to be optional at the moment.
Best,
Christoph
________________________________ Von: Squeak-dev <squeak-dev-bounces@lists.squeakfoundation.orgmailto:squeak-dev-bounces@lists.squeakfoundation.org> im Auftrag von Taeumel, Marcel Gesendet: Freitag, 8. April 2022 09:56:36 An: squeak-dev Betreff: Re: [squeak-dev] Unicode
Hi Jakob --
One could run the download&generate step as needed to update the data. (CI, release bubild, manually)
To integrate it as part of ReleaseBuilder class >> #prepareEnvironment would be my preferred way. Then it would be part of the CI.
Best, Marcel
Am 05.04.2022 14:41:14 schrieb Jakob Reschke <jakres+squeak@gmail.commailto:jakres%2Bsqueak@gmail.com>:
Would it be possible/practical to separate this? * Download and transform into code/objects * Distribute the generated code/objects via the update stream directly One could run the download&generate step as needed to update the data. (CI, release build, manually) Or are there any reasons not to do that?
I was asking myself the same thing recently about the package that provides time zone information. It needs a Unix timezone database from the operating system to initialize, rather than providing Smalltalk objects/code directly in Monticello, based on the official online database.
Kind regards, Jakob
Am Di., 5. Apr. 2022 um 08:59 Uhr schrieb Marcel Taeumel <marcel.taeumel@hpi.demailto:marcel.taeumel@hpi.de>: Hi Eliot, hi Christoph --
Unicode reinitializeData.
I think that this method has an unfortunate name. Since it downloads data from the Internet, it should be called #downloadAndInitializeData.
And that's the reason for it not being in the post-load script. We might put the raw info there, but it would be very surprising if "Update Squeak" fetches data other than from source.squeak.orghttp://source.squeak.org.
Best, Marcel
Am 04.04.2022 21:17:17 schrieb Thiede, Christoph <christoph.thiede@student.hpi.uni-potsdam.demailto:christoph.thiede@student.hpi.uni-potsdam.de>:
If this is essential
Well, how do you define essential? You can still use your image with the old font definitions. However, for some newer codepoints such as 😁😎😍, Unicode generalCategoryLabelOf: and friends will answers "not assigned" without the upgrade. You can watch the difference by browsing any comprehensive font in the FontImporter. But I am not aware of any code path that relies on the presence of newer Unicode data.
Apart from that, I was already discussing with Marcel what would be the consequences of downloading data from a third-party server during an image update. There might be any images, most likely server images, that do not have free internet access due to a strict firewall. Hypothetically, this might even introduce any security issues. So in the end, we decided on leaving this optional for now. It will only break if any future patch of any package relies on exact Unicode data.
Best,
Christoph
________________________________ Von: Squeak-dev <squeak-dev-bounces@lists.squeakfoundation.orgmailto:squeak-dev-bounces@lists.squeakfoundation.org> im Auftrag von Eliot Miranda <eliot.miranda@gmail.commailto:eliot.miranda@gmail.com> Gesendet: Montag, 4. April 2022 20:54:59 An: The general-purpose Squeak developers list Betreff: Re: [squeak-dev] Unicode
On Apr 4, 2022, at 11:20 AM, Christoph.Thiede@student.hpi.uni-potsdam.demailto:Christoph.Thiede@student.hpi.uni-potsdam.de wrote:
Merged via Multilingual-ct.271, Multilingual-ct.272, MultilingualTests-ct.41, and ReleaseBuilder-ct.231.
Please run the following in your image to install the new Unicode data (and to uncover any regressions I may have missed :D):
Unicode reinitializeData.
If this is essential then it *must* be added as a post load script to one (or more) of the relevant packages. Asking “did you run Unicode reinitializeData?” when someone reports a strange bug isn’t acceptable.
Best, Christoph
--- Sent from Squeak Inbox Talkhttps://github.com/hpi-swa-lab/squeak-inbox-talk
Hi Christoph,
We talked about this in the board meeting today and the general consensus was that it should rather be distributed via the update stream than relying on the download from a third party website. A few hundred kilobytes are not expected to be much trouble.
We did not specifically talk about whether the release should wait for further changes here... when in doubt, probably not. Of course, if you do get it done in the next few days, it would be welcome, but I do not think that any pressure is warranted. :-)
Regarding the storage efficiency... Do you really need to store the whole UnicodeData.txt in the image or in Monticello? For example, what I had in mind on April 5 was like: let parseUnicodeData: generate a new initialize method rather than populate the GeneralCategory table directly. The generated initialize method would "hardcode" the GeneralCategory table based on the parsed file. Then you could commit that generated method and you would not need to keep the file in the package. That might reduce the size that Monticello has to deal with.
Kind regards, Jakob
Am Fr., 20. Mai 2022 um 14:50 Uhr schrieb Thiede, Christoph < Christoph.Thiede@student.hpi.uni-potsdam.de>:
Hi Jakob, hi all,
I can see your point ... Given the current efficiency of Monticello (IMO less storage- and more transmission-related efficiency), I would hesitate to store the Unicode data in a package. But we could decide on running the database update indeed via the update stream in situations (maybe wrap it into an exception handler so that server images behind a firewall do not get in trouble).
If you deem this important enough, we can also pin the version of the downloaded Unicode data (i.e., https://www.unicode.org/Public/14.0.0/ucd/UnicodeData.txt instead of https://www.unicode.org/Public/UCD/latest/ucd/UnicodeData.txt). But then we would need a process for updating the version number ... Maybe I could write a reminder test for this ...
I don't know. Decide you (board) what is the right option and I will give my best to implement it. Please also decide whether this still should go into the current beta version. :-)
Best,
Christoph
*Von:* Squeak-dev squeak-dev-bounces@lists.squeakfoundation.org im Auftrag von Jakob Reschke jakres+squeak@gmail.com *Gesendet:* Sonntag, 8. Mai 2022 12:55:36 *An:* The general-purpose Squeak developers list *Betreff:* Re: [squeak-dev] Unicode
I want to retract the batteries not included argument, because I conflated the timezones situation too much with the Unicode situation. For the Unicode, there is already something in the image, although it is outdated.
Am So., 8. Mai 2022 um 12:23 Uhr schrieb Jakob Reschke < jakres+squeak@gmail.com>:
Also it is kind of inconsistent: on the one hand Squeak does not want to rely on foreign libraries (via FFI) by default, and have everything in Smalltalk or within the image, on the other hand it loads basic databases from the outside... Batteries not included.
Thiede, Christoph Christoph.Thiede@student.hpi.uni-potsdam.de schrieb am Do., 5. Mai 2022, 19:24:
Hi Jakob, Hi Marcel,
the advantage of your proposed solution is that we would have more control over the process.
The disadvantage is that it would increase the package size and tangle logic + data together. At least, we're talking about ~300 kB for the Unicode data if I used SpaceTally correctly. :-)
Personally, I would prefer to stay with the existing practice because updating your Unicode data locally really seems to be optional at the moment.
Best,
Christoph
*Von:* Squeak-dev squeak-dev-bounces@lists.squeakfoundation.org im Auftrag von Taeumel, Marcel *Gesendet:* Freitag, 8. April 2022 09:56:36 *An:* squeak-dev *Betreff:* Re: [squeak-dev] Unicode
Hi Jakob --
One could run the download&generate step as needed to update the
data. (CI, release bubild, manually)
To integrate it as part of ReleaseBuilder class >> #prepareEnvironment would be my preferred way. Then it would be part of the CI.
Best, Marcel
Am 05.04.2022 14:41:14 schrieb Jakob Reschke jakres+squeak@gmail.com: Would it be possible/practical to separate this?
- Download and transform into code/objects
- Distribute the generated code/objects via the update stream directly
One could run the download&generate step as needed to update the data. (CI, release build, manually) Or are there any reasons not to do that?
I was asking myself the same thing recently about the package that provides time zone information. It needs a Unix timezone database from the operating system to initialize, rather than providing Smalltalk objects/code directly in Monticello, based on the official online database.
Kind regards, Jakob
Am Di., 5. Apr. 2022 um 08:59 Uhr schrieb Marcel Taeumel < marcel.taeumel@hpi.de>:
Hi Eliot, hi Christoph --
Unicode reinitializeData.
I think that this method has an unfortunate name. Since it downloads data from the Internet, it should be called #downloadAndInitializeData.
And that's the reason for it not being in the post-load script. We might put the raw info there, but it would be very surprising if "Update Squeak" fetches data other than from source.squeak.org.
Best, Marcel
Am 04.04.2022 21:17:17 schrieb Thiede, Christoph < christoph.thiede@student.hpi.uni-potsdam.de>:
If this is essential
Well, how do you define essential? You can still use your image with the old font definitions. However, for some newer codepoints such as 😁😎😍, Unicode generalCategoryLabelOf: and friends will answers "not assigned" without the upgrade. You can watch the difference by browsing any comprehensive font in the FontImporter. But I am not aware of any code path that relies on the presence of newer Unicode data.
Apart from that, I was already discussing with Marcel what would be the consequences of downloading data from a third-party server during an image update. There might be any images, most likely server images, that do not have free internet access due to a strict firewall. Hypothetically, this might even introduce any security issues. So in the end, we decided on leaving this optional for now. It will only break if any future patch of any package relies on exact Unicode data.
Best,
Christoph
*Von:* Squeak-dev squeak-dev-bounces@lists.squeakfoundation.org im Auftrag von Eliot Miranda eliot.miranda@gmail.com *Gesendet:* Montag, 4. April 2022 20:54:59 *An:* The general-purpose Squeak developers list *Betreff:* Re: [squeak-dev] Unicode
On Apr 4, 2022, at 11:20 AM, Christoph.Thiede@student.hpi.uni-potsdam.de wrote:
Merged via Multilingual-ct.271, Multilingual-ct.272, MultilingualTests-ct.41, and ReleaseBuilder-ct.231.
Please run the following in your image to install the new Unicode data (and to uncover any regressions I may have missed :D):
Unicode reinitializeData.
If this is essential then it *must* be added as a post load script to one (or more) of the relevant packages. Asking “did you run Unicode reinitializeData?” when someone reports a strange bug isn’t acceptable.
Best, Christoph
*Sent from **Squeak Inbox Talk https://github.com/hpi-swa-lab/squeak-inbox-talk*
squeak-dev@lists.squeakfoundation.org