On Mon, 1 May 2017, Eliot Miranda wrote:
On Mon, May 1, 2017 at 5:11 PM, David T. Lewis lewis@mail.msen.com wrote: On Mon, May 01, 2017 at 07:26:35PM +0200, Levente Uzonyi wrote: > I presume that a general purpose in-image solution would be more complex. > String already has too many subclasses (6 in Squeak), while at the same > time other kind of new subclasses would be welcome too, e.g. Strings with > 2-byte characters. > Since these properties are orthogonal, there would be many new subclasses > to cover all cases. > Storing the size of the string in a different header word is a VM specific > detail, so I think caching the hash could also be hidden from the image. > > Levente
Actually, I meant something more like this: Object subclass: #LargeString instanceVariableNames: 'anyKindOfString cachedHashValueForTheString' classVariableNames: '' poolDictionaries: '' category: 'Probably a Bad Idea' I was guessing that hashing very large strings would imply a somewhat specialized problem domain, for which a wrapper class might make sense. Certainly it would not be a general solution.
But this implies changing very many places in the VM that access a string as a vector of bytes. That's why Levente suggests hiding the hash in unused memory in Spur. It doesn't have the side effect of requiring a major rewrite.
I guess Dave didn't mean the VM to treat LargeString as other Strings.
Levente
I am probably over my quota of bad ideas for today, so I'll stop now ;-) Dave > > On Mon, 1 May 2017, David T. Lewis wrote: > > > > >Does it need to be done in the VM? Why not make a class LargeString with > >instance variables aString and myCalculatedHashValueForTheString. That way > >you can cache the hash value and calculate it any way you want. > > > >I know vey little about hashing, just wondering if this kind of thing can > >be handled more easily in the image. > > > >Dave > > > > > >> > >>On Mon, 1 May 2017, Levente Uzonyi wrote: > >> > >>>Well, I had started to write a reply, but I had to postpone it. > >>>I mostly agree with your suggestions. > >>> > >>>One thing that can be done about large strings is to cache the > >>>calculated > >>>hash value in the larger strings. Currently the object representation > >>>changes when the string contains 255 or more characters. In that case an > >>>additional 64 bit field is added to the object header to store its > >>>length. > >>>If we were to use the upper 28+1 bits of that field to cache the hash, > >>>there would still be 35-bits to encode length, which would be enough to > >>>represent strings up to 8 GiB. > >> > >>Well, we can keep the whole range minus one bit, but then we can't store > >>the hash for strings larger than 8 GiB. > >> > >>Levente > >> > >>>But this would require further VM changes (e.g. at:put: would have to > >>>flush the cache). > >>> > >>>Levente > >>> > >>>On Mon, 1 May 2017, Martin McClure wrote: > >>> > >>>>I see no replies to this on any of the three lists it was sent to, so I > >>>>guess I'll chime in. > >>>> > >>>>tl;dr: > >>>>Making a primitive for #hashMultiply, probably a good idea, in some > >>>>form, since doing it in Smalltalk is awkward. > >>>>Only hashing every N-th character of large strings, probably a very bad > >>>>idea. Performance might well get worse, the complexity does not seem > >>>>justified, and it would open a sizeable security hole. > >>>> > >>>>More verbiage below for those interested. > >>>> > >>>>Regards, > >>>>-Martin > >>>> > >>>>On 04/18/2017 07:09 PM, Eliot Miranda wrote: > >>>>>Hi All, > >>>>> > >>>>> the hash algorithm used for ByteString in Squeak and Pharo is good > >>>>>for "small" strings and overkill for large strings. > >>>> > >>>>Why do you say it's overkill for large strings? Are there applications > >>>>with large strings that are being negatively impacted by the current > >>>>algorithm? Which ones, and impacted how? > >>>> > >>>>>It is important in many applications to get well distributed string > >>>>>hashes, especially over the range of strings that constitute things > >>>>>like method names, URLs, etc. Consequently, the current algorithm > >>>>>includes every character in a string. This works very well for > >>>>>"small" strings and results in very slow hashes (and hence long > >>>>>latencies, because the hash is an uninterruptible primitive) for large > >>>>>strings, where large may be several megabytes. > >>>> > >>>>A simple solution for the uninterruptable primitive is to not make it a > >>>>primitive. Make #hashMultiply a primitive (since this particular kind > >>>>of > >>>>numeric modulo computation is really painful in Smalltalk), and do the > >>>>rest in a loop in Smalltalk. It sounds like you've done the > >>>>#hashMultiply primitive already. > >>>> > >>>>If the overhead of calling a primitive for each character proves to be > >>>>too much, even with the faster primitive calling methodologies you > >>>>talked about in the "Cog Primitive Performance" thread on the Vm-dev > >>>>list, a more complex primitive could take a range of bytes, so large > >>>>strings would be done in batches, solving the latency problem. > >>>> > >>>>> > >>>>>Let's look at the basic hash algorithm. > >>>>[...] > >>>>> > >>>>>In looking at this I've added a primitive for hashMultiply; primitive > >>>>>#159 implements precisely self * 1664525 bitAnd: 16r0FFFFFFF for > >>>>>SmallInteger and LargePositiveInteger receivers, as fast as possible > >>>>>in the Cog JIT. With this machinery in place it's instructive to > >>>>>compare the cost of the primitive against the non-primitive Smalltalk > >>>>>code. > >>>>> > >>>>>First let me introduce a set of replacement hash functions, newHashN. > >>>>>These hash all characters in strings up to a certain size, and then no > >>>>>more than that number for larger strings. Here are newHash64 and > >>>>>newHash2048, which use pure Smalltalk, including an inlined > >>>>>hashMultiply written to avoid SmallInteger overflow. Also measured > >>>>>are the obvious variants newHash128, newHash256, newHash512 & > >>>>>mewHash1024. > >>>>> > >>>>> > >>>>[...] > >>>>>So the idea here is to step through the string by 1 for strings sizes > >>>>>up to N - 1, and by greater than 1 for strings of size >= N, limiting > >>>>>the maximum number of characters sampled to between N // 2 and N - 1. > >>>> > >>>>The history of computing is littered with the bones of those who have > >>>>tried this kind of thing. It doesn't end well. Yes, you get a faster > >>>>hash function. And then you find, for sets of data that you or your > >>>>users actually want to use, that you get collisions like crazy, and > >>>>much > >>>>worse overall performance than you started with. > >>>> > >>>>Sure, it works OK for the sets of data that the designer *tested*, and > >>>>probably for the sets of data that they *anticipated*. But real-world > >>>>data is tricky. It includes data sets where the characters that differ > >>>>are the ones that the hash thinks are unimportant, and there goes your > >>>>performance, by orders of magnitude. For instance, early versions of > >>>>Java used a limited number of characters to hash strings. One of the > >>>>biggest compatibility-breaking changes they were forced to make in > >>>>later > >>>>Java versions was to consider *all* characters in hashing. It turned > >>>>out > >>>>that it was very common to hash URLs, and many distinct URLs had most > >>>>of > >>>>their characters in common. > >>>> > >>>>And you don't always get to choose your data -- sometimes you have an > >>>>opponent who is actively looking to create collisions as a > >>>>denial-of-service attack. There was a fair-sized kerfluffle about this > >>>>a > >>>>few years ago -- most web scripting languages made it too easy to mount > >>>>this kind of attack. > >>>> > >>>https://arstechnica.com/business/2011/12/huge-portions-of-web-vulnerable-to-hashing-denial-of-service-attack/ > >>>>"...an attacker can degenerate the hash table by sending lots of > >>>>colliding keys. ...making it possible to exhaust hours of CPU time > >>>>using > >>>>a single HTTP request." > >>>>To guard against this kind of attack you need a randomized element in > >>>>your hash (not a bad idea for Smalltalk, actually, and pretty easy -- > >>>>mixing in the identity hash of the collection might be sufficient) or a > >>>>cryptographic hash (not worth the computational expense for most > >>>>purposes). However, even adding a randomized element would not prevent > >>>>this kind of attack if you predictably completely ignore some > >>>>characters > >>>>of the input string. That just makes it *so* easy to generate data that > >>>>validates, and is not equal, but causes collisions. > >>>> > >>>>So really, for general-purpose use (i.e. what's built into the > >>>>language) > >>>>hash *every* character of *all* strings. If someone finds that this is > >>>>a > >>>>performance problem in a real-world situation, it can be addressed in > >>>>an > >>>>application-specific way. > >>> > >>> > >> >
-- _,,,^..^,,,_ best, Eliot