On Mon, May 1, 2017 at 9:09 AM, Martin McClure martin@hand2mouse.com wrote:
I see no replies to this on any of the three lists it was sent to, so I guess I'll chime in.
tl;dr: Making a primitive for #hashMultiply, probably a good idea, in some form, since doing it in Smalltalk is awkward. Only hashing every N-th character of large strings, probably a very bad idea. Performance might well get worse, the complexity does not seem justified, and it would open a sizeable security hole.
More verbiage below for those interested.
Regards, -Martin
On 04/18/2017 07:09 PM, Eliot Miranda wrote:
Hi All,
the hash algorithm used for ByteString in Squeak and Pharo is good
for "small" strings and overkill for large strings.
Why do you say it's overkill for large strings? Are there applications with large strings that are being negatively impacted by the current algorithm? Which ones, and impacted how?
It is important in many applications to get well distributed string hashes, especially over the range of strings that constitute things like method names, URLs, etc. Consequently, the current algorithm includes every character in a string. This works very well for "small" strings and results in very slow hashes (and hence long latencies, because the hash is an uninterruptible primitive) for large strings, where large may be several megabytes.
A simple solution for the uninterruptable primitive is to not make it a primitive. Make #hashMultiply a primitive (since this particular kind of numeric modulo computation is really painful in Smalltalk), and do the rest in a loop in Smalltalk. It sounds like you've done the #hashMultiply primitive already.
If the overhead of calling a primitive for each character proves to be too much, even with the faster primitive calling methodologies you talked about in the "Cog Primitive Performance" thread on the Vm-dev list, a more complex primitive could take a range of bytes, so large strings would be done in batches, solving the latency problem.
Let's look at the basic hash algorithm.
[...]
In looking at this I've added a primitive for hashMultiply; primitive #159 implements precisely self * 1664525 bitAnd: 16r0FFFFFFF for SmallInteger and LargePositiveInteger receivers, as fast as possible in the Cog JIT. With this machinery in place it's instructive to compare the cost of the primitive against the non-primitive Smalltalk code.
First let me introduce a set of replacement hash functions, newHashN. These hash all characters in strings up to a certain size, and then no more than that number for larger strings. Here are newHash64 and newHash2048, which use pure Smalltalk, including an inlined hashMultiply written to avoid SmallInteger overflow. Also measured are the obvious variants newHash128, newHash256, newHash512 &
mewHash1024.
[...]
So the idea here is to step through the string by 1 for strings sizes up to N - 1, and by greater than 1 for strings of size >= N, limiting the maximum number of characters sampled to between N // 2 and N - 1.
The history of computing is littered with the bones of those who have tried this kind of thing. It doesn't end well. Yes, you get a faster hash function. And then you find, for sets of data that you or your users actually want to use, that you get collisions like crazy, and much worse overall performance than you started with.
Sure, it works OK for the sets of data that the designer *tested*, and probably for the sets of data that they *anticipated*. But real-world data is tricky. It includes data sets where the characters that differ are the ones that the hash thinks are unimportant, and there goes your performance, by orders of magnitude. For instance, early versions of Java used a limited number of characters to hash strings. One of the biggest compatibility-breaking changes they were forced to make in later Java versions was to consider *all* characters in hashing. It turned out that it was very common to hash URLs, and many distinct URLs had most of their characters in common.
Was curious about this and found the actual issues [1] if anybody else is interested.
[1] http://bugs.java.com/bugdatabase/view_bug.do?bug_id=4045622
Cheers, Andrei
And you don't always get to choose your data -- sometimes you have an opponent who is actively looking to create collisions as a denial-of-service attack. There was a fair-sized kerfluffle about this a few years ago -- most web scripting languages made it too easy to mount this kind of attack. https://arstechnica.com/business/2011/12/huge- portions-of-web-vulnerable-to-hashing-denial-of-service-attack/ "...an attacker can degenerate the hash table by sending lots of colliding keys. ...making it possible to exhaust hours of CPU time using a single HTTP request." To guard against this kind of attack you need a randomized element in your hash (not a bad idea for Smalltalk, actually, and pretty easy -- mixing in the identity hash of the collection might be sufficient) or a cryptographic hash (not worth the computational expense for most purposes). However, even adding a randomized element would not prevent this kind of attack if you predictably completely ignore some characters of the input string. That just makes it *so* easy to generate data that validates, and is not equal, but causes collisions.
So really, for general-purpose use (i.e. what's built into the language) hash *every* character of *all* strings. If someone finds that this is a performance problem in a real-world situation, it can be addressed in an application-specific way.