On Mon, May 1, 2017 at 5:11 PM, David T. Lewis lewis@mail.msen.com wrote:
On Mon, May 01, 2017 at 07:26:35PM +0200, Levente Uzonyi wrote:
I presume that a general purpose in-image solution would be more complex. String already has too many subclasses (6 in Squeak), while at the same time other kind of new subclasses would be welcome too, e.g. Strings with 2-byte characters. Since these properties are orthogonal, there would be many new subclasses to cover all cases. Storing the size of the string in a different header word is a VM
specific
detail, so I think caching the hash could also be hidden from the image.
Levente
Actually, I meant something more like this:
Object subclass: #LargeString instanceVariableNames: 'anyKindOfString cachedHashValueForTheString' classVariableNames: '' poolDictionaries: '' category: 'Probably a Bad Idea'
I was guessing that hashing very large strings would imply a somewhat specialized problem domain, for which a wrapper class might make sense. Certainly it would not be a general solution.
But this implies changing very many places in the VM that access a string as a vector of bytes. That's why Levente suggests hiding the hash in unused memory in Spur. It doesn't have the side effect of requiring a major rewrite.
I am probably over my quota of bad ideas for today, so I'll stop now ;-)
Dave
On Mon, 1 May 2017, David T. Lewis wrote:
Does it need to be done in the VM? Why not make a class LargeString with instance variables aString and myCalculatedHashValueForTheString. That
way
you can cache the hash value and calculate it any way you want.
I know vey little about hashing, just wondering if this kind of thing
can
be handled more easily in the image.
Dave
On Mon, 1 May 2017, Levente Uzonyi wrote:
Well, I had started to write a reply, but I had to postpone it. I mostly agree with your suggestions.
One thing that can be done about large strings is to cache the calculated hash value in the larger strings. Currently the object representation changes when the string contains 255 or more characters. In that case
an
additional 64 bit field is added to the object header to store its length. If we were to use the upper 28+1 bits of that field to cache the hash, there would still be 35-bits to encode length, which would be enough
to
represent strings up to 8 GiB.
Well, we can keep the whole range minus one bit, but then we can't
store
the hash for strings larger than 8 GiB.
Levente
But this would require further VM changes (e.g. at:put: would have to flush the cache).
Levente
On Mon, 1 May 2017, Martin McClure wrote:
I see no replies to this on any of the three lists it was sent to,
so I
guess I'll chime in.
tl;dr: Making a primitive for #hashMultiply, probably a good idea, in some form, since doing it in Smalltalk is awkward. Only hashing every N-th character of large strings, probably a very
bad
idea. Performance might well get worse, the complexity does not seem justified, and it would open a sizeable security hole.
More verbiage below for those interested.
Regards, -Martin
On 04/18/2017 07:09 PM, Eliot Miranda wrote: >Hi All, > > the hash algorithm used for ByteString in Squeak and Pharo is
good
>for "small" strings and overkill for large strings.
Why do you say it's overkill for large strings? Are there
applications
with large strings that are being negatively impacted by the current algorithm? Which ones, and impacted how?
>It is important in many applications to get well distributed string >hashes, especially over the range of strings that constitute things >like method names, URLs, etc. Consequently, the current algorithm >includes every character in a string. This works very well for >"small" strings and results in very slow hashes (and hence long >latencies, because the hash is an uninterruptible primitive) for
large
>strings, where large may be several megabytes.
A simple solution for the uninterruptable primitive is to not make
it a
primitive. Make #hashMultiply a primitive (since this particular kind of numeric modulo computation is really painful in Smalltalk), and do
the
rest in a loop in Smalltalk. It sounds like you've done the #hashMultiply primitive already.
If the overhead of calling a primitive for each character proves to
be
too much, even with the faster primitive calling methodologies you talked about in the "Cog Primitive Performance" thread on the Vm-dev list, a more complex primitive could take a range of bytes, so large strings would be done in batches, solving the latency problem.
> >Let's look at the basic hash algorithm. [...] > >In looking at this I've added a primitive for hashMultiply;
primitive
>#159 implements precisely self * 1664525 bitAnd: 16r0FFFFFFF for >SmallInteger and LargePositiveInteger receivers, as fast as possible >in the Cog JIT. With this machinery in place it's instructive to >compare the cost of the primitive against the non-primitive
Smalltalk
>code. > >First let me introduce a set of replacement hash functions,
newHashN.
>These hash all characters in strings up to a certain size, and then
no
>more than that number for larger strings. Here are newHash64 and >newHash2048, which use pure Smalltalk, including an inlined >hashMultiply written to avoid SmallInteger overflow. Also measured >are the obvious variants newHash128, newHash256, newHash512 & >mewHash1024. > > [...] >So the idea here is to step through the string by 1 for strings
sizes
>up to N - 1, and by greater than 1 for strings of size >= N,
limiting
>the maximum number of characters sampled to between N // 2 and N -
The history of computing is littered with the bones of those who have tried this kind of thing. It doesn't end well. Yes, you get a faster hash function. And then you find, for sets of data that you or your users actually want to use, that you get collisions like crazy, and much worse overall performance than you started with.
Sure, it works OK for the sets of data that the designer *tested*,
and
probably for the sets of data that they *anticipated*. But real-world data is tricky. It includes data sets where the characters that
differ
are the ones that the hash thinks are unimportant, and there goes
your
performance, by orders of magnitude. For instance, early versions of Java used a limited number of characters to hash strings. One of the biggest compatibility-breaking changes they were forced to make in later Java versions was to consider *all* characters in hashing. It turned out that it was very common to hash URLs, and many distinct URLs had most of their characters in common.
And you don't always get to choose your data -- sometimes you have an opponent who is actively looking to create collisions as a denial-of-service attack. There was a fair-sized kerfluffle about
this
a few years ago -- most web scripting languages made it too easy to
mount
this kind of attack.
portions-of-web-vulnerable-to-hashing-denial-of-service-attack/
"...an attacker can degenerate the hash table by sending lots of colliding keys. ...making it possible to exhaust hours of CPU time using a single HTTP request." To guard against this kind of attack you need a randomized element in your hash (not a bad idea for Smalltalk, actually, and pretty easy -- mixing in the identity hash of the collection might be sufficient)
or a
cryptographic hash (not worth the computational expense for most purposes). However, even adding a randomized element would not
prevent
this kind of attack if you predictably completely ignore some characters of the input string. That just makes it *so* easy to generate data
that
validates, and is not equal, but causes collisions.
So really, for general-purpose use (i.e. what's built into the language) hash *every* character of *all* strings. If someone finds that this
is
a performance problem in a real-world situation, it can be addressed in an application-specific way.