Re: [Vm-dev] [squeak-dev] [Pharo-dev] Byte & String collection hash performance; a modest proposal for change.

2 May 2017


      On Mon, May 1, 2017 at 5:11 PM, David T. Lewis lewis@mail.msen.com wrote:
...
On Mon, May 01, 2017 at 07:26:35PM +0200, Levente Uzonyi wrote:
...
I presume that a general purpose in-image solution would be more complex.
String already has too many subclasses (6 in Squeak), while at the same
time other kind of new subclasses would be welcome too, e.g. Strings with
2-byte characters.
Since these properties are orthogonal, there would be many new subclasses
to cover all cases.
Storing the size of the string in a different header word is a VM
specific
...
detail, so I think caching the hash could also be hidden from the image.
Levente
Actually, I meant something more like this:
Object subclass: #LargeString
        instanceVariableNames: 'anyKindOfString
cachedHashValueForTheString'
        classVariableNames: ''
        poolDictionaries: ''
        category: 'Probably a Bad Idea'
I was guessing that hashing very large strings would imply a somewhat
specialized problem domain, for which a wrapper class might make sense.
Certainly it would not be a general solution.
But this implies changing very many places in the VM that access a string
as a vector of bytes.  That's why Levente suggests hiding the hash in
unused memory in Spur.  It doesn't have the side effect of requiring a
major rewrite.
...
I am probably over my quota of bad ideas for today, so I'll stop now ;-)
Dave
...
On Mon, 1 May 2017, David T. Lewis wrote:
...
Does it need to be done in the VM? Why not make a class LargeString with
instance variables aString and myCalculatedHashValueForTheString. That
way
...
...
you can cache the hash value and calculate it any way you want.
I know vey little about hashing, just wondering if this kind of thing
can
...
...
be handled more easily in the image.
Dave
...
On Mon, 1 May 2017, Levente Uzonyi wrote:
...
Well, I had started to write a reply, but I had to postpone it.
I mostly agree with your suggestions.
One thing that can be done about large strings is to cache the
calculated
hash value in the larger strings. Currently the object representation
changes when the string contains 255 or more characters. In that case
an
...
...
...
...
additional 64 bit field is added to the object header to store its
length.
If we were to use the upper 28+1 bits of that field to cache the hash,
there would still be 35-bits to encode length, which would be enough
to
...
...
...
...
represent strings up to 8 GiB.
Well, we can keep the whole range minus one bit, but then we can't
store
...
...
...
the hash for strings larger than 8 GiB.
Levente
...
But this would require further VM changes (e.g. at:put: would have to
flush the cache).
Levente
On Mon, 1 May 2017, Martin McClure wrote:
...
I see no replies to this on any of the three lists it was sent to,
so I
...
...
...
...
...
guess I'll chime in.
tl;dr:
Making a primitive for #hashMultiply, probably a good idea, in some
form, since doing it in Smalltalk is awkward.
Only hashing every N-th character of large strings, probably a very
bad
...
...
...
...
...
idea. Performance might well get worse, the complexity does not seem
justified, and it would open a sizeable security hole.
More verbiage below for those interested.
Regards,
-Martin
On 04/18/2017 07:09 PM, Eliot Miranda wrote:
>Hi All,
>
>    the hash algorithm used for ByteString in Squeak and Pharo is
good
...
...
...
...
...
>for "small" strings and overkill for large strings.
Why do you say it's overkill for large strings? Are there
applications
...
...
...
...
...
with large strings that are being negatively impacted by the current
algorithm? Which ones, and impacted how?
>It is important in many applications to get well distributed string
>hashes, especially over the range of strings that constitute things
>like method names, URLs, etc.  Consequently, the current algorithm
>includes every character in a string.  This works very well for
>"small" strings and results in very slow hashes (and hence long
>latencies, because the hash is an uninterruptible primitive) for
large
...
...
...
...
...
>strings, where large may be several megabytes.
A simple solution for the uninterruptable primitive is to not make
it a
...
...
...
...
...
primitive. Make #hashMultiply a primitive (since this particular kind
of
numeric modulo computation is really painful in Smalltalk), and do
the
...
...
...
...
...
rest in a loop in Smalltalk. It sounds like you've done the
#hashMultiply primitive already.
If the overhead of calling a primitive for each character proves to
be
...
...
...
...
...
too much, even with the faster primitive calling methodologies you
talked about in the "Cog Primitive Performance" thread on the Vm-dev
list, a more complex primitive could take a range of bytes, so large
strings would be done in batches, solving the latency problem.
>
>Let's look at the basic hash algorithm.
[...]
>
>In looking at this I've added a primitive for hashMultiply;
primitive
...
...
...
...
...
>#159 implements precisely self * 1664525 bitAnd: 16r0FFFFFFF for
>SmallInteger and LargePositiveInteger receivers, as fast as possible
>in the Cog JIT.  With this machinery in place it's instructive to
>compare the cost of the primitive against the non-primitive
Smalltalk
...
...
...
...
...
>code.
>
>First let me introduce a set of replacement hash functions,
newHashN.
...
...
...
...
...
>These hash all characters in strings up to a certain size, and then
no
...
...
...
...
...
>more than that number for larger strings.  Here are newHash64 and
>newHash2048, which use pure Smalltalk, including an inlined
>hashMultiply written to avoid SmallInteger overflow.  Also measured
>are the obvious variants newHash128, newHash256, newHash512 &
>mewHash1024.
>
>
[...]
>So the idea here is to step through the string by 1 for strings
sizes
...
...
...
...
...
>up to N - 1, and by greater than 1 for strings of size >= N,
limiting
...
...
...
...
...
>the maximum number of characters sampled to between N // 2 and N -


...
...
...
...
...
The history of computing is littered with the bones of those who have
tried this kind of thing. It doesn't end well. Yes, you get a faster
hash function. And then you find, for sets of data that you or your
users actually want to use, that you get collisions like crazy, and
much
worse overall performance than you started with.
Sure, it works OK for the sets of data that the designer *tested*,
and
...
...
...
...
...
probably for the sets of data that they *anticipated*. But real-world
data is tricky. It includes data sets where the characters that
differ
...
...
...
...
...
are the ones that the hash thinks are unimportant, and there goes
your
...
...
...
...
...
performance, by orders of magnitude. For instance, early versions of
Java used a limited number of characters to hash strings. One of the
biggest compatibility-breaking changes they were forced to make in
later
Java versions was to consider *all* characters in hashing. It turned
out
that it was very common to hash URLs, and many distinct URLs had most
of
their characters in common.
And you don't always get to choose your data -- sometimes you have an
opponent who is actively looking to create collisions as a
denial-of-service attack. There was a fair-sized kerfluffle about
this
...
...
...
...
...
a
few years ago -- most web scripting languages made it too easy to
mount
...
...
...
...
...
this kind of attack.
https://arstechnica.com/business/2011/12/huge-
portions-of-web-vulnerable-to-hashing-denial-of-service-attack/
...
...
...
...
...
"...an attacker can degenerate the hash table by sending lots of
colliding keys. ...making it possible to exhaust hours of CPU time
using
a single HTTP request."
To guard against this kind of attack you need a randomized element in
your hash (not a bad idea for Smalltalk, actually, and pretty easy --
mixing in the identity hash of the collection might be sufficient)
or a
...
...
...
...
...
cryptographic hash (not worth the computational expense for most
purposes). However, even adding a randomized element would not
prevent
...
...
...
...
...
this kind of attack if you predictably completely ignore some
characters
of the input string. That just makes it *so* easy to generate data
that
...
...
...
...
...
validates, and is not equal, but causes collisions.
So really, for general-purpose use (i.e. what's built into the
language)
hash *every* character of *all* strings. If someone finds that this
is
...
...
...
...
...
a
performance problem in a real-world situation, it can be addressed in
an
application-specific way.
-- 
_,,,^..^,,,_
best, Eliot

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

Re: [Vm-dev] [squeak-dev] [Pharo-dev] Byte & String collection hash performance; a modest proposal for change.