Thanks David!
On Sun, Mar 12, 2017 at 3:52 PM, David T. Lewis lewis@mail.msen.com wrote:
On Fri, Mar 10, 2017 at 11:49:43PM -0500, David T. Lewis wrote:
On Fri, Mar 10, 2017 at 10:19:42AM +0100, Nicolas Cellier wrote:
2017-03-10 9:45 GMT+01:00 Nicolas Cellier < nicolas.cellier.aka.nice@gmail.com>:
2017-03-10 3:47 GMT+01:00 David T. Lewis lewis@mail.msen.com:
On Thu, Mar 09, 2017 at 12:41:44PM -0800, Eliot Miranda wrote:
Hi David,
On Mon, Mar 6, 2017 at 5:50 PM, David T. Lewis <
lewis@mail.msen.com>
wrote:
> > In the VM, the millisecond clock wraps within the 32 bit integer
range:
> > #define MillisecondClockMask 0x1FFFFFFF > > In the Cuis image, Delay class>>handleTimerEvent does this: > > nextTick := nextTick min: SmallInteger maxVal. > > On a 64-bit Spur image, SmallInteger maxVal is
16rFFFFFFFFFFFFFFF,
but on
> a 32-bit image it is 16r3FFFFFFF. > > Could that be it? > > I don't really know how to test in Squeak. As you say, Squeak
is now
> using the microsecond clock in #handleTimerEvent. I do not see
anything
> in primitiveSignalAtMilliseconds that would behave any
differently on
> a 64 bit versus 32 bit image or VM, but I do not know how to
test to
> be sure. >
I bet that the following code from primitiveSignalAtMilliseconds
ends up
wrapping when given Delay primSignal: s atMilliseconds: Time primMillisecondClock
deltaMsecs := msecs - (self ioMSecs bitAnd: MillisecondClockMask). deltaMsecs < 0 ifTrue: [deltaMsecs := deltaMsecs + MillisecondClockMask + 1]. nextWakeupUsecs := self ioUTCMicroseconds + (deltaMsecs * 1000).
and I bet you'll find that the VM will wake up in about 6 days
and 5
hours
;-)
No. The failure is specific to the 64 bit VM. Source code for the primitive is the same in either case.
I suppose we could fix this, but I'm *much* happier to simply not
use
primitiveSignalAtMilliseconds and stay with the simpler and
wrapping-immune
primitiveSignalAtUTCMicroseconds
Fair enough, but given that there is a demonstrated bug that
affects only
the 64-bit VM, and given that it expresses itself intermittently
and in
ways that affect only someone who is attempting to migrate their
existing V3
image to Spur, then I would say that it makes very good sense to take the
time
to fix it if we are able to do so. After all, there may be other
people
who will want to migrate V3 images to Spur, and there is no point in
making
the process needlessly difficult.
I do not have the solution, but maybe someone else can help. So I am asking for help here. Can someone with a working 64-bit build environment
please
check and see if what I said in an earlier email might make a
difference:
I see that ioMSecs() is declared as signed long (32 bits), but it
is
used in expression with a 64 bit usqInt. So maybe it needs a cast, or
maybe
the variables like msecs and deltaMsecs in
primitiveSignalAtMilliseconds
should be declared as 32 bit long and unsigned long to match the
actual
usage.
Thanks,
Dave
Hi David I just reviewed the generated code for primitiveSignalAtMilliseconds andhere is what i found:
The primitive will simply fail if fed with a negative integer,
thanks to:
msecs = positive32BitValueOf(msecsObj);
msecs is allways positive, since declared as usqInt msecs;
So far, so good.
Then deltaMsecs = msecs - ((ioMSecs()) & MillisecondClockMask); if (deltaMsecs < 0) { deltaMsecs = (deltaMsecs + MillisecondClockMask) + 1; } GIV(nextWakeupUsecs) = (ioUTCMicroseconds())
(deltaMsecs * 1000);
Due to the 29 bits Mask, the bitAnd: (&) operation will result into a positive int. 0 <= ((ioMSecs()) & MillisecondClockMask) <= 0x1FFFFFFF
However, when we do the Roll-over thing: deltaMsecs = (deltaMsecs + MillisecondClockMask) + 1; Then we create a giant delay (the 6 days mentionned by Eliot) which
is
HIGHLY QUESTIONNABLE whatever 32 or 64 bits!
The difference between 32 and 64 bits is in the next line: GIV(nextWakeupUsecs) = (ioUTCMicroseconds())
(deltaMsecs * 1000); Because sqInt deltaMsecs;
is 32 bits for a 32bits spur, 64bits for 64bits spur, then
(deltaMsecs *
- will do two different things:
- in 32 bits it will overflow leading to a negative delay and a
nextWakeupUsecs in the past (but that's invoking Undefined Behavior)
- in 64 bits it won't overflow leading to he 6 days delay
Does that explain?
if (deltaMsecs < 0) I would either fail the primitive or set the deltaMsecs to zero. What do you think?
Nicolas
Anyway, the RollOver protection sounds strange: if clock was high, and
if
we don't apply the MillisecondClockMask to (clock+delay), then (clock+delay) won't overflow (thanks to largeIntegers).
Eventually,
for delay > 2^32-2^29 msecs, the primitive will fail thru (positive32BitValueOf). That's something like 43 days delay max, more than enough...
Either the code should rather be something like this with roll-over protection deltaMsecs = (msecs - ioMSecs()) & MillisecondClockMask; or we should leave deltaMsecs as is and reject negative delays
Also note this in 32 bits: Since *1000 consumes 10bits, then delays above 21 bits msecs will
overflow
and be signalled immediately. Tha's about 2e6 msecs, 2e3 secs. Thus the maximum delay supported by
this
primitive is a bit less than 36 minutes (16r80000000/1000/1000/60). So maybe we should also reject large positive "unreasonable" delays,
that
isanthing above a few minutes and document it.
I think I understand what you are saying now. Let me see if I can restate the problem:
Before the change to microsecond clock, the interpreter VM did the rollover check only in #checkForInterrupts, and the arithmetic did not require a 64 bit microsecond range. Now we have rollover logic in #primitiveSignalAtMilliseconds, which leads to two issues, one of which masks the other on 32 bits.
Issue #1: If we pass to the primitive a millisecond count that is
slightly
in the past (possibly due to clock skew or other timing issues), then the primitive will test for deltaMsecs less than zero, and will incorrectly assume that a millisecond clock rollover has happened.
Issue #2: When adjusting for clock rollover on 32 bits, deltaMsecs is declared as 32 bit sqInt, which results in overflow when converting to microseconds. For 64 bit sqInt, there is no overflow. This error masks the first problem of Issue #1, which makes the primiitive appear to work correctly on 32 bits, even though the clock rollover logic is wrong.
I do not think that we can simply fail the primitive, or set
deltaMSeconds
to zero, in the case of deltaMSeconds being less than zero. We still need to be able to schedule delays over a time period when the millisecond clock rolls over, so that needs to work.
We do need to fix the arithmetic overflow condition, which probably just requires declaring deltaMsecs as usqLong. This matches the declaration for nextWakeupUsecs and should prevent arithmetic overflow. But fixing this for issue #2 will unmask issue #1.
For issue #1, the solution probably requires a heuristic. If deltaMsecs is a big negative value, then the clock has rolled over. If it is a small negative value (maybe less than a second, just to pick a number), then it is probably clock skew and the MillisecondClockMask + 1 adjustment should not be applied.
Does this sound right?
I now have a working Cog/Spur build on my Ubuntu box, so I was able to try a few things. It turns out that the arithmetic overflow that Nicolas spotted seems to fully account for the problem that Juan originally reported on the 64 but Cuis image. The overflow happened here:
nextWakeupUsecs := self ioUTCMicroseconds + (deltaMsecs * 1000).
Declaring deltaMsecs as a 64 bit unsigned to match the declaration of nextWakeupUsecs resolved the image lockup problem that Juan identified. I committed the fix in VMMaker.oscog-dtl.2147.
Dave