Hi All,
responding to Andrew here because this is generally of interest to the vm-list.
On Mon, Sep 26, 2011 at 11:06 AM, Andrew Gaylard apg@4dst.com wrote:
Hmmm. Thanks for the advice -- we now build with -O3, and all's well. I've run the VM at full load (mostly compiling) for 30 hours without a hiccup. Interesting that -O2 is problematic, but -O3 isn't; I assumed that higher optimisations would make things less stable, not more so. And we get a 17% speed increase.
My GCC is: $ gcc --version gcc (Ubuntu 4.4.3-4ubuntu5) 4.4.3
So this really surprises me since we see exactly the same thing with gcc version 3.4.6 20060404 (Red Hat 3.4.6-3). If we compile with -O1 or -O3 we get functional Cog VMs, but -O2 crashes on start-up or soon there-after. I'm surprised that two very different versions of gcc show the same behaviour but I guess I shouldn't be. Some time some of us (me included) could really do to put the effort into understanding what the issue is. It could be a gcc bug or it could be that we're generating C code with ill-defined behaviour. I have to say that I suspect the latter given how different gcc 3.4.x and gcc 4.4.x are (BTW Andrew also sees the same issue with gcc 4.1.x).
- Andrew
On 2011.09.25 23:12:50 -0700, Eliot Miranda eliot.miranda@gmail.com wrote:
On Sat, Sep 24, 2011 at 9:02 AM, Andrew Gaylard apg@4dst.com wrote:
Actually, it looks like I was wrong. After rebuiding everything from scratch, I've been unable to reproduce these crashes, except for the one with unix-4.4.7.image.
Sorry for the false alarm. r2495 looks pretty good, at both -O0 and -O1. It still crashes at -O2, but that's not a huge concern.
Which gcc are you using? Here at Cadence on a much older 32-bit machine using gcc 3.4.x we see crashes at -O2 but no crashes at -O0 -O1 & -O3 :)
On 2011.09.24 08:07:47 +0200, Andrew Gaylard apg@4dst.com wrote:
On 2011.09.23 13:26:06 -0700, Eliot Miranda <eliot.miranda@gmail.com
wrote:
Thank you, Andrew, you nailed it. I've found the bug via your
stack
trace
below. Huge relief. Thanks! New VMs and explanation to the list
soon.
Alas, we spoke too soon. -2495 exhibits the same symptoms; traces
and
gdb transcripts are attached.
- vm-*-2495.0.txt are from our basic.image, running the test-runner.
- vm-*-2495.1.txt are from Squeak4.2-10966.image, running the
test-runner.
- vm-*-2495.2.txt are from unix-4.4.7.image, having just started up
the
VM.
The first two of these appear to be the same problem I encountered with -2493. The backtraces certainly look very similar.
The third one is rather different. Looking at the stack trace, the 'rcvr' variable in ceSendsupertonumArgs is 17039140, which is de-referenced in line 10733, causing a SEGV; the handler duly
confirms
the faulting address as si_addr = 0x103ff24:
$ perl -e 'print 0x103ff24' 17039140
2011/9/26 Eliot Miranda eliot.miranda@gmail.com:
Hi All, responding to Andrew here because this is generally of interest to the vm-list.
On Mon, Sep 26, 2011 at 11:06 AM, Andrew Gaylard apg@4dst.com wrote:
Hmmm. Thanks for the advice -- we now build with -O3, and all's well. I've run the VM at full load (mostly compiling) for 30 hours without a hiccup. Interesting that -O2 is problematic, but -O3 isn't; I assumed that higher optimisations would make things less stable, not more so. And we get a 17% speed increase.
My GCC is: $ gcc --version gcc (Ubuntu 4.4.3-4ubuntu5) 4.4.3
So this really surprises me since we see exactly the same thing with gcc version 3.4.6 20060404 (Red Hat 3.4.6-3). If we compile with -O1 or -O3 we get functional Cog VMs, but -O2 crashes on start-up or soon there-after. I'm surprised that two very different versions of gcc show the same behaviour but I guess I shouldn't be. Some time some of us (me included) could really do to put the effort into understanding what the issue is. It could be a gcc bug or it could be that we're generating C code with ill-defined behaviour. I have to say that I suspect the latter given how different gcc 3.4.x and gcc 4.4.x are (BTW Andrew also sees the same issue with gcc 4.1.x).
Again some nasty kind of undefined shifts on signed/unsigned ints? Or a macro expansion leading to subtle subexpression ordering (++i/++i) ? Or one of the many dark zones here: http://www.vmunix.com/~gabor/c/draft.html#601
Being able to enumerate such a long list without any omission is already something! What a beautiful language !
Nicolas
- Andrew
On 2011.09.25 23:12:50 -0700, Eliot Miranda eliot.miranda@gmail.com wrote:
On Sat, Sep 24, 2011 at 9:02 AM, Andrew Gaylard apg@4dst.com wrote:
Actually, it looks like I was wrong. After rebuiding everything from scratch, I've been unable to reproduce these crashes, except for the one with unix-4.4.7.image.
Sorry for the false alarm. r2495 looks pretty good, at both -O0 and -O1. It still crashes at -O2, but that's not a huge concern.
Which gcc are you using? Here at Cadence on a much older 32-bit machine using gcc 3.4.x we see crashes at -O2 but no crashes at -O0 -O1 & -O3 :)
On 2011.09.24 08:07:47 +0200, Andrew Gaylard apg@4dst.com wrote:
On 2011.09.23 13:26:06 -0700, Eliot Miranda eliot.miranda@gmail.com
wrote:
Thank you, Andrew, you nailed it. I've found the bug via your stack
trace
below. Huge relief. Thanks! New VMs and explanation to the list
soon.
Alas, we spoke too soon. -2495 exhibits the same symptoms; traces and gdb transcripts are attached.
- vm-*-2495.0.txt are from our basic.image, running the test-runner.
- vm-*-2495.1.txt are from Squeak4.2-10966.image, running the
test-runner.
- vm-*-2495.2.txt are from unix-4.4.7.image, having just started up the
VM.
The first two of these appear to be the same problem I encountered with -2493. The backtraces certainly look very similar.
The third one is rather different. Looking at the stack trace, the 'rcvr' variable in ceSendsupertonumArgs is 17039140, which is de-referenced in line 10733, causing a SEGV; the handler duly confirms the faulting address as si_addr = 0x103ff24:
$ perl -e 'print 0x103ff24' 17039140
-- best, Eliot
Hi:
On 27 Sep 2011, at 08:45, Nicolas Cellier wrote:
Again some nasty kind of undefined shifts on signed/unsigned ints? Or a macro expansion leading to subtle subexpression ordering (++i/++i) ? Or one of the many dark zones here: http://www.vmunix.com/~gabor/c/draft.html#601
Being able to enumerate such a long list without any omission is already something! What a beautiful language !
Kind of similar to the problems previous mentioned in this thread, we got a optimization bug in the RoarVM codebase too.
It is biting me with an infinite loop when I use a GCC >4.2 or Intel compiler and enable optimization >O1 on two specific files.
Still, I think the bug is somewhere completely else.
I tried to go through the files and disable optimization for particular functions, without useful result... It is jumping. Perhaps there are multiple places where we use unspecified code.
Are there any tools that could be useful to find such things?
I tried the Clang static analyzer, but without much success. It also gives a lot of warnings about potentially uninitialized variables in the primitives. It does not like the primitiveFail/successFlag checks at all.
Best regards Stefan
On Tue, Sep 27, 2011 at 12:05 AM, Stefan Marr squeak@stefan-marr.de wrote:
Hi:
On 27 Sep 2011, at 08:45, Nicolas Cellier wrote:
Again some nasty kind of undefined shifts on signed/unsigned ints? Or a macro expansion leading to subtle subexpression ordering (++i/++i) ? Or one of the many dark zones here: http://www.vmunix.com/~gabor/c/draft.html#601
Being able to enumerate such a long list without any omission is already something! What a beautiful language !
Kind of similar to the problems previous mentioned in this thread, we got a optimization bug in the RoarVM codebase too.
It is biting me with an infinite loop when I use a GCC >4.2 or Intel compiler and enable optimization >O1 on two specific files.
Still, I think the bug is somewhere completely else.
I tried to go through the files and disable optimization for particular functions, without useful result... It is jumping. Perhaps there are multiple places where we use unspecified code.
Are there any tools that could be useful to find such things?
The only way I know to do this quickly is to get a reproducible case that runs to failure from start-up without user intervention and run the two VMs side-by-side on that case. If the bug shows itself when tracing is turned on then it can be relatively easy to find the point at which the two diverge and backtrack from there.
I tried the Clang static analyzer, but without much success.
It also gives a lot of warnings about potentially uninitialized variables in the primitives. It does not like the primitiveFail/successFlag checks at all.
:)
Best regards Stefan
-- Stefan Marr Software Languages Lab Vrije Universiteit Brussel Pleinlaan 2 / B-1050 Brussels / Belgium http://soft.vub.ac.be/~smarr Phone: +32 2 629 2974 Fax: +32 2 629 3525
vm-dev@lists.squeakfoundation.org