from preamble:
"Change Set: WeakGCFix-wbk Date: 21 March 2004 Author: Bryce Kampjes
Weak references can leave dangling pointers if the weak object is in old space and it points to a new space object. The problem occurs when the new space object is incrementally collected but the old space weak object remains. The weak object now has a dangling pointer to where the new space object was.
This change set adds a little code to ObjectMemory>>sweepPhase so that it will finalise any weak rootTable objects.
There is still the chance that an interpreter variable may contain a pointer to a weak object. However interpreter variables have very specific uses. I'm not sure if they should be checked for weak references.
"!
This bug can, and does, cause VM crashes. An example crash output is included below. All the examples I've seen have involved the object dependency mechanism.
There is the possibility that other root pointers also need to be checked for weak objects. I don't know if any of the interpreter's variables could contain weak objects, if they could then they could cause these crashes.
Recursive not understood error encountered
1094730556 WeakKeyAssociation>key 1094730464 WeakIdentityKeyDictionary>scanFor: 1094730372 Set>findElementOrNil: 1094730280 Dictionary>at:ifAbsent: 1094730096 Object>myDependents 1094730004 Object>dependents 1094729912 Object>changed: 1094729820 Object>sunitChanged: 1094704936 [] in TestSuite>run:
Bryce
bryce@kampjes.demon.co.uk writes:
from preamble:
"Change Set: WeakGCFix-wbk Date: 21 March 2004 Author: Bryce Kampjes
Weak references can leave dangling pointers if the weak object is in old space and it points to a new space object. The problem occurs when the new space object is incrementally collected but the old space weak object remains. The weak object now has a dangling pointer to where the new space object was.
This change set adds a little code to ObjectMemory>>sweepPhase so that it will finalise any weak rootTable objects.
There is still the chance that an interpreter variable may contain a pointer to a weak object. However interpreter variables have very specific uses. I'm not sure if they should be checked for weak references.
"!
This bug can, and does, cause VM crashes. An example crash output is included below. All the examples I've seen have involved the object dependency mechanism.
Hm ... actually the log is pretty interesting. WeakKeyAssociation is a class which specifically prevents creating "old weak references" by:
WeakKeyAssociation>>key: aKey key := WeakArray with: aKey
So the key is already there when the weak array is created and since Squeak allocates memory montonically it means that the WeakArray will come "after" the object it points to in memory. And since Squeak tenures equally monotonically, it will have the key tenured before the weak reference. In other words, by the way WeakKeyAssociation works it is impossible[*] to create the situation you were describing (reference being a root and object pointed to young).
[*] Even if #become: is used since it does the equivalent of a fullGC, implicitly tenuring the entire object space.
Cheers, - Andreas
When running Exupery's test suite it crashes every few runs. With a bad (freshly created) image it can crash every single time that TestExuperyPlugin>>testBasicRememberSetsBothOld is run. When the image is first created the other remembered set tests also cause it to crash. Images get more reliable with time.
With my fix I've run that test for well over a minute continuously without a single crash. That test takes a third of a second so that's over 200 times successfully.
Interestingly, the first time I run that method with the fixed VM it crashes. If I run any other test first, it is then stable.
I can easily reproduce these crashes with the Exupery development image when running Exupery's test suite. The reason I'm investigating is a new image I created was crashing every time I ran that test. My older development image only crashes ever couple of test runs. The problem is subtle, add a few extra expressions and it will go away.
For me, for now, that fix is enough to be able to produce an image to go with the next Exupery release. Without that fix, a newly created image will crash every time the test suite is run. If the troublesome methods are commented and the test suite is run a few times then it those tests will pass sometimes but they still crash the VM every few runs. It would be nice to get a proper fix though.
Bryce
Hi brce
could you let us know the status of exupery? Do you know if people continued to work on the JIT in ST?
Stef
stéphane ducasse writes:
Hi brce
could you let us know the status of exupery? Do you know if people continued to work on the JIT in ST?
Hi Stef,
The status is this: the 0.03 release is nearly done, all I need to do is 1) decide what to do about this possible VM bug and 2) build some sars. The code is already on SqueakSource.
The 0.03 release is hopefully the last release that doesn't add any noticeably improvements. This one merely compiles all but a few bytecodes leaving just blocks, extended bytecodes, and stack duplicate. Blocks are left because they look similar to the multiple method problem. Extended bytecodes and stack duplicate are left because I'm lazy and they don't seem to add much compared the limit of only a single compiled method.
The next release will have one major new feature. The ability to have more than one compiled method at a time. This is basically a VM modification to support multiple compiled methods. I've already modified the VM to handle a single compiled method.
To handle multiple methods involves two distinct problems. First, handling calling multiple compiled methods which really amounts to a hash table lookup similar to normal method lookup. Second, recognising a compiled method when returning into it after it's called another method. It's the second case that is hard, a simple solution would involve creating a subclass of MethodContext for compiled methods that contained a pointer to the compiled code encoded as a SmallInteger to avoid GC nastiness. It gets a little more complex when the code cache gets flushed or an image gets saved and reloaded potentially leaving dangling pointers into the code cache. These cases are solvable but need a little though and probably a little discussion which is why that work is bundled into it's own release.
I don't know of any progress with the JIT in ST. There may have been some but not that I've heard of.
Oh, the best way to follow my progress is through my blog on SqueakPeople.
Bryce P.S 0.03 should be out in the next week or so. Not tonight, not with work in the morning. It really is just an exersize in getting working sars onto SqueakMap. I need sars to hold the .c files needed for some of the older tests. Hmm, a sar that has a Monticello package it, that is the problem, haven't thought about solving it yet, suspect it's easy.
Hi Bryce,
When running Exupery's test suite it crashes every few runs. With a bad (freshly created) image it can crash every single time that TestExuperyPlugin>>testBasicRememberSetsBothOld is run. When the image is first created the other remembered set tests also cause it to crash. Images get more reliable with time.
Hearing the word "plugin" and considering that I am entirely unable to reproduce your example (the test is still running and absolutely nothing bad happened during -now- 3000 loops) makes me assume that there's something in the plugin which has a GC problem, e.g., a GC occuring in a place where you didn't remap some oop or other.
Interestingly, the first time I run that method with the fixed VM it crashes. If I run any other test first, it is then stable.
I think that actually proves my point.
I can easily reproduce these crashes with the Exupery development image when running Exupery's test suite. The reason I'm investigating is a new image I created was crashing every time I ran that test. My older development image only crashes ever couple of test runs. The problem is subtle, add a few extra expressions and it will go away.
Yes that too - it's precisely the kind of thing that happens if a GC hits you in a place you haven't thought about it. Let me tell you a rather nasty technique to debug (or rather: stress test) these problems:
If you look at "Smalltalk vmParameters" you will find one which defines the "allocations between GCs", e.g., how many allocations before an IGC takes place. If you set this to zero (or one? might try both) the system will run an IGC *every single time* an allocation happens. This is slow (it means about 1-2 ms for every allocated object) but it's the surest way I know to find out if there's a problem - if there is the system will crash almost instantly.
For me, for now, that fix is enough to be able to produce an image to go with the next Exupery release. Without that fix, a newly created image will crash every time the test suite is run. If the troublesome methods are commented and the test suite is run a few times then it those tests will pass sometimes but they still crash the VM every few runs. It would be nice to get a proper fix though.
I will vote against adding this fix unless you can provide evidence that such a crash is indeed possible on a stock VM/image pair. Like I said, from what I can see your fix is a no-op which would only slow down the garbage collector.
Cheers, - Andreas
Hi,
Weak references can leave dangling pointers if the weak object is in old space and it points to a new space object.
Your observation troubles me greatly. This should not happen, and as far as I know it cannot happen. Do you have a reproducable example?
The problem occurs when the new space object is incrementally collected but the old space weak object remains. The weak object now has a dangling pointer to where the new space object was.
This must not happen. Never. Not once. When a weak object becomes old it will be marked as "root" and root objects are always traced as if they are non-weak (the reason being that treating them weak would have complicated GC logic more than I liked).
The way this works is subtle however: When GC kicks in we trace the roots via #markAndTrace: and here, we have basically the same logic as in startObj *except* that we set lastField regardless of whether the root is weak or not (and lastField ultimately determines whether the GC logic finds an object via reference). Therefore, if a weak object is a root, it will be traced as if non-weak. You can test this by doing something like:
weakRef := WeakValueAssociation new. Smalltalk garbageCollect. "make it old" weakRef value: Object new. Smalltalk garbageCollectMost. "incr. GC" weakRef value. "and it's still there"
As far as I can tell, your changes should have no effect whatsoever.
Troubled, - Andreas
Hello
Andreas Raab writes:
Hi,
Weak references can leave dangling pointers if the weak object is in old space and it points to a new space object.
Your observation troubles me greatly. This should not happen, and as far as I know it cannot happen. Do you have a reproducable example?
Try:
weakRef := WeakValueAssociation new. Smalltalk garbageCollect. "make it old" TestClass gcTest: weakRef Smalltalk garbageCollectMost. "incr. GC" weakRef value. "and it's now gone"
TestClass class>>gcTest: weakRef weakRef value: Object new.
By moving the new object creation out into it's own method we remove the risk of it getting captured but the context object.
Bryce
Hi Bryce,
weakRef := WeakValueAssociation new. Smalltalk garbageCollect. "make it old" TestClass gcTest: weakRef Smalltalk garbageCollectMost. "incr. GC" weakRef value. "and it's now gone"
TestClass class>>gcTest: weakRef weakRef value: Object new.
By moving the new object creation out into it's own method we remove the risk of it getting captured but the context object.
I left this running in a loop to throw an error if weakRef's value is ever nil and so far it hasn't happened in several hundreds loops yet (it's slow because of the full GCs). This makes me wonder if you are using some kind of custom VM that you've built yourself - what platform, what VM *are* you on?
Cheers, - Andreas
Hi Andreas, First I agree with you. My fix should NOT be included in the main VM/image.
It does however unfortunately fix my personal problems. When they first appeared about three months ago I fixed them by compiling the VM without optimisation. That this helped is unfortunately a bad sign, it indicates that the difference between working and non-working code is in the area where the compiler has a right to choose, or a bug in the compiler which is very unlikely, but so is a bug in the interpreter/garbage collector.
It is unfortunate that it fixes my problems but that isolates my problems a lot which is very fortunate. Yes, I'm working with a custom VM. But I managed to run my version of your test and produce a nil even with a stock VM but not a stock image. Thinking about it I don't think my version (with the message send) should be different to yours. Weird.
That my fix does fix my problems however does isolate it. It's something that can stop a root weak object from being collected. That implies that the mark bit is set, and yes that bit should not be set. I very much doubt that my code is setting that bit, that would involve it producing an otherwise good header work with a bad mark bit which is highly unlikely.
Only in one place do I deal with headers, and that is what that part of the test suite that crashes. However the test that crashes does nothing, and I've stepped through the machine code, instruction by instruction, multiple times over a three month period to know this. Actually that specific test verifies that Exupery is not adding anything to the root table when both objects in an assignment are both old. Unfortunately, this also removes the chance of unexpected GCs because a call instruction is definitely noticeable.
So my situation is this: I have a bug that is possibly caused by the garbage collector and I have a fix that works. Unfortunately the fix works for the wrong reasons which is at least enlightening especially with you help. I can continue working using my fix but that leaves the real bug undiscovered. I can also spend more time chasing a better fix. Given that my fix fixes my problem it really isolates the kind of issue which is not the sort of thing that my VM modifications could do, especially as I've single stepped through the machine code I'm running.
Currently, I feel that I should release the next Exupery version with a Linux VM that includes my fix. See how that VM works in real use rather than just under explicit testing for a few weeks. Hope inspiration strikes, or (more likely) a better time comes to chase this bug further. Releasing a modified VM is nessisary to let people play with it without needing to compile the VM themselves.
Oh, the stock VM was a 3.4-2 Linux VM from Ian's site. My compiled VM's were modified versions built from Ned's SourceForge VM branch with the latest version of VMMaker.
Exupery does involve a few VM modifications to run. First, it needs to get the addresses of various VM variables for code generation. Second, it needs to modify the message sending code so it can override methods with compiled code. This is why until I had that fix I assumed the bug was due to my code. However to test rootTable updating I do run global collects frequently, this produces the bug that I see. I run identical code elsewhere without the garbage collect when testing the assignment which does not crash. The test that causes the crash does not update the rootTable, I've checked both by reading the assembly generated and also by single stepping through the machine code while watching the contents of the rootTable (only four entries in this case).
If there is interest, I'm happy to chase this further now. If it isn't impacting anybody else then I'll leave it until a better time. A better time would be when working with the Exupery/VM integration which is the guts of the next release. Or on things that involve GC interaction such as inlining code where type tests need types which are objects which the garbage collector can move.
Bryce
Hi Bryce,
That my fix does fix my problems however does isolate it. It's something that can stop a root weak object from being collected. That implies that the mark bit is set, and yes that bit should not be set. I very much doubt that my code is setting that bit, that would involve it producing an otherwise good header work with a bad mark bit which is highly unlikely.
There's a simple way to find out - just scan all the objects right *before* GC to see if any of them have the mark bit set before GC starts. If that happens you know you're dead in the water and having an IGC triggered at various places would allow you to pin-point where it happens.
So my situation is this: I have a bug that is possibly caused by the garbage collector and I have a fix that works. Unfortunately the fix works for the wrong reasons which is at least enlightening especially with you help. I can continue working using my fix but that leaves the real bug undiscovered. I can also spend more time chasing a better fix. Given that my fix fixes my problem it really isolates the kind of issue which is not the sort of thing that my VM modifications could do, especially as I've single stepped through the machine code I'm running.
It is *always* the case for GC problems that they show up in completely unrelated places. *ALWAYS*! Don't waste your time investigating that particular place which just happens to trigger a GC. Run IGCs at every allocation! Add sanity checks! The only thing you can say for sure is that the problem occured "some time before" the GC was triggered.
Currently, I feel that I should release the next Exupery version with a Linux VM that includes my fix. See how that VM works in real use rather than just under explicit testing for a few weeks.
That's entirely your choice - if you have faith that the fix you're using solves the problem, go for it. Though, I have to admit that having chased GC bugs before it is a dangerous assumption to have a fix where you don't understand why it works.
Exupery does involve a few VM modifications to run. First, it needs to get the addresses of various VM variables for code generation. Second, it needs to modify the message sending code so it can override methods with compiled code. This is why until I had that fix I assumed the bug was due to my code.
I am still convinced it is in your code ;-)
However to test rootTable updating I do run global collects frequently, this produces the bug that I see. I run identical code elsewhere without the garbage collect when testing the assignment which does not crash. The test that causes the crash does not update the rootTable, I've checked both by reading the assembly generated and also by single stepping through the machine code while watching the contents of the rootTable (only four entries in this case).
That doesn't mean anything. The only thing you can say for sure is that it happened some time before the GC. The code you're looking at might be *completely* unrelated.
If there is interest, I'm happy to chase this further now. If it isn't impacting anybody else then I'll leave it until a better time. A better time would be when working with the Exupery/VM integration which is the guts of the next release. Or on things that involve GC interaction such as inlining code where type tests need types which are objects which the garbage collector can move.
That's entirely your decision to make.
Cheers, - Andreas
When executing the code below, I expected both values to be garbage collected, yet 's2' apparently isn't. Other attempts were using 'Smalltalk garbageCollect' and adding 's2' from another method (to make sure the block context wasn't preventing garbage collection). Any ideas?
| weakDict | weakDict := WeakValueDictionary new. weakDict at: 1 put: Object new. weakDict at: 2 put: 's2'. Smalltalk garbageCollectMost. (weakDict at: 1) = nil ifFalse: ['at 1 not nil' inspect]. (weakDict at: 2) = nil ifFalse: ['at 2 not nil' inspect]. "=> shows an inspector"
Thanks, Brian.
The literal string is part of the compiled method. Try:
weakDict at: 2 put: 's2' copy
instead. The same is true for other literals.
Cheers, - Andreas
----- Original Message ----- From: "Brian Murphy-Dye" brian.murphydye@mac.com To: "The general-purpose Squeak developers list" squeak-dev@lists.squeakfoundation.org Sent: Monday, August 02, 2004 7:57 PM Subject: WeakValueDictionary
When executing the code below, I expected both values to be garbage collected, yet 's2' apparently isn't. Other attempts were using 'Smalltalk garbageCollect' and adding 's2' from another method (to make sure the block context wasn't preventing garbage collection). Any ideas?
| weakDict | weakDict := WeakValueDictionary new. weakDict at: 1 put: Object new. weakDict at: 2 put: 's2'. Smalltalk garbageCollectMost. (weakDict at: 1) = nil ifFalse: ['at 1 not nil' inspect]. (weakDict at: 2) = nil ifFalse: ['at 2 not nil' inspect]. "=> shows an inspector"
Thanks, Brian.
Obvious when you point it out =o) Thanks.
On Aug 2, 2004, at 11:26 PM, Andreas Raab wrote:
The literal string is part of the compiled method. Try:
weakDict at: 2 put: 's2' copy
instead. The same is true for other literals.
Cheers,
- Andreas
----- Original Message ----- From: "Brian Murphy-Dye" brian.murphydye@mac.com To: "The general-purpose Squeak developers list" squeak-dev@lists.squeakfoundation.org Sent: Monday, August 02, 2004 7:57 PM Subject: WeakValueDictionary
When executing the code below, I expected both values to be garbage collected, yet 's2' apparently isn't. Other attempts were using 'Smalltalk garbageCollect' and adding 's2' from another method (to make sure the block context wasn't preventing garbage collection). Any ideas?
| weakDict | weakDict := WeakValueDictionary new. weakDict at: 1 put: Object new. weakDict at: 2 put: 's2'. Smalltalk garbageCollectMost. (weakDict at: 1) = nil ifFalse: ['at 1 not nil' inspect]. (weakDict at: 2) = nil ifFalse: ['at 2 not nil' inspect]. "=> shows an inspector"
Thanks, Brian.
squeak-dev@lists.squeakfoundation.org