John M McIntosh wrote:
On Jul 14, 2007, at 2:48 PM, Andreas Raab wrote:
We've been running VMs with these changes[*] and they have fixed the Linux problems that we had. As an aside, one thing that we ran into (and that I just fixed a couple of days ago
So you have these changes where? I was not clear on your comment about server versus desktop and how the issue is triggered. Do dual processor intel desktop machines count as Server machines?
Can't say for sure. Only that our server's MTBF was somewhere between 24-48 hours because of that problem. After deploying the fix we've been going for three days straight with no problems (fingers crossed). If we can make it to a week or so I'll post the changes since deploying them on such short notice was a somewhat desperate measure due to heavy customer complaints.
If you want to look at some code, the problematic places are pretty obvious: Delay>>schedule, Delay>>unschedule, and Delay>>activate are all prone to being terminated while updating Delay-internal structures. When that happens, the result is a total system lockup since Delay resources are globally shared. Also, note that these operations run with the client's priority which makes it very possible to be preempted by a higher priority process and cause other problems. For example, consider a low priority process holding the Delay lock and a medium priority process sitting in a tight loop for some reason; this will lock up the entire system since the timer interrupt watcher won't be able to enter the semaphore. I have a a couple of stack traces showing these and related problems.
The one saving grace for us was to have USR1 generate a full stack dump of all processes for forensic reasons. Without that we'd be using Java on the servers by now (no kidding; this is still an option and depends largely on whether we can make Squeak reliable enough as a server).
Cheers, - Andreas