Chris Muller uploaded a new version of Kernel to project The Trunk: http://source.squeak.org/trunk/Kernel-cmm.1562.mcz
==================== Summary ====================
Name: Kernel-cmm.1562 Author: cmm Time: 24 March 2024, 6:27:34.572036 pm UUID: 886bf6b9-a56b-4131-accf-3867a3836658 Ancestors: Kernel-ct.1561
Allow processPreemptionYields: to persist across image saves, as purported by its comment.
=============== Diff against Kernel-ct.1561 ===============
Item was changed: ----- Method: ProcessorScheduler class>>startUp: (in category 'background process') ----- startUp: resuming
- Smalltalk processPreemptionYields ifTrue: [ - Smalltalk processPreemptionYields: false]. - resuming ifTrue: [ Smalltalk installLowSpaceWatcher. self installIdleProcess].!
Ah, this was "we have a new VM feature" migration code. We no longer need it, I suppose.
Can we make sure that the ReleaseBuilder sets this to "false" by default? :-)
Best, Marcel
Am 25.03.2024 00:28:07 schrieb commits@source.squeak.org commits@source.squeak.org:
Chris Muller uploaded a new version of Kernel to project The Trunk: http://source.squeak.org/trunk/Kernel-cmm.1562.mcz
==================== Summary ====================
Name: Kernel-cmm.1562 Author: cmm Time: 24 March 2024, 6:27:34.572036 pm UUID: 886bf6b9-a56b-4131-accf-3867a3836658 Ancestors: Kernel-ct.1561
Allow processPreemptionYields: to persist across image saves, as purported by its comment.
=============== Diff against Kernel-ct.1561 ===============
Item was changed: ----- Method: ProcessorScheduler class>>startUp: (in category 'background process') ----- startUp: resuming
- Smalltalk processPreemptionYields ifTrue: [ - Smalltalk processPreemptionYields: false]. - resuming ifTrue: [ Smalltalk installLowSpaceWatcher. self installIdleProcess].!
Hi Marcel,
true means Processes running at the same priority are given fair time, round-robin style, as higher-priority processes return or go back to sleep. This code forcing it to false appears to have been the root cause of the recent issues with source.squeak.org's responsiveness. Clients were timing out because one big request would cause a backlog of 10 small requests to queue up behind it and, often, timeout. Simply setting this true in a startup patch script seems to have fixed it.
There are no good explanations in the code for false. The only one I can remember from this mailing list I think was that you can get away easier not protecting your code with Semaphores, and instead depend on the known process switching behavior of the VM. Perhaps for ultra high-performance applications? It feels application-specific to me. According to the comment, it also diverges from the Blue Book specification.
I don't mean to be disagreeable about the default, but true seems like the setting for smoother multitasking and more in the spirit of Squeak's liveness. What is the rationale for false?
Best, Chris
On Mon, Mar 25, 2024 at 12:00 PM Taeumel, Marcel via Squeak-dev < squeak-dev@lists.squeakfoundation.org> wrote:
Ah, this was "we have a new VM feature" migration code. We no longer need it, I suppose.
Can we make sure that the ReleaseBuilder sets this to "false" by default? :-)
Best, Marcel
Am 25.03.2024 00:28:07 schrieb commits@source.squeak.org < commits@source.squeak.org>: Chris Muller uploaded a new version of Kernel to project The Trunk: http://source.squeak.org/trunk/Kernel-cmm.1562.mcz
==================== Summary ====================
Name: Kernel-cmm.1562 Author: cmm Time: 24 March 2024, 6:27:34.572036 pm UUID: 886bf6b9-a56b-4131-accf-3867a3836658 Ancestors: Kernel-ct.1561
Allow processPreemptionYields: to persist across image saves, as purported by its comment.
=============== Diff against Kernel-ct.1561 ===============
Item was changed: ----- Method: ProcessorScheduler class>>startUp: (in category 'background process') ----- startUp: resuming
- Smalltalk processPreemptionYields ifTrue: [
- Smalltalk processPreemptionYields: false].
resuming ifTrue: [ Smalltalk installLowSpaceWatcher. self installIdleProcess].!
I guess we switched the default (Cog supports both, some other VMs do not) at some point without implementing an image-level scheduler? This would be better than the accidental round-robin behavior we get from the random interruptions by higher-level processes (it kinda works but is far from "fair").
Here's the original discussion: https://lists.squeakfoundation.org/archives/list/squeak-dev@lists.squeakfoun...
Igor wrote about such a scheduler too: https://computeradventures.wordpress.com/2009/05/04/new-schedulingsignals-fo...
Vanessa
On Mon, Mar 25, 2024 at 4:30 PM Chris Muller asqueaker@gmail.com wrote:
Hi Marcel,
true means Processes running at the same priority are given fair time, round-robin style, as higher-priority processes return or go back to sleep. This code forcing it to false appears to have been the root cause of the recent issues with source.squeak.org's responsiveness. Clients were timing out because one big request would cause a backlog of 10 small requests to queue up behind it and, often, timeout. Simply setting this true in a startup patch script seems to have fixed it.
There are no good explanations in the code for false. The only one I can remember from this mailing list I think was that you can get away easier not protecting your code with Semaphores, and instead depend on the known process switching behavior of the VM. Perhaps for ultra high-performance applications? It feels application-specific to me. According to the comment, it also diverges from the Blue Book specification.
I don't mean to be disagreeable about the default, but true seems like the setting for smoother multitasking and more in the spirit of Squeak's liveness. What is the rationale for false?
Best, Chris
On Mon, Mar 25, 2024 at 12:00 PM Taeumel, Marcel via Squeak-dev < squeak-dev@lists.squeakfoundation.org> wrote:
Ah, this was "we have a new VM feature" migration code. We no longer need it, I suppose.
Can we make sure that the ReleaseBuilder sets this to "false" by default? :-)
Best, Marcel
Am 25.03.2024 00:28:07 schrieb commits@source.squeak.org < commits@source.squeak.org>: Chris Muller uploaded a new version of Kernel to project The Trunk: http://source.squeak.org/trunk/Kernel-cmm.1562.mcz
==================== Summary ====================
Name: Kernel-cmm.1562 Author: cmm Time: 24 March 2024, 6:27:34.572036 pm UUID: 886bf6b9-a56b-4131-accf-3867a3836658 Ancestors: Kernel-ct.1561
Allow processPreemptionYields: to persist across image saves, as purported by its comment.
=============== Diff against Kernel-ct.1561 ===============
Item was changed: ----- Method: ProcessorScheduler class>>startUp: (in category 'background process') ----- startUp: resuming
- Smalltalk processPreemptionYields ifTrue: [
- Smalltalk processPreemptionYields: false].
resuming ifTrue: [ Smalltalk installLowSpaceWatcher. self installIdleProcess].!
Hi Vanessa,
Here's the original discussion:
https://lists.squeakfoundation.org/archives/list/squeak-dev@lists.squeakfoun...
That's the one!
Wow... I wrote almost the exact same thing 14 years ago! Hilarious. Thanks for sharing that. :)
I guess we switched the default (Cog supports both, some other VMs do not)
at some point without implementing an image-level scheduler? This would be better than the accidental round-robin behavior we get from the random interruptions by higher-level processes (it kinda works but is far from "fair").
I was only being descriptive, not making any judgement about one being "better" or worse (except in terms of the default setting). I have no doubt first forked, first to finish with explicit yields are a desirable way to do same-priority Process scheduling in some cases.
Having said that, Squeak's built-in switching due for networking and file operations creates optimizes throughput. With processPremption set false, since those operations don't have explicit #yields, all of your same-level Processes are sitting ducks waiting for the front one to, for example, #waitForSendDone to finish over its 300 baud connection. Processes pile up, and the CPU is idle instead of working on them and getting THEIR client responses on their way (you want Squeak to make the OS do as much as possible in as little time as possible, an idle CPU is antithetical to that). It ends up being like serial execution without Processes at all in the first place. If I'm wrong, and it IS switching between those equal priority processes in those circumstances, then that would seem to compromise the theoretical benefit of explicit cooperative scheduling. I hope I will be corrected and learn something, but it seems like it's either going to be slow, or implicit switching anyway.
Best, Chris
Igor wrote about such a scheduler too:
https://computeradventures.wordpress.com/2009/05/04/new-schedulingsignals-fo...
Vanessa
On Mon, Mar 25, 2024 at 4:30 PM Chris Muller asqueaker@gmail.com wrote:
Hi Marcel,
true means Processes running at the same priority are given fair time, round-robin style, as higher-priority processes return or go back to sleep. This code forcing it to false appears to have been the root cause of the recent issues with source.squeak.org's responsiveness. Clients were timing out because one big request would cause a backlog of 10 small requests to queue up behind it and, often, timeout. Simply setting this true in a startup patch script seems to have fixed it.
There are no good explanations in the code for false. The only one I can remember from this mailing list I think was that you can get away easier not protecting your code with Semaphores, and instead depend on the known process switching behavior of the VM. Perhaps for ultra high-performance applications? It feels application-specific to me. According to the comment, it also diverges from the Blue Book specification.
I don't mean to be disagreeable about the default, but true seems like the setting for smoother multitasking and more in the spirit of Squeak's liveness. What is the rationale for false?
Best, Chris
On Mon, Mar 25, 2024 at 12:00 PM Taeumel, Marcel via Squeak-dev < squeak-dev@lists.squeakfoundation.org> wrote:
Ah, this was "we have a new VM feature" migration code. We no longer need it, I suppose.
Can we make sure that the ReleaseBuilder sets this to "false" by default? :-)
Best, Marcel
Am 25.03.2024 00:28:07 schrieb commits@source.squeak.org < commits@source.squeak.org>: Chris Muller uploaded a new version of Kernel to project The Trunk: http://source.squeak.org/trunk/Kernel-cmm.1562.mcz
==================== Summary ====================
Name: Kernel-cmm.1562 Author: cmm Time: 24 March 2024, 6:27:34.572036 pm UUID: 886bf6b9-a56b-4131-accf-3867a3836658 Ancestors: Kernel-ct.1561
Allow processPreemptionYields: to persist across image saves, as purported by its comment.
=============== Diff against Kernel-ct.1561 ===============
Item was changed: ----- Method: ProcessorScheduler class>>startUp: (in category 'background process') ----- startUp: resuming
- Smalltalk processPreemptionYields ifTrue: [
- Smalltalk processPreemptionYields: false].
resuming ifTrue: [ Smalltalk installLowSpaceWatcher. self installIdleProcess].!
Having said that, Squeak's built-in switching due for networking and file
operations creates optimizes throughput.
Sorry about that sentence. ^^^ I meant to say,
"Squeak's networking and file operations creates a *natural* switching that optimizes throughput by letting the VM process run as much in parallel with those OS-level things as possible."
On Mon, Mar 25, 2024 at 9:11 PM Chris Muller ma.chris.m@gmail.com wrote:
With processPremption set false, since those operations don't have explicit #yields, all of your same-level Processes are sitting ducks waiting for the front one to, for example, #waitForSendDone to finish over its 300 baud connection.
That does not sound right. In particular, #waitForSendDone: waits on a semaphore, which resumes the next runnable process at the same priority. It would only be true if some process was busy-waiting without a semaphore or yielding. That would obviously prevent other processes from running, but I'm reasonably sure we got rid of those busy loops. Those were common back in the days of MVC, but even Morphic's main loop (a.k.a. the UI Process) yields.
Didn't Dave say it was RFB causing the issues? Maybe that one has a busy-loop.
Vanessa
With processPremption set false, since those operations don't have
explicit #yields, all of your same-level Processes are sitting ducks waiting for the front one to, for example, #waitForSendDone to finish over its 300 baud connection.
That does not sound right. In particular, #waitForSendDone: waits on a semaphore, which resumes the next runnable process at the same priority. It would only be true if some process was busy-waiting without a semaphore or yielding. That would obviously prevent other processes from running, but I'm reasonably sure we got rid of those busy loops. Those were common back in the days of MVC, but even Morphic's main loop (a.k.a. the UI Process) yields.
If you're right, then even setting it false, "cooperative scheduling" still can't be sculpted merely out of a set of strategically placed #yields. Without knowing and carefully avoiding all the external semaphores, the difference between this setting being on vs. off seems very, very, subtle. I'm still skeptical due to my observation of the immediate improvement flipping it true provided in the source.squeak.org server image.
Didn't Dave say it was RFB causing the issues? Maybe that one has a busy-loop.
The cause of the slowness issues hasn't been definitively identified. RFB is a suspect, but I didn't see anything suspicious in the one he reverted from, and source.squeak.org is still running that version just fine. squeaksource.com is running with processPreemptionYields false. Both servers are currently running fine. Mysterious.
- Chris
On Mon, Mar 25, 2024 at 10:37 PM Chris Muller ma.chris.m@gmail.com wrote:
With processPremption set false, since those operations don't have
explicit #yields, all of your same-level Processes are sitting ducks waiting for the front one to, for example, #waitForSendDone to finish over its 300 baud connection.
That does not sound right. In particular, #waitForSendDone: waits on a semaphore, which resumes the next runnable process at the same priority. It would only be true if some process was busy-waiting without a semaphore or yielding. That would obviously prevent other processes from running, but I'm reasonably sure we got rid of those busy loops. Those were common back in the days of MVC, but even Morphic's main loop (a.k.a. the UI Process) yields.
If you're right, then even setting it false, "cooperative scheduling" still can't be sculpted merely out of a set of strategically placed #yields. Without knowing and carefully avoiding all the external semaphores, the difference between this setting being on vs. off seems very, very, subtle. I'm still skeptical due to my observation of the immediate improvement flipping it true provided in the source.squeak.org server image.
I mean it's true that if one request causes a lot of processing then all others would have to wait until it's ready to send something and waits. Our file i/o is mostly synchronous so anything except for network stuff might starve other processes. In a way it's surprising Squeak works as a server at all, because it definitely wasn't designed for that.
Didn't Dave say it was RFB causing the issues? Maybe that one has a
busy-loop.
The cause of the slowness issues hasn't been definitively identified. RFB is a suspect, but I didn't see anything suspicious in the one he reverted from, and source.squeak.org is still running that version just fine. squeaksource.com is running with processPreemptionYields false. Both servers are currently running fine. Mysterious.
- Chris
I think we do not want accidental round-robin, because it is tricky to debug. Process-preemption should not yield. Then, one could experiment with a time-slicing Image-Level scheduler. But there is no actual need in general, I think.
Also, Morphic got more robust since we disabled this accidental round-robin. ☺️
Best, Marcel
________________________________ From: Vanessa Freudenberg vanessa@codefrau.net Sent: Tuesday, March 26, 2024 7:57:04 AM To: Chris Muller ma.chris.m@gmail.com Cc: The general-purpose Squeak developers list squeak-dev@lists.squeakfoundation.org; Taeumel, Marcel Marcel.Taeumel@hpi.de Subject: Re: Process scheduling (was: The Trunk: Kernel-cmm.1562.mcz)
On Mon, Mar 25, 2024 at 10:37 PM Chris Muller <ma.chris.m@gmail.commailto:ma.chris.m@gmail.com> wrote: With processPremption set false, since those operations don't have explicit #yields, all of your same-level Processes are sitting ducks waiting for the front one to, for example, #waitForSendDone to finish over its 300 baud connection.
That does not sound right. In particular, #waitForSendDone: waits on a semaphore, which resumes the next runnable process at the same priority. It would only be true if some process was busy-waiting without a semaphore or yielding. That would obviously prevent other processes from running, but I'm reasonably sure we got rid of those busy loops. Those were common back in the days of MVC, but even Morphic's main loop (a.k.a. the UI Process) yields.
If you're right, then even setting it false, "cooperative scheduling" still can't be sculpted merely out of a set of strategically placed #yields. Without knowing and carefully avoiding all the external semaphores, the difference between this setting being on vs. off seems very, very, subtle. I'm still skeptical due to my observation of the immediate improvement flipping it true provided in the source.squeak.orghttp://source.squeak.org server image.
I mean it's true that if one request causes a lot of processing then all others would have to wait until it's ready to send something and waits. Our file i/o is mostly synchronous so anything except for network stuff might starve other processes. In a way it's surprising Squeak works as a server at all, because it definitely wasn't designed for that.
Didn't Dave say it was RFB causing the issues? Maybe that one has a busy-loop.
The cause of the slowness issues hasn't been definitively identified. RFB is a suspect, but I didn't see anything suspicious in the one he reverted from, and source.squeak.orghttp://source.squeak.org is still running that version just fine. squeaksource.comhttp://squeaksource.com is running with processPreemptionYields false. Both servers are currently running fine. Mysterious.
- Chris
I think there should be an actual round-robin scheduler that can be enabled in images that need it. It would control scheduling of the processes at its priority - 1.
It could be as simple as forcing those processes to yield at regular intervals (which pretty much was the old behavior caused accidentally by the timer process).
Much better would be a fair scheduler. Measuring how much time each process got might be tricky to do efficiently. But I'm sure someone will have a good idea. (I also remember the Linux scheduler wars ... interesting stuff).
Vanessa
On Tue, Mar 26, 2024 at 12:57 AM Taeumel, Marcel Marcel.Taeumel@hpi.de wrote:
I think we do not want accidental round-robin, because it is tricky to debug. Process-preemption should not yield. Then, one could experiment with a time-slicing Image-Level scheduler. But there is no actual need in general, I think.
Also, Morphic got more robust since we disabled this accidental round-robin. ☺️
Best, Marcel
*From:* Vanessa Freudenberg vanessa@codefrau.net *Sent:* Tuesday, March 26, 2024 7:57:04 AM *To:* Chris Muller ma.chris.m@gmail.com *Cc:* The general-purpose Squeak developers list < squeak-dev@lists.squeakfoundation.org>; Taeumel, Marcel < Marcel.Taeumel@hpi.de> *Subject:* Re: Process scheduling (was: The Trunk: Kernel-cmm.1562.mcz)
On Mon, Mar 25, 2024 at 10:37 PM Chris Muller ma.chris.m@gmail.com wrote:
With processPremption set false, since those operations don't have explicit #yields, all of your same-level Processes are sitting ducks waiting for the front one to, for example, #waitForSendDone to finish over its 300 baud connection.
That does not sound right. In particular, #waitForSendDone: waits on a semaphore, which resumes the next runnable process at the same priority. It would only be true if some process was busy-waiting without a semaphore or yielding. That would obviously prevent other processes from running, but I'm reasonably sure we got rid of those busy loops. Those were common back in the days of MVC, but even Morphic's main loop (a.k.a. the UI Process) yields.
If you're right, then even setting it false, "cooperative scheduling" still can't be sculpted merely out of a set of strategically placed #yields. Without knowing and carefully avoiding all the external semaphores, the difference between this setting being on vs. off seems very, very, subtle. I'm still skeptical due to my observation of the immediate improvement flipping it true provided in the source.squeak.org server image.
I mean it's true that if one request causes a lot of processing then all others would have to wait until it's ready to send something and waits. Our file i/o is mostly synchronous so anything except for network stuff might starve other processes. In a way it's surprising Squeak works as a server at all, because it definitely wasn't designed for that.
Didn't Dave say it was RFB causing the issues? Maybe that one has a busy-loop.
The cause of the slowness issues hasn't been definitively identified. RFB is a suspect, but I didn't see anything suspicious in the one he reverted from, and source.squeak.org is still running that version just fine. squeaksource.com is running with processPreemptionYields false. Both servers are currently running fine. Mysterious.
- Chris
On 2024-03-26 02:57, Vanessa Freudenberg wrote:
On Mon, Mar 25, 2024 at 10:37 PM Chris Muller ma.chris.m@gmail.com wrote:
With processPremption set false, since those operations don't have explicit #yields, all of your same-level Processes are sitting ducks waiting for the front one to, for example, #waitForSendDone to finish over its 300 baud connection.
That does not sound right. In particular, #waitForSendDone: waits on a semaphore, which resumes the next runnable process at the same priority. It would only be true if some process was busy-waiting without a semaphore or yielding. That would obviously prevent other processes from running, but I'm reasonably sure we got rid of those busy loops. Those were common back in the days of MVC, but even Morphic's main loop (a.k.a. the UI Process) yields.
If you're right, then even setting it false, "cooperative scheduling" still can't be sculpted merely out of a set of strategically placed #yields. Without knowing and carefully avoiding all the external semaphores, the difference between this setting being on vs. off seems very, very, subtle. I'm still skeptical due to my observation of the immediate improvement flipping it true provided in the source.squeak.org [1] server image.
I mean it's true that if one request causes a lot of processing then all others would have to wait until it's ready to send something and waits. Our file i/o is mostly synchronous so anything except for network stuff might starve other processes. In a way it's surprising Squeak works as a server at all, because it definitely wasn't designed for that.
Didn't Dave say it was RFB causing the issues? Maybe that one has a busy-loop.
The cause of the slowness issues hasn't been definitively identified. RFB is a suspect, but I didn't see anything suspicious in the one he reverted from, and source.squeak.org [1] is still running that version just fine. squeaksource.com [2] is running with processPreemptionYields false. Both servers are currently running fine. Mysterious.
- Chris
I was wrong. RFB is not the cause of the problems that I was seeing. The squeaksource.com service has been back to running on the latest RFB version with no further problems.
Dave
Links: ------ [1] http://source.squeak.org [2] http://squeaksource.com
Hi Chris --
I don't mean to be disagreeable about the default, but true seems like the setting for smoother multitasking and more in the spirit of Squeak's liveness. What is the rationale for false?
"false" is more robust for forked user Morphic scripts and UI priority. Thus, "false" means more liveness and less debuggers. "true" means less control and potentially more debuggers. :-)
High-priority, blocking processes cannot be fixed with such "scheduling trick". They must be treated on their own. Let's favor modularity, single-responsibility, and local knowledge to maintain this rather large system. :-)
Best, Marcel
Am 26.03.2024 00:30:18 schrieb Chris Muller asqueaker@gmail.com:
Hi Marcel,
true means Processes running at the same priority are given fair time, round-robin style, as higher-priority processes return or go back to sleep. This code forcing it to false appears to have been the root cause of the recent issues with source.squeak.orghttp://source.squeak.org's responsiveness. Clients were timing out because one big request would cause a backlog of 10 small requests to queue up behind it and, often, timeout. Simply setting this true in a startup patch script seems to have fixed it.
There are no good explanations in the code for false. The only one I can remember from this mailing list I think was that you can get away easier not protecting your code with Semaphores, and instead depend on the known process switching behavior of the VM. Perhaps for ultra high-performance applications? It feels application-specific to me. According to the comment, it also diverges from the Blue Book specification.
I don't mean to be disagreeable about the default, but true seems like the setting for smoother multitasking and more in the spirit of Squeak's liveness. What is the rationale for false?
Best, Chris
On Mon, Mar 25, 2024 at 12:00 PM Taeumel, Marcel via Squeak-dev <squeak-dev@lists.squeakfoundation.orgmailto:squeak-dev@lists.squeakfoundation.org> wrote: Ah, this was "we have a new VM feature" migration code. We no longer need it, I suppose.
Can we make sure that the ReleaseBuilder sets this to "false" by default? :-)
Best, Marcel
Am 25.03.2024 00:28:07 schrieb commits@source.squeak.orgmailto:commits@source.squeak.org <commits@source.squeak.orgmailto:commits@source.squeak.org>:
Chris Muller uploaded a new version of Kernel to project The Trunk: http://source.squeak.org/trunk/Kernel-cmm.1562.mcz
==================== Summary ====================
Name: Kernel-cmm.1562 Author: cmm Time: 24 March 2024, 6:27:34.572036 pm UUID: 886bf6b9-a56b-4131-accf-3867a3836658 Ancestors: Kernel-ct.1561
Allow processPreemptionYields: to persist across image saves, as purported by its comment.
=============== Diff against Kernel-ct.1561 ===============
Item was changed: ----- Method: ProcessorScheduler class>>startUp: (in category 'background process') ----- startUp: resuming
- Smalltalk processPreemptionYields ifTrue: [ - Smalltalk processPreemptionYields: false]. - resuming ifTrue: [ Smalltalk installLowSpaceWatcher. self installIdleProcess].!
Hi Chris,
On Mon, Mar 25, 2024 at 4:30 PM Chris Muller asqueaker@gmail.com wrote:
Hi Marcel,
true means Processes running at the same priority are given fair time, round-robin style, as higher-priority processes return or go back to sleep.
This violates the real-time semantics of the scheduler, which are that - processes are only preempted by higher-priority processes - processes of the same priority are cooperative; another process of the same priority will only run if the currently running process at that priority yields or waits on a semaphore, etc.
This code forcing it to false appears to have been the root cause of the
recent issues with source.squeak.org's responsiveness.
That makes sense, but the issue is that the server is relying on ill-defined behaviour, and should instead implement its own scheduler amongst the time hogs.
For example, the server could introduce a high-priority process that wakes up periodically (looping on a delay). On waking up it inspects the current set of running request serving processes, and if the current one has taken more than its time-slice the supervisor process can send yield to it, moving it to the back of the queue.
In this way one can implement time slicing for a subset of processes while allowing the rest of the system to provide sound concurrent code with a minimum of synchronisation. See below.
Clients were timing out because one big request would cause a backlog of
10 small requests to queue up behind it and, often, timeout. Simply setting this true in a startup patch script seems to have fixed it.
yes, but it's a hack fix that breaks the rest of the system. See below.
There are no good explanations in the code for false.
There are *excellent* reasons for false as the default. And other Smalltalk implementations (especially VisualWorks) have been modified to work this way. The old behaviour (Smalltalk processPreemptionYields == true) *IS A BUG*!! And I shout for good reason. When the scheduler does not preempt (and note from above that *does not* prevent a subsystem from implementing preemption/time-slicing, etc, amongst a subset of processes), then a set of processes can safely maintain concurrent access to resources, and any other manner of cooperative behaviour, safely simply by using yield and wait. When Smalltalk processPreemptionYields == true, however, any code can be preempted at arbitrary times, and no safety guarantees can be provided. There is only a fig leaf of a real-time cooperative scheduler, not a reality.
Hence the solution here is *to fix the squeaksource server*, *not* to break the carefully thought out fix to the scheduling semantics that Smalltalk processPreemptionYields == false provides!!!!
This was discussed in some detail (obviously insufficient) years ago when the change was introduced to the Cog VM.
The only one I can remember from this mailing list I think was that you can
get away easier not protecting your code with Semaphores, and instead depend on the known process switching behavior of the VM. Perhaps for ultra high-performance applications? It feels application-specific to me. According to the comment, it also diverges from the Blue Book specification.
It's not necessarily to do with ultra-high-performance. It's much more to do with comprehensibility and predictability. A scheduler that provides preemption across priorities and round-robin cooperative scheduling within a priority has very desirable properties that make possible very simple implementation of robust concurrent code.
I don't mean to be disagreeable about the default, but true seems like the setting for smoother multitasking and more in the spirit of Squeak's liveness.
Wrong, wrong, wrong, and wrong again. Smalltalk processPreemptionYields == true is a horrible bug.
What is the rationale for false?
See "*excellent* reasons" above.
Best, Chris
On Mon, Mar 25, 2024 at 12:00 PM Taeumel, Marcel via Squeak-dev < squeak-dev@lists.squeakfoundation.org> wrote:
Ah, this was "we have a new VM feature" migration code. We no longer need it, I suppose.
Can we make sure that the ReleaseBuilder sets this to "false" by default? :-)
Best, Marcel
Am 25.03.2024 00:28:07 schrieb commits@source.squeak.org < commits@source.squeak.org>: Chris Muller uploaded a new version of Kernel to project The Trunk: http://source.squeak.org/trunk/Kernel-cmm.1562.mcz
==================== Summary ====================
Name: Kernel-cmm.1562 Author: cmm Time: 24 March 2024, 6:27:34.572036 pm UUID: 886bf6b9-a56b-4131-accf-3867a3836658 Ancestors: Kernel-ct.1561
Allow processPreemptionYields: to persist across image saves, as purported by its comment.
=============== Diff against Kernel-ct.1561 ===============
Item was changed: ----- Method: ProcessorScheduler class>>startUp: (in category 'background process') ----- startUp: resuming
- Smalltalk processPreemptionYields ifTrue: [
- Smalltalk processPreemptionYields: false].
resuming ifTrue: [ Smalltalk installLowSpaceWatcher. self installIdleProcess].!
+1 (maybe even +100, it's really not controversial)
Please continue in the "Process scheduling" thread. Someone needs to implement the scheduler Eliot describes. I did link to Igor's blog post about that. Maybe that helps?
Vanessa
On Tue, Mar 26, 2024 at 10:28 AM Eliot Miranda eliot.miranda@gmail.com wrote:
Hi Chris,
On Mon, Mar 25, 2024 at 4:30 PM Chris Muller asqueaker@gmail.com wrote:
Hi Marcel,
true means Processes running at the same priority are given fair time, round-robin style, as higher-priority processes return or go back to sleep.
This violates the real-time semantics of the scheduler, which are that
- processes are only preempted by higher-priority processes
- processes of the same priority are cooperative; another process of the
same priority will only run if the currently running process at that priority yields or waits on a semaphore, etc.
This code forcing it to false appears to have been the root cause of the
recent issues with source.squeak.org's responsiveness.
That makes sense, but the issue is that the server is relying on ill-defined behaviour, and should instead implement its own scheduler amongst the time hogs.
For example, the server could introduce a high-priority process that wakes up periodically (looping on a delay). On waking up it inspects the current set of running request serving processes, and if the current one has taken more than its time-slice the supervisor process can send yield to it, moving it to the back of the queue.
In this way one can implement time slicing for a subset of processes while allowing the rest of the system to provide sound concurrent code with a minimum of synchronisation. See below.
Clients were timing out because one big request would cause a backlog of
10 small requests to queue up behind it and, often, timeout. Simply setting this true in a startup patch script seems to have fixed it.
yes, but it's a hack fix that breaks the rest of the system. See below.
There are no good explanations in the code for false.
There are *excellent* reasons for false as the default. And other Smalltalk implementations (especially VisualWorks) have been modified to work this way. The old behaviour (Smalltalk processPreemptionYields == true) *IS A BUG*!! And I shout for good reason. When the scheduler does not preempt (and note from above that *does not* prevent a subsystem from implementing preemption/time-slicing, etc, amongst a subset of processes), then a set of processes can safely maintain concurrent access to resources, and any other manner of cooperative behaviour, safely simply by using yield and wait. When Smalltalk processPreemptionYields == true, however, any code can be preempted at arbitrary times, and no safety guarantees can be provided. There is only a fig leaf of a real-time cooperative scheduler, not a reality.
Hence the solution here is *to fix the squeaksource server*, *not* to break the carefully thought out fix to the scheduling semantics that Smalltalk processPreemptionYields == false provides!!!!
This was discussed in some detail (obviously insufficient) years ago when the change was introduced to the Cog VM.
The only one I can remember from this mailing list I think was that you
can get away easier not protecting your code with Semaphores, and instead depend on the known process switching behavior of the VM. Perhaps for ultra high-performance applications? It feels application-specific to me. According to the comment, it also diverges from the Blue Book specification.
It's not necessarily to do with ultra-high-performance. It's much more to do with comprehensibility and predictability. A scheduler that provides preemption across priorities and round-robin cooperative scheduling within a priority has very desirable properties that make possible very simple implementation of robust concurrent code.
I don't mean to be disagreeable about the default, but true seems like the setting for smoother multitasking and more in the spirit of Squeak's liveness.
Wrong, wrong, wrong, and wrong again. Smalltalk processPreemptionYields == true is a horrible bug.
What is the rationale for false?
See "*excellent* reasons" above.
Best, Chris
On Mon, Mar 25, 2024 at 12:00 PM Taeumel, Marcel via Squeak-dev < squeak-dev@lists.squeakfoundation.org> wrote:
Ah, this was "we have a new VM feature" migration code. We no longer need it, I suppose.
Can we make sure that the ReleaseBuilder sets this to "false" by default? :-)
Best, Marcel
Am 25.03.2024 00:28:07 schrieb commits@source.squeak.org < commits@source.squeak.org>: Chris Muller uploaded a new version of Kernel to project The Trunk: http://source.squeak.org/trunk/Kernel-cmm.1562.mcz
==================== Summary ====================
Name: Kernel-cmm.1562 Author: cmm Time: 24 March 2024, 6:27:34.572036 pm UUID: 886bf6b9-a56b-4131-accf-3867a3836658 Ancestors: Kernel-ct.1561
Allow processPreemptionYields: to persist across image saves, as purported by its comment.
=============== Diff against Kernel-ct.1561 ===============
Item was changed: ----- Method: ProcessorScheduler class>>startUp: (in category 'background process') ----- startUp: resuming
- Smalltalk processPreemptionYields ifTrue: [
- Smalltalk processPreemptionYields: false].
resuming ifTrue: [ Smalltalk installLowSpaceWatcher. self installIdleProcess].!
-- _,,,^..^,,,_ best, Eliot
Hi everyone, :)
First, interesting discussion, thanks for your passionate and stimulating engagement. I'm reminded of our past debates where I learned a lot over the years challenging your positions, including when you challenged mine.
Given how dug in you seem to be, I don't really see an economic path to consensus this time. However, I do hope you'll oblige me with another round of exploring the question to see if we can glean any final tidbits. I perceive your motivation to be to provide Squeak with the most flexibility out-of-the-box with regard to supporting multiple, alternative concurrency disciplines like cooperative. I respect that, even though I don't think preemptProcessYields (PPY) false achieves that, and my view being motivated by "utility visible from the outside looking in at the Scheduler". That's why, I tried to read Igors blog, my eyes glazed over, because it's written from the view inside the Scheduler looking out.
My goal now is just to alleviate _new_ concerns your last points have just raised in me. It was mentioned how the developer can / should write their own Scheduler to handle the same-priority Process selection however they want. My first question I hope you will help me with is, "If I wrote such a custom scheduler, but made it behave exactly like the current one does when PPY is set true by choosing the longest-suspended Process, would there be something fundamentally wrong or unsafe about that?"
Your answer to this should help me also figure out if you're simply insisting that "the *default setting* MUST be false."? OR, are you saying, "a Scheduler that chooses the longest-waiting Process within the priority group should never be made, because to do that is fundamentally a bug"?
Finally, with respect to the idea:
For example, the server could introduce a high-priority process that
wakes up periodically (looping on a delay). On
waking up it inspects the current set of running request serving
processes, and if the current one has taken more than
its time-slice the supervisor process can send yield to it, moving it to
the back of the queue.
My final question is: "What are the possible goals behind the design of the Process-selection algorithm?" Your suggestion of, if "the current one has taken more than its time-slice" suggests this may be where you only glossed over the details in your mind. It has to be ready to face the possibility that all the Processes within the priority-group are "overdue", so it's a criteria of *order*, not explicit selection. And what you're doing by pretending there is such a thing as a "current" within each priority group is *attributing additional artificial "priority"* to a single Process within each group even though their stated #priority is no higher than any of the others in their group. Show me one place in Squeak that mentions the notion of a "current" Process within a priority group. I would love to be proven wrong, but it doesn't exist.
So if the most *generic* responsibility of the algorithm is, "what is the BEST Process instance, generically, within a priority-group to run next, relative to each other," then at the top of my list are:
*selection algorithm: * resume the longest-suspended Process within the priority-group *rationale:* optimize for smoother response-time curve amongst multiple clients
*selection algorithm:* resume the shortest-suspended Process within the priority group *rationale:* optimize for ??? (<--- anyone?)
*selection algorithm:* resume a Process within the priority group at random *rationale:* optimize for ???
So what other possible orders are there, and which one is the best choice for generic behavior?
I know time is tight. For me too. I hope these questions provoked some useful thoughts. Thanks for any clarifications.
Best Regards, Chris
Writing a good scheduler is hard. I didn’t mention the Scheduler Wars for nothing.
Luckily, we have a much simpler problem than a multi-core machine with gazillions of processes. But we also have to admit that Squeak’s default scheduling has atrocious behavior, preempting yield or not.
That’s because we have no actual scheduler that gives a fair share of time to all runnable processes. Instead, the VM simply gives 100% of CPU time to the runnable process with the highest priority. It will starve any other runnable process unless the high-priority process voluntarily gives up time. That’s “cooperative” scheduling, which was a bad idea in the 70s and unfortunately we never moved on (you may remember that the major reason Macs switched to Unix was scheduling).
An actual scheduler would give fair time to all runnable processes in the system. It would give more time to higher-priority processes because that’s why they were given a higher priority. But it would still make sure that within a time slice (Linux defaults to 100ms, at least in the old simple scheduler) every runnable process gets at least *some* time. A runaway process would not freeze the whole system.
Now, I haven’t thought of the full implications of that for Squeak yet. There is beauty in simplicity, and the cooperative scheduling we have is simple for sure. But it breaks down in cases where you need preemptive scheduling because the processes do not cooperate.
Vanessa
PS: here’s an article describing the simple scheduler Linux used for its first decade (before its O(n) runtime became a problem): https://dev.to/satorutakeuchi/a-brief-history-of-the-linux-kernel-s-process-...
On Tue, Mar 26, 2024 at 21:29 Chris Muller ma.chris.m@gmail.com wrote:
Hi everyone, :)
First, interesting discussion, thanks for your passionate and stimulating engagement. I'm reminded of our past debates where I learned a lot over the years challenging your positions, including when you challenged mine.
Given how dug in you seem to be, I don't really see an economic path to consensus this time. However, I do hope you'll oblige me with another round of exploring the question to see if we can glean any final tidbits. I perceive your motivation to be to provide Squeak with the most flexibility out-of-the-box with regard to supporting multiple, alternative concurrency disciplines like cooperative. I respect that, even though I don't think preemptProcessYields (PPY) false achieves that, and my view being motivated by "utility visible from the outside looking in at the Scheduler". That's why, I tried to read Igors blog, my eyes glazed over, because it's written from the view inside the Scheduler looking out.
My goal now is just to alleviate _new_ concerns your last points have just raised in me. It was mentioned how the developer can / should write their own Scheduler to handle the same-priority Process selection however they want. My first question I hope you will help me with is, "If I wrote such a custom scheduler, but made it behave exactly like the current one does when PPY is set true by choosing the longest-suspended Process, would there be something fundamentally wrong or unsafe about that?"
Your answer to this should help me also figure out if you're simply insisting that "the *default setting* MUST be false."? OR, are you saying, "a Scheduler that chooses the longest-waiting Process within the priority group should never be made, because to do that is fundamentally a bug"?
Finally, with respect to the idea:
For example, the server could introduce a high-priority process that
wakes up periodically (looping on a delay). On
waking up it inspects the current set of running request serving
processes, and if the current one has taken more than
its time-slice the supervisor process can send yield to it, moving it to
the back of the queue.
My final question is: "What are the possible goals behind the design of the Process-selection algorithm?" Your suggestion of, if "the current one has taken more than its time-slice" suggests this may be where you only glossed over the details in your mind. It has to be ready to face the possibility that all the Processes within the priority-group are "overdue", so it's a criteria of *order*, not explicit selection. And what you're doing by pretending there is such a thing as a "current" within each priority group is *attributing additional artificial "priority"* to a single Process within each group even though their stated #priority is no higher than any of the others in their group. Show me one place in Squeak that mentions the notion of a "current" Process within a priority group. I would love to be proven wrong, but it doesn't exist.
So if the most *generic* responsibility of the algorithm is, "what is the BEST Process instance, generically, within a priority-group to run next, relative to each other," then at the top of my list are:
*selection algorithm: * resume the longest-suspended Process within
the priority-group *rationale:* optimize for smoother response-time curve amongst multiple clients
*selection algorithm:* resume the shortest-suspended Process within
the priority group *rationale:* optimize for ??? (<--- anyone?)
*selection algorithm:* resume a Process within the priority group at
random *rationale:* optimize for ???
So what other possible orders are there, and which one is the best choice for generic behavior?
I know time is tight. For me too. I hope these questions provoked some useful thoughts. Thanks for any clarifications.
Best Regards, Chris
On 2024-03-26, at 10:58 PM, Vanessa Freudenberg vanessa@codefrau.net wrote:
we have no actual scheduler that gives a fair share of time to all runnable processes.
I have to admit that I can't even work out what we really mean by "a fair share of time"! Obviously, what we *really* mean in any particular case is that "a fair share" equals "my processes get everything they want and to hell with the rest"
We do have a couple of relatively simple things almost ready to provide some improvement; a) asynchronous file access, which has had a prim plugin for.. a couple of decades? AsynchFile is in the default image but has not users. b) sockety stuff, where it's almost defined that there will be an appreciable (in cpu terms) time waiting. I *think* the unix plugin does some stuff asynchronously, but not sure about higher level prim calls.
Even just making that sort of thing trigger a yield event would surely improve matters for server-type usage?
tim -- tim Rowledge; tim@rowledge.org; http://www.rowledge.org/tim Useful random insult:- Living proof that nature does not abhor a vacuum.
On Wed, Mar 27, 2024 at 2:33 PM Tim Rowledge tim@rowledge.org wrote:
On 2024-03-26, at 10:58 PM, Vanessa Freudenberg vanessa@codefrau.net
wrote:
we have no actual scheduler that gives a fair share of time to all
runnable processes.
I have to admit that I can't even work out what we really mean by "a fair share of time"! Obviously, what we *really* mean in any particular case is that "a fair share" equals "my processes get everything they want and to hell with the rest"
LOL. Well Linux's CFS ("Completely Fair Scheduling") is around 10,000 lines of code ;)
But I'd say the minimum of fairness would be that all runnable processes should make progress all the time. How much time is given to each is debatable, except that higher priority processes should get relatively more time. And "all the time" means that every one of them should be serviced at least once every x milliseconds.
We do have a couple of relatively simple things almost ready to provide some improvement; a) asynchronous file access, which has had a prim plugin for.. a couple of decades? AsynchFile is in the default image but has not users.
Using that is just a lot less convenient I guess, and we'd have to change a lot of code. But we could perhaps implement the current file prims in terms of the async prims, meaning that instead of waiting in the VM we'd wait in the image, so other processes could run in the meantime. That should improve performance dramatically for code that's i/o heavy but distributed across multiple processes (e.g., loading multiple bitmap files for game graphics, or servicing multiple http requests simultaneously).
b) sockety stuff, where it's almost defined that there will be an appreciable (in cpu terms) time waiting. I *think* the unix plugin does some stuff asynchronously, but not sure about higher level prim calls.
Socket code is pretty much async. Chris mentioned waitUntil... which uses semaphores the VM signals when the data is actually available/sent.
Even just making that sort of thing trigger a yield event would surely
improve matters for server-type usage?
Not when the waiting happens in a primitive, as is the case with file access. We'd really need to use AsynchFile.
Vanessa
Just an observation that might, maybe, trigger some useful thoughts - I use Seaside quite a lot and I can't really think of any scheduling issues that have ever come up. I can be working doing development work in a Seaside image that is actively serving pages without any problems getting annoying. I don't recall ever digging into Seaside scheduling or process fiddling but perhaps a Clever Thing is being done?
tim -- tim Rowledge; tim@rowledge.org; http://www.rowledge.org/tim A bug in the hand is better than one as yet undetected.
On 2024-03-27 23:50, Tim Rowledge wrote:
Just an observation that might, maybe, trigger some useful thoughts - I use Seaside quite a lot and I can't really think of any scheduling issues that have ever come up. I can be working doing development work in a Seaside image that is actively serving pages without any problems getting annoying. I don't recall ever digging into Seaside scheduling or process fiddling but perhaps a Clever Thing is being done?
My possibly useful thought is that your observation is right on target. Process scheduling is not going to be a noticeable problem unless the the Process being scheduled is doing something unreasonably expensive.
In the case of the squeaksource.com issue, and possibly also the source.squeak.org image, the unreasonably expensive process may be the save-repository-to-serialized-object-file process. I think it is being triggered each time something changes in the repository, such as for example somebody uploading a new MCZ package. The repository object is very large (but 20 years ago it was very small) and it takes at least a minute to serialize it to disk now. Regardless of the scheduling priority in Squeak, this is going to take some time.
I noticed this on squeaksource.com because I was watching the load on dan.box.squeak.org, and right after I pushed an update to one of my projects the system got quite busy. I could see that the image was saving a data.obj copy of the repository after I did the update, and the CPU finally went back to normal a minute or two later when the save was complete.
Although source.squeak.org image works differently, I noticed in a previously saved copy of that image that had gone sluggish that the processes in the process browser seemed to be forked blocks waiting to do repository saves. I don't know if I interpreted this right but I can't help but think that the event-driven repository saves might be problematic.
Dave
Hi Dave, Hi All,
On Mar 27, 2024, at 6:28 PM, lewis@mail.msen.com wrote:
On 2024-03-27 23:50, Tim Rowledge wrote:
Just an observation that might, maybe, trigger some useful thoughts - I use Seaside quite a lot and I can't really think of any scheduling issues that have ever come up. I can be working doing development work in a Seaside image that is actively serving pages without any problems getting annoying. I don't recall ever digging into Seaside scheduling or process fiddling but perhaps a Clever Thing is being done?
My possibly useful thought is that your observation is right on target. Process scheduling is not going to be a noticeable problem unless the the Process being scheduled is doing something unreasonably expensive.
In the case of the squeaksource.com issue, and possibly also the source.squeak.org image, the unreasonably expensive process may be the save-repository-to-serialized-object-file process. I think it is being triggered each time something changes in the repository, such as for example somebody uploading a new MCZ package. The repository object is very large (but 20 years ago it was very small) and it takes at least a minute to serialize it to disk now. Regardless of the scheduling priority in Squeak, this is going to take some time.
I noticed this on squeaksource.com because I was watching the load on dan.box.squeak.org, and right after I pushed an update to one of my projects the system got quite busy. I could see that the image was saving a data.obj copy of the repository after I did the update, and the CPU finally went back to normal a minute or two later when the save was complete. Although source.squeak.org image works differently, I noticed in a previously saved copy of that image that had gone sluggish that the processes in the process browser seemed to be forked blocks waiting to do repository saves. I don't know if I interpreted this right but I can't help but think that the event-driven repository saves might be problematic.
So introduce a queue for save requests and service them in a lower priority process than the processes serving user requests. Interaction is with the in-image model. The lower priority process going saves can elide intervening saves if it gets behind, so the system saves as often as necessary when lightly loaded and as often as possible while prioritizing user responsiveness when heavily loaded.
If the model needs to be locked while setialising then take a copy (which will be shallow at the leaves, deep in the branches, because only changeable data needs to be shallow copied), and serialise the copy. The copy operation should be much faster than the serialisation.
I’d happily collaborate on this but I need pointers to the code and instructions on how to interact with the running server.
Dave
Best, Eliot _,,,^..^,,,_ (phone)
Hi Eliot and Chris,
On 2024-03-28 15:25, Eliot Miranda wrote:
Hi Dave, Hi All,
On Mar 27, 2024, at 6:28 PM, lewis@mail.msen.com wrote:
On 2024-03-27 23:50, Tim Rowledge wrote: Just an observation that might, maybe, trigger some useful thoughts - I use Seaside quite a lot and I can't really think of any scheduling issues that have ever come up. I can be working doing development work in a Seaside image that is actively serving pages without any problems getting annoying. I don't recall ever digging into Seaside scheduling or process fiddling but perhaps a Clever Thing is being done?
My possibly useful thought is that your observation is right on target. Process scheduling is not going to be a noticeable problem unless the the Process being scheduled is doing something unreasonably expensive.
In the case of the squeaksource.com issue, and possibly also the source.squeak.org image, the unreasonably expensive process may be the save-repository-to-serialized-object-file process. I think it is being triggered each time something changes in the repository, such as for example somebody uploading a new MCZ package. The repository object is very large (but 20 years ago it was very small) and it takes at least a minute to serialize it to disk now. Regardless of the scheduling priority in Squeak, this is going to take some time.
I noticed this on squeaksource.com because I was watching the load on dan.box.squeak.org, and right after I pushed an update to one of my projects the system got quite busy. I could see that the image was saving a data.obj copy of the repository after I did the update, and the CPU finally went back to normal a minute or two later when the save was complete.
Although source.squeak.org image works differently, I noticed in a previously saved copy of that image that had gone sluggish that the processes in the process browser seemed to be forked blocks waiting to do repository saves. I don't know if I interpreted this right but I can't help but think that the event-driven repository saves might be problematic.
So introduce a queue for save requests and service them in a lower priority process than the processes serving user requests. Interaction is with the in-image model. The lower priority process going saves can elide intervening saves if it gets behind, so the system saves as often as necessary when lightly loaded and as often as possible while prioritizing user responsiveness when heavily loaded.
If the model needs to be locked while setialising then take a copy (which will be shallow at the leaves, deep in the branches, because only changeable data needs to be shallow copied), and serialise the copy. The copy operation should be much faster than the serialisation.
I'd happily collaborate on this but I need pointers to the code and instructions on how to interact with the running server.
Perhaps Chris can comment, I'm not actually familiar with that part of the SqueakSource code. In particular, I don't really know how and when the repository save events happen.
I'm definitely happy to help and to provide access to the squeaksource.com image (requires box-admin access to a login account though). I also have access to the source.squeak.org service, but would want to work through/with Chris on anything that gets done there.
Note that my observations were based on watching files being slowly written to disc while also watching /usr/bin/top. The activity also correlates with log messages written to the ss.log log file, so that's what made me suspect issues with the repository save mechanism.
Dave
Hi Dave,
I just downloaded squeaksource.8.image from dan and took a look. I see you abandoned the PersonalSqueakSource codebase back in Nov-2022. That's too bad. Part of what I'd hoped to accomplish with the renovation was not only a more responsive and resilient server, but for the relocation to /ss on source.squeak.org to encourage your and the community's collaboration, where we would eventually get to a point where questions like this:
I’d happily collaborate on this but I need pointers to the code and
instructions on how to interact
with the running server.
would be as universally known and natural as the Inbox process (although maybe that isn't saying much anyway). Your comment in the unmerge version (SqueakSource.sscom-dtl.1147) mentions merge issues and startup problems. I would've tried to help if you'd reached out. Perhaps we can learn and gain just as much remaining forked and cherry-picking from each other what we deem to be most appropriate. I just noticed the performance improvement from Levente last September. See, before I dreamt something like that would simply be committed to /ss by him, and maybe it would send an email like with /trunk and /inbox. Then, we admins could merge fixes into the servers whenever it was worthwhile to do so.
Note that my observations were based on watching files being slowly written to disc while also watching /usr/bin/top. The activity also correlates with log messages written to the ss.log log file, so that's what made me suspect issues with the repository save mechanism.
I don't think saving data.obj was / is related to the client slowness issues. Why? Because you're still rightly using SSFilesystem from PersonalSqueakSource (which is good!), which essentially does what Eliot described. It forks the save at Processor userBackgroundPriority-1 (29), which is lower than client operations (30). And although there appears to be a bug that will cause other client save operations to be blocked during the long serialization process (see the attached fix for that, if you wish) *read* operations don't wait on any mutex, so should remain completely unblocked. You'd still see 100% CPU during serialization, yes, but client responsiveness should still be fine due to their (30) level processes preempting the serialization process.
- Chris
Hi Chris,
The method update that you sent is adding a critical section for synchronization, but I see no evidence of synchronization problems in the ss.log log files. Each 'BEGIN SAVING' is always followed by a 'DONE SAVING' so it does not look like we ever have two processes running this method at the same time.
But the method itself is interesting. There is a 10 times retry loop in it:
"Because we're serializing a big object while its changing, there is a possibility the serialization will fail." triesRemaining := 10.
On the squeaksource.com server, a repository save takes an average of 5 1/2 minutes most of the time. If the retry loop is handling the case of the repository changing during that 5 1/2 minutes, it seems quite likely that this might happen when people are actively using the system. A 10 times retry policy might well lead to total processing time an the order of an hour if the system was busy and you then fell into the retry loop.
Noticing this, I pulled the log messages out of the ss.log for all repository saves since January 1. I looked at total processing duration from 'BEGIN SAVING' until 'DONE SAVING' and pulled out the top 24 processing times sorted by duration. There are three cases of the repository save running for well over an hour as well as a couple of 45 minute runs. All of these happened between Mar 20 and Mar 24, which is exactly the time frame in which I got worried about slow response time and high CPU load:
2024-03-20T15:19:05.335842+00:00 SSFilesystem DONE SAVING => 0:01:16:51.227981 2024-03-23T05:00:30.784172+00:00 SSFilesystem DONE SAVING => 0:01:14:31.366004 2024-03-20T16:29:24.942607+00:00 SSFilesystem DONE SAVING => 0:01:10:19.604764 2024-03-24T06:24:27.642523+00:00 SSFilesystem DONE SAVING => 0:00:45:40.476352 2024-03-24T05:38:47.164168+00:00 SSFilesystem DONE SAVING => 0:00:45:35.499997 2024-03-09T08:29:10.761091+00:00 SSFilesystem DONE SAVING => 0:00:17:12.286009 2024-03-22T22:32:23.862167+00:00 SSFilesystem DONE SAVING => 0:00:14:33.731995 2024-03-24T21:29:08.624166+00:00 SSFilesystem DONE SAVING => 0:00:12:22.215981 2024-03-09T07:44:09.783081+00:00 SSFilesystem DONE SAVING => 0:00:09:45.765823 2024-03-08T21:35:23.817078+00:00 SSFilesystem DONE SAVING => 0:00:09:44.209998 2024-03-24T14:59:00.770168+00:00 SSFilesystem DONE SAVING => 0:00:08:48.055976 2024-03-08T21:22:01.479093+00:00 SSFilesystem DONE SAVING => 0:00:08:47.844014 2024-03-01T15:08:03.471507+00:00 SSFilesystem DONE SAVING => 0:00:08:37.066421 2024-03-06T00:04:33.281516+00:00 SSFilesystem DONE SAVING => 0:00:08:16.202437 2024-02-01T02:19:37.511613+00:00 SSFilesystem DONE SAVING => 0:00:08:11.774005 2024-03-01T14:50:12.219079+00:00 SSFilesystem DONE SAVING => 0:00:08:05.161995 2024-03-24T15:06:40.470254+00:00 SSFilesystem DONE SAVING => 0:00:07:39.698083 2024-03-01T15:15:41.72317+00:00 SSFilesystem DONE SAVING => 0:00:07:38.246062 2024-03-09T15:10:55.945114+00:00 SSFilesystem DONE SAVING => 0:00:07:30.252033 2024-03-06T00:14:41.921079+00:00 SSFilesystem DONE SAVING => 0:00:07:29.582 2024-03-01T21:17:34.817093+00:00 SSFilesystem DONE SAVING => 0:00:07:21.842016 2024-02-28T00:08:45.99831+00:00 SSFilesystem DONE SAVING => 0:00:06:51.106038 2024-02-05T11:08:41.027609+00:00 SSFilesystem DONE SAVING => 0:00:06:42.185927 2024-02-28T00:18:09.073611+00:00 SSFilesystem DONE SAVING => 0:00:06:41.339976
Dave
On 2024-03-29 05:02, Chris Muller wrote:
Hi Dave,
I just downloaded squeaksource.8.image from dan and took a look. I see you abandoned the PersonalSqueakSource codebase back in Nov-2022. That's too bad. Part of what I'd hoped to accomplish with the renovation was not only a more responsive and resilient server, but for the relocation to /ss on source.squeak.org [1] to encourage your and the community's collaboration, where we would eventually get to a point where questions like this:
I'd happily collaborate on this but I need pointers to the code and instructions on how to interact with the running server.
would be as universally known and natural as the Inbox process (although maybe that isn't saying much anyway). Your comment in the unmerge version (SqueakSource.sscom-dtl.1147) mentions merge issues and startup problems. I would've tried to help if you'd reached out. Perhaps we can learn and gain just as much remaining forked and cherry-picking from each other what we deem to be most appropriate. I just noticed the performance improvement from Levente last September. See, before I dreamt something like that would simply be committed to /ss by him, and maybe it would send an email like with /trunk and /inbox. Then, we admins could merge fixes into the servers whenever it was worthwhile to do so.
Note that my observations were based on watching files being slowly written to disc while also watching /usr/bin/top. The activity also correlates with log messages written to the ss.log log file, so that's what made me suspect issues with the repository save mechanism.
I don't think saving data.obj was / is related to the client slowness issues. Why? Because you're still rightly using SSFilesystem from PersonalSqueakSource (which is good!), which essentially does what Eliot described. It forks the save at Processor userBackgroundPriority-1 (29), which is lower than client operations (30). And although there appears to be a bug that will cause other client save operations to be blocked during the long serialization process (see the attached fix for that, if you wish) *read* operations don't wait on any mutex, so should remain completely unblocked. You'd still see 100% CPU during serialization, yes, but client responsiveness should still be fine due to their (30) level processes preempting the serialization process.
- Chris
Links: ------ [1] http://source.squeak.org
Hi Eliot,
I don't know the answer, but I can say that any changes are always going to be very small, and the serialized squeaksource.com repository occupies 146MB of disc space, so quite large.
It's worth mentioning that I have been looking after the old squeaksource.com service for many years now, and I cannot recall a single case in which I made use of one of the automatically saved data.obj files. For cases in which I did major image updates, I'm pretty sure I just serialized it myself right before completing the update. So, as a backup and recovery mechanism it's not terribly important, and simply turning off the automatic repository saves would hurt nothing.
However, the situation may be quite different on source.squeak.org, where I believe that loading the repository from a data.obj file may be part of the image startup process (I am not sure of this though).
Dave
On 2024-03-29 20:29, Eliot Miranda wrote:
Hi Both,
how long, and how big is a copy of the graph which only shallow copies objects that reference only objects that don't change?
_,,,^..^,,,_ (phone)
On Mar 29, 2024, at 12:56 PM, lewis@mail.msen.com wrote:
Hi Chris,
The method update that you sent is adding a critical section for synchronization, but I see no evidence of synchronization problems in the ss.log log files. Each 'BEGIN SAVING' is always followed by a 'DONE SAVING' so it does not look like we ever have two processes running this method at the same time.
But the method itself is interesting. There is a 10 times retry loop in it:
"Because we're serializing a big object while its changing, there is a possibility the serialization will fail." triesRemaining := 10.
On the squeaksource.com server, a repository save takes an average of 5 1/2 minutes most of the time. If the retry loop is handling the case of the repository changing during that 5 1/2 minutes, it seems quite likely that this might happen when people are actively using the system. A 10 times retry policy might well lead to total processing time an the order of an hour if the system was busy and you then fell into the retry loop.
Noticing this, I pulled the log messages out of the ss.log for all repository saves since January 1. I looked at total processing duration from 'BEGIN SAVING' until 'DONE SAVING' and pulled out the top 24 processing times sorted by duration. There are three cases of the repository save running for well over an hour as well as a couple of 45 minute runs. All of these happened between Mar 20 and Mar 24, which is exactly the time frame in which I got worried about slow response time and high CPU load:
2024-03-20T15:19:05.335842+00:00 SSFilesystem DONE SAVING => 0:01:16:51.227981 2024-03-23T05:00:30.784172+00:00 SSFilesystem DONE SAVING => 0:01:14:31.366004 2024-03-20T16:29:24.942607+00:00 SSFilesystem DONE SAVING => 0:01:10:19.604764 2024-03-24T06:24:27.642523+00:00 SSFilesystem DONE SAVING => 0:00:45:40.476352 2024-03-24T05:38:47.164168+00:00 SSFilesystem DONE SAVING => 0:00:45:35.499997 2024-03-09T08:29:10.761091+00:00 SSFilesystem DONE SAVING => 0:00:17:12.286009 2024-03-22T22:32:23.862167+00:00 SSFilesystem DONE SAVING => 0:00:14:33.731995 2024-03-24T21:29:08.624166+00:00 SSFilesystem DONE SAVING => 0:00:12:22.215981 2024-03-09T07:44:09.783081+00:00 SSFilesystem DONE SAVING => 0:00:09:45.765823 2024-03-08T21:35:23.817078+00:00 SSFilesystem DONE SAVING => 0:00:09:44.209998 2024-03-24T14:59:00.770168+00:00 SSFilesystem DONE SAVING => 0:00:08:48.055976 2024-03-08T21:22:01.479093+00:00 SSFilesystem DONE SAVING => 0:00:08:47.844014 2024-03-01T15:08:03.471507+00:00 SSFilesystem DONE SAVING => 0:00:08:37.066421 2024-03-06T00:04:33.281516+00:00 SSFilesystem DONE SAVING => 0:00:08:16.202437 2024-02-01T02:19:37.511613+00:00 SSFilesystem DONE SAVING => 0:00:08:11.774005 2024-03-01T14:50:12.219079+00:00 SSFilesystem DONE SAVING => 0:00:08:05.161995 2024-03-24T15:06:40.470254+00:00 SSFilesystem DONE SAVING => 0:00:07:39.698083 2024-03-01T15:15:41.72317+00:00 SSFilesystem DONE SAVING => 0:00:07:38.246062 2024-03-09T15:10:55.945114+00:00 SSFilesystem DONE SAVING => 0:00:07:30.252033 2024-03-06T00:14:41.921079+00:00 SSFilesystem DONE SAVING => 0:00:07:29.582 2024-03-01T21:17:34.817093+00:00 SSFilesystem DONE SAVING => 0:00:07:21.842016 2024-02-28T00:08:45.99831+00:00 SSFilesystem DONE SAVING => 0:00:06:51.106038 2024-02-05T11:08:41.027609+00:00 SSFilesystem DONE SAVING => 0:00:06:42.185927 2024-02-28T00:18:09.073611+00:00 SSFilesystem DONE SAVING => 0:00:06:41.339976
Dave
On 2024-03-29 05:02, Chris Muller wrote:
Hi Dave,
I just downloaded squeaksource.8.image from dan and took a look. I see you abandoned the PersonalSqueakSource codebase back in Nov-2022. That's too bad. Part of what I'd hoped to accomplish with the renovation was not only a more responsive and resilient server, but for the relocation to /ss on source.squeak.org [1] to encourage your and the community's collaboration, where we would eventually get to a point where questions like this:
I'd happily collaborate on this but I need pointers to the code and instructions on how to interact with the running server.
would be as universally known and natural as the Inbox process (although maybe that isn't saying much anyway). Your comment in the unmerge version (SqueakSource.sscom-dtl.1147) mentions merge issues and startup problems. I would've tried to help if you'd reached out. Perhaps we can learn and gain just as much remaining forked and cherry-picking from each other what we deem to be most appropriate. I just noticed the performance improvement from Levente last September. See, before I dreamt something like that would simply be committed to /ss by him, and maybe it would send an email like with /trunk and /inbox. Then, we admins could merge fixes into the servers whenever it was worthwhile to do so.
Note that my observations were based on watching files being slowly written to disc while also watching /usr/bin/top. The activity also correlates with log messages written to the ss.log log file, so that's what made me suspect issues with the repository save mechanism. I don't think saving data.obj was / is related to the client slowness issues. Why? Because you're still rightly using SSFilesystem from PersonalSqueakSource (which is good!), which essentially does what Eliot described. It forks the save at Processor userBackgroundPriority-1 (29), which is lower than client operations (30). And although there appears to be a bug that will cause other client save operations to be blocked during the long serialization process (see the attached fix for that, if you wish) *read* operations don't wait on any mutex, so should remain completely unblocked. You'd still see 100% CPU during serialization, yes, but client responsiveness should still be fine due to their (30) level processes preempting the serialization process.
- Chris
Links: ------ [1] http://source.squeak.org
Hi Dave,
The method update that you sent is adding a critical section for
synchronization, but I see no evidence of synchronization problems in the ss.log log files. Each 'BEGIN SAVING' is always followed by a 'DONE SAVING' so it does not look like we ever have two processes running this method at the same time.
The BEGIN SAVING and DONE SAVING are both within the same critical section, so their execution could never interleave. At first I thought the second saver would block waiting for the serialization, but after looking again, I saw the wait is forked, so the client won't be blocked. I was wrong, the method looks fine as it is.
But the method itself is interesting. There is a 10 times retry loop in it:
"Because we're serializing a big object while its changing,
there is a possibility the serialization will fail." triesRemaining := 10.
On the squeaksource.com server, a repository save takes an average of 5 1/2 minutes most of the time. If the retry loop is handling the case of the repository changing during that 5 1/2 minutes, it seems quite likely that this might happen when people are actively using the system. A 10 times retry policy might well lead to total processing time an the order of an hour if the system was busy and you then fell into the retry loop.
Noticing this, I pulled the log messages out of the ss.log for all
repository saves since January 1. I looked at total processing duration from 'BEGIN SAVING' until 'DONE SAVING' and pulled out the top 24 processing times sorted by duration. There are three cases of the repository save running for well over an hour as well as a couple of 45 minute runs. All of these happened between Mar 20 and Mar 24, which is exactly the time frame in which I got worried about slow response time and high CPU load:
2024-03-20T15:19:05.335842+00:00 SSFilesystem DONE SAVING =>
0:01:16:51.227981 2024-03-23T05:00:30.784172+00:00 SSFilesystem DONE SAVING => 0:01:14:31.366004 2024-03-20T16:29:24.942607+00:00 SSFilesystem DONE SAVING => 0:01:10:19.604764 2024-03-24T06:24:27.642523+00:00 SSFilesystem DONE SAVING => 0:00:45:40.476352 2024-03-24T05:38:47.164168+00:00 SSFilesystem DONE SAVING => 0:00:45:35.499997
That's useful info on the serialization times. However, as you can see, the retries are logged as well. Here's the only retries grep found in the entire ss.log:
2021-01-10T20:58:55.771009+00:00 Error while saving SSRepository current to the data.obj file: FileWriteError: File /srv/squeaksourcecom/SqueakSource/ss/data.obj is closed 2021-01-10T20:59:05.77288+00:00 Error while saving SSRepository current to the data.obj file: Error: primClose: failed 2022-12-28T00:46:12.375508+00:00 Error while saving SSRepository current to the data.obj file: FileWriteError: File /srv/squeaksourcecom/SqueakSource/ss/data.obj is closed 2022-12-28T00:46:22.377481+00:00 Error while saving SSRepository current to the data.obj file: Error: primClose: failed
The last time a retry occurred was in 2022, and it was due to a FileWriteError (possibly disk full), not a serialization error. So those long ones in March were not due to retries, but something else. I'm still not satisfied that we got to the bottom of what happened, but glad both servers are continuing to run fine.
Have a good week. :)
- Chris
Hi Chris,
Fully agree, see also my notes in the "Process Scheduling" thread.
Dave
On 2024-04-01 21:31, Chris Muller wrote:
Hi Dave,
The method update that you sent is adding a critical section for synchronization, but I see no evidence of synchronization problems in the ss.log log files. Each 'BEGIN SAVING' is always followed by a 'DONE SAVING' so it does not look like we ever have two processes running this method at the same time.
The BEGIN SAVING and DONE SAVING are both within the same critical section, so their execution could never interleave. At first I thought the second saver would block waiting for the serialization, but after looking again, I saw the wait is forked, so the client won't be blocked. I was wrong, the method looks fine as it is.
But the method itself is interesting. There is a 10 times retry loop in it:
"Because we're serializing a big object while its changing, there is a possibility the serialization will fail." triesRemaining := 10.
On the squeaksource.com [1] server, a repository save takes an average of 5 1/2 minutes most of the time. If the retry loop is handling the case of the repository changing during that 5 1/2 minutes, it seems quite likely that this might happen when people are actively using the system. A 10 times retry policy might well lead to total processing time an the order of an hour if the system was busy and you then fell into the retry loop.
Noticing this, I pulled the log messages out of the ss.log for all repository saves since January 1. I looked at total processing duration from 'BEGIN SAVING' until 'DONE SAVING' and pulled out the top 24 processing times sorted by duration. There are three cases of the repository save running for well over an hour as well as a couple of 45 minute runs. All of these happened between Mar 20 and Mar 24, which is exactly the time frame in which I got worried about slow response time and high CPU load:
2024-03-20T15:19:05.335842+00:00 SSFilesystem DONE SAVING => 0:01:16:51.227981 2024-03-23T05:00:30.784172+00:00 SSFilesystem DONE SAVING => 0:01:14:31.366004 2024-03-20T16:29:24.942607+00:00 SSFilesystem DONE SAVING => 0:01:10:19.604764 2024-03-24T06:24:27.642523+00:00 SSFilesystem DONE SAVING => 0:00:45:40.476352 2024-03-24T05:38:47.164168+00:00 SSFilesystem DONE SAVING => 0:00:45:35.499997
That's useful info on the serialization times. However, as you can see, the retries are logged as well. Here's the only retries grep found in the entire ss.log:
2021-01-10T20:58:55.771009+00:00 Error while saving SSRepository current to the data.obj file: FileWriteError: File /srv/squeaksourcecom/SqueakSource/ss/data.obj is closed 2021-01-10T20:59:05.77288+00:00 Error while saving SSRepository current to the data.obj file: Error: primClose: failed 2022-12-28T00:46:12.375508+00:00 Error while saving SSRepository current to the data.obj file: FileWriteError: File /srv/squeaksourcecom/SqueakSource/ss/data.obj is closed 2022-12-28T00:46:22.377481+00:00 Error while saving SSRepository current to the data.obj file: Error: primClose: failed
The last time a retry occurred was in 2022, and it was due to a FileWriteError (possibly disk full), not a serialization error. So those long ones in March were not due to retries, but something else. I'm still not satisfied that we got to the bottom of what happened, but glad both servers are continuing to run fine.
Have a good week. :)
- Chris
Links: ------ [1] http://squeaksource.com
I want to repeat Eliot's suggestion below because it seems to me a very good design approach. I'm going to tinker with the idea off line (not in any SqueakSource image). I'm also interested in the idea of doing the saving work in a separate forkSqueak OSProcess to let it run at low priority, and with little or no performance impact on multi-core platforms. Just mentioning it here because the overall approach that Eliot describes below seems right to me.
Dave
On 2024-03-28 15:25, Eliot Miranda wrote:
Hi Dave, Hi All,
<snip>
So introduce a queue for save requests and service them in a lower priority process than the processes serving user requests. Interaction is with the in-image model. The lower priority process going saves can elide intervening saves if it gets behind, so the system saves as often as necessary when lightly loaded and as often as possible while prioritizing user responsiveness when heavily loaded.
If the model needs to be locked while setialising then take a copy (which will be shallow at the leaves, deep in the branches, because only changeable data needs to be shallow copied), and serialise the copy. The copy operation should be much faster than the serialisation.
I'd happily collaborate on this but I need pointers to the code and instructions on how to interact with the running server.
Best, Eliot
_,,,^..^,,,_ (phone)
On 2024-03-31, at 9:04 AM, lewis@mail.msen.com wrote: I'm also interested in the idea of doing the saving work in a separate forkSqueak OSProcess to let it run at low priority, and with little or no performance impact on multi-core platforms.
This is an idea I like for quite a lot of long-running operations, and this particular case seems ideal.
I can't remember any detail of how the squeaksource save is done, but forking the entire image usually takes close to zero time and then the forked image can take its time to complete the job. The possible complication I can imagine (and that may not apply, see above memory issue) would be if multiple save requests are made during the time a forked save is running. I *think* there can Only Be One data.obj and if we get multiple concurrent requests to save it we could probably just kill any running fork and let a new one replace whatever had been already written?
Oh, and this reminds me that the ARM64 VM still explodes if you try to do the fork; IIRC we concluded it was sometihng to do with executable permission pages and resetting the phase of the chocolate clock, something like that.
tim -- tim Rowledge; tim@rowledge.org; http://www.rowledge.org/tim The severity of the itch is proportional to the reach.
Hi Tim, Hi Dave,
On Mar 31, 2024, at 10:15 AM, Tim Rowledge tim@rowledge.org wrote:
On 2024-03-31, at 9:04 AM, lewis@mail.msen.com wrote: I'm also interested in the idea of doing the saving work in a separate forkSqueak OSProcess to let it run at low priority, and with little or no performance impact on multi-core platforms.
This is an idea I like for quite a lot of long-running operations, and this particular case seems ideal.
I can't remember any detail of how the squeaksource save is done, but forking the entire image usually takes close to zero time and then the forked image can take its time to complete the job. The possible complication I can imagine (and that may not apply, see above memory issue) would be if multiple save requests are made during the time a forked save is running. I *think* there can Only Be One data.obj and if we get multiple concurrent requests to save it we could probably just kill any running fork and let a new one replace whatever had been already written?
The problems with the fork-and-save design are that 1. it makes it harder to elide unneeded saves when the saving processes get behind. Keeping things in one image makes it easier to discard saves that can’t be started and/or completed by the time a new version of the graph is ready to save 2. whether the save processes take time from the server process depends at least somewhat on the OS. Given that the save processes are doing file i/o, the OS may prioritize the save processes above the server process
Oh, and this reminds me that the ARM64 VM still explodes if you try to do the fork; IIRC we concluded it was something to do with executable permission pages and resetting the phase of the chocolate clock, something like that.
Submit an issue on github?
tim
tim Rowledge; tim@rowledge.org; http://www.rowledge.org/tim The severity of the itch is proportional to the reach.
On Mon, Apr 1, 2024 at 12:30 AM Eliot Miranda eliot.miranda@gmail.com wrote:
Hi Tim, Hi Dave,
On Mar 31, 2024, at 10:15 AM, Tim Rowledge tim@rowledge.org wrote:
On 2024-03-31, at 9:04 AM, lewis@mail.msen.com wrote: I'm also interested in the idea of doing the saving work in a separate
forkSqueak OSProcess to let it run at low priority, and with little or no performance impact on multi-core platforms.
This is an idea I like for quite a lot of long-running operations, and
this particular case seems ideal.
I can't remember any detail of how the squeaksource save is done, but
forking the entire image usually takes close to zero time and then the forked image can take its time to complete the job. The possible complication I can imagine (and that may not apply, see above memory issue) would be if multiple save requests are made during the time a forked save is running. I *think* there can Only Be One data.obj and if we get multiple concurrent requests to save it we could probably just kill any running fork and let a new one replace whatever had been already written?
The problems with the fork-and-save design are that
- it makes it harder to elide unneeded saves when the saving processes
get behind. Keeping things in one image makes it easier to discard saves that can’t be started and/or completed by the time a new version of the graph is ready to save 2. whether the save processes take time from the server process depends at least somewhat on the OS. Given that the save processes are doing file i/o, the OS may prioritize the save processes above the server process
This can be handled like a batch job, all started saves finish. If you make the batch in time it gets saved, if not you wait till the next batch, add a delay after save to collect more items in the batch. The guarantee is each item is saved after commit time + delay time. Delay could be 24 hours, or maybe wait till 3am. Because the save is async anyway, you are just adding some time to the backup which isn't used (much/at all) anyway.
Ron
Oh, and this reminds me that the ARM64 VM still explodes if you try to
do the fork; IIRC we concluded it was something to do with executable permission pages and resetting the phase of the chocolate clock, something like that.
Submit an issue on github?
tim
tim Rowledge; tim@rowledge.org; http://www.rowledge.org/tim The severity of the itch is proportional to the reach.
Hi Ron,
On 2024-04-01 14:59, Ron Teitelbaum wrote:
On Mon, Apr 1, 2024 at 12:30 AM Eliot Miranda eliot.miranda@gmail.com wrote: This can be handled like a batch job, all started saves finish. If you make the batch in time it gets saved, if not you wait till the next batch, add a delay after save to collect more items in the batch. The guarantee is each item is saved after commit time + delay time. Delay could be 24 hours, or maybe wait till 3am. Because the save is async anyway, you are just adding some time to the backup which isn't used (much/at all) anyway.
Ron
This is a good approach, and has the nice property of reducing the total number of expensive saves when there is a lot of activity on the system.
I do want to note for the record that after watching system activity and reviewing the ss.log log file, I see nothing actually wrong with the current repository save strategy.
Chris designed this to keep an up-to-date save whenever the repository changes, and also to avoid unnecessary saves. The saves are expensive and long-running, but the strategy works as intended. For example, a typical pattern appearing in the logs is:
- someone commits an update to the repository
- a repository save is started, and will run for perhaps 7 minutes (on squeaksource.com)
- meanwhile one or more additional mcz update are committed
- the repository save is completed, and a new one is started because of the intervening commits
- the second save finishes later, and no additional saves are started.
This is all good, and working as intended. I see no concurrency problems, and there are no "extra" saves given the intent of always keeping the repository saves as up to date as possible.
I still have no explanation for the cases in which we saw extremely long-running saves (a couple of cases in which the saves took well over an hour). All I know is that it happened, but I cannot see anything actually wrong in the existing repository save logic other than the fact that it is a very expensive and slow operation on a SqueakSource server with a large repository.
Dave
On Mon, Apr 1, 2024 at 3:22 PM lewis@mail.msen.com wrote:
Hi Ron,
On 2024-04-01 14:59, Ron Teitelbaum wrote:
On Mon, Apr 1, 2024 at 12:30 AM Eliot Miranda eliot.miranda@gmail.com wrote: This can be handled like a batch job, all started saves finish. If you make the batch in time it gets saved, if not you wait till the next batch, add a delay after save to collect more items in the batch. The guarantee is each item is saved after commit time + delay time. Delay could be 24 hours, or maybe wait till 3am. Because the save is async anyway, you are just adding some time to the backup which isn't used (much/at all) anyway.
Ron
This is a good approach, and has the nice property of reducing the total number of expensive saves when there is a lot of activity on the system.
I do want to note for the record that after watching system activity and reviewing the ss.log log file, I see nothing actually wrong with the current repository save strategy.
Chris designed this to keep an up-to-date save whenever the repository changes, and also to avoid unnecessary saves. The saves are expensive and long-running, but the strategy works as intended. For example, a typical pattern appearing in the logs is:
someone commits an update to the repository
a repository save is started, and will run for perhaps 7 minutes (on
squeaksource.com)
meanwhile one or more additional mcz update are committed
the repository save is completed, and a new one is started because of
the intervening commits
- the second save finishes later, and no additional saves are started.
This is all good, and working as intended. I see no concurrency problems, and there are no "extra" saves given the intent of always keeping the repository saves as up to date as possible.
I still have no explanation for the cases in which we saw extremely long-running saves (a couple of cases in which the saves took well over an hour). All I know is that it happened, but I cannot see anything actually wrong in the existing repository save logic other than the fact that it is a very expensive and slow operation on a SqueakSource server with a large repository.
To me this sounds like a hardware or memory problem. If we started swapping memory or if there were connection issues or data corruption to the disk, a save could take that long. Updates, external backups, Dos Attacks, or other network problems could also play a role.
Ron
Dave
Back in 1994 I was asked to mentor three students who had a "scientific initiation" scholarship and I proposed to add multimedia features to Smalltalk V/286.
While Self has the scheduler in the image built on top of a single "TWAINS" (Transfer and WAit for INterrupt or Signal) primitive, V/286 used the traditional Smalltalk-80 design of cooperative scheduling in the same priority with preemptive scheduling among different priorities. So we just had the highest priority process waiting for a semaphore that was poked by a timer and the actual scheduler was in that process.
I taught the students about different kinds of real time schedulers, like rate monotonic, first deadline first and many others. In the end I implemented all this myself, but it worked as expected. We never did get to the point of playing video or audio in parallel with other things, however.
In the 1994 project I didn't take multiple processors into account, but had a 64 x 68020 Smalltalk machine in a different project which did deal with this. In 2024 it should be part of the discussion.
-- Jecel
squeak-dev@lists.squeakfoundation.org