On Mon, Apr 1, 2024 at 3:22 PM <lewis@mail.msen.com> wrote:

Hi Ron,

On 2024-04-01 14:59, Ron Teitelbaum wrote:

On Mon, Apr 1, 2024 at 12:30 AM Eliot Miranda <eliot.miranda@gmail.com> wrote:

This can be handled like a batch job, all started saves finish. If you make the batch in time it gets saved, if not you wait till the next batch, add a delay after save to collect more items in the batch. The guarantee is each item is saved after commit time + delay time. Delay could be 24 hours, or maybe wait till 3am. Because the save is async anyway, you are just adding some time to the backup which isn't used (much/at all) anyway.

Ron

This is a good approach, and has the nice property of reducing the total number of expensive saves when there is a lot of activity on the system.

I do want to note for the record that after watching system activity and reviewing the ss.log log file, I see nothing actually wrong with the current repository save strategy.

Chris designed this to keep an up-to-date save whenever the repository changes, and also to avoid unnecessary saves. The saves are expensive and long-running, but the strategy works as intended. For example, a typical pattern appearing in the logs is:

- someone commits an update to the repository

- a repository save is started, and will run for perhaps 7 minutes (on squeaksource.com)

- meanwhile one or more additional mcz update are committed

- the repository save is completed, and a new one is started because of the intervening commits

- the second save finishes later, and no additional saves are started.

This is all good, and working as intended. I see no concurrency problems, and there are no "extra" saves given the intent of always keeping the repository saves as up to date as possible.

I still have no explanation for the cases in which we saw extremely long-running saves (a couple of cases in which the saves took well over an hour). All I know is that it happened, but I cannot see anything actually wrong in the existing repository save logic other than the fact that it is a very expensive and slow operation on a SqueakSource server with a large repository.

To me this sounds like a hardware or memory problem. If we started swapping memory or if there were connection issues or data corruption to the disk, a save could take that long. Updates, external backups, Dos Attacks, or other network problems could also play a role.

Ron

Dave