Hi Dave,

The method update that you sent is adding a critical section for synchronization, but I see no evidence of synchronization problems in the ss.log log files. Each 'BEGIN SAVING' is always followed by a 'DONE SAVING' so it does not look like we ever have two processes running this method at the same time.


The BEGIN SAVING and DONE SAVING are both within the same critical section, so their execution could never interleave.  At first I thought the second saver would block waiting for the serialization, but after looking again, I saw the wait is forked, so the client won't be blocked.  I was wrong, the method looks fine as it is.
 

But the method itself is interesting.  There is a 10 times retry loop in it:

            "Because we're serializing a big object while its changing, there is a possibility the serialization will fail."
            triesRemaining := 10. 

On the squeaksource.com server, a repository save takes an average of 5 1/2 minutes most of the time. If the retry loop is handling the case of the repository changing during that 5 1/2 minutes, it seems quite likely that this might happen when people are actively using the system. A 10 times retry policy might well lead to total processing time an the order of an hour if the system was busy and you then fell into the retry loop. 

Noticing this, I pulled the log messages out of the ss.log for all repository saves since January 1. I looked at total processing duration from 'BEGIN SAVING' until 'DONE SAVING' and pulled out the top 24 processing times sorted by duration. There are three cases of the repository save running for well over an hour as well as a couple of 45 minute runs. All of these happened between Mar 20 and Mar 24, which is exactly the time frame in which I got worried about slow response time and high CPU load:

2024-03-20T15:19:05.335842+00:00 SSFilesystem DONE SAVING => 0:01:16:51.227981
2024-03-23T05:00:30.784172+00:00 SSFilesystem DONE SAVING => 0:01:14:31.366004
2024-03-20T16:29:24.942607+00:00 SSFilesystem DONE SAVING => 0:01:10:19.604764
2024-03-24T06:24:27.642523+00:00 SSFilesystem DONE SAVING => 0:00:45:40.476352
2024-03-24T05:38:47.164168+00:00 SSFilesystem DONE SAVING => 0:00:45:35.499997

That's useful info on the serialization times.  However, as you can see, the retries are logged as well.  Here's the only retries grep found in the entire ss.log:

2021-01-10T20:58:55.771009+00:00 Error while saving SSRepository current to the data.obj file:  FileWriteError: File /srv/squeaksourcecom/SqueakSource/ss/data.obj is closed
2021-01-10T20:59:05.77288+00:00 Error while saving SSRepository current to the data.obj file:  Error: primClose: failed
2022-12-28T00:46:12.375508+00:00 Error while saving SSRepository current to the data.obj file:  FileWriteError: File /srv/squeaksourcecom/SqueakSource/ss/data.obj is closed
2022-12-28T00:46:22.377481+00:00 Error while saving SSRepository current to the data.obj file:  Error: primClose: failed

The last time a retry occurred was in 2022, and it was due to a FileWriteError (possibly disk full), not a serialization error.  So those long ones in March were not due to retries, but something else.  I'm still not satisfied that we got to the bottom of what happened, but glad both servers are continuing to run fine.

Have a good week.  :)

 - Chris