unix SqueakSource socket 'too many open files' problem

List overview All Threads
Download

newer

older

aioDisable: epoll_ctl: Bad file...

[OpenSmalltalk/opensmalltalk-vm]...

tim Rowledge

25 Feb 2021 25 Feb '21

8:42 p.m.

I've been running a squeaksource system on a 64bit linux server for a year or so and just started having odd problems.

Talking with Chris about it, and looking into the assorted logs, lead to the theory that sometihng (unknown and I can't see how possible) caused the image to save and quit at some point when the expected behaviour is to *not* ever save. That lead to some complaints about Magma related details that don't matter here.

I had much fun (the usual linux user names/permissions stuff) but eventually got a new copy of the squeaksource image running. Yay!

Except after an indeterminate time (several hours, not a whole day) the webpage part of the sytem became unaccessible even though it had not actually crashed out. The log contained a bunch of lines like this - 3729ab18e70dac 2021-02-24T23:37:53.41687-05:00 CurrentEnvironment: @40000000603729ab18e71d4c 2acceptHandler: Too many open files @4000000060379f1127814ddc acceptHandler: aborting server 12 pss=0x14df900 @4000000060379f112782dc4c socketStatus: freeing invalidated pss=0x14df900 @400000006037b1ef0d99c214 acceptHandler: Too many open files @400000006037b1ef0d9a5e54 acceptHandler: aborting server 12 pss=0x7f2f8c8d4630 @400000006037b1ef0d9c7194 socketStatus: freeing invalidated pss=0x7f2f8c8d4630

... which are errors from the VM socket handlers.

I've not seen this happening before that I recall. Any ideas? Stuff to check? Solutions?

tim -- tim Rowledge; tim@rowledge.org; http://www.rowledge.org/tim Strange OpCodes: TDB: Transfer and Drop Bits

Show replies by date

David T. Lewis

25 Feb 25 Feb

10:12 p.m.

Socket leaks were always a problem on the older squeaksource.com images, although I don't think it has been happening since I updated it to a Squeak 5.3 image, and I don't recall that it was ever a problem on source.squeak.org.

I don't know what the root cause was, and it's certainly possible that socket leak issues still exist but are not showing up with the usage or load on our servers.

If you are using image based persistence (as opposed to Magma back end), then the utility that I used to use for squeaksource.com is a handy workaround (actually it is still running on our squeaksource.com image, which is the reason I am not entirely sure if the socket leak problem is actually gone).

Take a look in the http://www.squeaksource.com/SSImageInit repository and have a look at the socket leak monitor. Again to be clear, you don't want to run this with Magma, but in any case it may give you and idea for how to handle it.

I do have to note that the error messages you are getting may be a different issue entirely. The socket leaks I was dealing with menifested as open file descriptors (visible in /proc/<pid>/fd/*). The error messages you are getting may be another issue entirely. If you run out of file descriptors the image will hang, and if I recall correctly it may have crashed the VM. Possibly that would lead to the later symptoms you are seeing now.

Dave

On Thu, Feb 25, 2021 at 11:42:23AM -0800, tim Rowledge wrote:

...

I've been running a squeaksource system on a 64bit linux server for a year or so and just started having odd problems.

Talking with Chris about it, and looking into the assorted logs, lead to the theory that sometihng (unknown and I can't see how possible) caused the image to save and quit at some point when the expected behaviour is to *not* ever save. That lead to some complaints about Magma related details that don't matter here.

I had much fun (the usual linux user names/permissions stuff) but eventually got a new copy of the squeaksource image running. Yay!

Except after an indeterminate time (several hours, not a whole day) the webpage part of the sytem became unaccessible even though it had not actually crashed out. The log contained a bunch of lines like this - 3729ab18e70dac 2021-02-24T23:37:53.41687-05:00 CurrentEnvironment: @40000000603729ab18e71d4c 2acceptHandler: Too many open files @4000000060379f1127814ddc acceptHandler: aborting server 12 pss=0x14df900 @4000000060379f112782dc4c socketStatus: freeing invalidated pss=0x14df900 @400000006037b1ef0d99c214 acceptHandler: Too many open files @400000006037b1ef0d9a5e54 acceptHandler: aborting server 12 pss=0x7f2f8c8d4630 @400000006037b1ef0d9c7194 socketStatus: freeing invalidated pss=0x7f2f8c8d4630

... which are errors from the VM socket handlers.

I've not seen this happening before that I recall. Any ideas? Stuff to check? Solutions?

tim

tim Rowledge; tim@rowledge.org; http://www.rowledge.org/tim Strange OpCodes: TDB: Transfer and Drop Bits

tim Rowledge

10:47 p.m.

Thanks Dave - that's all rather scary sounding.

It seems really weird that this has only just started and yet the system has been running happily for over a year. It's a 'PersonalSqueakSource' image built on a very late 5.2 /5.3 alpha image.

Are there unix system settings for allowable numbers of sockets etc that might have been changed by a sysadmin? Anything like that?

tim -- tim Rowledge; tim@rowledge.org; http://www.rowledge.org/tim Strange OpCodes: STOP: No Op

David T. Lewis

26 Feb 26 Feb

12:17 a.m.

On Thu, Feb 25, 2021 at 01:47:16PM -0800, tim Rowledge wrote:

...

Thanks Dave - that's all rather scary sounding.

Scary, no. But hard to debug? Yes. A file and/or socket handle leak just means that the image is failing to close connections properly, possibly because of some completely unrelated error that was not handled by closing the socket.

...

It seems really weird that this has only just started and yet the system has been running happily for over a year. It's a 'PersonalSqueakSource' image built on a very late 5.2 /5.3 alpha image.

Are there unix system settings for allowable numbers of sockets etc that might have been changed by a sysadmin? Anything like that?

Yes I'm sure it's configurable, but it's highly unlikely that a sysadmin would bother to fool around with something like that. I think the default is 1000 open file descriptors (including sockets, pipes, files) per process, and that is way more than you need for a squeaksource server.

Assuming you have access to the server, the best place to start is by looking in the directory /proc/<pid>/fd/* (where <pid> is the process ID of the Squeak VM (you can find it with something like "$ ps -aef | grep squeak"). If you look at that once a day for a week, you will be able to see a growing number of file descriptors hanging around. Eventually it gets to be too many, and bad things happen.

If you do /not/ see these file descriptors accumulating over time, then it would indicate some other kind of issue entirely.

Good luck and keep us posted.

Dave

tim Rowledge

5:04 a.m.

...

On 2021-02-25, at 3:17 PM, David T. Lewis lewis@mail.msen.com wrote:

Assuming you have access to the server, the best place to start is by looking in the directory /proc/<pid>/fd/* (where <pid> is the process ID of the Squeak VM (you can find it with something like "$ ps -aef | grep squeak"). If you look at that once a day for a week, you will be able to see a growing number of file descriptors hanging around. Eventually it gets to be too many, and bad things happen.

After an hour or so there are 350 entries in that directory, virtually all sockets. I imagine somehow something has caused an issue that ... well, no idea. Pretty sure it isn't going to last for a week at this rate!

I've found a backup copy of the image from before the problem so I think I'll probably install that and see if things improve. Maybe a new VM. Strange stuff.

tim -- tim Rowledge; tim@rowledge.org; http://www.rowledge.org/tim Useful Latin Phrases:- Fac me cocleario vomere! = Gag me with a spoon!

Tobias Pape

8:40 a.m.

...

On 26. Feb 2021, at 05:04, tim Rowledge tim@rowledge.org wrote:

...
On 2021-02-25, at 3:17 PM, David T. Lewis lewis@mail.msen.com wrote:

Assuming you have access to the server, the best place to start is by looking in the directory /proc/<pid>/fd/* (where <pid> is the process ID of the Squeak VM (you can find it with something like "$ ps -aef | grep squeak"). If you look at that once a day for a week, you will be able to see a growing number of file descriptors hanging around. Eventually it gets to be too many, and bad things happen.

After an hour or so there are 350 entries in that directory, virtually all sockets. I imagine somehow something has caused an issue that ... well, no idea. Pretty sure it isn't going to last for a week at this rate!

Maybe sockets the other end closed but we did not notice, so hang on to them. I'm quite sure I saw that not too few times when caring for Seaside on Squeak. There was often something strange with the sockets going on when the talking between browser and Seaside was already done but somehow the Squeak hang on to the socket…

Best regards -Tobias

...

I've found a backup copy of the image from before the problem so I think I'll probably install that and see if things improve. Maybe a new VM. Strange stuff.

tim

David T. Lewis

2 p.m.

On Fri, Feb 26, 2021 at 08:40:17AM +0100, Tobias Pape wrote:

...

...
On 26. Feb 2021, at 05:04, tim Rowledge tim@rowledge.org wrote:

...
On 2021-02-25, at 3:17 PM, David T. Lewis lewis@mail.msen.com wrote:

Assuming you have access to the server, the best place to start is by looking in the directory /proc/<pid>/fd/* (where <pid> is the process ID of the Squeak VM (you can find it with something like "$ ps -aef | grep squeak"). If you look at that once a day for a week, you will be able to see a growing number of file descriptors hanging around. Eventually it gets to be too many, and bad things happen.

After an hour or so there are 350 entries in that directory, virtually all sockets. I imagine somehow something has caused an issue that ... well, no idea. Pretty sure it isn't going to last for a week at this rate!

Maybe sockets the other end closed but we did not notice, so hang on to them. I'm quite sure I saw that not too few times when caring for Seaside on Squeak. There was often something strange with the sockets going on when the talking between browser and Seaside was already done but somehow the Squeak hang on to the socket???

It may be that the socket leak problem on squeaksource.com is "fixed" now simply because the it is so much faster now that it is running on 64-bit Spur, so timeouts and other errors simply do not happen any more. In other words, an underlying problem may still be there, but it rarely happens for squeaksource on the faster VM.

The responsibility for closing the sockets is entirely on the Squeak side. Regardless of what the client does or does not do, the Squeak image must always close its socket in order to free the file descriptor.

Tim, one other thing to look out for is the event handling in the VM. Since last October it now uses the Linux-specific epoll mechanism. If the problem you see now started happening after updating to a newer VM, then consider this as a possible suspect.

Dave

tim Rowledge

9:04 p.m.

...

On 2021-02-25, at 8:04 PM, tim Rowledge tim@rowledge.org wrote:

After an hour or so there are 350 entries in that directory, virtually all sockets. I imagine somehow something has caused an issue that ... well, no idea. Pretty sure it isn't going to last for a week at this rate!

This morning there were 976 fds. So I restarted and watched and .. 18. It's sitting there all happy and as if, as we say in the UK, "butter wouldn't melt in its mouth". What, me? Cause problems? How can you be so cruel to a young server?

Maybe a port scanning/attack of some sort flooded it?

tim -- tim Rowledge; tim@rowledge.org; http://www.rowledge.org/tim Oxymorons: Living dead

Chris Muller

9:20 p.m.

Still curious as to what VM you're using...

On Fri, Feb 26, 2021 at 2:04 PM tim Rowledge tim@rowledge.org wrote:

...

...
On 2021-02-25, at 8:04 PM, tim Rowledge tim@rowledge.org wrote:

After an hour or so there are 350 entries in that directory, virtually

all sockets. I imagine somehow something has caused an issue that ... well, no idea. Pretty sure it isn't going to last for a week at this rate!

This morning there were 976 fds. So I restarted and watched and .. 18. It's sitting there all happy and as if, as we say in the UK, "butter wouldn't melt in its mouth". What, me? Cause problems? How can you be so cruel to a young server?

Maybe a port scanning/attack of some sort flooded it?

tim

tim Rowledge; tim@rowledge.org; http://www.rowledge.org/tim Oxymorons: Living dead

tim Rowledge

27 Feb 27 Feb

12:11 a.m.

Didn't I mention that before? - the vm I can only identify as 5.0-201807260206 from the lib/squeak directory name. Levente thinks that's the default one for the original 5.3 alpha package that is almost certainly what I downloaded to build this system 18 months ago. Sounds plausible to me.

If I have to restart anytime soon I'll try the latest VM and see if it makes any difference. So far today nothing is going wrong though; it has crept up to 20 fds open - 4 sockets right now.

...

On 2021-02-26, at 12:20 PM, Chris Muller asqueaker@gmail.com wrote:

Still curious as to what VM you're using...

On Fri, Feb 26, 2021 at 2:04 PM tim Rowledge tim@rowledge.org wrote:

...
On 2021-02-25, at 8:04 PM, tim Rowledge tim@rowledge.org wrote:

After an hour or so there are 350 entries in that directory, virtually all sockets. I imagine somehow something has caused an issue that ... well, no idea. Pretty sure it isn't going to last for a week at this rate!

This morning there were 976 fds. So I restarted and watched and .. 18. It's sitting there all happy and as if, as we say in the UK, "butter wouldn't melt in its mouth". What, me? Cause problems? How can you be so cruel to a young server?

Maybe a port scanning/attack of some sort flooded it?

tim

tim Rowledge; tim@rowledge.org; http://www.rowledge.org/tim Oxymorons: Living dead

tim -- tim Rowledge; tim@rowledge.org; http://www.rowledge.org/tim Calm down -- it's only ones and zeros.

David T. Lewis

26 Feb 26 Feb

10:12 p.m.

On Fri, Feb 26, 2021 at 12:04:26PM -0800, tim Rowledge wrote:

...

...
On 2021-02-25, at 8:04 PM, tim Rowledge tim@rowledge.org wrote:

After an hour or so there are 350 entries in that directory, virtually all sockets. I imagine somehow something has caused an issue that ... well, no idea. Pretty sure it isn't going to last for a week at this rate!

This morning there were 976 fds. So I restarted and watched and .. 18. It's sitting there all happy and as if, as we say in the UK, "butter wouldn't melt in its mouth". What, me? Cause problems? How can you be so cruel to a young server?

Maybe a port scanning/attack of some sort flooded it?

Keep an eye on it, you probably still have a socket leak.

Here is what I saw on the squeaksource.com server a few minutes ago:

squeaksourcecom@dan:~$ ls -l /proc/16147/fd total 0 lr-x------ 1 squeaksourcecom www-data 64 Feb 14 03:25 0 -> /dev/null l-wx------ 1 squeaksourcecom www-data 64 Feb 14 03:25 1 -> pipe:[11697] l-wx------ 1 squeaksourcecom www-data 64 Feb 14 03:25 2 -> pipe:[11697] lr-x------ 1 squeaksourcecom www-data 64 Feb 14 03:25 3 -> /srv/squeaksourcecom/SqueakSource/ss/Pharo lr-x------ 1 squeaksourcecom www-data 64 Feb 14 03:25 4 -> /srv/squeaksourcecom/SqueakSource/squeaksource.8.image lr-x------ 1 squeaksourcecom www-data 64 Feb 14 03:25 6 -> /srv/squeaksourcecom/SqueakSource/SqueakV50.sources lrwx------ 1 squeaksourcecom www-data 64 Feb 14 03:25 7 -> socket:[25825763] lrwx------ 1 squeaksourcecom www-data 64 Feb 14 03:25 8 -> socket:[25825764] lrwx------ 1 squeaksourcecom www-data 64 Feb 14 06:25 9 -> /srv/squeaksourcecom/SqueakSource/squeaksource.8.changes

And then a few minutes later:

squeaksourcecom@dan:~$ ls -l /proc/16147/fd total 0 lr-x------ 1 squeaksourcecom www-data 64 Feb 14 03:25 0 -> /dev/null l-wx------ 1 squeaksourcecom www-data 64 Feb 14 03:25 1 -> pipe:[11697] l-wx------ 1 squeaksourcecom www-data 64 Feb 14 03:25 2 -> pipe:[11697] lr-x------ 1 squeaksourcecom www-data 64 Feb 14 03:25 3 -> /srv/squeaksourcecom/SqueakSource/ss/metacello lr-x------ 1 squeaksourcecom www-data 64 Feb 14 03:25 4 -> /srv/squeaksourcecom/SqueakSource/squeaksource.8.image lr-x------ 1 squeaksourcecom www-data 64 Feb 14 03:25 6 -> /srv/squeaksourcecom/SqueakSource/SqueakV50.sources lrwx------ 1 squeaksourcecom www-data 64 Feb 14 03:25 7 -> socket:[25825763] lrwx------ 1 squeaksourcecom www-data 64 Feb 14 03:25 8 -> socket:[25825764] lrwx------ 1 squeaksourcecom www-data 64 Feb 14 06:25 9 -> /srv/squeaksourcecom/SqueakSource/squeaksource.8.changes

If your server is not actively handling requests, you should see similar socket usage. You'll see more file descriptors when you are using Magma, but keep an eye on the socket entries, and they should not accumulate over time.

Dave

John M McIntosh

11:30 p.m.

Historically it's up to finalization to cleanup a socket. Anyone confirmed that code is there and runs with this VM?

.... John M. McIntosh. Corporate Smalltalk Consulting Ltd https://www.linkedin.com/in/smalltalk

‐‐‐‐‐‐‐ Original Message ‐‐‐‐‐‐‐ On Friday, February 26, 2021 1:12 PM, David T. Lewis lewis@mail.msen.com wrote:

...

On Fri, Feb 26, 2021 at 12:04:26PM -0800, tim Rowledge wrote:

...
...
On 2021-02-25, at 8:04 PM, tim Rowledge tim@rowledge.org wrote: After an hour or so there are 350 entries in that directory, virtually all sockets. I imagine somehow something has caused an issue that ... well, no idea. Pretty sure it isn't going to last for a week at this rate!

This morning there were 976 fds. So I restarted and watched and .. 18. It's sitting there all happy and as if, as we say in the UK, "butter wouldn't melt in its mouth". What, me? Cause problems? How can you be so cruel to a young server? Maybe a port scanning/attack of some sort flooded it?

Keep an eye on it, you probably still have a socket leak.

Here is what I saw on the squeaksource.com server a few minutes ago:

squeaksourcecom@dan:~$ ls -l /proc/16147/fd total 0 lr-x------ 1 squeaksourcecom www-data 64 Feb 14 03:25 0 -> /dev/null l-wx------ 1 squeaksourcecom www-data 64 Feb 14 03:25 1 -> pipe:[11697] l-wx------ 1 squeaksourcecom www-data 64 Feb 14 03:25 2 -> pipe:[11697] lr-x------ 1 squeaksourcecom www-data 64 Feb 14 03:25 3 -> /srv/squeaksourcecom/SqueakSource/ss/Pharo lr-x------ 1 squeaksourcecom www-data 64 Feb 14 03:25 4 -> /srv/squeaksourcecom/SqueakSource/squeaksource.8.image lr-x------ 1 squeaksourcecom www-data 64 Feb 14 03:25 6 -> /srv/squeaksourcecom/SqueakSource/SqueakV50.sources lrwx------ 1 squeaksourcecom www-data 64 Feb 14 03:25 7 -> socket:[25825763] lrwx------ 1 squeaksourcecom www-data 64 Feb 14 03:25 8 -> socket:[25825764] lrwx------ 1 squeaksourcecom www-data 64 Feb 14 06:25 9 -> /srv/squeaksourcecom/SqueakSource/squeaksource.8.changes

And then a few minutes later:

squeaksourcecom@dan:~$ ls -l /proc/16147/fd total 0 lr-x------ 1 squeaksourcecom www-data 64 Feb 14 03:25 0 -> /dev/null l-wx------ 1 squeaksourcecom www-data 64 Feb 14 03:25 1 -> pipe:[11697] l-wx------ 1 squeaksourcecom www-data 64 Feb 14 03:25 2 -> pipe:[11697] lr-x------ 1 squeaksourcecom www-data 64 Feb 14 03:25 3 -> /srv/squeaksourcecom/SqueakSource/ss/metacello lr-x------ 1 squeaksourcecom www-data 64 Feb 14 03:25 4 -> /srv/squeaksourcecom/SqueakSource/squeaksource.8.image lr-x------ 1 squeaksourcecom www-data 64 Feb 14 03:25 6 -> /srv/squeaksourcecom/SqueakSource/SqueakV50.sources lrwx------ 1 squeaksourcecom www-data 64 Feb 14 03:25 7 -> socket:[25825763] lrwx------ 1 squeaksourcecom www-data 64 Feb 14 03:25 8 -> socket:[25825764] lrwx------ 1 squeaksourcecom www-data 64 Feb 14 06:25 9 -> /srv/squeaksourcecom/SqueakSource/squeaksource.8.changes

If your server is not actively handling requests, you should see similar socket usage. You'll see more file descriptors when you are using Magma, but keep an eye on the socket entries, and they should not accumulate over time.

Dave

K K Subbu

27 Feb 27 Feb

7:32 a.m.

On 27/02/21 2:42 am, David T. Lewis wrote:

...

Here is what I saw on the squeaksource.com server a few minutes ago:

squeaksourcecom@dan:~$ ls -l /proc/16147/fd

The socket count can be generated directly with:

$ ls -l /proc/$(pgrep squeak)/fd | grep -c socket:

This can then be visually graphed every, say, 3600s with:

bars='======================================================+'; while true; do echo $bars | cut -c 1-$(ls -l /proc/$(pgrep squeak)/fd | grep -c socket:) sleep 3600 done

If this shows a monotonic growth, then a leak can be suspected.

HTH .. Subbu

tim Rowledge

2 Mar 2 Mar

5:08 a.m.

...

On 2021-02-26, at 10:32 PM, K K Subbu kksubbu.ml@gmail.com wrote:

On 27/02/21 2:42 am, David T. Lewis wrote:

...
Here is what I saw on the squeaksource.com server a few minutes ago: squeaksourcecom@dan:~$ ls -l /proc/16147/fd

The socket count can be generated directly with:

$ ls -l /proc/$(pgrep squeak)/fd | grep -c socket:

That's pretty neat, thanks.

...

This can then be visually graphed every, say, 3600s with:

bars='======================================================+'; while true; do echo $bars | cut -c 1-$(ls -l /proc/$(pgrep squeak)/fd | grep -c socket:) sleep 3600 done

Goodness; shell script bargraphs. Whatever next?

...

If this shows a monotonic growth, then a leak can be suspected.

I'm pretty sure there is a slow leak; but slow. It's climbed from 19 to 30 over three days

tim -- tim Rowledge; tim@rowledge.org; "You're a ghost driving a meat-coated skeleton made from stardust, riding a rock, hurtling through space. Fear nothing"

K K Subbu

1:18 p.m.

On 02/03/21 9:38 am, tim Rowledge wrote:

...

...
If this shows a monotonic growth, then a leak can be suspected.

I'm pretty sure there is a slow leak; but slow. It's climbed from 19 to 30 over three days

Tim,

Ugh! slow leaks can be really annoying to diagnose. You may narrow down the specific leak with ss (socket stats)

$ ss -s

to get a breakdown by tcp/udp ipv4/ipv6 etc. or

$ ss state close-wait

to catch all sockets in close-wait state. The arguments is a filter condition and the man page gives examples of various filter expressions.

You may also use the watch command to sample it every day (-n 86400) and highlight the difference (-d):

$ watch -n 86400 -d ss state close-wait

BTW, an early trick I learned is to keep a small terminal running such commands in a corner (or second monitor these days). This lets me use the center of the screen for all active work. Any gross changes in shape (char graphs ;-)) or highlight will catch the corner of the eye.

HTH .. Subbu

1165

Age (days ago)

1170

Last active (days ago)

vm-dev@lists.squeakfoundation.org

14 comments

6 participants

tags (0)

participants (6)

Chris Muller
David T. Lewis
John M McIntosh
K K Subbu
tim Rowledge
Tobias Pape