Just my too cents:

We ran into the “errno 9 select: Bad file descriptor” error two weaks back while playing around with systemd and socket activation.
In our case was the following:
1. send request to socket -> image starts through socket activation
2. image initializes itself with the special systemd socket handle and keeps track of it (because we want to reuse it after restarting the image)
3. save the image
4. errno 9

What we found then was, that when the image was being saved, stderr would be closed and then reopened by the system, as did the systemd socket (which was a different problem). Unfortunately we always assumed a fixed number for the systemd socket, namely 3. Now, when the image reopened stderr it could  happen that another socket had taken fd 2 and stderr would be on fd 3, thus we tried to open stderr as TCP socket.

What I want to say is, just really make sure that you’re not doing something funny with sockets (e.g. fixed fd number) before you look at the VM.

Good luck!


Cheers,
Max

On 10.10.2014, at 05:56, vm-dev-request@lists.squeakfoundation.org wrote:

Date: Thu, 9 Oct 2014 23:56:03 -0400
From: "Ron Teitelbaum" <ron@usmedrec.com>
Subject: RE: [Vm-dev] Squeak socket problem ... help!
To: "'Squeak Virtual Machine Development Discussion'"
<vm-dev@lists.squeakfoundation.org>
Message-ID: <009601cfe43e$1f0ffb00$5d2ff100$@usmedrec.com>
Content-Type: text/plain; charset="iso-8859-1"

Hi All!

Thanks for your help!!

Trying to reproduce the error I ran 1000 clients against our test cluster of
7 servers.

The fd size hovers around 250 and never made it above 490 (there were 4
service providers so the main server should only handle around 250 people).
David, Thanks for the pointers on how to find the fd size.  Very cool.
Since the fd size seemed to be well within proper limits and went both up
and down I think you are correct we may be looking at the wrong thing.
Could be that we are closing a socket twice?  (see 0xffa5e0cc M
Socket>destroy 0xd9ba02d4: a(n) Socket)  Göran mentioned finding something
that said on a very busy server you may try closing an already closed socket
that had already been reused.  On not so busy system you can get away with
it so you may not see this error until you put a very large load on it.

I captured the error: 

This repeated about 1000 times (no idea why it would repeat so many times):

errno 9
select: Bad file descriptor

Then:

Segmentation fault Fri Oct 10 02:46:17 2014


Squeak VM version: 4.0-2798 #1 Wed Nov 13 12:31:45 UTC 2013 gcc 4.4.7
[Production ITHB VM]
Built from: CoInterpreter VMMaker.oscog-eem.496 uuid:
f0401045-f2f7-470e-9940-3535be9c0334 Nov 13 2013
With: StackToRegisterMappingCogit VMMaker.oscog-eem.496 uuid:
f0401045-f2f7-470e-9940-3535be9c0334 Nov 13 2013
Revision: VM: r2798 http://www.squeakvm.org/svn/squeak/branches/Cog
Plugins: r2545 http://squeakvm.org/svn/squeak/trunk/platforms/Cross/plugins
Build host: Linux cogvm 2.6.32-358.el6.x86_64 #1 SMP Fri Feb 22 00:31:26 UTC
2013 x86_64 x86_64 x86_64 GNU/Linux
plugin path: /home/openqwaq/server/bin/forums/../qwaqvm/plugins [default:
/home/openqwaq/server/bin/cogvm/lib/squeak/4.0-2798/]


C stack backtrace & registers:
       eax 0xffa191a4 ebx 0xffa190c0 ecx 0xffa19158 edx 0xffa1910c
       edi 0xffa18f90 esi 0xffa18f90 ebp 0xffa19028 esp 0xffa19074
       eip 0xffa19288
*[0xffa19288]
/home/openqwaq/server/bin/forums/../qwaqvm/lib/squeak/4.0-2798/squeak[0x805e
54e]
/home/openqwaq/server/bin/forums/../qwaqvm/lib/squeak/4.0-2798/squeak[0x805e
84f]
[0x5ad410]
/home/openqwaq/server/bin/forums/../qwaqvm/lib/squeak/4.0-2798/squeak(sqSock
etCloseConnection+0x26)[0x811f0a6]
/home/openqwaq/server/bin/forums/../qwaqvm/lib/squeak/4.0-2798/squeak(sqSock
etDestroy+0x40)[0x811fad0]
/home/openqwaq/server/bin/forums/../qwaqvm/lib/squeak/4.0-2798/squeak[0x811c
9f8]
[0xd7444848]
/home/openqwaq/server/bin/forums/../qwaqvm/lib/squeak/4.0-2798/squeak(interp
ret+0x76e)[0x8084eee]
/home/openqwaq/server/bin/forums/../qwaqvm/lib/squeak/4.0-2798/squeak(main+0
x2b4)[0x805efa4]
/lib/libc.so.6(__libc_start_main+0xe6)[0x2f4d36]
/home/openqwaq/server/bin/forums/../qwaqvm/lib/squeak/4.0-2798/squeak[0x805b
d41]
[0xffa18c80]


Smalltalk stack dump:
0xffa5e0cc M Socket>destroy 0xd9ba02d4: a(n) Socket
0xffa5e0e4 M QServiceProviderClient>destroy 0xd9c33dd4: a(n)
QServiceProviderClient
0xffa5e0fc M QServiceProviderClient>connectionClosed 0xd9c33dd4: a(n)
QServiceProviderClient
0xffa5e114 M [] in QServiceProviderClient>processRequest 0xd9c33dd4: a(n)
QServiceProviderClient
0xffa5e138 M Socket>waitForDataFor:ifClosed:ifTimedOut: 0xd9ba02d4: a(n)
Socket
0xffa5e164 M QServiceProviderClient>processRequest 0xd9c33dd4: a(n)
QServiceProviderClient
0xffa5e17c M [] in QServiceProviderClient>on:in: 0xd9c33dd4: a(n)
QServiceProviderClient
0xffa5e194 M [] in QSafeProcessLoop>run 0xd9c35efc: a(n) QSafeProcessLoop
0xffa5e1b0 M BlockClosure>on:do: 0xda118824: a(n) BlockClosure
0xffa5e1d0 M QSafeProcessLoop>run 0xd9c35efc: a(n) QSafeProcessLoop
0xd9c646dc s [] in QSafeProcessLoop(Object)>fork:at:
0xd9c36020 s [] in BlockClosure>newProcess

Most recent primitives
primSocketConnectionStatus:
primSocketSendDone:
primSocket:sendData:startIndex:count:
millisecondClockValue
primitiveWait
millisecondClockValue
millisecondClockValue
primitiveEnterCriticalSection
primitiveEnterCriticalSection
at:put:
findNextUnwindContextUpTo:
tempAt:
tempAt:put:
tempAt:
terminateTo:
primitiveExitCriticalSection
findNextUnwindContextUpTo:
terminateTo:
primitiveExitCriticalSection
new:
at:put:
at:put:
updateCrc:from:to:in:
updateCrc:from:to:in:
bitXor:
at:put:
at:put:
at:put:
at:put:
at:put:
at:put:
at:put:
at:put:
replaceFrom:to:with:startingAt:
primitiveEnterCriticalSection
at:put:
primitiveExitCriticalSection
primitiveEnterCriticalSection
=
at:put:
primitiveExitCriticalSection
primitiveEnterCriticalSection
at:put:
primitiveExitCriticalSection
replaceFrom:to:with:startingAt:
primitiveEnterCriticalSection
at:put:
primitiveExitCriticalSection
primitiveEnterCriticalSection
=
at:put:
primitiveExitCriticalSection
primitiveEnterCriticalSection
at:put:
primitiveExitCriticalSection
signal
primitiveEnterCriticalSection
primitiveEnterCriticalSection
findNextUnwindContextUpTo:
tempAt:
tempAt:put:
tempAt:
terminateTo:
primitiveExitCriticalSection
findNextUnwindContextUpTo:
terminateTo:
at:put:
at:put:
primitiveExitCriticalSection
primSocketConnectionStatus:
primSocketSendDone:
primSocketConnectionStatus:
primSocketSendDone:
primSocket:sendData:startIndex:count:
millisecondClockValue
primitiveWait
millisecondClockValue
primitiveWait
primitiveWait
primitiveWait
primitiveWait
primitiveWait
primitiveWait
millisecondClockValue
millisecondClockValue
basicNew:
at:put:
basicNew:
integerAt:put:size:signed:
basicNew:
primitiveARC4Process:from:to:m:x:y:
findSubstring:in:startingAt:matchTable:
basicNew
stringHash:initialHash:
hashBytes:startingWith:
basicNew
at:put:
replaceFrom:to:with:startingAt:
replaceFrom:to:with:startingAt:
stringHash:initialHash:
basicNew
stringHash:initialHash:
primitiveEnterCriticalSection
primitiveEnterCriticalSection
at:put:
primitiveExitCriticalSection
at:put:
at:put:
signal
millisecondClockValue
millisecondClockValue
primitiveEnterCriticalSection
primitiveExitCriticalSection
primitiveEnterCriticalSection
at:put:
findNextUnwindContextUpTo:
tempAt:
tempAt:put:
tempAt:
terminateTo:
primitiveExitCriticalSection
findNextUnwindContextUpTo:
terminateTo:
primitiveExitCriticalSection
new:
at:put:
at:put:
updateCrc:from:to:in:
updateCrc:from:to:in:
bitXor:
at:put:
at:put:
at:put:
at:put:
at:put:
at:put:
at:put:
at:put:
replaceFrom:to:with:startingAt:
primitiveEnterCriticalSection
at:put:
primitiveExitCriticalSection
primitiveEnterCriticalSection
=
at:put:
primitiveExitCriticalSection
primitiveEnterCriticalSection
at:put:
primitiveExitCriticalSection
replaceFrom:to:with:startingAt:
primitiveEnterCriticalSection
at:put:
primitiveExitCriticalSection
primitiveEnterCriticalSection
=
at:put:
primitiveExitCriticalSection
primitiveEnterCriticalSection
at:put:
primitiveExitCriticalSection
primitiveEnterCriticalSection
primitiveEnterCriticalSection
findNextUnwindContextUpTo:
tempAt:
tempAt:put:
tempAt:
terminateTo:
primitiveExitCriticalSection
findNextUnwindContextUpTo:
terminateTo:
at:put:
at:put:
primitiveExitCriticalSection
primSocketConnectionStatus:
primSocketSendDone:
primSocketConnectionStatus:
primSocketSendDone:
primSocket:sendData:startIndex:count:
millisecondClockValue
primitiveWait
basicNew
basicNew
primitiveEnterCriticalSection
new:
replaceFrom:to:with:startingAt:
at:put:
new:
replaceFrom:to:with:startingAt:
at:put:
primitiveExitCriticalSection
basicNew
basicNew
primitiveEnterCriticalSection
new:
replaceFrom:to:with:startingAt:
at:put:
new:
replaceFrom:to:with:startingAt:
at:put:
primitiveExitCriticalSection
basicNew
millisecondClockValue
primitiveWait
primitiveWait
primitiveWait
signal
primitiveWait
signal
primitiveWait
signal
primitiveWait
signal
primitiveWait
signal
primitiveWait
signal
signal
primitiveWait
primitiveEnterCriticalSection
replaceFrom:to:with:startingAt:
primSocketConnectionStatus:
primitiveExitCriticalSection
suspend
primSocketReceiveDataAvailable:
primSocketReceiveDataAvailable:
primSocketConnectionStatus:
basicNew:
replaceFrom:to:with:startingAt:
replaceFrom:to:with:startingAt:
basicNew
millisecondClockValue
=
digitMultiply:neg:
<
basicNew
pointsTo:
basicNew:
replaceFrom:to:with:startingAt:
replaceFrom:to:with:startingAt:
basicNew:
replaceFrom:to:with:startingAt:
replaceFrom:to:with:startingAt:
basicNew:
replaceFrom:to:with:startingAt:
replaceFrom:to:with:startingAt:
primitiveWait
at:put:
signal
signal
basicNew
primitiveWait
at:put:
signal
signal
primSocketConnectionStatus:
primSocketDestroy:

stack page bytes 4096 available headroom 3300 minimum unused headroom 2164

       (Segmentation fault)

From: David T. Lewis
Sent: Thursday, October 09, 2014 8:22 PM

On Fri, Oct 10, 2014 at 02:09:08AM +0200, G?ran Krampe wrote:

Hi guys!

Hi Eliot, hope you are happy and well!


On 10/10/2014 01:28 AM, David T. Lewis wrote:
Ron and I (3DICC) have a problem with the Unix VM networking and I
am reaching out before burning too many hours on something one of
you C-Unix/Socket/VM guys can fix in an afternoon - and earn a buck
for your trouble.

Cool.  This is likely easy to fix.  Your image is running out of
file descriptors.  Track open and close calls, e.g. add logging
around at least StandardFileStream>>#primOpen:writable:
, AsyncFile>>#primOpen:forWrite:semaIndex:,

Socket>>#primAcceptFrom:receiveBufferSize:sendBufSize:semaIndex:rea
dSemaIndex:writeSemaIndex:
and their associated close calls and see what's being opened without
being closed.  It shoudl be easy to track=down, but may be more
difficult to fix.

Good luck!

Aha. Soo... am I understanding this correctly - we are probably
leaking fds and when we go above 1024 this makes select() go bonkers
and eventually leads to the "Bad file descriptor" error?

I agree with what Eliot is saying and would add a few thoughts:

- Don't fix the wrong problem (DFtWP). Unless you have some reason to
believe that this server application would realistically have a need
to handle anything close to a thousand concurrent TCP sessions, don't
fix it by raising the per-process file handle limit, and don't fix it
by reimplementing the socket listening code.

We haven't done the exact numbers, but we could probably hit several
hundreds concurrent at least. 1024 seemed a bit "over the top" though
:)

The system in question is meant to serve more than 1000 concurrent
users, so we are in fact moving into this territory. We have been up
to around 600 so far.

- It is entirely possible that no one before you has ever tried to
run a server application with the per-process file handle limit
bumped up above the default 1024. So if that configuration does not
play nicely with the select() mechanism, you may well be the first to
have encountered this as an issue. But see above, don't fix it if it
ain't
broke.

Well, it most probably *is* broke - I mean - I haven't read anywhere
that our Socket code is limited to 1024 concurrent sockets and that
going above that limit causes the Socket code to stop working? :)

But I agree - I don't want to touch that code if we can simply avoid
this bug by making sure we stay below 1024.

But it sounds broke to me, nevertheless. ;)

Indeed it probably is.


- Most "out of file descriptor" problems involve resource leaks (as
Eliot is suggesting), and in those cases you will see a gradual
increase in file descriptors in /proc/<vmpid>/fd/ over time.

Thank you for the pointer.  That's a pretty cool directory to look at!

Eventually you run out of descriptors and something horrible happens.

We will start looking at that and other tools too.


- Sorry to repeat myself but this is by far the most important point:
DFtWP.

Sure :). This is why I posted - to get your input. And I have a
suspicion that the SAML issue I mentioned may be the code leaking, we
will start looking.

regards, G?ran

Cool. Please keep us posted on progress. In the unlikely event that free
advice turns out to be worth anything, there is plenty more where that
came from ;-)

Dave


Ron Teitelbaum