Project

General

Profile

Actions

Bug #238

open

Bip uses 100% CPU

Added by Bojan Rajkovic over 12 years ago. Updated over 10 years ago.

Status:
New
Priority:
Normal
Assignee:
-
Target version:
-
Start date:
2011-08-06
Due date:
% Done:

0%

Estimated time:
Patch Available:
No
Found in Versions:
0.8.8
Confirmed:
No
Branch:
Security:
No
Help Needed:

Description

On a Debian sid install, Bip version 0.8.8 uses 100% CPU consistently. Attached is a config file with channels/passwords removed, but a total of 30 channels across 5 networks are used via Bip. GDB attaching only shows 1 thread, can't find any obvious busy loops. Attaching with strace shows many calls to gettimeofday ({...}, NULL), which are likely the cause of the 100% CPU usage. Attached is an strace log, with only a few seconds worth of runtime (almost 400 KB due to all the gettimeofday calls).


Files

bip.conf (1.8 KB) bip.conf configuration Bojan Rajkovic, 2011-08-07 00:01
bip-trace.log (383 KB) bip-trace.log trace log Bojan Rajkovic, 2011-08-07 00:01
Actions #1

Updated by Pierre-Louis Bonicoli over 12 years ago

Quick analyse of the log file: select indicate that file descriptor 72 is ready for writing, next read is called on the file descriptor 72 but the file descriptor is not available for reading.

Actions #2

Updated by Bojan Rajkovic over 12 years ago

Hmm—I cannot reproduce the error anymore. The only difference is that in the meantime I turned off SSL, because my client was having trouble with it. I'm starting to suspect that this issue may have been caused by my client screwing up SSL negotiations or the SSL socket itself. It would explain the large amount of gettimeofday calls (SSL being timing-dependent and all), and would neatly explain why this bug has mysteriously vanished. This should probably be marked as INVALID.

Actions #3

Updated by Michael Hutchinson almost 12 years ago

I see this once every few weeks, and I know a few other people who see this. It's not obvious what the trigger is. My bip instance is connecting to three IRC servers, once of which is SSL protected, and I connect to bip with Linkinus from a Mac, and YChat from Windows.

Is there anything I could do to debug bip when it's in this state?

Actions #4

Updated by Antoine Amarilli over 10 years ago

I can reproduce the 100% CPU usage problem on Debian wheezy with bip 0.8.8. I am connecting to 3 networks and using SSL to talk with the clients.

stracing shows an infinite sequence of:

select(258, [3 4 5 6 156 196 199 210 211 232 234 238 241 243 254 255 256 257], [34 35], [3 4 5 6 34 35 156 196 199 210 211 232 234 238 241 243 254 255 256 257], {0, 772000}) = 2 (out [34 35], left {0, 771997})
read(34, 0x7ff54e6470, 5)               = -1 EAGAIN (Resource temporarily unavailable)
read(35, 0x7ff56d3bc0, 5)               = -1 EAGAIN (Resource temporarily unavailable)

I'm not sure about how to debug the problem. I attached to the process with gdb and took two arbitrary backtraces:

(gdb) bt
#0  0x000003a3b1a18fc0 in BIO_int_ctrl () from /usr/lib/libcrypto.so.0.9.8
#1  0x000003a3b1d24990 in SSL_set_fd () from /usr/lib/libssl.so.0.9.8
#2  0x0000007ff3362724 in SSLize ()
#3  0x0000007ff3360cd7 in check_event_write ()
#4  0x0000007ff33615a9 in wait_event ()
#5  0x0000007ff336a41b in irc_main ()
#6  0x0000007ff335bdcf in main ()
(gdb) bt
#0  0x000003a3b1693553 in __select_nocancel () at ../sysdeps/unix/syscall-template.S:82
#1  0x0000007ff3361391 in wait_event ()
#2  0x0000007ff336a41b in irc_main ()
#3  0x0000007ff335bdcf in main ()

There is no infinite message sequence in bip's stdout.

Any ideas about where to go from here? I had to restart bip, but the problem should recur eventually...

Thanks!

Actions #5

Updated by Diego Ongaro over 10 years ago

I also see this problem occasionally on Debian wheezy with bip 0.8.8 and SSL clients.

I'm new to bip's internals, but I think the problem might occur when a socket is writable and CONN_NEED_SSLIZE. But setting up an SSL connection requires reading from it as well as writing to it, so sometimes SSLize can't make progress until there are bytes to read.

SSLize() calls SSL_accept() with a non-blocking socket, which according to the man page:

       If the underlying BIO is non-blocking, SSL_accept() will also
       return when the underlying BIO could not satisfy the needs of
       SSL_accept() to continue the handshake, indicating the problem by
       the return value -1.  In this case a call to SSL_get_error() with
       the return value of SSL_accept() will yield SSL_ERROR_WANT_READ or
       SSL_ERROR_WANT_WRITE. The calling process then must repeat the
       call after taking appropriate action to satisfy the needs of
       SSL_accept().  The action depends on the underlying BIO. When
       using a non-blocking socket, nothing is to be done, but select()
       can be used to check for the required condition. When using a
       buffering BIO, like a BIO pair, data must be written into or
       retrieved out of the BIO before being able to continue.

But SSLize ignores SSL_ERROR_WANT_READ and SSL_ERROR_WANT_WRITE entirely. I think in the case of SSL_ERROR_WANT_READ, the socket should be added to the select() read group.

Can any of the core developers chime in?

Actions

Also available in: Atom PDF