2000.03.19 02:42 "faxq maxing out CPU - let's crack this!!!!!", by Darren Nickerson

2000.03.19 02:42 "faxq maxing out CPU - let's crack this!!!!!", by Darren Nickerson

Folks,

As may of you know, the latest release of Sam Leffler's libtiff (now maintained by the great people at http://www.libtiff.org) broke HylaFAX(tm). Not too much of a problem for us until RedHat decided to adopt the new libtiff and release update RPMs for RedHat-6.1 (Cartman). Now we're in deep deep trouble, and it's essential that we resolve this issue - as far as I know it's the only MAJOR problem holding back a release of hylafax-4.1-beta3.

I sat down today and recreated the bug. It's easy to do... just install a build of the current HylaFAX CVS onto vanilla RedHat, then upgrade libtiff to:

[darren@ducent darren]$ rpm -qa | egrep tiff
libtiff-3.5.4-1
libtiff-devel-3.5.4-1

I set strace up on each of the relevant processes, and then submitted a fax, and watched. Sure enough, the fax was submitted and processing began (sendfax exited cleanly) and faxq ended up maxing out CPU, leaving hylafax completely hung and unresponsive, yet not making any syscalls. In fact, once the hang was initiated, the strace output did not grow at all.

Here's the last bit of each strace:

[darren@ducent darren]$ tail /tmp/*strace.out
==> /tmp/faxgetty_strace.out <==
18934 ioctl(8, SNDCTL_TMR_START, {B19200 -opost -isig -icanon -echo ...}) = 0
18934 ioctl(8, TCGETS, {B19200 -opost -isig -icanon -echo ...}) = 0
18934 flock(4, LOCK_EX)                 = 0
18934 _llseek(4, 0, [0], SEEK_SET)      = 0
18934 write(4, "Running and idle\n", 17) = 17
18934 _llseek(4, 0, [17], SEEK_CUR)     = 0
18934 ftruncate(4, 17)                  = 0
18934 flock(4, LOCK_UN)                 = 0
18934 write(6, "+ttyS0:RP3fd5e7ff:ff\0", 21) = 21
18934 select(9, [5 8], [], [8], NULL
==> /tmp/faxq_strace.out <==
18940 fstat(9, {st_mode=S_IFREG|0600, st_size=10910, ...}) = 0
18940 mmap(0, 10910, PROT_READ, MAP_SHARED, 9, 0) = 0x402fe000
18940 munmap(0x402fe000, 10910)         = 0
18940 close(9)                          = 0
18940 close(8)                          = 0
18940 open("docq/doc6.cover;31", O_RDONLY) = 8
18940 read(8, "II*\0\10\0\0\0", 8)      = 8
18940 fstat(8, {st_mode=S_IFREG|0600, st_size=25761, ...}) = 0
18940 mmap(0, 25761, PROT_READ, MAP_SHARED, 8, 0) = 0x402fe000
18940 brk(0x8088000)                    = 0x8088000

==> /tmp/hfaxd_strace.out <==
18936 close(3)                          = 0
18936 unlink("client/18936")            = 0
18936 _exit(1)                          = ?
18916 <... select resumed> )            = ? ERESTARTNOHAND (To be restarted)
18916 --- SIGCHLD (Child exited) ---
18916 wait4(-1, [WIFEXITED(s) && WEXITSTATUS(s) == 1], WNOHANG, NULL) = 18936
18916 wait4(-1, 0xbffff848, WNOHANG, NULL) = -1 ECHILD (No child processes)
18916 sigreturn()                       = ? (mask now [])
18916 rt_sigaction(SIGCHLD, {SIG_DFL}, NULL, 8) = 0
18916 select(5, [4], [], [], NULL
==> /tmp/sendfax_strace.out <==
18406 munmap(0x40018000, 4096)          = 0
18406 rt_sigaction(SIGHUP, {SIG_IGN}, {0x804b324, [], SA_RESTART|0x4000000}, 8) = 0
18406 rt_sigaction(SIGINT, {SIG_IGN}, {0x804b324, [], SA_RESTART|0x4000000}, 8) = 0
18406 rt_sigaction(SIGTERM, {SIG_IGN}, {0x804b324, [], SA_RESTART|0x4000000}, 8) = 0
18406 unlink("/tmp//sndfaxxrGI1A")      = 0
18406 unlink("/tmp//sndfax5lhY76")      = 0
18406 unlink("/tmp//sndfaxmJvGjc")      = 0
18406 unlink("/tmp//sndfax5lhY76")      = -1 ENOENT (No such file or directory)
18406 munmap(0x40275000, 4096)          = 0
18406 _exit(0)                          = ?

Looks to me like faxq goes silent just after opening the TIFF file (filehandle 8) and doing an mmap. I can provide the entire strace output for anyone who wants to look at this in greater detail... this is about as far as I can take this tonight. If you read this and went "ahhhhhhhhhhhhhh, I seeeeeeeeeeeee" then I'd love to hear from you ;-)

-Darren