I had a problem. My alarm server was having some major issues, and would stop functioning at random times. There was no core dump to look for crossed signals (no odd null-termination things that I am used to). However, after a little digging, I found that the process was getting sniped - meaning the Linux kernel was killing it - because it was running out of memory. Not cool. I didn't really know how to go about fixing the problem, so I band-aided the thing by putting a cron to reboot the device once-per-week. I ran with it in that configuration while working on other priorities for a couple of months. Also not cool, but it kept the security system online while I went and handled my growing project list.
So, I started researching ways to isolate the memory leaks. One tool stood out as the best one to use (command-line based), and would work with a natively-compiled ELF binary.
Valgrind
Val grind does a lot of things, and memory leak detection is just a small slice. I downloaded the source, and compiled and installed it (using the standard "
./configure && make && make install
" stuff). I did many Google searches, and finally came up with the following command to launch the server inside of Valgrind for memory leak detection :G_SLICE=always-malloc G_DEBUG=gc-friendly valgrind -v --tool=memcheck --leak-check=full --num-callers=40 --track-origins=yes --read-var-info=yes ./eventserver/server/eventserver -bf ./eventserver/server/eventserver.conf > valgrind.log 2>&;1
Essentially, I wanted to track everything related to memory interaction - I was (after all) looking for a memory leak. I wanted to see if this would help me identify the issue. I ran with it in this state for a while (in a screen session, to be precise) to let it build up stats. When I hit control-C on the application to terminate, I opened up the valgrind.log file, and found a few stanzas similar to :==12262== 891 (108 direct, 783 indirect) bytes in 9 blocks are definitely lost in loss record 86 of 93
==12262== at 0x482CF8C: malloc (vg_replace_malloc.c:236)
==12262== by 0xAA59: eventserver_key_set (config.c)
==12262== by 0xAB55: eventserver_sensor_create (config.c)
==12262== by 0x4D4DDA9: arduino_handle_request (arduino.c)
==12262== by 0xC795: eventserver_process_hook_socket_handle_incoming (process.c)
==12262== by 0xDC0D: eventserver_main_connection_handler (main.c)
==12262== by 0xE29D: main (main.c)
==12262==
==12262== 65,550 bytes in 5 blocks are possibly lost in loss record 91 of 93
==12262== at 0x482CF8C: malloc (vg_replace_malloc.c:236)
==12262== by 0x4DA5A23: my_malloc (in /usr/lib/libmysqlclient.so.16.0.0)
Not sure what I was looking at yet, I knew it looked similar to a gdb stack trace. The unfortunate thing was that in the traces I was looking at, it was finding things allocated using malloc
, and memory pointer problems using strchr
. Since the files referenced called the functions numerous times, I didn't have much to go on. I went back to Google, and found some additional information. Apparently, using some compilation flags to gcc, you could add gdb debugging information, such as line numbers. I manually added the following to the CFLAGS in all of my make files :-g -O0
The -O0 (capital "O" and the number zero) are an optimization flag that means "don't optimize anything, seriously, leave a lot of junk in the binary" (I swear that's what it says in the man page for gcc). The -g option means "add gdb debugger information". You can run with a -O1, and that gets you close to the line numbers, but might not always be accurate. However, -O0 runs molasses-slow, so use sparingly. I thought I'd do it anyway.I started the application again, let it run for a few days, and broke out of it once again. This time, I had line numbers in the output. Hallelujah!
==12262== 891 (108 direct, 783 indirect) bytes in 9 blocks are definitely lost in loss record 86 of 93
==12262== at 0x482CF8C: malloc (vg_replace_malloc.c:236)
==12262== by 0xAA59: eventserver_key_set (config.c:219)
==12262== by 0xAB55: eventserver_sensor_create (config.c:256)
==12262== by 0x4D4DDA9: arduino_handle_request (arduino.c:228)
==12262== by 0xC795: eventserver_process_hook_socket_handle_incoming (process.c:257)
==12262== by 0xDC0D: eventserver_main_connection_handler (main.c:494)
==12262== by 0xE29D: main (main.c:669)
==12262==
==12262== 65,550 bytes in 5 blocks are possibly lost in loss record 91 of 93
==12262== at 0x482CF8C: malloc (vg_replace_malloc.c:236)
==12262== by 0x4DA5A23: my_malloc (in /usr/lib/libmysqlclient.so.16.0.0)
This was great news! Suddenly, I could look back through the stack traces and find out where things were happening. The application initially identified a couple hundred memory leak contexts, and a couple hundred memory errors (seen with the "definitely lost in loss record ## of ##"). On research, I realized that the vast majority of these all had a fairly similar stack trace and line number (maybe one or two lines off). I opened that file and jumped to the line number and........ I had the wrong variable I was using to allocate a chunk of memory. I made the fix, re-compiled, and had instantly narrowed down the memory error contexts to about three basic memory errors - things where I was using NULL-termination-requiring functions on variables that may not have had the termination and a MySQL error leak. I fixed the NULL-termination errors, recompiled yet again, and had only the MySQL memory leak (appears standard with that version, which is very old). No more memory leaks!
No comments:
Post a Comment