Debugging Core Files¶
Intro¶
This lesson is meant to familiarize you with using the pystack core
subcommand to analyze a core
dump file. The steps will help you enable core dumps in your Linux environment, run a script that
crashes and leaves a core dump, and then analyze the core dump file with PyStack.
Ensuring core dumps are enabled¶
First, we need to ensure that crashing processes in our Linux environment leave core files behind.
The system allows setting a limit on the size of a core dump, to avoid filling up your disk when a process is crashing frequently. You can check the configured size limit by running:
ulimit -c
If that reports 0
it means that core files have been disabled on your machine, and you’ll need
to enable them before proceeding with this tutorial. You can enable them using the ulimit
command in your terminal:
# unlimited core size
ulimit -c unlimited
# or if you want to limit the size to 100 MB instead
ulimit -c 100000
Note that this command only enables core files for the shell session you ran the command in. If you
want to make this change permanent, you can add that line to your ~/.bashrc
or ~/.zshrc
file
depending on your terminal preference. For instance:
echo "ulimit -c unlimited" >> ~/.bashrc
source ~/.bashrc
Next, we want to check where core dump files will be written. Check what this command prints:
cat /proc/sys/kernel/core_pattern
That may contain placeholders, prefixed with %
. You can learn more about what placeholders are
available in the “Naming of core dump files” section of man 5 core.
If you want to control the names used for generated core files, you can update the core_pattern
setting with sysctl
. Note that this affects all users and processes on the machine! The naming
of core dump files can be configured like so:
sudo sysctl -w kernel.core_pattern="/tmp/core-%e.%p.%h.%t"
In this example, we configure the file pattern to include the executable name (%e
), process ID
(%p
), hostname (%h
), and timestamp (%t
), and to be written to the /tmp
directory. If
you would like to learn more about core dumps, Red Hat has more information here.
Generating the Core Dump¶
You can test whether core dumps are being generated by running this command:
python3 -c 'import os; os.abort()'
If everything is configured properly, you should see a message telling you that the process aborted
and a core file was dumped. If your shell is bash, that might say simply Aborted (core dumped)
.
If you use zsh, you’d see something like zsh: abort python3 -c 'import os; os.abort()'
.
Now, we need to find the core file. Given the core_pattern
that we configured above with
sysctl
, the core dump should have been written to /tmp
and its name should start with
core-
, so you can find the core file with:
ls /tmp/core-*
Analyzing the Core Dump¶
Now that we have generated the core dump file and located it on disk, we can use the pystack
core
command to analyze it like so:
pystack core /tmp/core-<executable_name>.<process_id>.<hostname>.<timestamp>
The output will display the stack trace of the core dump file, which will help you identify the source of the error. In the case of explicitly telling the process to abort itself, that stack isn’t all that interesting:
Using executable found in the core file: /src/.venv/bin/python
Core file information:
state: R zombie: True niceness: 0
pid: 618 ppid: 473 sid: 473
uid: 1000 gid: 1000 pgrp: 618
executable: python3 arguments: python3 -c import os; os.abort()
The process died due receiving signal SIGABRT
Traceback for thread 618 [Has the GIL] (most recent call last):
(Python) File "<string>", line 1, in <module>
By default PyStack shows you the same Python stack as the interpreter would print if an exception
occurred. Since the Python stack only has one frame on it, there’s not much to see here! For
debugging a crash, though, it’s generally best to use the --native
flag to display the
interpreter’s native C frames, as well as any C or C++ or Rust frames from libraries you are using,
interleaved with the Python frames in the resulting stack trace.
pystack core --native /tmp/core-<executable_name>.<process_id>.<hostname>.<timestamp>
This will produce output similar to the below:
Using executable found in the core file: /src/.venv/bin/python
Core file information:
state: R zombie: True niceness: 0
pid: 618 ppid: 473 sid: 473
uid: 1000 gid: 1000 pgrp: 618
executable: python3 arguments: python3 -c import os; os.abort()
The process died due receiving signal SIGABRT
Traceback for thread 618 [Has the GIL] (most recent call last):
(C) File "???", line 0, in _start (python3)
(C) File "../csu/libc-start.c", line 360, in __libc_start_main@@GLIBC_2.34 (libc.so.6)
(C) File "../sysdeps/nptl/libc_start_call_main.h", line 58, in __libc_start_call_main (libc.so.6)
(C) File "???", line 0, in Py_BytesMain (python3)
(C) File "???", line 0, in Py_RunMain (python3)
(C) File "???", line 0, in PyRun_SimpleStringFlags (python3)
(C) File "???", line 0, in PyRun_StringFlags (python3)
(Python) File "<string>", line 1, in <module>
(C) File "./stdlib/abort.c", line 79, in abort (libc.so.6)
(C) File "../sysdeps/posix/raise.c", line 26, in raise (libc.so.6)
(C) File "./nptl/pthread_kill.c", line 89, in pthread_kill@@GLIBC_2.34 (libc.so.6)
(C) File "./nptl/pthread_kill.c", line 78, in __pthread_kill_internal (inlined) (libc.so.6)
(C) File "./nptl/pthread_kill.c", line 44, in __pthread_kill_implementation (inlined) (libc.so.6)
Note that this now includes not only the Python frames on the stack, but also the C frames that
led to the first Python call on the stack, as well as the C frames that the last Python call on the
stack called into. This lets us see how the interpreter implements the os.abort()
function, as
well as how the C library (glibc) implements its own abort()
function!
Exercise¶
Read through the code in core_tutorial.py
and see if you can spot any bugs:
1import gc
2
3
4def types_found_in_tuples():
5 elem_types = set()
6
7 for obj in gc.get_objects():
8 if isinstance(obj, tuple):
9 elem_types.update(map(type, obj))
10
11 for elem_type in elem_types:
12 yield elem_type
13
14
15# printing all results with multiple calls to print
16for t in types_found_in_tuples():
17 print(t)
18
19# printing all results with one call to print
20print(*types_found_in_tuples(), sep="\n")
Given the context of this exercise, you won’t be particularly surprised to learn that this code
crashes, and leaves a core file for you to investigate. Spend a few minutes investigating the
pystack core --native
report for the core file generated by running this script, and see if you
can figure out what’s going wrong. If you aren’t familiar with Python’s internals this will be
tough, but these hints should help:
Hint 1
This program uses gc.get_objects
to find every tuple in the program’s memory, and then finds the
set of distinct element types contained in tuples. The for
loop on lines 16 and 17 runs
successfully to completion and produces a usable report. We know that it finishes because the
PyStack report shows that the crash happens inside the call on line 20.
Printing each type one at a time in a loop ought to be equivalent to telling print()
to print
them all at once with newlines between each. Why isn’t it?
Hint 2
Check the local variables with --locals
as well. Can you spot anything wrong there?
Hint 3
--native
mode shows us that the crash happens within a call to PySequence_Tuple()
. Read
up on what that function does, and see if that gives you a clue about what’s going wrong here.
Hint 4
The documentation tells us that PySequence_Tuple()
can be called on either a sequence or an
iterable. Can you tell from the stack what type of object it was being called on when the crash
happened?
Hint 5
What gets printed out if you run this Python code?
def args_type(*args):
print(type(args))
args_type()
Solution¶
Toggle to see the explanation
Just by using --native
mode, we see some interesting things. First, the crash doesn’t happen
when we loop over the values yielded by types_found_in_tuples()
with a for
loop, but does
happen when we pass all the yielded values to print
using *
for unpacking. Second, the crash
happens inside of a call to PySequence_Tuple()
, which is the C function equivalent to the
Python code tuple(iterable)
.
When we add --locals
into the mix, we additionally see the very surprising value of obj
:
obj: (<invalid object at 0x0>, <invalid object at 0x0>, ...)
That’s PyStack’s representation of a tuple whose first two elements are NULL pointers!
This is about as much as PyStack can tell us, and from here we need to figure things out on our own. But all the clues are there:
types_found_in_tuples()
was called into byPySequence_Tuple()
, which constructs a new tuple from an iterabletypes_found_in_tuples()
went on to discover a tuple that contains NULL pointersThe problem only happens when we use
*
unpacking in the call toprint
, and doesn’t happen when we use afor
loopFunctions receive
*args
as a tuple!
What’s happening here is that, in order to execute print(*types_found_in_tuples())
, the
interpreter needs to create a new tuple to pass to print
, which will contain all of the values
yielded by types_found_in_tuples()
. The interpreter creates that tuple, and then starts
iterating over the generator produced by calling types_found_in_tuples()
in order to populate
that new tuple and get it ready to pass to print
. But types_found_in_tuples()
uses
gc.get_objects
and discovers this new, partially initialized tuple, and chaos ensues when it tries
to use that object before the interpreter has finished assigning its elements.
Conclusion¶
In this tutorial, we learned how to use the core
subcommand to get stack traces from core dumps.
We learned how to ensure core files are produced, and how to test that. We then analyzed the core
dump file using the pystack
command to identify the source of the fault. Using the --native
flag, we were able to view what the Python interpreter itself was doing at the time of the crash,
which provided further insights into the cause of the error. By understanding how to analyze core
dump files, we can effectively debug and troubleshoot interpreter crashes.
Thank you for following along with this tutorial. Happy coding!