Debugging Core Files

Intro

This lesson is meant to familiarize you with using the pystack core subcommand to analyze a core dump file. The steps will help you enable core dumps in your Linux environment, run a script that crashes and leaves a core dump, and then analyze the core dump file with PyStack.

Ensuring core dumps are enabled

First, we need to ensure that crashing processes in our Linux environment leave core files behind.

The system allows setting a limit on the size of a core dump, to avoid filling up your disk when a process is crashing frequently. You can check the configured size limit by running:

ulimit -c

If that reports 0 it means that core files have been disabled on your machine, and you’ll need to enable them before proceeding with this tutorial. You can enable them using the ulimit command in your terminal:

# unlimited core size
ulimit -c unlimited

# or if you want to limit the size to 100 MB instead
ulimit -c 100000

Note that this command only enables core files for the shell session you ran the command in. If you want to make this change permanent, you can add that line to your ~/.bashrc or ~/.zshrc file depending on your terminal preference. For instance:

echo "ulimit -c unlimited" >> ~/.bashrc
source ~/.bashrc

Next, we want to check where core dump files will be written. Check what this command prints:

cat /proc/sys/kernel/core_pattern

That may contain placeholders, prefixed with %. You can learn more about what placeholders are available in the “Naming of core dump files” section of man 5 core.

If you want to control the names used for generated core files, you can update the core_pattern setting with sysctl. Note that this affects all users and processes on the machine! The naming of core dump files can be configured like so:

sudo sysctl -w kernel.core_pattern="/tmp/core-%e.%p.%h.%t"

In this example, we configure the file pattern to include the executable name (%e), process ID (%p), hostname (%h), and timestamp (%t), and to be written to the /tmp directory. If you would like to learn more about core dumps, Red Hat has more information here.

Generating the Core Dump

You can test whether core dumps are being generated by running this command:

python3 -c 'import os; os.abort()'

If everything is configured properly, you should see a message telling you that the process aborted and a core file was dumped. If your shell is bash, that might say simply Aborted (core dumped). If you use zsh, you’d see something like zsh: abort      python3 -c 'import os; os.abort()'.

Now, we need to find the core file. Given the core_pattern that we configured above with sysctl, the core dump should have been written to /tmp and its name should start with core-, so you can find the core file with:

ls /tmp/core-*

Analyzing the Core Dump

Now that we have generated the core dump file and located it on disk, we can use the pystack core command to analyze it like so:

pystack core /tmp/core-<executable_name>.<process_id>.<hostname>.<timestamp>

The output will display the stack trace of the core dump file, which will help you identify the source of the error. In the case of explicitly telling the process to abort itself, that stack isn’t all that interesting:

Using executable found in the core file: /src/.venv/bin/python

Core file information:
state: R zombie: True niceness: 0
pid: 618 ppid: 473 sid: 473
uid: 1000 gid: 1000 pgrp: 618
executable: python3 arguments: python3 -c import os; os.abort()

The process died due receiving signal SIGABRT
Traceback for thread 618 [Has the GIL] (most recent call last):
    (Python) File "<string>", line 1, in <module>

By default PyStack shows you the same Python stack as the interpreter would print if an exception occurred. Since the Python stack only has one frame on it, there’s not much to see here! For debugging a crash, though, it’s generally best to use the --native flag to display the interpreter’s native C frames, as well as any C or C++ or Rust frames from libraries you are using, interleaved with the Python frames in the resulting stack trace.

pystack core --native /tmp/core-<executable_name>.<process_id>.<hostname>.<timestamp>

This will produce output similar to the below:

Using executable found in the core file: /src/.venv/bin/python

Core file information:
state: R zombie: True niceness: 0
pid: 618 ppid: 473 sid: 473
uid: 1000 gid: 1000 pgrp: 618
executable: python3 arguments: python3 -c import os; os.abort()

The process died due receiving signal SIGABRT
Traceback for thread 618 [Has the GIL] (most recent call last):
    (C) File "???", line 0, in _start (python3)
    (C) File "../csu/libc-start.c", line 360, in __libc_start_main@@GLIBC_2.34 (libc.so.6)
    (C) File "../sysdeps/nptl/libc_start_call_main.h", line 58, in __libc_start_call_main (libc.so.6)
    (C) File "???", line 0, in Py_BytesMain (python3)
    (C) File "???", line 0, in Py_RunMain (python3)
    (C) File "???", line 0, in PyRun_SimpleStringFlags (python3)
    (C) File "???", line 0, in PyRun_StringFlags (python3)
    (Python) File "<string>", line 1, in <module>
    (C) File "./stdlib/abort.c", line 79, in abort (libc.so.6)
    (C) File "../sysdeps/posix/raise.c", line 26, in raise (libc.so.6)
    (C) File "./nptl/pthread_kill.c", line 89, in pthread_kill@@GLIBC_2.34 (libc.so.6)
    (C) File "./nptl/pthread_kill.c", line 78, in __pthread_kill_internal (inlined) (libc.so.6)
    (C) File "./nptl/pthread_kill.c", line 44, in __pthread_kill_implementation (inlined) (libc.so.6)

Note that this now includes not only the Python frames on the stack, but also the C frames that led to the first Python call on the stack, as well as the C frames that the last Python call on the stack called into. This lets us see how the interpreter implements the os.abort() function, as well as how the C library (glibc) implements its own abort() function!

Exercise

Read through the code in core_tutorial.py and see if you can spot any bugs:

 1import gc
 2
 3
 4def types_found_in_tuples():
 5    elem_types = set()
 6
 7    for obj in gc.get_objects():
 8        if isinstance(obj, tuple):
 9            elem_types.update(map(type, obj))
10
11    for elem_type in elem_types:
12        yield elem_type
13
14
15# printing all results with multiple calls to print
16for t in types_found_in_tuples():
17    print(t)
18
19# printing all results with one call to print
20print(*types_found_in_tuples(), sep="\n")

Given the context of this exercise, you won’t be particularly surprised to learn that this code crashes, and leaves a core file for you to investigate. Spend a few minutes investigating the pystack core --native report for the core file generated by running this script, and see if you can figure out what’s going wrong. If you aren’t familiar with Python’s internals this will be tough, but these hints should help:

Hint 1

This program uses gc.get_objects to find every tuple in the program’s memory, and then finds the set of distinct element types contained in tuples. The for loop on lines 16 and 17 runs successfully to completion and produces a usable report. We know that it finishes because the PyStack report shows that the crash happens inside the call on line 20.

Printing each type one at a time in a loop ought to be equivalent to telling print() to print them all at once with newlines between each. Why isn’t it?

Hint 2

Check the local variables with --locals as well. Can you spot anything wrong there?

Hint 3

--native mode shows us that the crash happens within a call to PySequence_Tuple(). Read up on what that function does, and see if that gives you a clue about what’s going wrong here.

Hint 4

The documentation tells us that PySequence_Tuple() can be called on either a sequence or an iterable. Can you tell from the stack what type of object it was being called on when the crash happened?

Hint 5

What gets printed out if you run this Python code?

def args_type(*args):
    print(type(args))

args_type()

Solution

Toggle to see the explanation

Just by using --native mode, we see some interesting things. First, the crash doesn’t happen when we loop over the values yielded by types_found_in_tuples() with a for loop, but does happen when we pass all the yielded values to print using * for unpacking. Second, the crash happens inside of a call to PySequence_Tuple(), which is the C function equivalent to the Python code tuple(iterable).

When we add --locals into the mix, we additionally see the very surprising value of obj:

obj: (<invalid object at 0x0>, <invalid object at 0x0>, ...)

That’s PyStack’s representation of a tuple whose first two elements are NULL pointers!

This is about as much as PyStack can tell us, and from here we need to figure things out on our own. But all the clues are there:

  1. types_found_in_tuples() was called into by PySequence_Tuple(), which constructs a new tuple from an iterable

  2. types_found_in_tuples() went on to discover a tuple that contains NULL pointers

  3. The problem only happens when we use * unpacking in the call to print, and doesn’t happen when we use a for loop

  4. Functions receive *args as a tuple!

What’s happening here is that, in order to execute print(*types_found_in_tuples()), the interpreter needs to create a new tuple to pass to print, which will contain all of the values yielded by types_found_in_tuples(). The interpreter creates that tuple, and then starts iterating over the generator produced by calling types_found_in_tuples() in order to populate that new tuple and get it ready to pass to print. But types_found_in_tuples() uses gc.get_objects and discovers this new, partially initialized tuple, and chaos ensues when it tries to use that object before the interpreter has finished assigning its elements.

Conclusion

In this tutorial, we learned how to use the core subcommand to get stack traces from core dumps. We learned how to ensure core files are produced, and how to test that. We then analyzed the core dump file using the pystack command to identify the source of the fault. Using the --native flag, we were able to view what the Python interpreter itself was doing at the time of the crash, which provided further insights into the cause of the error. By understanding how to analyze core dump files, we can effectively debug and troubleshoot interpreter crashes.

Thank you for following along with this tutorial. Happy coding!