Debugging Core Files ==================== Intro ----- This lesson is meant to familiarize you with using the ``pystack core`` subcommand to analyze a core dump file. The steps will help you enable core dumps in your Linux environment, run a script that crashes and leaves a core dump, and then analyze the core dump file with PyStack. Ensuring core dumps are enabled ------------------------------- First, we need to ensure that crashing processes in our Linux environment leave core files behind. The system allows setting a limit on the size of a core dump, to avoid filling up your disk when a process is crashing frequently. You can check the configured size limit by running: .. code:: shell ulimit -c If that reports ``0`` it means that core files have been disabled on your machine, and you'll need to enable them before proceeding with this tutorial. You can enable them using the ``ulimit`` command in your terminal: .. code:: shell # unlimited core size ulimit -c unlimited # or if you want to limit the size to 100 MB instead ulimit -c 100000 Note that this command only enables core files for the shell session you ran the command in. If you want to make this change permanent, you can add that line to your ``~/.bashrc`` or ``~/.zshrc`` file depending on your terminal preference. For instance: .. code:: shell echo "ulimit -c unlimited" >> ~/.bashrc source ~/.bashrc Next, we want to check where core dump files will be written. Check what this command prints: .. code:: shell cat /proc/sys/kernel/core_pattern That may contain placeholders, prefixed with ``%``. You can learn more about what placeholders are available in the "Naming of core dump files" section of `man 5 core `_. If you want to control the names used for generated core files, you can update the ``core_pattern`` setting with ``sysctl``. Note that this affects all users and processes on the machine! The naming of core dump files can be configured like so: .. code:: shell sudo sysctl -w kernel.core_pattern="/tmp/core-%e.%p.%h.%t" In this example, we configure the file pattern to include the executable name (``%e``), process ID (``%p``), hostname (``%h``), and timestamp (``%t``), and to be written to the ``/tmp`` directory. If you would like to learn more about core dumps, Red Hat has more information `here `_. Generating the Core Dump ------------------------ You can test whether core dumps are being generated by running this command: .. code:: shell python3 -c 'import os; os.abort()' If everything is configured properly, you should see a message telling you that the process aborted and a core file was dumped. If your shell is bash, that might say simply ``Aborted (core dumped)``. If you use zsh, you'd see something like ``zsh: abort python3 -c 'import os; os.abort()'``. Now, we need to find the core file. Given the ``core_pattern`` that we configured above with ``sysctl``, the core dump should have been written to ``/tmp`` and its name should start with ``core-``, so you can find the core file with: .. code:: shell ls /tmp/core-* Analyzing the Core Dump ----------------------- Now that we have generated the core dump file and located it on disk, we can use the ``pystack core`` command to analyze it like so: .. code:: shell pystack core /tmp/core-... The output will display the stack trace of the core dump file, which will help you identify the source of the error. In the case of explicitly telling the process to abort itself, that stack isn't all that interesting: .. code:: python Using executable found in the core file: /src/.venv/bin/python Core file information: state: R zombie: True niceness: 0 pid: 618 ppid: 473 sid: 473 uid: 1000 gid: 1000 pgrp: 618 executable: python3 arguments: python3 -c import os; os.abort() The process died due receiving signal SIGABRT Traceback for thread 618 [Has the GIL] (most recent call last): (Python) File "", line 1, in By default PyStack shows you the same Python stack as the interpreter would print if an exception occurred. Since the Python stack only has one frame on it, there's not much to see here! For debugging a crash, though, it's generally best to use the ``--native`` flag to display the interpreter's native C frames, as well as any C or C++ or Rust frames from libraries you are using, interleaved with the Python frames in the resulting stack trace. .. code:: shell pystack core --native /tmp/core-... This will produce output similar to the below: .. code:: python Using executable found in the core file: /src/.venv/bin/python Core file information: state: R zombie: True niceness: 0 pid: 618 ppid: 473 sid: 473 uid: 1000 gid: 1000 pgrp: 618 executable: python3 arguments: python3 -c import os; os.abort() The process died due receiving signal SIGABRT Traceback for thread 618 [Has the GIL] (most recent call last): (C) File "???", line 0, in _start (python3) (C) File "../csu/libc-start.c", line 360, in __libc_start_main@@GLIBC_2.34 (libc.so.6) (C) File "../sysdeps/nptl/libc_start_call_main.h", line 58, in __libc_start_call_main (libc.so.6) (C) File "???", line 0, in Py_BytesMain (python3) (C) File "???", line 0, in Py_RunMain (python3) (C) File "???", line 0, in PyRun_SimpleStringFlags (python3) (C) File "???", line 0, in PyRun_StringFlags (python3) (Python) File "", line 1, in (C) File "./stdlib/abort.c", line 79, in abort (libc.so.6) (C) File "../sysdeps/posix/raise.c", line 26, in raise (libc.so.6) (C) File "./nptl/pthread_kill.c", line 89, in pthread_kill@@GLIBC_2.34 (libc.so.6) (C) File "./nptl/pthread_kill.c", line 78, in __pthread_kill_internal (inlined) (libc.so.6) (C) File "./nptl/pthread_kill.c", line 44, in __pthread_kill_implementation (inlined) (libc.so.6) Note that this now includes not only the Python frames on the stack, but also the C frames that led to the first Python call on the stack, as well as the C frames that the last Python call on the stack called into. This lets us see how the interpreter implements the ``os.abort()`` function, as well as how the C library (glibc) implements its own ``abort()`` function! Exercise -------- Read through the code in ``core_tutorial.py`` and see if you can spot any bugs: .. literalinclude:: core_tutorial.py :linenos: Given the context of this exercise, you won't be particularly surprised to learn that this code crashes, and leaves a core file for you to investigate. Spend a few minutes investigating the ``pystack core --native`` report for the core file generated by running this script, and see if you can figure out what's going wrong. If you aren't familiar with Python's internals this will be tough, but these hints should help: .. raw:: html
Hint 1 This program uses `gc.get_objects` to find every tuple in the program's memory, and then finds the set of distinct element types contained in tuples. The ``for`` loop on lines 16 and 17 runs successfully to completion and produces a usable report. We know that it finishes because the PyStack report shows that the crash happens inside the call on line 20. Printing each type one at a time in a loop ought to be equivalent to telling ``print()`` to print them all at once with newlines between each. Why isn't it? .. raw:: html
Hint 2 Check the local variables with ``--locals`` as well. Can you spot anything wrong there? .. raw:: html
Hint 3 ``--native`` mode shows us that the crash happens within a call to :c:func:`PySequence_Tuple`. Read up on what that function does, and see if that gives you a clue about what's going wrong here. .. raw:: html
Hint 4 The documentation tells us that :c:func:`PySequence_Tuple` can be called on either a sequence or an iterable. Can you tell from the stack what type of object it was being called on when the crash happened? .. raw:: html
Hint 5 What gets printed out if you run this Python code? .. code:: python def args_type(*args): print(type(args)) args_type() .. raw:: html
Solution -------- .. raw:: html
Toggle to see the explanation Just by using ``--native`` mode, we see some interesting things. First, the crash doesn't happen when we loop over the values yielded by ``types_found_in_tuples()`` with a ``for`` loop, but does happen when we pass all the yielded values to ``print`` using ``*`` for unpacking. Second, the crash happens inside of a call to :c:func:`PySequence_Tuple`, which is the C function equivalent to the Python code ``tuple(iterable)``. When we add ``--locals`` into the mix, we additionally see the very surprising value of ``obj``: .. code:: text obj: (, , ...) That's PyStack's representation of a tuple whose first two elements are `NULL pointers `_! This is about as much as PyStack can tell us, and from here we need to figure things out on our own. But all the clues are there: 1. ``types_found_in_tuples()`` was called into by :c:func:`PySequence_Tuple`, which constructs a new tuple from an iterable 2. ``types_found_in_tuples()`` went on to discover a tuple that contains NULL pointers 3. The problem only happens when we use ``*`` unpacking in the call to ``print``, and doesn't happen when we use a ``for`` loop 4. Functions receive ``*args`` as a tuple! What's happening here is that, in order to execute ``print(*types_found_in_tuples())``, the interpreter needs to create a new tuple to pass to ``print``, which will contain all of the values yielded by ``types_found_in_tuples()``. The interpreter creates that tuple, and then starts iterating over the generator produced by calling ``types_found_in_tuples()`` in order to populate that new tuple and get it ready to pass to ``print``. But ``types_found_in_tuples()`` uses `gc.get_objects` and discovers this new, partially initialized tuple, and chaos ensues when it tries to use that object before the interpreter has finished assigning its elements. .. raw:: html
Conclusion ---------- In this tutorial, we learned how to use the ``core`` subcommand to get stack traces from core dumps. We learned how to ensure core files are produced, and how to test that. We then analyzed the core dump file using the ``pystack`` command to identify the source of the fault. Using the ``--native`` flag, we were able to view what the Python interpreter itself was doing at the time of the crash, which provided further insights into the cause of the error. By understanding how to analyze core dump files, we can effectively debug and troubleshoot interpreter crashes. Thank you for following along with this tutorial. Happy coding!