Debugging Core Files
====================

Intro
-----

This lesson is meant to familiarize you with using the ``pystack core`` subcommand to analyze a core
dump file. The steps will help you enable core dumps in your Linux environment, run a script that
crashes and leaves a core dump, and then analyze the core dump file with PyStack.

Ensuring core dumps are enabled
-------------------------------

First, we need to ensure that crashing processes in our Linux environment leave core files behind.

The system allows setting a limit on the size of a core dump, to avoid filling up your disk when
a process is crashing frequently. You can check the configured size limit by running:

.. code:: shell

    ulimit -c

If that reports ``0`` it means that core files have been disabled on your machine, and you'll need
to enable them before proceeding with this tutorial. You can enable them using the ``ulimit``
command in your terminal:

.. code:: shell

    # unlimited core size
    ulimit -c unlimited

    # or if you want to limit the size to 100 MB instead
    ulimit -c 100000

Note that this command only enables core files for the shell session you ran the command in. If you
want to make this change permanent, you can add that line to your ``~/.bashrc`` or ``~/.zshrc`` file
depending on your terminal preference. For instance:

.. code:: shell

    echo "ulimit -c unlimited" >> ~/.bashrc
    source ~/.bashrc

Next, we want to check where core dump files will be written. Check what this command prints:

.. code:: shell

    cat /proc/sys/kernel/core_pattern

That may contain placeholders, prefixed with ``%``. You can learn more about what placeholders are
available in the "Naming of core dump files" section of `man 5 core <https://man7.org/linux/man-pages/man5/core.5.html>`_.

If you want to control the names used for generated core files, you can update the ``core_pattern``
setting with ``sysctl``. Note that this affects all users and processes on the machine! The naming
of core dump files can be configured like so:

.. code:: shell

    sudo sysctl -w kernel.core_pattern="/tmp/core-%e.%p.%h.%t"

In this example, we configure the file pattern to include the executable name (``%e``), process ID
(``%p``), hostname (``%h``), and timestamp (``%t``), and to be written to the ``/tmp`` directory. If
you would like to learn more about core dumps, Red Hat has more information `here
<https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux/7/html/developer_guide/debugging-crashed-application#debugging-crashed-application_understanding-core-dumps6>`_.

Generating the Core Dump
------------------------

You can test whether core dumps are being generated by running this command:

.. code:: shell

    python3 -c 'import os; os.abort()'

If everything is configured properly, you should see a message telling you that the process aborted
and a core file was dumped. If your shell is bash, that might say simply ``Aborted (core dumped)``.
If you use zsh, you'd see something like ``zsh: abort      python3 -c 'import os; os.abort()'``.

Now, we need to find the core file. Given the ``core_pattern`` that we configured above with
``sysctl``, the core dump should have been written to ``/tmp`` and its name should start with
``core-``, so you can find the core file with:

.. code:: shell

    ls /tmp/core-*

Analyzing the Core Dump
-----------------------

Now that we have generated the core dump file and located it on disk, we can use the ``pystack
core`` command to analyze it like so:

.. code:: shell

    pystack core /tmp/core-<executable_name>.<process_id>.<hostname>.<timestamp>

The output will display the stack trace of the core dump file, which will help you identify the
source of the error. In the case of explicitly telling the process to abort itself, that stack isn't
all that interesting:

.. code:: python

    Using executable found in the core file: /src/.venv/bin/python

    Core file information:
    state: R zombie: True niceness: 0
    pid: 618 ppid: 473 sid: 473
    uid: 1000 gid: 1000 pgrp: 618
    executable: python3 arguments: python3 -c import os; os.abort()

    The process died due receiving signal SIGABRT
    Traceback for thread 618 [Has the GIL] (most recent call last):
        (Python) File "<string>", line 1, in <module>

By default PyStack shows you the same Python stack as the interpreter would print if an exception
occurred. Since the Python stack only has one frame on it, there's not much to see here! For
debugging a crash, though, it's generally best to use the ``--native`` flag to display the
interpreter's native C frames, as well as any C or C++ or Rust frames from libraries you are using,
interleaved with the Python frames in the resulting stack trace.

.. code:: shell

    pystack core --native /tmp/core-<executable_name>.<process_id>.<hostname>.<timestamp>

This will produce output similar to the below:

.. code:: python

    Using executable found in the core file: /src/.venv/bin/python

    Core file information:
    state: R zombie: True niceness: 0
    pid: 618 ppid: 473 sid: 473
    uid: 1000 gid: 1000 pgrp: 618
    executable: python3 arguments: python3 -c import os; os.abort()

    The process died due receiving signal SIGABRT
    Traceback for thread 618 [Has the GIL] (most recent call last):
        (C) File "???", line 0, in _start (python3)
        (C) File "../csu/libc-start.c", line 360, in __libc_start_main@@GLIBC_2.34 (libc.so.6)
        (C) File "../sysdeps/nptl/libc_start_call_main.h", line 58, in __libc_start_call_main (libc.so.6)
        (C) File "???", line 0, in Py_BytesMain (python3)
        (C) File "???", line 0, in Py_RunMain (python3)
        (C) File "???", line 0, in PyRun_SimpleStringFlags (python3)
        (C) File "???", line 0, in PyRun_StringFlags (python3)
        (Python) File "<string>", line 1, in <module>
        (C) File "./stdlib/abort.c", line 79, in abort (libc.so.6)
        (C) File "../sysdeps/posix/raise.c", line 26, in raise (libc.so.6)
        (C) File "./nptl/pthread_kill.c", line 89, in pthread_kill@@GLIBC_2.34 (libc.so.6)
        (C) File "./nptl/pthread_kill.c", line 78, in __pthread_kill_internal (inlined) (libc.so.6)
        (C) File "./nptl/pthread_kill.c", line 44, in __pthread_kill_implementation (inlined) (libc.so.6)

Note that this now includes not only the Python frames on the stack, but also the C frames that
led to the first Python call on the stack, as well as the C frames that the last Python call on the
stack called into. This lets us see how the interpreter implements the ``os.abort()`` function, as
well as how the C library (glibc) implements its own ``abort()`` function!

Exercise
--------

Read through the code in ``core_tutorial.py`` and see if you can spot any bugs:

.. literalinclude:: core_tutorial.py
   :linenos:

Given the context of this exercise, you won't be particularly surprised to learn that this code
crashes, and leaves a core file for you to investigate. Spend a few minutes investigating the
``pystack core --native`` report for the core file generated by running this script, and see if you
can figure out what's going wrong. If you aren't familiar with Python's internals this will be
tough, but these hints should help:

.. raw:: html

   <details>
   <summary><i>Hint 1</i></summary>

This program uses `gc.get_objects` to find every tuple in the program's memory, and then finds the
set of distinct element types contained in tuples. The ``for`` loop on lines 16 and 17 runs
successfully to completion and produces a usable report. We know that it finishes because the
PyStack report shows that the crash happens inside the call on line 20.

Printing each type one at a time in a loop ought to be equivalent to telling ``print()`` to print
them all at once with newlines between each. Why isn't it?

.. raw:: html

   </details>
   <details>
   <summary><i>Hint 2</i></summary>

Check the local variables with ``--locals`` as well. Can you spot anything wrong there?

.. raw:: html

   </details>
   <details>
   <summary><i>Hint 3</i></summary>

``--native`` mode shows us that the crash happens within a call to :c:func:`PySequence_Tuple`. Read
up on what that function does, and see if that gives you a clue about what's going wrong here.

.. raw:: html

   </details>
   <details>
   <summary><i>Hint 4</i></summary>

The documentation tells us that :c:func:`PySequence_Tuple` can be called on either a sequence or an
iterable. Can you tell from the stack what type of object it was being called on when the crash
happened?

.. raw:: html

   </details>
   <details>
   <summary><i>Hint 5</i></summary>

What gets printed out if you run this Python code?

.. code:: python

    def args_type(*args):
        print(type(args))

    args_type()

.. raw:: html

   </details>

Solution
--------

.. raw:: html

   <details>
   <summary><i>Toggle to see the explanation</i></summary>

Just by using ``--native`` mode, we see some interesting things. First, the crash doesn't happen
when we loop over the values yielded by ``types_found_in_tuples()`` with a ``for`` loop, but does
happen when we pass all the yielded values to ``print`` using ``*`` for unpacking. Second, the crash
happens inside of a call to :c:func:`PySequence_Tuple`, which is the C function equivalent to the
Python code ``tuple(iterable)``.

When we add ``--locals`` into the mix, we additionally see the very surprising value of ``obj``:

.. code:: text

   obj: (<invalid object at 0x0>, <invalid object at 0x0>, ...)

That's PyStack's representation of a tuple whose first two elements are `NULL pointers
<https://www.infoq.com/presentations/Null-References-The-Billion-Dollar-Mistake-Tony-Hoare/>`_!

This is about as much as PyStack can tell us, and from here we need to figure things out on our own.
But all the clues are there:

1. ``types_found_in_tuples()`` was called into by :c:func:`PySequence_Tuple`, which constructs a new
   tuple from an iterable
2. ``types_found_in_tuples()`` went on to discover a tuple that contains NULL pointers
3. The problem only happens when we use ``*`` unpacking in the call to ``print``, and doesn't happen
   when we use a ``for`` loop
4. Functions receive ``*args`` as a tuple!

What's happening here is that, in order to execute ``print(*types_found_in_tuples())``, the
interpreter needs to create a new tuple to pass to ``print``, which will contain all of the values
yielded by ``types_found_in_tuples()``. The interpreter creates that tuple, and then starts
iterating over the generator produced by calling ``types_found_in_tuples()`` in order to populate
that new tuple and get it ready to pass to ``print``. But ``types_found_in_tuples()`` uses
`gc.get_objects` and discovers this new, partially initialized tuple, and chaos ensues when it tries
to use that object before the interpreter has finished assigning its elements.

.. raw:: html

   </details>


Conclusion
----------

In this tutorial, we learned how to use the ``core`` subcommand to get stack traces from core dumps.
We learned how to ensure core files are produced, and how to test that. We then analyzed the core
dump file using the ``pystack`` command to identify the source of the fault. Using the ``--native``
flag, we were able to view what the Python interpreter itself was doing at the time of the crash,
which provided further insights into the cause of the error. By understanding how to analyze core
dump files, we can effectively debug and troubleshoot interpreter crashes.

Thank you for following along with this tutorial. Happy coding!