Intro to Snakemake
==================

.. _rules: https://snakemake.readthedocs.io/en/stable/snakefiles/rules.html

Under the hood, |showyourwork| is essentially a wrapper around Snakemake. The
code builds the article PDF by parsing the ``showyourwork.yml`` config file and
the ``ms.tex`` manuscript to build the computational graph for the workflow,
identifying which scripts it needs to execute and which datasets it needs to
download to produce all the figures in the article. If you poke around the
API, you'll see that |showyourwork| defines several Snakemake rules to do
these various tasks, then hands over full control to Snakemake.

If your article consists only of text and figures that can be generated by
running lightweight scripts, you probably don't need to worry about any of this.
But for certain use cases, it can be convenient to extend or even override some
of the |showyourwork| functionality by defining custom Snakemake rules.
Below we discuss a few examples of this.


The Snakefile
-------------

Every ``showyourwork`` article repository is instantiated with a blank Snakefile
at the repository root. This file gets included at the start of the main (build) step of the
workflow, and may thus be used to define custom rules or to run custom ``python``
code during the workflow. Almost everything you need to know about Snakefiles can
be found in the Snakemake documentation about `rules`_,
but we'll go over the basics below.

Snakefiles are, at their core, Python scripts with a little extra functionality.
Any valid Python script is also a valid Snakefile, so that should give you lots
of flexibility to define your custom commands. However, the main thing you probably
want to use the Snakefile for is to define custom *rules* for your workflow.
Snakefile rules tell Snakemake how to generate an ``output`` file from given
``input`` files, much like rules in a classic ``Makefile``. Snakemake rules
usually look something like this:

.. code-block:: python

    rule simulation:
        input:
            "dataset1.dat",
            "dataset2.dat"
        output:
            "results.dat"
        conda:
            "environment.yml"
        params:
            seed=42,
            iterations=1000,
            mode="fast"
        script:
            "src/scripts/run_simulation.py"

In this example, we've defined a rule called ``simulation``, which tells
Snakemake how to produce the output file ``results.dat``. Specifically,
this file can be generated by running the script ``src/scripts/run_simulation.py``
in an isolated conda environment with specs given in ``environment.yml``.
The rule also tells Snakemake that the files ``dataset1.dat`` and ``dataset2.dat``
are dependencies of ``results.dat``, meaning (1) the rule cannot be executed
if those files are not present (and there's no other Snakemake rule capable
of generating them) and (2) whenever either of those two files is modified,
this rule will be re-executed the next time the workflow runs in order to keep
``results.dat`` up to date with its inputs.
Finally, the rule specifies three parameters ``params``, which can be accessed
within the script via the ``snakemake.params`` dictionary
(e.g., ``snakemake.params["seed"]``). Note that there's
no need to explicitly import ``snakemake`` within ``run_simulation.py``, as
it gets automagically inserted into the namespace.
However, your code editor, linter or type checker may still show an error
or warning about ``snakemake`` being undefined. To fix this, when using
``snakemake`` version 9.17.3 or above, you can import it explicitly at the
top of your script using the following snippet:

.. code-block:: python

    from typing import TYPE_CHECKING

    if TYPE_CHECKING:
        from snakemake.iocontainers import snakemake

For ``snakemake`` versions lower than 9.17.0, the following could be used:

.. code-block:: python

    from snakemake.script import snakemake

.. note::

    The argument to the ``script`` key must be a Python script.
    If your script is in a different language, you can instead pass the
    ``shell`` key and provide a string containing the shell command
    Snakemake should execute to produce the output file, e.g.,
    ``jupyter execute notebook.ipynb``. If you do that, remember to include
    the script (``notebook.ipynb``) as an explicit input to your rule so
    that Snakemake can track dependencies properly!

    Note that Snakemake also provides a ``run`` key which allows users
    to specify Python code directly. To ensure commands are run in isolated
    conda environments (to maximize reproducibility), |showyourwork| does
    not support this. Please use either ``script`` or ``shell`` in your rules,
    and remember to always provide a conda environment file.

There are a lot of other features supported within rules; for instance,
input files and parameters can be provided as *functions*, adding another
layer of flexibility to your workflow. Rules can also be declared within
for loops, if statements, etc. For the full list of features, please refer
to the `rules`_.


Intermediate results
--------------------

An example usage of the Snakefile is discussed in the :doc:`zenodo` guide, where we show
how to define a Snakemake rule to generate **intermediate results**. The idea here
is that partitioning one's workflow into *pipeline* steps and *plotting* steps
can make it easier for the author (and the interested reader) while writing or
editing the article. For example, suppose one of the figures in an article
depends on running a computationally expensive simulation. If this simulation
is run within the script that generates the figure, *any* changes to that script
will result in a re-execution of the simulation the next time the article is
built. Thus, if one wanted to change something as simple as the color of one
of the lines in the figure, the entire simulation would have to be run again.

The way around this is to split the script into a simulation script and a plotting
script. The former generates an intermediate results file, and the latter loads
that file to do the plotting. This way, the plotting is decoupled from the
simulation, and changes to the plotting script will not trigger re-execution
of the expensive computation.

In the :doc:`zenodo` guide, we show how to define a custom Snakemake rule to
make this work. In that guide, we also discuss how |showyourwork| extends
the Snakemake ``cache`` command to allow caching of intermediate results on
Zenodo, which can help others avoid re-running expensive computations when
reproducing your work.


Variables in the TeX file
-------------------------

Another use case for custom rules is the definition of dynamic variables in
the TeX manuscript. For example, say I have a script called ``age_of_universe.py``
that infers the age of the universe from some cosmological dataset:

.. code-block:: python
    :caption: **File:** ``age_of_universe.py``

    import paths
    from my_awesome_code import get_age_of_universe

    # Load the data
    dataset = paths.data / "planck.dat"

    # Compute the age
    age = get_age_of_universe(dataset)

    # Write it to disk
    with open(paths.output / "age_of_universe.txt", "w") as f:
        print(f"{age:.3f}", file=f)

I would like
to report this age in the text of my article, but I want to avoid having to
re-type it in every time I make changes to my workflow that affect this quantity.
We can easily automate this by defining a custom Snakemake rule:

.. code-block:: python
    :caption: **File:** ``Snakefile``

    rule age_of_universe:
        input:
            "src/data/planck.dat"
        output:
            "src/tex/output/age_of_universe.txt"
        script:
            "src/scripts/age_of_universe.py"

Then, in my TeX file, I can do the following:

.. code-block:: latex
    :caption: **File:** ``ms.tex``

    Based on a detailed analysis of Planck observations of the cosmic
    microwave background, we have determined the age of the universe
    to be \variable{output/age_of_universe.txt} Gyr.

That's it! This functionality can easily be adapted to automatically populate tables in
your article or anything else that can be generated programmatically from your
workflow. Note that |showyourwork| automatically parses calls to ``\variable``
statements and adds their arguments as explicit dependencies of the manuscript,
so that any changes to these files will trigger a re-run of the compile step.
For more information on this command, see :ref:`latex_variable`.

Mixed figure environments
-------------------------

.. note::

    Coming soon: how to deal with ``\figure`` environments with figures
    that are generated by multiple different scripts, or if you'd like to
    include figures generated by a given script in multiple figure
    environments. It's easy if you define your own Snakemake rules.


Advanced usage
--------------

It is also possible to entirely override |showyourwork| rules. When ingesting
user-defined rules from the Snakefile, the code automatically gives precedence
to those rules over |showyourwork| rules (by setting a higher ``ruleorder`` for
all user rules). This means that if there are two rules that can generate the
same output, Snakemake will always favor the user-defined rule.
You can take advantage of this to provide custom rules to build individual
figures or even the article PDF itself.

Using existing (data) files in a workflow by ignoring timestamps
----------------------------------------------------------------

When starting up a project or when in a rapid development phase, it can be useful to
tell Snakemake to ignore changes to a file or timestamp when running the build. For
example, you may have a slow rule to generate a data file from querying an external data
archive and you just want to use a temporary subset of the data or existing copy of the
data. Snakemake supports this with the ``ancient()`` command.
See the `how to ignore timestamps <https://snakemake.readthedocs.io/en/stable/snakefiles/rules.html#ignoring-timestamps>`_
for more information about how to use this in a rule.