========= Utilities ========= The module ``jug.utils`` has a few functions which are meant to be used in writing jugfiles. Identity -------- This is simply implemented as:: @TaskGenerator def identity(x): return x This might seem like the most pointless function, but it can be helpful in speeding things up. Consider the following case:: from glob import glob def load(fname): return open(fname).readlines() @TaskGenerator def process(inputs, parameter): ... inputs = [] for f in glob('*.data'): inputs.extend(load(f)) # inputs is a large list results = {} for p in range(1000): results[p] = process(inputs, p) How is this processed? Every time ``process`` is called, a new ``jug.Task`` is generated. This task has two arguments: ``inputs`` and an integer. When the hash of the task is computed, both its arguments are analysed. ``inputs`` is a large list of strings. Therefore, it is going to take a very long time to process all of the hashes. Consider the variation:: from jug.utils import identity # ... # same as above inputs = identity(inputs) results = {} for p in range(1000): results[p] = process(inputs, p) Now, the long list is only hashed once! It is transformed into a ``Task`` (we reuse the name ``inputs`` to keep things clear) and each ``process`` call can now compute its hash very fast. Using ``identity`` to induce dependencies ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ``identity`` can also be used to introduce dependencies. One can define a helper function:: def value_after(val, token): from jug.utils import identity return identity( [val, token] )[0] Now, this function, will always return its first argument, but will only run once its second argument is available. Here is a typical use case: 1. Function ``process`` takes an output file name 2. Function ``postprocess`` takes as input the output filename of ``process`` Now, you want to run ``process`` and **then** ``postprocess``, but since communication is done with files, Jug does not see that these functions depend on each other. ``value_after`` is the solution:: token = process(input, ofile='output.txt') postprocess(value_after('output.txt', token)) This works independently of whatever ``process`` returns (even if it is ``None``). jug_execute ----------- This is a simple wrapper around ``subprocess.call()``. It adds two important pieces of functionality: 1. it checks the exit code and raises an exception if not zero (this can be disabled by passing ``check_exit=False``). 2. It takes an argument called ``run_after`` which is ignored but can be used to declare dependencies between tasks. Thus, it can be used to ensure that a specific process only runs after something else has run:: from jug.utils import jug_execute from jug import TaskGenerator @TaskGenerator def my_computation(input, ouput_filename): ... token = my_computation(input, 'output.txt') # We want to run gzip, but **only after** `my_computation` has run: jug_execute(['gzip', 'output.txt'], run_after=token) cached_glob ----------- ``cached_glob`` is a simple utility to perform the following common operation:: from glob import glob from jug import CachedFunction files = CachedFunction(glob, pattern) files.sort() Where ``pattern`` is a glob pattern can be simply written as:: from jug.utils import cached_glob files = cached_glob(pattern) NoHash ------ This is imported from the ``jug.unsafe`` module as it bypasses the hashing mechanism and can lead to incorrect results if used incorrectly. This marks certain arguments to Tasks as not being part of the hash that defines the results. It can be useful for arguments that do not change the results, but nonetheless need to be passsed to functions. Example usage:: jug_execute(['SemiBin2', 'single_easy_bin', '--cpus', NoHash('8'), '-i', 'input/contigs.fna.gz', '--bam', 'input/mapped.bam', '--output', 'output' ]) .. automodule:: jug.utils :members: