Metapipe Syntax¶

The syntax for .mp files is as follows.

Structure¶

Each command in the pipeline should be on their own line and have the following structure:

[command alias] command [flags] {input/output pattern} 

Example
-------

1. cut -f 1 some_file > {o}

In this example, the command cut is given the alias 1 and cuts column 1 from some_file and puts it in an output file called some_file.1.

Note: The alias for a file can be anything, as long as it’s unique in the pipeline.

Input Patterns¶

Consider the following command:

[COMMANDS]
python somescript {1||2||3}

[FILES]
1. some_file1.txt
2. some_file2.txt
3. some_file3.txt

This command will run the python script 3 times in parallel, once with each file specified. The output will look something like this:

Output
------

python somescript some_file1.txt
python somescript some_file2.txt
python somescript some_file3.txt

Running a script with multiple inputs¶

Let’s say that you have a script with takes multiple files as input. In this case the syntax becomes:

[COMMANDS]
python somescript {1,2,3}

[FILES]
1. some_file1.txt
2. some_file2.txt
3. some_file3.txt

Output
------

>>> python somescript some_file1.txt some_file2.txt some_file3.txt

Multiple steps and file names¶

[TODO]

Output Patterns¶

Whenever a script would take an explicit output filename you can use the output pattern syntax to tell metapipe where/what it should use.

[COMMANDS]
python somescript -o {o} {1||2||3}

[FILES]
1. some_file1.txt
2. some_file2.txt
3. some_file3.txt

Output
------

python somescript -o mp.1.1.output some_file1.txt
python somescript -o mp.1.2.output some_file2.txt
python somescript -o mp.1.3.output some_file3.txt

Metapipe will generate the filename with the command’s alias inside. An upcoming feature will provide more useful output names.

Implicit or Hardcoded output¶

In a case where the script or command you want to use generates an output that is not passed through the command, but you need to use for another step in the pipeline, you can use output patterns to tell metapipe what to look for.

Consider this:

[COMMANDS]
./do_count {1||2}
./analyze.sh {1.*}


[FILES]
1. foo.txt
2. bar.txt

This set of commands is invalid because the second command (./analyze.sh) doesn’t know what the output of command 1 is because it isn’t specified. The split command generates output based on the input filenames it is given.

Since we wrote the ./do_count script, we know that it generates files with a .counts extension. But since we don’t explicitly specify the files, in this case Metapipe cannot assume the file names generated by step 1 and this config file is invalid.

We can tell metapipe what the output should look like by using an output pattern.

[COMMANDS]
./do_counts {1||2} {o:*.counts}
./analyze.sh {2.*}

[FILES]
1. foo.txt
2. bar.txt

The above example tells metapipe that the output of command 1, which is hardcoded in the script will have an output that ends in .counts. Now that the output of command 1 is known, command 2 will wait until command 1 finishes.

When the output marker has the form {o}, then metapipe will insert a pregenerated filename to the command. The output marker {o:<pattern>} means that the output of the script is not determined by the input of the script, but it will match given pattern. This means that later commands will be able to reference the files by name.

Multiple Inputs and Outputs¶

Often times a given shell command will either take multiple dynamic files as input, or generate multiple files as output. In either case, metapipe provides a way to manage and track these files.

For multiple inputs, metapipe expects the number of inputs per command to be the same, and will iterate over them in order.

Example:

# Given the following:
[COMMANDS]
bash somescript {1||2||3} --conf {4||5||6}  > {o}

[FILES]
1. somefile.1
2. somefile.2
3. somefile.3

# Metapipe will return this:
bash somescript somefile.1 --conf somefile.4  > mp.1.1.output
bash somescript somefile.2 --conf somefile.5  > mp.1.2.output
bash somescript somefile.3 --conf somefile.6  > mp.1.3.output

Metapipe will name the multiple output files as follows (in order from left to right):

mp.{command_number}.{sub_command_number}-{output_number}

Example:

# Given an input like the one below:
[COMMANDS]
bash somescript {1||2||3} --log {o} -r {o}

[FILES]
1. somefile.1
2. somefile.2
3. somefile.3

# metapipe will generate the following:
bash somescript somefile.1 --log mp.1.1-1.output -r mp.1.1-2.output
bash somescript somefile.2 --log mp.1.2-1.output -r mp.1.2-2.output
bash somescript somefile.3 --log mp.1.3-1.output -r mp.1.3-2.output

Sample config.mp file¶

[COMMANDS]
# Trim and cut your sample fastq files.
# Metapipe will handle naming the output files for you!
# Metapipe will trim all of the files at once! (see parallel)
trimmomatic -o {o} {*.fastq.gz||}

# Metapipe will manage your dependencies for you!
# Take all the outputs of step 1 and feed them to cutadapt.
cutadapt -o {o} {1.*||}

# Next you need to align them.
htseq <alignment options> -o {o} {2.*||}

# Of course, now you'll have some custom code to put all the data together. 
# That's fine too!

# Oh no! You hardcode the output name? No problem! Just tell metapipe 
# what the filename is.
python my_custom_code.py {3.*} #{o:hardcoded_output.csv}

# Now you want to compare your results to some controls? Ok!
# Metapipe wil compare your hardcoded_output to all 3 controls at the same time!
python my_compare_script.py --compare-to {1||2||3} {4.1} 

# Finally, you want to make some pretty graphs? No problem!
# But wait! You want R 2.0 for this code? Just create an alias for R!
Rscript my_cool_graphing_code.r {5.*} > {o}

[FILES]
1. controls.1.csv
2. controls.2.csv
3. controls.3.csv

[PATHS]
Rscript ~/path/to/my/custom/R/version