This tutorial is only a quick overview of the language, for rapid testing. For a complete description, you should read the documentation.
A Logol grammar can be defined in 2 ways: textual or graphical model.
The basis of a grammar is the rule definition.
A sample grammar is presented below. The order of declarations has no importance but rule must end the grammar.
mod1()==>"aaa","acgt":{_R1},?R1 mod1()==*>SEQ1
The rule is defined by mod1()==*>SEQ1
.
mod1() is the model we want to match against the input sequence. The models defined in the rule will try to find a match at any position in the sequence.
SEQ1 is a constant defining the sequence, it is not related to the input file name. Just keep it with this syntax.
The second definition mod1()==>"aaa","acgt":{_R1},?R1
defines the pattern to match in the sequence.
To declare mod1 we use: mod1()==>
. A model is always declared with mod followed by a unique number.
Then we declare the list of pattern, each pattern separated by a comma (meaning search at next position), in a left to right reading.
"aaa"
: we search the exact string "aaa" in the sequence."acgt":{_R1}
: then we search the exact string "acgt", right after previous pattern match (position+1), and we save the result in variable R1. Variables does not only hold the content found but all the information related to this match (position, content, number of errors,...).?R1
: after previous pattern match position, we search for the content (?) of the variable R1, i.e. we search "acgt" again.
If you do not use Logol via command-line, just skip this chapter.
Logol is executed against a grammar (textual or graphical) and a sequence in FASTA format. For a description of configuration and options, please refer to the documentation.
LogolExec executes a search in a single sequence only.
Create a sequence file in a directory (test.fasta):
>sample sequence ccaaaacgtacgtttttttcccccc
Copy the example grammar above in a file named test.logol in the same directory and execute:
LogolExec.sh -g test.logol -s test.fasta -dna calling logol with parameters -s test.fasta -g test.logol -dna For help, use option -h INFO org.irisa.genouest.logol.Logol - Using configuration file: ...logol.properties INFO org.irisa.genouest.logol.Logol - option g called with test.logol INFO org.irisa.genouest.logol.Logol - option s called with test.fasta INFO org.irisa.genouest.logol.Logol - No maximum solutions defined, using defaults INFO org.irisa.genouest.logol.Logol - option dna called INFO org.irisa.genouest.logol.Logol - Start analyse to create grammar analyser Executing prolog for pre-analyse INFO org.irisa.genouest.logol.Logol - Analyse in progress.. INFO org.irisa.genouest.logol.Logol -...test.logol.57ec7ad1-ecad-497f-9389-d6d7b6873a2c.1-25.xml: 1 INFO org.irisa.genouest.logol.Logol - Number of output match: 1 INFO org.irisa.genouest.logol.Logol - JOB OVER !!! INFO org.irisa.genouest.logol.Logol - RESULT FILES ARE AVAILABLE: INFO org.irisa.genouest.logol.Logol - - ...test.logol.57ec7ad1-ecad-497f-9389-d6d7b6873a2c.0-25-all.xml
Program use a unique number per run (test.logol.57ec7ad1-ecad-497f-9389-d6d7b6873a2c) and output file name is display in output (test.logol.57ec7ad1-ecad-497f-9389-d6d7b6873a2c.0-25-all.xml).
The number of match found is also displayed: Number of output match: 1
The result file is a XML file containing all the matches with their details and their model correspondance
LogolMultiExec is identical to LogolExec but can be executed on a multi-fasta input file. It is in charge of dispatching the sequences as a set of single sequence to LogolExec.
The program outputs a zip file containing all the results file, one per sequence.
It also provides additional options to convert the output to GFF or FASTA.
As said before, the grammar can be a little complex, and you should take care of writting rules, spaces etc...
Before going further, you should at least know the following to write your own grammar:
Those rules do not apply on graphical design, as graphical model to grammar converter will take in charge of rewriting the data according to those rules.
We previously search a string that matches perfectly. However, in the real world, we often face mutations, errors...
Logol supports Hamming (cost) and distance (indel) errors definition:
mod1()==>"ccc","agcga":{$[1,1],$$[1,1]} mod1()==*>SEQ1
In this example, a support a range of Hamming error $[1,1]
of 1, and a indel error $$[1,1]
range of 1.
A range is defined by a lower limit and a higher limit. Those limits can be integers, variables or a mix of those:
Constraints are divided in two classes:
VAR1:{ String constraint , String constraint, ... }:{ Structure constraint , Structure constraint, ... }
Length constraint supports for the "max", the no limit character "_": [3,_] will mean at least 3 characters.
Each constraint type is more defined in the documentation.
The language support a flexible syntax. If there is no String constraint, then syntax can be: VAR1:{ Structure constraint , Structure constraint, ... }
While models defined the rule can match any location in the sequence, we sometimes need to find some pattern with holes between them.
Those holes are named here spacers. We have 2 kinds of spacers.
Spacers are managed like any other pattern, i.e. they can be saved and constraints can be applied.
Those spacers defines a hole whose size is unknown and usually relates to a possibly large hole.
mod1()==>"acgt",.*,"ccc" mod1()==*>SEQ1
"ccc"
will be search at any position after "acgt"
match.
Unlimited spacers should not be saved, but length constraints can be applied.
This kind of spacer refers to small spacers where a range is always defined.
mod1()==>"acgt",SPACER1:{#[1,7]},"ccc" mod1()==*>SEQ1
In this example, the spacer has a size between 1 and 7.
You should prefer this kind of spacer when you know a maximum size of the spacer and size is relatively small.
Views are a way to group patterns for 2 reasons:
mod1()==>"aaaa",("cgtacg"|"cgta"):{#[3,5]} mod1()==*>SEQ1
We apply here a constraint on a group ("aaat"|"cgta")
. This code applies a OR condition (don't worry, you will see it in next chapter).
What matters here is that the constraint {#[3,5]}
will make sure that whatever the result of the group of pattern is, its global length will not exceed the defined range.
In our case, both "cgtacg" and "cgta" could match after "aaaa" in the sample sequence. However, as global constraint size is 5, the match "cgtacg" will be discarded.
This language use allows to match 2 or more pattern at the same location.
Following example will match "acgt"
or "aggt"
right after "aaaa"
.
mod1()==>"aaaa",("acgt"|"aggt") mod1()==*>SEQ1
A slightly more complex example, combining a OR and a SPACER.
mod1()==>"aaaa",((.*,"acgt")|(.*,"aggt")) mod1()==*>SEQ1
This example applies a OR on a group of pattern (.*,"acgt")
and (.*,"aggt")
.
Morphisms are used to transform a string to an other
Logol predefines some morphisms but it is also possible to define your own morphisms
A morphism is oriented, i.e. a + will apply the transformation to the string while a - will reverse the string first.
mod1()==>"aaa",+"wc" "tgca" mod1()==*>SEQ1
This morphism +"wc"
will search the complement to "tgca" (acgt) after "aaa"
To ease the description of the global model, and to reuse some parts of code, Logol provides the models.
We used a model in the rule description. We can also use models out of the rule.
There is a difference however. Models called from an other model will try to match it at position+1 only, not in the whole sequence.
Models can also use variables as input and output.
mod1()==>"aaa","acgt":{_R1},mod2(R1) mod2(X1)==>?X1 mod1()==*>SEQ1
In mod1()==>
, we call model mod2(R1)
. We pass variable R1 to mod2 as argument.
Then we defined mod2 with mod2(X1)==>?X1
.
mod2 will search the content (?) of variable X1, which is an input argument. In our case, the input argument will match the variable R1.
In the result file, mod2 will appear as a pattern match, and will include all the details of mod2 sub patterns.
Sometimes, we want to match (or not to match, that is the question), a pattern occuring multiple times.
Repeats can help to do so. It can be applied on a pattern or a view.
mod1()==>"aaa",repeat("acgt" ,[0,1])+[0,2] mod1()==*>SEQ1
We defined a repeat on "acgt" with repeat("acgt"
. We accept a spacer range of 0,1 characters between each repeat ,[0,1]
, and the pattern can occur between 0 and 2 times +[0,2]
.
Pattern can be a view or a pattern, including a variable. Repeat also support overlap between repeats with a semicolon instead of the comma
2 consecutive strings can overlap:
mod1()==>"ccaaaa";("aacgt","acg") mod1()==*>SEQ1
In this example, "aacgt" can overlap over previous pattern match "ccaaaa" on the "aa" or "a".