Logol in a nutshell

Introduction

This tutorial is only a quick overview of the language, for rapid testing. For a complete description, you should read the documentation.

A Logol grammar can be defined in 2 ways: textual or graphical model.

The textual grammar: Follows a strict language definition and can be a little complex. It can however be easily modified with a text editor on a local computer. Don't be afraid however by its syntax. After a few exercises, you will see that it will be easier over time, and modifications are really easy. We also provide a complete documentation with the syntax for each part of the language with detailled examples. Grammar files usually have the extension .logol
The graphical model: Created with the Logol Designer. You can drag and drop components, zoom, copy etc... With the Designer, you have a logical view of the grammar and see easily the different logical groups and loops. To update a graphical model, you can edit it via the Designer or convert it to a textual grammar. Indeed, at runtime, the graphical model is always converted first to a textual grammar. Graphical grammars usually have the extension .lgd

A complete documentation on both textual and graphical model is available in the tutorial.

Skeleton

The basis of a grammar is the rule definition.

A sample grammar is presented below. The order of declarations has no importance but rule must end the grammar.

mod1()==>"aaa","acgt":{_R1},?R1
mod1()==*>SEQ1

The rule is defined by mod1()==*>SEQ1.
mod1() is the model we want to match against the input sequence. The models defined in the rule will try to find a match at any position in the sequence.
SEQ1 is a constant defining the sequence, it is not related to the input file name. Just keep it with this syntax.

The second definition mod1()==>"aaa","acgt":{_R1},?R1 defines the pattern to match in the sequence.

To declare mod1 we use: mod1()==>. A model is always declared with mod followed by a unique number.

Then we declare the list of pattern, each pattern separated by a comma (meaning search at next position), in a left to right reading.

"aaa": we search the exact string "aaa" in the sequence.
"acgt":{_R1}: then we search the exact string "acgt", right after previous pattern match (position+1), and we save the result in variable R1. Variables does not only hold the content found but all the information related to this match (position, content, number of errors,...).
?R1: after previous pattern match position, we search for the content (?) of the variable R1, i.e. we search "acgt" again.

Now, test it...

We have defined here a very basic grammar, with exact strings only and a single model. You can test this pattern in the Experiment section.

Command-line execution

If you do not use Logol via command-line, just skip this chapter.

Logol is executed against a grammar (textual or graphical) and a sequence in FASTA format. For a description of configuration and options, please refer to the documentation.

LogolExec.sh

LogolExec executes a search in a single sequence only.

Create a sequence file in a directory (test.fasta):

>sample sequence
ccaaaacgtacgtttttttcccccc

Copy the example grammar above in a file named test.logol in the same directory and execute:

LogolExec.sh -g test.logol -s test.fasta -dna
calling logol with parameters -s test.fasta -g test.logol -dna
For help, use option -h
INFO org.irisa.genouest.logol.Logol  - Using configuration file: ...logol.properties
INFO org.irisa.genouest.logol.Logol  - option g called with test.logol
INFO org.irisa.genouest.logol.Logol  - option s called with test.fasta
INFO org.irisa.genouest.logol.Logol  - No maximum solutions defined, using defaults
INFO org.irisa.genouest.logol.Logol  - option dna called
INFO org.irisa.genouest.logol.Logol  - Start analyse to create grammar analyser
Executing prolog for pre-analyse
INFO org.irisa.genouest.logol.Logol  - Analyse in progress..
INFO org.irisa.genouest.logol.Logol  -...test.logol.57ec7ad1-ecad-497f-9389-d6d7b6873a2c.1-25.xml: 1
INFO org.irisa.genouest.logol.Logol  - Number of output match: 1
INFO org.irisa.genouest.logol.Logol  - JOB OVER !!!
INFO org.irisa.genouest.logol.Logol  - RESULT FILES ARE AVAILABLE: 
INFO org.irisa.genouest.logol.Logol  - - ...test.logol.57ec7ad1-ecad-497f-9389-d6d7b6873a2c.0-25-all.xml

Program use a unique number per run (test.logol.57ec7ad1-ecad-497f-9389-d6d7b6873a2c) and output file name is display in output (test.logol.57ec7ad1-ecad-497f-9389-d6d7b6873a2c.0-25-all.xml).

The number of match found is also displayed: Number of output match: 1

The result file is a XML file containing all the matches with their details and their model correspondance

LogolMultiExec.sh

LogolMultiExec is identical to LogolExec but can be executed on a multi-fasta input file. It is in charge of dispatching the sequences as a set of single sequence to LogolExec.

The program outputs a zip file containing all the results file, one per sequence.

It also provides additional options to convert the output to GFF or FASTA.

What you should know

As said before, the grammar can be a little complex, and you should take care of writting rules, spaces etc...

Before going further, you should at least know the following to write your own grammar:

Variables must start with UPPERCASE letters followed by digits, e.g. MYVAR1
Models, that could be seen as function in traditional programming, MUST be named mod followed by digits, e.g. mod10
String constants, e.g. "acgt" must be lowercas.

Those rules do not apply on graphical design, as graphical model to grammar converter will take in charge of rewriting the data according to those rules.

Pattern with errors

We previously search a string that matches perfectly. However, in the real world, we often face mutations, errors...

Logol supports Hamming (cost) and distance (indel) errors definition:

mod1()==>"ccc","agcga":{$[1,1],$$[1,1]}
mod1()==*>SEQ1

In this example, a support a range of Hamming error $[1,1] of 1, and a indel error $$[1,1] range of 1.

A range is defined by a lower limit and a higher limit. Those limits can be integers, variables or a mix of those:

$[1,10] is a range between 1 and 10 substitution errors
$[1, $R1] defines that the max number of error is equal to the number of substitutions found in R1, a previously saved variable
$[1, $R1 + 2] like above, but we increment (or decrement) the variable error

Constraints usage

Constraints are divided in two classes:

String constraints: begin, end, content, length
Structure constraints: substitution cost, indel cost

Their definition must follow the following syntax:
VAR1:{ String constraint , String constraint, ... }:{ Structure constraint , Structure constraint, ... }

Length constraint supports for the "max", the no limit character "_": [3,_] will mean at least 3 characters.

Each constraint type is more defined in the documentation.

The language support a flexible syntax. If there is no String constraint, then syntax can be: VAR1:{ Structure constraint , Structure constraint, ... }

Spacers

While models defined the rule can match any location in the sequence, we sometimes need to find some pattern with holes between them.

Those holes are named here spacers. We have 2 kinds of spacers.

Spacers are managed like any other pattern, i.e. they can be saved and constraints can be applied.

unlimited spacer

Those spacers defines a hole whose size is unknown and usually relates to a possibly large hole.

mod1()==>"acgt",.*,"ccc"

mod1()==*>SEQ1

"ccc" will be search at any position after "acgt" match.

Unlimited spacers should not be saved, but length constraints can be applied.

limited range spacer

This kind of spacer refers to small spacers where a range is always defined.

mod1()==>"acgt",SPACER1:{#[1,7]},"ccc"

mod1()==*>SEQ1

In this example, the spacer has a size between 1 and 7.

You should prefer this kind of spacer when you know a maximum size of the spacer and size is relatively small.

Views

Views are a way to group patterns for 2 reasons:

Get a logical view of the group in the result file
Apply some constraints to the group. View support the same constraints as other patterns.

mod1()==>"aaaa",("cgtacg"|"cgta"):{#[3,5]}
mod1()==*>SEQ1

We apply here a constraint on a group ("aaat"|"cgta"). This code applies a OR condition (don't worry, you will see it in next chapter).

What matters here is that the constraint {#[3,5]} will make sure that whatever the result of the group of pattern is, its global length will not exceed the defined range.

In our case, both "cgtacg" and "cgta" could match after "aaaa" in the sample sequence. However, as global constraint size is 5, the match "cgtacg" will be discarded.

OR conditions

This language use allows to match 2 or more pattern at the same location.

Following example will match "acgt" or "aggt" right after "aaaa".

mod1()==>"aaaa",("acgt"|"aggt")
mod1()==*>SEQ1

A slightly more complex example, combining a OR and a SPACER.

mod1()==>"aaaa",((.*,"acgt")|(.*,"aggt"))
mod1()==*>SEQ1

This example applies a OR on a group of pattern (.*,"acgt") and (.*,"aggt").

Morphisms

Morphisms are used to transform a string to an other

Logol predefines some morphisms but it is also possible to define your own morphisms

wc: complement (DNA or RNA pairing)
wooble: Wooble pairing
p2d: protein to DNA conversion

A morphism is oriented, i.e. a + will apply the transformation to the string while a - will reverse the string first.

mod1()==>"aaa",+"wc" "tgca"
mod1()==*>SEQ1

This morphism +"wc" will search the complement to "tgca" (acgt) after "aaa"

Multiple models

To ease the description of the global model, and to reuse some parts of code, Logol provides the models.

We used a model in the rule description. We can also use models out of the rule.

There is a difference however. Models called from an other model will try to match it at position+1 only, not in the whole sequence.

Models can also use variables as input and output.

mod1()==>"aaa","acgt":{_R1},mod2(R1)
mod2(X1)==>?X1
mod1()==*>SEQ1

In mod1()==>, we call model mod2(R1). We pass variable R1 to mod2 as argument.

Then we defined mod2 with mod2(X1)==>?X1.

mod2 will search the content (?) of variable X1, which is an input argument. In our case, the input argument will match the variable R1.

In the result file, mod2 will appear as a pattern match, and will include all the details of mod2 sub patterns.

Repeats

Sometimes, we want to match (or not to match, that is the question), a pattern occuring multiple times.

Repeats can help to do so. It can be applied on a pattern or a view.

mod1()==>"aaa",repeat("acgt" ,[0,1])+[0,2]
mod1()==*>SEQ1

We defined a repeat on "acgt" with repeat("acgt". We accept a spacer range of 0,1 characters between each repeat ,[0,1], and the pattern can occur between 0 and 2 times +[0,2].

Pattern can be a view or a pattern, including a variable. Repeat also support overlap between repeats with a semicolon instead of the comma

Overlap

2 consecutive strings can overlap:

mod1()==>"ccaaaa";("aacgt","acg")
mod1()==*>SEQ1

In this example, "aacgt" can overlap over previous pattern match "ccaaaa" on the "aa" or "a".