Logol in a nutshell


Introduction

This tutorial is only a quick overview of the language, for rapid testing. For a complete description, you should read the documentation.

A Logol grammar can be defined in 2 ways: textual or graphical model.

The textual grammar
Follows a strict language definition and can be a little complex. It can however be easily modified with a text editor on a local computer. Don't be afraid however by its syntax. After a few exercises, you will see that it will be easier over time, and modifications are really easy. We also provide a complete documentation with the syntax for each part of the language with detailled examples. Grammar files usually have the extension .logol
The graphical model
Created with the Logol Designer. You can drag and drop components, zoom, copy etc... With the Designer, you have a logical view of the grammar and see easily the different logical groups and loops. To update a graphical model, you can edit it via the Designer or convert it to a textual grammar. Indeed, at runtime, the graphical model is always converted first to a textual grammar. Graphical grammars usually have the extension .lgd
A complete documentation on both textual and graphical model is available in the tutorial.

Skeleton

The basis of a grammar is the rule definition.

A sample grammar is presented below. The order of declarations has no importance but rule must end the grammar.

mod1()==>"aaa","acgt":{_R1},?R1
mod1()==*>SEQ1
      

The rule is defined by mod1()==*>SEQ1.
mod1() is the model we want to match against the input sequence. The models defined in the rule will try to find a match at any position in the sequence.
SEQ1 is a constant defining the sequence, it is not related to the input file name. Just keep it with this syntax.


The second definition mod1()==>"aaa","acgt":{_R1},?R1 defines the pattern to match in the sequence.

To declare mod1 we use: mod1()==>. A model is always declared with mod followed by a unique number.

Then we declare the list of pattern, each pattern separated by a comma (meaning search at next position), in a left to right reading.

Now, test it...

We have defined here a very basic grammar, with exact strings only and a single model. You can test this pattern in the Experiment section.

Command-line execution

If you do not use Logol via command-line, just skip this chapter.

Logol is executed against a grammar (textual or graphical) and a sequence in FASTA format. For a description of configuration and options, please refer to the documentation.

LogolExec.sh

LogolExec executes a search in a single sequence only.

Create a sequence file in a directory (test.fasta):

>sample sequence
ccaaaacgtacgtttttttcccccc
	   

Copy the example grammar above in a file named test.logol in the same directory and execute:

LogolExec.sh -g test.logol -s test.fasta -dna
calling logol with parameters -s test.fasta -g test.logol -dna
For help, use option -h
INFO org.irisa.genouest.logol.Logol  - Using configuration file: ...logol.properties
INFO org.irisa.genouest.logol.Logol  - option g called with test.logol
INFO org.irisa.genouest.logol.Logol  - option s called with test.fasta
INFO org.irisa.genouest.logol.Logol  - No maximum solutions defined, using defaults
INFO org.irisa.genouest.logol.Logol  - option dna called
INFO org.irisa.genouest.logol.Logol  - Start analyse to create grammar analyser
Executing prolog for pre-analyse
INFO org.irisa.genouest.logol.Logol  - Analyse in progress..
INFO org.irisa.genouest.logol.Logol  -...test.logol.57ec7ad1-ecad-497f-9389-d6d7b6873a2c.1-25.xml: 1
INFO org.irisa.genouest.logol.Logol  - Number of output match: 1
INFO org.irisa.genouest.logol.Logol  - JOB OVER !!!
INFO org.irisa.genouest.logol.Logol  - RESULT FILES ARE AVAILABLE: 
INFO org.irisa.genouest.logol.Logol  - - ...test.logol.57ec7ad1-ecad-497f-9389-d6d7b6873a2c.0-25-all.xml      
       

Program use a unique number per run (test.logol.57ec7ad1-ecad-497f-9389-d6d7b6873a2c) and output file name is display in output (test.logol.57ec7ad1-ecad-497f-9389-d6d7b6873a2c.0-25-all.xml).

The number of match found is also displayed: Number of output match: 1

The result file is a XML file containing all the matches with their details and their model correspondance

LogolMultiExec.sh

LogolMultiExec is identical to LogolExec but can be executed on a multi-fasta input file. It is in charge of dispatching the sequences as a set of single sequence to LogolExec.

The program outputs a zip file containing all the results file, one per sequence.

It also provides additional options to convert the output to GFF or FASTA.

What you should know

As said before, the grammar can be a little complex, and you should take care of writting rules, spaces etc...

Before going further, you should at least know the following to write your own grammar:

Those rules do not apply on graphical design, as graphical model to grammar converter will take in charge of rewriting the data according to those rules.

Pattern with errors

We previously search a string that matches perfectly. However, in the real world, we often face mutations, errors...

Logol supports Hamming (cost) and distance (indel) errors definition:

mod1()==>"ccc","agcga":{$[1,1],$$[1,1]}
mod1()==*>SEQ1
       

In this example, a support a range of Hamming error $[1,1] of 1, and a indel error $$[1,1] range of 1.

A range is defined by a lower limit and a higher limit. Those limits can be integers, variables or a mix of those:

Constraints usage

Constraints are divided in two classes:

Their definition must follow the following syntax:
VAR1:{ String constraint , String constraint, ... }:{ Structure constraint , Structure constraint, ... }

Length constraint supports for the "max", the no limit character "_": [3,_] will mean at least 3 characters.

Each constraint type is more defined in the documentation.

The language support a flexible syntax. If there is no String constraint, then syntax can be: VAR1:{ Structure constraint , Structure constraint, ... }

Spacers

While models defined the rule can match any location in the sequence, we sometimes need to find some pattern with holes between them.

Those holes are named here spacers. We have 2 kinds of spacers.

Spacers are managed like any other pattern, i.e. they can be saved and constraints can be applied.

unlimited spacer

Those spacers defines a hole whose size is unknown and usually relates to a possibly large hole.

mod1()==>"acgt",.*,"ccc"

mod1()==*>SEQ1
		

"ccc" will be search at any position after "acgt" match.

Unlimited spacers should not be saved, but length constraints can be applied.

limited range spacer

This kind of spacer refers to small spacers where a range is always defined.

mod1()==>"acgt",SPACER1:{#[1,7]},"ccc"

mod1()==*>SEQ1
		

In this example, the spacer has a size between 1 and 7.

You should prefer this kind of spacer when you know a maximum size of the spacer and size is relatively small.

Views

Views are a way to group patterns for 2 reasons:

mod1()==>"aaaa",("cgtacg"|"cgta"):{#[3,5]}
mod1()==*>SEQ1
       

We apply here a constraint on a group ("aaat"|"cgta"). This code applies a OR condition (don't worry, you will see it in next chapter).

What matters here is that the constraint {#[3,5]} will make sure that whatever the result of the group of pattern is, its global length will not exceed the defined range.

In our case, both "cgtacg" and "cgta" could match after "aaaa" in the sample sequence. However, as global constraint size is 5, the match "cgtacg" will be discarded.

OR conditions

This language use allows to match 2 or more pattern at the same location.

Following example will match "acgt" or "aggt" right after "aaaa".

mod1()==>"aaaa",("acgt"|"aggt")
mod1()==*>SEQ1
       

A slightly more complex example, combining a OR and a SPACER.

mod1()==>"aaaa",((.*,"acgt")|(.*,"aggt"))
mod1()==*>SEQ1
       

This example applies a OR on a group of pattern (.*,"acgt") and (.*,"aggt").

Morphisms

Morphisms are used to transform a string to an other

Logol predefines some morphisms but it is also possible to define your own morphisms

A morphism is oriented, i.e. a + will apply the transformation to the string while a - will reverse the string first.

mod1()==>"aaa",+"wc" "tgca"
mod1()==*>SEQ1
       

This morphism +"wc" will search the complement to "tgca" (acgt) after "aaa"

Multiple models

To ease the description of the global model, and to reuse some parts of code, Logol provides the models.

We used a model in the rule description. We can also use models out of the rule.

There is a difference however. Models called from an other model will try to match it at position+1 only, not in the whole sequence.

Models can also use variables as input and output.

mod1()==>"aaa","acgt":{_R1},mod2(R1)
mod2(X1)==>?X1
mod1()==*>SEQ1
       

In mod1()==>, we call model mod2(R1). We pass variable R1 to mod2 as argument.

Then we defined mod2 with mod2(X1)==>?X1.

mod2 will search the content (?) of variable X1, which is an input argument. In our case, the input argument will match the variable R1.

In the result file, mod2 will appear as a pattern match, and will include all the details of mod2 sub patterns.

Repeats

Sometimes, we want to match (or not to match, that is the question), a pattern occuring multiple times.

Repeats can help to do so. It can be applied on a pattern or a view.

mod1()==>"aaa",repeat("acgt" ,[0,1])+[0,2]
mod1()==*>SEQ1
       

We defined a repeat on "acgt" with repeat("acgt". We accept a spacer range of 0,1 characters between each repeat ,[0,1], and the pattern can occur between 0 and 2 times +[0,2].

Pattern can be a view or a pattern, including a variable. Repeat also support overlap between repeats with a semicolon instead of the comma

Overlap

2 consecutive strings can overlap:

mod1()==>"ccaaaa";("aacgt","acg")
mod1()==*>SEQ1
       

In this example, "aacgt" can overlap over previous pattern match "ccaaaa" on the "aa" or "a".