Copyright (c) 2002-2004 Infrae. All rights reserved.
See also LICENSE.txt

Meta::
  
  Valid for:  Silva 0.9.3+
  Author:     Christian Zagrodnick
  Email:      cz+gocept.com
  CVS:        $Id: DEVELOPER.txt,v 1.1.22.1 2004/04/29 16:57:02 roman Exp $


Silva Document Developer Notes
or "How the heck does this parser work?"


Problem

  Users don't care about syntax. And they do want a bold asterisks. Or italic
  pluses. *wink*

  So there is the need for a parser which eats anything.  The old parser was
  not powerful enough to handle various cases.


Solution

  What we have now is a constraint satisfaction algorithm. This is modelled as
  a heuristic search problem, i.e. an artifical intelligence approach.

  
  The Great Picture

    NOTE: The description is noth 100% correct. It's main purpose is to get an
    understanding of how the things work.

    The editable is split into tokens. The available tokens kinds are defined
    in class "Token". Each token kind has a unique id. 
    
    The parser has a mapping from regular expressions to token ids. In each
    parsing step all regular expressions are matched against the remaining
    (i.e. not yet parsed) part of the editable. The result is a list of
    potential tokens. Such a token, together with the remaining text, makes a
    ParserState. 

    All generated ParserState instances are beeing weighted. That's where the
    heuristic method comes in. The best ParserState is selected and continued
    with. 

    When there is no more text to parse from the editable the current parser
    state's token list is interpreted by the Interpreter class. If the
    interpreter fails due to a token list which cannot be converted to a valid
    xml structure the ParserState is droped. The parser then continues with the
    next best ParserState.

    
  Heuristics Function in Detail

    One of the most important parts to get good and quick results is the
    heuristics function. 

    h: ParserState --> [0, 4]

    The lower h gets the better.

    The following assumptions have been made:

      * More text was consumed is better than less
      
      * Fewer tokens are better than more
      
      * The lower floor(token_id/10) the better

      * Unmatched parenthesis are bad
     
    h = (average of floor(token_id/10)/10 + 
        number of tokens / number of consumed characters) *
        parenthesis bias
      
    The parenthesis bias takes into account that usually parenthesis are
    matching. I.e. you don't have "foo **(bar**)" as editable. If no
    parenthesis are missmatching the bias is 1 and moves towards 2 for an
    infinte amount of unmathing parenthesis. The bias is nessecary to get
    correct results of link tokens, i.e. ((foo|http://bla?x=(bar))) vs.
    (see ((foo|http://bar))).
    
Terms

  editable -- Text entered by the user; that is what the parser gets. In terms
    of datastructures this is a unicode string.

  xml -- xml usually referrs to Silva xml here
  
  
