148 lines
4.7 KiB
OCaml
Raw Normal View History

2019-05-12 20:56:22 +00:00
(* Lexer specification for LIGO, to be processed by [ocamllex].
The underlying design principles are:
(1) enforce stylistic constraints at a lexical level, in order to
early reject potentially misleading or poorly written
LIGO contracts;
(2) provide precise error messages with hints as how to fix the
2019-05-12 20:56:22 +00:00
issue, which is achieved by consulting the lexical
right-context of lexemes;
(3) be as independent as possible from the LIGO version, so
upgrades have as little impact as possible on this
specification: this is achieved by using the most general
regular expressions to match the lexing buffer and broadly
distinguish the syntactic categories, and then delegating a
finer, second analysis to an external module making the
tokens (hence a functor below);
2019-05-12 20:56:22 +00:00
(4) support unit testing (lexing of the whole input with debug
traces).
A limitation to the independence with respect to the LIGO version
lies in the errors that the external module building the tokens
(which may be version-dependent) may have to report. Indeed these
errors have to be contextualised by the lexer in terms of input
source regions, so useful error messages can be printed, therefore
they are part of the signature [TOKEN] that parameterises the
functor generated here. For instance, if, in a future release of
LIGO, new tokens are added, and the recognition of their lexemes
entails new errors, the signature [TOKEN] will have to be augmented
and this lexer specification changed. However, in practice, it is
more likely that instructions or types will be added, instead of
new kinds of tokens.
2019-05-12 20:56:22 +00:00
*)
module Region = Simple_utils.Region
module Pos = Simple_utils.Pos
(* TOKENS *)
(* The signature [TOKEN] exports an abstract type [token], so a lexer
can be a functor over tokens. This enables to externalise
version-dependent constraints in any module whose signature matches
[TOKEN]. Generic functions to construct tokens are required.
Note the predicate [is_eof], which caracterises the virtual token
for end-of-file, because it requires special handling. Some of
those functions may yield errors, which are defined as values of
the type [int_err] etc. These errors can be better understood by
reading the ocamllex specification for the lexer ([Lexer.mll]).
*)
2020-04-28 19:26:31 +02:00
type lexeme = string
2019-05-12 20:56:22 +00:00
module type TOKEN =
sig
type token
(* Errors *)
type int_err = Non_canonical_zero
type ident_err = Reserved_name
type nat_err = Invalid_natural
| Non_canonical_zero_nat
type sym_err = Invalid_symbol
2020-01-16 19:36:04 +00:00
type attr_err = Invalid_attribute
2019-05-12 20:56:22 +00:00
(* Injections *)
val mk_int : lexeme -> Region.t -> (token, int_err) result
val mk_nat : lexeme -> Region.t -> (token, nat_err) result
2019-10-27 11:50:24 -05:00
val mk_mutez : lexeme -> Region.t -> (token, int_err) result
2019-05-12 20:56:22 +00:00
val mk_ident : lexeme -> Region.t -> (token, ident_err) result
val mk_sym : lexeme -> Region.t -> (token, sym_err) result
val mk_string : lexeme -> Region.t -> token
val mk_bytes : lexeme -> Region.t -> token
2019-05-12 20:56:22 +00:00
val mk_constr : lexeme -> Region.t -> token
2020-01-20 10:57:07 +01:00
val mk_attr : string -> lexeme -> Region.t -> (token, attr_err) result
2019-05-12 20:56:22 +00:00
val eof : Region.t -> token
(* Predicates *)
val is_eof : token -> bool
(* Projections *)
val to_lexeme : token -> lexeme
val to_string : token -> ?offsets:bool -> [`Byte | `Point] -> string
val to_region : token -> Region.t
The preprocessor library depends now on the kinds of comments instead of a closed set of languages. I also removed the offsets: I simply use the current region to determine whether the preprocessing directie starts at the beginning of a line. I also removed scanning line indicators, to make the lexer simpler. LexToken.mll: Moved the function [check_right_context] that checks stylistic constraints from Lexer.mll to LexToken.mll. While this triplicates code (as CameLIGO, PascaLIGO and ReasonLIGO share the same constraints), the benefit is that Lexer.mll becomes more generic and the signature for the TOKEN module is simpler (no more exporting predicates, except for EOF). In accordance with the change of the preprocessor, the lexers and parsers for LIGO now depend on the kind of comments, not a fixed set of syntaxes. This gives more versatility when adding a new language: only the kinds of its comments are needed, although Lexer.mll and Preproc.mll may have to be modified if they do not already know the comment delimiters, for example line comments starting with #. **************************************************************** BUG: The exceptions coming from LexToken.mll when a stylistic constraint is broken in [LexToken.check_right_context] are not caught yet. **************************************************************** Lexer.mll: I moved out as much as I could from the header into a new module LexerLib. The aim is to make it easy to reuse as much as possible of the lexer machinerie, when it cannot be used as is.
2020-04-24 21:06:18 +02:00
(* Style *)
2019-05-12 20:56:22 +00:00
type error
val error_to_string : error -> string
exception Error of error Region.reg
2019-05-12 20:56:22 +00:00
val format_error :
The preprocessor library depends now on the kinds of comments instead of a closed set of languages. I also removed the offsets: I simply use the current region to determine whether the preprocessing directie starts at the beginning of a line. I also removed scanning line indicators, to make the lexer simpler. LexToken.mll: Moved the function [check_right_context] that checks stylistic constraints from Lexer.mll to LexToken.mll. While this triplicates code (as CameLIGO, PascaLIGO and ReasonLIGO share the same constraints), the benefit is that Lexer.mll becomes more generic and the signature for the TOKEN module is simpler (no more exporting predicates, except for EOF). In accordance with the change of the preprocessor, the lexers and parsers for LIGO now depend on the kind of comments, not a fixed set of syntaxes. This gives more versatility when adding a new language: only the kinds of its comments are needed, although Lexer.mll and Preproc.mll may have to be modified if they do not already know the comment delimiters, for example line comments starting with #. **************************************************************** BUG: The exceptions coming from LexToken.mll when a stylistic constraint is broken in [LexToken.check_right_context] are not caught yet. **************************************************************** Lexer.mll: I moved out as much as I could from the header into a new module LexerLib. The aim is to make it easy to reuse as much as possible of the lexer machinerie, when it cannot be used as is.
2020-04-24 21:06:18 +02:00
?offsets:bool ->
[`Byte | `Point] ->
error Region.reg ->
file:bool ->
string Region.reg
val check_right_context :
token ->
(Lexing.lexbuf -> (Markup.t list * token) option) ->
Lexing.lexbuf ->
unit
2019-05-12 20:56:22 +00:00
end
2020-04-28 19:26:31 +02:00
(* The signature of the lexer *)
module type S =
sig
module Token : TOKEN
type token = Token.token
(* The scanner [init] is meant to be called first to read the
BOM. Then [scan] is called. *)
val init : token LexerLib.state -> Lexing.lexbuf -> token LexerLib.state
val scan : token LexerLib.state -> Lexing.lexbuf -> token LexerLib.state
(* Errors (specific to the generic lexer, not to the tokens) *)
type error
val error_to_string : error -> string
exception Error of error Region.reg
val format_error :
?offsets:bool -> [`Byte | `Point] ->
error Region.reg -> file:bool -> string Region.reg
end
2019-05-12 20:56:22 +00:00
(* The functorised interface
Note that the module parameter [Token] is re-exported as a
submodule in [S].
*)
2020-04-28 19:26:31 +02:00
module Make (Token : TOKEN) : S with module Token = Token