Inferno's SEXPRS(6 )

NAME

sexprs - symbolic expressions

DESCRIPTION

S-expressions (`symbolic expressions') provide a way for programs to store and exchange tree-structured text and binary data. The Limbo module sexprs(2) provides the variant defined by Rivest in Internet Draft draft-rivest-sexp-00.txt (4 May 1997), as used for instance by the Simple Public Key Infrastructure (SPKI). It provides a `canonical' form of S-expression, and an `advanced' form for display. They can convey binary data directly and efficiently, unlike some other schemes such as XML. The two forms are closely related and all can be read or written by sexprs(2), including a variant sometimes used for transport on links that are not 8-bit safe.

An S-expression is either a sequence of bytes (a byte string), or a parenthesised list of smaller S-expressions. All forms start with the fundamental rules below, in extended BNF:

sexpr	::=	string | list
list	::=	'(' sexpr* ')'

They give the recursive structure. The various representations ultimately differ only in how the byte string is represented and whether white space such as blanks or newlines can appear.

Furthermore, the definition of string is also common to all forms:

string	::=	display? simple-string
display	::=	'[' simple-string ']'

The optional bracketed display string provides information on how to present the associated byte string to a user. (``It has no other function. Many of the MIME types work here.'') Although supported by sexprs(2), it is largely unused by Inferno applications and is usually left out. The canonical and advanced forms differ in their definitions of simple-string. They always denote sequences of 8-bit bytes, but with different syntax (encodings). Two strings are equal iff their simple-strings encode the same byte strings (for both data and display).

Canonical form must be used when exchanging S-expressions between computers, and when digitally signing an expression. It is defined by the complete set of rules below:

sexpr	::=	string | list
list	::=	'(' sexpr* ')'
string	::=	display? simple-string
display	::=	'[' simple-string ']'
simple-string	::=	raw
raw	::=	nbytes ':' byte*
nbytes	::=	[1-9][0-9]+ |  0

Its simple-string is a raw byte string. The primitive byte represents an 8-bit byte. The length of every byte string is given explicitly by a preceding decimal value nbytes (with no leading zeroes). There is no white space. It is `canonical' because it is uniquely defined for each S-expression. It is efficient to parse even on small computers.

Advanced form is more elaborate, and has two main differences: not all byte strings need an explicit length, and binary data can be represented in printable form, either using hexadecimal or base 64 encodings, or using quoted strings (with escape sequences similar to those of Limbo or C). Unquoted text is called a token, and is restricted by the standard to a specific alphabet: it must contain only letters, digits, or characters from the set -./_:*+=, and must not start with a digit. The latter restriction is imposed to allow byte counts to be distinguished from tokens without lookahead, but has the consequence that decimal numbers must be quoted, as must non-ASCII characters in utf(6) encoding. Upper- and lower-case letters are distinct. The advanced transport syntax is defined by the complete set of rules below:

sexpr	::=	string | list
list	::=	'(' ( sexpr | whitespace )* ')'
string	::=	display? simple-string
display	::=	'[' simple-string ']'
simple-string	::=	raw | token | base-64 | hexadecimal |  quoted-string
raw	::=	nbytes ':' byte*
nbytes	::=	[1-9][0-9]+ |  0
token	::=	token-start token-char*
base-64	::=	decimal? '|' ( base-64-char | whitespace )* '|'
hexadecimal	::=	'#' ( hex-digit | whitespace )* '#'
quoted-string	::=	nbytes? quoted-string-body  
quoted-string-body	::=	'"' byte* '"'
token-start	::=	[-./_:*+=a-zA-Z]
token-char	::=	token-start | [0-9]
hex-digit	::=	[0-9a-fA-F]
base-64-char	::=	[a-zA-Z0-9+/=]

Whitespace is any sequence of blank, tab, newline or carriage-return characters; note that it can appear only at the places shown. The bytes in a quoted-string-body are interpreted according to the quoting rules for Limbo (or C). That is, the bytes are enclosed in quotes, and may contain the escape sequences for the following characters: backspace (\b), form-feed (\f), newline (\n), carriage-return (\r), tab (\t), and vertical tab (\v), octal escape \ooo (all three digits must be given), hexadecimal escape \xhh (both digits must be given), \\ for backslash, \' for single quote, and and \" to include a quote in a string. Note that a quoted string can have an optional nbytes, but it gives the length of the byte string resulting after interpreting character escapes.

Both canonical and advanced forms can contain binary data verbatim. Sometimes that is troublesome for storage or transport. At the lexical level any sexpr can therefore be replaced by the following:

'{' ( base-64-char | whitespace )* '}'

where the text between the braces is the base-64 encoding of the sexpr expressed in canonical or advanced form. The S-expression parser will replace the sequence by its decoded, and resume parsing at the start of that byte string. Note the difference in syntax and interpretation from rule base-64 above, which encodes a simple-string, not an sexpr.

EXAMPLES

The following S-expression is in canonical form:

(12:hello world!(5:inner0:))

It is a list of two elements: the string hello world!, and another list also with two elements, the string inner and an empty string. All the bytes in the example are printable characters, but they could have been arbitrary binary values.

The following is an S-expression in advanced form:

(hello-world
    (* "3" "5.6")
    (best-of-3 (5:inner0:)))

Note that advanced form contains canonical form as a subset; here it is used for the innermost list.

SEE ALSO

sexprs(2), json(6), ubfa(6)

R. Rivest, ``S-expressions'', Network Working Group Internet Draft (4 May 1997), reproduced in /lib/sexp.

SEXPRS(6 )

Rev: Thu Feb 15 14:43:48 GMT 2007