INTERNET-DRAFT                               Charles H. Lindsey
Usenet Format Working Group                  University of Manchester
                                             July 2001

2.4. Syntax Notation

Previous Up Next
2.4.  Syntax Notation
   This standard uses the Augmented Backus Naur Form described in [RFC
   2234].  A discussion of this is outside the bounds of this standard,
   but it is expected that implementors will be able quickly to
   understand it with reference to that defining document.

   Much of the syntax of News Articles is based on the corresponding
   syntax defined in [RFC 2822] or in the MIME specifications [RFC 2045]
   et seq, which is deemed to have been incorporated into this standard
   as required. However, there are some important differences arising
   from the fact that [RFC 2822] does not recognise anything other than
   US-ASCII characters, that it does not recognise the MIME headers [RFC
   2045], and that it includes much syntax described as "obsolete".

        NOTE: News parsers historically have been much less permissive
        than Mail parsers, and this is reflected in the modifications
        referred to, and in some further specific rules.

   The following syntactic forms therefore supersede the corresponding
   rules given in [RFC 2822] and [RFC 2045], thus allowing UTF-8
   characters [RFC 2279] to appear in certain contexts (the five rules
   begining with "strict-" reflect the corresponding original rules from
   [RFC 2822]).

      UTF8-xtra-2-head= %xC2-DF
      UTF8-xtra-3-head= %xE0 %xA0-BF / %xE1-EC %x80-BF /
                      = %xED %x80-9F / %xEE-EF %x80-BF
      UTF8-xtra-4-head= %xF0 %x90-BF / %xF1-F7 %x80-BF
      UTF8-xtra-5-head= %xF8 %x88-BF / %xF9-FB %x80-BF
      UTF8-xtra-6-head= %xFC %x84-BF / %xFD    %x80-BF
      UTF8-xtra-tail  = %x80-BF
      UTF8-xtra-char  = UTF8-xtra-2-head 1( UTF8-xtra-tail ) /
                        UTF8-xtra-3-head 1( UTF8-xtra-tail ) /
                        UTF8-xtra-4-head 2( UTF8-xtra-tail ) /
                        UTF8-xtra-5-head 3( UTF8-xtra-tail ) /
                        UTF8-xtra-6-head 4( UTF8-xtra-tail )
      text            = %d1-9 /            ; all UTF-8 characters except
                        %d11-12 /          ; US-ASCII NUL, CR and LF
                        %d14-127 /
                        UTF8-xtra-char
      ctext           = NO-WS-CTL /        ; all of  except
                        %d33-39 /          ; SP, HTAB, "(", ")"
                        %d42-91 /          ; and "\"
                        %d93-126 /
                        UTF8-xtra-char
      qtext           = NO-WS-CTL /        ; all of  except
                        %d33 /             ; SP, HTAB, "\" and DQUOTE
                        %d35-91 /
                        %d93-126 /
                        UTF8-xtra-char
      utext           = NO-WS-CTL /        ; Non white space controls
                        %d33-126 /         ; The rest of US-ASCII
                        UTF8-xtra-char
      strict-text     = %d1-9 /            ; text restricted to
                        %d11-12 /          ; US-ASCII
                        %d14-127
      strict-qtext    = NO-WS-CTL /        ; qtext restricted to
                        %d33 /             ; US-ASCII
                        %d35-91 /
                        %d93-127
      strict-quoted-pair
                      = "\" strict-text


      strict-qcontent = strict-qtext / strict-quoted-pair
      strict-quoted-string
                      = [CFWS]
                           DQUOTE *([FWS] strict-qcontent) [FWS] DQUOTE
                           [CFWS]

   The syntax for UTF8-xtra-char excludes those redundant sequences of
   octets which cannot occur in UTF-8, as defined by [RFC 2279], either
   because they would not be the shortest possible encodings of some UCS
   character, or they would represent one of the characters D800 through
   DFFF, disallowed in UCS because of their surrogate use in the UTF-16
   encoding.  These sequences MUST NOT be generated by posting agents.
   Where they occur inadavertently, they MAY be passed on untouched by
   other agents, but they MUST NOT ever be interpreted as valid
   characters.

   Wherever in this standard the syntax is stated to be taken from [RFC
   2822], it is to be understood as the syntax defined by [RFC 2822]
   after making the above changes, but NOT including any syntax defined
   in section 4 ("Obsolete syntax") of [RFC 2822].  Software compliant
   with this standard MUST NOT generate any of the syntactic forms
   defined in that Obsolete Syntax, although it MAY accept such
   syntactic forms. Certain syntax from the MIME specifications [RFC
   2045] et seq is also considered a part of this standard (see 6.21).

   The following syntactic forms, taken from [RFC 2234] or from [RFC
   2822], are repeated here for convenience only:

      ALPHA           = %x41-5A /          ; A-Z
                        %x61-7A            ; a-z
      CR              = %x0D               ; carriage return
      CRLF            = CR LF
      DIGIT           = %x30-39            ; 0-9
      HTAB            = %x09               ; horizontal tab
      LF              = %x0A               ; line feed
      SP              = %x20               ; space
      NO-WS-CTL       = %d1-8 /            ; US-ASCII control characters
                        %d11 /             ; which do not include the
                        %d12 /             ; carriage return, line feed,
                        %d14-31 /          ; and whitespace characters
                        %d127
      specials        = "(" / ")" /        ; Special characters used in
                        "<" / ">" /        ;  other parts of the syntax
                        "[" / "]" /
                        ":" / ";" /
                        "@" / "
                        "," / "." /
                        DQUOTE
      WSP             = SP / HTAB          ; Whitespace characters
      FWS             = ([*WSP CRLF] 1*WSP); Folding whitespace
      ccontent        = ctext / quoted-pair / comment
      comment         = "(" *([FWS] ccontent) [FWS] ")"
      CFWS            = *([FWS] comment) (([FWS] comment) / FWS )
      DQUOTE          = %d34              ; quote mark
      quoted-pair     = "\" text



      atext           = ALPHA / DIGIT /
                        "!" / "#" /        ; Any character except
                        "$" / "%" /        ; controls, SP, and specials.
                        "&" / "'" /        ; Used for atoms
                        "*" / "+" /
                        "-" / "/" /
                        "=" / "?" /
                        "^" / "_" /
                        "`" / "}" /
                        "|" / "}" /
                        "~"
      atom            = [CFWS] 1*atext [CFWS]
      dot-atom        = [CFWS] dot-atom-text [CFWS]
      dot-atom-text   = 1*atext *( "." 1*atext )
      qcontent        = qtext / quoted-pair
      quoted-string   = [CFWS]
                           DQUOTE *([FWS] qcontent) [FWS] DQUOTE
                           [CFWS]
      word            = atom / quoted-string
      phrase          = 1*word
      unstructured    = *( [FWS] utext ) [FWS]

        NOTE: CFWS occurs at many places in the syntax in order to allow
        comments and extra whitespace to be inserted almost anywhere.
        The syntax is in fact ambiguous insofar as it may be impossible
        to tell in which of several possible ways a given comment or WS
        was produced. However, this does not lead to semantic ambiguity
        because, unless specifically stated otherwise, the presence of
        absence of a comment or additional WS has no semantic meaning
        and, in particular, it is a matter of indifference whether it
        forms a part of the syntactic construct preceding it or the one
        following it.

        NOTE: Following [RFC 2234], literal text included in the syntax
        is to be regarded as case-insensitive.  However, in
        contradistinction to [RFC 2822], the Netnews protocols are
        sensitive to case in some instances (as in newsgroup names, some
        header parameters, etc.). Care has been taken to indicate this
        explicitly where required.

   The complete syntax defined in this standard is repeated, for
   convenience, in Appendix B.

Previous Up Next
Previous draft (04): 2.4. Syntax Notation

Diffs to previous draft

--- {draft-04}	Wed Jul 11 21:54:57 2001
+++ {draft-05}	Wed Jul 11 21:54:59 2001
@@ -5,10 +5,10 @@
    understand it with reference to that defining document.
 
    Much of the syntax of News Articles is based on the corresponding
-   syntax defined in [MESSFOR] or in the Mime specifications [RFC 2045]
+   syntax defined in [RFC 2822] or in the MIME specifications [RFC 2045]
    et seq, which is deemed to have been incorporated into this standard
    as required. However, there are some important differences arising
-   from the fact that [MESSFOR] does not recognise anything other than
+   from the fact that [RFC 2822] does not recognise anything other than
    US-ASCII characters, that it does not recognise the MIME headers [RFC
    2045], and that it includes much syntax described as "obsolete".
 
@@ -17,14 +17,23 @@
         referred to, and in some further specific rules.
 
    The following syntactic forms therefore supersede the corresponding
-   rules given in [MESSFOR] and [RFC 2045], thus allowing UTF-8
-   characters [RFC 2044] to appear in certain contexts (the four rules
+   rules given in [RFC 2822] and [RFC 2045], thus allowing UTF-8
+   characters [RFC 2279] to appear in certain contexts (the five rules
    begining with "strict-" reflect the corresponding original rules from
-   [MESSFOR]).
+   [RFC 2822]).
 
-      UTF8-xtra-head  = %d192-253
-      UTF8-xtra-tail  = %d128-191
-      UTF8-xtra-char  = UTF8-xtra-head 1*UTF8-xtra-tail
+      UTF8-xtra-2-head= %xC2-DF
+      UTF8-xtra-3-head= %xE0 %xA0-BF / %xE1-EC %x80-BF /
+                      = %xED %x80-9F / %xEE-EF %x80-BF
+      UTF8-xtra-4-head= %xF0 %x90-BF / %xF1-F7 %x80-BF
+      UTF8-xtra-5-head= %xF8 %x88-BF / %xF9-FB %x80-BF
+      UTF8-xtra-6-head= %xFC %x84-BF / %xFD    %x80-BF
+      UTF8-xtra-tail  = %x80-BF
+      UTF8-xtra-char  = UTF8-xtra-2-head 1( UTF8-xtra-tail ) /
+                        UTF8-xtra-3-head 1( UTF8-xtra-tail ) /
+                        UTF8-xtra-4-head 2( UTF8-xtra-tail ) /
+                        UTF8-xtra-5-head 3( UTF8-xtra-tail ) /
+                        UTF8-xtra-6-head 4( UTF8-xtra-tail )
       text            = %d1-9 /            ; all UTF-8 characters except
                         %d11-12 /          ; US-ASCII NUL, CR and LF
                         %d14-127 /
@@ -53,29 +62,33 @@
                       = "\" strict-text
 
 
-
+      strict-qcontent = strict-qtext / strict-quoted-pair
       strict-quoted-string
-                      = [CFWS] DQUOTE
-                           *([FWS] (strict-qtext / strict-quoted-pair))
-                           [FWS] DQUOTE [CFWS]
-
-        NOTE: There are sequences of octets which cannot legitimately
-        occur in UTF-8, even a few permitted by the above syntax. These
-        SHOULD NOT be generated by posting agents but, where they occur
-        inadavertently, they SHOULD be passed on untouched by other
-        agents.
+                      = [CFWS]
+                           DQUOTE *([FWS] strict-qcontent) [FWS] DQUOTE
+                           [CFWS]
+
+   The syntax for UTF8-xtra-char excludes those redundant sequences of
+   octets which cannot occur in UTF-8, as defined by [RFC 2279], either
+   because they would not be the shortest possible encodings of some UCS
+   character, or they would represent one of the characters D800 through
+   DFFF, disallowed in UCS because of their surrogate use in the UTF-16
+   encoding.  These sequences MUST NOT be generated by posting agents.
+   Where they occur inadavertently, they MAY be passed on untouched by
+   other agents, but they MUST NOT ever be interpreted as valid
+   characters.
 
-   Wherever in this standard the syntax is stated to be taken from
-   [MESSFOR], it is to be understood as the syntax defined by [MESSFOR]
+   Wherever in this standard the syntax is stated to be taken from [RFC
+   2822], it is to be understood as the syntax defined by [RFC 2822]
    after making the above changes, but NOT including any syntax defined
-   in section 4 ("Obsolete syntax") of [MESSFOR].  Software compliant
+   in section 4 ("Obsolete syntax") of [RFC 2822].  Software compliant
    with this standard MUST NOT generate any of the syntactic forms
    defined in that Obsolete Syntax, although it MAY accept such
    syntactic forms. Certain syntax from the MIME specifications [RFC
    2045] et seq is also considered a part of this standard (see 6.21).
 
-   The following syntactic forms, taken from [RFC 2234] or from
-   [MESSFOR], are repeated here for convenience only:
+   The following syntactic forms, taken from [RFC 2234] or from [RFC
+   2822], are repeated here for convenience only:
 
       ALPHA           = %x41-5A /          ; A-Z
                         %x61-7A            ; a-z
@@ -90,11 +103,26 @@
                         %d12 /             ; carriage return, line feed,
                         %d14-31 /          ; and whitespace characters
                         %d127
+      specials        = "(" / ")" /        ; Special characters used in
+                        "<" / ">" /        ;  other parts of the syntax
+                        "[" / "]" /
+                        ":" / ";" /
+                        "@" / "
+                        "," / "." /
+                        DQUOTE
       WSP             = SP / HTAB          ; Whitespace characters
       FWS             = ([*WSP CRLF] 1*WSP); Folding whitespace
+      ccontent        = ctext / quoted-pair / comment
+      comment         = "(" *([FWS] ccontent) [FWS] ")"
+      CFWS            = *([FWS] comment) (([FWS] comment) / FWS )
+      DQUOTE          = %d34              ; quote mark
+      quoted-pair     = "\" text
+
+
+
       atext           = ALPHA / DIGIT /
                         "!" / "#" /        ; Any character except
-                        "$" / "%" /        ; controls SP, and specials.
+                        "$" / "%" /        ; controls, SP, and specials.
                         "&" / "'" /        ; Used for atoms
                         "*" / "+" /
                         "-" / "/" /
@@ -106,15 +134,12 @@
       atom            = [CFWS] 1*atext [CFWS]
       dot-atom        = [CFWS] dot-atom-text [CFWS]
       dot-atom-text   = 1*atext *( "." 1*atext )
-      comment         = "(" *([FWS]
-                           (ctext / quoted-pair / comment)) [FWS] ")"
-      CFWS            = *([FWS] comment) (([FWS] comment) / FWS )
-      DQUOTE          = %d34              ; quote mark
-      quoted-pair     = "\" text
-
-      quoted-string   = [CFWS] DQUOTE
-                           *([FWS] (qtext / quoted-pair))
-                           [FWS] DQUOTE [CFWS]
+      qcontent        = qtext / quoted-pair
+      quoted-string   = [CFWS]
+                           DQUOTE *([FWS] qcontent) [FWS] DQUOTE
+                           [CFWS]
+      word            = atom / quoted-string
+      phrase          = 1*word
       unstructured    = *( [FWS] utext ) [FWS]
 
         NOTE: CFWS occurs at many places in the syntax in order to allow
@@ -130,7 +155,7 @@
 
         NOTE: Following [RFC 2234], literal text included in the syntax
         is to be regarded as case-insensitive.  However, in
-        contradistinction to [MESSFOR], the Netnews protocols are
+        contradistinction to [RFC 2822], the Netnews protocols are
         sensitive to case in some instances (as in newsgroup names, some
         header parameters, etc.). Care has been taken to indicate this
         explicitly where required.