INTERNET-DRAFT                               Charles H. Lindsey
Usenet Format Working Group                  University of Manchester
                                             July 2001

5.5. Newsgroups

Previous Up Next
5.5.  Newsgroups
   The Newsgroups header's content specifies the newsgroup(s) in which
   the article is intended to appear. It is an inheritable header
   (4.2.2.2) which then becomes the default Newsgroups header of any
   followup, unless a Followup-To header is present to prescribe
   otherwise.

   References to "Unicode" or "the latest version of the Unicode
   Standard" mean [UNICODE 3.1] contains guarantees of strict future
   upwards compatibility (e.g. no character will be removed or change
   classification). Implementors should be aware that currently
   unassigned code points (Unicode category Cn) may become valid
   characters in future versions of Unicode. Since the poster of an
   article might have access to a newer version of that standard,
   relaying and serving agents MUST accept such characters, but posting
   agents (and indeed all agents) MUST NOT generate them.

      Newsgroups-content  = newsgroup-name
                               *( *FWS ng-delim *FWS newsgroup-name )
                               *FWS
      newsgroup-name      = component *( "." component )
      component           = 1*component-glyph
      ng-delim            = ","
      component-glyph     = combiner-base *combiner-mark
      combiner-base       = combiner-ASCII / combiner-extended
      combiner-ASCII      = "0"-"9" / %x41-5A / %x61-7A / "+" / "-" / "_"
      combiner-extended   = 
      combiner-mark       = 

        NOTE: the excluded characters are control characters (Cc),
        format control characters (Cf), surrogates (Cs), and separators
        (Zs, Zl, Zp). In particular, this excludes all whitespace
        characters.

   Each component MUST be invariant under Unicode normalization NFKC
   (cf. the weaker normalization requirement for other headers in
   section 4.4.1 which specified no more than normalization NFC).


        NOTE: Alternatively, this restriction could have been expressed
        by saying:
          o All characters with a compatibility decomposition are
            forbidden;
        or else
          o All characters with property NFKC-NO are forbidden.
        The effect is to exclude variant forms of characters, such as
        superscripts and subscripts, wide and narrow forms, font
        variants, encircled forms, ligatures, and so on, as their use
        could cause confusion.

        As a result of of this restriction, a name has only one valid
        form. Implementations can assume that a straight comparison of
        characters or octets is sufficient to compare two newsgroup-
        names.

        NOTE: An implementation is not required to apply NFKC, or any
        other normalization, to newsgroup names. Only agencies that
        create new groups need to be careful to obey this restriction
        (7.1).  However, if a posting agent neglects to normalize a
        newsgroup-name entered manually, this may lead to the user
        posting to a non-existent group without understanding why.

   Newsgroup-names containing non-ASCII characters MUST be encoded in
   UTF-8 and not according to [RFC 2047].

   Components beginning with underline ("_") are reserved for use by
   future versions of this standard and MUST NOT occur in newsgroup
   names (whether in Newsgroup headers or in newgroup control messages
   (7.1)).  However, such names MUST be accepted.

   Components beginning with "+" or "-" are reserved for use by
   implementations and MUST NOT occur in newsgroup names (whether in
   Newsgroup headers or in newgroup control messages). Implementors may
   assume that this rule will not change in any future version of this
   standard.

        NOTE: For example, implementors may safely use leading "+" and
        "-" to "escape" other entities within something that looks like
        a newsgroup-name.

   Agencies responsible for the administration of particular hierarchies
   Ought to place additional restrictions on the characters they allow
   in newsgroup-names within those hierarchies (such as to accord with
   the languages commonly used within those hierarchies, or to avoid
   perceived ambiguities pertinent to those languages). Where there is
   no such specific policy, the following restrictions SHOULD be applied
   to newsgroup names.

        NOTE: These restrictions are intended to reflect existing
        practice, with some additions to accomodate foreseeable
        enhancements, and are intended both to avoid certain technical
        difficulties and to avoid unnecessary confusion. It may well be
        that experience will allow future extensions to this standard to
        relax some or all of these restrictions.



   The specific restrictions (to be applied in the absence of
   established policies to the contrary) are:

   1. The following characters are forbidden, subject to the comments
      and notes at the end of the list:

      characters in category Cn (Other, Not assigned)         [1]
      characters in category Co (Other, Private Use)          [2]
      characters in category Lt (Letter, Titlecase)           [3]
      characters in category Lu (Letter, Uppercase)           [3]
      characters in category Me (Mark, Enclosing)             [4]
      characters in category Pd (Punctuation, Dash)           [4][5]
      characters in category Pe (Punctuation, Close)          [4]
      characters in category Pf (Punctuation, Final quote)    [4]
      characters in category Pi (Punctuation, Initial quote)  [4]
      characters in category Po (Punctuation, Other)          [4]
      characters in category Ps (Punctuation, Open)           [4]
      characters in category Sc (Symbol, Currency)            [4]
      characters in category Sk (Symbol, Modifier)            [4]
      characters in category Sm (Symbol, Math)                [4][5]
      characters in category So (Symbol, Other)               [4]

      [1] As new characters are added to Unicode, the code point moves
          from category Cn to some other category. As stated above,
          implementors should be prepared for this.

      [2] Specific private use characters can be used within a hierarchy
          or co-operating subnet that has agreed meanings for them.

      [3] Traditionally, newsgroup-names have been written in lowercase.
          Posting agents MAY convert these characters to the
          corresponding lowercase forms.
      [That may be better left unsaid, or rewritten]

      [4] Traditionally newsgroup names have only used letters, digits,
          and the three special characters "+", "-" and "_". These
          categories correspond to characters outside that set.

      [5] Although the characters "+" and "-" are within categories Pd
          and Sm, they are not forbidden.

   2. A component name is forbidden to consist entirely of digits.

        NOTE: This requirement was in [RFC 1036] but nevertheless
        several such groups have appeared in practice and implementors
        should be prepared for them. A common implementation technique
        uses each component as the name of a directory and uses numeric
        filenames for each article within a group. Such an
        implementation needs to be careful when this could cause a clash
        (e.g. between article 123 of group xxx.yyy and the directory for
        group xxx.yyy.123).
[Open issue a number of people think this should not be a default
requirement but simply be a NOTE; wording for such is further down.]

   3. A component is limited to 30 component-glyphs and a newsgroup-name
      to 71 component-glyphs. Whilst there is no longer any technical
      reason to limit the length of a component (formerly, it was
      limited to 14 octets) nor of a newsgroup-name, it should be noted
      that these names are also used in the newsgroups line (7.1.2)
      where an overall policy limit applies and, moreover, excessively
      long names can be exceedingly inconvenient in practical use.

        NOTE: To all intents and purposes, a component-glyph is what a
        user might regard as a single "character" as displayed on his
        screen, though it might be transmitted as several actual
        characters (e.g. q-circumflex is two characters).

   Serving and relaying agents MUST accept any newsgroup-name that meets
   the above requirements, even if they violate one or more of the
   policy restrictions. Posting and injecting agents MAY reject articles
   containing newsgroup-names that do not meet these restrictions, and
   posting agents MAY attempt to correct them (e.g. by lowercasing).
   However, because of the large and changing tables required to do
   these checks and corrections throughout the whole of Unicode, this
   standard does not require them to do so. Rather, the onus is placed
   on those who create new newsgroups (7.1) to check the mandatory
   requirements, to consider the effects of relaxing the other
   restrictions, and to consider how all this may affect propagation of
   the group.

   Since future extensions to this standard and the Unicode standard,
   plus any relaxations of the default restrictions introduced by
   specific hierarchies, might invalidate some such checks, warnings,
   and adjustments, implementations MUST incorporate means to disable
   them. In particular, implementations must be prepared for a
   relaxation of the normalization requirements (e.g. from NFKC down to
   NFC), which have been made rather stringent due to a lack of
   practical experience in this area.

[Alternative text for Open issue]

        NOTE: Components composed entirely of digits were forbidden by
        [RFC 1036] but have nevertheless been used in practice, and are
        therefore permitted by this specification. A common
        implementation technique uses each component as the name of a
        directory and uses numeric filenames for each article within a
        group. Such an implementation needs to be careful when this
        could cause a clash (e.g. between article 123 of group xxx.yyy
        and the directory for group xxx.yyy.123).
[Open issue: delete the above text if we retain the default requirement
above.]

        NOTE: The newsgroup-name as encoded in UTF-8 should be regarded
        as the canonical form. Reading agents may convert it to whatever
        character set they are able to display (see 4.4.1) and serving
        agents may possibly need to convert it to some form more
        suitable as a filename. Simple algorithms for both kinds of
        conversion are readily available.  Observe that the syntax does
        not allow comments within the Newsgroups header; this is to
        simplify processing by relaying and serving agents which have a
        requirement to process this header extremely rapidly.

   The inclusion of folding white space within a Newsgroups-content is a
   newly introduced feature in this standard. It MUST be accepted by all
   conforming implementations (relaying agents, serving agents and
   reading agents).  Posting agents should be aware that such postings
   may be rejected by overly-critical old-style relaying agents. When a
   sufficient number of relaying agents are in conformance, posting
   agents SHOULD generate such whitespace in the form of  so as
   to keep the length of lines in the relevant headers (notably
   Newsgroups and Followup-To) to no more than than 79 characters (or
   other agreed policy limit - see 4.5).  Before such critical mass
   occurs, injecting agents MAY reformat such headers by removing
   whitespace inserted by the posting agent, but relaying agents MUST
   NOT do so.

   Posters SHOULD use only the names of existing newsgroups in the
   Newsgroups header. However, it is legitimate to cross-post to
   newsgroup(s) which do not exist on the posting agent's host, provided
   that at least one of the newsgroups DOES exist there, and followup
   agents SHOULD accept this (posting agents MAY accept it, but Ought at
   least to alert the poster to the situation and request confirmation).
   Relaying agents MUST NOT rewrite Newsgroups headers in any way, even
   if some or all of the newsgroups do not exist on the relaying agent's
   host. Serving agents MUST NOT create new newsgroups simply because an
   unrecognised newsgroup-name occurs in a Newsgroups header (see 7.1
   for the correct method of newsgroup creation).

   The Newsgroups header is intended for use in Netnews articles rather
   than in mail messages. It MAY be used in a mail message to indicate
   that it is a copy also posted to the listed newsgroups, but it SHOULD
   NOT be used in a mail-only reply to a Netnews article (thus the
   "inheritable" property of this header applies only to followups to a
   newsgroup, and not to followups to the poster). Moreover, if a
   newsgroup-name contains any non-ASCII character, it MAY be encoded
   using the mechanism defined in [RFC 2047] when sent by mail but, if
   it is subsequently returned to the Netnews environment, it MUST then
   be re-encoded into UTF-8.

Previous Up Next
Previous draft (04): 5.5. Newsgroups

Diffs to previous draft

--- {draft-04}	Wed Jul 11 21:55:22 2001
+++ {draft-05}	Wed Jul 11 21:55:23 2001
@@ -5,20 +5,212 @@
    followup, unless a Followup-To header is present to prescribe
    otherwise.
 
+   References to "Unicode" or "the latest version of the Unicode
+   Standard" mean [UNICODE 3.1] contains guarantees of strict future
+   upwards compatibility (e.g. no character will be removed or change
+   classification). Implementors should be aware that currently
+   unassigned code points (Unicode category Cn) may become valid
+   characters in future versions of Unicode. Since the poster of an
+   article might have access to a newer version of that standard,
+   relaying and serving agents MUST accept such characters, but posting
+   agents (and indeed all agents) MUST NOT generate them.
+
       Newsgroups-content  = newsgroup-name
                                *( *FWS ng-delim *FWS newsgroup-name )
                                *FWS
       newsgroup-name      = component *( "." component )
-      component           = component-start
-                               *( component-start / component-other )
-      component-start     = Un-lowercase / Un-digit
-      Un-lowercase        = <Unicode Letter, Lowercase> /
-                            <Unicode Letter, Other>
-      Un-digit            = <Unicode Number, Decimal Digit> /
-                            <Unicode Number, Other>
-      component-other     = "+" / "-" / "_"
+      component           = 1*component-glyph
       ng-delim            = ","
-   where the <Unicode ...> items are as described in [UNICODE].
+      component-glyph     = combiner-base *combiner-mark
+      combiner-base       = combiner-ASCII / combiner-extended
+      combiner-ASCII      = "0"-"9" / %x41-5A / %x61-7A / "+" / "-" / "_"
+      combiner-extended   = <any character with a Unicode code value of
+                             0080 or greater and a combining class of 0,
+                             but excluding any character in Unicode
+                             categories Cc, Cf, Cs, Zs, Zl, and Zp>
+      combiner-mark       = <any character with a Unicode code value of
+                             0080 or greater and a combining class other
+                             than 0>
+
+        NOTE: the excluded characters are control characters (Cc),
+        format control characters (Cf), surrogates (Cs), and separators
+        (Zs, Zl, Zp). In particular, this excludes all whitespace
+        characters.
+
+   Each component MUST be invariant under Unicode normalization NFKC
+   (cf. the weaker normalization requirement for other headers in
+   section 4.4.1 which specified no more than normalization NFC).
+
+
+        NOTE: Alternatively, this restriction could have been expressed
+        by saying:
+          o All characters with a compatibility decomposition are
+            forbidden;
+        or else
+          o All characters with property NFKC-NO are forbidden.
+        The effect is to exclude variant forms of characters, such as
+        superscripts and subscripts, wide and narrow forms, font
+        variants, encircled forms, ligatures, and so on, as their use
+        could cause confusion.
+
+        As a result of of this restriction, a name has only one valid
+        form. Implementations can assume that a straight comparison of
+        characters or octets is sufficient to compare two newsgroup-
+        names.
+
+        NOTE: An implementation is not required to apply NFKC, or any
+        other normalization, to newsgroup names. Only agencies that
+        create new groups need to be careful to obey this restriction
+        (7.1).  However, if a posting agent neglects to normalize a
+        newsgroup-name entered manually, this may lead to the user
+        posting to a non-existent group without understanding why.
+
+   Newsgroup-names containing non-ASCII characters MUST be encoded in
+   UTF-8 and not according to [RFC 2047].
+
+   Components beginning with underline ("_") are reserved for use by
+   future versions of this standard and MUST NOT occur in newsgroup
+   names (whether in Newsgroup headers or in newgroup control messages
+   (7.1)).  However, such names MUST be accepted.
+
+   Components beginning with "+" or "-" are reserved for use by
+   implementations and MUST NOT occur in newsgroup names (whether in
+   Newsgroup headers or in newgroup control messages). Implementors may
+   assume that this rule will not change in any future version of this
+   standard.
+
+        NOTE: For example, implementors may safely use leading "+" and
+        "-" to "escape" other entities within something that looks like
+        a newsgroup-name.
+
+   Agencies responsible for the administration of particular hierarchies
+   Ought to place additional restrictions on the characters they allow
+   in newsgroup-names within those hierarchies (such as to accord with
+   the languages commonly used within those hierarchies, or to avoid
+   perceived ambiguities pertinent to those languages). Where there is
+   no such specific policy, the following restrictions SHOULD be applied
+   to newsgroup names.
+
+        NOTE: These restrictions are intended to reflect existing
+        practice, with some additions to accomodate foreseeable
+        enhancements, and are intended both to avoid certain technical
+        difficulties and to avoid unnecessary confusion. It may well be
+        that experience will allow future extensions to this standard to
+        relax some or all of these restrictions.
+
+
+
+   The specific restrictions (to be applied in the absence of
+   established policies to the contrary) are:
+
+   1. The following characters are forbidden, subject to the comments
+      and notes at the end of the list:
+
+      characters in category Cn (Other, Not assigned)         [1]
+      characters in category Co (Other, Private Use)          [2]
+      characters in category Lt (Letter, Titlecase)           [3]
+      characters in category Lu (Letter, Uppercase)           [3]
+      characters in category Me (Mark, Enclosing)             [4]
+      characters in category Pd (Punctuation, Dash)           [4][5]
+      characters in category Pe (Punctuation, Close)          [4]
+      characters in category Pf (Punctuation, Final quote)    [4]
+      characters in category Pi (Punctuation, Initial quote)  [4]
+      characters in category Po (Punctuation, Other)          [4]
+      characters in category Ps (Punctuation, Open)           [4]
+      characters in category Sc (Symbol, Currency)            [4]
+      characters in category Sk (Symbol, Modifier)            [4]
+      characters in category Sm (Symbol, Math)                [4][5]
+      characters in category So (Symbol, Other)               [4]
+
+      [1] As new characters are added to Unicode, the code point moves
+          from category Cn to some other category. As stated above,
+          implementors should be prepared for this.
+
+      [2] Specific private use characters can be used within a hierarchy
+          or co-operating subnet that has agreed meanings for them.
+
+      [3] Traditionally, newsgroup-names have been written in lowercase.
+          Posting agents MAY convert these characters to the
+          corresponding lowercase forms.
+      [That may be better left unsaid, or rewritten]
+
+      [4] Traditionally newsgroup names have only used letters, digits,
+          and the three special characters "+", "-" and "_". These
+          categories correspond to characters outside that set.
+
+      [5] Although the characters "+" and "-" are within categories Pd
+          and Sm, they are not forbidden.
+
+   2. A component name is forbidden to consist entirely of digits.
+
+        NOTE: This requirement was in [RFC 1036] but nevertheless
+        several such groups have appeared in practice and implementors
+        should be prepared for them. A common implementation technique
+        uses each component as the name of a directory and uses numeric
+        filenames for each article within a group. Such an
+        implementation needs to be careful when this could cause a clash
+        (e.g. between article 123 of group xxx.yyy and the directory for
+        group xxx.yyy.123).
+[Open issue a number of people think this should not be a default
+requirement but simply be a NOTE; wording for such is further down.]
+
+   3. A component is limited to 30 component-glyphs and a newsgroup-name
+      to 71 component-glyphs. Whilst there is no longer any technical
+      reason to limit the length of a component (formerly, it was
+      limited to 14 octets) nor of a newsgroup-name, it should be noted
+      that these names are also used in the newsgroups line (7.1.2)
+      where an overall policy limit applies and, moreover, excessively
+      long names can be exceedingly inconvenient in practical use.
+
+        NOTE: To all intents and purposes, a component-glyph is what a
+        user might regard as a single "character" as displayed on his
+        screen, though it might be transmitted as several actual
+        characters (e.g. q-circumflex is two characters).
+
+   Serving and relaying agents MUST accept any newsgroup-name that meets
+   the above requirements, even if they violate one or more of the
+   policy restrictions. Posting and injecting agents MAY reject articles
+   containing newsgroup-names that do not meet these restrictions, and
+   posting agents MAY attempt to correct them (e.g. by lowercasing).
+   However, because of the large and changing tables required to do
+   these checks and corrections throughout the whole of Unicode, this
+   standard does not require them to do so. Rather, the onus is placed
+   on those who create new newsgroups (7.1) to check the mandatory
+   requirements, to consider the effects of relaxing the other
+   restrictions, and to consider how all this may affect propagation of
+   the group.
+
+   Since future extensions to this standard and the Unicode standard,
+   plus any relaxations of the default restrictions introduced by
+   specific hierarchies, might invalidate some such checks, warnings,
+   and adjustments, implementations MUST incorporate means to disable
+   them. In particular, implementations must be prepared for a
+   relaxation of the normalization requirements (e.g. from NFKC down to
+   NFC), which have been made rather stringent due to a lack of
+   practical experience in this area.
+
+[Alternative text for Open issue]
+
+        NOTE: Components composed entirely of digits were forbidden by
+        [RFC 1036] but have nevertheless been used in practice, and are
+        therefore permitted by this specification. A common
+        implementation technique uses each component as the name of a
+        directory and uses numeric filenames for each article within a
+        group. Such an implementation needs to be careful when this
+        could cause a clash (e.g. between article 123 of group xxx.yyy
+        and the directory for group xxx.yyy.123).
+[Open issue: delete the above text if we retain the default requirement
+above.]
+
+        NOTE: The newsgroup-name as encoded in UTF-8 should be regarded
+        as the canonical form. Reading agents may convert it to whatever
+        character set they are able to display (see 4.4.1) and serving
+        agents may possibly need to convert it to some form more
+        suitable as a filename. Simple algorithms for both kinds of
+        conversion are readily available.  Observe that the syntax does
+        not allow comments within the Newsgroups header; this is to
+        simplify processing by relaying and serving agents which have a
+        requirement to process this header extremely rapidly.
 
    The inclusion of folding white space within a Newsgroups-content is a
    newly introduced feature in this standard. It MUST be accepted by all
@@ -33,60 +225,6 @@
    occurs, injecting agents MAY reformat such headers by removing
    whitespace inserted by the posting agent, but relaying agents MUST
    NOT do so.
-
-   A newsgroup-name consists of one or more components. Components MAY
-   contain non-ASCII letters, but these MUST be encoded in UTF-8 and not
-   according to [RFC 2047].  A component MUST contain at least one
-   letter (and MUST, according to the syntax, begin with a letter or
-   digit). Components SHOULD begin with a letter.  Composite characters
-   (made by overlaying one character with another) and format
-   characters, as allowed in certain parts of Unicode and needed by
-   certain languages, must use whatever canonical conventions apply to
-   those parts of Unicode (such conventions are not defined in this
-   Standard). The use of "_" in a component is deprecated. Serving
-   agents MAY refuse to accept newsgroups using such a component.
-
-        NOTE: Components composed entirely of digits would cause
-        problems for the commonly used implementation technique of using
-        the component as the name of a directory, whilst also using
-        sequential numbers to distinguish the articles within a group.
-        Components containing other non-permitted characters could cause
-        problems when newsgroup-names appear in URLs [RFC 1738] (for
-        example an '@' character would prevent distinguishing between
-        newsgroup-names and message identifiers).
-
-        NOTE: According to the syntax, uppercase letters cannot occur in
-        newsgroup-names, but this standard imposes no requirement on
-        software to check this condition, since it would be unreasonable
-        to expect it to do so in parts of Unicode for which it was not
-        configured (in general, a table lookup is required). Rather, it
-        is the responsibility of those creating new newsgroups (7.1) not
-        to violate it. It is, moreover, to be expected that a newsgroup
-        created in violation of this condition will not be propagated
-        particularly well.
-
-   Whilst there is no longer any technical reason to limit the length of
-   a component (formerly, it was limited to 14 characters) nor to limit
-   the total length of a newsgroup-name, it should be noted that these
-   names are also used in the newsgroups line (7.1.2) where an overall
-   policy limit applies, and moreover excessively long names can be
-   exceedingly inconvenient in practical use.  Agencies responsible for
-   individual hierarchies Ought therefore, as a matter of policy, to set
-   reasonable limits for the length of a component and of a newsgroup-
-   name. In the absence of such explicit policies, the default limits
-   are 30 characters and 71 characters respectively.
-[If the checkpolicies proposal is included in the Standard, there should
-be a reference to it here.]
-
-        NOTE: The newsgroup-name as encoded in UTF-8 should be regarded
-        as the canonical form. Reading agents may convert it to whatever
-        character set they are able to display (see 4.4.1) and serving
-        agents may possibly need to convert it to some form more
-        suitable as a filename. Simple algorithms for both kinds of
-        conversion are readily available.  Observe that the syntax does
-        not allow comments within the Newsgroups header; this is to
-        simplify processing by relaying and serving agents which have a
-        requirement to process this header extremely rapidly.
 
    Posters SHOULD use only the names of existing newsgroups in the
    Newsgroups header. However, it is legitimate to cross-post to