INTERNET-DRAFT                               Charles H. Lindsey
Usenet Format Working Group                  University of Manchester
                                             July 2001

4.4.1. Character Sets within Article Headers

Previous Up Next
4.4.1.  Character Sets within Article Headers
   Within article headers, characters are represented as octets
   according to the UTF-8 encoding scheme [RFC 2279] or [ISO/IEC 10646],
   and hence all the characters in Unicode [UNICODE 3.1] or in the
   Universal Multiple-Octet Coded Character Set (UCS) [ISO/IEC 10646]
   (which is essentially a superset of Unicode and expected to remain
   so) are potentially available. However, processing all octets in the
   same manner as US-ASCII characters should ensure correct behaviour in
   most situations.

        NOTE: UTF-8 is an encoding for 16bit (and even 32bit) character
        sets with the property that any octet less than 128 immediately
        represents the corresponding US-ASCII character, thus ensuring
        upwards compatibility with previous practice.  Non-ASCII
        characters from Unicode are represented by sequences of octets
        satisfying the syntax of a UTF8-xtra-char (2.4), which excludes
        certain octet sequences not explicitly permitted by [RFC 2279].
        Unicode includes all characters from the ISO-8859 series of
        characters sets [ISO 8859] (which includes all Cyrillic, Greek
        and Arabic characters) together with the more elaborate
        characters used in Asian countries. See the following section
        for the appropriate treatment of Unicode characters by reading
        agents.

   Notwithstanding the great flexibility permitted by UTF-8, there is
   need for restraint in its use in order that the essential components
   of headers may be discerned using reading agents that cannot present
   the full Unicode range. In particular, header-names and tokens MUST
   be in US-ASCII, and certain other components of headers, as defined
   elsewhere in this standard - notably msg-ids, date-times, dot-atoms,
   domains and path-identities - MUST be in US-ASCII.  Comments, phrases
   (as in addresses) and unstructureds (as in Subject headers) MAY use
   the full range of UTF-8 characters, but SHOULD nevertheless be
   invariant under Unicode normalization NFC [UNICODE 3.1].

        NOTE: The effect of normalization NFC is to place composite
        characters (made by overlaying one character with another) into
        a canonical form (usually represented by a single character
        where one is available - thus E-acute is preferred over E
        followed by a non-spacing acute accent), and to make a
        consistent choice among equivalent forms (e.g. the Angstrom sign
        is replaced by A-ring). At least for the main European
        languages, for which all the needed composites are already
        available as single characters, it is unlikely that posting
        agents will need to take any special steps to ensure
        normalization.

   In the particular case of newsgroup-names (see 5.5) there are more
   stringent requirements regarding the use of UTF-8 and Unicode.

   Where the use of non-ASCII characters, encoded in UTF-8, is permitted
   as above, they MAY also be encoded using the MIME mechanism defined
   in [RFC 2047], but this usage is deprecated within news articles
   (even though it is required in mail messages) since it is less
   legible in older reading agents which support neither it nor UTF-8.
   Nevertheless, reading agents SHOULD support this usage, but only in
   those contexts explicitly mentioned in [RFC 2047].

Previous Up Next
Previous draft (04): 4.4.1. Character Sets within Article Headers

Diffs to previous draft

--- {draft-04}	Wed Jul 11 21:55:15 2001
+++ {draft-05}	Wed Jul 11 21:55:16 2001
@@ -1,37 +1,55 @@
 
4.4.1.  Character Sets within Article Headers
    Within article headers, characters are represented as octets
-   according to the UTF-8 encoding scheme [ISO 10646] or [RFC 2279] and
-   hence all the characters in the Universal Multiple-Octet Coded
-   Character Set (UCS) [ISO 10646] (which is essentially a superset of
-   Unicode [UNICODE] and expected to remain so) are potentially
-   available. However, interpreting the octets directly as US-ASCII
-   characters should ensure correct behaviour in most situations.
+   according to the UTF-8 encoding scheme [RFC 2279] or [ISO/IEC 10646],
+   and hence all the characters in Unicode [UNICODE 3.1] or in the
+   Universal Multiple-Octet Coded Character Set (UCS) [ISO/IEC 10646]
+   (which is essentially a superset of Unicode and expected to remain
+   so) are potentially available. However, processing all octets in the
+   same manner as US-ASCII characters should ensure correct behaviour in
+   most situations.
 
         NOTE: UTF-8 is an encoding for 16bit (and even 32bit) character
         sets with the property that any octet less than 128 immediately
         represents the corresponding US-ASCII character, thus ensuring
         upwards compatibility with previous practice.  Non-ASCII
-        characters from UCS are represented by sequences of octets
-        satisfying the syntax of a UTF8-xtra-char (2.4).  Only those
-        octet sequences explicitly permitted by [RFC 2044] shall be
-        used.  UCS includes all characters from the ISO-8859 series of
-        characters sets [ISO 8859] (which includes all Greek and Arabic
-        characters) as well as the more elaborate characters used in
-        Japan and China. See the following section for the appropriate
-        treatment of UCS characters by reading agents.
+        characters from Unicode are represented by sequences of octets
+        satisfying the syntax of a UTF8-xtra-char (2.4), which excludes
+        certain octet sequences not explicitly permitted by [RFC 2279].
+        Unicode includes all characters from the ISO-8859 series of
+        characters sets [ISO 8859] (which includes all Cyrillic, Greek
+        and Arabic characters) together with the more elaborate
+        characters used in Asian countries. See the following section
+        for the appropriate treatment of Unicode characters by reading
+        agents.
 
    Notwithstanding the great flexibility permitted by UTF-8, there is
    need for restraint in its use in order that the essential components
    of headers may be discerned using reading agents that cannot present
-   the full UCS range. In particular, header-names and tokens MUST be in
-   US-ASCII, and certain other components of headers, as defined
+   the full Unicode range. In particular, header-names and tokens MUST
+   be in US-ASCII, and certain other components of headers, as defined
    elsewhere in this standard - notably msg-ids, date-times, dot-atoms,
    domains and path-identities - MUST be in US-ASCII.  Comments, phrases
    (as in addresses) and unstructureds (as in Subject headers) MAY use
-   the full range of UTF-8 characters. For newsgroup-names see 5.5.
+   the full range of UTF-8 characters, but SHOULD nevertheless be
+   invariant under Unicode normalization NFC [UNICODE 3.1].
+
+        NOTE: The effect of normalization NFC is to place composite
+        characters (made by overlaying one character with another) into
+        a canonical form (usually represented by a single character
+        where one is available - thus E-acute is preferred over E
+        followed by a non-spacing acute accent), and to make a
+        consistent choice among equivalent forms (e.g. the Angstrom sign
+        is replaced by A-ring). At least for the main European
+        languages, for which all the needed composites are already
+        available as single characters, it is unlikely that posting
+        agents will need to take any special steps to ensure
+        normalization.
+
+   In the particular case of newsgroup-names (see 5.5) there are more
+   stringent requirements regarding the use of UTF-8 and Unicode.
 
    Where the use of non-ASCII characters, encoded in UTF-8, is permitted
-   as above, they MAY also be encoded using the Mime mechanism defined
+   as above, they MAY also be encoded using the MIME mechanism defined
    in [RFC 2047], but this usage is deprecated within news articles
    (even though it is required in mail messages) since it is less
    legible in older reading agents which support neither it nor UTF-8.