Discussion:
Why SOME chars nonASCII?
(too old to reply)
q***@outlook.com
2021-06-23 23:17:19 UTC
Permalink
It seems absurd to me that a recent [few years] fad is to make some
chars 2-bytes, amongst existing one-byte-ASCII-strings.
What is the motive for this?
-- CRG
Eli the Bearded
2021-06-24 00:57:09 UTC
Permalink
Post by q***@outlook.com
It seems absurd to me that a recent [few years] fad is to make some
chars 2-bytes, amongst existing one-byte-ASCII-strings.
What is the motive for this?
Mostly it is because of all of the non-English languages that don't fit
in the seven bits of ASCII. Even the eight bit ISO-8859-x family doesn't
cover lots of well-used languages. UTF-8 gives you most living languages
and many dead ones. UTF-8 isn't strictly "2-bytes", it is a variable
width encoding with ASCII compatibility for ASCII characters. High bit
sequences can be two, three, or four octets.

C encoding verifier I wrote:

/* Bit patterns for legitimate UTF-8:
*
* non-highbit:
* 0bbbbbbb
* two octet highbit:
* 110bbbbb 10bbbbbb
* three octet highbit:
* 1110bbbb 10bbbbbb 10bbbbbb
* four octet highbit:
* 11110bbb 10bbbbbb 10bbbbbb 10bbbbbb
*/

/* low bit (no highbit)
* 0bbbbbbb
* note that null is low bit
*/
#define UTF8_LOWBIT(oct) (0x00 == ((oct) & 0x80))

/* any continuation octet
* 10bbbbbb
*/
#define UTF8_CONTINUATION(oct) (0x80 == ((oct) & 0xC0))

/* start of two octet
* 110bbbbb
*/
#define UTF8_SEQUENCE_2(oct) (0xC0 == ((oct) & 0xE0))

/* start of three octet
* 1110bbbb
*/
#define UTF8_SEQUENCE_3(oct) (0xE0 == ((oct) & 0xF0))

/* start of four octet
* 11110bbb
*/
#define UTF8_SEQUENCE_4(oct) (0xF0 == ((oct) & 0xF8))


/* checks a string str of length len for legit UTF-8 bit patterns.
* null will not terminate the string -- those are legit 7bit ASCII.
* returns byte offset of first non-legit sequence or -1 if 100% okay.
*/
int
check_utf8(str, len)
unsigned char* str;
int len;
{
int seq, pos, run, octet;
run = 0;

for(pos = 0; pos < len; pos++) {
octet = str[pos];

/* start of a sequence */
if(run == 0) {
seq = pos;
}

if(UTF8_LOWBIT(octet)) {
if( run != 0 ) {
/* whoops, wanted highbit there */
return seq;
}
continue;
}

if(UTF8_CONTINUATION(octet)) {
if( run ) {
/* one of our expected run */
run --;
continue;
}
/* whoops, not the right spot for this */
return seq;
}

if( run ) {
/* whoops, should have had a continuation octet above */
return seq;
}

if(UTF8_SEQUENCE_2(octet)) {
run = 1; /* one more */
continue;
}

if(UTF8_SEQUENCE_3(octet)) {
run = 2; /* two more */
continue;
}

if(UTF8_SEQUENCE_4(octet)) {
run = 3; /* three more */
continue;
}

/* yikes! fall through! */
return seq;
}

return -1;
} /* check_utf8() */

https://github.com/Eli-the-Bearded/eli-mailx/blob/master/utf-8.c

Elijah
------
using K&R style to match the rest of mailx
Mike Spencer
2021-06-24 01:22:32 UTC
Permalink
Post by Eli the Bearded
Post by q***@outlook.com
It seems absurd to me that a recent [few years] fad is to make some
chars 2-bytes, amongst existing one-byte-ASCII-strings.
What is the motive for this?
Mostly it is because of all of the non-English languages that don't fit
in the seven bits of ASCII. Even the eight bit ISO-8859-x family doesn't
cover lots of well-used languages.
Several of my correspondents (using Mac or Windoes) writing in
English do this in their own text and in text/articles copied from the
net.

Oddly, the non-ASCII chars are almost all punctuation: left & right
double & single quotes, em dash, ellipses and the degree symbol. Very
occasionally, there are French or Spanish names with non-ASCII chars
but the big nuisance is the punctuation. And of course, they send it
as quoted-printable.

I have an Emacs macro that finds the QP strings for the punctuation
and reverts them to ASCII before rmail-decode-quoted-printable but
it's a PITA.
Post by Eli the Bearded
UTF-8 gives you most living languages and many dead ones. UTF-8
isn't strictly "2-bytes", it is a variable width encoding with ASCII
compatibility for ASCII characters. High bit sequences can be two,
three, or four octets.
[snip]
--
Mike Spencer Nova Scotia, Canada
Eli the Bearded
2021-06-24 02:19:09 UTC
Permalink
Post by Mike Spencer
Several of my correspondents (using Mac or Windoes) writing in
English do this in their own text and in text/articles copied from the
net.
Yes, a related problem. With UTF-8 comes a lot more punctuation options,
and a large number of programs silently "correct" things. Some people
believe very dearly the fancy punctionation is better, others believe
the opposite.
Post by Mike Spencer
Oddly, the non-ASCII chars are almost all punctuation: left & right
double & single quotes, em dash, ellipses and the degree symbol. Very
occasionally, there are French or Spanish names with non-ASCII chars
but the big nuisance is the punctuation. And of course, they send it
as quoted-printable.
Quoted printable is how to make UTF-8 seven bit safe and _mostly_
readable. That's the real goal of QP, making it _mostly_ readable if you
don't have software that can display it. Base64 is not readable and gets
used sometimes.
Post by Mike Spencer
I have an Emacs macro that finds the QP strings for the punctuation
and reverts them to ASCII before rmail-decode-quoted-printable but
it's a PITA.
I have vim settings for the same purpose, and a simple Perl script
for use outside of vim. The Perl script will look for my vim
configuration first, and if it doesn't find it use a built in set of
rules.

https://qaz.wtf/tmp/textify

I basically only try to fix punctuation issues I've encountered.
I do not try to replace accented vowels, for example.

Elijah
------
knows German rules for that, but not, say, French ones
Mike Spencer
2021-06-24 05:26:08 UTC
Permalink
Post by Eli the Bearded
I basically only try to fix punctuation issues I've encountered.
I do not try to replace accented vowels, for example.
Same. After undoing QP, the UTF8 punctuation apears in Emacs as 3
escaped octal digits making for hard reading.
--
Mike Spencer Nova Scotia, Canada
Sylvain Robitaille
2021-07-07 21:28:04 UTC
Permalink
Post by Mike Spencer
I have an Emacs macro that finds the QP strings for the punctuation
and reverts them to ASCII before rmail-decode-quoted-printable but
it's a PITA.
I have vim settings for the same purpose, ...
Care to share your vim settings? I see that your Perl script reads it
in, or defaults to its own, but I'm certainly curious about what you've
done in vim ...
--
----------------------------------------------------------------------
Sylvain Robitaille ***@encs.concordia.ca

Systems analyst / AITS Concordia University
Faculty of Engineering and Computer Science Montreal, Quebec, Canada
----------------------------------------------------------------------
Eli the Bearded
2021-07-08 17:15:26 UTC
Permalink
Post by Sylvain Robitaille
I have vim settings for the same purpose, ...
Care to share your vim settings? I see that your Perl script reads it
in, or defaults to its own, but I'm certainly curious about what you've
done in vim ...
The complete vim settings are basically the same as in the perl script,
but here:

base64 -d <<_B64_VIMRC > highbit_vimrc
IiBzbWFydCBxdW90ZXMKbWFwISDigJkgJwptYXAhIOKAmCAnCm1hcCEg4oCcICIKbWFwISDi
gJ0gIgptYXAhIOKAsyAiCiIgYnVsbGV0Cm1hcCEg4pePICoKIiBlbGxpcHNpcwptYXAhIOKA
piAuLi4KIiBuLWRhc2gKbWFwISDigJMgLS0KIiBtLWRhc2gKbWFwISDigJQgLS0KIiBVKzIy
MTIgbWludXMKbWFwISDiiJIgLQoiIFUrMjAxMCBoeXBoZW4KbWFwISDigJAgLQoiIGx5bngg
YnJva2VuIFVURi04Cm1hcCEgw6LCgMKcICIKbWFwISDDosKAwp0gIgptYXAhIMOiwoDCmSAn
Cm1hcCEgw6LCgMKUIC0tCm1hcCEgw6LCgMKmIC4uLgoiCiIgZmluZCBub24tYXNjaWkKbWFw
IDxGNT4gL1teCSAtfl08Y3I+Cg==
_B64_VIMRC

Elijah
------
yay for multiple encodings raw in one file
Sylvain Robitaille
2021-07-12 22:59:00 UTC
Permalink
Post by Eli the Bearded
The complete vim settings are basically the same as in the perl script,
base64 -d <<_B64_VIMRC > highbit_vimrc
IiBzbWFydCBxdW90ZXMKbWFwISDigJkgJwptYXAhIOKAmCAnCm1hcCEg4oCcICIKbWFwISDi
gJ0gIgptYXAhIOKAsyAiCiIgYnVsbGV0Cm1hcCEg4pePICoKIiBlbGxpcHNpcwptYXAhIOKA
piAuLi4KIiBuLWRhc2gKbWFwISDigJMgLS0KIiBtLWRhc2gKbWFwISDigJQgLS0KIiBVKzIy
MTIgbWludXMKbWFwISDiiJIgLQoiIFUrMjAxMCBoeXBoZW4KbWFwISDigJAgLQoiIGx5bngg
YnJva2VuIFVURi04Cm1hcCEgw6LCgMKcICIKbWFwISDDosKAwp0gIgptYXAhIMOiwoDCmSAn
Cm1hcCEgw6LCgMKUIC0tCm1hcCEgw6LCgMKmIC4uLgoiCiIgZmluZCBub24tYXNjaWkKbWFw
IDxGNT4gL1teCSAtfl08Y3I+Cg==
_B64_VIMRC
Beautiful. Thank you.
--
----------------------------------------------------------------------
Sylvain Robitaille ***@encs.concordia.ca

Systems analyst / AITS Concordia University
Faculty of Engineering and Computer Science Montreal, Quebec, Canada
----------------------------------------------------------------------
Richmond
2021-07-07 21:45:22 UTC
Permalink
Post by Mike Spencer
Several of my correspondents (using Mac or Windoes) writing in
English do this in their own text and in text/articles copied from the
net.
Oddly, the non-ASCII chars are almost all punctuation: left & right
double & single quotes, em dash, ellipses and the degree symbol. Very
occasionally, there are French or Spanish names with non-ASCII chars
but the big nuisance is the punctuation. And of course, they send it
as quoted-printable.
Surely as most of the web is utf-8 it is good to use that as a standard.

There is no £ in seven bit ascii, there is in extended ascii, and in
iso, but it causes confusion when email programs do not state the
encoding used.
Richard Kettlewell
2021-06-24 07:54:05 UTC
Permalink
Post by Eli the Bearded
Post by q***@outlook.com
It seems absurd to me that a recent [few years] fad is to make some
chars 2-bytes, amongst existing one-byte-ASCII-strings.
What is the motive for this?
Mostly it is because of all of the non-English languages that don't fit
in the seven bits of ASCII. Even the eight bit ISO-8859-x family doesn't
cover lots of well-used languages. UTF-8 gives you most living languages
and many dead ones. UTF-8 isn't strictly "2-bytes", it is a variable
width encoding with ASCII compatibility for ASCII characters. High bit
sequences can be two, three, or four octets.
[...]
Post by Eli the Bearded
https://github.com/Eli-the-Bearded/eli-mailx/blob/master/utf-8.c
That has several bugs...

1) It accepts non-minimal sequences such as F0808080.

2) It accepts sequences mapping to UTF-16 surrogates, such as EDA080.

3) It accepts sequences mapping outside the Unicode code point range,
such as F7808080.

See https://www.unicode.org/versions/Unicode13.0.0/ch03.pdf D92 for the
specification.
--
https://www.greenend.org.uk/rjk/
Eli the Bearded
2021-07-01 20:44:00 UTC
Permalink
Post by Richard Kettlewell
Post by Eli the Bearded
https://github.com/Eli-the-Bearded/eli-mailx/blob/master/utf-8.c
That has several bugs...
1) It accepts non-minimal sequences such as F0808080.
2) It accepts sequences mapping to UTF-16 surrogates, such as EDA080.
3) It accepts sequences mapping outside the Unicode code point range,
such as F7808080.
Interesting critique. I may fix those, but I'm not sure they'll ever be
relevant to the level of strictness I need. I'm looking to catch
mislabeled "charset"s not devious attacks.
Post by Richard Kettlewell
See https://www.unicode.org/versions/Unicode13.0.0/ch03.pdf D92 for the
specification.
The Unicode website was briefly down, so I had put off responding to
this until I could check that. Would have been nice if you had included
a page number since that document is large and doesn't include a TOC.
Page 123 as numbered in document, page 54 as numbered by my PDF reader.

Elijah
------
recalls now non-minimal UTF-8 being used escape Apache document root once
Richard Kettlewell
2021-07-02 08:44:14 UTC
Permalink
Post by Eli the Bearded
Post by Richard Kettlewell
Post by Eli the Bearded
https://github.com/Eli-the-Bearded/eli-mailx/blob/master/utf-8.c
That has several bugs...
1) It accepts non-minimal sequences such as F0808080.
2) It accepts sequences mapping to UTF-16 surrogates, such as EDA080.
3) It accepts sequences mapping outside the Unicode code point range,
such as F7808080.
Interesting critique. I may fix those, but I'm not sure they'll ever be
relevant to the level of strictness I need. I'm looking to catch
mislabeled "charset"s not devious attacks.
It was advertized as checking for “legitimate UTF-8”, not “UTF-8 but
also some other stuff that is not UTF-8”.
Post by Eli the Bearded
Post by Richard Kettlewell
See https://www.unicode.org/versions/Unicode13.0.0/ch03.pdf D92 for the
specification.
The Unicode website was briefly down, so I had put off responding to
this until I could check that. Would have been nice if you had
included a page number since that document is large and doesn't
include a TOC. Page 123 as numbered in document, page 54 as numbered
by my PDF reader.
I did’t think D92 would be hard to search for.
--
https://www.greenend.org.uk/rjk/
Loading...