[SLL] LANG=C and Red Hat

Joshua Daniel Franklin jdf.lists at gmail.com
Thu Dec 6 11:05:57 PST 2007


On Dec 6, 2007 6:28 AM, Jesse Keating wrote:
> On Thu, 6 Dec 2007 07:04:59 -0600 (CST) reed wrote:
> > I am just curious: does this happen to you a lot? And why are the
> > defaults like this? Or if it is not a default setup, what is
> > configured wrong and why is it so common?
>
> I largely never run into issues because all the terminal emulators I
> use support UTF8, perhaps you're using software that does not support
> UTF8.  LANG=C is rather hostile to non-English speaking folks.

I used to use mrxvt and had the garbage character problem because it
doesn't have UTF-8 support. Now I'm just using gnome-terminal but
there is a rxvt-unicode (available in Fedora).

By the way if you want to know if your terminal is borking Unicode,
here's some notes I've got.

echo -e '\xc2\xa3' # that's UTF-8 for a sterling symbol.

If you see capital-a-circumflex and a sterling symbol you're
using ISO-8859-1.

echo -e '\xc2\xa3' > pound-symbol
cat pound-symbol
vi pound-symbol

If vi (or nano or whatever) don't contain that but it displays OK when you cat
it, then vi/nano are messing you about.  If the file contains something else
it'll be interesting to know what because the terminal emulator is broken.

Note that HTML hex entity reference £ drops the 'C2' (high bit) part
because the UTF-8 for Unicode 127-2047 (0x7F-0x07FF) is spread across two
bytes. The first byte will have the two high bits set and the third bit clear
(i.e. 0xC2 to 0xDF). The second byte will have the top bit set and the second
bit clear (i.e. 0x80 to 0xBF). This is to differentiate it from codepages like
ISO-8859-1.

By the way, there's a lot of confusion that UTF-8 is "wasting" bits vs ASCII;
for chars 0-127 UTF-8 is one byte--same as ASCII. So for virtually all your
code and scripts UTF8 == ASCII. The exceptions are things like MS Word Smart
Quotes which should be exorcised anyway. But to detect it you need to be
using Unicode aware tools that will complain when your files are mixing
ISO-8859-1 and UTF-8. If you have the odd ǽ in your files you can batch
convert from your local codepage with iconv.

There are two excellent tools I use all the time, mainly for web stuff:
* unum.pl which allows you to look up Unicode/HTML characters by name or number
http://www.fourmilab.ch/webtools/unum/

* FeedParser (RSS/Atom, but you can wrap any file) which "sets the
bozo bit" for
wrong encodings, such as a mixed UTF-8 and Windows-1252 file :
http://feedparser.org/docs/character-encoding.html


More information about the linux-list mailing list