> Since when have HTML chars begun to be assigned to the range 127-
> 159? E.g., in your test document, ‘ and ’ for the smart
> single quotes and — for the em dash.
> Traditionally,
> 127-159 were left unused. They're not mentioned in the HTML 4.01
> spec, http://www.w3.org/TR/html401/html40.txt. Are these perhaps
> IE extensions? Yet, my ancient copy of Netscape for OS/2 renders
> them correctly.
Look at http://www.w3.org/TR/html4/charset.html
which is part of section 5 of the HTML 4.01 specification.
And if you write XHTML (which you might as well be doing, since it's
more versatile and only a bit stricter), look at
http://www.w3.org/TR/xhtml1/#a_dtd_XHTML-1.0-Transitional
where you will find the following:
"A.2. Entity Sets
The XHTML entity sets are the same as for HTML 4, but have been
modified to be valid XML 1.0 entity declarations. Note the entity for
the Euro currency sign (€ or € or €) is defined as
part of the special characters.
A.2.1. Latin-1 characters
The file DTD/xhtml-lat1.ent is a normative part of this specification.
The annotated contents of this file are available in this separate
section for completeness.
A.2.2. Special characters
The file DTD/xhtml-special.ent is a normative part of this
specification. The annotated contents of this file are available in
this separate section for completeness.
A.2.3. Symbols
The file DTD/xhtml-symbol.ent is a normative part of this
specification. The annotated contents of this file are available in
this separate section for completeness."
---
If you do XHTML, and if your browser can read it (which the newer ones
can), you can actually use pretty much any Unicode character on your
page by encoding it as, for example "∝" (which is the
"proportional to" symbol), the syntax being:
& (tells the browser a special character is starting)
# (tells the browser a Unicode character is coming in decimal format),
nnnn (where the four digits are the right number for the Unicode character),
; (tells the browser the special character is over now)
If you want to put in the character as a hexademial value, replace "#"
with "#x".
After years of trying to write pages that needed to display both
German and Turkish characters, I was totally overjoyed when I recently
discovered this trick. I know a multinational corporate site where all
the non Latin-1 characters are stored in the &#nnnn; format and the
site seems to display fine in Hebrew, Japanese, Chinese, etc.
You can declare your HTML or XHTML document to be in a particular
character set (and should) as well as including these things.
The main W3C page XHMTL has this header:
http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd
http://www.w3.org/1999/xhtml; lang="en" xml:lang="en">
XHTML 1.0: The Extensible HyperText Markup Language (Second
Edition)
http://www.w3.org/StyleSheets/TR/W3C-REC.css; />
Note that they declare what language it's in, which helps the browser
figure out how to display it and figure out what character set it's
probably using.
If you go to the BBC Arabic news page
http://news.bbc.co.uk/hi/arabic/news/, the header is:
http://www.w3.org/TR/html4/loose.dtd
BBC Arabic News | ÇáÕÝÍÉ ÇáÑÆíÓíÉ
Note the first meta tag here which lists the character set. It's
better to always include that information (don't know why the W3C
didn't on their page!), since it tells the browser how to display the
characters.