[ Pobierz całość w formacie PDF ]
.If you convert to Unicodeusing native2ascii, you can still use XML character references the viewer willstill recognize them.3236-7 ch07.F.qc 6/29/99 1:04 PM Page 185Chapter 7 &' Foreign Languages and Non-Roman Text185How to Write XML in Other Character SetsUnless told otherwise, an XML processor assumes that text entity characters areencoded in UTF-8.Since UTF-8 includes ASCII as a subset, ASCII text is easily parsedby XML processors as well.The only character set other than UTF-8 that an XML processor is required tounderstand is raw Unicode.If you cannot convert your text into either UTF-8 orraw Unicode, you can leave the text in its native character set and tell the XMLprocessor which set that is.This should be a last resort, though, because there sno guarantee an arbitrary XML processor can process other encodings.Nonethe-less Netscape Navigator and Internet Explorer both do a pretty good job of inter-preting the common character sets.To warn the XML processor that you re using a non-Unicode encoding, you includean encodingattribute in the XML declaration at the start of the file.For example,to specify that the entire document uses Latin-1 by default (unless overridden byanother processing instruction in a nested entity) you would use this XMLdeclaration:You can also include the encoding declaration as part of a separate processinginstruction after the XML declaration but before any character data appears.Table 7-7 lists the official names of the most common character sets used today, asthey would be given in XML encoding attributes.For encodings not found in this list,consult the official list maintained by the Internet Assigned Numbers Authority(IANA) at http://www.isi.edu/in-notes/iana/assignments/character-sets.Table 7-7Names of Common Character SetsCharacter Set Name Languages/CountriesUS-ASCII EnglishUTF-8 Compressed UnicodeUTF-16 Compressed UCSISO-10646-UCS-2 Raw UnicodeISO-10646-UCS-4 Raw UCSContinued3236-7 ch07.F.qc 6/29/99 1:04 PM Page 186Part I &' Introducing XML186Table 7-7 (continued)Character Set Name Languages/CountriesISO-8859-1 Latin-1, Western EuropeISO-8859-2 Latin-2, Eastern EuropeISO-8859-3 Latin-3, Southern EuropeISO-8859-4 Latin-4, Northern EuropeISO-8859-5 ASCII plus CyrillicISO-8859-6 ASCII plus ArabicISO-8859-7 ASCII plus GreekISO-8859-8 ASCII plus HebrewISO-8859-9 Latin-5, TurkishISO-8859-10 Latin-6, ASCII plus the Nordic languagesISO-8859-11 ASCII plus ThaiISO-8859-13 Latin-7, ASCII plus the Baltic Rim languages,particularly LatvianISO-8859-14 Latin-8, ASCII plus Gaelic and WelshISO-8859-15 Latin-9, Latin-0; Western EuropeISO-2022-JP JapaneseShift_JIS Japanese, WindowsEUC-JP Japanese, UnixBig5 Chinese, TaiwanGB2312 Chinese, mainland ChinaKOI6-R RussianISO-2022-KR KoreanEUC-KR Korean, UnixISO-2022-CN Chinese3236-7 ch07.F.qc 6/29/99 1:04 PM Page 187Chapter 7 &' Foreign Languages and Non-Roman Text187SummaryIn this chapter you learned:&' Web pages should identify the encoding they use.&' What a script is, how it relates to languages, and the four things a scriptrequires.&' How scripts are used in computers with character sets, fonts, glyphs, andinput methods.&' What character sets are commonly used on different platforms and that mostare based on ASCII.&' How to write XML in Unicode without a Unicode editor (write the documentin ASCII and include Unicode character references).&' When writing XML in other encodings, include an encodingattribute in theXML declaration.In the next chapter, you ll begin exploring DTDs and how they enable you to defineand enforce a vocabulary, syntax, and grammar for your documents.&' &' &'3236-7 ch07.F.qc 6/29/99 1:04 PM Page 188
[ Pobierz całość w formacie PDF ]