An overview of the major orthographic issues in intelligent searching and processing of Korean texts and webpages
This report provides an overview of the linguistic and orthographic issues related to the processing of Korean texts, especially webpages. It has two basic aims:
The Korean orthography is not as regular as most people tend to believe. Though hangul is often described as "logical" and "scientific," creating the impression that it is a highly regular, "phonetic script" (good phoneme-grapheme correspondence), in reality many phonological changes occur as a result of syllable coda (patchim) liaison , such as voicing, nasalization and resyllabification, which has a major impact on Korean TTS (text-to-speech) systems. A factor that is of crucial importance to text processing is the fact that there is a significant amount of orthographic variation.
The existence of orthographic variants and the morphological complexity of Korean pose a special challenge to the developers of linguistic tools, especially in the field of information retrieval, since the same word or phrase can be written in multiple, often unpredictable, ways. This report focuses on the major types of orthographic variation in Korean, and provides a brief analysis of the linguistic issues to be considered by software developers.
The most important type of orthographic variation in Korean is the use of variant hangul spellings in the writing of loanwords, which are very common in Korean. There are also differences in the spelling of loanwords between North Korea and South Korea (described below). Table 1 shows some orthographic variants of common loanwords from English.
English | Variant 1 | Variant 2 |
---|---|---|
cake | 케이크 (keikeu) | 케잌 (keik) |
yellow | 옐로우 (yelrou) |
옐로 (yelro) |
juice | 주스 (juseu) |
쥬스 (jyuseu) |
Another kind of orthographic variation that plays a significant role is in the writing of non-Korean personal names, as shown in the table below.
English | Variant 1 | Variant 2 | Marx | 마르크스 ( mareukeuseu ) |
막스 ( makseu) |
---|---|---|
Mao Zedong | 마오쩌뚱 (maojjeottung ) |
모택동 ( motaekdong) |
Clinton | 클린턴 (keulrinteon ) |
클린톤 (keulrinton) |
Orthographic variation in loanwords and in non-Korean personal names is fairly common, and must be given special attention in the development of IR and translation tools.
An important factor contributing to the complexity of the Korean writing system is the use of multiple scripts. Korean is written in a mixture of three scripts: an alphabetic syllabary called hangul, Chinese characters called hanja, and the Latin alphabet called romaja. Orthographic variation across scripts is not uncommon: the same word can be written in hangul, hangul mixed with numerals, hanja, or even English. Although the use of hanja in Korean is declining, hanja still plays a significant role in newspapers, technical journals, classical works such as the Buddhists scriptures, and proper nouns.
The table below shows the major patterns of cross-script variation in Korean.
Type of Variation | English | Var 1 | Var 2 | Var 3 |
---|---|---|---|---|
Hanja vs. hangul | many people | 大勢 (daese) |
대세 (daese) |
|
Hangul vs. hybrid | shirt | 와이셔츠 (waisyeacheu) |
Y셔츠 ( waisyeacheu) |
|
T-shirt | 티셔츠 ( tisyeacheu) |
T셔츠 (tisyeacheu ) |
||
Hangul vs. numeral vs. hanja | one o'clock | 한시 ( hansi) |
1시 (hansi ) |
一時 (hansi ) |
5 people | 다섯명 (daseosmyeong ) |
5명 (daseosmyeong ) |
5名 (daseosmyeong ) |
|
English vs. hangul | television | TV | 텔레비젼 (telrebijyeon ) |
테레비 (terebi ) |
sex | sex | 섹스 (sekseu ) |
Another factor contributing to the irregularity of hangul orthography is the differences in spelling between South Korea and North Korea. This is a result of an independent language policy implemented by the North Korean government in the postwar era. The major differences are in the writing of loanwords (often influenced by Russian), a strong preference for native Korean words over loanwords, and in the writing of non-Korean proper nouns.
Typical examples are shown in the table below.
English | North Korea | South Korea |
---|---|---|
Osaka | 오사까 (osakka ) |
오사카 (osaka ) |
San Diego | 싼디아고 (ssandiago ) |
쌘디애고 (ssaendiaego ) |
Cuba | 꾸바 (kkuba ) |
쿠바 (kuba ) |
Bush | 부슈 ( busyu ) |
부시 ( busi ) |
Type of Variation |
English | North Korea |
South Korea |
---|---|---|---|
Loanwords | plus | 쁠류스 (ppeullyuseu ) |
플러스 (peuleoseu ) |
missile | 미싸일 (missail ) |
미사일 (misail ) |
|
bus | 뻐스 (ppeoseu ) |
버스 (beoseu ) |
|
Russian vs. English loanword | group | 그루빠 (guruppa ) |
그룹 (geurup ) |
campaign | 깜빠니아 (kkampania ) |
캠페인 (kaempein ) |
|
Morphophonemic | abuse | 람용 (ramyong ) |
남용 (namyong ) |
The Korean writing system went through several reforms during its history, including the invention of hangul by King Sejong in 1443, several reforms of the hangul script, and the abolishment of hanja in North Korea. The most recent reform of hangul orthography took place as recently as 1988.
Though the new orthography has now become fairly well established, the old orthography is still sufficiently common. Because the affected words are of high frequency and their number is not insignificant, they should be considered in developing linguistic tools. The table below shows typical examples.
English | As of 1988 | Before 1988 | Messenger | 심부름군 (simbureumgun ) |
심부름꾼 (simburemkkun ) |
Worker | 일군 (ilgun ) |
일꾼 (ilkkun ) |
Color | 빛갈 (bitgal ) |
빛깔 (bitkkal ) |
---|
Hanja is written in the traditional character forms. Unlike the PRC and Japan, language reforms in Korea did not include the simplification of the character forms. Thus the problem of traditional forms versus simplified ones is not a major issue. Nevertheless, the Japanese occupation of the Korean Peninsula (1910 to 1945) has had a strong influence, which resulted in many simplified Japanese character forms, including abbreviated forms, coming into common use. Typical examples are shown in the table below.
Type of Variation | English | Standard | Variant |
---|---|---|---|
Variant forms | fermentation | 醗酵 (balhyo ) | 発酵 (balhyo ) |
Abbreviated form | 10 years old | 十歳 (sipse ) | 十才 (sipse ) |
Traditional form | development | 發達 (baldal ) | 発達 (baldal ) |
There are various other types of orthographic variation, a detailed treatment of which is beyond the scope of this report. The table below shows the most important ones.
Type of Variation | English | Var 1 | Var 2 |
---|---|---|---|
Abbreviations | farmers cooperative | 농협 (nonghyeop ) |
농업협동조합 (nongeophyeopdongjohap ) |
Korean Tobacco | KT | 한국담배인삼공사 (hangukdambaeinsamgongsa ) |
|
Word spacing | top class | 톱 클래스 (topkeulraeseu ) |
톱클래스 (topkeulraeseu ) |
North Sea | 북 해 (bukhae ) |
북해 (bukhae ) |
|
Caribbean Sea | 카리브 해 (karibeuhae ) |
카리브해 (karibeuhae ) |
|
Punctuation | period | . | 。 |
quotes | “” | 『』 |
An advanced form of variant expansion is synonym expansion and cross-language expansion. The details of this are beyond the scope of this report. In a nutshell, synonym expansion generates a picklist of synonyms and other semantically related variants; cross-language expansion generates a picklist of Korean equivalents to a foreign language source term. These are shown in the table below, along with some miscellaneous variant types such as abbreviations and loanwords.
Type of Variation | English | Var 1 | Var2 | Var 3 | Var 4 |
---|---|---|---|---|---|
Synonyms | money | 돈 (don ) | 현금 (hyeongeum ) | 통화 (tong ) | 화폐 (hwapye ) |
calendar | 카렌더 (karendeo ) | 달력 (dalryeok ) | |||
Cross-language | Tokyo | Tokyo | 도쿄 (dokyo ) | 동경 (donggyeong ) | |
country | country | 국가 (gukga ) | 나라 (nara ) |
A particularly important class of synonyms in Korean is native words as opposed to those derived from other languages, especially Sinitic words and English-derived loanwords, as shown in the table below.