Orthographic Variation in Korean

An overview of the major orthographic issues in intelligent searching and processing of Korean texts and webpages


Jack Halpern
CEO
The CJK Dictionary Institute, Inc.
株式会社日中韓辭典研究所
©2001-2008 The CJK Dictionary Institute, Inc.




Index to This Document
  1. Introduction
  2. Hangul Variants in Loanwords
  3. Cross-Script Orthographic Variants
  4. North vs. South Korean Orthography
  5. Miscellaneous Variants
  6. Synonym and Cross-Language Expansion
  7. Mapping Tables and Lexical Databases
  8. Documents for Reference

1. Introduction

This report provides an overview of the linguistic and orthographic issues related to the processing of Korean texts, especially webpages. It has two basic aims:

  1. To enable software developers to acquire a basic knowledge of the special problems resulting from the complexity of the Korean script, especially orthographic variation.
  2. To help software developers build linguistic tools for processing Korean, such as intelligent information retrieval tools, input method editors, and machine translation systems.

The Korean orthography is not as regular as most people tend to believe. Though hangul is often described as "logical" and "scientific," creating the impression that it is a highly regular, "phonetic script" (good phoneme-grapheme correspondence), in reality many phonological changes occur as a result of syllable coda (patchim) liaison , such as voicing, nasalization and resyllabification, which has a major impact on Korean TTS (text-to-speech) systems. A factor that is of crucial importance to text processing is the fact that there is a significant amount of orthographic variation.

The existence of orthographic variants and the morphological complexity of Korean pose a special challenge to the developers of linguistic tools, especially in the field of information retrieval, since the same word or phrase can be written in multiple, often unpredictable, ways. This report focuses on the major types of orthographic variation in Korean, and provides a brief analysis of the linguistic issues to be considered by software developers.


2. Hangul Variants in Loanwords

The most important type of orthographic variation in Korean is the use of variant hangul spellings in the writing of loanwords, which are very common in Korean. There are also differences in the spelling of loanwords between North Korea and South Korea (described below). Table 1 shows some orthographic variants of common loanwords from English.

Table 1
Hangul Variants in Loanwords
English Variant 1 Variant 2
cake 케이크
(keikeu)
케잌
(keik)
yellow 옐로우
(yelrou)
옐로
(yelro)
juice 주스
(juseu)
쥬스
(jyuseu)


Another kind of orthographic variation that plays a significant role is in the writing of non-Korean personal names, as shown in the table below.

Table 2
Non-Korean Personal Names
English Variant 1 Variant 2
Marx 마르크스
( mareukeuseu )
막스
( makseu)
Mao Zedong 마오쩌뚱
(maojjeottung )
모택동
( motaekdong)
Clinton 클린턴
(keulrinteon )
클린톤
(keulrinton)

Orthographic variation in loanwords and in non-Korean personal names is fairly common, and must be given special attention in the development of IR and translation tools.

3. Cross-Script Orthographic Variants

An important factor contributing to the complexity of the Korean writing system is the use of multiple scripts. Korean is written in a mixture of three scripts: an alphabetic syllabary called hangul, Chinese characters called hanja, and the Latin alphabet called romaja. Orthographic variation across scripts is not uncommon: the same word can be written in hangul, hangul mixed with numerals, hanja, or even English. Although the use of hanja in Korean is declining, hanja still plays a significant role in newspapers, technical journals, classical works such as the Buddhists scriptures, and proper nouns.

The table below shows the major patterns of cross-script variation in Korean.

Table 3
Cross-Script Orthographic Variants
Type of Variation English Var 1 Var 2 Var 3
Hanja vs. hangul many people 大勢
(daese)
대세
(daese)
 
Hangul vs. hybrid shirt 와이셔츠
(waisyeacheu)
Y셔츠
( waisyeacheu)
 
T-shirt 티셔츠
( tisyeacheu)
T셔츠
(tisyeacheu )
 
Hangul vs. numeral vs. hanja one o'clock 한시
( hansi)
1시
(hansi )
一時
(hansi )
5 people 다섯명
(daseosmyeong )
5명
(daseosmyeong )
5名
(daseosmyeong )
English vs. hangul television TV 텔레비젼
(telrebijyeon )
테레비
(terebi )
sex sex 섹스
(sekseu )
 

4. North vs. South Korean Orthography

Another factor contributing to the irregularity of hangul orthography is the differences in spelling between South Korea and North Korea. This is a result of an independent language policy implemented by the North Korean government in the postwar era. The major differences are in the writing of loanwords (often influenced by Russian), a strong preference for native Korean words over loanwords, and in the writing of non-Korean proper nouns.

Typical examples are shown in the table below.

Table 4
North vs. South Korean Proper Nouns
English North Korea South Korea
Osaka 오사까
(osakka )
오사카
(osaka )
San Diego 싼디아고
(ssandiago )
쌘디애고
(ssaendiaego )
Cuba 꾸바
(kkuba )
쿠바
(kuba )
Bush 부슈
( busyu )
부시
( busi )

Table 5
Other North Korean vs. South Korean Differences
Type of
Variation
English North
Korea
South
Korea
Loanwords plus 쁠류스
(ppeullyuseu )
플러스
(peuleoseu )
missile 미싸일
(missail )
미사일
(misail )
bus 뻐스
(ppeoseu )
버스
(beoseu )
Russian vs.
English loanword
group 그루빠
(guruppa )
그룹
(geurup )
campaign 깜빠니아
(kkampania )
캠페인
(kaempein )
Morphophonemic abuse 람용
(ramyong )
남용
(namyong )

5. Miscellaneous Variants

5.1 New vs. Old Orthography

The Korean writing system went through several reforms during its history, including the invention of hangul by King Sejong in 1443, several reforms of the hangul script, and the abolishment of hanja in North Korea. The most recent reform of hangul orthography took place as recently as 1988. (For more information on Korean orthographic reforms, see Korean Writing Reforms.)

Though the new orthography has now become fairly well established, the old orthography is still sufficiently common. Because the affected words are of high frequency and their number is not insignificant, they should be considered in developing linguistic tools. The table below shows typical examples.


Table 6
New vs. Old Orthography
English As of 1988 Before 1988
Messenger 심부름군
(simbureumgun )
심부름꾼
(simburemkkun )
Worker 일군
(ilgun )
일꾼
(ilkkun )
Color 빛갈
(bitgal )
빛깔
(bitkkal )

5.2 Hanja Variants

Hanja is written in the traditional character forms. Unlike the PRC and Japan, language reforms in Korea did not include the simplification of the character forms. Thus the problem of traditional forms versus simplified ones is not a major issue. Nevertheless, the Japanese occupation of the Korean Peninsula (1910 to 1945) has had a strong influence, which resulted in many simplified Japanese character forms, including abbreviated forms, coming into common use. Typical examples are shown in the table below.

Table 7
Hanja Variants
Type of Variation English Standard Variant
Variant formsfermentation 醗酵
(balhyo )
発酵
(balhyo )
Abbreviated form10 years old 十歳
(sipse )
十才
(sipse )
Traditional form development 發達
(baldal )
発達
(baldal )

5.3 Miscellaneous Variants

There are various other types of orthographic variation, a detailed treatment of which is beyond the scope of this report. The table below shows the most important ones.

Table 8
Miscellaneous Variants
Type of Variation English Var 1 Var 2
Abbreviations farmers cooperative 농협
(nonghyeop )
농업협동조합
(nongeophyeopdongjohap )
Korean Tobacco KT 한국담배인삼공사
(hangukdambaeinsamgongsa )
Word spacing top class 톱 클래스
(topkeulraeseu )
톱클래스
(topkeulraeseu )
North Sea 북 해
(bukhae )
북해
(bukhae )
Caribbean Sea 카리브 해
(karibeuhae )
카리브해
(karibeuhae )
Punctuation period
quotes “” 『』

6. Synonym and Cross-Language Expansion

An advanced form of variant expansion is synonym expansion and cross-language expansion. The details of this are beyond the scope of this report. In a nutshell, synonym expansion generates a picklist of synonyms and other semantically related variants; cross-language expansion generates a picklist of Korean equivalents to a foreign language source term. These are shown in the table below, along with some miscellaneous variant types such as abbreviations and loanwords.

 
Table 9
Synonym and Cross-Language Expansion
Type of Variation English Var 1Var2Var 3Var 4
Synonyms
money
(don )
현금
(hyeongeum )
통화
(tong )
화폐
(hwapye )
calendar카렌더
(karendeo )
달력
(dalryeok )
  
Cross-languageTokyoTokyo도쿄
(dokyo )
동경
(donggyeong )
 
countrycountry국가
(gukga )
나라
(nara )
 

A particularly important class of synonyms in Korean is native words as opposed to those derived from other languages, especially Sinitic words and English-derived loanwords, as shown in the table below.

Table 10
Native vs. non-Native Words
Native vs. Sinitic light bulb 불알
( bulal)
전구
(jeongu )
wig 덧머리
(deotmeori )
가발
(gabal )
Native vs. loanword ice cream 얼음보숭이
(eoleumbosungi )
아이스크림
(aiseukeurim )
juice 과일단물
(gwaildanmul )
주스
(juseu )
Sinitic vs. loanword ball-point pen 원주필
(urnjupil )
볼펜
(bolpen )

7. Mapping Tables and Lexical Databases

One of the central components necessary for building tools for processing Korean orthographic and other variants is a database of hard-coded mapping tables of orthographic variants and lexical databases for synonym and cross-language expansion, fine-tuned to the needs of variant expansion and normalization. Below is a list of components required for building such a database (these are currently under development at The CJK Dictionary Institute).

  1. A comprehensive mapping table of hangul variants, especially in loanwords.
  2. A mapping table of cross-script orthographic variants.
  3. A mapping table of North vs. South Korean variants, especially in loanwords.
  4. Mapping tables for miscelaneous orthographic variants, including hanja variants, old vs. new orthography, and abbreviations.
  5. A dictionary of semantically classified synonym groups (Korean thesaurus).
  6. An English-Korean dictionary covering general vocabulary and important proper names designed for CLIR (Cross-Language Infromation Retrieval).
  7. A collection of orthographic variation rules for generating and identifying variants not listed in the database.

8. Documents for Reference

See the following links for more information: