The Challenges of Japanese Speech Technology

日本語音声技術への挑戦

1. Is Japanese a "phonetic" language"?

There is a common fallacy that the kana syllabary is "phonetic". That is, that the kana representation of Japanese is "pronounced as it is written and written as it pronounced." Technically, this should be called "phonemic" rather than phonetic. In a truly phonemic orthography, there is a one-to-one correspondence between grapheme and phoneme, but the kana syllabary is not truly phonemic. This is the first hurdle that needs to be overcome in developing Japanese speech technology systems.

The CJK Dictionary Institute (CJKI) specializes in Chinese, Japanese and Korean (CJK) lexicography and is engaged in the development of phonological and phonetic databases that accurately indicate how Japanese and Chinese words are pronounced in actual speech. This document briefly describes some of the main issues in Japanese phonology and phonetics, but does not attempt to cover any issue in-depth. However, it does clarify the difficult challenges, from a phonetic point of view, that must be tackled in achieving natural speech synthesis. The explanations are rather technical, and require background knowledge of Japanese phonology and phonetics.

For more information on CJKI's phonological databases, please see our Japanese Phonetic Database.

2. Conventions used in this document

We will illustrate the conventions used in this document by using 新聞 shimbun 'newspaper' as an example.

Slashes for phonemic transcription, e.g. /siNbuN/.
Katakana for kana orthography, e.g. シンブン.
Square brackets for phonetic transcription in IPA. e.g. [ɕimbɯɴ].
Italics for Hepburn romanization, e.g. shimbun.
Single quotes for English equivalents of Japanese words, e.g. 'newspaper'.

The five standard Japanese vowels are transcribed in IPA (International Phonetic Alphabet) using broad transcription, as shown in the table below. This is not a precise representation of the actual pronunciation. For example, both エ /e/ and オ /o/ are lower than the cardinal vowels [e] and [o], so that [e̞] and [o̞] are closer. In particular, the realization of ウ /u/ is variable, as shown in the Narrow column below, and cannot be rendered precisely by any IPA symbol. It is a somewhat fronted back vowel pronounced with compressed lips, and is different from the close back unrounded vowel represented by [ɯ]. Nevertheless, /u/ is traditionally transcribed by [ɯ], and we will follow that convention.

Kana	Phonemic	Broad	Narrow
ア	a	a	ä
イ	i	i	i
ウ	u	ɯ	ɯ̹̈ ~ ü
エ	e	e	e̞
オ	o	o	o̞

Another issue is how to transcribe fricatives/affricates such as シ and チ and ジ . Traditionally, perhaps through the influence of English, these are often transcribed as [ʃi], [tʃi] and [dʒi], but we have opted for [ɕ], [tɕi] and [dʑi] because strictly speaking these sounds are pronounced slightly further back than the corresponding English ones; that is, they are alveolopalatal rather than palatoalveolar (an even more precise rendering could be [cɕi] for チ and [ɟʑi] for ジ).

3. Is kana a phonemic orthography?

The fact is that though on the whole the kana syllabary is fairly phonemic, that is, each kana symbol represents one phoneme (such as ア = /a/) or a specific sequence of two phonemes (such as カ = /k/ + /a/), kana can be ambiguous, such as ウ representing either /o/ or /u/. For example, ウ in トウ as the reading of 塔 'tower' represents the phoneme /o/ (phonetically [o]), elongating the previous /o/, while in トウ as the reading of 問う 'ask' represents /u/ (phonetically [ɯ]), two distinct phonemes.

The kana syllabary is not a true phonemic orthography because of various one-to-many ambiguities, such as:

Variation in long vowel representation, e.g. both ウ or オ are used for long /o/, as in トオリ (通り 'road') and トウリ (党利 'party interests') and エ or イ are used for long /e/, as in ネエサン /neesan/ (姉さん 'older sister') and ケイサン /keesan/ (計算 'calculation').
Similarly, イ can represent either /i/ or /e/ in ケイ、メイ etc. For example, メイン 'main' can be /mein/ or /meen/.
Historical kana ambiguity, such as ハ representing both /ha/ or /wa. For example, the topic marker /wa/ is represented by ハ, as in ワタシハ (私は 'I'), which is exactly the same sound as ワ /wa/.
Some kana, such as ジ/ヂ /zi/ and ヅ/ズ /zu/, represent exactly the same phoneme. For example, . /zi/ (phonetically [dʑi]) is normally written ジ as in ジブン (自分 'self') but as ヂ in チヂム (縮む 'shrink').
Allophonic alternation, such as the phoneme /t/ representing [tɕ] before /i/ or [ts] before /u/, e.g. チ /ti/ is [tɕi], /tu/ is [tsu] while タ /ta/ is [ta].

The phonemic component of our phonological database eliminates such grapheme-to-phoneme ambiguity by enabling normalization of a Japanese lexeme expressed as a string of kana characters to an unambiguous string of phonemes.

4. Allophonic variation -- the real challenge

Grapheme-to-phoneme ambiguity is only half the battle -- the easy half. The real challenge is to convert the phonemes into phones, the actual speech sounds. Since Japanese, like all other languages, has allophonic variation that depends on the phonetic environment, there is often no phoneme-to-phone correspondence on a one-to-one basis. Accurate phoneme-to-phone transformation is a prerequisite to natural speech synthesis.

Let us briefly examine some of the issues in the variation of Japanese pronunciation from a phonetic, rather than phonemic, point of view. The environments that lead to allophonic variation in Japanese are complex (though perhaps simpler than languages like Korean). Kana, even if modified to eliminate one-to-many ambiguities (such as by adding diacritics to indicate if ウ represents [ɯ] or [o]) or converted to a romanized phonemic transcription, cannot represent the precise speech sounds, or phones, as they are actually pronounced in various environments; that is, the differences in their phonetic realizations.

Below is a brief description of the main allophonic and other phonetic changes, such as devoicing of vowels and nasalization of moraic /N/, which occur mostly unconsciously in the natural speech of native speakers. These have a marked effect on the naturaleness of synthesized speech, and must be considered in the development of speech synthesis technology.

Vowel devoicing
A salient feature of the standard Tokyo dialect is that the unaccented high vowels /i/ and /u/ tend to be devoiced between voiceless consonants and other environments. For example, /su/ in /ainosuke/ (愛之助) is realized as a devoiced [sɯ̥], rather than the normal [sɯ]. Optionally, devoicing also occurs word finally after voiceless consonants, so that 続く /tuzuku/ could be either [tsɯ̥zɯkɯ]̥ or [tsɯ̥zɯkɯ]. Devoicing may also occurs for /o/ and even /a/, but these are of lesser importance.

Nasalization of /g/

/g/ is pronounced as [g] word initially, but is often nasalized word-internally and realized as [ŋ] (for some speakers as [^ŋg]), a phonetically distinct allophone of /g/ which is normally realized as the voiced velar stop [g]. For example, in /kage/ (影), /ge/ represents a nasalized allophone of /ge/ realized as [ŋe], so it is pronounced [kaŋe]. Nasalized /g/ is gradually falling into disuse among the younger generations. For some speakers, especially in fast speech, intervocalic /g/ in certain environments is realized as a voiced velar fricative [ɣ], so that /kage/ becomes [kaɣe].

Orthographic	Kana	Phonetic	Remarks
影	カゲ	[kage]	unnasalized by most speakers
影	カゲ	[kaŋe]	nasalized intervocalically
影	カゲ	[kaɣe]	fricativized in fast speech
画像	ガゾウ	[gazoː]	never nasalized word intially

Nasal assimilation

The phoneme /N/ (ン), a moraic nasal, is realized as six different allophones governed by complex rules of nasal assimilation involving coarticulation. Even linguistically naive native speakers, for whom allophonic variation is mostly unconscious, notice that /N/ is pronounced as [m] in certain environments, such as in /siNbuN/ (新聞シンブン), realized as [ɕimbɯɴ]. But other /N/ allophonic rules are subtle and mostly go unnoticed by native speakers. The most important rules (somewhat informally) are that it is realized as a bilabial nasal [m] when followed by a bilabial [m], [b] or [p], as a velar nasal [ŋ] when followed by the velar stops /k/ and /g/, as [n] before /t/, /d/, /n/ and /r/, as a nasalized vowel of various qualities before vowels, semivowels and some consonants, and as an uvular nasal [ɴ] word finally. The examples below illustrate the six different realizations of /N/.

Orthographic	Kana	Phonetic
自分	ジブン	dʑibɯɴ
純子	ジュンコ	dʑɯŋko
慎一	シンイチ	ɕiĩtɕi̥
新聞	シンブン	ɕimbɯɴ
運動	ウンドウ	ɯndoː
蒟蒻	コンニャク	koɲɲakɯ
本	ホン	hoɴ, hõ

For natural speech synthesis, it is important to generate the correct allophone of /N/.

Spirantization of affricates

A subtle feature of Japanese allophonic variation is that spirantization of certain affricates occurring word-internally. For example, /zi/ is realized as an alveolopalatal affricate [dʑi] word-initially, as in [dʑibɯɴ] (自分 'self'), but intervocalically as the alveolopalatal fricative [ʑi] (or sometimes [^dʑi]), as in [haʑi] (恥 'shame'). Similarly, /zu/ is realized as [dzu] word initially but as [zu] in other positions. A related phenomeon is a tendency, especially among young Tokyo females, to pronounce /s/ as [si], an alveolar fricative, rather than the normal alveolopalatal fricative [ɕi].

Orthographic	Phonemic	Phonetic
自分	/zibun/	[dʑibɯɴ (ɟʑibɯɴ)]
恥	/hazi/	[haʑi (ha^dʑi)]
地震	/zisiɴ/	[dʑiiɴ (ɟʑiɕiɴ)]
塩	/sio/	[ɕio (sio)]

Spirantization is essentially a kind of weakening (lenition) of affricates in most non-initial positions, the degree of which depends on the speakers, so that fricatives/affricates alternation can be said to be in free variation

Sequential voicing

Sequential voicing , or rendaku (連濁 /rendaku/), is a common phenomenon, governed by complex rules, such as the frequent voicing of the initial consonant of the second element of a compound, e.g. the [su] in 寿司 'sushi' is pronounced [sɯɕi] in isolation but [zɯ] in the compound いなり寿司 'inarizushi', pronounced [inaɾizɯɕi].

Orthographic	Voiceless	Voiced
いなり寿司	[inaɾi] + [sɯɕi]	[inaɾizɯɕi]
物語	[mono] + [kataɾi]	[monogataɾi]
二人連れ	[futaɾi] + [tsuɾe]	[futaɾizure]
棒立ち	[boː] +[tatɕi]	[boːdatɕi]
花火	[hana] +[hi]	[hanabi]

The last item in the table shows a change from the glottal fricative [h] the bilabial stop [b], phonetically unrelated to voicing but a common rendaku phenomenon. Attempting to predict sequential voicing by rules is futile as it is often unpredictable. The only safe way is to store such voiced lexemes in a hardcoded database.

Palatilization
Certain consonants followed by /j/ are palatalized. This is represented in kana by small ャ,ュ and ｮ, as in ギャ [gʲa], ギュ [gʲɯ] and ギョ [gʲo]. (A table of these is shown in the document Kana and Romanization. )Certain consonants followed by /i/, especially /n/, are also palatalized, so that /niQpon/ (日本 'Japan') is pronounced [nʲip̚poɴ]. Light palatilization may also occur in such phonemes as /mi/ and /ki/, as shown below.

Orthographic Phonemic Phonetic

客 /kyaku/ [kʲaku]

日本 /niQpon/ [nʲip̚poɴ]

民 /tami/ [tamʲi]

滝 /taki/ [takʲi]

Consonant Gemination

Geminated consonants in Japanese consists of two identical consonants (moraic obstruents) interrupted by a pause, with each consonant belonging to a different mora. This is a represented in kana by small ッ. According to the mainstream interpretation of Japanese phonology, this is phonemicized as an archiphoneme represented by /Q/. For example, /tatta/ (立った 'stood') becomes /taQta/.

/Q/ represents a variety of phonetically distinct sounds depending on the following consonant. In cases like /haQsai/ (八歳 'eight years old'), it doubles the consonant and is realized as [hassai], but in cases like /taQta/, the quality of the first [t] is different because it is unreleased, whereas the second [t] is a normal stop. Strictly speaking, /taQta/ is thus realized as [tat̚ta], rather than [tatta]. Moreover, if the geminated consonant is an affricate, only the plosive portion of the affricate is repeated, so that /haQtjuu/ ( 発注 'ordering goods') is realized as [hat̚tɕɯː], not as [hatɕtɕɯː].

Orthographic	Phonemic	Phonetic
日本	/niQpon/	[nʲip̚poɴ]
八歳	/haQsai/	[hassai]
発注	/haQtuu/	[hat̚tɕɯː]
発車	/haQsja/	[haɕɕa]

Vowel Glottalization

Japanese vowels are sometimes preceded or followed by a glottal stop, often in short words standing alone, and for emphasis. In some case, the word final glottal stop is clearly audible and is represented orthographically by small ッ, as in アッ /aQ/ (アッ 'Oh!'), pronounced [aʔ].

Orthographic	Phonemic	Phonetic
アッ	/aQ/	[aʔ̚]
鵜	/u/	[ʔɯʔ]
サッ	[saQ]	[saʔ]

5. Pitch Accent

The accentual system of Japanese is a mora-based pitch accent, which is distinct from typical tone languages like Chinese. In Chinese, the tone for each syllable must be specified, whereas in Japanese it is only necessary to specify the accented mora, from which the pitch pattern of the entire word can be determined by phonological rules.

For example, in /anata/ (あなた 'you'), the second mora /na/ is high pitched. All morae following the accented one are lowered. In addition, there is a rule that the first mora is always lowered, unless it is the accented one. This means that /anata/ gets a pitch pattern of LHL (low-high-low), pronounced [anáta]. If no accent is specified, the word is considered accentless, as in /katati/ (形 'shape'), but lowering the pitch of the first mora results in a pitch pattern of LHH.

Below are some examples of Japanese accent patterns (the first three from our Japanese surnames database). The "L" or "H" in parentheses indicate the pitch of particles immediately following the word. The number indicates the accented mora; that is, the mora immediately following the accented mora drops in pitch.

In accentless words (about 80% of the Japanese lexicon), represented by "0", the first mora must be lowered by the above rule, and the rest remain high up to and including the following particle (such as the subject marker が /ga/). This is in contrast to words whose final mora is accented (尾高型 /odakagata/) , such as /kagami/ (鏡 'mirror') (accent pattern "3"), in which the accent drops immediately after the word. That is, /katati ga/ (accentless) has a pitch pattern of LHH(H), whereas /kagami ga/ has a pattern of LHH(L).

Orthographic	Kana	Phonetic	Accent	Pitch Pattern	Remarks
井川	イカワ	[ikawa]	1	HLL(L)	first mora accented
井田	イダ	[ida]	0	LH(H)	accentless
磯貝	イソガイ	[isoŋai]	2	LHLL(L)	second mora accented
鏡	カガミ	[kaŋami]	3	LHH(L)	last mora accented
形	カタチ	[katatɕi]	0	LHH(H)	accentless

6. The Role of Phonological Databases

The days of metallic, flat voices (as spoken by robots in science fiction movies) are over. Users are increasingly expecting natural speech from computers, not just properly pronounced, but also properly accented. This means that it is not only necessary to eliminate the phonemic ambiguities resulting from kana orthography, but also to pay close attention to the generation of allophonic variants and correct accent patterns.

Speech technology systems, no matter how advanced or sophisticated, must have access to phonetic/phonological databases. To this end, The CJK Dictionary Institute is engaged in research and development of comprehensive CJK phonological databases (see Japanese Phonetic Database) that, among other things, provide accurate phonemic and phonetic transcriptions. Our goal is to contribute to the advancement of CJK speech technology by providing software developers with comprehensive phonological databases as well as linguistic consulting.

The CJK Dictionary Institute

Dictionaries

Other

Company