Adapting to Various UTAU Voicebanks

This article will explain how ScoreDraft (UtauDraft) adapts itself to different kinds of UTAU voicebanks after giving a general explanation of how the system works, also including things users should be aware of when using the system.

Big Thanks to UTAU and all
Voicebank Makers.

First, I want to give big thanks to UTAU and voicebank makers who follow UTAU’s protocol.  It is already a challenging work to build a singing synthesizer from scratch, let alone building voicebanks at the same time. The existence of UTAU and its large number of third party voicebank makers makes it possible for ScoreDraft to take a short-cut. In return, ScoreDraft provides UTAU lovers one more way to have fun with their favorite UTAU voicebanks.

UtauDraft as an extension of ScoreDraft

Here, I’m in a way reiterating what has been said in the introduction.
ScoreDraft (and PyScoreDraft) define how singing (and other musical information) are represented. How the inputs are processed is decided by the extensions. As one of the the extensions, UtauDraft tries to be compatible with all kinds of UTAU voicebanks, including 単独音,連続音, VCV, CVVC as much as possible. In order to better interpret the data of the voicebanks, UtauDraft even makes use of the oto.ini file, the .frq files and the prefix.map file that comes with the voicebanks. The architecture allows the co-existence of other singing extensions that adapt to other protocols, but there’s no such plan so far (KeLa not counted).

The basic usage of UtauDraft is to extract your UTAU voicebanks right into the /python_test/UTAUVoice folder, just like the “UTAU\voice” folder when using UTAU. In most cases, the folder name of the voicebank need to be changed, because UtauDraft uses it do define a Python function that initializes a singer object of the voicebank. For example, if the folder name of a voicebank is named “Ayaka”, there will be a function ScoreDraft.Ayaka_UTAU() for initializing an instance of the singer Ayaka.

Basic Lyric Handling

The standards of voicebanks are highly diversified, with lyric definitions being a main aspect of such diversity.

By default, you can use the lyrics as defined in the oto.ini. When the file prefix.map is present, the default behavior is that the prefixes will be automatically applied according to prefix.map and the pitches of notes, so you should not include the prefixes when writing the lyrics. To change that behavior (to manual), call “singer.setUsePrefixMap(False)” like:


singer=ScoreDraft.Ayaka_UTAU()
singer.setUsePrefixMap(False)

Thus, prefix map is turned off,  and you can include the prefixes manually when writing the lyrics.

Most UTAU voicebanks contain overlapping components, such a design does a great job in making the generated sound smooth and natural. However, it increases the effort needed to write down notes and lyrics, especially when each syllable needs to be mapped to more than one lyric strings in oto.ini.

The more intuitive way of writing lyrics is to write a lyric per syllable without the redundancy. A syllable is also the smallest unit that needs to be aligned with pitches, so this is a natural choice for a singing synthesizer.

UTAU has different kinds of plugins (including presamp) to help to convert lyrics for each syllable to the ones defined in oto.ini. For CVVC/VCCV voicebanks, that process also involves diving syllables into diphones.

In ScoreDraft, similar lyric conversions can be done by lyric-converters. Several lyric-converters are provide alongside ScoreDraft:

  • ScoreDraft.JPVCVConverter: for Japanese 連続音
  • ScoreDraft.TsuroVCVConverter: for Tsuro’s VCV Chinese
  • ScoreDraft.CVVCChineseConverter: for CVVChinese
  • ScoreDraft.TTEnglishConverter: for Delta CV VC English
  • ScoreDraft.VCCVEnglishConverter: for CZ’s VCCV English

The lyric-converters are basically callback functions defined separately, which can be modified/extended by users for their own needs.

The next sections will explain how each different kind of UTAU voicebanks are used.

Japanese voicebanks

Japanese voicebanks are generally easy to handle, because the language has a well defined set of syllables that have fixed representations in written language.

However, an inherent limitation of Shift-JIS encoded Japanese voicebanks is that, they do not work well in systems other than Windows with Japanese or Chinese code-pages. This is a well known issue of UTAU, and the same case for ScoreDraft. In cases that UTAU works, ScoreDraft should also work. The Python source code should be written with UTF-8 encoding. Lyrics will be automatically converted to Shift-JIS when referencing oto.ini and filenames (only in Windows, since it won’t work in Linux anyway). For Linux, a promising solution is to try to correctly convert both the filenames and content of oto.ini to UTF-8, or use romaji.

単独音

単独音 is the kind of the voicebank that can be most easily used. Just write the lyrics are defined in oto.ini. No converter needed.

An example:


import ScoreDraft
from ScoreDraft.Notes import *

doc=ScoreDraft.Document()
doc.setTempo(120)

seq= [ ('あ', mi(5,48), 'り', so(5,48), 'が', ti(5,48), 'とぅ', do(6,144), ti(5,144), so(5,144))]
doc.sing(seq, ScoreDraft.uta_UTAU())

doc.mixDown('uta.wav')

(A more complete demo can be found here)

連続音

Lyric converter defined in JPVCVConverter.py should be used for 連続音 voicebanks . An example:


import ScoreDraft
from ScoreDraft.Notes import *

doc=ScoreDraft.Document()
doc.setTempo(120)

seq= [ ('あ', mi(5,48), 'り', so(5,48), 'が', ti(5,48), 'とぅ', do(6,144), ti(5,144), so(5,144))]

Ayaka= ScoreDraft.Ayaka2_UTAU()
Ayaka.setLyricConverter(ScoreDraft.JPVCVConverter)
doc.sing(seq, Ayaka)
doc.mixDown('jpvcv.wav')

(A more complete demo can be found here)

Conversion for 連続音 or VCV style voicebanks is relatively simple. Lyrics are still corresponding to syllables like 単独音. Vowel information of the previous syllable is added as a prefix to choose from different versions of the syllables.

Chinese voicebanks

Like Japanese, Chinese also has a well defined set of syllables. Using Pinyin system, these syllables also have fixed representations. Because Pinyin system is encoded in ASCII, there’s no encoding issues. Therefore, Chinese is actually the best supported language in ScoreDraft.

整音 (Isolated Syllables)

These voicebanks are conceptually similar to 単独音, the usages are the same, just put down to lyrics defined in the oto.ini should do. My favorite voicebank of this class is GePing. Here’s an example using it:

import ScoreDraft
from ScoreDraft.Notes import *

ScoreDraft.setDefaultNumberOfChannels(1)

doc=ScoreDraft.Document()
doc.refFreq=207.65
doc.tempo=90

seq= [ ("ni", la(4,24), do(5,48), la(4,24)), ("tiao", re(5,24), mi(5,48)), ("zhuo", re(5,24)), ("dan", re(5,16), mi(5,16), re(5,16), do(5,144) ) ]
seq=seq+ [ ("wo", ti(4,24), la(4,48), ti(4,24)), ("qian", re(5,72)), ("zhe", mi(5,24)), ("ma", do(5,48), la(4, 144))]

doc.sing(seq, ScoreDraft.GePing_UTAU())
doc.mixDown('GePing.wav')

(A more complete demo can be found here)

樗式VCV(Tsuro’s VCV)

Lyric converter defined in TsuroVCVConverter.py should be used. An example:

import ScoreDraft
from ScoreDraft.Notes import *

doc=ScoreDraft.Document()
line= ("zheng", re(5,24), "yue", do(5,48), "li", re(5,24), "cai", mi(5,36), so(5,12), "hua", mi(5,24), la(4,24))
line+=("wu", re(5,24), "you", do(5,48), "hua", re(5,24), "cai", mi(5,24), do(5,12), re(5,12), mi(5,24), BL(24))
seq = [line]

line= ("er", mi(5,12), so(5,12), "yue", la(5,48), "jian", do(6,24), "cai", la(5, 48), "hua", so(5,24), mi(5,24))
line+= ("hua", la(5,24), "you", mi(5,48), "zheng", re(5,12), mi(5,12), "kai", do(5,24), BL(24))
seq += [line]

line= ("er", do(5,12), re(5,12), "yue", mi(5,12), so(5,12), "jian", mi(5,48), "cai", re(5,24), "hua", do(5,24))
line+=("hua", la(4,24), "you", do(5,48), "zheng", la(4,12), do(5,12), "kai", la(4,96), BL(96))
seq += [line]

doc.setReferenceFrequency(440.0)

WanEr=  ScoreDraft.WanEr_UTAU()

WanEr.setLyricConverter(ScoreDraft.TsuroVCVConverter)

doc.sing(seq, WanEr)
doc.mixDown('vcv.wav')

As a VCV system, conversion of this class of voicebank is done in a way similar to Japanese 連続音. Vowel components are extracted from each syllable and used as prefix of next syllable.

CVVChinese

Lyric converter defined in CVVCChineseConverter.py should be used. An example:

import ScoreDraft
from ScoreDraft.Notes import *

line= ("zheng", re(5,24), "yue", do(5,48), "li", re(5,24), "cai", mi(5,36), so(5,12), "hua", mi(5,24), la(4,24))
line+=("wu", re(5,24), "you", do(5,48), "hua", re(5,24), "cai", mi(5,24), do(5,12), re(5,12), mi(5,24), BL(24))
seq = [line]

line= ("er", mi(5,12), so(5,12), "yue", la(5,48), "jian", do(6,24), "cai", la(5, 48), "hua", so(5,24), mi(5,24))
line+= ("hua", la(5,24), "you", mi(5,48), "zheng", re(5,12), mi(5,12), "kai", do(5,24), BL(24))
seq += [line]

line= ("er", do(5,12), re(5,12), "yue", mi(5,12), so(5,12), "jian", mi(5,48), "cai", re(5,24), "hua", do(5,24))
line+=("hua", la(4,24), "you", do(5,48), "zheng", la(4,12), do(5,12), "kai", la(4,96), BL(96))
seq += [line]

Ayaka = ScoreDraft.Ayaka_UTAU()
Ayaka.setLyricConverter(ScoreDraft.CVVCChineseConverter)

doc=ScoreDraft.Document()
doc.setReferenceFrequency(440.0)
doc.sing(seq, Ayaka)
doc.mixDown('cvvc2.wav')

(More complete demos can be found here and here)

Conversion of CVVChinese is done by dividing each syllable into two diphones, one CV, one VC (in most cases). For Chinese, the consonant of the VC part always comes from the next syllable, so the conversion rule is still relatively simple.

XiaYuYao style CVVC

XiaYuYao is a very famous TaiWan made voicebank. Although it is also based on Pinyin and is also CVVC, its recording list is different from CVVChinese. I tried to write a lyric converter for XiaYuYao, but I found it difficult to fully understand its recording system. You can find the file XiaYYConverter.py, which may work in a lot of cases, but will likely to fail in other cases.

English

English is the most difficult language of the 3 languages I’ve tried. First, its phones and syllables cannot be covered by a single standard, there are at least an England style and an America style which sound very differently. Second, the number of different syllables is very big, therefore we cannot (and should not) try to enumerate all of them. Third, the writing form of English, while it does indicate the pronunciation, is not reliable. Therefore, we need to rely on (less readable) phonetic representations, and there are different standards of  phonetics existing.

The mature form of an English singing synthesis system, like Vocaloid, should allow writing English directly, and the synthesizer does all the conversions and mapping internally to a form that can be used to look up the recordings. That will be too complicated, even UTAU does not do that.

The second best form is to use syllable aligned phonetic lyrics, then convert the syllables to smaller units (diphones) that can be found in the recordings. For Delta style English voicebanks, presamp can be used to do the conversion. This form is what ScoreDraft is now aiming to achieve. Writing English lyric in this form is very similar to writing Japanese/Chinese lyrics as described previously. The only difference is that the syllables are more variable. For example, there can be consonants at both ends of a syllable.

The last form, when we don’t have any tool to process the lyrics, is to write the diphones manually. I was quite surprised at first to know that this is practiced quite commonly among UTAU users.

Delta CV VC English

Delta style, also called Teto-English or TTEnglish, is the recording standard used by Kasane Teto English . This standard has a helper tool TTEnglishHelper that helps converting English to phonetics and to diphones. TTEnglishHelper has 2 output forms, the diphone form and a form “for presamp”, which is actually a syllable aligned phonetic form. The syllable form is much easier to write.

ScoreDraft provides a lyric converter in TTEnglishConverter.py which converts from the syllable form to diphone form, like what presamp does. Although something similar is done for CVVChinese, the complexity is much different. For CVVChinese, each syllable is almost always divided into 1 CV diphone followed by 1 VC diphone, and the consonant of the VC diphone is from the next syllable. The case of TTEnglish is much more complicated. The consonant of a VC diphone is not necessarily the same as the one in the succeeding CV diphone. There can be consecutive CV diphones because of the missing of some VCs in the recording.

There does seem to be a finite set of rules defined somewhere, but I chose to write the converter in a more brute-force way. I first grabbed all possible lyrics as defined in oto.ini, then, during lyric converting, I basically try to find a best match to the input lyric sequence using the lyric set from oto.ini. This strategy can also be used for other kinds of English voicebanks even though they have different symbols and rules defined.

An example using TTEnglishConverter.py

import ScoreDraft
from ScoreDraft.Notes import *

doc=ScoreDraft.Document()
doc.setTempo(100)

seq = [ ('twIN', do(5,48), 'k@l', do(5,48), 'twIN', so(5,48), 'k@l', so(5,48), 'It',la(5,48), '@l',la(5,48), 'stAr',so(5,96))]

Teto = ScoreDraft.TetoEng_UTAU()
ScoreDraft.setLyricConverter(ScoreDraft.TTEnglishConverter)

doc.sing(seq, Teto)
doc.mixDown('teto.wav')

(A more complete demo can be found here)

To write the above lyric, I looked up “twinkle twinkle little star” with TTEnglishHelper and used the presamp style of lyric. The output of TTEnglishConverter.py in this case is:

[(‘- tw’, 0.1, False, ‘wI’, 0.4, True, ‘I N’, 0.1, False), (‘k@’, 0.4, True, ‘@
l’, 0.1, False, ‘tw’, 0.1, False), (‘wI’, 0.4, True, ‘I N’, 0.1, False), (‘k@’,
0.4, True, ‘@ l’, 0.1, False), (‘lI’, 0.4, True, ‘I t’, 0.1, False), (‘t@’, 0.4,
True, ‘@ l’, 0.1, False, ‘st’, 0.1, False), (‘tA’, 0.4, True, ‘A r’, 0.1, False
)]

The converted lyrics are consistent to the CV VC form of TTEnglishHelper’s output.

CZ’s VCCV English

CZ’s VCCV English is a standard that I looked most recently. It is the first standard that makes me aware of that lyrics are not the only thing makes the diversity of UTAU voicebanks. CZ’s VCCV English voicebanks are recorded and OTOed very uniquely. During OTOing, the ending position of the vowel part of VC diphones are considered unimportant. Even the ending of consonants are not considered important. The arbitrariness of these markers are compensated by setting consonant velocities for each VC diphones during USTing.

UtauDraft is originally not designed to work that way. There’s some concept similar to consonant velocities insides ScoreDraft. By default, they are calculated automatically depending on the consonant ending marker in the oto. The existence of  VCCV English forced me to provide another option. Now user can call ‘singer.setCZMode()’ to make UtauDraft to handle the VC diphones in a special way that adapts to CZ’s VCCV English voicebanks.

The lyric converter for CZ’s VCCV English is VCCVEnglishConverter.py, which works in a way similar to TTEnglishConverter.py. An example using it:

import ScoreDraft
from ScoreDraft.Notes import *

doc=ScoreDraft.Document()
doc.setTempo(100)

seq = [ ("ma", mi(5,24), "ma", re(5,24), mi(5,48)), BL(24)]
seq +=[ ("do",mi(5,24),"yO", so(5,24), "ri", la(5,24), "mem", mi(5,12),re(5,12), "b3", re(5,72)), BL(24)]
seq +=[ ("dhx",do(5,12), re(5,12), "Od", mi(5,24), "str0l", so(5,24), "h@t", so(5,72)), BL(24)]
seq +=[ ("yO", mi(5,12),ti(5,36)),BL(12),("gAv", la(5,24), "t6",so(5,12), "mE", mi(5,96))]

singer = ScoreDraft.Yami_UTAU()
singer.setLyricConverter(ScoreDraft.VCCVEnglishConverter)
singer.setCZMode()

doc.sing(seq, singer)
doc.mixDown('vccv.wav')

(A more complete demo can be found here)

Epilogue

The voicebank adaption work is not over yet. There are many voicebanks that I have not tried yet, and I’m pretty sure that there are many tricks to use (and abuse) UTAU that I don’t know. I will do my best to help those who have requirements to use the kinds of voicebanks not covered by this article. And, you will be very much appreciated if you can provide me with your lyric converters for the voicebanks that I don’t know or handled incorrectly by current lyric converters.

 

About the Author

Leave a Reply

Your email address will not be published. Required fields are marked *