SingingGadgets: How to Use

In the previous blog , I have briefly introduced the 2 different usages of the project SingingGadgets. You can either use the low-level “gadgets” provided by package SingingGadgets or use package ScoreDraft in a way very similar to using the project ScoreDraft. This blog will go into details about the 2 usages. For package SingingGadgets, we will go into details about the new interfaces it provides. For package ScoreDraft, this blog will assume that the user has already known the basic usage of the project ScoreDraft (if not, see here ), then explain the differences between the 2 ScoreDrafts.

Package SingingGadgets: the Low-level Gadgets

The source-code repository of SingingGadgets comes with an example Test.py
showing how to directly use the low-level interfaces provided by package SingingGadgets, as listed below.


import wave
import SingingGadgets as sg

def loadWav(file):
	wavS16=bytes()
	with wave.open(file, mode='rb') as wavFile:
		wavS16=wavFile.readframes(wavFile.getnframes())
	return sg.S16ToF32Voice(wavS16)

def saveWav(wavF32, file, amp=1.0):
	wavS16=sg.F32ToS16Voice(wavF32, amp)
	with wave.open(file, mode='wb') as wavFile:
		wavFile.setnchannels(1)
		wavFile.setsampwidth(2)
		wavFile.setframerate(44100)
		wavFile.setnframes(len(wavS16)//2)
		wavFile.writeframes(wavS16)

src={
	'wav': loadWav('fa.wav'),
	'frq': sg.LoadFrqUTAU('fa_wav.frq')
}

sentence= {
	'pieces': [ 
		{
			'src': src,
			'map': [(24.04, -73.457, sg.notVowel), (97.497, 0.0, sg.preVowel), (177.999, 153.959, sg.isVowel), (362.911, 454.59)]
		},
		{
			'src': src,
			'map': [(24.04, 426.543, sg.notVowel), (97.497, 500.0, sg.preVowel), (177.999, 653.959, sg.isVowel), (362.911, 954.59)]
		},
		{
			'src': src,
			'map': [(24.04, 926.543, sg.notVowel), (97.497, 1000.0, sg.preVowel), (177.999, 1153.959, sg.isVowel), (362.911, 1500.0)]
		}
	],
	'piece_map': [(0, 0), (0, 426.543), (1,454.59), (1,926.543), (2,954.59), (2,1500)],
	'freq_map': [(264.0, 0), (264.0, 500), (264.0*1.125, 500), (264.0*1.125, 1000), (264.0*1.25,1000), (264.0*1.25,1500)],
	'volume_map': [(1.0, 0), (1.0, 1450), (0.0, 1500)]
}

#res=sg.GenerateSentence(sentence)
res=sg.GenerateSentenceCUDA(sentence)

#outData=res['data']
#maxValue=sg.MaxValueF32Voice(outData)
#saveWav(res['data'],'out.wav', 1.0/maxValue)

track = sg.TrackBuffer(1)
track.writeBlend(res)
track.moveCursor(2000)
track.writeBlend(res)
sg.WriteTrackBufferToWav(track,'out.wav')

The core function of the package SingingGadgets is a singing synthesizer using only raw waveform data of voice source accompanied by information of how to transform and concatenate the input waveform. Singing waveform is generated one sentence (continuous voice in one breath) each time.

In the about code, the code block ” sentence = {…} ” defines the input data structure. The out-most layer of the data structure is a dictionary containing 4 fields: ‘pieces’, ‘piece_map’, ‘freq_map’ and ‘volume_map’. The ‘pieces’ field contains information of all input waveform and how are they mapped temporally to the output sentence. The ‘piece_map’ field contains weighting information controlling when and how to transit from one piece to another, much like to “cross-fading” concept in UTAU. The ‘freq_map’ field contains the information of output frequency (or pitch). The ‘volume_map’ field contains the information of output volume envelope.

The ‘pieces’ field is consisted of a list of dictionaries, each of which contains 2 fields: ‘src’ and ‘map’. The ‘src’ field contains the information of an input waveform, which includes the waveform itself plus frequency detection data. The ‘map’ field contains the temporal mapping information from the input waveform to target waveform.

First, let’s take a close look the ‘src’ field of a ‘piece’. It points to another dictionary containing 2 fields: ‘wav’ and ‘frq’. The ‘wav’ data required here is raw PCM data in the form of Python ‘bytes’ that contains 32-bit float values. For voice synthesis, we are currently restricted to mono, so each float value represents 1 PCM sample. The Python module ‘wave’ provides functionality to read/write ‘wav’ files into/from 16-bit signed int ‘bytes’. Therefore, SingingGadgets provides utility functions to convert between 32-bit float and 16-bit signed int, which are SingingGadgets.S16ToF32Voice() and SingingGadgets.F32ToS16Voice(). In the example, the utilities are used together with the ‘wave’ module. The ‘frq’ field points to another dictionary structure consisting 3 fields: ‘interval’ (number of samples between 2 frequency data point), ‘key’ (representative frequency of the input), ‘data’ (the actual frequency data points). SingingGadgets currently provides an utility function SingingGadgets.LoadFrqUTAU() to load frequency detection data directly from ‘frq’ files generated by UTAU’s resampler.exe, which often comes as a part of UTAU voicebanks.

Second, let’s get an idea of what the ‘map’ field of a ‘piece’ is saying by looking at the chart above. The ‘map’ field is consisted of a list of control points. The 1st item of each control point gives an temporal location in the source waveform in milliseconds. The 2nd item of each control point gives an temporal location in the target time axis in milliseconds. The 3rd item of each control point defines whether the section following the control point should be treated as vowel, not vowel, or a section transiting from not vowel to vowel (the section after the red marker). The third item is not required for the last control point of each ‘map’.

Third, let’s look at the ‘piece_map’ field of a sentence. It basically defines a piece-id map represented as a float value. It is also consisted of a list of control points. The 1st item of each control point gives a piece-id. The 2nd item of each control point gives a temporal location in the target time axis in milliseconds. We can draw a curve by connecting the control points like shown in the chart above. If a time interval is covered by a integer value of piece-id, then the generated sound in that interval will come from a single piece of source.  If a time interval has piece-id values between 2 integers, then that is a transitional interval from one piece of source to another. The contribution of the 2 sources will be weighted by the fractal part of the piece-id.

The last 2 field of a sentence, ‘freq_map’ and ‘volume_map’ are 2 more curves defined by control points similar to  ‘piece_map’. In  ‘freq_map’, the control points are target frequencies in Hz followed by target  temporal locations while in ‘volume_map’,  the control points are volume multipliers followed by target  temporal locations.

Once you defined a sentence structure, you can feed it to SingingGadgets.GenerateSentence() or SingingGadgets.GenerateSentenceCUDA(). While using GenerateSentence(), you are always calling the CPU singing generator, GenerateSentenceCUDA() should be smart enough to use the GPU branch only when it is available and the running environment allows it, fall back to CPU branch otherwise.

The return value of “GenerateSentence” functions is a dictionary structure consisting of 6 fields:

‘sample_rate’: will always be 44100

‘num_channels’: will always be 1 for singing

‘data’: raw PCM data in 32 bit float bytes

‘align_pos’: position of logical original point in number of samples

‘volume’: volume value used for blending

‘pan’: pan value used for blending

The return value can be directly blended into a TrackBuffer object as the example shows. The class SingingGadgets.TrackBuffer is evolved from ScoreDraft.TrackBuffer in the project ScoreDraft. TrackBuffer objects can be used to store massive waveform data. The data will be cached in a temporary file, so the memory footprint will be extremely small. Functions like mixing, as well as reading/writing wav files are provided. See SingingGadgets/TrackBuffer.py for the detail of the class TrackBuffer. Please be careful about a few differences between this TrackBuffer class and the one in project ScoreDrafts. In this project, the getCursor()/setCursor()/moveCursor() functions are using milliseconds as the unit, instead of number of samples as in project ScoreDrafts. Also be noted that the writeBlend() function does not move the cursor automatically, user need to call moveCursor() or setCursor() manually if desired.

Differences between the 2 ScoreDrafts

In the source-code repository, TestScoreDraft.py is given as an example showing how to use the package ScoreDrafts. The file contains several sections demoing difference kinds of voicebanks. I’m not pasting the code here. The point is that it is very similar to using project ScoreDraft. Actually, since that the re-implementation of instrument and percussion functions are also finished, the 2 ScoreDrafts are now basically the same in usage, except the following:

Difference 1: The TrackBuffer class is in package SingingGadgets, not package ScoreDraft, because it is considered as a low-level interface.

Difference 2: Initialization of Instruments/Percussions/Singers based on pre-deployed data assumes different search paths in the 2 ScoreDrafts. In project ScoreDraft, the search paths are located within the ScoreDraft package folder. In the SocreDraft of SingingGadgets, the search paths are based on the starting path of the final app. It is not encouraged to put sample data into the installation path of SingingGadgets.

Difference 3: Dynamic Tempo Maps are also supported here. But you need to use milliseconds instead of number of samples as the unit to give positions of destination timeline. SingingGadgets uses milliseconds to represent time consistently. The original ScoreDraft should have done the same, but chose to use number of samples in many cases. I don’t think I’m going to fix it though 🙁

Difference 4: Qt based modules are not re-implemented in SingingGadgets. However, a Meteor module is provided, which only supports writing to .meteor data files, and can be used with the web-based visualizer.

 

About the Author

Leave a Reply

Your email address will not be published. Required fields are marked *