Prosody application note: speech processing replay and record data formats

Introduction

This document describes the data formats provided by Prosody for recording and replay of speech data, as well as the characteristics of each format and gives some indications on when these formats are applicable.

WAV files, VOX files and raw data streams

The Prosody speech API can record and replay speech data either to/from files (High level play/record API) or data buffers (using sm_replay_start() and sm_record_start()).

The speech data consists of a sequence of octets which is an encoding of a speech signal, Prosody can replay and record speech data encoded in a number of different ways, these different encodings are called speech data formats. Each Prosody speech data format has different characteristics, and the choice of data format used in an application depends on the requirements of the application and the capabilities of the platform hosting the application.

When Prosody is used to record speech data, that data is encoded into a sequence of octets in an explicitly specified data format from a (mu-law or A-law) speech signal switched to a Prosody channel input.

When Prosody is used to replay speech data, that data is decoded from a sequence of octets in an explicitly specified data format and a (mu-law or A-law) speech signal generated on a Prosody channel output. It is clearly imperative that the correct decoding algorithm is used by the Prosody module to generate the output speech signal from the speech data.

When speech data used with the Prosody speech processing firmware is stored in a file, the high level speech processing API provides a set of calls for handling two commonly used file formats:

WAV files
RAW (also known as VOX) files

RAW or VOX files just contain encoded speech data (they do not have any header). Thus it is not possible to determine the speech data format of a RAW file just from its contents.

The WAV file format is a Microsoft defined format that includes a header containing information about the following encoded speech data. Within the header a data format identifier indicates the data format of the speech data contained in the WAV file. These identifier values are defined by Microsoft (or registered with Microsoft by product vendors). Examples of data format identifiers are:

WAVE_FORMAT_ALAW
WAVE_FORMAT_MULAW
WAVE_FORMAT_OKI_ADPCM
WAVE_FORMAT_IMA_ADPCM
WAVE_FORMAT_G721
WAVE_FORMAT_PCM

WAV data formats require more than the identifier to completely describe them. For example the format WAVE_FORMAT_PCM simply states that the data in the file is in the form of uniformly quantized samples. The sampling rate is the frequency at which the samples must be output in order to reproduce the original speed and pitch. The sample size is the number of bits in each sample (the numerical accuracy of the sample). Further, the data in a WAV file can be mono (one channel) or stereo (two parallel channels). Thus common data formats used by WAV players (and recorders) are:

WAVE_FORMAT_PCM, sample rate 8000Hz, 8 bits per sample (Telephone quality)
WAVE_FORMAT_PCM, sample rate 32000Hz, 16 bits per sample (Radio quality)
WAVE_FORMAT_PCM, sample rate 44100Hz, 16 bits per sample (CD quality)
WAVE_FORMAT_PCM, sample rate 22050Hz, 16 bits per sample
WAVE_FORMAT_PCM, sample rate 11025Hz, 16 bits per sample

WAVE_FORMAT_ALAW and WAVE_FORMAT_MULAW are almost always sampled at 8000Hz and are both implicitly 8 bits per sample. Hence 64 kilobits per second (Kbps).

As for CTI-specific data formats,

WAVE_FORMAT_OKI_ADPCM is implicitly 4 bits per sample, giving 32Kbps at a sampling rate of 8000Hz, or 24Kbps at 6000Hz.
WAVE_FORMAT_IMA_ADPCM is another format using 4 bits per sample.

Obviously Prosody, being a telephony system, only handles mono signals, since telephones have no facility for stereo.

WAV file data format identifiers are not the same as the type parameter supplied to Prosody replay/record API calls. It is not, in general, possible to replay any arbitrary type of WAV file data using the Prosody API. Only certain data format identifiers are directly supported. In these cases Prosody can map the WAV data format identifier to one of its equivalent data formats and use the speech data in the WAV file directly. Otherwise you have to transcode the WAV speech data into a Prosody supported data format.

Customers can handle other file types (neither WAV or RAW) of speech file format by writing their own code above the low level buffer based record/replay API.

Whereas some traditional CTI installations use RAW files, Prosody developers are encouraged to use WAV files, because the file header can be interrogated by a playback/record application using Prosody API calls, which can then choose the appropriate Prosody algorithm for playback/recording of the file. If the file is RAW, some external knowledge is required, independent of the file itself, and it is possible to attempt playback of a particular file using the wrong algorithm.

Another advantage of using WAV files for storing recorded data is that if the data format is supported by other applications, such as a WAV player, it can be rendered on a desktop computer without requiring the use of a Prosody card.

Compatibility

Since the term WAV (or WAVE) file refers only to the header on the file, and since there are many data formats (e.g. WAVE_FORMAT_ALAW, WAVE_FORMAT_G721) it is virtually impossible for a single application to support all data formats. Most desktop WAVE players will support WAVE_FORMAT_ALAW and WAVE_FORMAT_MULAW, and WAVE_FORMAT_PCM (but only at certain sampling rates and bit-rates).

On Windows 2000 platforms, it is the Audio Compression Manager (ACM) that determines what type of WAV files may be rendered to a sound card using media players. The media player interacts with ACM drivers to convert data between various data formats.

Prosody Supported Data Formats

Prosody format description	Registered WAV type tag	bits per sample
kSMDataFormatALawPCM	WAVE_FORMAT_ALAW	8
kSMDataFormatULawPCM	WAVE_FORMAT_MULAW	8
kSMDataFormatOKIADPCM	WAVE_FORMAT_OKI_ADPCM	4
kSMDataFormatIMAADPCM	WAVE_FORMAT_IMA_ADPCM	4
kSMDataFormat8bit	WAVE_FORMAT_PCM	8
kSMDataFormat16bit	WAVE_FORMAT_PCM	16
kSMDataFormatSigned8bit	WAVE_FORMAT_PCM	8
kSMDataFormatSpeex	WAVE_FORMAT_SPEEX	+

Note: + See Speex website() for details.

Linear PCM

Pulse Code Modulation is the method of periodically measuring the voltage of a waveform (a speech signal), and storing that voltage as an integer, which can have a finite number of values. In linear PCM (16 bit or 8 bit), the integers used for sampling are simply fractions of the full scale deflection of the signal. The rate at which these values are measured is the sampling rate, and the conversion to integer values is known as quantisation.

Mu-law

Mu-law (sometimes u-law or m-law) is the format used for transport of speech data in the American and Japanese telephone networks. It is specified by ITU G.711 [2]. This is a non-uniform (logarithmic) quantisation of signal samples, using 8 data bits per sample, at a rate of 8000 samples per second, hence 64 kilobits per second. The subjective sound quality is equivalent to linearly sampled 14 bits at the same sampling rate. This is commonly supported by WAVE players.

A-law

A-law is the format used for transport of speech data on telephone networks around the world except America and Japan. It is specified by ITU G.711 [2]. This is a non-uniform (logarithmic) quantization of signal samples, using 8 data bits per sample, at a rate of 8000 samples per second, hence 64 kilobits per second. The subjective sound quality is equivalent to linearly sampled 13 bits at the same sampling rate. This is commonly supported by WAVE players.

OKI ADPCM

OKI ADPCM has become an industry standard for storage of speech signals at slightly reduced data rates of 32Kbps and 24Kbps. This coding method is differential, and uses four data bits per sample, sampled at 8000 or 6000 samples per second, respectively. When recording at 24Kbps, the network sampling rate of 8000 samples per second is subsampled at 6000 samples per second by DSP software, prior to applying the ADPCM algorithm. The reverse is true for replay. The supported file formats are .vox and .vap

IMA ADPCM

IMA ADPCM (also known as 'DVI ADPCM' or simply 'ADPCM') developed by Intel and IMA, approved by the Interactive Multimedia Association as a low complexity cross-platform coding scheme for speech. Unfortunately, the data format can differ between implementations, in particular a legacy build of Microsoft Windows uses the wrong format. The format supported by Prosody is consistent with Microsoft Windows NT 4.0, Windows 9x and Windows 2000. This is commonly supported by WAVE players.

Sampling rate

While a data format determines how much information is recorded for each sample, the total amount of information recorded about a signal also depends on the sampling rate. Prosody supports these sampling rates:

8000
6000
11000

Since signals are sent across the telephone network at 8000 samples per second, it is normally pointless to record or play files at a higher sampling rate as the extra sound quality cannot cross the phone network and merely serves to make the files bigger. However, since PC sound cards are typically optimised for CD sampling rates (16 bit linear at 44100 samples/second), it may be more convenient to have files which use a rate of 11000 samples per second. This is sufficiently close to a quarter of the CD rate that there is no perceptible difference (it's less than 1% off). Some PC sound cards do not handle 8000 samples/second very well, so this may also influence the choice of which rate to use.

The rate of 6000 samples per second offers a saving of 25% of the file sizes at the cost of some sound quality.

Other types of ADPCM

There are two other very common types of ADPCM. Neither is compatible with OKI ADPCM.

Microsoft ADPCM is a completely incompatible (and slightly more complex) type of ADPCM. This is obviously supported by Microsoft WAVE players, and is available as open source for use with GNU / Linux.
G.721 (also known as CCITT ADPCM) is an ITU specification for a rather complicated form of ADPCM, rarely used nowadays.

Applicability of Supported Data Formats

The choice of data format for use in an application may depend on some or all of the following factors:

platform system bus (PCI) performance
platform disk controller performance
platform disk capacity for storing recorded speech data locally
system network performance if speech data not stored on local disk
platform CPU performance and loading
application scalability requirements
application audio quality requirements
manipulation or use of application recorded speech using non-Aculab software (eg. sound card)
computational complexity of encode/decode algorithm (hence channel count per DSP)
nature of signal to be recorded/replayed vs. characteristics of encoding technique, for example some encoding algorithms are adequate for speech data but poor when used with recorded music
combinatorial restrictions on use of algorithm in available Prosody firmware builds

Summary of Data Format Characteristics

Data Format	Data rate	Speech quality	Encoding cost	Decoding cost	Portability
16-bit Linear PCM	very high	Best	Low	Low	Universal
A or Mu law	High	Best	Low	Low	Wide
OKI ADPCM 32Kbps	Moderate	Moderate	Moderate	Moderate	within CT
OKI ADPCM 24Kbps	Moderate	Low	Moderate	Moderate	within CT
IMA ADPCM 32Kbps	Moderate	Moderate	Moderate	Moderate	Wide

Platform/Network System Performance

Each active channel of data (recording or replaying) generally requires 2K buffers of data octets to be supplied at regular intervals. With uncompressed data this is every 250 ms, with 2:1 compressed data this is every 500 ms, etc.

The sustainable channel count depends on how reliably the system can deliver data buffers to active channels, this is a function of the CPU performance, other loading on the CPU, the local disk controller performance, or the network performance if replay data comes over a LAN.

It is evident that the more compressed the speech data is, the less often data has to be supplied to active channels and thus the less demand this makes on system performance.

Platform disk capacity

If long duration or large numbers of recordings are going to be stored and disk capacity for these recordings is an issue, then consideration should be given to the compression ratio offered by a particular data format.

See section 3 for the bits per sample characteristic for each data format. Each PCM sample is 8 bits, so if a data format uses N bits per sample, the compression ratio is 8/N

References

The structure of a WAV file header is described in MSDN online library section: Platform SDK/Graphics and Multimedia Services/Windows Multimedia/Multimedia Reference/Multimedia Structures/WAVEFORMATEX
ITU G.711 Pulse Code Modulation (PCM) of voice frequencies available from http://www.itu.int/ITU-T/index.html

Document reference: AN 1386