Prosody application note: speech processing replay and record data formats

Introduction

This document describes the data formats provided by Prosody for recording and replay of speech data, as well as the characteristics of each format and gives some indications on when these formats are applicable.

WAV files, VOX files and raw data streams

The Prosody speech API can record and replay speech data either to/from files (High level play/record API) or data buffers (using sm_replay_start() and sm_record_start()).

The speech data consists of a sequence of octets which is an encoding of a speech signal, Prosody can replay and record speech data encoded in a number of different ways, these different encodings are called speech data formats. Each Prosody speech data format has different characteristics, and the choice of data format used in an application depends on the requirements of the application and the capabilities of the platform hosting the application.

When Prosody is used to record speech data, that data is encoded into a sequence of octets in an explicitly specified data format from a (mu-law or A-law) speech signal switched to a Prosody channel input.

When Prosody is used to replay speech data, that data is decoded from a sequence of octets in an explicitly specified data format and a (mu-law or A-law) speech signal generated on a Prosody channel output. It is clearly imperative that the correct decoding algorithm is used by the Prosody module to generate the output speech signal from the speech data.

When speech data used with the Prosody speech processing firmware is stored in a file, the high level speech processing API provides a set of calls for handling two commonly used file formats:

RAW or VOX files just contain encoded speech data (they do not have any header). Thus it is not possible to determine the speech data format of a RAW file just from its contents.

The WAV file format is a Microsoft defined format that includes a header containing information about the following encoded speech data. Within the header a data format identifier indicates the data format of the speech data contained in the WAV file. These identifier values are defined by Microsoft (or registered with Microsoft by product vendors). Examples of data format identifiers are:

WAV data formats require more than the identifier to completely describe them. For example the format WAVE_FORMAT_PCM simply states that the data in the file is in the form of uniformly quantized samples. The sampling rate is the frequency at which the samples must be output in order to reproduce the original speed and pitch. The sample size is the number of bits in each sample (the numerical accuracy of the sample). Further, the data in a WAV file can be mono (one channel) or stereo (two parallel channels). Thus common data formats used by WAV players (and recorders) are:

WAVE_FORMAT_ALAW and WAVE_FORMAT_MULAW are almost always sampled at 8000Hz and are both implicitly 8 bits per sample. Hence 64 kilobits per second (Kbps).

As for CTI-specific data formats,

Obviously Prosody, being a telephony system, only handles mono signals, since telephones have no facility for stereo.

WAV file data format identifiers are not the same as the type parameter supplied to Prosody replay/record API calls. It is not, in general, possible to replay any arbitrary type of WAV file data using the Prosody API. Only certain data format identifiers are directly supported. In these cases Prosody can map the WAV data format identifier to one of its equivalent data formats and use the speech data in the WAV file directly. Otherwise you have to transcode the WAV speech data into a Prosody supported data format.

Customers can handle other file types (neither WAV or RAW) of speech file format by writing their own code above the low level buffer based record/replay API.

Whereas some traditional CTI installations use RAW files, Prosody developers are encouraged to use WAV files, because the file header can be interrogated by a playback/record application using Prosody API calls, which can then choose the appropriate Prosody algorithm for playback/recording of the file. If the file is RAW, some external knowledge is required, independent of the file itself, and it is possible to attempt playback of a particular file using the wrong algorithm.

Another advantage of using WAV files for storing recorded data is that if the data format is supported by other applications, such as a WAV player, it can be rendered on a desktop computer without requiring the use of a Prosody card.

Compatibility

Since the term WAV (or WAVE) file refers only to the header on the file, and since there are many data formats (e.g. WAVE_FORMAT_ALAW, WAVE_FORMAT_G721) it is virtually impossible for a single application to support all data formats. Most desktop WAVE players will support WAVE_FORMAT_ALAW and WAVE_FORMAT_MULAW, and WAVE_FORMAT_PCM (but only at certain sampling rates and bit-rates).

On Windows 2000 platforms, it is the Audio Compression Manager (ACM) that determines what type of WAV files may be rendered to a sound card using media players. The media player interacts with ACM drivers to convert data between various data formats.

Prosody Supported Data Formats

Prosody format descriptionRegistered WAV type tagbits per sample
kSMDataFormatALawPCMWAVE_FORMAT_ALAW8
kSMDataFormatULawPCMWAVE_FORMAT_MULAW8
kSMDataFormatOKIADPCMWAVE_FORMAT_OKI_ADPCM4
kSMDataFormatIMAADPCMWAVE_FORMAT_IMA_ADPCM4
kSMDataFormat8bitWAVE_FORMAT_PCM8
kSMDataFormat16bitWAVE_FORMAT_PCM16
kSMDataFormatSigned8bitWAVE_FORMAT_PCM8
kSMDataFormatSpeexWAVE_FORMAT_SPEEX+

Note: + See Speex website() for details.

Linear PCM

Pulse Code Modulation is the method of periodically measuring the voltage of a waveform (a speech signal), and storing that voltage as an integer, which can have a finite number of values. In linear PCM (16 bit or 8 bit), the integers used for sampling are simply fractions of the full scale deflection of the signal. The rate at which these values are measured is the sampling rate, and the conversion to integer values is known as quantisation.

Mu-law

Mu-law (sometimes u-law or m-law) is the format used for transport of speech data in the American and Japanese telephone networks. It is specified by ITU G.711 [2]. This is a non-uniform (logarithmic) quantisation of signal samples, using 8 data bits per sample, at a rate of 8000 samples per second, hence 64 kilobits per second. The subjective sound quality is equivalent to linearly sampled 14 bits at the same sampling rate. This is commonly supported by WAVE players.

A-law

A-law is the format used for transport of speech data on telephone networks around the world except America and Japan. It is specified by ITU G.711 [2]. This is a non-uniform (logarithmic) quantization of signal samples, using 8 data bits per sample, at a rate of 8000 samples per second, hence 64 kilobits per second. The subjective sound quality is equivalent to linearly sampled 13 bits at the same sampling rate. This is commonly supported by WAVE players.

OKI ADPCM

OKI ADPCM has become an industry standard for storage of speech signals at slightly reduced data rates of 32Kbps and 24Kbps. This coding method is differential, and uses four data bits per sample, sampled at 8000 or 6000 samples per second, respectively. When recording at 24Kbps, the network sampling rate of 8000 samples per second is subsampled at 6000 samples per second by DSP software, prior to applying the ADPCM algorithm. The reverse is true for replay. The supported file formats are .vox and .vap

IMA ADPCM

IMA ADPCM (also known as 'DVI ADPCM' or simply 'ADPCM') developed by Intel and IMA, approved by the Interactive Multimedia Association as a low complexity cross-platform coding scheme for speech. Unfortunately, the data format can differ between implementations, in particular a legacy build of Microsoft Windows uses the wrong format. The format supported by Prosody is consistent with Microsoft Windows NT 4.0, Windows 9x and Windows 2000. This is commonly supported by WAVE players.

Sampling rate

While a data format determines how much information is recorded for each sample, the total amount of information recorded about a signal also depends on the sampling rate. Prosody supports these sampling rates:

Since signals are sent across the telephone network at 8000 samples per second, it is normally pointless to record or play files at a higher sampling rate as the extra sound quality cannot cross the phone network and merely serves to make the files bigger. However, since PC sound cards are typically optimised for CD sampling rates (16 bit linear at 44100 samples/second), it may be more convenient to have files which use a rate of 11000 samples per second. This is sufficiently close to a quarter of the CD rate that there is no perceptible difference (it's less than 1% off). Some PC sound cards do not handle 8000 samples/second very well, so this may also influence the choice of which rate to use.

The rate of 6000 samples per second offers a saving of 25% of the file sizes at the cost of some sound quality.

Other types of ADPCM

There are two other very common types of ADPCM. Neither is compatible with OKI ADPCM.

Applicability of Supported Data Formats

The choice of data format for use in an application may depend on some or all of the following factors:

Summary of Data Format Characteristics

Data FormatData rateSpeech qualityEncoding costDecoding costPortability
16-bit Linear PCMvery highBestLowLowUniversal
A or Mu lawHighBestLowLowWide
OKI ADPCM 32KbpsModerateModerateModerateModeratewithin CT
OKI ADPCM 24KbpsModerateLowModerateModeratewithin CT
IMA ADPCM 32KbpsModerateModerateModerateModerateWide

Platform/Network System Performance

Each active channel of data (recording or replaying) generally requires 2K buffers of data octets to be supplied at regular intervals. With uncompressed data this is every 250 ms, with 2:1 compressed data this is every 500 ms, etc.

The sustainable channel count depends on how reliably the system can deliver data buffers to active channels, this is a function of the CPU performance, other loading on the CPU, the local disk controller performance, or the network performance if replay data comes over a LAN.

It is evident that the more compressed the speech data is, the less often data has to be supplied to active channels and thus the less demand this makes on system performance.

Platform disk capacity

If long duration or large numbers of recordings are going to be stored and disk capacity for these recordings is an issue, then consideration should be given to the compression ratio offered by a particular data format.

See section 3 for the bits per sample characteristic for each data format. Each PCM sample is 8 bits, so if a data format uses N bits per sample, the compression ratio is 8/N

References

  1. The structure of a WAV file header is described in MSDN online library section: Platform SDK/Graphics and Multimedia Services/Windows Multimedia/Multimedia Reference/Multimedia Structures/WAVEFORMATEX
  2. ITU G.711 Pulse Code Modulation (PCM) of voice frequencies available from http://www.itu.int/ITU-T/index.html

Document reference: AN 1386