The WAV file format is one defined by Microsoft. However there are some differences between the formats understood by different programs. This document explains some of these differences and how they influence the high level WAV file library functions provided with Prosody.
The WAV file format is built from very simple components, called chunks. A chunk consists of a four-byte code, a four-byte length, and optional data. The length is the length of the optional data (so it's zero if there is no optional data). If the length of the data is not even, then an extra zero byte of padding must be appended after the data (but not included in the length). The contents of a chunk depend on the type of chunk, which is determined by its code. Some chunks contain a sequence of sub-chunks, while others contain other kinds of data. Chunk type codes are, by convention, four printable characters, which are also used as the names of the types. For example the RIFF chunk starts with 0x52 0x49 0x46 0x46, which is 'R' 'I' 'F' 'F'. The lengths are stored least significant byte first (little-endian). A WAV file consists of a single chunk which must be a RIFF chunk which must contains a four-byte code (which must be 'W' 'A' 'V' 'E') and a sequence of sub-chunks. The minimal WAV file has only two sub-chunks in the RIFF chunk, and they are 'fmt ' and 'data' chunks in that order. The 'fmt ' chunk specifies the data format and the 'data' chunk contains the actual data.
Here is a diagram of a very small WAV file:
Content | Explanation | ||||
---|---|---|---|---|---|
52 | 49 | 46 | 46 | The 'RIFF' code | RIFF |
2c | 00 | 00 | 00 | The length of the RIFF chunk | |
57 | 41 | 56 | 45 | The 'WAVE' code | |
66 | 6d | 74 | 20 | The 'fmt ' code | fmt |
10 | 00 | 00 | 00 | The length of the fmt chunk | |
11 | 22 | 33 | 44 | Contents of the fmt chunk | |
55 | 66 | 77 | 88 | ||
99 | aa | bb | cc | ||
dd | ee | ff | 00 | ||
64 | 61 | 74 | 61 | The 'data' code | data |
07 | 00 | 00 | 00 | The length of the data chunk | |
11 | 22 | 33 | 44 | Contents of the data chunk | |
55 | 66 | 77 | |||
00 | Padding |
The original WAV file specification ("Multimedia Programming Interface and Data Specifications 1.0", published in 1991 by Microsoft and IBM) described a fmt chunk format which had 16 bytes. This covered the basic PCM formats and was equivalent to the PCMWAVEFORMAT structure. However, when other encodings were added to the specification, some required extra information, so the fmt chunk was extended as defined in "New Multimedia Data Types and Data Techniques", published in 1994 by Microsoft. Firstly, an extra two-byte field was appended, which contains a length. Then, depending on the encoding, further bytes were added, with the new length indicating how many of these were present. Unfortunately, the two-byte length was specified to be present, with the value zero, even if there were no further bytes in the fmt chunk. This made the encoding of basic PCM formats be different than under the old specification even though no extra information was present and quite a lot of software cannot read both the old and the new formats.
Note that it is especially unfortunate that the specification was modified in this way, since the original already permitted the fmt chunk to contain further data, and a length field is not necessary since the length of the fmt chunk is known from the explicit length field.
The structure equivalent to this version is called WAVEFORMATEX.
Microsoft have further modified the WAV file format ("Multiple Channel Audio Data and WAVE files", published by Microsoft in 2001), again extending the fmt chunk. This is equivalent to the WAVEFORMATEXTENSIBLE structure. It adds more fields, which are included in the length recorded in the two-byte length field (as well as in the length field of the fmt chunk itself of course). While this alone would not cause any incompatibility, the field which describes the data encoding (previously at the beginning of the fmt chunk, and sometimes called the wFormatTag) is now in this extension, with the old encoding field simply indicating that the true encoding (called SubFormat) is to be found in its new place. This means that files can be produced by applications conforming to this new version of the specification which cannot be understood by programs conforming to the previous version. This can apply even to files which do not need or use any of the new extensions. Obviously, this is a rather unfortunate situation, especially as applications written and tested against this latest specification may not understand files created by programs conforming to the older versions.
The high level library supplied with Prosody can understand files conforming to either the original (1991) version of the specification or the first (1994) revision. It cannot understand files which make use of the extensions in the 2001 revision, even if the files use a data encoding that Prosody handles.
Files created by the Prosody library conform to the 1994 version of the specification because this is currently the most widely supported version. However it can trivially be modified to produce files conforming to the 1991 version by defining the C preprocessor symbol OMIT_CBSIZE when building the library. Note that quite a few applications have problems with this format.
Document reference: AN ????