Prosody speech processing: API: sm_asr_listen

The channel on which to listen.

The number of items in vocab_item_ids and vocab_recog_ids.

The vocabulary items to make active.

The identifiers to use when reporting recognition results.

The recognition mode to use.

One of these values:

kSMASRModeDisabled: Inhibit ASR recognition.
kSMASRModeOneShot: Initiate ASR recognition. Once a word has been detected or a timeout has occurred, ASR recognition is inhibited.
kSMASRModeContinuous: Repeatedly initiate ASR recognition with the same active vocabulary and recognition parameters. Continue producing recognition results until the ASR is disabled.

Pointer to ASR parameters, or zero to use the defaults.

Where a paramater is specified as being a number of frames, this refers to units of time with 1 frame being equivalent to 16ms.

Recognition performance can sometimes be improved by adjusting some of these parameters. If it is felt that any of these values needs changing, it is normally best to change only one value at a time. This procedure can be very time-consuming and attempting to change more than one value at the same time will make the process too complicated.

If there is no better way to choose an optimum value for any of these parameters, it is best initially to change from the default by a factor of two (or by one half, if it needs to be made smaller). If that improves matters, try changing by another factor of two, and so on, until performance becomes worse. At that point, back-track to mid-way between the values already tried, on either side of the best one. Repeat this back-tracking until there is no significant difference between the performance of the two best values.

vfr_max_frames

Default 8. See discussion under vfr_diff_threshold.

vfr_diff_threshold

Default 0.5.

The recogniser uses a simple variable-frame-rate analysis of incoming speech in order to reduce the average computational loading. In practice, its main function is to skip over extended periods of silence without performing more than the absolute minimum of processing. vfr_max_frames sets the maximum number of input frames which will be examined before a frame is passed on to the recogniser proper, while vfr_diff_threshold is the change in the input frames required in order to trigger a frame to be passed on. Setting it to zero has the same effect as setting vfr_max_frames to 1, i.e. it disables the VFR mechanism and passes all frames straight to the recogniser. vfr_diff_threshold normally satisfies this condition:

0 <= vfr_diff_threshold <= 1.0

In cases where vocabulary size is small (or words are short), there may be an increase in accuracy if vfr_max_frames is set to a small value, say 1 or 2, and / or vfr_diff_threshold is decreased. This will cause fewer frames of data to be skipped, but it will increase the average computational loading on the respective module. Thus this method should only be used if the loading is not too heavy. Conversely, if a module is too heavily loaded, vfr_diff_threshold can be increased somewhat to reduce the loading (but at the expense of reduced accuracy). It is unlikely that an increase in vfr_max_frames will have any useful effect, and it only really makes sense to set it such that:

1 <= vfr_ max_frames <= pse_min_frames

pse_max_frames

Default 63. This, together with pse_min_frames, determines the maximum and minimum latency in the ASR system (i.e. the delay between the speech ending and a result being produced). If pse_min_frames is too small, recognition accuracy will suffer because a premature result may be produced if there is a brief hesitation or quiet interval within an utterance. If pse_max_frames is too large, there will occasionally be long delays before a result is produced. In cases where lines are noisy or there are problems with excessive echo, it may be advantageous to increase pse_min_frames and / or decrease pse_max_frames, subject to these constraints:

1 <= pse_min_frames <= pse_max_frames <= 125

where the value 125 is two seconds.

pse_min_frames

Default 21. See discussion under pse_max_frames.

vit_soft_threshold

Default 0.16. Increase this value to make more recognition results be classified as uncertain, decrease it for fewer. The value zero causes no result to be reported as uncertain. See discussion under vit_hard_threshold.

vit_hard_threshold

Default 0.08. Increase this value to make more recognition results be classified as rejected, decrease it for fewer. The value zero causes no result to be reported as rejected,

It is unlikely that values greater than 1.0 would ever be used for this parameter, and vit_hard_threshold is normally set greater than vit_soft_threshold:

0 <= vit_soft_threshold < vit_hard_threshold < 1.0

vit_snr_adjust

This is designed to facilitate compensation for an unusual signal-to-noise ratio, or poor speech quality. It is a log-probability, so can be positive or negative. It affects the likelihood of speech being classified as background noise, and vice-versa. Optimising this parameter is not always straightforward, however: timeouts can be caused by it being too positive or too negative.

If vit_snr_adjust is too positive, background noise may be detected as speech, both before and after the end of the utterance. The result of this is a timeout because the recogniser never detects the end of the utterance. If it is too negative, the speech will be less likely to be detected at all, and this will also result in a timeout.

It is sometimes possible to tell whether vit_snr_adjust is too positive or too negative by the nature of recognition errors. If the recognition errors appear to be due to the initial and/or final parts of the spoken words being ignored, it should be made more positive. If they appear to be due to background noise immediately before and/or after the spoken words being treated as part of the utterance, it should be made more negative.

If the errors offer no clues as to the cause of the problem, and the recogniser is giving too many timeouts and/or poor accuracy, vit_snr_adjust should initially be made slightly more positive. Only if that causes the timeouts to increase, should it be made negative.

A suitable amount to add or subtract is 5.

Default 0.0

The channel whose output may be echoed back into the input being analysed.

Prosody speech processing: API: sm_asr_listen_for

Prototype Definition

Parameters

Description

Fields

Returns