Prosody speech processing: API: sm_asr_listen_for

This function is deprecated.

Prototype Definition

int sm_asr_listen_for(struct sm_asr_listen_for_parms *listenp)

Parameters

*listenp
a structure of the following type:
typedef struct sm_asr_listen_for_parms {
	tSMChannelId channel;					/* in */
	tSM_INT vocab_item_count;				/* in */
	tSM_UT32 *vocab_item_ids;				/* in */
	tSM_INT *vocab_recog_ids;				/* in */
	enum kSMASRMode {
		kSMASRModeDisabled,
		kSMASRModeOneShot,
		kSMASRModeContinuous,
	} asr_mode;						/* in */
	struct sm_asr_characteristics {
		tSM_INT vfr_max_frames;				/* in */
		double vfr_diff_threshold;			/* in */
		tSM_INT pse_max_frames;				/* in */
		tSM_INT pse_min_frames;				/* in */
		double vit_soft_threshold;			/* in */
		double vit_hard_threshold;			/* in */
		double vit_snr_adjust;				/* in */
	} *specific_parameters;					/* in */
	tSMChannelId sidetone;					/* in */
} SM_ASR_LISTEN_FOR_PARMS;

Description

This function controls the spoken words that may be recognised on the specified channel and may be called to start or stop speech recognition on the channel.

While listening for speech, a channel has an active vocabulary consisting of a set of vocabulary items which may be any subset of the current vocabulary loaded into the module using sm_add_input_vocab(). vocab_item_count specifies how many items are in the active vocabulary, and the array pointed to by vocab_item_ids specifies the items by ID value (as returned by sm_add_input_vocab()). The active vocabulary must include the "noise" vocabulary item. Whenever one of the items in the active vocabulary is recognised, it will be identified to the application (via sm_get_recognised()) using the corresponding user ID specified in the array vocab_recog_ids.

User IDs should be non-zero because a zero ID value is returned by sm_get_recognised() to indicate no item in the active vocabulary could be matched with the utterance. More than one vocabulary item can be made to return the same user ID (so that, for example, the words "Zero", "Nought" and "Oh" could all return the same user ID). Conversely, the same vocabulary item may be return different user ids on different channels so that multiple applications can assign user ids independently of one another on different channels.

When a result is awaiting collection from sm_get_recognised() if an event has been previously associated with channel (see sm_channel_set_event()), then that event will be signalled.

If specific_parameters is set to a non-zero value, then the module's default set of ASR parameters is overridden with the set pointed to by specific_parameters.

Non-speech sounds (including call-progress and DTMF tones) and echoes of replayed voice prompts may produce unwanted ASR results. These two problems can be ameliorated (respectively) by these two methods:

  1. Enabling DTMF recognition together with ASR: this automatically inhibits ASR for the duration of any recognised DTMF tone and then restarts it immediately afterwards.
  2. Indicating which output may be the source of any echo (using the sidetone field). This selectively reduces the recogniser's sensitivity to any echo derived from the specified channel output.

Bear in mind that speech recognition is a computationally intensive process, and the load on the Prosody Processor is proportional to the size and complexity of the active vocabulary. There is a limit to the combined sizes of all the channels' active vocabularies on a module. The application must ensure that sm_asr_listen_for() is not invoked such that the resource limits of the Prosody Processor are exceeded by the combined load of all activities including speech recognition.

Requires the module iwr to have been downloaded. Since Prosody X and Prosody S do not implement ASR, this function always returns an error.

Fields

channel (Deprecated)
The channel on which to listen.
vocab_item_count (Deprecated)
The number of items in vocab_item_ids and vocab_recog_ids.
vocab_item_ids (Deprecated)
The vocabulary items to make active.
vocab_recog_ids (Deprecated)
The identifiers to use when reporting recognition results.
asr_mode (Deprecated)
The recognition mode to use.
One of these values:
kSMASRModeDisabled
Inhibit ASR recognition.
kSMASRModeOneShot
Initiate ASR recognition. Once a word has been detected or a timeout has occurred, ASR recognition is inhibited.
kSMASRModeContinuous
Repeatedly initiate ASR recognition with the same active vocabulary and recognition parameters. Continue producing recognition results until the ASR is disabled.
specific_parameters (Deprecated)
Pointer to ASR parameters, or zero to use the defaults.

Where a paramater is specified as being a number of frames, this refers to units of time with 1 frame being equivalent to 16ms.

Recognition performance can sometimes be improved by adjusting some of these parameters. If it is felt that any of these values needs changing, it is normally best to change only one value at a time. This procedure can be very time-consuming and attempting to change more than one value at the same time will make the process too complicated.

If there is no better way to choose an optimum value for any of these parameters, it is best initially to change from the default by a factor of two (or by one half, if it needs to be made smaller). If that improves matters, try changing by another factor of two, and so on, until performance becomes worse. At that point, back-track to mid-way between the values already tried, on either side of the best one. Repeat this back-tracking until there is no significant difference between the performance of the two best values.

vfr_max_frames
Default 8. See discussion under vfr_diff_threshold.
vfr_diff_threshold
Default 0.5.

The recogniser uses a simple variable-frame-rate analysis of incoming speech in order to reduce the average computational loading. In practice, its main function is to skip over extended periods of silence without performing more than the absolute minimum of processing. vfr_max_frames sets the maximum number of input frames which will be examined before a frame is passed on to the recogniser proper, while vfr_diff_threshold is the change in the input frames required in order to trigger a frame to be passed on. Setting it to zero has the same effect as setting vfr_max_frames to 1, i.e. it disables the VFR mechanism and passes all frames straight to the recogniser. vfr_diff_threshold normally satisfies this condition:

0 <= vfr_diff_threshold <= 1.0

In cases where vocabulary size is small (or words are short), there may be an increase in accuracy if vfr_max_frames is set to a small value, say 1 or 2, and / or vfr_diff_threshold is decreased. This will cause fewer frames of data to be skipped, but it will increase the average computational loading on the respective module. Thus this method should only be used if the loading is not too heavy. Conversely, if a module is too heavily loaded, vfr_diff_threshold can be increased somewhat to reduce the loading (but at the expense of reduced accuracy). It is unlikely that an increase in vfr_max_frames will have any useful effect, and it only really makes sense to set it such that:

1 <= vfr_ max_frames <= pse_min_frames

pse_max_frames
Default 63. This, together with pse_min_frames, determines the maximum and minimum latency in the ASR system (i.e. the delay between the speech ending and a result being produced). If pse_min_frames is too small, recognition accuracy will suffer because a premature result may be produced if there is a brief hesitation or quiet interval within an utterance. If pse_max_frames is too large, there will occasionally be long delays before a result is produced. In cases where lines are noisy or there are problems with excessive echo, it may be advantageous to increase pse_min_frames and / or decrease pse_max_frames, subject to these constraints:

1 <= pse_min_frames <= pse_max_frames <= 125

where the value 125 is two seconds.

pse_min_frames
Default 21. See discussion under pse_max_frames.
vit_soft_threshold
Default 0.16. Increase this value to make more recognition results be classified as uncertain, decrease it for fewer. The value zero causes no result to be reported as uncertain. See discussion under vit_hard_threshold.
vit_hard_threshold
Default 0.08. Increase this value to make more recognition results be classified as rejected, decrease it for fewer. The value zero causes no result to be reported as rejected,

It is unlikely that values greater than 1.0 would ever be used for this parameter, and vit_hard_threshold is normally set greater than vit_soft_threshold:

0 <= vit_soft_threshold < vit_hard_threshold < 1.0

vit_snr_adjust
This is designed to facilitate compensation for an unusual signal-to-noise ratio, or poor speech quality. It is a log-probability, so can be positive or negative. It affects the likelihood of speech being classified as background noise, and vice-versa. Optimising this parameter is not always straightforward, however: timeouts can be caused by it being too positive or too negative.

If vit_snr_adjust is too positive, background noise may be detected as speech, both before and after the end of the utterance. The result of this is a timeout because the recogniser never detects the end of the utterance. If it is too negative, the speech will be less likely to be detected at all, and this will also result in a timeout.

It is sometimes possible to tell whether vit_snr_adjust is too positive or too negative by the nature of recognition errors. If the recognition errors appear to be due to the initial and/or final parts of the spoken words being ignored, it should be made more positive. If they appear to be due to background noise immediately before and/or after the spoken words being treated as part of the utterance, it should be made more negative.

If the errors offer no clues as to the cause of the problem, and the recogniser is giving too many timeouts and/or poor accuracy, vit_snr_adjust should initially be made slightly more positive. Only if that causes the timeouts to increase, should it be made negative.

A suitable amount to add or subtract is 5.

Default 0.0

sidetone (Deprecated)
The channel whose output may be echoed back into the input being analysed.

Returns

0 if call completed successfully, otherwise a standard error such as:


This function is part of the Prosody speech processing API.