SPEECHSC S. Maes Internet Draft IBM Document: draft-maes-speechsc-use-cases-00 A. Sakrajda Category: Informational IBM Expires: December, 2002 June 23, 2002 Usage Scenarios for Speech Service Control (SPEECHSC) Status of this Memo This document is an Internet-Draft and is in full conformance with all provisions of Section 10 of RFC2026 [1]. Internet-Drafts are working documents of the Internet Engineering Task Force (IETF), its areas, and its working groups. Note that other groups may also distribute working documents as Internet- Drafts. Internet-Drafts are draft documents valid for a maximum of six months and may be updated, replaced, or obsoleted by other documents at any time. It is inappropriate to use Internet- Drafts as reference material or to cite them other than as "work in progress." The list of current Internet-Drafts can be accessed at http://www.ietf.org/ietf/1id-abstracts.txt The list of Internet-Draft Shadow Directories can be accessed at http://www.ietf.org/shadow.html. Discussion of this and related documents is on the MRCP list. To subscribe, send the message "subscribe mrcp" to majordomo@snowshore.com. The public archive is at http://flyingfox.snowshore.com/mrcp_archive/maillist.html. NOTE: This mailing list will be superseded by an official working group mailing list, cats@ietf.org, once the WG is formally chartered. 1. Abstract This document proposes usage scenarios for SPEECHSC. 2. Conventions used in this document The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be interpreted as described in RFC-2119 [ ]. OPEN ISSUES: This document highlights questions that are, as yet, undecided as "OPEN ISSUES". 3. Introduction This document proposes different usage scenarios for SPEECHSC. Maes & Sakrajda Informational Expires December 2002 1 Usage Scenarios for Speech Service Control June 2002 SPEECHSC targets can support different frameworks: - To enable a terminal-based application (located on the client or local to the audio sub-system) to remote control speech engine resources. Examples include: - A wireless handset-based application that uses remote speech engines. This would be typically the case of a multimodal application in "fat client configuration" with a voice browser embedded on the client that uses remote speech engines. - A voice application running on a client with local embedded engines used for some of the tasks and use of remote speech engines when the task is too complex for the local engine, when the task requires a specialized engine, when it would not be possible to download the speech data files (grammars, etcà) without introducing significant delays or when for IP, security or privacy reasons it is not appropriate to download such data files on the client or to perform the processing on the client or to send results from the client. - To enable an application located in the network to remote control different speech engines located in the network. For example: - To distribute the processing and perform load balancing - To allow the use of engines optimized for specific tasks. - To enable third party services specialized in providing speech engine capabilities With respect to [2], in the first case, the speech application and media processing unit are conceptually collocated on a terminal. In the latter case, they are distributed in the network. Note that nothing should impose that input and output audio sub- systems be located on a same system. In general, encoded speech with conventional codecs (e.g. AMR) or DSR optimized codecs (e.g. ETSI ES 201 108) is exchanged between terminal and speech engines (uplink from audio sub-system to speech engines and downlink, from speech engine to audio sub-system). This can advantageously be complemented with side speech meta-information (in-band or out-of-band) that facilitates speech processing. Examples of such meta-information could include: speech/no-speech, barge-in information, terminal events and settings, audio sub-system acquired parameters (e.g. noise level,...) or settings, audio sub- system control and application specific exchanges. ETSI Aurora and 3GPP SES (Speech Enabled Services) are designing such Distributed Speech Recognition framework. In the rest of the document, we assume the existence of an appropriate framework to exchange encoded speech and possibly speech meta-information. 4. Use cases 4.1 Single engine speech service on uplink Maes & Sakrajda Informational Expires December 2002 2 Usage Scenarios for Speech Service Control June 2002 The following usage scenarios MUST be supported by SPEECHSC: - A terminal-based application uses a server-side speech recognition engine. The application MUST be able to determine engine readiness or reserve the services of an engine (this may require discovery). The selected engine may depend on the nature of the processing to perform and on network or engine workload. It MUST then be able to load appropriate speech data files (e.g. vocabulary, grammar, acoustic models, language model,...) in the speech engine and set some engine parameters values. This may be decided based on the audio codec (static or dynamic settings), the acoustic environment etc... The application MUST be able to specify the nature of the processing to be performed, based on what is supported by the engine and determine where, when and how results should be sent. If errors or other relevant events take place on the terminal, the application MUST be able to immediately notify the speech engine (e.g. speech detection, background noise, barge-in event, ...) and conversely, it MUST be possible to specify where engine events should be sent. Speech recognition results drive the application and result into client-side GUI or voice updates (e.g. using a local TTS engine). Through local API, the application can control the audio I/O subsystem and any other processing (barge-in detection, speech detection, filtering, pre-processing (noise subtraction, etc...) and encoding locally applied to speech input signals. - A terminal-based application uses server-side speaker recognition engine. Usages are similar to the previous case. Security or privacy consideration MAY require mechanisms to establish trust between the terminal and the engine as well as possible encryption of the results. Trust management is outside the scope of SPEECHSC. Encryption could be achieved by relying on SPEECHSC to set the encryption details. - A server-side application drives a dialog with a user connected to a media processing entity by relying on remote speech recognition engines. Examples include VoIP or PSTN access to a voice gateway. The application MUST be able to determine engine readiness or reserve the services of an engine (this may require discovery). The selected engine may depend on the nature of the processing to perform and on network or engine workload. It MUST then be able to load appropriate speech data files (e.g. vocabulary, grammar, acoustic models, language model,...) in the speech engine and set some engine parameters values. This may be decided based on the audio codec (static or dynamic settings), the acoustic environment etc... The application MUST be able to specify the nature of the processing to be performed, based on what is supported by the engine and determine where, when and how results should be sent. If errors or other relevant events take place on the media processing entity, the application MUST be able to immediately notify the speech engine and conversely, it MUST be possible to specify where engine events should be sent. Speech recognition results drive the application and result into voice updates (e.g. using a TTS engine local to the application). It Maes & Sakrajda Informational Expires December 2002 3 Usage Scenarios for Speech Service Control June 2002 MUST be possible for the application to control the processing performed by the media processing entity (e.g. speech detection, barge-in detection, noise subtraction etcà) as in the case of the terminal-based application. Some encoding and processing has been applied on the speech by the audio sub- system, prior to reaching the media processing entity. This can not be controlled by the application; but media conversions and post processing MUST possible. Note that speech engine data files can be loaded from the application location or from a server-side location. 4.2 Single engine speech service on downlink The following usage scenarios MUST be supported by SPEECHSC: - A terminal-based application MUST be able to rely on server- side engines to provide audio prompts. These prompts may be generated by a TTS engine or retrieved as audio recordings. The application MUST be able to determine engine readiness or reserve the services of an engine (this may require discovery). The selected engine may depend on the nature of the processing to perform and on network or engine workload. It MUST then be able to load appropriate speech data files in the speech engine and set some engine parameters values (e.g. voice characteristics). The application MUST be able to specify the nature of the processing to be performed, based on what is supported by the engine and determine where, when and how the downlink audio should be sent. If errors or other relevant events (e.g. barge-in) take place on the terminal, the application MUST be able to immediately notify the speech engine and conversely, it MUST be possible to specify where engine events (e.g. begin or end of prompt) should be sent. Through local API, the application can control the audio I/O subsystem and any other processing encoding or decoding locally applied to prompts. - A server-side application drives a dialog with a user connected to a media processing entity by relying on remote prompt generators. Examples include VoIP or PSTN access to a voice gateway. These prompts may be generated by a TTS engine or retrieved as audio recordings. The application MUST be able to determine engine readiness or reserve the services of an engine (this may require discovery). The selected engine may depend on the nature of the processing to perform and on network or engine workload. It MUST then be able to load appropriate speech data files in the speech engine and set some engine parameters values (e.g. voice characteristics). The application MUST be able to specify the nature of the processing to be performed, based on what is supported by the engine and determine where, when and how the downlink audio should be sent. If errors or other relevant events (e.g. barge- in) take place on the media processing entity, the application MUST be able to immediately notify the speech engine and conversely, it MUST be possible to specify where engine events Maes & Sakrajda Informational Expires December 2002 4 Usage Scenarios for Speech Service Control June 2002 (e.g. begin or end of prompt) should be sent. It MUST be possible for the application to control the processing performed by the media processing entity (e.g. media conversion, ...) as in the case of the terminal-based application. 4.3 Uplink and downlink remote engines The following usage scenarios MUST be supported by SPEECHSC: - Terminal-based applications MUST be able to rely on server- side speech recognition or speaker recognition engines that process the uplink speech input and remote prompt generators (TTS or recorded prompts) to generate the resulting prompts. The same considerations as above in terms of additional events and processing apply. In addition, it MUST be possible that events generated by one engine be transmitted to another (e.g. server-side speech detection should be transmitted to TTS engine to stop generation of a prompt, etc...). - The same applies for server-side applications. 4.4 Serial combination For the following usage scenarios, we do not distinguish between terminal-side or server-side applications. They SHOULD be supported by SPEECHSC: - Control of any pre-processing applied uplink speech may be advantageously supported by SPEECHSC. Each processor is treated as a separate engine and results from one engine are passed another. The output of pre-processing engines may consist of events, partial or annotated results and modified audio stream. It MUST possible through SPEECHSC to specify the nature of the processing and where, when and how the output should be sent (to other engines). Such capabilities are assumed for all the following usage scenarios and no more discussed. - Terminal-based speech recognition engines may perform local recognition and generate a tentative recognition result or produce partial results. SPEECHSC SHOULD support exchange of the partial or tentative results with a server-side engine for comparison or more detailed processing. It SHOULD be possible through SPEECHSC to specify the nature of the processing and where, when and how the different output should be sent. - Applications may serially combine speech recognition and speaker recognition engine. For example: - The identify of the user (verified or not) or user class may be passed to a speech engine for example for selection of the most appropriate acoustic models tuned to the userÆs characteristics or to select a particular grammar. - The results of a speech recognition engine may be passed to a speaker identification or verification in order to let the speaker recognition algorithm take into account the transcription or alignments. Maes & Sakrajda Informational Expires December 2002 5 Usage Scenarios for Speech Service Control June 2002 SPEECHSC SHOULD support the configuration and control of the sequential processing and the appropriate exchange of results, output and events between engines as well as final results. - Conversational applications that support natural language processing may post process the result of speech recognition to extract annotated attribute value pairs that will be used by the dialog manager (e.g. the application) to determine the focus and intent of the user based on the last input and context of previous input. SPEECHSC SHOULD support the configuration and control of the speech recognition engine, NL engines (e.g. classer, parser, attribute-value pair extractor), the exchanges between engines as well as the final results. - Conversational applications that support natural language generation (NLG) may rely on a NLG engine to generate textual prompts based on the state and context of the applications. These prompts are then synthesized by TTS engines. SPEECHSC SHOULD support the configuration and control of the NLG engine, TTS engines, the exchanges between engines as well as the final results. Numerous other serial combinations of engines and processing can be considered. While it may be possible to support such use case without supporting the configuration and control of serial combinations of engines, such deployments configurations would then require numerous roundtrips between the application and the different engines. The resulting hub-centric architecture may not be the most efficient way to proceed. 4.5 Parallel combination Similarly, SPEECHSC SHOULD support the control of parallel combinations of engines. Again, there is no need to distinguish between terminal-side or server-side applications and we can assume that additional processing may be serially added. The following usage scenarios SHOULD be supported by SPEECHSC: - Conversational biometric applications where authentication relies on parallel use of speech and speaker recognition engines. Through a joint and possibly iterative algorithm, the speaker recognition engine may drive the data files to use and the expected output of the speech recognition engine. Concurrently, the speech recognition result (ID claim, alignment, recognized text) may drive the speaker recognition engine. Partial processing or results may be shared. SPEECHSC SHOULD support the configuration and control of the different engines, the exchanges between engines as well as the final results. Numerous other parallel combinations of engines and processing can be considered. While it may be possible to support such use case without supporting Maes & Sakrajda Informational Expires December 2002 6 Usage Scenarios for Speech Service Control June 2002 the configuration and control of parallel combinations of engines, such deployments configurations would then require numerous roundtrips between the application and the different engines. The resulting hub-centric architecture may not be the most efficient way to proceed. 5. Security Considerations Security or privacy consideration MAY require mechanisms to establish trust between the application, the audio I/O sub-systems and the engines. Also, engine remote control may enable third party to request speech data files (e.g. grammar or vocabulary) that are considered as proprietary (e.g. hand crafted complex grammar) or that contain private information (e.g. the list of names of the customer of a bank) etc... The SPEECHSC activity SHOULD address how to maintain control on the distribution of the speech data files needed by web services and therefore not only the authentication of SERCP exchanges but also of target speech engine web services. SPEECHSC may also possibly require encryption, integrity protection or digital signature of the input, output and results. 6. References [1] Bradner, S., "The Internet Standards Process -- Revision 3", BCP 9, RFC 2026, October 1996. [2] Burger, E. and Oran, D., "Requirements for Distributed Control of ASR, SV and TTS Resources", draft-burger-speechsc-reqts-00, June 13, 2002. 7. Author's Addresses St‰phane H. Maes IBM T.J Watson Research Center PO Box 218, Yorktown Heights, NY 10598 Phone: +1-914-945-2908 Email: smaes@us.ibm.com Andrzej Sakrajda IBM T.J Watson Research Center PO Box 218, Yorktown Heights, NY 10598 Phone: +1-914-945-4362 Email: ansa@us.ibm.com