SPEECHSC					       S. Maes
Internet Draft					       IBM
Document: draft-maes-speechsc-use-cases-00	       A. Sakrajda 
Category: Informational				       IBM
Expires: December, 2002				       June 23, 2002

Usage Scenarios for Speech Service Control (SPEECHSC)


      Status of this Memo

This document is an Internet-Draft and is in full conformance with 
all provisions of Section 10 of RFC2026 [1]. 

Internet-Drafts are working documents of the Internet Engineering 
Task Force (IETF), its areas, and its working groups. Note that 
other groups may also distribute working documents as Internet-
Drafts. Internet-Drafts are draft documents valid for a maximum of 
six months and may be updated, replaced, or obsoleted by other 
documents at any time. It is inappropriate to use Internet- Drafts 
as reference material or to cite them other than as "work in 
progress." 

The list of current Internet-Drafts can be accessed at 
http://www.ietf.org/ietf/1id-abstracts.txt 

The list of Internet-Draft Shadow Directories can be accessed at 
http://www.ietf.org/shadow.html.


Discussion of this and related documents is on the MRCP list.  To 
subscribe, send the message "subscribe mrcp" to 
majordomo@snowshore.com. The public archive is at 
http://flyingfox.snowshore.com/mrcp_archive/maillist.html. 
    
NOTE: This mailing list will be superseded by an official working 
group mailing list, cats@ietf.org, once the WG is formally 
chartered. 
    

    1. Abstract

This document proposes usage scenarios for SPEECHSC. 

    2. Conventions used in this document

The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", 
"SHOULD", "SHOULD NOT", "RECOMMENDED",  "MAY", and "OPTIONAL" in 
this document are to be interpreted as described in RFC-2119 [ ].

OPEN ISSUES: This document highlights questions that are, as yet, 
undecided as "OPEN ISSUES".

    3. Introduction 

This document proposes different usage scenarios for SPEECHSC. 

Maes & Sakrajda  Informational Expires December 2002		    1
Usage Scenarios for Speech Service Control 		    June 2002

SPEECHSC targets can support different frameworks:
    - To enable a terminal-based application (located on the client 
    or local to the audio sub-system) to remote control speech 
    engine resources. Examples include:
        - A wireless handset-based application that uses remote 
        speech engines. This would be typically the case of a 
	multimodal application in "fat client configuration" 
	with a voice browser embedded on the client that uses 
	remote speech engines.
	- A voice application running on a client with local 
	embedded engines used for some of the tasks and use of 
	remote speech engines when the task is too complex for 
	the local engine, when the task requires a specialized 
	engine, when it would not be possible to download the 
	speech data files (grammars, etcà) without introducing 
	significant delays or when for IP, security or privacy 
	reasons it is not appropriate to download such data 
	files on the client or to perform the processing on the 
	client or to send results from the client.
    - To enable an application located in the network to remote 
    control different speech engines located in the network. For 
    example:
        - To distribute the processing and perform load 
	balancing
	- To allow the use of engines optimized for specific 
	tasks.
	- To enable third party services specialized in 
	providing speech engine capabilities 

With respect to [2], in the first case, the speech application and 
media processing unit are conceptually collocated on a terminal. In 
the latter case, they are distributed in the network.

Note that nothing should impose that input and output audio sub-
systems be located on a same system.

In general, encoded speech with conventional codecs (e.g. AMR) or 
DSR optimized codecs (e.g. ETSI ES 201 108) is exchanged between 
terminal and speech engines (uplink from audio sub-system to speech 
engines and downlink, from speech engine to audio sub-system). This 
can advantageously be complemented with side speech meta-information 
(in-band or out-of-band) that facilitates speech processing. 
Examples of such meta-information could include: speech/no-speech, 
barge-in information, terminal events and settings, audio sub-system 
acquired parameters (e.g. noise level,...) or settings, audio sub-
system control and application specific exchanges. ETSI Aurora and 
3GPP SES (Speech Enabled Services) are designing such Distributed 
Speech Recognition framework. In the rest of the document, we assume 
the existence of an appropriate framework to exchange encoded speech 
and possibly speech meta-information.

    4. Use cases

    4.1 Single engine speech service on uplink
Maes & Sakrajda  Informational Expires December 2002		    2
Usage Scenarios for Speech Service Control 		    June 2002

The following usage scenarios MUST be supported by SPEECHSC:

    - A terminal-based application uses a server-side speech 
    recognition engine. The application MUST be able to determine 
    engine readiness or reserve the services of an engine (this may 
    require discovery). The selected engine may depend on the 
    nature of the processing to perform and on network or engine 
    workload. It MUST then be able to load appropriate speech data 
    files (e.g. vocabulary, grammar, acoustic models, language 
    model,...) in the speech engine and set some engine parameters 
    values. This may be decided based on the audio codec (static or 
    dynamic settings), the acoustic environment etc... The 
    application MUST be able to specify the nature of the 
    processing to be performed, based on what is supported by the 
    engine and determine where, when and how results should be 
    sent. If errors or other relevant events take place on the 
    terminal, the application MUST be able to immediately notify 
    the speech engine (e.g. speech detection, background noise, 
    barge-in event, ...) and conversely, it MUST be possible to 
    specify where engine events should be sent. Speech recognition 
    results drive the application and result into client-side GUI 
    or voice updates (e.g. using a local TTS engine). Through local 
    API, the application can control the audio I/O subsystem and 
    any other processing (barge-in detection, speech detection, 
    filtering, pre-processing (noise subtraction, etc...) and 
    encoding locally applied to speech input signals. 
    - A terminal-based application uses server-side speaker 
    recognition engine. Usages are similar to the previous case. 
    Security or privacy consideration MAY require mechanisms to 
    establish trust between the terminal and the engine as well as 
    possible encryption of the results. Trust management is outside 
    the scope of SPEECHSC. Encryption could be achieved by relying 
    on SPEECHSC to set the encryption details.
    - A server-side application drives a dialog with a user 
    connected to a media processing entity by relying on remote 
    speech recognition engines. Examples include VoIP or PSTN 
    access to a voice gateway. The application MUST be able to 
    determine engine readiness or reserve the services of an engine 
    (this may require discovery). The selected engine may depend on 
    the nature of the processing to perform and on network or 
    engine workload. It MUST then be able to load appropriate 
    speech data files (e.g. vocabulary, grammar, acoustic models, 
    language model,...) in the speech engine and set some engine 
    parameters values. This may be decided based on the audio codec 
    (static or dynamic settings), the acoustic environment etc... 
    The application MUST be able to specify the nature of the 
    processing to be performed, based on what is supported by the 
    engine and determine where, when and how results should be 
    sent. If errors or other relevant events take place on the 
    media processing entity, the application MUST be able to 
    immediately notify the speech engine and conversely, it MUST be 
    possible to specify where engine events should be sent. Speech 
    recognition results drive the application and result into voice 
    updates (e.g. using a TTS engine local to the application). It 
Maes & Sakrajda  Informational Expires December 2002		    3
Usage Scenarios for Speech Service Control 		    June 2002    

    MUST be possible for the application to control the processing 
    performed by the media processing entity (e.g. speech 
    detection, barge-in detection, noise subtraction etcà) as in 
    the case of the terminal-based application. Some encoding and 
    processing has been applied on the speech by the audio sub-
    system, prior to reaching the media processing entity. This can 
    not be controlled by the application; but media conversions and 
    post processing MUST possible.

Note that speech engine data files can be loaded from the 
application location or from a server-side location. 

    4.2 Single engine speech service on downlink

The following usage scenarios MUST be supported by SPEECHSC:

    - A terminal-based application MUST be able to rely on server-
    side engines to provide audio prompts. These prompts may be 
    generated by a TTS engine or retrieved as audio recordings. The 
    application MUST be able to determine engine readiness or 
    reserve the services of an engine (this may require discovery). 
    The selected engine may depend on the nature of the processing 
    to perform and on network or engine workload. It MUST then be 
    able to load appropriate speech data files in the speech engine 
    and set some engine parameters values (e.g. voice 
    characteristics). The application MUST be able to specify the 
    nature of the processing to be performed, based on what is 
    supported by the engine and determine where, when and how the 
    downlink audio should be sent. If errors or other relevant 
    events (e.g. barge-in) take place on the terminal, the 
    application MUST be able to immediately notify the speech 
    engine and conversely, it MUST be possible to specify where 
    engine events (e.g. begin or end of prompt) should be sent. 
    Through local API, the application can control the audio I/O 
    subsystem and any other processing encoding or decoding locally 
    applied to prompts.
    - A server-side application drives a dialog with a user 
    connected to a media processing entity by relying on remote 
    prompt generators. Examples include VoIP or PSTN access to a 
    voice gateway. These prompts may be generated by a TTS engine 
    or retrieved as audio recordings. The application MUST be able 
    to determine engine readiness or reserve the services of an 
    engine (this may require discovery). The selected engine may 
    depend on the nature of the processing to perform and on 
    network or engine workload. It MUST then be able to load 
    appropriate speech data files in the speech engine and set some 
    engine parameters values (e.g. voice characteristics). The 
    application MUST be able to specify the nature of the 
    processing to be performed, based on what is supported by the 
    engine and determine where, when and how the downlink audio 
    should be sent. If errors or other relevant events (e.g. barge-
    in) take place on the media processing entity, the application 
    MUST be able to immediately notify the speech engine and 
    conversely, it MUST be possible to specify where engine events 
Maes & Sakrajda  Informational Expires December 2002	            4
Usage Scenarios for Speech Service Control 		    June 2002
    
    (e.g. begin or end of prompt) should be sent. It MUST be 
    possible for the application to control the processing 
    performed by the media processing entity (e.g. media 
    conversion, ...) as in the case of the terminal-based 
    application. 

    4.3 Uplink and downlink remote engines

The following usage scenarios MUST be supported by SPEECHSC:

    - Terminal-based applications MUST be able to rely on server-
    side speech recognition or speaker recognition engines that 
    process the uplink speech input and remote prompt generators 
    (TTS or recorded prompts) to generate the resulting prompts. 
    The same considerations as above in terms of additional events 
    and processing apply. In addition, it MUST be possible that 
    events generated by one engine be transmitted to another (e.g. 
    server-side speech detection should be transmitted to TTS 
    engine to stop generation of a prompt, etc...).
    - The same applies for server-side applications.

    4.4 Serial combination

For the following usage scenarios, we do not distinguish between 
terminal-side or server-side applications. They SHOULD be supported 
by SPEECHSC:

   - Control of any pre-processing applied uplink speech may be 
   advantageously supported by SPEECHSC. Each processor is treated 
   as a separate engine and results from one engine are passed 
   another. The output of pre-processing engines may consist of 
   events, partial or annotated results and modified audio stream. 
   It MUST possible through SPEECHSC to specify the nature of the 
   processing and where, when and how the output should be sent 
   (to other engines). Such capabilities are assumed for all the 
   following usage scenarios and no more discussed.
   - Terminal-based speech recognition engines may perform local 
   recognition and generate a tentative recognition result or 
   produce partial results. SPEECHSC SHOULD support exchange of 
   the partial or tentative results with a server-side engine for 
   comparison or more detailed processing. It SHOULD be possible 
   through SPEECHSC to specify the nature of the processing and 
   where, when and how the different output should be sent.
   - Applications may serially combine speech recognition and 
   speaker recognition engine. For example:
       - The identify of the user (verified or not) or user 
       class may be passed to a speech engine for example for 
       selection of the most appropriate acoustic models tuned 
       to the userÆs characteristics or to select a particular 
       grammar.
       - The results of a speech recognition engine may be 
       passed to a speaker identification or verification in 
       order to let the speaker recognition algorithm take 
       into account the transcription or alignments.
Maes & Sakrajda  Informational Expires December 2002		    5
Usage Scenarios for Speech Service Control 		    June 2002

   SPEECHSC SHOULD support the configuration and control of the 
   sequential processing and the appropriate exchange of results, 
   output and events between engines as well as final results.
   - Conversational applications that support natural language 
   processing may post process the result of speech recognition to 
   extract annotated attribute value pairs that will be used by 
   the dialog manager (e.g. the application) to determine the 
   focus and intent of the user based on the last input and 
   context of previous input. SPEECHSC SHOULD support the 
   configuration and control of the speech recognition engine, NL 
   engines (e.g. classer, parser, attribute-value pair extractor), 
   the exchanges between engines as well as the final results. 
   - Conversational applications that support natural language 
   generation (NLG) may rely on a NLG engine to generate textual 
   prompts based on the state and context of the applications. 
   These prompts are then synthesized by TTS engines. SPEECHSC 
   SHOULD support the configuration and control of the NLG engine, 
   TTS engines, the exchanges between engines as well as the final 
   results.

Numerous other serial combinations of engines and processing can be 
considered. 

While it may be possible to support such use case without supporting 
the configuration and control of serial combinations of engines, 
such deployments configurations would then require numerous 
roundtrips between the application and the different engines. The 
resulting hub-centric architecture may not be the most efficient way 
to proceed.  

    4.5 Parallel combination

Similarly, SPEECHSC SHOULD support the control of parallel 
combinations of engines. Again, there is no need to distinguish 
between terminal-side or server-side applications and we can assume 
that additional processing may be serially added.

The following usage scenarios SHOULD be supported by SPEECHSC:
    - Conversational biometric applications where authentication 
    relies on parallel use of speech and speaker recognition 
    engines. Through a joint and possibly iterative algorithm, the 
    speaker recognition engine may drive the data files to use and 
    the expected output of the speech recognition engine. 
    Concurrently, the speech recognition result (ID claim, 
    alignment, recognized text) may drive the speaker recognition 
    engine. Partial processing or results may be shared. SPEECHSC 
    SHOULD support the configuration and control of the different 
    engines, the exchanges between engines as well as the final 
    results.

Numerous other parallel combinations of engines and processing can 
be considered. 

While it may be possible to support such use case without supporting
Maes & Sakrajda  Informational Expires December 2002		    6
Usage Scenarios for Speech Service Control 		    June 2002
 
the configuration and control of parallel combinations of engines, 
such deployments configurations would then require numerous 
roundtrips between the application and the different engines. The 
resulting hub-centric architecture may not be the most efficient way 
to proceed.


    5. Security Considerations

Security or privacy consideration MAY require mechanisms to 
establish trust between the application, the audio I/O sub-systems 
and the engines.

Also, engine remote control may enable third party to request 
speech data files (e.g. grammar or vocabulary) that are considered 
as proprietary (e.g. hand crafted complex grammar) or that contain 
private information (e.g. the list of names of the customer of a bank) 
etc...  

The SPEECHSC activity SHOULD address how to maintain control on the 
distribution of the speech data files needed by web services and 
therefore not only the authentication of SERCP exchanges but also 
of target speech engine web services.

SPEECHSC may also possibly require encryption, integrity protection or 
digital signature of the input, output and results. 


6. References

[1] Bradner, S., "The Internet Standards Process -- Revision 3", BCP 
      9, RFC 2026, October 1996.
[2] Burger, E. and Oran, D., "Requirements for Distributed Control
of ASR, SV and TTS Resources", draft-burger-speechsc-reqts-00, 
June 13, 2002.


7. Author's Addresses

St‰phane H. Maes 
IBM T.J Watson Research Center
PO Box 218, Yorktown Heights, NY 10598
Phone: +1-914-945-2908
Email: smaes@us.ibm.com

Andrzej Sakrajda 
IBM T.J Watson Research Center
PO Box 218, Yorktown Heights, NY 10598
Phone: +1-914-945-4362
Email: ansa@us.ibm.com