Internet Engineering Task Force Saravanan Shanmugham Internet-Draft Cisco Systems Inc. draft-ietf-speechsc-mrcpv2-01 January 16, 2004 Expires: July 16, 2004 Media Resource Control Protocol Version 2(MRCPv2) Status of this Memo This document is an Internet-Draft and is in full conformance with all provisions of Section 10 of RFC2026. Internet-Drafts are working documents of the Internet Engineering Task Force (IETF), its areas, and its working groups. Note that other groups may also distribute working documents as Internet- Drafts. Internet-Drafts are draft documents valid for a maximum of six months and may be updated, replaced, or obsoleted by other documents at any time. It is inappropriate to use Internet-Drafts as reference material or to cite them other than as "work in progress." The list of current Internet-Drafts can be accessed at http://www.ietf.org/ietf/1id-abstracts.txt The list of Internet-Draft Shadow Directories can be accessed at http://www.ietf.org/shadow.html. Copyright Notice Copyright (C) The Internet Society (1999). All Rights Reserved. Abstract This document describes a proposal for a Media Resource Control Protocol Version 2(MRCPv2) and aims to meet the requirements specified in the SPEECHSC working group requirements document. It is based on the Media Resource Control Protocol (MRCP), also called MRCPv1 developed jointly by Cisco Systems, Inc., Nuance Communications, and Speechworks Inc. The MRCPv2 protocol will control media service resources like speech synthesizers, recognizers, signal generators, signal detectors, fax servers etc. over a network. This protocol depends on a session management protocol such as the Session Initiation Protocol (SIP) to S. Shanmugham, et. al. Page 1 MRCPv2 Protocol January 2004 establish a separate MRCPv2 control session between the client and the media server. It also depends on SIP to establish the media pipe and associated parameters between the media source or sink and the media server. Once this is done, the MRCPv2 protocol exchange can happen over the control session established above allowing the client to command and control the media processing resources that may exist on the media server. Table of Contents Status of this Memo..............................................1 Copyright Notice.................................................1 Abstract.........................................................1 Table of Contents................................................2 1. Introduction:...............................................4 2. Architecture:...............................................5 2.1. MRCPv2 Media Resources:...................................6 2.2. Server and Resource Addressing............................7 3. MRCPv2 Protocol Basics......................................7 3.1. Connecting to the Media Server............................7 3.2. Managing Resource Control Channels........................8 3.3. Media Streams and RTP Ports..............................14 3.4. MRCPv2 Message Transport.................................15 4. Notational Conventions.....................................16 5. MRCPv2 Specification.......................................16 5.1. Request..................................................17 5.2. Response.................................................18 5.2.1. Status Codes.............................................18 5.3. Event....................................................19 5.4. Message Headers..........................................20 5.4.1. Channel-Identifier.......................................21 5.4.2. Active-Request-Id-List...................................21 5.4.3. Proxy-Sync-Id............................................22 5.4.4. Accept-Charset...........................................22 5.4.5. Content-Type.............................................22 5.4.6. Content-Id...............................................22 5.4.7. Content-Base.............................................22 5.4.8. Content-Encoding.........................................23 5.4.9. Content-Location.........................................23 5.4.10. Content-Length.........................................24 5.4.11. Cache-Control..........................................24 5.4.12. Logging-Tag............................................25 6. Resource Discovery.........................................26 7. Speech Synthesizer Resource................................27 7.1. Synthesizer State Machine................................27 7.2. Synthesizer Methods......................................28 7.3. Synthesizer Events.......................................28 7.4. Synthesizer Header Fields................................28 7.4.1. Jump-Target..............................................29 7.4.2. Kill-On-Barge-In.........................................30 S Shanmugham IETF-Draft Page 2 MRCPv2 Protocol January 2004 7.4.3. Speaker Profile..........................................30 7.4.4. Completion Cause.........................................30 7.4.5. Voice-Parameters.........................................31 7.4.6. Prosody-Parameters.......................................31 7.4.7. Vendor Specific Parameters...............................32 7.4.8. Speech Marker............................................32 7.4.9. Speech Language..........................................32 7.4.10. Fetch Hint.............................................33 7.4.11. Audio Fetch Hint.......................................33 7.4.12. Fetch Timeout..........................................33 7.4.13. Failed URI.............................................33 7.4.14. Failed URI Cause.......................................34 7.4.15. Speak Restart..........................................34 7.4.16. Speak Length...........................................34 7.5. Synthesizer Message Body.................................35 7.5.1. Synthesizer Speech Data..................................35 7.6. SET-PARAMS...............................................36 7.7. GET-PARAMS...............................................37 7.8. SPEAK....................................................38 7.9. STOP.....................................................39 7.10. BARGE-IN-OCCURRED........................................40 7.11. PAUSE....................................................41 7.12. RESUME...................................................42 7.13. CONTROL..................................................43 7.14. SPEAK-COMPLETE...........................................45 7.15. SPEECH-MARKER............................................46 8. Speech Recognizer Resource.................................47 8.1. Recognizer State Machine.................................47 8.2. Recognizer Methods.......................................48 8.3. Recognizer Events........................................48 8.4. Recognizer Header Fields.................................48 8.4.1. Confidence Threshold.....................................49 8.4.2. Sensitivity Level........................................50 8.4.3. Speed Vs Accuracy........................................50 8.4.4. N Best List Length.......................................50 8.4.5. No Input Timeout.........................................50 8.4.6. Recognition Timeout......................................51 8.4.7. Waveform URL.............................................51 8.4.8. Completion Cause.........................................51 8.4.9. Recognizer Context Block.................................52 8.4.10. Recognition Start Timers...............................52 8.4.11. Vendor Specific Parameters.............................53 8.4.12. Speech Complete Timeout................................53 8.4.13. Speech Incomplete Timeout..............................54 8.4.14. DTMF Interdigit Timeout................................54 8.4.15. DTMF Term Timeout......................................54 8.4.16. DTMF-Term-Char.........................................55 8.4.17. Fetch Timeout..........................................55 8.4.18. Failed URI.............................................55 8.4.19. Failed URI Cause.......................................55 8.4.20. Save Waveform..........................................55 S Shanmugham IETF-Draft Page 3 MRCPv2 Protocol January 2004 8.4.21. New Audio Channel......................................56 8.4.22. Speech Language........................................56 8.5. Recognizer Message Body..................................56 8.5.1. Recognizer Grammar Data..................................56 8.5.2. Recognizer Result Data...................................59 8.5.3. Recognizer Context Block.................................60 8.6. SET-PARAMS...............................................60 8.7. GET-PARAMS...............................................61 8.8. DEFINE-GRAMMAR...........................................61 8.9. RECOGNIZE................................................65 8.10. STOP.....................................................67 8.11. GET-RESULT...............................................68 8.12. START-OF-SPEECH..........................................69 8.13. RECOGNITION-START-TIMERS.................................69 8.14. RECOGNITON-COMPLETE......................................69 8.15. DTMF Detection...........................................71 9. Examples:..................................................71 10. Reference Documents........................................78 11. Appendix...................................................79 ABNF Message Definitions........................................79 Full Copyright Statement........................................87 Authors' Addresses..............................................88 1. Introduction: The MRCPv2 protocol is designed to provide a mechanism for a client device requiring audio/video stream processing to control media processing resources on the network. Some of these media processing resources could be speech recognition, speech synthesis engines, speaker verification or speaker identification engines. This allows a vendor to implement distributed Interactive Voice Response platforms such as VoiceXML [7] browsers. This protocol is designed to leverage and build upon a session management protocols such as Session Initiation Protocol (SIP) and Session Description Protocol (SDP). The SIP protocol described in [2] defines session control messages used during the setup and tear down stages of a SIP session. In addition, the SIP re-INVITE can be used during a SIP session to change the characteristics of the session. This is generally to create or delete media/control channels or to change the properties of existing media/control channels related to the SIP session. In this SIP exchange, SDP is used to describe the parameters of the media pipe associated with that session. The MRCPv2 protocol depends on SIP and SDP to create the session, and setup the media channels to the media server. It also depends on SIP and SDP to establish a MRCPv2 control channel between the client and the server for every media processing resource that thee client S Shanmugham IETF-Draft Page 4 MRCPv2 Protocol January 2004 requires for that session. The MRCPv2 protocol exchange between the client and the media resource can then happen on that control channel. The MRCPv2 protocol exchange happening on this control channel does not change the state of the SIP session, the media or other parameters of the session SIP initiated. It merely controls and affects the state of the media processing resource associated with that MRCPv2 channel. The MRCPv2 protocol defines the messages to control the different media processing resources and the state machines required to guide their operation. It also describes how these messages are carried over a transport layer such as TCP or SCTP. The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY" and "OPTIONAL" in this document are to be interpreted as described in RFC 2119[9]. 2. Architecture: The system consists of a client that requires the generation of media streams or requires the processing of media streams and a media resource server that has the resources or engines to process or generate these streams. The client establishes a session using SIP and SDP with the server to use its media processing resources. A SIP URI refers to the MRCPv2 media server. The session management protocol (SIP) will use SDP with the offer/answer model described RFC 3264 to describe and setup the MRCPv2 control channels. Separate MRCPv2 control channels are need for controlling the different media processing resources associated with that session. Within a SIP session, the individual resource control channels for the different resources are added or removed through the SDP offer/answer model and the SIP re-INVITE dialog. The server, through the SDP exchange, provides the client with a unique channel identifier and a TCP port number. The client MAY then open a new TCP connection with the server using this port number. Multiple MRCPv2 channels can share a TCP connection between the client and the server. All MRCPv2 messages exchanged between the client and the server will also carry the specified channel identifier that MUST be unique among all MRCPv2 control channels that are active on that server. The client can use this channel to control the media processing resource associated with that channel. The session management protocol (SIP) will also establish media pipes between the client (or source/sink of media) and the media server using SDP m-lines. A media pipe maybe shared by one or more media processing resources under that SIP session or each media processing resource may have its own media pipe. S Shanmugham IETF-Draft Page 5 MRCPv2 Protocol January 2004 MRCPv2 client MRCPv2 Media Resource Server |--------------------| |-----------------------------| ||------------------|| ||---------------------------|| || Application Layer|| || TTS | ASR | SV | SI || ||------------------|| ||Engine|Engine|Engine|Engine|| ||Media Resource API|| ||---------------------------|| ||------------------|| || Media Resource Management || || SIP | MRCPv2 || ||---------------------------|| ||Stack | || || SIP | MRCPv2 || || | || || Stack | || ||------------------|| ||---------------------------|| || TCP/IP Stack ||----MRCPv2---|| TCP/IP Stack || || || || || ||------------------||-----SIP-----||---------------------------|| |--------------------| |-----------------------------| | / SIP / | / |-------------------| RTP | | / | Media Source/Sink |-------------/ | | |-------------------| 2.1. MRCPv2 Media Resources: The MRCPv2 media server may offer one or more of the following media processing resources to its clients. Speech Recognition The media server may offer speech recognition engines that the client can allocate, control and have it recognize the spoken input contained in the audio stream. Speech Synthesis The media server may offer speech synthesis engines that the client can allocate, control and have it generate synthesized voice into the audio stream. Speaker Identification The media server may offer speaker recognition engines that the client can allocate, control and have it recognize the speaker from voice in the audio stream. Speaker Verification The media server may offer speaker Verification engines that the client can allocate, control and have it verify and authenticate the speaker based on his voice. S Shanmugham IETF-Draft Page 6 MRCPv2 Protocol January 2004 2.2. Server and Resource Addressing The MRCPv2 server as a whole is a generic SIP server and the MRCPv2 media processing resources it offers are addressed by specific SIP URL registered by the server. Example: sip:mrcpv2@mediaserver.com 3. MRCPv2 Protocol Basics MRCPv2 requires the use of a transport layer protocol such as TCP or SCTP to guarantee reliable sequencing and delivery of MRCPv2 control messages between the client and the server. One or more TCP or SCTP connections between the client and the server can be shared between different MRCPv2 channels to the server. The individual messages carry the channel identifier to differentiate messages on different channels. The message format for MRCPv2 is text based with mechanisms to carry embedded binary data. This allows data like recognition grammars, recognition results, synthesizer speech markup etc. to be carried in the MRCPv2 message between the client and the server resource. The protocol does not address session and media establishment and management and relies of SIP and SDP to do this. 3.1. Connecting to the Media Server The MRCPv2 protocol depends on a session establishment and management protocol such as SIP in conjunction with SDP. The client finds and reaches a MRCPv2 server across the SIP network using the INVITE and other SIP dialog exchanges. The SDP offer/answer exchange model over SIP is used to establish resource control channels for each resource. The SDP offer/answer exchange is also used to establish media pipes between the source or sink of audio and the media server. Example 1: Opening a session to the media server. This does not immediately allocate any resource control channels yet. C->S: INVITE sip:mresources@mediaserver.com SIP/2.0 Via: SIP/2.0/TCP client.atlanta.example.com:5060; branch=z9hG4bK74bf9 Max-Forwards: 70 To: MediaServer From: sarvi ;tag=1928301774 Call-ID: a84b4c76e66710 CSeq: 314161 INVITE Contact: S Shanmugham IETF-Draft Page 7 MRCPv2 Protocol January 2004 Content-Type: application/sdp Content-Length: ... v=0 o=sarvi 2890844526 2890842808 IN IP4 126.16.64.4 s=- c=IN IP4 224.2.17.12 S->C: SIP/2.0 200 OK Via: SIP/2.0/TCP client.atlanta.example.com:5060; branch=z9hG4bK74bf9 To: MediaServer From: sarvi ;tag=1928301774 Call-ID: a84b4c76e66710 CSeq: 314161 INVITE Contact: Content-Type: application/sdp Content-Length: ... v=0 o=sarvi 2890844526 2890842808 IN IP4 126.16.64.4 s=- c=IN IP4 224.2.17.12 C->S: ACK sip:mresources@mediaserver.com SIP/2.0 Via: SIP/2.0/TCP client.atlanta.example.com:5060; branch=z9hG4bK74bf9 Max-Forwards: 70 To: MediaServer ;tag=a6c85cf From: Sarvi ;tag=1928301774 Call-ID: a84b4c76e66710 CSeq: 314162 ACK Content-Length: 0 3.2. Managing Resource Control Channels The client needs a separate MRCPv2 resource control channel to control each media processing resource under the SIP session. A unique channel identifier string identifies these resource control channels. The channel identifier string consists of a hexadecimal number specifying the channel ID followed by a string token specifying the type of resource separated by an "@". The server generates the hexadecimal channel ID and MUST make sure it does not clash with any other MRCP channel allocated to that server. MRCPv2 defines the following type of media processing resources. Additional resource types, their associated methods/events and state machines can be added by future specification proposing to extend the capabilities of MRCPv2. S Shanmugham IETF-Draft Page 8 MRCPv2 Protocol January 2004 Resource Type Resource Description speechrecog Speech Recognition dtmfrecog DTMF Recognition speechsynth Speech Synthesis simplesynth Poor Speech Synthesizer audioplayer Simple Audio Player speakidentify Speaker Identification speakverify Speaker Verification Additional resource types, their associated methods/events and state machines can be added by future specification proposing to extend the capabilities of MRCPv2. The SIP INVITE or re-INVITE dialog exchange and the SDP offer/answer exchange it carries, will contain m-lines describing the resource control channel it wants to allocate. There MUST be one SDP m-line for each MRCPv2 resource that needs to be controlled. This m-line will have a media type field of "control" and a transport type field of "TCP" or "SCTP". The port number field of the m-line MUST contain 9 in the SDP offer from the client and MUST contain the TCP listen port on the server in the SDP answer. The client MAY then setup a TCP or TLS connection to that server port or share an already established connection to that port. The format field of the m-line MUST contain "application/mrcpv2". The client must specify the resource type identifier in the resource attribute associated with the control m-line of the SDP offer. The server MUST respond with the full Channel-Identifier (which includes the resource type identifier and an unique hexadecimal identifier), in the "channel" attribute associated with the control m-line of the SDP answer. When the client wants to add a media processing resource to the session, it MUST initiate a re-INVITE dialog. The SDP offer/answer exchange contained in this SIP dialog will contain an additional control m-line for the new resource that needs to be allocated. The media server, on seeing the new m-line, will allocate the resource and respond with a corresponding control m-line in the SDP answer response. When the client wants to de-allocate the resource from this session, it MUST initiate a SIP re-INVITE dialog with the media server and MUST offer the control m-line with a port 0. The server MUST then answer the control m-line with a response of port 0. Example 2: This exchange continues from example 1 and adds a resource control channel for a synthesizer. Since a synthesizer would be generating an audio stream, this interaction also creates a receive-only audio stream for the server to send audio to. C->S: S Shanmugham IETF-Draft Page 9 MRCPv2 Protocol January 2004 INVITE sip:mresources@mediaserver.com SIP/2.0 Via: SIP/2.0/TCP client.atlanta.example.com:5060; branch=z9hG4bK74bf9 Max-Forwards: 70 To: MediaServer From: sarvi ;tag=1928301774 Call-ID: a84b4c76e66710 CSeq: 314161 INVITE Contact: Content-Type: application/sdp Content-Length: ... v=0 o=sarvi 2890844526 2890842808 IN IP4 126.16.64.4 s=- c=IN IP4 224.2.17.12 m=control 9 TCP application/mrcpv2 a=resource:speechsynth a=cmid:1 m=audio 49170 RTP/AVP 0 96 a=rtpmap:0 pcmu/8000 a=recvonly a=mid:1 S->C: SIP/2.0 200 OK Via: SIP/2.0/TCP client.atlanta.example.com:5060; branch=z9hG4bK74bf9 To: MediaServer From: sarvi ;tag=1928301774 Call-ID: a84b4c76e66710 CSeq: 314161 INVITE Contact: Content-Type: application/sdp Content-Length: ... v=0 o=sarvi 2890844526 2890842808 IN IP4 126.16.64.4 s=- c=IN IP4 224.2.17.12 m=control 32416 TCP application/mrcpv2 a=channel:32AECB234338@speechsynth a=cmid:1 m=audio 48260 RTP/AVP 00 96 a=rtpmap:0 pcmu/8000 a=sendonly a=mid:1 C->S: ACK sip:mresources@mediaserver.com SIP/2.0 Via: SIP/2.0/TCP client.atlanta.example.com:5060; S Shanmugham IETF-Draft Page 10 MRCPv2 Protocol January 2004 branch=z9hG4bK74bf9 Max-Forwards: 70 To: MediaServer ;tag=a6c85cf From: Sarvi ;tag=1928301774 Call-ID: a84b4c76e66710 CSeq: 314162 ACK Content-Length: 0 Example 3: This exchange continues from example 2 allocates an additional resource control channel for a recognizer. Since a recognizer would need to receive an audio stream for recognition, this interaction also updates the audio stream to sendrecv making it a 2-way audio stream. C->S: INVITE sip:mresources@mediaserver.com SIP/2.0 Via: SIP/2.0/TCP client.atlanta.example.com:5060; branch=z9hG4bK74bf9 Max-Forwards: 70 To: MediaServer From: sarvi ;tag=1928301774 Call-ID: a84b4c76e66710 CSeq: 314163 INVITE Contact: Content-Type: application/sdp Content-Length: ... v=0 o=sarvi 2890844526 2890842809 IN IP4 126.16.64.4 s=- c=IN IP4 224.2.17.12 m=control 9 TCP application/mrcpv2 a=resource:speechrecog a=cmid:1 m=control 9 TCP application/mrcpv2 a=resource:speechsynth a=cmid:1 m=audio 49170 RTP/AVP 0 96 a=rtpmap:0 pcmu/8000 a=rtpmap:96 telephone-event/8000 a=fmtp:96 0-15 a=sendrecv a=mid:1 S->C: SIP/2.0 200 OK Via: SIP/2.0/TCP client.atlanta.example.com:5060; branch=z9hG4bK74bf9 To: MediaServer From: sarvi ;tag=1928301774 S Shanmugham IETF-Draft Page 11 MRCPv2 Protocol January 2004 Call-ID: a84b4c76e66710 CSeq: 314163 INVITE Contact: Content-Type: application/sdp Content-Length: 131 v=0 o=sarvi 2890844526 2890842809 IN IP4 126.16.64.4 s=- c=IN IP4 224.2.17.12 m=control 32416 TCP application/mrcpv2 a=channel:32AECB234338@speechrecog a=cmid:1 m=control 32416 TCP application/mrcpv2 a=channel:32AECB234339@speechsynth a=cmid:1 m=audio 48260 RTP/AVP 0 96 a=rtpmap:0 pcmu/8000 a=rtpmap:96 telephone-event/8000 a=fmtp:96 0-15 a=sendrecv a=mid:1 C->S: ACK sip:mresources@mediaserver.com SIP/2.0 Via: SIP/2.0/TCP client.atlanta.example.com:5060; branch=z9hG4bK74bf9 Max-Forwards: 70 To: MediaServer ;tag=a6c85cf From: Sarvi ;tag=1928301774 Call-ID: a84b4c76e66710 CSeq: 314164 ACK Content-Length: 0 Example 4: This exchange continues from example 3 and de-allocates recognizer channel. Since a recognizer would not need to receive an audio stream any more, this interaction also updates the audio stream to recvonly. C->S: INVITE sip:mresources@mediaserver.com SIP/2.0 Via: SIP/2.0/TCP client.atlanta.example.com:5060; branch=z9hG4bK74bf9 Max-Forwards: 70 To: MediaServer From: sarvi ;tag=1928301774 Call-ID: a84b4c76e66710 CSeq: 314163 INVITE Contact: S Shanmugham IETF-Draft Page 12 MRCPv2 Protocol January 2004 Content-Type: application/sdp Content-Length: ... v=0 o=sarvi 2890844526 2890842809 IN IP4 126.16.64.4 s=- c=IN IP4 224.2.17.12 m=control 0 TCP application/mrcpv2 a=resource:speechrecog a=cmid:1 m=control 9 TCP application/mrcpv2 a=resource:speechsynth a=cmid:1 m=audio 49170 RTP/AVP 0 96 a=rtpmap:0 pcmu/8000 a=recvonly a=mid:1 S->C: SIP/2.0 200 OK Via: SIP/2.0/TCP client.atlanta.example.com:5060; branch=z9hG4bK74bf9 To: MediaServer From: sarvi ;tag=1928301774 Call-ID: a84b4c76e66710 CSeq: 314163 INVITE Contact: Content-Type: application/sdp Content-Length: 131 v=0 o=sarvi 2890844526 2890842809 IN IP4 126.16.64.4 s=- c=IN IP4 224.2.17.12 m=control 0 TCP application/mrcpv2 a=channel:32AECB234338@speechrecog a=cmid:1 m=control 32416 TCP application/mrcpv2 a=channel:32AECB234339@speechsynth a=cmid:1 m=audio 48260 RTP/AVP 0 96 a=rtpmap:0 pcmu/8000 a=sendonly a=mid:1 C->S: ACK sip:mresources@mediaserver.com SIP/2.0 Via: SIP/2.0/TCP client.atlanta.example.com:5060; branch=z9hG4bK74bf9 Max-Forwards: 70 To: MediaServer ;tag=a6c85cf S Shanmugham IETF-Draft Page 13 MRCPv2 Protocol January 2004 From: Sarvi ;tag=1928301774 Call-ID: a84b4c76e66710 CSeq: 314164 ACK Content-Length: 0 3.3. Media Streams and RTP Ports The client or the server would need to add audio (or other media) pipes between the client and the server and associate them with the resource that would process or generate the media. One or more resources could be associated with a single media channel or each resource could be assigned a separate media channel. For example, a synthesizer and a recognizer could be associated to the same media pipe(m=audio line), if it is opened in "sendrecv" mode. Alternatively, the recognizer could have its own "sendonly" audio pipe and the synthesizer could have its own "recvonly" audio pipe. The association between control channels and their corresponding media channels is established through the mid attribute defined in RFC 3388[20]. If there are more than 1 audio m-line, then each audio m-line MUST have a "mid" attribute. Each control m-line MUST have a "cmid" attribute that matches the "mid" attribute of the audio m- line it is associated with. cmid-attribute = "a=cmid:" identification-tag identification-tag = token A single audio m-line can be associated with multiple resources or each resource can have its own audio m-line. For example, if the client wants to allocate a recognizer and a synthesizer and associate them to a single 2-way audio pipe, the SDP offer should contain two control m-lines and a single audio m-line with an attribute of "sendrecv". Each of the control m-lines should have a "cmid" attribute whose value matches the "mid" of the audio m-line. If the client wants to allocate a recognizer and a synthesizer each with its own separate audio pipe, the SDP offer would carry two control m-lines (one for the recognizer and another for the synthesizer) and two audio m-lines (one with the attribute "sendonly" and another with attribute "recvonly"). The "cmid" attribute of the recognizer control m-line would match the "mid" value of the "sendonly" audio m-line and the "cmid" attribute of the synthesizer control m-line would match the "mid" attribute of the "recvonly" m-line. When a server receives media(say audio) on a media pipe that is associated with more than one media processing resource, it is the responsibility of the server to receive and fork it to the resources that need it. If the multiple resources in a session are generating audio (or other media), that needs to be sent on a single associated media pipe, it is the responsibility of the server to mix the S Shanmugham IETF-Draft Page 14 MRCPv2 Protocol January 2004 streams before sending on the media pipe. The media stream in either direction may contain more than one Synchronized Source (SSRC) identifier due to multiple sources contributing to the media on the pipe and the client server SHOULD be able to deal with it. If a media server does not have the capability to mix or fork media, in the above cases, then the server SHOULD disallow the client from associating multiple such resources to a single audio pipe, by rejecting the SDP offer. 3.4. MRCPv2 Message Transport The MRCPv2 resource messages defined in this document are transported over a TCP or SCTP pipe between the client and the server. The setting up of this TCP pipe and the resource control channel is discussed in Section 3.2. Multiple resource control channels between a client and a server that belong to different SIP sessions can share one or more TCP or SCTP pipes between them. The individual MRCPv2 messages carry the MRCPv2 channel identifier in their Channel-Identifier header field MUST be used to differentiate MRCPv2 messages from different resource channels. All MRCPv2 based media servers MUST support TCP for transport and MAY support SCTP. Example 1: C->S: MRCP/2.0 483 SPEAK 543257 Channel-Identifier: 32AECB23433802@speechsynth Voice-gender: neutral Voice-category: teenager Prosody-volume: medium Content-Type: application/synthesis+ssml Content-Length: 104 You have 4 new messages. The first is from Stephanie Williams and arrived at 3:45pm. The subject is ski trip S->C: MRCP/2.0 81 543257 200 IN-PROGRESS Channel-Identifier: 32AECB23433802@speechsynth S->C: MRCP/2.0 89 SPEAK-COMPLETE 543257 COMPLETE S Shanmugham IETF-Draft Page 15 MRCPv2 Protocol January 2004 Channel-Identifier: 32AECB23433802@speechsynth Most examples from here on show only the MRCPv2 messages and do not show the SIP messages and headers that may have been used to establish the MRCPv2 control channel. 4. Notational Conventions Since many of the definitions and syntax are identical to HTTP/1.1, this specification only points to the section where they are defined rather than copying it. For brevity, [HX.Y] is to be taken to refer to Section X.Y of the current HTTP/1.1 specification (RFC 2616 [1]). All the mechanisms specified in this document are described in both prose and an augmented Backus-Naur form (ABNF). It is described in detail in RFC 2234 [3]. The complete message format in ABNF form is provided in Appendix section 12.1 and is the normative format definition. 5. MRCPv2 Specification The MRCPv2 PDU is textual using an ISO 10646 character set in the UTF-8 encoding (RFC 2044) to allow many different languages to be represented. However, to assist in compact representations, MRCPv2 also allows other character sets such as ISO 8859-1 to be used when desired. The MRCPv2 protocol headers and field names use only the US-ASCII subset of UTF-8. Internationalization only applies to certain fields like grammar, results, speech markup etc, and not to MRCPv2 as a whole. Lines are terminated by CRLF, but receivers SHOULD be prepared to also interpret CR and LF by themselves as line terminators. Also, some parameters in the PDU may contain binary data or a record spanning multiple lines. Such fields have a length value associated with the parameter, which indicates the number of octets immediately following the parameter. All MRCPv2 messages, responses and events MUST carry the Channel- Identifier header field in it for the server or client to differentiate messages from different control channels that may share the same TCP connection. The MRCPv2 message set consists of requests from the client to the server, responses from the server to the client and asynchronous events from the server to the client. All these messages consist of a start-line, one or more header fields (also known as "headers"), an empty line (i.e. a line with nothing preceding the CRLF) indicating the end of the header fields, and an optional message body. generic-message = start-line S Shanmugham IETF-Draft Page 16 MRCPv2 Protocol January 2004 message-header CRLF [ message-body ] start-line = request-line | response-line | event-line message-header = 1*(generic-header | resource-header) resource-header = recognizer-header | synthesizer-header The message-body contains resource-specific and message-specific data that needs to be carried between the client and server as a MIME entity. The information contained here and the actual MIME- types used to carry the data are specified later when addressing the specific messages. If a message contains data in the message body, the header fields will contain content-headers indicating the MIME-type and encoding of the data in the message body. 5.1. Request A MRCPv2 request consists of a Request line followed by zero or more parameters as part of the message headers and an optional message body containing data specific to the request message. The Request message from a client to the server includes within the first line, the method to be applied, a method tag for that request and the version of protocol in use. request-line = mrcp-version SP message-length SP method-name SP request-id CRLF The mrcp-version field is the MRCPv2 protocol version that is being used by the client. mrcp-version = "MRCP" "/" 1*DIGIT "." 1*DIGIT The message-length field specifies the length of the message and nd MUST be the 2 token from the beginning of the message. This is to make the framing and parsing of the message simpler to do. Message-length = 1*DIGIT The request-id field is a unique identifier created by the client and sent to the server. The server resource MUST use this identifier in its response to this request. If the request does not complete with the response future asynchronous events associated with this request MUST carry the request-id. S Shanmugham IETF-Draft Page 17 MRCPv2 Protocol January 2004 request-id = 1*DIGIT The method-name field identifies the specific request that the client is making to the server. Each resource supports a certain list of requests or methods that can be issued to it, and will be addressed in later sections. method-name = synthesizer-method | recognizer-method 5.2. Response After receiving and interpreting the request message, the server resource responds with an MRCPv2 response message. It consists of a status line optionally followed by a message body. response-line = mrcp-version SP message-length SP request-id SP status-code SP request-state CRLF The mrcp-version field used here is similar to the one used in the Request Line and indicates the version of MRCPv2 protocol running on the server. The request-id used in the response MUST match the one sent in the corresponding request message. The status-code field is a 3-digit code representing the success or failure or other status of the request. The request-state field indicates if the job initiated by the Request is PENDING, IN-PROGRESS or COMPLETE. The COMPLETE status means that the Request was processed to completion and that there are will be no more events from that resource to the client with that request-id. The PENDING status means that the job has been placed on a queue and will be processed in first-in-first-out order. The IN-PROGRESS status means that the request is being processed and is not yet complete. A PENDING or IN-PROGRESS status indicates that further Event messages will be delivered with that request-id. request-state = "COMPLETE" | "IN-PROGRESS" | "PENDING" Status Codes The status codes are classified under the Success(2XX) codes and the Failure(4XX) codes. Success 2xx 200 Success 201 Success with some optional parameters ignored. S Shanmugham IETF-Draft Page 18 MRCPv2 Protocol January 2004 Failure 4xx 401 Method not allowed 402 Method not valid in this state 403 Unsupported Parameter 404 Illegal Value for Parameter 405 Not found (e.g. Resource URI not initialized or doesn't exist) 406 Mandatory Parameter Missing 407 Method or Operation Failed(e.g. Grammar compilation failed in the recognizer. Detailed cause codes MAY BE available through a resource specific header field.) 408 Unrecognized or unsupported message entity 409 Unsupported Parameter Value 421-499 Resource specific Failure codes 5.3. Event The server resource may need to communicate a change in state or the occurrence of a certain event to the client. These messages are used when a request does not complete immediately and the response returns a status of PENDING or IN-PROGRESS. The intermediate results and events of the request are indicated to the client through the event message from the server. Events have the request-id of the request that is in progress and generating these events and status value. The status value is COMPLETE if the request is done and this was the last event, else it is IN-PROGRESS. event-line = mrcp-version SP message-length SP event-name SP request-id SP request-state CRLF The mrcp-version used here is identical to the one used in the Request/Response Line and indicates the version of MRCPv2 protocol running on the server. The request-id used in the event MUST match the one sent in the request that caused this event. The request-state indicates if the Request/Command causing this event is complete or still in progress, and is the same as the one mentioned in section 5.3. The final event will contain a COMPLETE status indicating the completion of the request. The event-name identifies the nature of the event generated by the media resource. The set of valid event names are dependent on the resource generating it, and will be addressed in later sections. event-name = synthesizer-event | recognizer-event S Shanmugham IETF-Draft Page 19 MRCPv2 Protocol January 2004 5.4. Message Headers MRCPv2 header fields, which include general-header (section 5.5) and resource-specific-header (section 7.4 and section 8.4), follow the same generic format as that given in Section 3.1 of RFC 822 [8]. Each header field consists of a name followed by a colon (":") and the field value. Field names are case-insensitive. The field value MAY be preceded by any amount of LWS, though a single SP is preferred. Header fields can be extended over multiple lines by preceding each extra line with at least one SP or HT. message-header = field-name ":" [ field-value ] field-name = token field-value = *( field-content | LWS ) field-content = The field-content does not include any leading or trailing LWS: linear white space occurring before the first non-whitespace character of the field-value or after the last non-whitespace character of the field-value. Such leading or trailing LWS MAY be removed without changing the semantics of the field value. Any LWS that occurs between field-content MAY be replaced with a single SP before interpreting the field value or forwarding the message downstream. The order in which header fields with differing field names are received is not significant. However, it is "good practice" to send general-header fields first, followed by request-header or response- header fields, and ending with the entity-header fields. Multiple message-header fields with the same field-name MAY be present in a message if and only if the entire field-value for that header field is defined as a comma-separated list [i.e., #(values)]. It MUST be possible to combine the multiple header fields into one "field-name: field-value" pair, without changing the semantics of the message, by appending each subsequent field-value to the first, each separated by a comma. The order in which header fields with the same field-name are received is therefore significant to the interpretation of the combined field value, and thus a proxy MUST NOT change the order of these field values when a message is forwarded. Generic Headers generic-header = channel-identifier | active-request-id-list | proxy-sync-id S Shanmugham IETF-Draft Page 20 MRCPv2 Protocol January 2004 | content-id | content-type | content-length | content-base | content-location | content-encoding | cache-control | logging-tag All headers in MRCPv2 will be case insensitive consistent with HTTP and SIP protocol header definitions. Channel-Identifier All MRCPv2 methods, responses and events MUST contain the Channel- Identifier header field. The value of this field is a hexadecimal string and is allocated by the media server when the control channel was added to the session through a SDP offer/answer exchange. The last 2 digits of the Channel-Identifier field specify one of the media processing resource types listed in Section 3.2. channel-identifier = "Channel-Identifier" ":" channel-id CRLF Channel-id = 1*HEXDIG "@" 1*ALPHA Active-Request-Id-List In a request, this field indicates the list of request-ids that it should apply to. This is useful when there are multiple Requests that are PENDING or IN-PROGRESS and you want this request to apply to one or more of these specifically. In a response, this field returns the list of request-ids that the operation modified or were in progress or just completed. There could be one or more requests that returned a request-state of PENDING or IN-PROGRESS. When a method affecting one or more PENDING or IN-PROGRESS requests is sent from the client to the server, the response MUST contain the list of request-ids that were affected in this header field. The active-request-id-list is only used in requests and responses, not in events. For example, if a STOP request with no active-request-id-list is sent to a synthesizer resource(a wildcard STOP) which has one or more SPEAK requests in the PENDING or IN-PROGRESS state, all SPEAK requests MUST be cancelled, including the one IN-PROGRESS and the response to the STOP request would contain the request-id of all the SPEAK requests that were terminated in the active-request-id-list. In this case, no SPEAK-COMPLETE or RECOGNITION-COMPLETE events will be sent for these terminated requests. S Shanmugham IETF-Draft Page 21 MRCPv2 Protocol January 2004 active-request-id-list = "Active-Request-Id-List" ":" request-id *("," request-id) CRLF Proxy-Sync-Id When any server resource generates a barge-in-able event, it will generate a unique Tag and send it as a header field in an event to the client. The client then acts as a proxy to the server resource and sends a BARGE-IN-OCCURRED method to the synthesizer server resource with the Proxy-Sync-Id it received from the server resource. When the recognizer and synthesizer resources are part of the same session, they may choose to work together to achieve quicker interaction and response. Here the proxy-sync-id helps the resource receiving the event, proxied by the client, to decide if this event has been processed through a direct interaction of the resources. proxy-sync-id = "Proxy-Sync-Id" ":" 1*ALPHA CRLF Accept-Charset See [H14.2]. This specifies the acceptable character set for entities returned in the response or events associated with this request. This is useful in specifying the character set to use in the NLSML results of a RECOGNITON-COMPLETE event. Content-Type See [H14.17]. Note that the content types suitable for MRCPv2 are restricted to speech markup, grammar, recognition results etc. and are specified later in this document. The multi-part content type "multi-part/mixed" is supported to communicate multiple of the above mentioned contents, in which case the body parts cannot contain any MRCPv2 specific headers. Content-Id This field contains an ID or name for the content, by which it can be referred to. The definition of this field is available in RFC 2111 and is needed in multi-part messages. In MRCPv2 whenever the content needs to be stored, by either the client or the server, it is stored associated with this ID. Such content can be referenced during the session in URI form using the session: URI scheme described in a later section. Content-Base The content-base entity-header field may be used to specify the base URI for resolving relative URLs within the entity. S Shanmugham IETF-Draft Page 22 MRCPv2 Protocol January 2004 content-base = "Content-Base" ":" absoluteURI CRLF Note, however, that the base URI of the contents within the entity- body may be redefined within that entity-body. An example of this would be a multi-part MIME entity, which in turn can have multiple entities within it. Content-Encoding The content-encoding entity-header field is used as a modifier to the media-type. When present, its value indicates what additional content coding have been applied to the entity-body, and thus what decoding mechanisms must be applied in order to obtain the media- type referenced by the content-type header field. Content-encoding is primarily used to allow a document to be compressed without losing the identity of its underlying media type. content-encoding = "Content-Encoding" ":" 1#content-coding CRLF Content coding is defined in [H3.5]. An example of its use is Content-Encoding: gzip If multiple encoding have been applied to an entity, the content coding MUST be listed in the order in which they were applied. Content-Location The content-location entity-header field MAY BE used to supply the resource location for the entity enclosed in the message when that entity is accessible from a location separate from the requested resource's URI. content-location = "Content-Location" ":" ( absoluteURI | relativeURI ) CRLF The content-location value is a statement of the location of the resource corresponding to this particular entity at the time of the request. The media server MAY use this header field to optimize certain operations. When providing this header field the entity being sent should not have been modified, from what was retrieved from the content-location URI. For example, if the client provided a grammar markup inline, and it had previously retrieved it from a certain URI, that URI can be provided as part of the entity, using the content-location header field. This allows a resource like the recognizer to look into its cache to see if this grammar was previously retrieved, compiled and cached. In which case, it might optimize by using the previously compiled grammar object. S Shanmugham IETF-Draft Page 23 MRCPv2 Protocol January 2004 If the content-location is a relative URI, the relative URI is interpreted relative to the content-base URI. Content-Length This field contains the length of the content of the message body (i.e. after the double CRLF following the last header field). Unlike HTTP, it MUST be included in all messages that carry content beyond the header portion of the message. If it is missing, a default value of zero is assumed. It is interpreted according to [H14.13]. Cache-Control If the media server plans on implementing caching it MUST adhere to the cache correctness rules of HTTP 1.1 (RFC2616), when accessing and caching HTTP URI. In particular, the expires and cache-control headers of the cached URI or document must be honored and will always take precedence over the Cache-Control defaults set by this header field. The cache-control directives are used to define the default caching algorithms on the media server for the session or request. The scope of the directive is based on the method it is sent on. If the directives are sent on a SET-PARAMS method, it SHOULD apply for all requests for documents the media server may make in that session. If the directives are sent on any other messages they MUST only apply to document requests the media server needs to make for that method. An empty cache-control header on the GET-PARAMS method is a request for the media server to return the current cache-control directives setting on the server. cache-control = "Cache-Control" ":" 1#cache-directive CRLF cache-directive = "max-age" "=" delta-seconds | "max-stale" "=" delta-seconds | "min-fresh" "=" delta-seconds delta-seconds = 1*DIGIT Here delta-seconds is a time value to be specified as an integer number of seconds, represented in decimal, after the time that the message response or data was received by the media server. These directives allow the media server to override the basic expiration mechanism. max-age Indicates that the client is ok with the media server using a response whose age is no greater than the specified time in seconds. S Shanmugham IETF-Draft Page 24 MRCPv2 Protocol January 2004 Unless a max-stale directive is also included, the client is not willing to accept the media server using a stale response. min-fresh Indicates that the client is willing to accept the media server using a response whose freshness lifetime is no less than its current age plus the specified time in seconds. That is, the client wants the media server to use a response that will still be fresh for at least the specified number of seconds. max-stale Indicates that the client is willing to accept the media server using a response that has exceeded its expiration time. If max-stale is assigned a value, then the client is willing to accept the media server using a response that has exceeded its expiration time by no more than the specified number of seconds. If no value is assigned to max-stale, then the client is willing to accept the media server using a stale response of any age. The media server cache MAY BE requested to use stale response/data without validation, but only if this does not conflict with any "MUST"-level requirements concerning cache validation (e.g., a "must-revalidate" cache-control directive) in the HTTP 1.1 specification pertaining the URI. If both the MRCPv2 cache-control directive and the cached entry on the media server include "max-age" directives, then the lesser of the two values is used for determining the freshness of the cached entry for that request. Logging-Tag This header field MAY BE sent as part of a SET-PARAMS/GET-PARAMS method to set the logging tag for logs generated by the media server. Once set, the value persists until a new value is set or the session is ended. The MRCPv2 server SHOULD provide a mechanism to subset its output logs so that system administrators can examine or extract only the log file portion during which the logging tag was set to a certain value. MRCPv2 clients using this feature SHOULD take care to ensure that no two clients specify the same logging tag. In the event that two clients specify the same logging tag, the effect on the MRCPv2 server's output logs in undefined. logging-tag = "Logging-Tag" ":" 1*ALPHA CRLF S Shanmugham IETF-Draft Page 25 MRCPv2 Protocol January 2004 6. Resource Discovery The capability of media server resources can be found using the SIP OPTIONS method requesting the capability of the media server. The media server SHOULD respond to such a request with an SDP description of its capabilities according to RFC 3264. The MRCPv2 capabilities are described by a single m-line containing the media type Ÿcontrol÷, transport type ŸTCP÷ or "SCTP" and a format of "application/mrcpv2". There should be one "resource" attribute for each resource that the media server supports with the resource type identifier as its value. The SDP description SHOULD also contain m-lines describing the audio capabilities, and the coders it supports. Example 4: The client uses the SIP OPTIONS method to query the capabilities of the MRCPv2 media server. C->S: OPTIONS sip:mrcp@mediaserver.com SIP/2.0 Max-Forwards: 70 To: From: Sarvi ;tag=1928301774 Call-ID: a84b4c76e66710 CSeq: 63104 OPTIONS Contact: Accept: application/sdp Content-Length: 0 S->C: SIP/2.0 200 OK To: ;tag=93810874 From: Sarvi ;tag=1928301774 Call-ID: a84b4c76e66710 CSeq: 63104 OPTIONS Contact: Allow: INVITE, ACK, CANCEL, OPTIONS, BYE Accept: application/sdp Accept-Encoding: gzip Accept-Language: en Supported: foo Content-Type: application/sdp Content-Length: 274 v=0 o=sarvi 2890844526 2890842807 IN IP4 126.16.64.4 s=SDP Seminar i=A session for processing media S Shanmugham IETF-Draft Page 26 MRCPv2 Protocol January 2004 c=IN IP4 224.2.17.12/127 m=control 9 TCP application/mrcpv2 a=resource:speechsynth a=resource:speechrecog a=resource:speakverify m=audio 0 RTP/AVP 0 1 3 a=rtpmap:0 PCMU/8000 a=rtpmap:1 1016/8000 a=rtpmap:3 GSM/8000 7. Speech Synthesizer Resource This resource is capable of converting text provided by the client and generating a speech stream in real-time. Depending on the implementation and capability of this resource, the client can control parameters like voice characteristics, speaker speed, etc. The synthesizer resource is controlled by MRCPv2 requests from the client. Similarly the resource can respond to these requests or generate asynchronous events to the server to indicate certain conditions during the processing of the stream. This section applies for the following resource types. 1. speechsynth 2. basicsynth 3. audioplayer The difference between the above three resources is in their level of support in rendering SSML. The "audioplayer" resource MUST support the