Internet Engineering Task Force D. Burnett Internet-Draft Nuance Communications draft-burnett-mrcpext-00 P. Forgues Expires: April 17, 2004 Nuance Communications C. Galles Intervoice, Inc. October 17, 2003 MRCP Extensions: Media Resource Control Protocol Extensions Status of this Memo This document is an Internet-Draft and is subject to all provisions of Section 10 of RFC2026. Internet-Drafts are working documents of the Internet Engineering Task Force (IETF), its areas, and its working groups. Note that other groups may also distribute working documents as Internet- Drafts. Internet-Drafts are draft documents valid for a maximum of six months and may be updated, replaced, or obsoleted by other documents at any time. It is inappropriate to use Internet-Drafts as reference material or to cite them other than as "work in progress." The list of current Internet-Drafts can be accessed at http://www.ietf.org/1id-abstracts.html The list of Internet-Draft Shadow Directories can be accessed at http://www.ietf.org/shadow.html Copyright Notice Copyright (C) The Internet Society (2003). All Rights Reserved. Abstract The Media Resource Control Protocol (MRCP) is an application level protocol to control media service resources like Speech Synthesizers, Recognizers, Signal Generators, Signal Detectors, Fax Servers etc. over a network. This document captures the extensions required to implement Voice Enrollment, Speaker Verification and Hotword recognition as well as to augment the recognizer functionality using MRCP. The extensions are largely orthogonal to existing features of MRCP and to each other, with an eye towards backwards compatibility with existing features and independence of the extensions from each other to simplify integration. Page 1 MRCP Extensions October 2003 This document is published as an Internet-Draft as input for further IETF development in this area. Burnett, et al. IETF-Draft Page 2 MRCP Extensions October 2003 Table of Contents Status of this Memo.................................................1 Abstract............................................................1 1. Introduction....................................................6 2. Architecture....................................................7 3. Notational Conventions..........................................7 4. Recognizer resource extensions..................................8 4.1. Recognizer Resource Extensions Methods........................8 4.2. Recognizer Resource Extensions Events.........................8 4.3. Recognizer Resource Extensions Header Fields..................8 4.3.1. Recording-URL.............................................8 4.3.2. Required-Phrase...........................................9 4.3.3. Phrase-Status.............................................9 4.3.4. Interpret-Text............................................9 4.4. RECORD........................................................9 4.5. INTERPRET....................................................10 4.6. RECORDING-COMPLETE...........................................11 4.7. INTERPRETATION-COMPLETE......................................12 5. Enrollment.....................................................13 5.1. Enrollment State Machine.....................................14 5.2. Enrollment Methods...........................................14 5.3. Enrollment Events............................................14 5.4. Enrollment Header Fields.....................................14 5.4.1. Num-Min-Consistent-Pronunciations........................16 5.4.2. Consistency-Threshold....................................16 5.4.3. Clash-Threshold..........................................16 5.4.4. Personal-Grammar-URI.....................................16 5.4.5. Phrase-Id................................................17 5.4.6. Phrase-NL................................................17 5.4.7. Weight...................................................17 5.4.8. Save-Waveform............................................17 5.4.9. Waveform-URL.............................................18 5.4.10. New-Phrase-Id..........................................18 5.4.11. Phrase-Text............................................18 5.4.12. Completion-Cause.......................................18 5.4.13. Num-Clashes............................................18 5.4.14. Num-Good-Repetitions...................................19 5.4.15. Num-Repetitions-Still-Needed...........................19 5.4.16. Consistency-Status.....................................19 5.4.17. Clash-Phrase-Ids.......................................19 5.5. Enrollment Methods...........................................20 5.5.1. START-ENROLLMENT-SESSION.................................20 5.5.2. RECOGNIZE................................................20 5.5.3. STOP.....................................................21 5.5.4. PAUSE-ENROLLMENT-SESSION.................................21 5.5.5. RESUME-ENROLLMENT-SESSION................................22 5.5.6. ENROLLMENT-ROLLBACK......................................22 5.5.7. END-ENROLLMENT-SESSION...................................22 5.5.8. ABORT-ENROLLMENT-SESSION.................................23 5.5.9. MODIFY-PHRASE............................................23 5.5.10. ADD-PHRASE.............................................24 5.5.11. DELETE-PHRASE..........................................24 5.5.12. RECOGNITION-COMPLETE...................................24 Burnett, et al. IETF-Draft Page 3 MRCP Extensions October 2003 6. Speaker Verification and Identification........................26 6.1. Speaker Verification/Identification Resource.................26 6.2. SETUP Verification/Identification Resource...................27 6.3. Speaker Verification State Machine...........................27 6.4. Speaker Verification Methods.................................27 6.5. Verification Events..........................................28 6.6. Verification Header Fields...................................28 6.6.1. Voiceprint-URI...........................................29 6.6.2. Voiceprint-Identifier....................................29 6.6.3. Voiceprint-Group.........................................30 6.6.4. Verification-Mode........................................30 6.6.5. Adapt-Model..............................................31 6.6.6. Abort-Model..............................................32 6.6.7. Buffering-Mode...........................................32 6.6.8. Security-Level...........................................32 6.6.9. Num-Min-Verification-Phrases.............................32 6.6.10. Num-Max-Verification-Phrases...........................32 6.6.11. Completion-Cause.......................................33 6.6.12. No-Input-Timeout.......................................34 6.6.13. Save-Waveform..........................................34 6.6.14. Waveform-URL...........................................34 6.6.15. Vendor-Specific........................................34 6.6.16. Voiceprint-Exists......................................35 6.6.17. Is-Valid-Utterance.....................................35 6.6.18. Num-Valid-Utterances...................................35 6.6.19. Decision...............................................35 6.6.20. Num-Frames.............................................36 6.6.21. Device.................................................36 6.6.22. Gender.................................................36 6.6.23. Matched................................................36 6.6.24. Adapted................................................36 6.6.25. Verification-Score.....................................37 6.6.26. Group-Name.............................................37 6.6.27. Member.................................................37 6.6.28. Score..................................................37 6.7. Verification Session Methods.................................37 6.7.1. VER-START-SESSION........................................38 6.7.2. VER-END-SESSION..........................................39 6.7.3. VER-SET-VOICEPRINT.......................................39 6.7.4. VER-DELETE-VOICEPRINT....................................41 6.7.5. VERIFY...................................................42 6.7.6. VER-BUFFERING-START......................................42 6.7.7. VER-BUFFERING-CONTROL....................................43 6.7.8. VER-BUFFERING-STOP.......................................43 6.7.9. VER-FROM-BUFFER..........................................43 6.7.10. VER-ROLLBACK...........................................46 6.7.11. VER-STOP...............................................46 6.7.12. VER-START-TIMERS.......................................47 6.7.13. SET-PARAMS.............................................47 6.7.14. GET-PARAMS.............................................47 6.8. Verification Session Events..................................48 6.8.1. VERIFICATION-COMPLETE....................................48 6.8.2. START-OF-SPEECH..........................................49 7. Hotword Recognition............................................50 Burnett, et al. IETF-Draft Page 4 MRCP Extensions October 2003 7.1. Hotword State Machine........................................50 7.1.1. Addressing Resources.....................................50 7.2. Hotword Header Fields........................................51 7.2.1. Hotword-Max-Seconds......................................51 7.2.2. Hotword-Min-Seconds......................................51 7.3. Hotword Methods..............................................51 7.3.1. SETUP....................................................51 7.3.2. RECOGNIZE................................................52 8. RTSP based Examples:...........................................54 8.1. Enrollment...................................................54 8.2. Speaker Verification and Identification......................56 8.3. Hotword Recognition..........................................62 9. Security Considerations........................................62 10. Reference Documents............................................63 Acknowledgements...................................................63 Full Copyright Statement...........................................63 AuthorsÆ Addresses.................................................64 Burnett, et al. IETF-Draft Page 5 MRCP Extensions October 2003 1. Introduction The Media Resource Control Protocol (MRCP) [3] is an application level protocol to control media service resources like Speech Synthesizers, Recognizers, Signal Generators, Signal Detectors, Fax Servers etc. over a network. This protocol is designed to work with streaming protocols like RTSP (Real Time Streaming Protocol) or SIP (Session Initiation Protocol) which help establish control connections to external media streaming devices, and media delivery mechanisms like RTP (Real Time Protocol). MRCP supports basic recognition and speech synthesis (TTS) capabilities. This document captures the extensions required to implement Voice Enrollment, Speaker Verification and Hotword recognition as well as to augment the recognizer functionality using MRCP. Already having functional implementations of [3], the authors developed these extensions within that framework. It is expected that these methods will also prove useful as information for the IETF in its standardization efforts beyond this draft version of MRCP. A major goal of the Recognition, Enrollment, Speaker Verification and Hotword recognition extensions is to be backward compatible, i.e. to implement them in such a way that previous functionality is available without change. In addition, the MRCP extensions used for Enrollment, Speaker Verification and Identification and Hotword recognition are independent from one another. This means a client can implement only the set of methods needed for a particular integration. For example, only the Enrollment methods and responses need to be implemented by a client, provided the server has implemented those methods. The extensions for Enrollment do not need a separate resource type because they are implemented as part of the recognition resource. Speaker Verification and Hotword recognition were defined as new resource types since they essentially consist in either creating a verification resource or attaching a special kind of Recognizer resource on the session in addition to the primary Recognizer resource (unlike Enrollment). There is no need to change the underlying protocols to support Enrollment, Speaker Verification or Hotword recognition. Like the original MRCP specification, the extensions rely on a protocol like the Real Time Streaming Protocol (RTSP) or Session Initiation Protocol (SIP) to establish and maintain the session. The session control protocol is also responsible for establishing the media connection from the client to the network server. The MRCP protocol extensions define the requests, responses and events needed to control Voice Enrollment, Speaker Verification and Hotword recognition features. It is assumed the state machine for a recognition resource is preserved. Burnett, et al. IETF-Draft Page 6 MRCP Extensions October 2003 The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY" and "OPTIONAL" in this document are to be interpreted as described in RFC 2119[5]. Please send any feedback on this document directly to the authors. 2. Architecture There is no change in architecture from the original MRCP specification. It is assumed that Enrollment is done by a Recognizer resource. Therefore, an appropriate SETUP message needs to be sent and a media stream established between a client and server before these functions are used. Speaker Verification and Hotword recognition are slightly different. For Speaker verification, a new verification resource is now defined. This verification resource can be used on its own or be attached to a session where a recognition is already set up. For Hotword recognition it differs in that a second Recognizer resource needs to be attached to the same session. The state machine for this second recognizer is the same as for the primary Recognizer resource. The following sections describe each of the following MRCP extensions separately: (1) Recognizer resource extensions, (2) Enrollment, (3) Speaker Verification and Identification and (4) Hotword recognition. 3. Notational Conventions Most of the definitions and syntax follow the same format used in the MRCP draft submission. The only new field required is to represent short floating-point numbers needed to indicate relative weight for some of the header fields. A weight is normalized in the range of 0 to 1. WEIGHT = ( "0" [ "." 0*3DIGIT ] ) | ( "1" [ "." 0*3("0") ] ) FLOAT = [ "+" / "-" ] 1*DIGIT [ "." 0*DIGIT ] Burnett, et al. IETF-Draft Page 7 MRCP Extensions October 2003 4. Recognizer resource extensions The only new functionality added to the recognizer resource is the inclusion of the INTERPRET and RECORD methods and the associated INTERPRETATION-COMPLETE and RECORDING-COMPLETE events. 4.1. Recognizer Resource Extensions Methods The following methods are supported by the recognizer resource in addition to those already defined in [3]. recognizer-extension-method = "RECORD" | "INTERPRET" 4.2. Recognizer Resource Extensions Events The recognizer resource may now generate the following events in addition to those already defined in [3]. recognizer-extension-event = "RECORDING-COMPLETE" | "INTERPRETATION-COMPLETE" 4.3. Recognizer Resource Extensions Header Fields The recognizer resource extensions define new header fields to augment the request, response or event messages they are associated with. recognizer-extension-header = "Recording-URL" ; Section 4.3.1 | "Required-Phrase" ; Section 4.3.2 | "Phrase-Status" ; Section 4.3.3 | "Interpret-Text" ; Section 4.3.4 Parameter Support Methods/Events/Responses recording-url MANDATORY RECORD, SET-PARAMS, GET-PARAMS required-phrase MANDATORY RECOGNIZE, SET-PARAMS, GET-PARAMS phrase-status MANDATORY RECOGNITION-COMPLETE interpret-text MANDATORY INTERPRET 4.3.1. Recording-URL This header field specifies the location where the audio stream recorded by a call to the RECORD method should be saved. Currently, this should only be a URL using the ÆfileÆ scheme. Should this URL be relative, it will be treated relative to the current working directory where the MRCP server process is running. Burnett, et al. IETF-Draft Page 8 MRCP Extensions October 2003 This header field MAY be used only when invoking the RECORD, SET- PARAMS and GET-PARAMS method. recording-url = "Recording-URL" ":" Url CRLF 4.3.2. Required-Phrase This header field specifies the required or expected phrase to be spoken during recognition. The required phrase is a hint to the recognizer resource to examine its n-best list to determine if the required phrase is contained somewhere in the list (even if it is not the top choice). This header field MAY occur in the RECOGNIZE, SET-PARAMS, and GET-PARAMS methods. An empty string for this header field means that there is no required phrase needed. The default value is an empty string. Use of the Required-Phrase header field causes the RECOGNITION- COMPLETE method to include a header field, "Phrase-Status" with values of "valid" or "invalid" to indicate whether the result was found in the N-best list. A scenario in which the required phrase may be useful is in voice verification against an expected response. If the caller does not speak a valid phrase, the client can use a phrase status of "invalid" to rollback a verification resource utterance. required-phrase = "Required-Phrase" ":" 1*ALPHA CRLF 4.3.3. Phrase-Status This header field provides an indicator of the validity of the caller utterance when a required phrase is used. Utterances that produce a recognition result matching the required phrase somewhere in the n-best recognizer matches, yield a Phrase-Status of "valid ". While recognition results that do not match the required phrase anywhere in the N-best list yield a Phrase-Status of "invalid". phrase-status = "Phrase-Status" ":" phrase-status-string CRLF phrase-status-string = "valid" | "invalid" 4.3.4. Interpret-Text This header field is used to provide the text string for which a natural language interpretation is desired. This header field MUST be used when invoking the INTERPRET method as it cannot be set with the SET-PARAMS method. interpret-text = "Interpret-Text" : 1*OCTET CRLF 4.4. RECORD The RECORD method does not invoke the recognizer resource but simply endpoints and records the input audio stream. It saves the Burnett, et al. IETF-Draft Page 9 MRCP Extensions October 2003 endpointed audio to a URL having its name supplied in the recording- url header field. Currently, this URL can only use the ÆfileÆ scheme. If a RECOGNIZE, INTERPRET or another RECORD operation is already in progress, invoking this method will cause the response to have a status code of 402, "Method not valid in this state", and a COMPLETE request state. It the recording-url is not valid, a status code of 404, "Illegal Value for Parameter", will be returned in the response. If it is impossible for the server to create the requested file, a status code of 407, "Method or Operation Failed", will be returned. If the recording-url is valid, the recording operation is initiated and the response will indicate an IN-PROGRESS request state. The server MAY generate a subsequent START-OF-SPEECH event when speech is detected. Upon completion of the recording operation, the server will generate a RECORDING-COMPLETE event. Example: C->S:RECORD 456234 MRCP/1.0 Recording-URL: file://mediaserver/recordings/myfile.wav S->C:MRCP/1.0 456234 200 IN-PROGRESS S->C:START-OF-SPEECH 456234 IN-PROGRESS MRCP/1.0 S->C:RECORDING-COMPLETE 456234 COMPLETE MRCP/1.0 Completion-Cause: 000 success 4.5. INTERPRET The INTERPRET method from the client to the server takes as input an interpret-text header, containing the text for which the semantic interpretation is desired, and returns, via the INTERPRETATION- COMPLETE event, an interpretation result which is very similar to the one returned from a RECOGNIZE method invocation. Only portions of the result relevant to acoustic matching are excluded from the result. The interpret-text header MUST be included in the INTERPRET request. Recognizer grammar data is treated in the same way as it is when issuing a RECOGNIZE method call. If a RECOGNIZE, RECORD or another INTERPRET operation is already in progress, invoking this method will cause the response to have a status code of 402, "Method not valid in this state", and a COMPLETE request state. Example: C->S:INTERPRET 234567 MRCP/1.0 Burnett, et al. IETF-Draft Page 10 MRCP Extensions October 2003 Interpret-Text: may I speak to Andre Roy Content-Type: application/grammar+xml Content-Id: request1@form-level.store Content-Length: 104 oui yes may I speak to Michel Tremblay Andre Roy S->C:MRCP/1.0 234567 200 IN-PROGRESS S->C:INTERPRETATION-COMPLETE 234567 COMPLETE MRCP/1.0 Completion-Cause: 000 success Content-Type: application/x-nlsml Content-Length: 276 Andre Roy may I speak to Andre Roy 4.6. RECORDING-COMPLETE Burnett, et al. IETF-Draft Page 11 MRCP Extensions October 2003 This event from the recognition resource to the client indicates that the RECORD operation is complete. The request state MUST be set to COMPLETE. The completion-cause header MUST be included in this event. It MUST be set to one of the following values defined for the recognizer resource: Cause-Code Cause-Name Description 000 success RECORD completed successfully 002 no-input-timeout RECORD completed with no audio recorded due to lack of input 006 error RECORD operation terminated due to an error When the completion-cause is "000 success", the URL specified via the recording-url header in the RECORD method invocation will contain the recorded audio. The client may then use this URL to retrieve the audio. Example: C->S:RECORD 456234 MRCP/1.0 Recording-URL: file://mediaserver/recordings/myfile.wav S->C:MRCP/1.0 456234 200 IN-PROGRESS S->C:START-OF-SPEECH 456234 IN-PROGRESS MRCP/1.0 S->C:RECORDING-COMPLETE 456234 COMPLETE MRCP/1.0 Completion-Cause: 000 success 4.7. INTERPRETATION-COMPLETE This event from the recognition resource to the client indicates that the INTERPRET operation is complete. The interpretation result is sent in the body of the MRCP message. The request state MUST be set to COMPLETE. The completion-cause header MUST be included in this event and MUST be set to one of the following two values defined for the recognizer resource: Cause-Code Cause-Name Description 000 success INTERPRET completed successfully 006 error INTERPRET terminated due to an error Example: C->S:INTERPRET 234567 MRCP/1.0 Burnett, et al. IETF-Draft Page 12 MRCP Extensions October 2003 Interpret-Text: may I speak to Andre Roy Content-Type: application/grammar+xml Content-Id: request1@form-level.store Content-Length: 104 oui yes may I speak to Michel Tremblay Andre Roy S->C:MRCP/1.0 234567 200 IN-PROGRESS S->C:INTERPRETATION-COMPLETE 234567 COMPLETE MRCP/1.0 Completion-Cause: 000 success Content-Type: application/x-nlsml Content-Length: 276 Andre Roy may I speak to Andre Roy 5. Enrollment Burnett, et al. IETF-Draft Page 13 MRCP Extensions October 2003 This document captures the extensions required to implement Voice Enrollment, Speaker Verification and Hotword recognition using MRCP. This section describes the methods, responses and events needed for doing Enrollment. Enrollment can be performed using a personÆs voice or by building the personal grammar using text entry. For example, a list of contacts can be created and maintained by recording the personÆs names using their voice or by editing the list of contacts using a Web-based tool. These techniques are called Voice Enrollment or Text-based enrollment, respectively. Voice Enrollment has a concept of an enrollment session. Adding a new phrase to a personal grammar involves the initial enrollment followed by a repeat of enough utterances before committing the new phrase to the personal grammar. Each time an utterance is recorded, it is compared for similarity with the other samples and a clash test is performed against other entries in the personal grammar to ensure there are no similar and confusable entries. 5.1. Enrollment State Machine Starting an enrollment session does not change the state of the recognizer resource, i.e. it remains idle. Once an enrollment session is started, then utterances are enrolled by calling the RECOGNIZE method repeatedly. The state of the Speech Recognizer resources goes from IDLE to RECOGNIZING state each time RECOGNIZE is called. 5.2. Enrollment Methods Enrollment supports the following methods. enrollment-method = "START-ENROLLMENT-SESSION" | "RECOGNIZE" | "STOP" | "PAUSE-ENROLLMENT-SESSION" | "RESUME-ENROLLMENT-SESSION" | "ENROLLMENT-ROLLBACK" | "END-ENROLLMENT-SESSION" | "ABORT-ENROLLMENT-SESSION" | "MODIFY-PHRASE" | "ADD-PHRASE" | "DELETE-PHRASE" 5.3. Enrollment Events Enrollment may generate the following events. enrollment-event = "RECOGNITION-COMPLETE" 5.4. Enrollment Header Fields Burnett, et al. IETF-Draft Page 14 MRCP Extensions October 2003 An Enrollment request may contain header fields containing request options and information to augment the Request, Response or Event message it is associated with. Some of the header fields from the following list, such as Save- Waveform, Waveform-URL, are from the MRCP Recognizer resources. They are put here again because they are also related to enrollment operations. enrollment-header = num-min-consistent-pronunciations ; Section 5.4.1 | consistency-threshold ; Section 5.4.2 | clash-threshold ; Section 5.4.3 | personal-grammar-uri ; Section 5.4.4 | phrase-id ; Section 5.4.5 | phrase-nl ; Section 5.4.6 | weight ; Section 5.4.7 | save-waveform ; Section 5.4.8 | waveform-url ; Section 5.4.9 | new-phrase-id ; Section 5.4.10 | phrase-text ; Section 5.4.11 | completion-cause ; Section 5.4.12 Parameter Support Methods/Events num-min-consistent MANDATORY START-ENROLLMENT-SESSION, -pronunciations SET-PARAMS, GET-PARAMS consistency-threshold Optional START-ENROLLMENT-SESSION, SET-PARAMS, GET-PARAMS clash-threshold Optional START-ENROLLMENT-SESSION, SET-PARAMS, GET-PARAMS personal-grammar-uri MANDATORY START-ENROLLMENT-SESSION, SET-PARAMS, GET-PARAMS, MODIFY-PHRASE, ADD-PHRASE, DELETE-PHRASE phrase-id MANDATORY ADD-PHRASE, DELETE-PHRASE, MODIFY-PHRASE, END-ENROLLMENT-SESSION phrase-nl MANDATORY ADD-PHRASE, MODIFY-PHRASE, END-ENROLLMENT-SESSION weight Optional ADD-PHRASE, MODIFY-PHRASE, END-ENROLLMENT-SESSION save-waveform MANDATORY SET-PARAMS, GET-PARAMS, RECOGNIZE waveform-url MANDATORY RECOGNITION-COMPLETE new-phrase-id Optional MODIFY-PHRASE phrase-text MANDATORY ADD-PHRASE completion-cause MANDATORY RECOGNITION-COMPLETE For enrollment-specific header fields that can appear as part of SET-PARAMS or GET-PARAMS methods, the following general rule applies: The START-ENROLLMENT-SESSION method must be called before these header fields can be set through the SET-PARAMS method or retrieved through the GET-PARAMS method. Burnett, et al. IETF-Draft Page 15 MRCP Extensions October 2003 enrollment-result-elements = num-clashes ; Section 5.4.13 | num-good-repetitions ; Section 5.4.14 | num-repetitions-still-needed; Section 5.4.15 | consistency-status ; Section 5.4.16 | clash-phrase-id ; Section 5.4.17 5.4.1. Num-Min-Consistent-Pronunciations This parameter MAY BE specified in a START-ENROLLMENT-SESSION, SET- PARAMS, or GET-PARAMS method and is used to specify the minimum number of consistent pronunciations that must be obtained to voice enroll a new phrase. The minimum value is 1. The default value is 2. num-min-consistent-pronunciations = "Num-Min-Consistent-Pronunciations" ":" 1*DIGIT CRLF 5.4.2. Consistency-Threshold This parameter MAY BE sent as part of the START-ENROLLMENT-SESSION, SET-PARAMS, or GET-PARAMS method. Used during voice-enrollment, this parameter specifies how similar an utterance needs to be to a previously enrolled pronunciation of the same phrase to be considered "consistent." The higher the threshold, the closer the match between an utterance and previous pronunciations must be for the pronunciation to be considered consistent. The range for this threshold is 0 to 100. consistency-threshold = "Consistency-Threshold" ":" 1*DIGIT CRLF 5.4.3. Clash-Threshold This parameter MAY BE sent as part of the START-ENROLLMENT-SESSION, SET-PARMS, or GET-PARAMS method. Used during voice-enrollment, this parameter specifies how similar the pronunciations of two different phrases can be before they are considered to be clashing. For example, pronunciations of phrases such as "John Smith" and "Jon Smits" may be so similar that they are difficult to distinguish correctly. A smaller threshold reduces the number of clashes detected. The range for this threshold is 0 to 100. The default value for this field is platform specific. clash-threshold = "Clash-Threshold" ":" 1*DIGIT CRLF 5.4.4. Personal-Grammar-URI Burnett, et al. IETF-Draft Page 16 MRCP Extensions October 2003 This parameter specifies the speaker-trained grammar to be used or referenced during enrollment operations. For example, a contact list for user "Jeff" could be stored at the Personal-Grammar- URI="http://myserver/myenrollmentdb/jeff-list". There is no default value for this header field. personal-grammar-uri = "Personal-Grammar-URI" ":" Url CRLF 5.4.5. Phrase-Id This header identifies a phrase in a personal grammar and will also be returned when doing recognition. This header field MAY occur in ADD-PHRASE, DELETE-PHRASE, MODIFY-PHRASE and END-ENROLLMENT-SESSION requests. There is no default value for this header field. phrase-id = "Phrase-ID" ":" 1*ALPHA CRLF 5.4.6. Phrase-NL This is a string specifying the natural language statement to execute when the phrase is recognized. This header field MAY occur in ADD-PHRASE, MODIFY-PHRASE and END-ENROLLMENT-SESSION requests. There is no default value for this header field. phrase-nl = "Phrase-NL" ":" 1*ALPHA CRLF 5.4.7. Weight The value of this header field represents the occurrence likelihood of this branch of the grammar. The weights are normalized to sum to one at compilation time, so use the value of Æ1Æ if you want all branches to have the same weight. This header field MAY occur in ADD-PHRASE, MODIFY-PHRASE and END-ENROLLMENT-SESSION requests. The default value is 1. weight = "Weight" ":" WEIGHT CRLF 5.4.8. Save-Waveform This header field is from the recognizer resource and it allows the client to indicate to the recognizer that it MUST save the audio stream that was used during the enrollment session. The recognizer MUST then record the recognized audio and make it available to the client in the form of a URL returned in the waveform-url header field in the RECOGNITION-COMPLETE event. If there was an error in recording the stream or the audio clip is otherwise not available, the recognizer MUST return an empty waveform-url header field. Burnett, et al. IETF-Draft Page 17 MRCP Extensions October 2003 save-waveform = "Save-Waveform" ":" Boolean-value CRLF 5.4.9. Waveform-URL This header field is from the recognizer resource. If the Save- Waveform header field is set to true, the recognizer MUST record the incoming audio stream of the recognition into a file and provide a URL for the client to access it. This header MUST be present in the RECOGNITION-COMPLETE event if the Save-Waveform header field was set to true. The URL value of the header MUST be empty if there was some error preventing the server from recording. Otherwise, the URL generated by the server MUST be unique across the server and all its recognition and enrollment sessions. waveform-url ="Waveform-URL" ":" Url CRLF 5.4.10. New-Phrase-Id This header field replaces the id used to identify the phrase in a personal grammar. The recognizer returns the new id when using an enrollment grammar. This header field MAY occur in MODIFY-PHRASE requests. new-phrase-id = "New-Phrase-ID" ":" 1*ALPHA CRLF 5.4.11. Phrase-Text This represents the text that will be returned by the recognizer when a text enrolled phrase is recognized. This parameter is plain text. This header field MAY occur in ADD-PHRASE requests. phrase-text = "Phrase-Text" ":" 1*ALPHA CRLF 5.4.12. Completion-Cause This header field is from the recognizer resource and it MUST be specified in a RECOGNITION-COMPLETE event coming from the recognizer resource to the client. This indicates the reason behind the RECOGNIZE request completion. The error codes used for Enrollment should not clash with those for normal recognition. There are no completion-cause values specific to enrollment, so please refer to the original MRCP specification for valid completion causes. completion-cause = "Completion-Cause" ":" 1*DIGIT SP 1*ALPHA CRLF 5.4.13. Num-Clashes Burnett, et al. IETF-Draft Page 18 MRCP Extensions October 2003 This is not a header field, but part of the recognition results. It is returned in a RECOGNITION-COMPLETE event. Its value represents the number of clashes that this pronunciation has with other pronunciations in an active enrollment session. The header field Clash-Threshold determines the sensitivity of the clash measurement. Clash testing can be turned off completely by setting Clash- Threshold to 0. num-clashes = "num-clashes" ":" 1*DIGIT CRLF 5.4.14. Num-Good-Repetitions This is not a header field, but part of the recognition results. It is returned in a RECOGNITION-COMPLETE event. Its value represents the number of consistent pronunciations obtained so far in an active enrollment session. num-good-repetitions = "num-good-repetitions" ":" 1*DIGIT CRLF 5.4.15. Num-Repetitions-Still-Needed This is not a header field, but part of the recognition results. It is returned in a RECOGNITION-COMPLETE event. Its value represents the number of consistent pronunciations that must still be obtained before the new phrase can be added to the enrollment grammar. The number of consistent pronunciations required is determined by the parameter Num-Min-Consistent-Pronunciations, whose default value is two. The returned value must be 0 before the system will allow you to end an enrollment session for a new phrase. num-repetitions-still-needed = "num-repetitions-still-needed" ":" 1*DIGIT CRLF 5.4.16. Consistency-Status This is not a header field, but part of the recognition results. It is returned in a RECOGNITION-COMPLETE event. This is used to indicate how consistent the repetitions are when learning a new phrase. It can have the values of CONSISTENT, INCONSISTENT and UNDECIDED. consistency-status = "consistency-status" ":" 1*ALPHA CRLF 5.4.17. Clash-Phrase-Ids This is not a header field, but part of the recognition results. It is returned in a RECOGNITION-COMPLETE event. This gets filled with the phrase ids of the clashing pronunciation(s). This field is absent if there are no clashes. This MAY occur in RECOGNITION- COMPLETE events. Burnett, et al. IETF-Draft Page 19 MRCP Extensions October 2003 phrase-id = "phrase-id" ":" 1*ALPHA CRLF Phrase-Id à 5.5. Enrollment Methods 5.5.1. START-ENROLLMENT-SESSION The START-ENROLLMENT-SESSION method sent from the client to the server starts a new enrollment session during which the client may call RECOGNIZE to enroll a new utterance. This consists of a set of calls to RECOGNIZE in which the caller speaks a phrase several times so the system can "learn" it. You then add the phrase to a personal grammar (speaker-trained grammar), and the system can recognize it later. Only one enrollment session may be active at a time. The Personal- Grammar-URI identifies the grammar that is used during enrollment to store the personal list of phrases. Once RECOGNIZE is called, the result is returned in a RECOGNITION-COMPLETE event and may contain either an enrollment result OR a recognition result for a regular recognition. Calling END-ENROLLMENT-SESSION ends the ongoing enrollment session, which is typically done after a sequence of successful calls to RECOGNIZE. Alternatively a call to ABORT-ENROLLMENT-SESSION terminates the enrollment session without committing the new enrollments to the database. The Personal-Grammar-URI, which specifies the grammar to contain the new enrolled phrase, will be created if it does not exist. Also, the personal grammar may ONLY contain phrases added via an enrollment session. Example: C->S: START-ENROLLMENT-SESSION 543258 MRCP/1.0 Num-Min-Consistent-Pronunciations: 2 Consistency-Threshold: 30 Clash-Threshold: 12 Personal-Grammar-URI: S->C: MRCP/1.0 543258 200 COMPLETE 5.5.2. RECOGNIZE The RECOGNIZE method from the client to the server starts an ongoing enrollment/recognition during which either the phrase is learned, or recognition occurs against the grammar passed to RECOGNIZE. A START- OF-SPEECH event followed by a RECOGNITION-COMPLETE event should be expected. Burnett, et al. IETF-Draft Page 20 MRCP Extensions October 2003 There can only be a single RECOGNIZE operation IN-PROGRESS at a time and this method MUST be called during an ongoing START-ENROLLMENT- SESSION if enrollment is desired. If the RECOGNIZE request contains a Content-Id header field then the resulting grammar (which includes the personal grammar as a sub- grammar) can be referenced from elsewhere by using "session:my- grammar". Example: C->S: RECOGNIZE 543259 MRCP/1.0 Content-Type: application/grammar+xml Content-Id: my-grammar Content-Length: 123 help cancel S->C: MRCP/1.0 543259 200 IN-PROGRESS S->C: START-OF-SPEECH 543259 200 MRCP/1.0 5.5.3. STOP The STOP method from the client to the server may only be called during an ongoing RECOGNIZE operation and is used to abort that recognition. No RECOGNITION-COMPLETE event will follow. There is no difference in behavior for regular recognition versus an enrollment. It is included here for completeness. Example: C->S: STOP 543258 MRCP/1.0 S->C: MRCP/1.0 543258 200 COMPLETE Active-Request-Id-List: 543259 5.5.4. PAUSE-ENROLLMENT-SESSION The PAUSE-ENROLLMENT-SESSION method from the client to the server may only be called during an ongoing START-ENROLLMENT-SESSION. It may NOT be called during an ongoing RECOGNIZE operation. Burnett, et al. IETF-Draft Page 21 MRCP Extensions October 2003 This operation will pause the enrollment session. Any RECOGNIZE requests sent by the client after the session is paused will only return recognition results, not enrollment results. This method is quietly ignored if the resource is already paused. A response indicating a success status will be returned in those cases. Example: C->S: PAUSE-ENROLLMENT-SESSION 543260 MRCP/1.0 S->C: MRCP/1.0 543260 200 COMPLETE 5.5.5. RESUME-ENROLLMENT-SESSION The RESUME-ENROLLMENT-SESSION method from the client to the server may only be called during an ongoing START-ENROLLMENT-SESSION that has been paused. It may NOT be called during an ongoing RECOGNIZE operation. This will resume the enrollment session. Any RECOGNIZE requests sent by the client after the session is resumed can return recognition or enrollment results. This method is quietly ignored if the resource is already resumed. A response indicating a success status will be returned in those cases. Example: C->S: RESUME-ENROLLMENT-SESSION 543261 MRCP/1.0 S->C: MRCP/1.0 543261 200 COMPLETE 5.5.6. ENROLLMENT-ROLLBACK The ENROLLMENT-ROLLBACK method discards the last live utterances from the RECOGNIZE operation. This method should be invoked when the caller provides undesirable input such as non-speech noises, side- speech, commands, utterance from the RECOGNIZE grammar, etc. Note that this method does not provide a stack of rollback states. Executing ENROLLMENT-ROLLBACK twice in succession without an intervening recognition operation has no effect on the second attempt. Example: C->S: ENROLLMENT-ROLLBACK 543261 MRCP/1.0 S->C: MRCP/1.0 543261 200 COMPLETE 5.5.7. END-ENROLLMENT-SESSION Burnett, et al. IETF-Draft Page 22 MRCP Extensions October 2003 The END-ENROLLMENT-SESSION method can only be called during an active enrollment session, which was started by calling the method START-ENROLLMENT-SESSION. It may NOT be called during an ongoing RECOGNIZE operation. It should be called only when successive calls to RECOGNIZE have succeeded and Num-Repetitions-Still-Needed has been returned as 0 in the RECOGNITION-COMPLETE event. The Phrase-ID passed to this method will be used to identify this phrase in the grammar and will be returned as the speech input when doing a RECOGNIZE on the grammar. The Phrase-NL similarly will be returned in a RECOGNITION-COMPLETE event in the same manner as other NL in a grammar. The tag-format of this NL is vendor specific. If the client has specified Save-Waveform as true, the response should contain the location/URL of a recording of the best repetition of the learned phrase. Example: C->S: END-ENROLLMENT-SESSION 543262 MRCP/1.0 Phrase-Id: Phrase-NL: Weight: 1 Save-Waveform: true S->C: MRCP/1.0 543262 200 COMPLETE Waveform-URL: 5.5.8. ABORT-ENROLLMENT-SESSION The ABORT-ENROLLMENT-SESSION method may only be called during an ongoing enrollment session and is used to abort that session. It may NOT be called during an ongoing RECOGNIZE operation. After calling this function, you cannot call END-ENROLLMENT-SESSION and the phrase is not added to the personal grammar. Example: C->S: ABORT-ENROLLMENT-SESSION 543263 MRCP/1.0 S->C: MRCP/1.0 543263 200 COMPLETE 5.5.9. MODIFY-PHRASE The MODIFY-PHRASE method sent from the client to the server is used to change the phrase ID, NL phrase and/or weight for a given phrase in a personal grammar. If no fields are supplied then calling this method has no effect and it is silently ignored. Example: C->S: MODIFY-PHRASE 543265 MRCP/1.0 Personal-Grammar-URI: Burnett, et al. IETF-Draft Page 23 MRCP Extensions October 2003 Phrase-Id: New-Phrase-Id: Phrase-NL: Weight: 1 S->C: MRCP/1.0 543265 200 COMPLETE 5.5.10. ADD-PHRASE The ADD-PHRASE method sent from the client to the server is used to add a text phrase to a personal grammar. The phrase must be simple text with no special characters. As with voice enrollment, a Phrase Id, NL phrase and weight MAY be supplied. Example: C->S: ADD-PHRASE 543266 MRCP/1.0 Personal-Grammar-URI: Phrase-Id: Phrase-Text: Phrase-NL: Weight: 1 S->C: MRCP/1.0 543266 200 COMPLETE 5.5.11. DELETE-PHRASE The DELETE-PHRASE method sent from the client to the server is used to delete a phase in a personal grammar added through voice enrollment or text enrollment. If the specified phrase doesnÆt exist, this method has no effect and it is silently ignored. Example: C->S: DELETE-PHRASE 543266 MRCP/1.0 Personal-Grammar-URI: Phrase-Id: S->C: MRCP/1.0 543266 200 COMPLETE 5.5.12. RECOGNITION-COMPLETE The RECOGNITION-COMPLETE event follows a method call to RECOGNIZE and is used to communicate to the client the results of the enrollment. Note that the event can contain recognition or enrollment results depending on what was spoken. Example: S->C: RECOGNITION-COMPLETE 543259 200 MRCP/1.0 Completion-Cause: 000 success Content-Type: application/x-nlsml Content-Length: 123 Burnett, et al. IETF-Draft Page 24 MRCP Extensions October 2003 2 1 1 consistent Jeff Andre Burnett, et al. IETF-Draft Page 25 MRCP Extensions October 2003 6. Speaker Verification and Identification This document captures the extensions required to implement Voice Enrollment, Speaker Verification / Identification and Hotword recognition using MRCP. This section describes the methods, responses and events needed for doing Speaker Verification / Identification. 6.1. Speaker Verification/Identification Resource Speaker verification is a voice authentication feature that can be used to identify the speaker in order to grant the user access to sensitive information and transactions. To do this, a recorded utterance is compared to a voiceprint previously stored for that user. Verification consists of two phases: a designation phase to establish the claimed identity of the caller and an execution phase in which a voiceprint is either created (training) or used to authenticate the claimed identity (verification). Speaker identification identifies the speaker from a set of valid users, such as family members. Identification can be performed on a small set of users or for a large population. This feature is useful for applications where multiple users share the same account number, but where the individual speaker must be uniquely identified from the group. Speaker identification is also done in two phases, a designation phase and an execution phase. It is possible for a speaker verification resource to share the same session as an existing recognizer resource or a speaker verification session can be SETUP to operate in standalone mode, without a recognizer resource sharing the same session.In order to share the same session, the SETUP message for the verification resource should include the RTSP session identifier of the recognizer resource it wishes to share. If no session identifier is specified, an independent verification resource, running on the same physical server or a separate one, will be set up. Some of the speaker verification methods, described below, apply only to a specific mode of operation. The verification resource supports some buffering methods that allow the user to buffer the verification data from one or more utterances and then process this set of utterances as a single entity. This is different from collecting waveforms and processing them using the verification methods that operate directly on the incoming audio stream because the buffering mechanism does not simply accumulate utterance data to a buffer. In particular, when both the recognition and verification resources share the same session, additional information gathered by the recognition resource is saved with these buffers to improve verification performance. Burnett, et al. IETF-Draft Page 26 MRCP Extensions October 2003 6.2. SETUP Verification/Identification Resource The SETUP method from the client to the server is used to open a resource for verification/identification from a media server. If session-id header field is specified in the SETUP method, the verification/identification resource would share the same session with other resources in the session. Otherwise, a new session would be created for the verification/identification resource. The resource name is Æverification-resourceÆ. Example: This example assumes the verification resource would share a session that is already created. C->S: SETUP rtsp://media.server.com/media/verification-resource RTSP/1.0 CSeq: 1 Transport: RTP/AVP;unicast;client_port=46456-46457 Session: 0a030258_00003815_3bc4873a_0001_0000 S->C: RTSP/1.0 200 OK CSeq: 1 Transport: RTP/AVP;unicast;client_port=46456-46457; server_port=46460-46461 Session: 0a030258_00003815_3bc4873a_0001_0000 6.3. Speaker Verification State Machine Speaker Verification has a concept of a training, verification or buffering sessions. Starting one of these sessions does not change the state of the verification resource, i.e. it remains idle. Once a verification or training session is started, then utterances are trained or verified by calling the VERIFY or VER-FROM-BUFFER method. The state of the Speaker Verification resources goes from IDLE to VERIFYING state each time VERIFY or VER-FROM-BUFFER is called. 6.4. Speaker Verification Methods Speaker Verification supports the following methods. verification-method = "VER-START-SESSION" | "VER-END-SESSION" | "VER-SET-VOICEPRINT" | "VER-DELETE-VOICEPRINT" | "VERIFY" | "VER-BUFFERING-START" | "VER-BUFFERING-CONTROL" | "VER-BUFFERING-STOP" | "VER-FROM-BUFFER" | "VER-ROLLBACK" | "VER-STOP" | "VER-START-TIMERS" | "SET-PARAMS" | "GET-PARAMS" Burnett, et al. IETF-Draft Page 27 MRCP Extensions October 2003 6.5. Verification Events Speaker Verification may generate the following events. verification-event = "VERIFICATION-COMPLETE" | "START-OF-SPEECH" 6.6. Verification Header Fields A Speaker Verification request may contain header fields containing request options and information to augment the Request, Response or Event message it is associated with. The verification result elements will be returned in a VERIFICATION- COMPLETE event containing an NLSML document [4], having a MIME-type application/x-nlsml. The current specification proposes some element names which could be incorporated to an namespace verification-header = voiceprint-uri ; Section 6.6.1 | voiceprint-identifier ; Section 6.6.2 | voiceprint-group ; Section 6.6.3 | verification-mode ; Section 6.6.4 | adapt-model ; Section 6.6.5 | abort-model ; Section 6.6.6 | buffering-mode ; Section 6.6.7 | security-level ; Section 6.6.8 | num-min-verification-phrases; Section 6.6.9 | num-max-verification-phrases; Section 6.6.10 | completion-cause ; Section 6.6.11 | no-input-timeout ; Section 6.6.12 | save-waveform ; Section 6.6.13 | waveform-url ; Section 6.6.14 | vendor-specific ; Section 6.6.15 | voiceprint-exists ; Section 6.6.16 Parameter Support Methods/Events voiceprint-uri MANDATORY VER-SET-VOICEPRINT, VER-DELETE-VOICEPRINT voiceprint-identifier MANDATORY VER-SET-VOICEPRINT, VER-DELETE-VOICEPRINT voiceprint-group Optional VER-SET-VOICEPRINT, VER-DELETE-VOICEPRINT verification-mode MANDATORY SET-PARAMS, GET-PARAMS, VERIFY, VER-FROM-BUFFER adapt-model Optional VER-START-SESSION abort-model Optional VER-END-SESSION buffering-mode Optional VER-BUFFERING-CONTROL security-level Optional SET-PARAMS, GET-PARAMS, VERIFY, VER-FROM-BUFFER num-min-verification Optional SET-PARAMS, GET-PARAMS, -phrases VERIFY, VER-FROM-BUFFER num-max-verification Optional SET-PARAMS, GET-PARAMS, Burnett, et al. IETF-Draft Page 28 MRCP Extensions October 2003 -phrases VERIFY, VER-FROM-BUFFER completion-cause MANDATORY VERIFICATION-COMPLETE VER-SET-VOICEPRINT, VER-DELETE-VOICEPRINT no-input-timeout MANDATORY SET-PARAMS, GET-PARAMS, VERIFY save-waveform MANDATORY SET-PARAMS, GET-PARAMS, VERIFY waveform-url MANDATORY VERIFICATION-COMPLETE vendor-specific MANDATORY SET-PARAMS, GET-PARAMS voiceprint-exists MANDATORY VER-SET-VOICEPRINT, VER-DELETE-VOICEPRINT verification-result-elements = | is-valid-utterance ; Section 6.6.17 | num-valid-utterance ; Section 6.5.18 | decision ; Section 6.6.19 | num-frames ; Section 6.6.20 | device ; Section 6.6.21 | gender ; Section 6.6.22 | matched ; Section 6.6.23 | adapted ; Section 6.6.24 | verification-score ; Section 6.6.25 | group-name ; Section 6.6.26 | member ; Section 6.6.27 | score ; Section 6.6.28 6.6.1. Voiceprint-URI This parameter specifies the voiceprint repository to be used or referenced during speaker verification or identification operations. This header field is required in VER-SET-VOICEPRINT and VER-DELETE-VOICEPRINT method. If this header field is set through the SET-PARAMS method, it can be silently ignored. voiceprint-uri = "Voiceprint-URI" ":" Url CRLF 6.6.2. Voiceprint-Identifier This header field specifies the claimed identity for voice verification applications. The claimed identity may be used to specify an existing voiceprint or to establish a new voiceprint. This header field is required in VER-SET-VOICEPRINT and VER-DELETE- VOICEPRINT method executions in preparation for verification application operations. The Voiceprint-Identifier is not required for identification applications except in the VER-DELETE-VOICEPRINT method when the client needs to remove an identity from a voiceprint group. voiceprint-identifier = "Voiceprint-Identifier" ":" 1*ALPHA CRLF Burnett, et al. IETF-Draft Page 29 MRCP Extensions October 2003 6.6.3. Voiceprint-Group This header field specifies the voiceprint group for speaker identification operations. The voiceprint group narrows the potential voiceprint identification candidates to a subset of the voiceprints in the repository. This header field may appear in VER- SET-VOICEPRINT and VER-DELETE-VOICEPRINT method executions for speaker identification operations. If this header field is absent, then verification, not identification, operations will be executed. voiceprint-group = "Voiceprint-Group" ":" 1*ALPHA CRLF 6.6.4. Verification-Mode This header field specifies the mode of the verification resource in a VERIFY or VER-FROM-BUFFER method execution. Acceptable values indicate whether the verification session should ignore audio ("idle"), train a voiceprint ("train"), or verify/identify using an existing voiceprint ("verify"). The default value for the verification resource mode is "idle". While the mode is idle, the verification resource only applies utterance end-pointing to incoming speech and potentially adds utterances to the audio buffer. Setting this header field to "train" or "verify" requires that the voiceprint or voiceprint group identifier attributes have already been set through the VER-SET-VOICEPRINT method. Training and verification sessions both require the voiceprint URI to be specified at the start of the session. In many usage scenarios, however, the system cannot know the speakerÆs claimed identity until the speaker says, for example, their account number. In order to allow the first few utterances of a dialog to be both recognized and verified, the verification resource on the MRCP server retains an audio buffer. In this audio buffer, the MRCP server will accumulate recognized utterances in memory. The application can later execute a verification method and apply the buffered utterances to the current verification session. The buffering methods are used for this purpose. When buffering is used, subsequent input utterances are added to the audio buffer for later analysis. Some voice user interfaces may require additional user input that should not be analyzed for verification. For example, the userÆs input may have been recognized with low confidence and thus require a confirmation cycle. In such cases, the client should not execute the VERIFY or VER-FROM-BUFFER methods to collect and analyze the callerÆs input. A separate recognizer resource can analyze the callerÆs response without any participation on behalf of the verification resource. Once the following conditions have been met: Burnett, et al. IETF-Draft Page 30 MRCP Extensions October 2003 1. Voiceprint identity has been successfully established through the voiceprint identifier header fields of the VER-SET-VOICEPRINT method, and 2. the verification mode has been set to one of "train" or "verify", the verification resource may begin providing verification information during verification operations. The verification resource MUST reach one of the two major states ("train" or "verify") if the above two conditions hold, or it MUST report an error condition in the MRCP status code to indicate why the verification resource is not ready for action. The value of verification-mode is persistent within a verification session. Changing the mode to a different value than the previous setting causes the verification resource to report an error if the previous setting was either "train" or "verify". If the mode is changed back to its previous value, the operation may continue. For example: MRCP MRCP Server Client | | |<--------VERIFY: mode verify------| |<--------VERIFY-------------------| |<--------VERIFY: mode idle--------| |<--------VERIFY-------------------| |<--------VERIFY: mode verify------| The above sequence of VERIFY method requests would start a verification operation. When the verification resource is placed into idle, any subsequent audio would be ignored until the final update to verification-mode. At that time, the verification operation would continue, using the original utterances and any subsequent utterances. verification-mode = "Verification-Mode" ":" verification-mode-string verification-mode-string = "idle" | "train" | "verify" 6.6.5. Adapt-Model This header field indicates the desired behavior of the verification resource after a successful verification execution. If the value of this parameter is "true", the audio collected during the verification session is used to update the voiceprint to account for ongoing changes in a speakerÆs incoming speech characteristics. If the value is "false" (the default), the voiceprint is not updated with the latest audio. This header field MAY only occur in VER- START-SESSION method. adapt-model = "Adapt-Model" ":" Boolean-value CRLF Burnett, et al. IETF-Draft Page 31 MRCP Extensions October 2003 6.6.6. Abort-Model The Abort-Model header field indicates the desired behavior of the verification resource upon session termination. If the value of this parameter is "true", the pending changes to a voiceprint due to verification training or verification adaptation are discarded. If the value is "false" (the default), the pending changes for a training session or a successful verification session are committed to the voiceprint repository. A value of "true" for Abort-Model overrides a value of "true" for the Adapt-Model header field. This header field MAY only occur in VER-END-SESSION method. abort-model = "Abort-Model" ":" Boolean-value CRLF 6.6.7. Buffering-Mode The Buffering-Mode header field is used to indicate which action, of pausing or resuming, should be applied to a buffering session. It MUST only be used with the VER-BUFFERING-CONTROL method. buffering-mode = "Buffering-Mode" ":" "pause" | "resume" CRLF 6.6.8. Security-Level The Security-Level header field determines the range of verification scores in which a decision of ÆacceptedÆ may be declared. This header field MAY occur in SET-PARAMS, GET-PARAMS, VERIFY and VER- FROM-BUFFER methods. It can be "high" (highest security level), "medium-high", "medium" (normal security level), "medium-low", or "low" (low security level). The default value is platform specific. security-level = "Security-Level" ":" security-level-string CRLF security-level-string = "high" | "medium-high" | "medium" | "medium-low" | "low" 6.6.9. Num-Min-Verification-Phrases The Num-Min-Verification-Phrases header field is used to specify the minimum number of valid utterances before a positive decision is given for verification. The value for this parameter is integer and the default value is 1. The verification resource should not announce a decision of ÆacceptedÆ unless the Num-Min-Verification- Phrases utterances are available. The minimum value is 1. num-min-verification-phrases = "Num-Min-Verification-Phrases" ":" 1*DIGIT CRLF 6.6.10. Num-Max-Verification-Phrases The Num-Max-Verification-Phrases header field is used to specify the number of valid utterances required before a decision is forced for Burnett, et al. IETF-Draft Page 32 MRCP Extensions October 2003 verification. The verification resource MUST NOT return a decision of ÆundecidedÆ once Num-Max-Verification-Phrases have been collected and used to determine a verification score. The value for this parameter is integer and the minimum value is 1. num-min-verification-phrases = "Num-Max-Verification-Phrases" ":" 1*DIGIT CRLF 6.6.11. Completion-Cause This header field MUST be part of a VERIFICATION-COMPLETE event coming from the verification resource to the client. This indicates the reason behind the VERIFY or VER-FROM-BUFFER method completion. This header field MUST BE sent in the VERIFY, VER-FROM-BUFFER, VER- SET-VOICEPRINT responses, if they return with a failure status and a COMPLETE state. completion-cause = "Completion-Cause" ":" 1*DIGIT SP 1*ALPHA CRLF Cause-Code Cause-Name Description 000 success VERIFY or VER-FROM-BUFFER request completed successfully. The verify decision can be "accepted", "rejected", or "undecided". 001 error VERIFY or VER-FROM-BUFFER request terminated prematurely due to a verification resource or system error. 002 no-input-timeout VERIFY request completed with no result due to a no-input-timeout. 003 buffer-empty VER-FROM-BUFFER request completed with no result due to empty buffer. 004 invalid-phrase VERIFY or VER-FROM-BUFFER request completed, but the required phrase was not found by a co-operative recognizer resource. This completion code is a hint that the utterance should be removed. 005 out-of-sequence Verification operation failed due to out-of-sequence method invocations. For example calling VERIFY before VER-SET-VOICEPRINT. 006 voiceprint-uri-failure Failure accessing voiceprint URI. 007 voiceprint-uri-missing Voiceprint-uri is not specified. 008 voiceprint-id-missing Voiceprint-identification is not specified. 009 voiceprint-id-not-exist Voiceprint-identification doesnÆt exist in the voiceprint repository. 010 voiceprint-group-not-exist Burnett, et al. IETF-Draft Page 33 MRCP Extensions October 2003 Voiceprint-group doesnÆt exist. 6.6.12. No-Input-Timeout The No-Input-Timeout header field sets the length of time from the start of the verification timers (see VER-START-TIMERS) until the declaration of a no-input event in the VERIFICATION-COMPLETE server event message. The value is in milliseconds. This header field MAY occur in VERIFY, SET-PARAMS or GET-PARAMS. The value for this field ranges from 0 to MAXTIMEOUT, where MAXTIMEOUT is platform specific. The default value for this field is platform specific. no-input-timeout = "No-Input-Timeout" ":" 1*DIGIT CRLF 6.6.13. Save-Waveform This header field allows the client to indicate to the verification resource that it MUST save the audio stream that was used for verification/identification. The verification resource MUST then record the audio and make it available to the client in the form of a URI returned in the waveform-uri header field in the VERIFICATION-COMPLETE event. If there was an error in recording the stream or the audio clip is otherwise not available, the verification resource MUST return an empty waveform-uri header field. The default value for this field is "false". This header field MAY appear in the VERIFY method, but NOT in the VER-FROM- BUFFER method since it can control whether or not to save the waveform for live verification / identification operations only. save-waveform = "Save-Waveform" ":" boolean-value CRLF 6.6.14. Waveform-URL If the save-waveform header field is set to true, the verification resource MUST record the incoming audio stream of the verification into a file and provide a URI for the client to access it. This header MUST be present in the VERIFICATION-COMPLETE event if the save-waveform header field is set to true. The URL value of the header MUST be NULL if there was some error condition preventing the server from recording. Otherwise, the URL generated by the server SHOULD be globally unique across the server and all its verification sessions. The URL SHOULD BE available until the session is torn down. Since the save-waveform header field applies only to live verification / identification operations, the waveform-url will only be returned in the VERIFICATION-COMPLETE event for live verification / identification operations. waveform-url = "Waveform-URL" ":" Url CRLF 6.6.15. Vendor-Specific This set of headers allows the client to set Vendor Specific parameters. Burnett, et al. IETF-Draft Page 34 MRCP Extensions October 2003 vendor-specific = "Vendor-Specific-Parameters" ":" vendor-specific-av-pair *[";" vendor-specific-av-pair] CRLF vendor-specific-av-pair = vendor-av-pair-name "=" vendor-av-pair-value This header can be sent in the SET-PARAMS method and is used to set vendor-specific parameters on the server. The vendor-av-pair-name can be any vendor-specific field name and conforms to the XML vendor-specific attribute naming convention. The vendor-av-pair- value is the value to set the attribute to, and needs to be quoted. When asking the server to get the current value of these parameters, this header can be sent in the GET-PARAMS method with the list of vendor-specific attribute names to get separated by a semicolon. This header field MAY occur in SET-PARAMS or GET-PARAMS. 6.6.16. Voiceprint-Exists This header field is returned in a VER-SET-VOICEPRINT or VER-DELETE- VOICEPRINT response. This is the status of the voiceprint specified in the VER-SET-VOICEPRINT method. For the VER-DELETE-VOICEPRINT method this field indicates the status of the voiceprint as the method execution started. Voiceprint-Exists = "Voiceprint-Exists " ":" Boolean-value CRLF 6.6.17. Is-Valid-Utterance This is not a header field, but part of the verification results. It is returned in a VERIFICATION-COMPLETE event. Its value indicates if verification has determined that the last utterance is valid. A verification utterance is valid if it matches the required verification phrase, as determined by the recognizer. If the utterance was valid, you can get other information such as the acceptance decision and the score. The value can be TRUE or FALSE. is-valid-utterance = "is-valid-utterance" ":" Boolean-value CRLF 6.6.18. Num-Valid-Utterances This is not a header field, but part of the verification results. It is returned in a VERIFICATION-COMPLETE event. Its value indicates the cumulative number of valid utterances found during verification. A verification utterance is valid if it matches the required verification phrase, as determined by the recognizer. num-valid-utterance = "num-valid-utterance" ":" 1*DIGIT CRLF 6.6.19. Decision This is not a header field, but part of the verification results. It is returned in a VERIFICATION-COMPLETE event. Its value indicates Burnett, et al. IETF-Draft Page 35 MRCP Extensions October 2003 the decision as determined by verification. It can have the values of accepted, rejected or undecided. decision = "decision" ":" decision-string CRLF decision-string = "accepted" | "rejected" | "undecided" 6.6.20. Num-Frames This is not a header field, but part of the verification results. It is returned in a VERIFICATION-COMPLETE event. Its value indicates the number of 10 millisecond speech frames in the last utterance or in the cumulated set of utterances. num-frames = "num-frames" ":" 1*DIGIT CRLF 6.6.21. Device This is not a header field, but part of the verification results. It is returned in a RECOGNITION-COMPLETE event. Its value indicates the apparent type of device used by the caller as determined by verification. It can have the values of cellular-phone, electret- phone and carbon-button-phone. device = "device" ":" device-string CRLF device-string = "cellular-phone" | "electret-phone" | "carbon-button-phone" 6.6.22. Gender This is not a header field, but part of the verification results. It is returned in a VERIFICATION-COMPLETE event. Its value indicates the apparent gender of the speaker as determined by verification. It can have the values of male or female. gender = "gender" ":" "male" | "female" CRLF 6.6.23. Matched This is not a header field, but part of the verification results. It is returned in a VERIFICATION-COMPLETE event. When verification is trying to confirm the voiceprint, this indicates if the last utterance and the voiceprints are of the same gender and used the same type of device. It is not returned during verification training. The value can be TRUE or FALSE. matched = "matched" ":" Boolean-value CRLF 6.6.24. Adapted This is not a header field, but part of the verification results. It is returned in a VERIFICATION-COMPLETE event. When verification is trying to confirm the voiceprint, this indicates if the voiceprint has been adapted as a consequence of analyzing the source Burnett, et al. IETF-Draft Page 36 MRCP Extensions October 2003 utterances. It is not returned during verification training. The value can be TRUE or FALSE. adapted = "adapted" ":" Boolean-value CRLF 6.6.25. Verification-Score This is not a header field, but part of the verification results. It is returned in a VERIFICATION-COMPLETE event. Its value indicates the score of the last utterance as determined by verification. During verification, the higher the score the more likely it is that the speaker is the same one as the one who spoke the voiceprint utterances. During training, the higher the score the more likely the speaker is to have spoken all of the analyzed utterances. If there are no such utterances the score is -100. verification-score = "verification-score" ":" FLOAT CRLF 6.6.26. Group-Name This is not a header field, but part of the verification results. It is returned in a VERIFICATION-COMPLETE event. Its value indicates the name of the group used in speaker identification. group-name = "group-name" ":" 1*ALPHA CRLF 6.6.27. Member This is not a header field, but part of the verification results. It is returned in a VERIFICATION-COMPLETE event. Its value indicates the member in a group identified by its URI. There is one URI for each member in the group. member = "member" ":" 1*ALPHA CRLF 6.6.28. Score This is not a header field, but part of the verification results. It is returned in a VERIFICATION-COMPLETE event. This is the score associated with the identified member of the group, as returned in the member result. Score = "score " ":" 1*ALPHA CRLF 6.7. Verification Session Methods These methods allow the client to control the mode and target of verification or identification operations within the context of a session. All the verification input cycles that occur within a session may be used to create, update, or validate against the voiceprint specified during the session. At the beginning of each session the verification resource is reset to a known state. Burnett, et al. IETF-Draft Page 37 MRCP Extensions October 2003 Verification/identification operations can be executed against live or buffered audio. The verification resource provides methods for controlling collection of audio data into an audio buffer, methods for collecting and evaluating live audio data, and methods for controlling the verification resource and adjusting its configured behavior. The following methods provide controls for collecting buffered audio data from live caller utterances and subsequently evaluating the buffered audio against voiceprints: buffered-audio-method = "VER-FROM-BUFFER" | "VER-BUFFERING-START" | "VER-BUFFERING-CONTROL" | "VER-BUFFERING-STOP" The following methods provide controls for collection and or evaluation of live audio utterances : live-audio-method = "VERIFY" | "VER-START-TIMERS" The following methods provide controls for configuring the verification resource and for establishing resource states : live-or-buffered-audio-method = "VER-START-SESSION" | "VER-END-SESSION" | "VER-SET-VOICEPRINT" | "VER-DELETE-VOICEPRINT" | "VER-ROLLBACK" | "VER-STOP" | "SET-PARAMS" | "GET-PARAMS" 6.7.1. VER-START-SESSION The VER-START-SESSION method starts a Speaker Verification/Identification Session. Execution of this method forces the verification resource into a known initial state. If this method is called during an ongoing verification session, the previous session is implicitly aborted. Upon completion of the VER-START-SESSION method, the verification resource MUST terminate any ongoing verification sessions, and clear any voiceprint designation. The header field "Adapt-Model" may also be present in the start session method to indicate whether or not to adapt a voiceprint with data collected during the session (if the voiceprint verification phase succeeds). By default the voiceprint model should NOT be adapted with data from a verification session. Burnett, et al. IETF-Draft Page 38 MRCP Extensions October 2003 Before a verification/identification resource is started, only audio buffering operations, VER-BUFFERING-START, VER-BUFFERING-CONTROL, VER-BUFFERING-STOP, VER-ROLLBACK and generic SET-PARAMS and GET- PARAMS operations can be performed. The media server should return 402(Method not valid in this state) for all other operations, such as VERIFY, VER-SET-VOICEPRINT. A single session can be active at one time. Example: C->S: VER-START-SESSION 314161 MRCP/1.0 Adapt-Model: true S->C: MRCP/1.0 314161 200 COMPLETE 6.7.2. VER-END-SESSION The VER-END-SESSION method terminates an ongoing verification session and releases the verification voiceprint model in one of three ways: a. aborting û the voiceprint adaptation or creation may be aborted so that the voiceprint remains unchanged (or is not created). b. committing û when terminating a voiceprint training session, the new voiceprint is committed to the repository. c. adapting û an existing voiceprint is modified using a successful verification. The header field "Abort-Model" may be included in the VER-END- SESSION to control whether or not to abort any pending changes to the voiceprint. The default behavior is to commit (not abort) any pending changes to the designated voiceprint. The VER-END-SESSION method may be safely executed multiple times without first executing the VER-START-SESSION method. Any additional executions of this method without an intervening use of the VER- START-SESSION method have no effect on the system. Example: This example assumes there are a training session or a verification session in progress. C->S: VER-END-SESSION 314174 MRCP/1.0 Abort-Model: true S->C: MRCP/1.0 314174 200 COMPLETE 6.7.3. VER-SET-VOICEPRINT The VER-SET-VOICEPRINT method causes the verification resource to establish the voiceprint to be used for verification, identification, or training purposes. At this time the desired mode of the verification resource is not yet known. Burnett, et al. IETF-Draft Page 39 MRCP Extensions October 2003 The VER-SET-VOICEPRINT method can also be used to query whether or not a voiceprint exists. The response to the VER-SET-VOICEPRINT method request will contain an indication of the status of the designated voiceprint in the "Voiceprint-Exists" header field, allowing the client to determine whether to use the current voiceprint for verification, train a new voiceprint, or choose a different voiceprint. A Voiceprint location may be completely specified by providing the URI of the voiceprint repository along with attributes to locate a single voiceprint within the repository. The voiceprint repository is specified through the "Voiceprint-URI" header field, in which a URI describing the location of the voiceprint repository is given. The attributes used to locate a specific record or records within the repository depend on whether the client intends to use speaker verification or speaker identification. In the case of speaker verification, only a single attribute is required to uniquely locate a voiceprint record within the repository. The "Voiceprint-Identity" header field MUST describe a unique voiceprint record within a given repository. In the case of speaker identification, an attribute describing the set or group of speakers from which to select a specific identity must be supplied in the VER-SET-VOICEPRINT message. The header field "Voiceprint-Group" specifies the group of voiceprints from which an identity is determined. If a new voiceprint is to be added to an existing voiceprint group, then both the voiceprint group and the new voiceprint identifier must be supplied. In most cases, the voiceprint operations, VER-SET-VOICEPRINT and VER- DELETE-VOICEPRINT, would operate on the same voiceprint repository, but using different voiceprint records or group names. For simplicity reasons, the ÆVoiceprint-URIÆ header field can be omitted if itÆs already set by previous voiceprint operations. But VER-START-SESSION would clear any voiceprint designation, including the ÆVoiceprint-URIÆ. Unlike the ÆVoiceprint-URIÆ, the ÆVoiceprint-IdentifierÆ header field MUST be specified in every voiceprint operations. And the ÆVoiceprint- GroupÆ header field MUST be specified in every voiceprint operations for identification. Example1: This example assumes a verification session is in progress and the voiceprint exists in the voiceprint repository. C->S: VER-SET-VOICEPRINT 314168 MRCP/1.0 Voiceprint-URI: Voiceprint-Identifier: S->C: MRCP/1.0 314168 200 COMPLETE Voiceprint-URI: Voiceprint-Identifier: Voiceprint-Exists: true Burnett, et al. IETF-Draft Page 40 MRCP Extensions October 2003 Example2: This example assumes a verification session is in progress and the voiceprint doesnÆt exist in the voiceprint repository. C->S: VER-SET-VOICEPRINT 314168 MRCP/1.0 Voiceprint-URI: Voiceprint-Identifier: S->C: MRCP/1.0 314168 200 COMPLETE Voiceprint-URI: Voiceprint-Identifier: Voiceprint-Exists: false Example3: This example assumes a verification session is in progress and the ÆVoiceprint-URIÆ header field is a bad URI. C->S: VER-SET-VOICEPRINT 314168 MRCP/1.0 Voiceprint-URI: Voiceprint-Identifier: S->C: MRCP/1.0 314168 405 COMPLETE Voiceprint-URI: Voiceprint-Identifier: Completion-Cause: 006 voiceprint-uri-failure Example 4: This example assumes an identification session is in progress and the group doesnÆt exist in the voiceprint repository. C->S: VER-SET-VOICEPRINT 314168 MRCP/1.0 Voiceprint-URI: Voiceprint-Group: S->C: MRCP/1.0 314168 200 COMPLETE Voiceprint-URI: Voiceprint-Group: Completion-Cause: 010 voiceprint-group-not-exist 6.7.4. VER-DELETE-VOICEPRINT The VER-DELETE-VOICEPRINT method removes a voiceprint from a repository or speaker identification group. For removal of a speaker identification voiceprint, three attributes describing the voiceprint repository, group, and voiceprint identifier are required. For removal of a speaker verification voiceprint, two attributes describing the repository and the specific voiceprint are needed. If a single voiceprint record is specified with no group identifier information, the voiceprint record is deleted. Burnett, et al. IETF-Draft Page 41 MRCP Extensions October 2003 If a group identifier is specified but no specific voiceprint within the group, the group record is deleted, and all the voiceprints associated with that group are deleted. If both a voiceprint record and a group identifier are specified, that voiceprint is deleted, and the group identifier is updated to no longer reference that voiceprint. If, after removing the reference to that voiceprint, the group identifier is empty, the group record is also removed. If a voiceprint record or a voiceprint group doesnÆt exist, the VER- DELETE-VOICEPRINT method can silently ignore the message and still return 200 status code. Example: This example demonstrates a message to remove a specific voiceprint. C->S: VER-DELETE-VOICEPRINT 314168 MRCP/1.0 Voiceprint-URI: Voiceprint-Identifier: S->C: MRCP/1.0 314168 200 COMPLETE 6.7.5. VERIFY The VERIFY method is used to send the utteranceÆs audio stream to the verification resource, which will then process it according to the current Verification-Mode, either to train the voiceprint or verify the user. When both a recognizer and verification resource share the same session, the VERIFY method MUST be called prior to calling the RECOGNIZE method on the recognizer resource. In such cases, media server vendors will know that verification must be enabled for a subsequent call to RECOGNIZE. Example: C->S: VERIFY 543260 MRCP/1.0 S->C: MRCP/1.0 543260 200 IN-PROGRESS When the VERIFY request is done, the MRCP server should send a ÆVERIFICATION-COMPLETEÆ event to the client. 6.7.6. VER-BUFFERING-START The VER-BUFFERING-START method starts a buffering session. Upon completion of the VER-BUFFERING-START method, the audio buffer associated with the verification resource MUST be cleared. Note that the audio buffer is independent of a verification session, so that a verification session may be started and terminated while the audio buffer continues to maintain its audio data. The lifespan of the data in the audio buffer is determined solely by the VER-BUFFERING- Burnett, et al. IETF-Draft Page 42 MRCP Extensions October 2003 START and VER-BUFFERING-STOP methods during the life of the verification resource. The audio buffer is initially cleared out when a verification resource is successfully allocated from an MRCP server. If another buffering session is in progress, this method will fail. Only a single buffering session may be in progress at a time. Example: C->S: VER-BUFFERING-START 314163 MRCP/1.0 S->C: MRCP/1.0 314163 200 COMPLETE 6.7.7. VER-BUFFERING-CONTROL The VER-BUFFERING-CONTROL method is used to either pause or resume an active buffering session. The "Buffering-Mode" parameter MUST be used when invoking this method. When invoked with Buffering-Mode set to pause, this method causes an active buffering session to be paused. Subsequent utterances are not buffered. When invoked with Buffering-Mode set to resume, this method resumes a buffering session and subsequent utterances will be buffered. Example: C->S: VER-BUFFERING-CONTROL 314165 MRCP/1.0 Buffering-Mode: pause S->C: MRCP/1.0 314165 200 COMPLETE 6.7.8. VER-BUFFERING-STOP The VER-BUFFERING-STOP method terminates the active buffering session, and frees the memory holding buffered utterances. Example: C->S: VER-BUFFERING-STOP 314167 MRCP/1.0 S->C: MRCP/1.0 314167 200 COMPLETE 6.7.9. VER-FROM-BUFFER The VER-FROM-BUFFER method begins an ongoing evaluation of the currently buffered audio against the voiceprint established through the VER-SET-VOICEPRINT method. Execution of this method without first establishing the voiceprint repository and identifier attributes produces an error response. Since a verification session may only have a single voiceprint identity at any given time, this Burnett, et al. IETF-Draft Page 43 MRCP Extensions October 2003 method may not be started repeatedly without first receiving a completion response or sending a VER-STOP message. Embedded with the request for audio evaluation is a header field to describe the desired usage of the verification resource. The value of the "Verification-Mode" header field MUST be one of either "train" or "verify". The buffered audio is not consumed by this evaluation operation and thus VER-FROM-BUFFER may be called repeatedly using different voiceprints. Such usage is desirable to implement an n-best processing strategy to determine a voiceprint identity. The processing initiated under a VER-FROM-BUFFER method may be terminated using the VER-STOP method. For VER-FROM-BUFFER method, the media server can optionally return an "IN-PROGRESS" response followed by the "VERIFICATION-COMPLETE" event. Example: This example illustrates the usage of some buffering methods. In this scenario the client first performed a live verification, but the utterance is rejected. In the meantime, the utterance is also saved to the audio buffer. Then, another voiceprint is used to do verification against the audio buffer and the utterance is accepted. Here, we assume both Ænum-min-verification-phrasesÆ and Ænum-max- verification-phrasesÆ are 1. C->S: VER-START-SESSION 314161 MRCP/1.0 Adapt-Model: true S->C: MRCP/1.0 314161 200 COMPLETE C->S: VER-SET-VOICEPRINT 314162 MRCP/1.0 Voiceprint-URI: Voiceprint-Identifier: S->C: MRCP/1.0 314162 200 COMPLETE Voiceprint-URI: Voiceprint-Identifier: Voiceprint-Exists: true C->S: VER-BUFFERING-START 314163 MRCP/1.0 S->C: MRCP/1.0 314163 200 COMPLETE C->S: VERIFY 314164 MRCP/1.0 S->C: MRCP/1.0 314164 200 IN-PROGRESS S->C: VERIFICATION-COMPLETE 314164 COMPLETE MRCP/1.0 Completion-Cause: 000 success Content-Type: application/x-nlsml Burnett, et al. IETF-Draft Page 44 MRCP Extensions October 2003 Content-Length: 123 true 50 cellular-phone female rejected -50 1 50 cellular-phone female rejected -50 C->S: VER-SET-VOICEPRINT 314165 MRCP/1.0 Voiceprint-Identifier: S->C: MRCP/1.0 314165 200 COMPLETE Voiceprint-URI: Voiceprint-Identifier: Voiceprint-Exists: true C->S: VER-FROM-BUFFER 314166 MRCP/1.0 Verification-Mode: verify S->C: MRCP/1.0 314166 200 IN-PROGRESS S->C: VERIFICATION-COMPLETE 314166 COMPLETE MRCP/1.0 Completion-Cause: 000 success Content-Type: application/x-nlsml Content-Length: 123 true 50 cellular-phone Burnett, et al. IETF-Draft Page 45 MRCP Extensions October 2003 female accepted 50 1 50 cellular-phone female accepted 50 C->S: VER-BUFFERING-STOP 314167 MRCP/1.0 S->C: MRCP/1.0 314167 200 COMPLETE C->S: VER-END-SESSION 314168 MRCP/1.0 S->C: MRCP/1.0 314168 200 COMPLETE 6.7.10. VER-ROLLBACK The VER-ROLLBACK method discards the last buffered utterance or discards the last live utterances (when the mode is "train" or "verify"). This method should be invoked when the caller provides undesirable input such as non-speech noises, side-speech, out-of- grammar utterances, commands, etc. Note that this method does not provide a stack of rollback states. Executing VER-ROLLBACK twice in succession without an intervening recognition operation has no effect on the second attempt. Example: C->S: VER-ROLLBACK 314165 MRCP/1.0 S->C: MRCP/1.0 314165 200 COMPLETE 6.7.11. VER-STOP The VER-STOP method from the client to the server tells the verification resource to stop VERIFY or VER-FROM-BUFFER requests if one is active. If such a request is active and the STOP request successfully terminated it, then the response header contains an active-request-id-list header field containing the request-id of the VERIFY or VER-FROM-BUFFER request that was terminated. In this case, no VERIFICATION-COMPLETE event will be sent for the terminated request. If there was no verify request active, then the response MUST NOT contain an active-request-id-list header field. Either way the response MUST contain a status of 200(Success). Burnett, et al. IETF-Draft Page 46 MRCP Extensions October 2003 The VER-STOP method aborts an ongoing evaluation operation against live audio or buffered audio. Example: This example assumes a voiceprint identity has already been established. C->S: VERIFY 314177 MRCP/1.0 Verification-Mode: verify S->C: MRCP/1.0 314177 200 IN-PROGRESS C->S: VER-STOP 314178 MRCP/1.0 S->C: MRCP/1.0 314178 200 COMPLETE Active-Request-Id-List: 314177 6.7.12. VER-START-TIMERS This request is sent from the client to the verification resource to start the no-input timer, usually once the audio prompts to the caller have played to completion. Example: C->S: VER-START-TIMERS 543260 MRCP/1.0 S->C: MRCP/1.0 543260 200 COMPLETE 6.7.13. SET-PARAMS The SET-PARAMS method, from the client to the server, tells the verification resource to set and modify its configuration parameters. If the server resource does not recognize an OPTIONAL parameter it MUST ignore that field. Many of the parameters in the SET-PARAMS method can also be used in another method like the VERIFY method. But the difference is that when you set something like the security-level using the SET-PARAMS it applies for all future requests, whenever applicable. On the other hand, when you pass security-level in a VERIFY request it applies only to that request. Example: C->S: SET-PARAMS 543256 MRCP/1.0 Security-Level: high No-Input-Timeout: 5000 S->C: MRCP/1.0 543256 200 COMPLETE 6.7.14. GET-PARAMS The GET-PARAMS method, from the client to the server, asks the verification resource for its current values for parameters in the request. The client can request specific parameters from the server by sending it one or more empty parameter headers with no values. Burnett, et al. IETF-Draft Page 47 MRCP Extensions October 2003 The server should then return the settings for those specific parameters only. When the client does not send a specific list of empty parameter headers, the verification resource should return the settings for all parameters. The wild card use can be very intensive as the number of settable parameters can be large depending on the vendor. Hence it is RECOMMENDED that the client does not use the wildcard GET-PARAMS operation very often. Example: C->S: GET-PARAMS 543256 MRCP/1.0 Security-Level: No-Input-Timeout: S->C: MRCP/1.0 543256 200 COMPLETE Security-Level: high No-Input-Timeout: 5000 6.8. Verification Session Events 6.8.1. VERIFICATION-COMPLETE The VERIFICATION-COMPLETE event follows a call to VERIFY or VER- FROM-BUFFER and is used to communicate to the client the verification results. This event will contain only verification results. Example: S->C: VERIFICATION-COMPLETE 543259 COMPLETE MRCP/1.0 Completion-Cause: 000 success Content-Type: application/x-nlsml Content-Length: 123 true 50 cellular-phone female accepted 50 3 150 cellular-phone female accepted 25 Burnett, et al. IETF-Draft Page 48 MRCP Extensions October 2003 123456 Martha-smith 75 6.8.2. START-OF-SPEECH The START-OF-SPEECH event is returned from the server to the client once the server has detected speech. This event is always returned by the verification resource when speech has been detected, irrespective of the fact that both the recognizer and verification resource are sharing the same session or not. Example: S->C: START-OF-SPEECH 543259 IN-PROGRESS MRCP/1.0 Burnett, et al. IETF-Draft Page 49 MRCP Extensions October 2003 7. Hotword Recognition This document captures the extensions required to implement Voice Enrollment, Speaker Verification and Hotword recognition using MRCP. This section describes the methods, responses and events needed for doing Hotword recognition. A new type of Speech Recognizer resource is presented that can be used for Hotword recognition. Unlike the primary recognizer resource, which is driven by the client for each recognition request, the secondary Hotword recognition resource is attached to the session and listens continuously until a particular command phrase is spoken. The Hotword recognition resource can be the only recognition resource in a session or it can be attached to the same session as a primary recognizer resource, and consequently connected to the same audio stream. When a client sends a SETUP request to add a Hotword recognizer resource to an existing session, then the MRCP server attaches the Hotword recognition resource in eavesdropping mode on the RTP stream already established by the primary resource. 7.1. Hotword State Machine The difference between a Hotword recognition resource and the primary recognition resource is minor. The RECOGNIZE method is the only method allowed on a Hotword recognition resource. The only event generated is RECOGNITION-COMPLETE. The resource goes from IDLE to RECOGNIZING and back to IDLE just like a regular recognizer resource. A Hotword recognition resource, unlike a normal recognizer resource, will not send a START-OF-SPEECH event while it is trying to locate a Hotword. The first event that will be returned once the Hotword is detected is a RECOGNITION-COMPLETE event. After a RECOGNITION-COMPLETE event is reported, the Hotword recognition resource must be primed once again by sending another RECOGNIZE request. 7.1.1. Addressing Resources To request a Hotword recognition resource be added to a session, a different URI must be specified in the SETUP message. The same rules apply as for other resources. That is, if no session is specified in the SETUP message, then this is considered to be the first resource added to a session. For subsequent SETUP requests, the MRCP client should indicate to the server that these resources belong to the same session by returning the same session id in the SETUP request message. Burnett, et al. IETF-Draft Page 50 MRCP Extensions October 2003 There is no special order required when requesting synthesizer, recognizer or Hotword-recognizer resources. 7.2. Hotword Header Fields Hotword recognition requests may contain the following header fields. Hotword-header = Hotword-Max-Seconds ; Section 6.2.1 | Hotword-Min-Seconds ; Section 6.2.2 7.2.1. Hotword-Max-Seconds This parameter MAY BE sent in a RECOGNIZE request to enable Hotword listening. It specifies the maximum length of an utterance (in seconds) that should be considered for Hotword. This parameter, along with Hotword-Min-Seconds, can be used to tune performance by preventing the recognizer from evaluating utterances that are too short or too long to be the Hotword. The value is in milliseconds. The default is 1700 milliseconds. hotword-max-seconds = " Hotword-Max-Seconds" ":" 1*DIGIT CRLF 7.2.2. Hotword-Min-Seconds This parameter MAY BE sent in a RECOGNIZE request to enable Hotword listening. It specifies the minimum length of an utterance (in seconds) that can be considered for Hotword. This parameter, along with Hotword-Max-Seconds, can be used to tune performance by preventing the recognizer from evaluating utterances that are too short or too long to be the hot word. The value is in milliseconds. The default is 300 milliseconds. hotword-min-seconds = " Hotword-Min-Seconds" ":" 1*DIGIT CRLF 7.3. Hotword Methods 7.3.1. SETUP The SETUP method from the client to the server is used to attach a Hotword recognizer resource to the session. Example: C->S: ANNOUNCE rtsp://media.server.com/media/hotword-asr RTSP/1.0 CSeq: 3 Transport: RTP/AVP;unicast;client_port=8000-8001; mode=record Session: 12345678 S->C: RTSP/1.0 200 OK CSeq: 3 Burnett, et al. IETF-Draft Page 51 MRCP Extensions October 2003 Transport: RTP/AVP;unicast;client_port=8000-8001; server_port=9000-9001;mode=record Session: 12345678 7.3.2. RECOGNIZE The RECOGNIZE method from the client to the server starts an ongoing Hotword recognition. This operation can be stopped using the STOP method. Otherwise, the RECOGNITION-COMPLETE event will be returned when the Hotword has been recognized. The client must call RECOGNIZE once again to re-start Hotword recognition. Example: C->S: ANNOUNCE rtsp://media.server.com/media/hotword-asr RTSP/1.0 Cseq: 314 Session: 12345678 Content-Type: application/mrcp Content-Length: 276 RECOGNIZE 543259 MRCP/1.0 Content-Type: application/grammar+xml Content-Length: 123 Hotword-Min-Seconds: 0.3 Hotword-Max-Seconds: 1.7 S->C: RTSP/1.0 200 OK Cseq: 314 Content-Type: application/mrcp Content-Length: 67 MRCP/1.0 543259 200 IN-PROGRESS S->C: ANNOUNCE rtsp://media.server.com/media/hotword-asr RTSP/1.0 Cseq: 315 Session: 12345678 Content-Type: application/mrcp Content-Length: 123 RECOGNITION-COMPLETE 543259 200 MRCP/1.0 Completion-Cause: 000 Normal Content-Type: application/x-nlsml Content-Length: 76 Wakeup Burnett, et al. IETF-Draft Page 52 MRCP Extensions October 2003 Wakeup C->S: RTSP/1.0 200 OK Cseq: 315 Burnett, et al. IETF-Draft Page 53 MRCP Extensions October 2003 8. RTSP based Examples: This section contains examples of typical sessions between a client and the server. 8.1. Enrollment This example illustrates a typical enrollment session. First, you need to start an enrollment session before proceeding to learn new phrases. C->S: ANNOUNCE rtsp://media.server.com/media/recognizer RTSP/1.0 Cseq: 406 Session: 12345678 Content-Type: application/mrcp Content-Length: 123 START-ENROLLMENT-SESSION 543258 MRCP/1.0 Num-Min-Consistent-Pronunciations: 2 Consistency-Threshold: 3000 Clash-Threshold: 1200 Personal-Grammar-URI: S->C: RTSP/1.0 200 OK Cseq: 406 Content-Type: application/mrcp Content-Length: 86 MRCP/1.0 543258 200 COMPLETE Then, the application can proceed to enroll an utterance by iterating over the following command. C->S: ANNOUNCE rtsp://media.server.com/media/recognizer RTSP/1.0 Cseq: 407 Session: 12345678 Content-Type: application/mrcp Content-Length: 276 RECOGNIZE 543259 MRCP/1.0 Content-Type: application/grammar+xml Content-Length: 123 help cancel Burnett, et al. IETF-Draft Page 54 MRCP Extensions October 2003 S->C: RTSP/1.0 200 OK Cseq: 407 Content-Type: application/mrcp Content-Length: 67 MRCP/1.0 543259 200 IN-PROGRESS S->C: ANNOUNCE rtsp://media.server.com/media/recognizer RTSP/1.0 Cseq: 408 Session: 12345678 Content-Type: application/mrcp Content-Length: 87 START-OF-SPEECH 543259 200 MRCP/1.0 C->S: RTSP/1.0 200 OK Cseq: 408 The recognizer resource returns the enrollment status after each attempt to enroll an utterance. This repeats until the required number of pronunciations is consistent and that there are no clashes with other pronunciations in the personal grammar. S->C: ANNOUNCE rtsp://media.server.com/media/recognizer RTSP/1.0 Cseq: 409 Session: 12345678 Content-Type: application/mrcp Content-Length: 276 RECOGNITION-COMPLETE 543259 200 MRCP/1.0 Completion-Cause: 000 Normal Content-Type: application/x-nlsml Content-Length: 123 2 1 1 consistent Jeff Andre C->S: RTSP/1.0 200 OK Cseq: 409 Burnett, et al. IETF-Draft Page 55 MRCP Extensions October 2003 Finally, when the application is satisfied with the enrollment results then the enrollment is committed to the personal grammar by ending the enrollment session, as follows. C->S: ANNOUNCE rtsp://media.server.com/media/recognizer RTSP/1.0 Cseq: 410 Session: 12345678 Content-Type: application/mrcp Content-Length: 123 END-ENROLLMENT-SESSION 543260 MRCP/1.0 Phrase-Id: Phrase-NL: Weight: 1 Save-Waveform: true S->C: RTSP/1.0 200 OK Cseq: 410 Content-Type: application/mrcp Content-Length: 67 MRCP/1.0 543260 200 COMPLETE Waveform-URL: 8.2. Speaker Verification and Identification This example illustrates a verification session. Assume prompts are played outside, MRCP synthesizer resource is left out for simplicity reasons. Opening the recognizer. This is the first resource for this session. The server and client agree on a single Session ID 12345678 and set of RTP/RTCP ports on both sides. C->S:SETUP rtsp://media.server.com/media/recognizer RTSP/1.0 CSeq: 2 Transport:RTP/AVP;unicast;client_port=46456-46457 Content-Type: application/sdp Content-Length: 190 v=0 o=- 123 456 IN IP4 10.0.0.1 s=Media Server p=+1-888-555-1212 c=IN IP4 0.0.0.0 t=0 0 m=audio 0 RTP/AVP 0 96 a=rtpmap:0 pcmu/8000 a=rtpmap:96 telephone-event/8000 a=fmtp:96 0-15 S->C:RTSP/1.0 200 OK CSeq: 2 Transport:RTP/AVP;unicast;client_port=46456-46457; Burnett, et al. IETF-Draft Page 56 MRCP Extensions October 2003 server_port=46460-46461 Session: 12345678 Content-Length: 190 Content-Type: application/sdp v=0 o=- 3211724219 3211724219 IN IP4 10.3.2.88 s=Media Server c=IN IP4 0.0.0.0 t=0 0 m=audio 46460 RTP/AVP 0 96 a=rtpmap:0 pcmu/8000 a=rtpmap:96 telephone-event/8000 a=fmtp:96 0-15 Opening a verification resource. Uses the existing session ID and ports. C->S:SETUP rtsp://media.server.com/media/verification-resource RTSP/1.0 CSeq: 3 Transport: RTP/AVP;unicast;client_port=46456-46457; mode=record;ttl=127 Session: 12345678 S->C:RTSP/1.0 200 OK CSeq: 3 Transport: RTP/AVP;unicast;client_port=46456-46457; server_port=46460-46461;mode=record;ttl=127 Session: 12345678 Start a verification session. C->S:ANNOUNCE rtsp://media.server.com/media/verification-resource RTSP/1.0 Cseq: 4 Session: 12345678 Content-Type: application/mrcp Content-Length: 53 VER-START-SESSION 314161 MRCP/1.0 Adapt-Model: true S->C:RTSP/1.0 200 OK CSeq: 4 Session: 12345678 Content-Length: 30 Content-Type: application/mrcp MRCP/1.0 314161 200 COMPLETE Start buffering utterance. Burnett, et al. IETF-Draft Page 57 MRCP Extensions October 2003 C->S:ANNOUNCE rtsp://media.server.com/media/verification-resource RTSP/1.0 Cseq: 5 Session: 12345678 Content-Type: application/mrcp Content-Length: 37 VER-BUFFERING-START 314162 MRCP/1.0 S->C:RTSP/1.0 200 OK CSeq: 5 Session: 12345678 Content-Length: 30 Content-Type: application/mrcp MRCP/1.0 314162 200 COMPLETE Start a recognition request, getting the account number for example. C->S:ANNOUNCE rtsp://media.server.com/media/recognizer RTSP/1.0 CSeq: 6 Session: 12345678 Content-Type: application/mrcp Content-Length: 188 RECOGNIZE 314163 MRCP/1.0 No-Input-Timeout: 7000 Recognizer-Start-Timers: false Save-Waveform: true N-Best-List-Length: 2 Content-Type: text/uri-list Content-Length: 33 builtin:grammar/digits?length=5 S->C:RTSP/1.0 200 OK CSeq: 6 Session: 12345678 Content-Length: 33 Content-Type: application/mrcp MRCP/1.0 314163 200 IN-PROGRESS S->C:ANNOUNCE rtsp://media.server.com/media/recognizer RTSP/1.0 CSeq: 1 Session: 12345678 Content-Length: 65 Content-Type: application/mrcp START-OF-SPEECH 314163 IN-PROGRESS MRCP/1.0 Proxy-Sync-Id: 1 C->S:RTSP/1.0 200 OK CSeq: 1 Burnett, et al. IETF-Draft Page 58 MRCP Extensions October 2003 The recognition result contains 2 choices. S->C:ANNOUNCE rtsp://media.server.com/media/recognizer RTSP/1.0 CSeq: 2 Session: 12345678 Content-Length: 3511 Content-Type: application/mrcp RECOGNITION-COMPLETE 314163 COMPLETE MRCP/1.0 Completion-Cause: 000 success Waveform-URL: http://media.server.com/waveforms/utt01.wav Content-Type: application/x-nlsml Content-Length: 3280 13579 one three five seven nine 13479 one three four seven nine C->S:RTSP/1.0 200 OK CSeq: 2 Check to see if the first choice from nbest list exists in the Voiceprint repository. C->S:ANNOUNCE rtsp://media.server.com/media/verification-resource RTSP/1.0 CSeq: 7 Session: 12345678 Content-Type: application/mrcp Content-Length: 119 VER-SET-VOICEPRINT 314164 MRCP/1.0 Voiceprint-URI: http://media.server.com/VoicePrints Voiceprint-Identifier: 13579 Voiceprint ID 13579 doesnÆt exist. S->C:RTSP/1.0 200 OK CSeq: 7 Session: 12345678 Content-Length: 139 Burnett, et al. IETF-Draft Page 59 MRCP Extensions October 2003 Content-Type: application/mrcp MRCP/1.0 314164 200 COMPLETE Voiceprint-URI: http://media.server.com/VoicePrints Voiceprint-Identifier: 13579 Voiceprint-Exists: false Check the second choice in the nbest list. C->S:ANNOUNCE rtsp://media.server.com/media/verification-resource RTSP/1.0 CSeq: 8 Session: 12345678 Content-Type: application/mrcp Content-Length: 119 VER-SET-VOICEPRINT 314165 MRCP/1.0 Voiceprint-URI: http://media.server.com/VoicePrints Voiceprint-Identifier: 13479 Voiceprint ID 13479 exists. S->C:RTSP/1.0 200 OK CSeq: 8 Session: 12345678 Content-Length: 138 Content-Type: application/mrcp MRCP/1.0 314165 200 COMPLETE Voiceprint-URI: http://media.server.com/VoicePrints Voiceprint-Identifier: 13479 Voiceprint-Exists: true Start verify on the voiceprint 13479. C->S:ANNOUNCE rtsp://media.server.com/media/verification-resource RTSP/1.0 CSeq: 9 Session: 12345678 Content-Type: application/mrcp Content-Length: 54 VER-FROM-BUFFER 314166 MRCP/1.0 Verify-Mode: verify S->C:RTSP/1.0 200 OK CSeq: 9 Session: 12345678 Content-Length: 33 Content-Type: application/mrcp MRCP/1.0 314166 200 IN-PROGRESS Burnett, et al. IETF-Draft Page 60 MRCP Extensions October 2003 The caller is verified (assume num-min-verification-phrases and num- max-verification-phrases are 1). S->C:ANNOUNCE rtsp://media.server.com/media/verification-resource RTSP/1.0 CSeq: 3 Session: 12345678 Content-Type: application/mrcp Content-Length: 183 VERIFICATION-COMPLETE 314166 COMPLETE MRCP/1.0 Completion-Cause: 000 success Content-Type: application/x-nlsml Content-Length: 123 true 50 cellular-phone female accepted 50 1 50 cellular-phone female accepted 50 C->S:RTSP/1.0 200 OK CSeq: 3 Stop the audio buffering session, clear the audio buffer. C->S:ANNOUNCE rtsp://media.server.com/media/verification-resource RTSP/1.0 CSeq: 10 Session: 12345678 Content-Type: application/mrcp Content-Length: 39 VER-BUFFERING-STOP 314167 MRCP/1.0 Burnett, et al. IETF-Draft Page 61 MRCP Extensions October 2003 S->C:RTSP/1.0 200 OK CSeq: 10 Session: 12345678 Content-Length: 30 Content-Type: application/mrcp MRCP/1.0 314167 200 COMPLETE End the verification session. C->S:ANNOUNCE rtsp://media.server.com/media/verification-resource RTSP/1.0 CSeq: 11 Session: 12345678 Content-Type: application/mrcp Content-Length: 33 VER-END-SESSION 314168 MRCP/1.0 S->C:RTSP/1.0 200 OK CSeq: 11 Session: 12345678 Content-Length: 30 Content-Type: application/mrcp MRCP/1.0 314168 200 COMPLETE Teardown the recognizer and verification resource. C->S:TEARDOWN rtsp://media.server.com/media/verification-resource RTSP/1.0 CSeq: 12 Session: 12345678 S->C:RTSP/1.0 200 OK CSeq: 12 C->S:TEARDOWN rtsp://media.server.com/media/recognizer RTSP/1.0 CSeq: 13 Session: 12345678 S->C:RTSP/1.0 200 OK CSeq: 13 8.3. Hotword Recognition Will be provided later. 9. Security Considerations The primary additional security considerations raised by the extensions in this document have to do with the use of speaker identification and verification as security functions. One such consideration is that individualized voiceprints are used to Burnett, et al. IETF-Draft Page 62 MRCP Extensions October 2003 identify or confirm the identity of a caller. The privacy and integrity of these voiceprints is of high importance. Fortunately, voiceprints are not transferred between client and server but are rather maintained by the server using the serverÆs own security mechanisms. Another consideration particular to these functions is the consequence of manipulating the media (speech) stream. Some verification technologies in use today are susceptible to impersonation or "replay" attacks, and all are susceptible to a denial of access attack by garbling an otherwise acceptable media stream. We recommend that standard media-securing protocols such as SRTP be used in these cases. 10. Reference Documents [1] Fielding, R., Gettys, J., Mogul, J., Frystyk. H., Masinter, L., Leach, P., and T. Berners-Lee, "Hypertext transfer protocol -- HTTP/1.1", RFC 2616, June 1999. [2] Schulzrinne, H., Rao, A., and R. Lanphier, "Real Time Streaming Protocol (RTSP)", RFC 2326, April 1998 [3] Shanmugham, S., et al., "A Media Resource Control Protocol Developed by Cisco, Nuance, and Speechworks.", Internet-draft draft-shanmugham-mrcp-04, (work in progress), May 1, 2003 [4] World Wide Web Consortium, "Natural Language Semantics Markup Language (NLSML) for the Speech Interface Framework", W3C Working Draft, 30 May 2001. [5] Bradner, S., "Key words for use in RFCs to Indicate Requirement Levels", RFC 2119, March 1997. Acknowledgements The authors would like to thank the following additional individuals for their contributions to this document: Andre Gillet (Nuance Communications) Saravanan Shanmugham (Cisco Systems, Inc.) Full Copyright Statement Copyright (C) The Internet Society (2003). All Rights Reserved. This document and translations of it may be copied and furnished to others, and derivative works that comment on or otherwise explain it or assist in its implementation may be prepared, copied, published and distributed, in whole or in part, without restriction of any kind, provided that the above copyright notice and this paragraph are included on all such copies and derivative works. However, this document itself may not be modified in any way, such as by removing the copyright notice or references to the Internet Society or other Internet organizations, except as needed for the purpose developing Burnett, et al. IETF-Draft Page 63 MRCP Extensions October 2003 Internet standards in which case the procedures for copyrights defined in the Internet Standards process must be followed, or as required to translate it into languages other than English. The limited permissions granted above are perpetual and will not be revoked by the Internet Society or its successors or assigns. This document and the information contained herein is provided on an "AS IS" basis and THE INTERNET SOCIETY AND THE INTERNET ENGINEERING TASK FORCE DISCLAIMS ALL WARRANTIES, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO ANY WARRANTY THAT THE USE OF THE INFORMATION HEREIN WILL NOT INFRINGE ANY RIGHTS OR ANY IMPLIED WARRANTIES OF MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE. AuthorsÆ Addresses Daniel C. Burnett Nuance Communications 1005 Hamilton Court Menlo Park, CA 94025-1422 USA Email: burnett@nuance.com Pierre Forgues Nuance Communications Ltd. 111 Duke Street Suite 4100 Montreal, Quebec Canada H3C 2M1 Email: forgues@nuance.com Charles Galles Intervoice, Inc. 17811 Waterview Parkway Dallas, Texas 75252 Email: charles.galles@intervoice.com This document expires on April 17, 2004. Burnett, et al. IETF-Draft Page 64