Internet Engineering Task Force D. Burnett
Internet-Draft Nuance Communications
draft-burnett-mrcpext-01 P. Forgues
Expires: June 24, 2004 Nuance Communications
C. Galles
Intervoice, Inc.
December 24, 2003
MRCP Extensions: Media Resource Control Protocol Extensions
Status of this Memo
This document is an Internet-Draft and is subject to all provisions
of Section 10 of RFC2026.
Internet-Drafts are working documents of the Internet Engineering
Task Force (IETF), its areas, and its working groups. Note that
other groups may also distribute working documents as Internet-
Drafts.
Internet-Drafts are draft documents valid for a maximum of six
months and may be updated, replaced, or obsoleted by other documents
at any time. It is inappropriate to use Internet-Drafts as
reference material or to cite them other than as "work in progress."
The list of current Internet-Drafts can be accessed at
http://www.ietf.org/1id-abstracts.html
The list of Internet-Draft Shadow Directories can be accessed at
http://www.ietf.org/shadow.html
Copyright Notice
Copyright (C) The Internet Society (2003). All Rights Reserved.
Abstract
The Media Resource Control Protocol (MRCP) is an application level
protocol to control media service resources like Speech
Synthesizers, Recognizers, Signal Generators, Signal Detectors, Fax
Servers etc. over a network. This document captures the extensions
required to implement Voice Enrollment, Speaker Verification and
Hotword recognition as well as to augment the recognizer
functionality using MRCP. The extensions are largely orthogonal to
existing features of MRCP and to each other, with an eye towards
backwards compatibility with existing features and independence of
the extensions from each other to simplify integration.
This document is published as an Internet-Draft as input for further
IETF development in this area.
Page 1
MRCP Extensions October 2003
Burnett, et al. IETF-Draft Page 2
MRCP Extensions October 2003
Table of Contents
Status of this Memo.................................................1
Abstract............................................................1
1. Introduction....................................................6
2. Architecture....................................................7
3. Notational Conventions..........................................7
4. Recognizer resource extensions..................................8
4.1. Recognizer Resource Extensions Methods........................8
4.2. Recognizer Resource Extensions Events.........................8
4.3. Recognizer Resource Extensions Header Fields..................8
4.3.1. Interpret-Text............................................8
4.3.2. Enroll-Utterance..........................................9
4.3.3. Ver-Buffer-Utterance......................................9
4.4. RECORD........................................................9
4.5. Record Header Fields..........................................9
4.5.1. Recording-URL............................................10
4.5.2. Ver-Buffer-Utterance.....................................10
4.6. INTERPRET....................................................10
4.7. RECORDING-COMPLETE...........................................12
4.8. INTERPRETATION-COMPLETE......................................12
5. Enrollment.....................................................14
5.1. Enrollment State Machine.....................................14
5.2. Enrollment Methods...........................................14
5.3. Enrollment Events............................................15
5.4. Enrollment Header Fields.....................................15
5.4.1. Num-Min-Consistent-Pronunciations........................16
5.4.2. Consistency-Threshold....................................16
5.4.3. Clash-Threshold..........................................16
5.4.4. Personal-Grammar-URI.....................................17
5.4.5. Phrase-Id................................................17
5.4.6. Phrase-NL................................................17
5.4.7. Weight...................................................17
5.4.8. Save-Best-Waveform.......................................17
5.4.9. Waveform-URL.............................................18
5.4.10. New-Phrase-Id..........................................18
5.4.11. Confusable-Phrases-URI.................................18
5.4.12. Abort-Phrase-Enrollment................................18
5.4.13. Completion-Cause.......................................18
5.5. Enrollment Result Elements...................................19
5.5.1. Num-Clashes..............................................19
5.5.2. Num-Good-Repetitions.....................................19
5.5.3. Num-Repetitions-Still-Needed.............................19
5.5.4. Consistency-Status.......................................20
5.5.5. Clash-Phrase-Ids.........................................20
5.5.6. Transcriptions...........................................20
5.5.7. Confusable-Phrases.......................................20
5.6. Enrollment Methods...........................................21
5.6.1. START-PHRASE-ENROLLMENT..................................21
5.6.2. RECOGNIZE................................................22
5.6.3. STOP.....................................................22
5.6.4. ENROLLMENT-ROLLBACK......................................23
5.6.5. END-PHRASE-ENROLLMENT....................................23
5.6.6. MODIFY-PHRASE............................................24
Burnett, et al. IETF-Draft Page 3
MRCP Extensions October 2003
5.6.7. DELETE-PHRASE............................................24
5.6.8. RECOGNITION-COMPLETE.....................................24
6. Speaker Verification and Identification........................26
6.1. Speaker Verification/Identification Resource.................26
6.2. SETUP Verification/Identification Resource...................27
6.3. Speaker Verification State Machine...........................27
6.4. Speaker Verification Methods.................................27
6.5. Verification Events..........................................27
6.6. Verification Header Fields...................................28
6.6.1. Voiceprint-URI...........................................29
6.6.2. Voiceprint-Identifier....................................29
6.6.3. Voiceprint-Group.........................................29
6.6.4. Verification-Mode........................................30
6.6.5. Adapt-Model..............................................31
6.6.6. Abort-Model..............................................31
6.6.7. Security-Level...........................................31
6.6.8. Num-Min-Verification-Phrases.............................32
6.6.9. Num-Max-Verification-Phrases.............................32
6.6.10. No-Input-Timeout.......................................32
6.6.11. Save-Waveform..........................................32
6.6.12. Waveform-URL...........................................33
6.6.13. Vendor-Specific........................................33
6.6.14. Voiceprint-Exists......................................33
6.6.15. Ver-Buffer-Utterance...................................34
6.6.16. Input-Waveform-Url.....................................34
6.6.17. Verification-Type......................................34
6.6.18. Digit-Sequence.........................................34
6.6.19. Completion-Cause.......................................34
6.7. Verification Result Elements.................................35
6.7.1. Decision.................................................36
6.7.2. Num-Frames...............................................36
6.7.3. Device...................................................36
6.7.4. Gender...................................................36
6.7.5. Matched..................................................36
6.7.6. Adapted..................................................37
6.7.7. Verification-Score.......................................37
6.7.8. Group-Name...............................................37
6.7.9. Member...................................................37
6.7.10. Score..................................................37
6.7.11. Vendor-Specific-Results................................38
6.8. Verification Session Methods.................................38
6.8.1. VER-START-SESSION........................................39
6.8.2. VER-END-SESSION..........................................40
6.8.3. VER-SET-VOICEPRINT.......................................40
6.8.4. VER-DELETE-VOICEPRINT....................................42
6.8.5. VERIFY...................................................43
6.8.6. VER-FROM-BUFFER..........................................43
6.8.7. VER-ROLLBACK.............................................46
6.8.8. VER-STOP.................................................46
6.8.9. VER-START-TIMERS.........................................47
6.8.10. SET-PARAMS.............................................47
6.8.11. GET-PARAMS.............................................47
6.9. Verification Session Events..................................48
6.9.1. VERIFICATION-COMPLETE....................................48
Burnett, et al. IETF-Draft Page 4
MRCP Extensions October 2003
6.9.2. START-OF-SPEECH..........................................48
7. Hotword Recognition............................................50
7.1. Hotword State Machine........................................50
7.1.1. Addressing Resources.....................................50
7.2. Hotword Header Fields........................................51
7.2.1. Hotword-Max-Duration.....................................51
7.2.2. Hotword-Min-Duration.....................................51
7.3. Hotword Methods..............................................51
7.3.1. SETUP....................................................51
7.3.2. RECOGNIZE................................................52
8. RTSP based Examples:...........................................54
8.1. Enrollment...................................................54
8.2. Speaker Verification and Identification......................56
8.3. Hotword Recognition..........................................62
9. Security Considerations........................................62
10. Reference Documents............................................62
Acknowledgements...................................................62
Full Copyright Statement...........................................63
Authors’ Addresses.................................................63
Burnett, et al. IETF-Draft Page 5
MRCP Extensions October 2003
1. Introduction
The Media Resource Control Protocol (MRCP) [3] is an application
level protocol to control media service resources like Speech
Synthesizers, Recognizers, Signal Generators, Signal Detectors, Fax
Servers etc. over a network. This protocol is designed to work with
streaming protocols like RTSP (Real Time Streaming Protocol) or SIP
(Session Initiation Protocol) which help establish control
connections to external media streaming devices, and media delivery
mechanisms like RTP (Real Time Protocol). MRCP supports basic
recognition and speech synthesis (TTS) capabilities.
This document captures the extensions required to implement Voice
Enrollment, Speaker Verification and Hotword recognition as well as
to augment the recognizer functionality using MRCP. Already having
functional implementations of [3], the authors developed these
extensions within that framework. It is expected that these methods
will also prove useful as information for the IETF in its
standardization efforts beyond this draft version of MRCP.
A major goal of the Recognition, Enrollment, Speaker Verification
and Hotword recognition extensions is to be backward compatible,
i.e. to implement them in such a way that previous functionality is
available without change. In addition, the MRCP extensions used for
Enrollment, Speaker Verification and Identification and Hotword
recognition are independent from one another. This means a client
can implement only the set of methods needed for a particular
integration. For example, only the Enrollment methods and responses
need to be implemented by a client, provided the server has
implemented those methods.
The extensions for Enrollment do not need a separate resource type
because they are implemented as part of the recognition resource.
Speaker Verification and Hotword recognition were defined as new
resource types since they essentially consist in either creating a
verification resource or attaching a special kind of Recognizer
resource on the session in addition to the primary Recognizer
resource (unlike Enrollment).
There is no need to change the underlying protocols to support
Enrollment, Speaker Verification or Hotword recognition. Like the
original MRCP specification, the extensions rely on a protocol like
the Real Time Streaming Protocol (RTSP) or Session Initiation
Protocol (SIP) to establish and maintain the session. The session
control protocol is also responsible for establishing the media
connection from the client to the network server.
The MRCP protocol extensions define the requests, responses and
events needed to control Voice Enrollment, Speaker Verification and
Hotword recognition features. It is assumed the state machine for a
recognition resource is preserved.
Burnett, et al. IETF-Draft Page 6
MRCP Extensions October 2003
The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",
"SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY" and "OPTIONAL" in this
document are to be interpreted as described in RFC 2119[5].
Please send any feedback on this document directly to the authors.
2. Architecture
There is no change in architecture from the original MRCP
specification. It is assumed that Enrollment is done by a
Recognizer resource. Therefore, an appropriate SETUP message needs
to be sent and a media stream established between a client and
server before these functions are used.
Speaker Verification and Hotword recognition are slightly different.
For Speaker verification, a new verification resource is now
defined. This verification resource can be used on its own or be
attached to a session where a recognition is already set up.
For Hotword recognition it differs in that a second Recognizer
resource needs to be attached to the same session. The state
machine for this second recognizer is the same as for the primary
Recognizer resource.
The following sections describe each of the following MRCP
extensions separately: (1) Recognizer resource extensions, (2)
Enrollment, (3) Speaker Verification and Identification and (4)
Hotword recognition.
3. Notational Conventions
Most of the definitions and syntax follow the same format used in
the MRCP draft submission. The only new field required is to
represent short floating-point numbers needed to indicate relative
weight for some of the header fields. A weight is normalized in the
range of 0 to 1.
WEIGHT = ( "0" [ "." 0*3DIGIT ] ) | ( "1" [ "." 0*3("0") ] )
FLOAT = [ "+" / "-" ] 1*DIGIT [ "." 0*DIGIT ]
Burnett, et al. IETF-Draft Page 7
MRCP Extensions October 2003
4. Recognizer resource extensions
The only new functionality added to the recognizer resource is the
inclusion of the INTERPRET and RECORD methods and the associated
INTERPRETATION-COMPLETE and RECORDING-COMPLETE events.
4.1. Recognizer Resource Extensions Methods
The following methods are supported by the recognizer resource in
addition to those already defined in [3].
recognizer-extension-method = "RECORD"
| "INTERPRET"
4.2. Recognizer Resource Extensions Events
The recognizer resource may now generate the following events in
addition to those already defined in [3].
recognizer-extension-event = "RECORDING-COMPLETE"
| "INTERPRETATION-COMPLETE"
4.3. Recognizer Resource Extensions Header Fields
The recognizer resource extensions define new header fields to
augment the request, response or event messages they are associated
with.
recognizer-extension-header = "Interpret-Text"; Section 4.3.1
| "Enroll-Utterance"; Section 4.3.2
| "Ver-Buffer-Utterance"; Sec. 4.3.3
Parameter Support Methods/Events/Responses
interpret-text MANDATORY INTERPRET
enroll-utterance OPTIONAL RECOGNIZE
ver-buffer-utterance OPTIONAL RECOGNIZE, VERIFY, RECORD
4.3.1. Interpret-Text
This header field is used to provide the text string for which a
natural language interpretation is desired. This header field MUST
be used when invoking the INTERPRET method as it cannot be set with
the SET-PARAMS method.
interpret-text = "Interpret-Text" : 1*OCTET CRLF
Burnett, et al. IETF-Draft Page 8
MRCP Extensions October 2003
4.3.2. Enroll-Utterance
This header field is used to indicate to the Recognizer resource to
consider this utterance for training a phrase for Voice Enrollment.
If this flag is specified then this utterance will be considered
when doing proximity testing between repetitions of the same phrase
and for doing clash testing with other phrases in the grammar. This
header field is OPTIONAL in the RECOGNIZE method. The default value
for this field is false.
enroll-utterance = "Enroll-Utterance" : Boolean-value CRLF
4.3.3. Ver-Buffer-Utterance
This header field is used to indicate that this utterance should be
considered for Speaker Verification. This way, an application can
buffer utterances while doing regular recognition or verification
activities and speaker verification can later be requested on the
buffered utterances. This header field is OPTIONAL in the
RECOGNIZE, VERIFY or RECORD method. The default value for this field
is false.
ver-buffer-utterance = "Ver-Buffer-Utterance" : Boolean-value CRLF
4.4. RECORD
The RECORD method does not invoke the recognizer resource but simply
endpoints and records the input audio stream. It saves the
endpointed audio to a URL having its name supplied in the recording-
url header field. Currently, this URL can only use the ’file’
scheme.
If a RECOGNIZE, INTERPRET or another RECORD operation is already in
progress, invoking this method will cause the response to have a
status code of 402, "Method not valid in this state", and a COMPLETE
request state.
It the recording-url is not valid, a status code of 404, "Illegal
Value for Parameter", will be returned in the response. If it is
impossible for the server to create the requested file, a status
code of 407, "Method or Operation Failed", will be returned.
If the recording-url is valid, the recording operation is initiated
and the response will indicate an IN-PROGRESS request state. The
server MAY generate a subsequent START-OF-SPEECH event when speech
is detected. Upon completion of the recording operation, the server
will generate a RECORDING-COMPLETE event.
4.5. Record Header Fields
A Record request may contain header fields containing request
options and information to augment the Request, Response or Event
message it is associated with.
Burnett, et al. IETF-Draft Page 9
MRCP Extensions October 2003
record-header =
recording-url; Section 4.5.1
| ver-buffer-utterance; Section 4.5.2
Parameter Support Methods/Events
recording-url MANDATORY RECORD, SET-PARAMS, GET-PARAMS
Ver-buffer-utterance OPTIONAL RECOGNIZE, VERIFY, RECORD
4.5.1. Recording-URL
This header field specifies the location where the audio stream
recorded by a call to the RECORD method should be saved. Currently,
this should only be a URL using the ’file’ scheme. Should this URL
be relative, it will be treated relative to the current working
directory where the MRCP server process is running.
This header field MAY be used only when invoking the RECORD, SET-
PARAMS and GET-PARAMS method.
recording-url = "Recording-URL" ":" Url CRLF
Example:
C->S:RECORD 456234 MRCP/1.0
Recording-URL: file://mediaserver/recordings/myfile.wav
S->C:MRCP/1.0 456234 200 IN-PROGRESS
S->C:START-OF-SPEECH 456234 IN-PROGRESS MRCP/1.0
S->C:RECORDING-COMPLETE 456234 COMPLETE MRCP/1.0
Completion-Cause: 000 success
4.5.2. Ver-Buffer-Utterance
This header field is used to indicate that this utterance should be
considered for Speaker Verification. This way, an application can
buffer utterances while doing regular recognition or verification
activities and speaker verification can later be requested on the
buffered utterances. This header field is OPTIONAL in the
RECOGNIZE, VERIFY or RECORD method.
ver-buffer-utterance = "Ver-Buffer-Utterance" : Boolean-value CRLF
4.6. INTERPRET
The INTERPRET method from the client to the server takes as input an
interpret-text header, containing the text for which the semantic
interpretation is desired, and returns, via the INTERPRETATION-
COMPLETE event, an interpretation result which is very similar to
the one returned from a RECOGNIZE method invocation. Only portions
Burnett, et al. IETF-Draft Page 10
MRCP Extensions October 2003
of the result relevant to acoustic matching are excluded from the
result. The interpret-text header MUST be included in the INTERPRET
request.
Recognizer grammar data is treated in the same way as it is when
issuing a RECOGNIZE method call.
If a RECOGNIZE, RECORD or another INTERPRET operation is already in
progress, invoking this method will cause the response to have a
status code of 402, "Method not valid in this state", and a COMPLETE
request state.
Example:
C->S:INTERPRET 234567 MRCP/1.0
Interpret-Text: may I speak to Andre Roy
Content-Type: application/grammar+xml
Content-Id: request1@form-level.store
Content-Length: 104
ouiyes
may I speak to
Michel TremblayAndre Roy
S->C:MRCP/1.0 234567 200 IN-PROGRESS
S->C:INTERPRETATION-COMPLETE 234567 COMPLETE MRCP/1.0
Completion-Cause: 000 success
Content-Type: application/x-nlsml
Content-Length: 276
Andre Roy
may I speak to Andre Roy
4.7. RECORDING-COMPLETE
This event from the recognition resource to the client indicates
that the RECORD operation is complete. The request state MUST be
set to COMPLETE.
The completion-cause header MUST be included in this event. It MUST
be set to one of the following values defined for the recognizer
resource:
Cause-Code Cause-Name Description
000 success RECORD completed successfully
002 no-input-timeout RECORD completed with no audio
recorded due to lack of input
006 error RECORD operation terminated
due to an error
When the completion-cause is "000 success", the URL specified via
the recording-url header in the RECORD method invocation will
contain the recorded audio. The client may then use this URL to
retrieve the audio.
Example:
C->S:RECORD 456234 MRCP/1.0
Recording-URL: file://mediaserver/recordings/myfile.wav
S->C:MRCP/1.0 456234 200 IN-PROGRESS
S->C:START-OF-SPEECH 456234 IN-PROGRESS MRCP/1.0
S->C:RECORDING-COMPLETE 456234 COMPLETE MRCP/1.0
Completion-Cause: 000 success
4.8. INTERPRETATION-COMPLETE
This event from the recognition resource to the client indicates
that the INTERPRET operation is complete. The interpretation result
is sent in the body of the MRCP message. The request state MUST be
set to COMPLETE.
Burnett, et al. IETF-Draft Page 12
MRCP Extensions October 2003
The completion-cause header MUST be included in this event and MUST
be set to one of the following two values defined for the recognizer
resource:
Cause-Code Cause-Name Description
000 success INTERPRET completed
successfully
006 error INTERPRET terminated
due to an error
Example:
C->S:INTERPRET 234567 MRCP/1.0
Interpret-Text: may I speak to Andre Roy
Content-Type: application/grammar+xml
Content-Id: request1@form-level.store
Content-Length: 104
ouiyes
may I speak to
Michel TremblayAndre Roy
S->C:MRCP/1.0 234567 200 IN-PROGRESS
S->C:INTERPRETATION-COMPLETE 234567 COMPLETE MRCP/1.0
Completion-Cause: 000 success
Content-Type: application/x-nlsml
Content-Length: 276
Burnett, et al. IETF-Draft Page 13
MRCP Extensions October 2003
Andre Roy
may I speak to Andre Roy
5. Enrollment
This document captures the extensions required to implement Voice
Enrollment, Speaker Verification and Hotword recognition using MRCP.
This section describes the methods, responses and events needed for
doing Enrollment.
Enrollment is performed using a person’s voice. For example, a list
of contacts can be created and maintained by recording the person’s
names using the caller’s voice. This technique is sometimes also
called speaker-dependant recognition.
Voice Enrollment has a concept of an enrollment session. A session
to add a new phrase to a personal grammar involves the initial
enrollment followed by a repeat of enough utterances before
committing the new phrase to the personal grammar. Each time an
utterance is recorded, it is compared for similarity with the other
samples and a clash test is performed against other entries in the
personal grammar to ensure there are no similar and confusable
entries.
Most vendors perform the enrollment feature using a Recognizer
resource. The way to control which utterances are to be considered
for enrollment of a new phrase is achieved by setting a header field
in the Recognize request, rather than pausing or resuming the phrase
enrollment session. This mechanism is explained in more detail in
following sections.
5.1. Enrollment State Machine
Starting an enrollment session does not change the state of the
recognizer resource, i.e. it remains idle. Once an enrollment
session is started, then utterances are enrolled by calling the
RECOGNIZE method repeatedly. The state of the Speech Recognizer
resources goes from IDLE to RECOGNIZING state each time RECOGNIZE is
called.
5.2. Enrollment Methods
Enrollment supports the following methods.
enrollment-method = "START-PHRASE-ENROLLMENT"
| "RECOGNIZE"
Burnett, et al. IETF-Draft Page 14
MRCP Extensions October 2003
| "STOP"
| "ENROLLMENT-ROLLBACK"
| "END-PHRASE-ENROLLMENT"
| "MODIFY-PHRASE"
| "DELETE-PHRASE"
5.3. Enrollment Events
Enrollment may generate the following events.
enrollment-event = "RECOGNITION-COMPLETE"
5.4. Enrollment Header Fields
An Enrollment request may contain header fields containing request
options and information to augment the Request, Response or Event
message it is associated with.
enrollment-header =
num-min-consistent-pronunciations ; Section 5.4.1
| consistency-threshold ; Section 5.4.2
| clash-threshold ; Section 5.4.3
| personal-grammar-uri ; Section 5.4.4
| phrase-id ; Section 5.4.5
| phrase-nl ; Section 5.4.6
| weight ; Section 5.4.7
| save-best-waveform ; Section 5.4.8
| waveform-url ; Section 5.4.9
| new-phrase-id ; Section 5.4.10
| confusable-phrases-uri ; Section 5.4.11
| abort-phrase-enrollment ; Section 5.4.12
| completion-cause ; Section 5.4.13
Parameter Support Methods/Events
num-min-consistent MANDATORY START-PHRASE-ENROLLMENT,
-pronunciations SET-PARAMS, GET-PARAMS
consistency-threshold OPTIONAL START-PHRASE-ENROLLMENT,
SET-PARAMS, GET-PARAMS
clash-threshold OPTIONAL START-PHRASE-ENROLLMENT,
SET-PARAMS, GET-PARAMS
personal-grammar-uri MANDATORY START-PHRASE-ENROLLMENT,
SET-PARAMS, GET-PARAMS,
MODIFY-PHRASE, DELETE-PHRASE
phrase-id MANDATORY MODIFY-PHRASE, DELETE-PHRASE,
START-PHRASE-ENROLLMENT
phrase-nl MANDATORY MODIFY-PHRASE,
START-PHRASE-ENROLLMENT
weight OPTIONAL MODIFY-PHRASE,
START-PHRASE-ENROLLMENT
save-best-waveform OPTIONAL SET-PARAMS, GET-PARAMS, RECOGNIZE
waveform-url MANDATORY RECOGNITION-COMPLETE
new-phrase-id OPTIONAL MODIFY-PHRASE
Burnett, et al. IETF-Draft Page 15
MRCP Extensions October 2003
confusable-phrases-uri OPTIONAL RECOGNIZE
abort-phrase-enrollment OPTIONAL END-PHRASE-ENROLLMENT
completion-cause MANDATORY RECOGNITION-COMPLETE
For enrollment-specific header fields that can appear as part of
SET-PARAMS or GET-PARAMS methods, the following general rule
applies: the START-PHRASE-ENROLLMENT method must be called before
these header fields can be set through the SET-PARAMS method or
retrieved through the GET-PARAMS method.
5.4.1. Num-Min-Consistent-Pronunciations
This parameter MAY BE specified in a START-PHRASE-ENROLLMENT, SET-
PARAMS, or GET-PARAMS method and is used to specify the minimum
number of consistent pronunciations that must be obtained to voice
enroll a new phrase. The minimum value is 1. The default value is
platform specific and MAY BE greater than 1.
num-min-consistent-pronunciations =
"Num-Min-Consistent-Pronunciations" ":" 1*DIGIT CRLF
5.4.2. Consistency-Threshold
This parameter MAY BE sent as part of the START-PHRASE-ENROLLMENT,
SET-PARAMS, or GET-PARAMS method. Used during voice-enrollment,
this parameter specifies how similar an utterance needs to be to a
previously enrolled pronunciation of the same phrase to be
considered "consistent." The higher the threshold, the closer the
match between an utterance and previous pronunciations must be for
the pronunciation to be considered consistent. The range for this
threshold is 0 to 100.
consistency-threshold = "Consistency-Threshold" ":" 1*DIGIT CRLF
5.4.3. Clash-Threshold
This parameter MAY BE sent as part of the START-PHRASE-ENROLLMENT,
SET-PARMS, or GET-PARAMS method. Used during voice-enrollment, this
parameter specifies how similar the pronunciations of two different
phrases can be before they are considered to be clashing. For
example, pronunciations of phrases such as "John Smith" and "Jon
Smits" may be so similar that they are difficult to distinguish
correctly. A smaller threshold reduces the number of clashes
detected. The range for this threshold is 0 to 100. The default
value for this field is platform specific.
clash-threshold = "Clash-Threshold" ":" 1*DIGIT CRLF
Burnett, et al. IETF-Draft Page 16
MRCP Extensions October 2003
5.4.4. Personal-Grammar-URI
This parameter specifies the speaker-trained grammar to be used or
referenced during enrollment operations. For example, a contact
list for user "Jeff" could be stored at the Personal-Grammar-
URI="http://myserver/myenrollmentdb/jeff-list". There is no default
value for this header field.
personal-grammar-uri = "Personal-Grammar-URI" ":" Url CRLF
5.4.5. Phrase-Id
This header identifies a phrase in a personal grammar and will also
be returned when doing recognition. This header field MAY occur in
START-PHRASE-ENROLLMENT, MODIFY-PHRASE or DELETE-PHRASE requests.
There is no default value for this header field.
phrase-id = "Phrase-ID" ":" 1*ALPHA CRLF
5.4.6. Phrase-NL
This is a string specifying the natural language statement to
execute when the phrase is recognized. This header field MAY occur
in START-PHRASE-ENROLLMENT and MODIFY-PHRASE requests. There is no
default value for this header field.
phrase-nl = "Phrase-NL" ":" 1*ALPHA CRLF
5.4.7. Weight
The value of this header field represents the occurrence likelihood
of this branch of the grammar. The weights are normalized to sum to
one at compilation time, so use the value of ’1’ if you want all
branches to have the same weight. This header field MAY occur in
START-PHRASE-ENROLLMENT and MODIFY-PHRASE requests.
weight = "Weight" ":" WEIGHT CRLF
5.4.8. Save-Best-Waveform
This header field allows the client to indicate to the recognizer
that it MUST save the audio stream for the best repetition of the
phrase that was used during the enrollment session. The recognizer
MUST then record the recognized audio and make it available to the
client in the form of a URL returned in the waveform-url header
field in the RECOGNITION-COMPLETE event. If there was an error in
recording the stream or the audio clip is otherwise not available,
the recognizer MUST return an empty waveform-url header field.
save-best-waveform = "Save-Best-Waveform" ":" Boolean-value CRLF
Burnett, et al. IETF-Draft Page 17
MRCP Extensions October 2003
5.4.9. Waveform-URL
If the Save-Best-Waveform header field is set to true, the
recognizer MUST record the incoming audio stream of the recognition
into a file and provide a URL for the client to access it. This
header MUST be present in the RECOGNITION-COMPLETE event if the
Save-Best-Waveform header field was set to true. The URL value of
the header MUST be empty if there was some error preventing the
server from recording. Otherwise, the URL generated by the server
MUST be unique across the server and all its recognition and
enrollment sessions.
waveform-url ="Waveform-URL" ":" Url CRLF
5.4.10. New-Phrase-Id
This header field replaces the id used to identify the phrase in a
personal grammar. The recognizer returns the new id when using an
enrollment grammar. This header field MAY occur in MODIFY-PHRASE
requests.
new-phrase-id = "New-Phrase-ID" ":" 1*ALPHA CRLF
5.4.11. Confusable-Phrases-URI
This optional header field specifies the grammar that defines
invalid phrases for enrollment. For example, typical applications
do not allow an enrolled phrase that is also a command word. This
header field MAY occur in RECOGNIZE requests.
confusable-phrases-uri
= "Confusable-Phrases-URI" ":" Url CRLF
5.4.12. Abort-Phrase-Enrollment
This header field can optionally be specified in the END-PHRASE-
ENROLLMENT method to abort the phrase enrollment, rather than
committing the phrase to the personal grammar.
abort-phrase-enrollment = "Abort-Phrase-Enrollment" ":" Boolean-
value CRLF
5.4.13. Completion-Cause
This header field is from the recognizer resource and it MUST be
specified in a RECOGNITION-COMPLETE event coming from the recognizer
resource to the client. This indicates the reason behind the
RECOGNIZE request completion.
The error codes used for Enrollment should not clash with those for
normal recognition. There are no completion-cause values specific
Burnett, et al. IETF-Draft Page 18
MRCP Extensions October 2003
to enrollment, so please refer to the original MRCP specification
for valid completion causes.
completion-cause = "Completion-Cause" ":" 1*DIGIT SP
1*ALPHA CRLF
5.5. Enrollment Result Elements
Enrollment results can contain the following elements:
enrollment-result-elements =
num-clashes ; Section 5.5.1
| num-good-repetitions ; Section 5.5.2
| num-repetitions-still-needed; Section 5.5.3
| consistency-status ; Section 5.5.4
| clash-phrase-ids ; Section 5.5.5
| transcriptions ; Section 5.5.6
| confusable-phrases ; Section 5.5.7
5.5.1. Num-Clashes
This is not a header field, but part of the recognition results. It
is returned in a RECOGNITION-COMPLETE event. Its value represents
the number of clashes that this pronunciation has with other
pronunciations in an active enrollment session. The header field
Clash-Threshold determines the sensitivity of the clash measurement.
Clash testing can be turned off completely by setting Clash-
Threshold to 0.
num-clashes = "" 1*DIGIT "" CRLF
5.5.2. Num-Good-Repetitions
This is not a header field, but part of the recognition results. It
is returned in a RECOGNITION-COMPLETE event. Its value represents
the number of consistent pronunciations obtained so far in an active
enrollment session.
num-good-repetitions = "" 1*DIGIT
"" CRLF
5.5.3. Num-Repetitions-Still-Needed
This is not a header field, but part of the recognition results. It
is returned in a RECOGNITION-COMPLETE event. Its value represents
the number of consistent pronunciations that must still be obtained
before the new phrase can be added to the enrollment grammar. The
number of consistent pronunciations required is determined by the
parameter Num-Min-Consistent-Pronunciations, whose default value is
Burnett, et al. IETF-Draft Page 19
MRCP Extensions October 2003
two. The returned value must be 0 before the system will allow you
to end an enrollment session for a new phrase.
num-repetitions-still-needed =
"" 1*DIGIT
"" CRLF
5.5.4. Consistency-Status
This is not a header field, but part of the recognition results. It
is returned in a RECOGNITION-COMPLETE event. This is used to
indicate how consistent the repetitions are when learning a new
phrase. It can have the values of CONSISTENT, INCONSISTENT and
UNDECIDED.
consistency-status = "" 1*ALPHA
"" CRLF
5.5.5. Clash-Phrase-Ids
This is not a header field, but part of the recognition results. It
is returned in a RECOGNITION-COMPLETE event. This gets filled with
the phrase ids of the clashing pronunciation(s). This field is
absent if there are no clashes. This MAY occur in RECOGNITION-
COMPLETE events.
phrase-id = "" 1*ALPHA "" CRLF
clash-phrase-ids = "" 1*phrase-id
"" CRLF
5.5.6. Transcriptions
This is not a header field, but part of the recognition results. It
is optionally returned in a RECOGNITION-COMPLETE event. This gets
filled with the transcriptions returned in the last repetition of
the phrase being enrolled. This MAY occur in RECOGNITION-COMPLETE
events.
transcription = "" 1*OCTET "" CRLF
transcriptions = "" 1*transcription
"" CRLF
5.5.7. Confusable-Phrases
This is not a header field, but part of the recognition results. It
is optionally returned in a RECOGNITION-COMPLETE event. This gets
filled with the list of phrases from a command grammar that are
confusable with the phrase being added to the personal grammar.
This MAY occur in RECOGNITION-COMPLETE events.
Burnett, et al. IETF-Draft Page 20
MRCP Extensions October 2003
Confusable-phrase = "" 1*OCTET "" CRLF
confusable-phrases = "" 1*confusable-phrase
"" CRLF
5.6. Enrollment Methods
5.6.1. START-PHRASE-ENROLLMENT
The START-PHRASE-ENROLLMENT method sent from the client to the
server starts a new phrase enrollment session during which the
client may call RECOGNIZE to enroll a new utterance. This consists
of a set of calls to RECOGNIZE in which the caller speaks a phrase
several times so the system can "learn" it. You then add the phrase
to a personal grammar (speaker-trained grammar), and the system can
recognize it later.
Only one phrase enrollment session may be active at a time. The
Personal-Grammar-URI identifies the grammar that is used during
enrollment to store the personal list of phrases. Once RECOGNIZE is
called, the result is returned in a RECOGNITION-COMPLETE event and
may contain either an enrollment result OR a recognition result for
a regular recognition.
Calling END-PHRASE-ENROLLMENT ends the ongoing phrase enrollment
session, which is typically done after a sequence of successful
calls to RECOGNIZE. This method can be called to commit the new
phrase to the personal grammar or to abort the phrase enrollment
session.
The Personal-Grammar-URI, which specifies the grammar to contain the
new enrolled phrase, will be created if it does not exist. Also, the
personal grammar may ONLY contain phrases added via a phrase
enrollment session.
The Phrase-ID passed to this method will be used to identify this
phrase in the grammar and will be returned as the speech input when
doing a RECOGNIZE on the grammar. The Phrase-NL similarly will be
returned in a RECOGNITION-COMPLETE event in the same manner as other
NL in a grammar. The tag-format of this NL is vendor specific.
If the client has specified Save-Best-Waveform as true, then the
response after ending the phrase enrollment session should contain
the location/URL of a recording of the best repetition of the
learned phrase.
Example:
C->S: START-PHRASE-ENROLLMENT 543258 MRCP/1.0
Num-Min-Consistent-Pronunciations: 2
Consistency-Threshold: 30
Clash-Threshold: 12
Personal-Grammar-URI:
Phrase-Id:
Burnett, et al. IETF-Draft Page 21
MRCP Extensions October 2003
Phrase-NL:
Weight: 1
Save-Best-Waveform: true
S->C: MRCP/1.0 543258 200 COMPLETE
5.6.2. RECOGNIZE
The RECOGNIZE method from the client to the server starts an ongoing
enrollment/recognition during which either the phrase is learned, or
recognition occurs against the grammar passed to RECOGNIZE. A START-
OF-SPEECH event followed by a RECOGNITION-COMPLETE event should be
expected.
There can only be a single RECOGNIZE operation IN-PROGRESS at a time
and this method MUST be called during an ongoing START-PHRASE-
ENROLLMENT if enrollment is desired.
If the RECOGNIZE request contains a Content-Id header field then the
resulting grammar (which includes the personal grammar as a sub-
grammar) can be referenced from elsewhere by using "session:my-
grammar".
Example:
C->S: RECOGNIZE 543259 MRCP/1.0
Content-Type: application/grammar+xml
Content-Id: my-grammar
Content-Length: 123
help cancel
S->C: MRCP/1.0 543259 200 IN-PROGRESS
S->C: START-OF-SPEECH 543259 200 MRCP/1.0
5.6.3. STOP
The STOP method from the client to the server may only be called
during an ongoing RECOGNIZE operation and is used to abort that
recognition. No RECOGNITION-COMPLETE event will follow.
Burnett, et al. IETF-Draft Page 22
MRCP Extensions October 2003
There is no difference in behavior for regular recognition versus an
enrollment. It is included here for completeness.
Example:
C->S: STOP 543258 MRCP/1.0
S->C: MRCP/1.0 543258 200 COMPLETE
Active-Request-Id-List: 543259
5.6.4. ENROLLMENT-ROLLBACK
The ENROLLMENT-ROLLBACK method discards the last live utterances
from the RECOGNIZE operation. This method should be invoked when the
caller provides undesirable input such as non-speech noises, side-
speech, commands, utterance from the RECOGNIZE grammar, etc. Note
that this method does not provide a stack of rollback states.
Executing ENROLLMENT-ROLLBACK twice in succession without an
intervening recognition operation has no effect on the second
attempt.
Example:
C->S: ENROLLMENT-ROLLBACK 543261 MRCP/1.0
S->C: MRCP/1.0 543261 200 COMPLETE
5.6.5. END-PHRASE-ENROLLMENT
The END-PHRASE-ENROLLMENT method can only be called during an active
phrase enrollment session, which was started by calling the method
START-PHRASE-ENROLLMENT. It may NOT be called during an ongoing
RECOGNIZE operation. It should be called when successive calls to
RECOGNIZE have succeeded and Num-Repetitions-Still-Needed has been
returned as 0 in the RECOGNITION-COMPLETE event to commit the new
phrase in the grammar. Alternatively, it can be called by
specifying the Abort-Phrase-Enrollment header to abort the phrase
enrollment session.
If the client has specified Save-Best-Waveform as true in the START-
PHRASE-ENROLLMENT request, then the response should contain the
location/URL of a recording of the best repetition of the learned
phrase.
Example:
C->S: END-PHRASE-ENROLLMENT 543262 MRCP/1.0
S->C: MRCP/1.0 543262 200 COMPLETE
Waveform-URL:
Burnett, et al. IETF-Draft Page 23
MRCP Extensions October 2003
5.6.6. MODIFY-PHRASE
The MODIFY-PHRASE method sent from the client to the server is used
to change the phrase ID, NL phrase and/or weight for a given phrase
in a personal grammar.
If no fields are supplied then calling this method has no effect and
it is silently ignored.
Example:
C->S: MODIFY-PHRASE 543265 MRCP/1.0
Personal-Grammar-URI:
Phrase-Id:
New-Phrase-Id:
Phrase-NL:
Weight: 1
S->C: MRCP/1.0 543265 200 COMPLETE
5.6.7. DELETE-PHRASE
The DELETE-PHRASE method sent from the client to the server is used
to delete a phase in a personal grammar added through voice
enrollment or text enrollment. If the specified phrase doesn’t
exist, this method has no effect and it is silently ignored.
Example:
C->S: DELETE-PHRASE 543266 MRCP/1.0
Personal-Grammar-URI:
Phrase-Id:
S->C: MRCP/1.0 543266 200 COMPLETE
5.6.8. RECOGNITION-COMPLETE
The RECOGNITION-COMPLETE event follows a method call to RECOGNIZE
and is used to communicate to the client the results of the
enrollment. Note that the event can contain recognition or
enrollment results depending on what was spoken.
Example:
S->C: RECOGNITION-COMPLETE 543259 200 MRCP/1.0
Completion-Cause: 000 success
Content-Type: application/x-nlsml
Content-Length: 123
Burnett, et al. IETF-Draft Page 24
MRCP Extensions October 2003
2 1 1
consistent Jeff Andre m ay b r ow k er m ax r aa k ah call 10
Burnett, et al. IETF-Draft Page 25
MRCP Extensions October 2003
6. Speaker Verification and Identification
This document captures the extensions required to implement Voice
Enrollment, Speaker Verification / Identification and Hotword
recognition using MRCP. This section describes the methods,
responses and events needed for doing Speaker Verification /
Identification.
6.1. Speaker Verification/Identification Resource
Speaker verification is a voice authentication feature that can be
used to identify the speaker in order to grant the user access to
sensitive information and transactions. To do this, a recorded
utterance is compared to a voiceprint previously stored for that
user. Verification consists of two phases: a designation phase to
establish the claimed identity of the caller and an execution phase
in which a voiceprint is either created (training) or used to
authenticate the claimed identity (verification).
Speaker identification identifies the speaker from a set of valid
users, such as family members. Identification can be performed on a
small set of users or for a large population. This feature is
useful for applications where multiple users share the same account
number, but where the individual speaker must be uniquely identified
from the group. Speaker identification is also done in two phases,
a designation phase and an execution phase.
It is possible for a speaker verification resource to share the same
session as an existing recognizer resource or a speaker verification
session can be SETUP to operate in standalone mode, without a
recognizer resource sharing the same session.In order to share the
same session, the SETUP message for the verification resource should
include the RTSP session identifier of the recognizer resource it
wishes to share. If no session identifier is specified, an
independent verification resource, running on the same physical
server or a separate one, will be set up.
Some of the speaker verification methods, described below, apply
only to a specific mode of operation.
The verification resource supports some buffering methods that allow
the user to buffer the verification data from one or more utterances
and then process this set of utterances as a single entity. This is
different from collecting waveforms and processing them using the
verification methods that operate directly on the incoming audio
stream because the buffering mechanism does not simply accumulate
utterance data to a buffer. In particular, when both the
recognition and verification resources share the same session,
additional information gathered by the recognition resource is saved
with these buffers to improve verification performance.
Burnett, et al. IETF-Draft Page 26
MRCP Extensions October 2003
6.2. SETUP Verification/Identification Resource
The SETUP method from the client to the server is used to open a
resource for verification/identification from a media server. If
session-id header field is specified in the SETUP method, the
verification/identification resource would share the same session
with other resources in the session. Otherwise, a new session would
be created for the verification/identification resource. The
resource name is ’verification-resource’.
Example:
This example assumes the verification resource would share a session
that is already created.
C->S: SETUP rtsp://media.server.com/media/verification-resource
RTSP/1.0
CSeq: 1
Transport: RTP/AVP;unicast;client_port=46456-46457
Session: 0a030258_00003815_3bc4873a_0001_0000
S->C: RTSP/1.0 200 OK
CSeq: 1
Transport: RTP/AVP;unicast;client_port=46456-46457;
server_port=46460-46461
Session: 0a030258_00003815_3bc4873a_0001_0000
6.3. Speaker Verification State Machine
Speaker Verification has a concept of a training, verification or
buffering sessions. Starting one of these sessions does not change
the state of the verification resource, i.e. it remains idle. Once
a verification or training session is started, then utterances are
trained or verified by calling the VERIFY or VER-FROM-BUFFER method.
The state of the Speaker Verification resources goes from IDLE to
VERIFYING state each time VERIFY or VER-FROM-BUFFER is called.
6.4. Speaker Verification Methods
Speaker Verification supports the following methods.
verification-method = "VER-START-SESSION"
| "VER-END-SESSION"
| "VER-SET-VOICEPRINT"
| "VER-DELETE-VOICEPRINT"
| "VERIFY"
| "VER-FROM-BUFFER"
| "VER-ROLLBACK"
| "VER-STOP"
| "VER-START-TIMERS"
| "SET-PARAMS"
| "GET-PARAMS"
6.5. Verification Events
Speaker Verification may generate the following events.
Burnett, et al. IETF-Draft Page 27
MRCP Extensions October 2003
verification-event = "VERIFICATION-COMPLETE"
| "START-OF-SPEECH"
6.6. Verification Header Fields
A Speaker Verification request may contain header fields containing
request options and information to augment the Request, Response or
Event message it is associated with.
The verification result elements will be returned in a VERIFICATION-
COMPLETE event containing an NLSML document [4], having a MIME-type
application/x-nlsml.
verification-header =
voiceprint-uri ; Section 6.6.1
| voiceprint-identifier ; Section 6.6.2
| voiceprint-group ; Section 6.6.3
| verification-mode ; Section 6.6.4
| adapt-model ; Section 6.6.5
| abort-model ; Section 6.6.6
| security-level ; Section 6.6.7
| num-min-verification-phrases; Section 6.6.8
| num-max-verification-phrases; Section 6.6.9
| no-input-timeout ; Section 6.6.10
| save-waveform ; Section 6.6.11
| waveform-url ; Section 6.6.12
| vendor-specific ; Section 6.6.13
| voiceprint-exists ; Section 6.6.14
| ver-buffer-utterance ; Section 6.6.15
| input-waveform-url ; Section 6.6.16
| verification-type ; Section 6.6.17
| digit-sequence ; Section 6.6.18
| completion-cause ; Section 6.6.19
Parameter Support Methods/Events
voiceprint-uri MANDATORY VER-SET-VOICEPRINT,
VER-DELETE-VOICEPRINT
voiceprint-identifier MANDATORY VER-SET-VOICEPRINT,
VER-DELETE-VOICEPRINT
voiceprint-group OPTIONAL VER-SET-VOICEPRINT,
VER-DELETE-VOICEPRINT
verification-mode MANDATORY SET-PARAMS, GET-PARAMS,
VERIFY, VER-FROM-BUFFER
adapt-model OPTIONAL VER-START-SESSION
abort-model OPTIONAL VER-END-SESSION
security-level OPTIONAL SET-PARAMS, GET-PARAMS,
VERIFY, VER-FROM-BUFFER
num-min-verification OPTIONAL SET-PARAMS, GET-PARAMS,
-phrases VERIFY, VER-FROM-BUFFER
num-max-verification OPTIONAL SET-PARAMS, GET-PARAMS,
-phrases VERIFY, VER-FROM-BUFFER
Burnett, et al. IETF-Draft Page 28
MRCP Extensions October 2003
no-input-timeout MANDATORY SET-PARAMS, GET-PARAMS,
VERIFY
save-waveform MANDATORY SET-PARAMS, GET-PARAMS,
VERIFY
waveform-url MANDATORY VERIFICATION-COMPLETE
vendor-specific MANDATORY SET-PARAMS, GET-PARAMS
voiceprint-exists MANDATORY VER-SET-VOICEPRINT,
VER-DELETE-VOICEPRINT
ver-buffer-utterance OPTIONAL RECOGNIZE, VERIFY, RECORD
input-waveform-url OPTIONAL VERIFY
verification-type OPTIONAL START-PHRASE-ENROLLMENT
digit-sequence OPTIONAL START-PHRASE-ENROLLMENT
completion-cause MANDATORY VERIFICATION-COMPLETE
VER-SET-VOICEPRINT,
VER-DELETE-VOICEPRINT
6.6.1. Voiceprint-URI
This parameter specifies the voiceprint repository to be used or
referenced during speaker verification or identification operations.
This header field is required in VER-SET-VOICEPRINT and
VER-DELETE-VOICEPRINT method. If this header field is set through
the SET-PARAMS method, it can be silently ignored.
voiceprint-uri = "Voiceprint-URI" ":" Url CRLF
6.6.2. Voiceprint-Identifier
This header field specifies the claimed identity for voice
verification applications. The claimed identity may be used to
specify an existing voiceprint or to establish a new voiceprint.
This header field is required in VER-SET-VOICEPRINT and VER-DELETE-
VOICEPRINT method executions in preparation for verification
application operations. The Voiceprint-Identifier is not required
for identification applications except in the VER-DELETE-VOICEPRINT
method when the client needs to remove an identity from a voiceprint
group.
voiceprint-identifier = "Voiceprint-Identifier" ":" 1*ALPHA CRLF
6.6.3. Voiceprint-Group
This header field specifies the voiceprint group for speaker
identification operations. The voiceprint group narrows the
potential voiceprint identification candidates to a subset of the
voiceprints in the repository. This header field may appear in VER-
SET-VOICEPRINT and VER-DELETE-VOICEPRINT method executions for
speaker identification operations. If this header field is absent,
then verification, not identification, operations will be executed.
voiceprint-group = "Voiceprint-Group" ":" 1*ALPHA CRLF
Burnett, et al. IETF-Draft Page 29
MRCP Extensions October 2003
6.6.4. Verification-Mode
This header field specifies the mode of the verification resource in
a VERIFY or VER-FROM-BUFFER method execution. Acceptable values
indicate whether the verification session should train a voiceprint
("train") or verify/identify using an existing voiceprint
("verify").
Setting this header field to "train" or "verify" requires that the
voiceprint or voiceprint group identifier attributes have already
been set through the VER-SET-VOICEPRINT method.
Training and verification sessions both require the voiceprint URI
to be specified at the start of the session. In many usage
scenarios, however, the system cannot know the speaker’s claimed
identity until the speaker says, for example, their account number.
In order to allow the first few utterances of a dialog to be both
recognized and verified, the verification resource on the MRCP
server retains an audio buffer. In this audio buffer, the MRCP
server will accumulate recognized utterances in memory. The
application can later execute a verification method and apply the
buffered utterances to the current verification session. The
buffering methods are used for this purpose. When buffering is used,
subsequent input utterances are added to the audio buffer for later
analysis.
Some voice user interfaces may require additional user input that
should not be analyzed for verification. For example, the user’s
input may have been recognized with low confidence and thus require
a confirmation cycle. In such cases, the client should not execute
the VERIFY or VER-FROM-BUFFER methods to collect and analyze the
caller’s input. A separate recognizer resource can analyze the
caller’s response without any participation on behalf of the
verification resource.
Once the following conditions have been met:
1. Voiceprint identity has been successfully established through the
voiceprint identifier header fields of the VER-SET-VOICEPRINT
method, and
2. the verification mode has been set to one of "train" or "verify",
the verification resource may begin providing verification
information during verification operations. The verification
resource MUST reach one of the two major states ("train" or
"verify") if the above two conditions hold, or it MUST report an
error condition in the MRCP status code to indicate why the
verification resource is not ready for action.
The value of verification-mode is persistent within a verification
session. Changing the mode to a different value than the previous
setting causes the verification resource to report an error if the
previous setting was either "train" or "verify". If the mode is
changed back to its previous value, the operation may continue.
verification-mode = "Verification-Mode" ":"
Burnett, et al. IETF-Draft Page 30
MRCP Extensions October 2003
verification-mode-string
verification-mode-string = "train"
| "verify"
6.6.5. Adapt-Model
This header field indicates the desired behavior of the verification
resource after a successful verification execution. If the value of
this parameter is "true", the audio collected during the
verification session is used to update the voiceprint to account for
ongoing changes in a speaker’s incoming speech characteristics. If
the value is "false" (the default), the voiceprint is not updated
with the latest audio. This header field MAY only occur in VER-
START-SESSION method.
adapt-model = "Adapt-Model" ":" Boolean-value CRLF
6.6.6. Abort-Model
The Abort-Model header field indicates the desired behavior of the
verification resource upon session termination. If the value of this
parameter is "true", the pending changes to a voiceprint due to
verification training or verification adaptation are discarded. If
the value is "false" (the default), the pending changes for a
training session or a successful verification session are committed
to the voiceprint repository. A value of "true" for Abort-Model
overrides a value of "true" for the Adapt-Model header field. This
header field MAY only occur in VER-END-SESSION method.
abort-model = "Abort-Model" ":" Boolean-value CRLF
6.6.7. Security-Level
The Security-Level header field determines the range of verification
scores in which a decision of ’accepted’ may be declared. This
header field MAY occur in SET-PARAMS, GET-PARAMS, VERIFY and VER-
FROM-BUFFER methods. It can be "high" (highest security level),
"medium-high", "medium" (normal security level), "medium-low", or
"low" (low security level). The default value is platform specific.
security-level = "Security-Level" ":" security-level-string CRLF
security-level-string = "high" |
"medium-high" |
"medium" |
"medium-low" |
"low"
Burnett, et al. IETF-Draft Page 31
MRCP Extensions October 2003
6.6.8. Num-Min-Verification-Phrases
The Num-Min-Verification-Phrases header field is used to specify the
minimum number of valid utterances before a positive decision is
given for verification. The value for this parameter is integer and
the default value is 1. The verification resource should not
announce a decision of ’accepted’ unless the Num-Min-Verification-
Phrases utterances are available. The minimum value is 1.
num-min-verification-phrases = "Num-Min-Verification-Phrases" ":"
1*DIGIT CRLF
6.6.9. Num-Max-Verification-Phrases
The Num-Max-Verification-Phrases header field is used to specify the
number of valid utterances required before a decision is forced for
verification. The verification resource MUST NOT return a decision
of ’undecided’ once Num-Max-Verification-Phrases have been collected
and used to determine a verification score. The value for this
parameter is integer and the minimum value is 1.
num-min-verification-phrases = "Num-Max-Verification-Phrases" ":"
1*DIGIT CRLF
6.6.10. No-Input-Timeout
The No-Input-Timeout header field sets the length of time from the
start of the verification timers (see VER-START-TIMERS) until the
declaration of a no-input event in the VERIFICATION-COMPLETE server
event message. The value is in milliseconds. This header field MAY
occur in VERIFY, SET-PARAMS or GET-PARAMS. The value for this field
ranges from 0 to MAXTIMEOUT, where MAXTIMEOUT is platform specific.
The default value for this field is platform specific.
no-input-timeout = "No-Input-Timeout" ":" 1*DIGIT CRLF
6.6.11. Save-Waveform
This header field allows the client to indicate to the verification
resource that it MUST save the audio stream that was used for
verification/identification. The verification resource MUST then
record the audio and make it available to the client in the form of
a URI returned in the waveform-uri header field in the
VERIFICATION-COMPLETE event. If there was an error in recording the
stream or the audio clip is otherwise not available, the
verification resource MUST return an empty waveform-uri header
field. The default value for this field is "false". This header
field MAY appear in the VERIFY method, but NOT in the VER-FROM-
Burnett, et al. IETF-Draft Page 32
MRCP Extensions October 2003
BUFFER method since it can control whether or not to save the
waveform for live verification / identification operations only.
save-waveform = "Save-Waveform" ":" boolean-value CRLF
6.6.12. Waveform-URL
If the save-waveform header field is set to true, the verification
resource MUST record the incoming audio stream of the verification
into a file and provide a URI for the client to access it. This
header MUST be present in the VERIFICATION-COMPLETE event if the
save-waveform header field is set to true. The URL value of the
header MUST be NULL if there was some error condition preventing the
server from recording. Otherwise, the URL generated by the server
SHOULD be globally unique across the server and all its verification
sessions. The URL SHOULD BE available until the session is torn
down. Since the save-waveform header field applies only to live
verification / identification operations, the waveform-url will only
be returned in the VERIFICATION-COMPLETE event for live verification
/ identification operations.
waveform-url = "Waveform-URL" ":" Url CRLF
6.6.13. Vendor-Specific
This set of headers allows the client to set Vendor Specific
parameters.
vendor-specific = "Vendor-Specific-Parameters" ":"
vendor-specific-av-pair
*[";" vendor-specific-av-pair] CRLF
vendor-specific-av-pair = vendor-av-pair-name "="
vendor-av-pair-value
This header can be sent in the SET-PARAMS method and is used to set
vendor-specific parameters on the server. The vendor-av-pair-name
can be any vendor-specific field name and conforms to the XML
vendor-specific attribute naming convention. The vendor-av-pair-
value is the value to set the attribute to, and needs to be quoted.
When asking the server to get the current value of these parameters,
this header can be sent in the GET-PARAMS method with the list of
vendor-specific attribute names to get separated by a semicolon.
This header field MAY occur in SET-PARAMS or GET-PARAMS.
6.6.14. Voiceprint-Exists
This header field is returned in a VER-SET-VOICEPRINT or VER-DELETE-
VOICEPRINT response. This is the status of the voiceprint specified
in the VER-SET-VOICEPRINT method. For the VER-DELETE-VOICEPRINT
Burnett, et al. IETF-Draft Page 33
MRCP Extensions October 2003
method this field indicates the status of the voiceprint as the
method execution started.
voiceprint-exists = "Voiceprint-Exists" ":" Boolean-value CRLF
6.6.15. Ver-Buffer-Utterance
This header field is used to indicate that this utterance should be
considered for Speaker Verification. This way, an application can
buffer utterances while doing regular recognition or verification
activities and speaker verification can later be requested on the
buffered utterances. This header field is OPTIONAL in the
RECOGNIZE, VERIFY or RECORD method.
ver-buffer-utterance = "Ver-Buffer-Utterance" : Boolean-value CRLF
6.6.16. Input-Waveform-Url
This optional header field specifies an audio file that has to be
processed according to the current verification mode, either to
train the voiceprint or verify the user. This enables the client to
implement the buffering use case also in the case where the
recognizer and verification resources live in two sessions. It MAY
be part of the VERIFY method.
input-waveform-url = "Input-Waveform-URL" ":" Url CRLF
6.6.17. Verification-Type
This optional header field specifies whether this is text-
independent, text dependant or digit string based verification. It
MAY be part of the VERIFY method. The default for this field is
"text-independent".
verification-type = "Verification-Type" ":"
verification-type-string
verification-type-string = "text-independent"
| "text-dependent"
| "digits"
6.6.18. Digit-Sequence
This optional header field specifies the digit sequence to use for
verification if the verification mode is "digits". It MAY be part
of the VERIFY method.
digit-sequence = "digit-sequence" ":" 1*ALPHA CRLF
6.6.19. Completion-Cause
Burnett, et al. IETF-Draft Page 34
MRCP Extensions October 2003
This header field MUST be part of a VERIFICATION-COMPLETE event
coming from the verification resource to the client. This indicates
the reason behind the VERIFY or VER-FROM-BUFFER method completion.
This header field MUST BE sent in the VERIFY, VER-FROM-BUFFER, VER-
SET-VOICEPRINT responses, if they return with a failure status and a
COMPLETE state.
completion-cause = "Completion-Cause" ":" 1*DIGIT SP
1*ALPHA CRLF
Cause-Code Cause-Name Description
000 success VERIFY or VER-FROM-BUFFER request
completed successfully. The verify
decision can be "accepted",
"rejected", or "undecided".
001 error VERIFY or VER-FROM-BUFFER request
terminated prematurely due to a
verification resource or system
error.
002 no-input-timeout VERIFY request completed with no
result due to a no-input-timeout.
003 too-much-speech-timeout VERIFY request completed
result due to too much speech
004 speech-too-early VERIFY request completed with no
result due to spoke too soon.
005 buffer-empty VER-FROM-BUFFER request completed
with no result due to empty buffer.
006 out-of-sequence Verification operation failed due
to out-of-sequence method
invocations. For example calling
VERIFY before VER-SET-VOICEPRINT.
007 voiceprint-uri-failure
Failure accessing voiceprint URI.
008 voiceprint-uri-missing
Voiceprint-uri is not specified.
007 voiceprint-id-missing
Voiceprint-identification is not
specified.
008 voiceprint-id-not-exist
Voiceprint-identification doesn’t
exist in the voiceprint repository.
009 voiceprint-group-not-exist
Voiceprint-group doesn’t exist.
6.7. Verification Result Elements
Enrollment results can contain the following elements:
verification-result-elements =
| decision ; Section 6.7.1
| num-frames ; Section 6.7.2
| device ; Section 6.7.3
| gender ; Section 6.7.4
| matched ; Section 6.7.5
Burnett, et al. IETF-Draft Page 35
MRCP Extensions October 2003
| adapted ; Section 6.7.6
| verification-score ; Section 6.7.7
| group-name ; Section 6.7.8
| member ; Section 6.7.9
| score ; Section 6.7.10
| vendor-specific-results ; Section 6.7.11
6.7.1. Decision
This is not a header field, but part of the verification results. It
is returned in a VERIFICATION-COMPLETE event. Its value indicates
the decision as determined by verification. It can have the values
of accepted, rejected or undecided.
decision-string = "accepted" | "rejected" | "undecided"
decision = "" decision-string "" CRLF
6.7.2. Num-Frames
This is not a header field, but part of the verification results. It
is returned in a VERIFICATION-COMPLETE event. Its value indicates
the number of 10 millisecond speech frames in the last utterance or
in the cumulated set of utterances.
num-frames = "" 1*DIGIT "" CRLF
6.7.3. Device
This is not a header field, but part of the verification results. It
is returned in a RECOGNITION-COMPLETE event. Its value indicates
the apparent type of device used by the caller as determined by
verification. It can have the values of cellular-phone, electret-
phone, carbon-button-phone and unknown.
device-string = "cellular-phone" | "electret-phone"
| "carbon-button-phone" | "unknown"
device = "" device-string "" CRLF
6.7.4. Gender
This is not a header field, but part of the verification results. It
is returned in a VERIFICATION-COMPLETE event. Its value indicates
the apparent gender of the speaker as determined by verification. It
can have the values of male, female or unknown.
gender-string = "male" | "female" | "unknown"
gender = "" gender-string "" CRLF
6.7.5. Matched
This is not a header field, but part of the verification results. It
is returned in a VERIFICATION-COMPLETE event. When verification is
trying to confirm the voiceprint, this indicates if the last
Burnett, et al. IETF-Draft Page 36
MRCP Extensions October 2003
utterance and the voiceprints are of the same gender and used the
same type of device. It is not returned during verification
training. The value can be TRUE or FALSE.
matched = "" Boolean-value "" CRLF
6.7.6. Adapted
This is not a header field, but part of the verification results. It
is returned in a VERIFICATION-COMPLETE event. When verification is
trying to confirm the voiceprint, this indicates if the voiceprint
has been adapted as a consequence of analyzing the source
utterances. It is not returned during verification training. The
value can be TRUE or FALSE.
adapted = "" Boolean-value "" CRLF
6.7.7. Verification-Score
This is not a header field, but part of the verification results. It
is returned in a VERIFICATION-COMPLETE event. Its value indicates
the score of the last utterance as determined by verification.
During verification, the higher the score the more likely it is that
the speaker is the same one as the one who spoke the voiceprint
utterances. During training, the higher the score the more likely
the speaker is to have spoken all of the analyzed utterances. If
there are no such utterances the score is -100.
verification-score = "" FLOAT
"" CRLF
6.7.8. Group-Name
This is not a header field, but part of the verification results. It
is returned in a VERIFICATION-COMPLETE event. Its value indicates
the name of the group used in speaker identification.
group-name = "" 1*ALPHA "" CRLF
6.7.9. Member
This is not a header field, but part of the verification results. It
is returned in a VERIFICATION-COMPLETE event. Its value indicates
the member in a group identified by its URI. There is one URI for
each member in the group.
member = "" 1*ALPHA "" CRLF
6.7.10. Score
This is not a header field, but part of the verification results. It
is returned in a VERIFICATION-COMPLETE event. This is the score
Burnett, et al. IETF-Draft Page 37
MRCP Extensions October 2003
associated with the identified member of the group, as returned in
the member result.
score = "" 1*ALPHA "" CRLF
6.7.11. Vendor-Specific-Results
This section describes the method used to describe vendor specific
results using the xml syntax. Vendor-specific elements and
attributes MUST belong to the vendor’s own namespace. In the result
structure, they must either be prefixed by a namespace prefix
declared within the result or must be children of an element
identified as belonging to the vendor’s namespace. For details on
how to use XML Namespaces, see [6]. Section 2 of [6] provides
details on how to declare namespaces and namespace prefixes. Here is
an example:
50 cellular-phone female rejected -50 high sadness 50 cellular-phone female rejected -50
6.8. Verification Session Methods
These methods allow the client to control the mode and target of
verification or identification operations within the context of a
session. All the verification input cycles that occur within a
session may be used to create, update, or validate against the
voiceprint specified during the session. At the beginning of each
session the verification resource is reset to a known state.
Burnett, et al. IETF-Draft Page 38
MRCP Extensions October 2003
Verification/identification operations can be executed against live
or buffered audio. The verification resource provides methods for
for collecting and evaluating live audio data, and methods for
controlling the verification resource and adjusting its configured
behavior.
There are no specific methods for collecting buffered audio data.
This is accomplished by calling RECOGNIZE or RECORD with the header
ver-buffer-utterance. Then, when the method VER-FROM-BUFFER is
called verification is performed using the set of buffered audio.
Buffered-audio-method - "VER-FROM-BUFFER"
The following methods provide controls for verification of live
audio utterances :
live-audio-method = "VERIFY"
| "VER-START-TIMERS"
The following methods provide controls for configuring the
verification resource and for establishing resource states :
live-or-buffered-audio-method = "VER-START-SESSION"
| "VER-END-SESSION"
| "VER-SET-VOICEPRINT"
| "VER-DELETE-VOICEPRINT"
| "VER-ROLLBACK"
| "VER-STOP"
| "SET-PARAMS"
| "GET-PARAMS"
6.8.1. VER-START-SESSION
The VER-START-SESSION method starts a Speaker
Verification/Identification Session. Execution of this method
forces the verification resource into a known initial state. If this
method is called during an ongoing verification session, the
previous session is implicitly aborted.
Upon completion of the VER-START-SESSION method, the verification
resource MUST terminate any ongoing verification sessions, and clear
any voiceprint designation.
The header field "Adapt-Model" may also be present in the start
session method to indicate whether or not to adapt a voiceprint with
data collected during the session (if the voiceprint verification
phase succeeds). By default the voiceprint model should NOT be
adapted with data from a verification session.
Before a verification/identification resource is started, only VER-
ROLLBACK and generic SET-PARAMS and GET-PARAMS operations can be
performed. The media server should return 402(Method not valid in
Burnett, et al. IETF-Draft Page 39
MRCP Extensions October 2003
this state) for all other operations, such as VERIFY, VER-SET-
VOICEPRINT.
A single session can be active at one time.
Example:
C->S: VER-START-SESSION 314161 MRCP/1.0
Adapt-Model: true
S->C: MRCP/1.0 314161 200 COMPLETE
6.8.2. VER-END-SESSION
The VER-END-SESSION method terminates an ongoing verification
session and releases the verification voiceprint model in one of
three ways:
a. aborting - the voiceprint adaptation or creation may be aborted
so that the voiceprint remains unchanged (or is not created).
b. committing - when terminating a voiceprint training session, the
new voiceprint is committed to the repository.
c. adapting - an existing voiceprint is modified using a successful
verification.
The header field "Abort-Model" may be included in the VER-END-
SESSION to control whether or not to abort any pending changes to
the voiceprint. The default behavior is to commit (not abort) any
pending changes to the designated voiceprint.
The VER-END-SESSION method may be safely executed multiple times
without first executing the VER-START-SESSION method. Any additional
executions of this method without an intervening use of the VER-
START-SESSION method have no effect on the system.
Example:
This example assumes there are a training session or a verification
session in progress.
C->S: VER-END-SESSION 314174 MRCP/1.0
Abort-Model: true
S->C: MRCP/1.0 314174 200 COMPLETE
6.8.3. VER-SET-VOICEPRINT
The VER-SET-VOICEPRINT method causes the verification resource to
establish the voiceprint to be used for verification, identification,
or training purposes. At this time the desired mode of the
verification resource is not yet known.
The VER-SET-VOICEPRINT method can also be used to query whether or not
a voiceprint exists. The response to the VER-SET-VOICEPRINT method
request will contain an indication of the status of the designated
Burnett, et al. IETF-Draft Page 40
MRCP Extensions October 2003
voiceprint in the "Voiceprint-Exists" header field, allowing the client
to determine whether to use the current voiceprint for verification,
train a new voiceprint, or choose a different voiceprint.
A Voiceprint location may be completely specified by providing the URI
of the voiceprint repository along with attributes to locate a single
voiceprint within the repository. The voiceprint repository is
specified through the "Voiceprint-URI" header field, in which a URI
describing the location of the voiceprint repository is given. The
attributes used to locate a specific record or records within the
repository depend on whether the client intends to use speaker
verification or speaker identification.
In the case of speaker verification, only a single attribute is
required to uniquely locate a voiceprint record within the repository.
The "Voiceprint-Identity" header field MUST describe a unique
voiceprint record within a given repository.
In the case of speaker identification, an attribute describing the set
or group of speakers from which to select a specific identity must be
supplied in the VER-SET-VOICEPRINT message. The header field
"Voiceprint-Group" specifies the group of voiceprints from which an
identity is determined. If a new voiceprint is to be added to an
existing voiceprint group, then both the voiceprint group and the new
voiceprint identifier must be supplied.
In most cases, the voiceprint operations, VER-SET-VOICEPRINT and VER-
DELETE-VOICEPRINT, would operate on the same voiceprint repository, but
using different voiceprint records or group names. For simplicity
reasons, the ’Voiceprint-URI’ header field can be omitted if it’s
already set by previous voiceprint operations. But VER-START-SESSION
would clear any voiceprint designation, including the ’Voiceprint-URI’.
Unlike the ’Voiceprint-URI’, the ’Voiceprint-Identifier’ header field
MUST be specified in every voiceprint operations. And the ’Voiceprint-
Group’ header field MUST be specified in every voiceprint operations
for identification.
Example1:
This example assumes a verification session is in progress and the
voiceprint exists in the voiceprint repository.
C->S: VER-SET-VOICEPRINT 314168 MRCP/1.0
Voiceprint-URI:
Voiceprint-Identifier:
S->C: MRCP/1.0 314168 200 COMPLETE
Voiceprint-URI:
Voiceprint-Identifier:
Voiceprint-Exists: true
Example2:
This example assumes a verification session is in progress and the
voiceprint doesn’t exist in the voiceprint repository.
Burnett, et al. IETF-Draft Page 41
MRCP Extensions October 2003
C->S: VER-SET-VOICEPRINT 314168 MRCP/1.0
Voiceprint-URI:
Voiceprint-Identifier:
S->C: MRCP/1.0 314168 200 COMPLETE
Voiceprint-URI:
Voiceprint-Identifier:
Voiceprint-Exists: false
Example3:
This example assumes a verification session is in progress and the
’Voiceprint-URI’ header field is a bad URI.
C->S: VER-SET-VOICEPRINT 314168 MRCP/1.0
Voiceprint-URI:
Voiceprint-Identifier:
S->C: MRCP/1.0 314168 405 COMPLETE
Voiceprint-URI:
Voiceprint-Identifier:
Completion-Cause: 006 voiceprint-uri-failure
Example 4:
This example assumes an identification session is in progress and
the group doesn’t exist in the voiceprint repository.
C->S: VER-SET-VOICEPRINT 314168 MRCP/1.0
Voiceprint-URI:
Voiceprint-Group:
S->C: MRCP/1.0 314168 200 COMPLETE
Voiceprint-URI:
Voiceprint-Group:
Completion-Cause: 010 voiceprint-group-not-exist
6.8.4. VER-DELETE-VOICEPRINT
The VER-DELETE-VOICEPRINT method removes a voiceprint from a
repository or speaker identification group. For removal of a speaker
identification voiceprint, three attributes describing the
voiceprint repository, group, and voiceprint identifier are
required. For removal of a speaker verification voiceprint, two
attributes describing the repository and the specific voiceprint are
needed.
If a single voiceprint record is specified with no group identifier
information, the voiceprint record is deleted.
If a group identifier is specified but no specific voiceprint within
the group, the group record is deleted, and all the voiceprints
associated with that group are deleted.
Burnett, et al. IETF-Draft Page 42
MRCP Extensions October 2003
If both a voiceprint record and a group identifier are specified,
that voiceprint is deleted, and the group identifier is updated to
no longer reference that voiceprint. If, after removing the
reference to that voiceprint, the group identifier is empty, the
group record is also removed.
If a voiceprint record or a voiceprint group doesn’t exist, the VER-
DELETE-VOICEPRINT method can silently ignore the message and still
return 200 status code.
Example:
This example demonstrates a message to remove a specific voiceprint.
C->S: VER-DELETE-VOICEPRINT 314168 MRCP/1.0
Voiceprint-URI:
Voiceprint-Identifier:
S->C: MRCP/1.0 314168 200 COMPLETE
6.8.5. VERIFY
The VERIFY method is used to send the utterance’s audio stream to
the verification resource, which will then process it according to
the current Verification-Mode, either to train the voiceprint or
verify the user.
When both a recognizer and verification resource share the same
session, the VERIFY method MUST be called prior to calling the
RECOGNIZE method on the recognizer resource. In such cases, media
server vendors will know that verification must be enabled for a
subsequent call to RECOGNIZE.
Example:
C->S: VERIFY 543260 MRCP/1.0
S->C: MRCP/1.0 543260 200 IN-PROGRESS
When the VERIFY request is done, the MRCP server should send a
’VERIFICATION-COMPLETE’ event to the client.
6.8.6. VER-FROM-BUFFER
The VER-FROM-BUFFER method begins an ongoing evaluation of the
currently buffered audio against the voiceprint established through
the VER-SET-VOICEPRINT method. Execution of this method without
first establishing the voiceprint repository and identifier
attributes produces an error response. Since a verification session
may only have a single voiceprint identity at any given time, this
method may not be started repeatedly without first receiving a
completion response or sending a VER-STOP message.
Embedded with the request for audio evaluation is a header field to
describe the desired usage of the verification resource. The value
Burnett, et al. IETF-Draft Page 43
MRCP Extensions October 2003
of the "Verification-Mode" header field MUST be one of either
"train" or "verify".
The buffered audio is not consumed by this evaluation operation and
thus VER-FROM-BUFFER may be called repeatedly using different
voiceprints. Such usage is desirable to implement an n-best
processing strategy to determine a voiceprint identity.
The processing initiated under a VER-FROM-BUFFER method may be
terminated using the VER-STOP method.
For VER-FROM-BUFFER method, the media server can optionally return
an "IN-PROGRESS" response followed by the "VERIFICATION-COMPLETE"
event.
Example:
This example illustrates the usage of some buffering methods. In
this scenario the client first performed a live verification, but
the utterance is rejected. In the meantime, the utterance is also
saved to the audio buffer. Then, another voiceprint is used to do
verification against the audio buffer and the utterance is accepted.
Here, we assume both ’num-min-verification-phrases’ and ’num-max-
verification-phrases’ are 1.
C->S: VER-START-SESSION 314161 MRCP/1.0
Adapt-Model: true
S->C: MRCP/1.0 314161 200 COMPLETE
C->S: VER-SET-VOICEPRINT 314162 MRCP/1.0
Voiceprint-URI:
Voiceprint-Identifier:
S->C: MRCP/1.0 314162 200 COMPLETE
Voiceprint-URI:
Voiceprint-Identifier:
Voiceprint-Exists: true
C->S: VERIFY 314164 MRCP/1.0
Ver-buffer-utterance: true
S->C: MRCP/1.0 314164 200 IN-PROGRESS
S->C: VERIFICATION-COMPLETE 314164 COMPLETE MRCP/1.0
Completion-Cause: 000 success
Content-Type: application/x-nlsml
Content-Length: 123
Burnett, et al. IETF-Draft Page 44
MRCP Extensions October 2003
50 cellular-phone female rejected -50 50 cellular-phone female rejected -50
C->S: VER-SET-VOICEPRINT 314165 MRCP/1.0
Voiceprint-Identifier:
S->C: MRCP/1.0 314165 200 COMPLETE
Voiceprint-URI:
Voiceprint-Identifier:
Voiceprint-Exists: true
C->S: VER-FROM-BUFFER 314166 MRCP/1.0
Verification-Mode: verify
S->C: MRCP/1.0 314166 200 IN-PROGRESS
S->C: VERIFICATION-COMPLETE 314166 COMPLETE MRCP/1.0
Completion-Cause: 000 success
Content-Type: application/x-nlsml
Content-Length: 123
50 cellular-phone female accepted 50 50 cellular-phone female accepted 50
Burnett, et al. IETF-Draft Page 45
MRCP Extensions October 2003
C->S: VER-END-SESSION 314168 MRCP/1.0
S->C: MRCP/1.0 314168 200 COMPLETE
6.8.7. VER-ROLLBACK
The VER-ROLLBACK method discards the last buffered utterance or
discards the last live utterances (when the mode is "train" or
"verify"). This method should be invoked when the caller provides
undesirable input such as non-speech noises, side-speech, out-of-
grammar utterances, commands, etc. Note that this method does not
provide a stack of rollback states. Executing VER-ROLLBACK twice in
succession without an intervening recognition operation has no
effect on the second attempt.
Example:
C->S: VER-ROLLBACK 314165 MRCP/1.0
S->C: MRCP/1.0 314165 200 COMPLETE
6.8.8. VER-STOP
The VER-STOP method from the client to the server tells the
verification resource to stop VERIFY or VER-FROM-BUFFER requests if
one is active. If such a request is active and the STOP request
successfully terminated it, then the response header contains an
active-request-id-list header field containing the request-id of the
VERIFY or VER-FROM-BUFFER request that was terminated. In this case,
no VERIFICATION-COMPLETE event will be sent for the terminated
request. If there was no verify request active, then the response
MUST NOT contain an active-request-id-list header field. Either way
the response MUST contain a status of 200(Success).
The VER-STOP method aborts an ongoing evaluation operation against
live audio or buffered audio.
Example:
This example assumes a voiceprint identity has already been
established.
C->S: VERIFY 314177 MRCP/1.0
Verification-Mode: verify
S->C: MRCP/1.0 314177 200 IN-PROGRESS
C->S: VER-STOP 314178 MRCP/1.0
S->C: MRCP/1.0 314178 200 COMPLETE
Burnett, et al. IETF-Draft Page 46
MRCP Extensions October 2003
Active-Request-Id-List: 314177
6.8.9. VER-START-TIMERS
This request is sent from the client to the verification resource to
start the no-input timer, usually once the audio prompts to the
caller have played to completion.
Example:
C->S: VER-START-TIMERS 543260 MRCP/1.0
S->C: MRCP/1.0 543260 200 COMPLETE
6.8.10. SET-PARAMS
The SET-PARAMS method, from the client to the server, tells the
verification resource to set and modify its configuration
parameters. If the server resource does not recognize an OPTIONAL
parameter it MUST
ignore that field. Many of the parameters in the SET-PARAMS method
can also be used in another method like the VERIFY method. But the
difference is that when you set something like the security-level
using the SET-PARAMS it applies for all future requests, whenever
applicable. On the other hand, when you pass security-level in a
VERIFY request it applies only to that request.
Example:
C->S: SET-PARAMS 543256 MRCP/1.0
Security-Level: high
No-Input-Timeout: 5000
S->C: MRCP/1.0 543256 200 COMPLETE
6.8.11. GET-PARAMS
The GET-PARAMS method, from the client to the server, asks the
verification resource for its current values for parameters in the
request. The client can request specific parameters from the server
by sending it one or more empty parameter headers with no values.
The server should then return the settings for those specific
parameters only. When the client does not send a specific list of
empty parameter headers, the verification resource should return the
settings for all parameters. The wild card use can be very intensive
as the number of settable parameters can be large depending on the
vendor. Hence it is RECOMMENDED that the client does not use the
wildcard GET-PARAMS operation very often.
Example:
C->S: GET-PARAMS 543256 MRCP/1.0
Security-Level:
No-Input-Timeout:
S->C: MRCP/1.0 543256 200 COMPLETE
Security-Level: high
Burnett, et al. IETF-Draft Page 47
MRCP Extensions October 2003
No-Input-Timeout: 5000
6.9. Verification Session Events
6.9.1. VERIFICATION-COMPLETE
The VERIFICATION-COMPLETE event follows a call to VERIFY or VER-
FROM-BUFFER and is used to communicate to the client the
verification results. This event will contain only verification
results.
Example:
S->C: VERIFICATION-COMPLETE 543259 COMPLETE MRCP/1.0
Completion-Cause: 000 success
Content-Type: application/x-nlsml
Content-Length: 123
50 cellular-phone female accepted 50 150 cellular-phone female accepted 25 123456 Martha-smith
75
6.9.2. START-OF-SPEECH
The START-OF-SPEECH event is returned from the server to the client
once the server has detected speech. This event is always returned
by the verification resource when speech has been detected,
irrespective of the fact that both the recognizer and verification
resource are sharing the same session or not.
Burnett, et al. IETF-Draft Page 48
MRCP Extensions October 2003
Example:
S->C: START-OF-SPEECH 543259 IN-PROGRESS MRCP/1.0
Burnett, et al. IETF-Draft Page 49
MRCP Extensions October 2003
7. Hotword Recognition
This document captures the extensions required to implement Voice
Enrollment, Speaker Verification and Hotword recognition using MRCP.
This section describes the methods, responses and events needed for
doing Hotword recognition.
A new type of Speech Recognizer resource is presented that can be
used for Hotword recognition. Unlike the primary recognizer
resource, which is driven by the client for each recognition
request, the secondary Hotword recognition resource is attached to
the session and listens continuously until a particular command
phrase is spoken.
The Hotword recognition resource can be the only recognition
resource in a session or it can be attached to the same session as a
primary recognizer resource, and consequently connected to the same
audio stream. When a client sends a SETUP request to add a Hotword
recognizer resource to an existing session, then the MRCP server
attaches the Hotword recognition resource in eavesdropping mode on
the RTP stream already established by the primary resource.
7.1. Hotword State Machine
The difference between a Hotword recognition resource and the
primary recognition resource is minor. The RECOGNIZE and STOP
methods are the only methods allowed on a Hotword recognition
resource. The only event generated is RECOGNITION-COMPLETE. The
resource goes from IDLE to RECOGNIZING and back to IDLE just like a
regular recognizer resource.
A Hotword recognition resource, unlike a normal recognizer resource,
will not send a START-OF-SPEECH event while it is trying to locate a
Hotword. The first event that will be returned once the Hotword is
detected is a RECOGNITION-COMPLETE event.
After a RECOGNITION-COMPLETE event is reported, the Hotword
recognition resource must be primed once again by sending another
RECOGNIZE request.
The Hotword recognition resource can also be stopped by calling the
STOP method.
7.1.1. Addressing Resources
To request a Hotword recognition resource be added to a session, a
different URI must be specified in the SETUP message. The same
rules apply as for other resources. That is, if no session is
specified in the SETUP message, then this is considered to be the
first resource added to a session. For subsequent SETUP requests,
the MRCP client should indicate to the server that these resources
belong to the same session by returning the same session id in the
SETUP request message.
Burnett, et al. IETF-Draft Page 50
MRCP Extensions October 2003
There is no special order required when requesting synthesizer,
recognizer or Hotword-recognizer resources.
7.2. Hotword Header Fields
Hotword recognition requests may contain the following header
fields.
Hotword-header = Hotword-Max-Duration ; Section 7.2.1
| Hotword-Min-Duration ; Section 7.2.2
7.2.1. Hotword-Max-Duration
This parameter MAY BE sent in a RECOGNIZE request to enable Hotword
listening. It specifies the maximum length of an utterance that
should be considered for Hotword. This parameter, along with
Hotword-Min-Duration, can be used to tune performance by preventing
the recognizer from evaluating utterances that are too short or too
long to be the Hotword. The value of this field is in milliseconds.
The default is 1700 milliseconds.
hotword-max-duration = "Hotword-Max-Duration" ":" 1*DIGIT CRLF
7.2.2. Hotword-Min-Duration
This parameter MAY BE sent in a RECOGNIZE request to enable Hotword
listening. It specifies the minimum length of an utterance that can
be considered for Hotword. This parameter, along with Hotword-Max-
Duration, can be used to tune performance by preventing the
recognizer from evaluating utterances that are too short or too long
to be the hot word. The value of this field is in milliseconds. The
default is 300 milliseconds.
hotword-min-duration = "Hotword-Min-Duration" ":" 1*DIGIT CRLF
7.3. Hotword Methods
7.3.1. SETUP
The SETUP method from the client to the server is used to attach a
Hotword recognizer resource to the session.
Example:
C->S: ANNOUNCE rtsp://media.server.com/media/hotword-asr RTSP/1.0
CSeq: 3
Transport: RTP/AVP;unicast;client_port=8000-8001; mode=record
Session: 12345678
S->C: RTSP/1.0 200 OK
Burnett, et al. IETF-Draft Page 51
MRCP Extensions October 2003
CSeq: 3
Transport: RTP/AVP;unicast;client_port=8000-8001;
server_port=9000-9001;mode=record
Session: 12345678
7.3.2. RECOGNIZE
The RECOGNIZE method from the client to the server starts an ongoing
Hotword recognition. This operation can be stopped using the STOP
method. Otherwise, the RECOGNITION-COMPLETE event will be returned
when the Hotword has been recognized.
The client must call RECOGNIZE once again to re-start Hotword
recognition.
Example:
C->S: ANNOUNCE rtsp://media.server.com/media/hotword-asr RTSP/1.0
Cseq: 314
Session: 12345678
Content-Type: application/mrcp
Content-Length: 276
RECOGNIZE 543259 MRCP/1.0
Content-Type: application/grammar+xml
Content-Length: 123
Hotword-Min-Duration: 0.3
Hotword-Max-Duration: 1.7
S->C: RTSP/1.0 200 OK
Cseq: 314
Content-Type: application/mrcp
Content-Length: 67
MRCP/1.0 543259 200 IN-PROGRESS
S->C: ANNOUNCE rtsp://media.server.com/media/hotword-asr RTSP/1.0
Cseq: 315
Session: 12345678
Content-Type: application/mrcp
Content-Length: 123
RECOGNITION-COMPLETE 543259 200 MRCP/1.0
Completion-Cause: 000 Normal
Content-Type: application/x-nlsml
Content-Length: 76
Burnett, et al. IETF-Draft Page 52
MRCP Extensions October 2003
Wakeup
Wakeup
C->S: RTSP/1.0 200 OK
Cseq: 315
Burnett, et al. IETF-Draft Page 53
MRCP Extensions October 2003
8. RTSP based Examples:
This section contains examples of typical sessions between a client
and the server.
8.1. Enrollment
This example illustrates a typical enrollment session.
First, you need to start an enrollment session before proceeding to
learn new phrases.
C->S: ANNOUNCE rtsp://media.server.com/media/recognizer RTSP/1.0
Cseq: 406
Session: 12345678
Content-Type: application/mrcp
Content-Length: 123
START-PHRASE-ENROLLMENT 543258 MRCP/1.0
Num-Min-Consistent-Pronunciations: 2
Consistency-Threshold: 3000
Clash-Threshold: 1200
Personal-Grammar-URI:
Phrase-Id:
Phrase-NL:
Weight: 1
Save-Best-Waveform: true
S->C: RTSP/1.0 200 OK
Cseq: 406
Content-Type: application/mrcp
Content-Length: 86
MRCP/1.0 543258 200 COMPLETE
Then, the application can proceed to enroll an utterance by
iterating over the following command.
C->S: ANNOUNCE rtsp://media.server.com/media/recognizer RTSP/1.0
Cseq: 407
Session: 12345678
Content-Type: application/mrcp
Content-Length: 276
RECOGNIZE 543259 MRCP/1.0
Content-Type: application/grammar+xml
Content-Length: 123
Burnett, et al. IETF-Draft Page 54
MRCP Extensions October 2003
help cancel
S->C: RTSP/1.0 200 OK
Cseq: 407
Content-Type: application/mrcp
Content-Length: 67
MRCP/1.0 543259 200 IN-PROGRESS
S->C: ANNOUNCE rtsp://media.server.com/media/recognizer RTSP/1.0
Cseq: 408
Session: 12345678
Content-Type: application/mrcp
Content-Length: 87
START-OF-SPEECH 543259 200 MRCP/1.0
C->S: RTSP/1.0 200 OK
Cseq: 408
The recognizer resource returns the enrollment status after each
attempt to enroll an utterance. This repeats until the required
number of pronunciations is consistent and that there are no clashes
with other pronunciations in the personal grammar.
S->C: ANNOUNCE rtsp://media.server.com/media/recognizer RTSP/1.0
Cseq: 409
Session: 12345678
Content-Type: application/mrcp
Content-Length: 276
RECOGNITION-COMPLETE 543259 200 MRCP/1.0
Completion-Cause: 000 Normal
Content-Type: application/x-nlsml
Content-Length: 123
2 1 1
consistent Jeff Andre
Burnett, et al. IETF-Draft Page 55
MRCP Extensions October 2003
C->S: RTSP/1.0 200 OK
Cseq: 409
Finally, when the application is satisfied with the enrollment
results then the enrollment is committed to the personal grammar by
ending the enrollment session, as follows.
C->S: ANNOUNCE rtsp://media.server.com/media/recognizer RTSP/1.0
Cseq: 410
Session: 12345678
Content-Type: application/mrcp
Content-Length: 123
END-PHRASE-ENROLLMENT 543260 MRCP/1.0
S->C: RTSP/1.0 200 OK
Cseq: 410
Content-Type: application/mrcp
Content-Length: 67
MRCP/1.0 543260 200 COMPLETE
Waveform-URL:
8.2. Speaker Verification and Identification
This example illustrates a verification session. Assume prompts are
played outside, MRCP synthesizer resource is left out for simplicity
reasons.
Opening the recognizer. This is the first resource for this
session. The server and client agree on a single Session ID 12345678
and set of RTP/RTCP ports on both sides.
C->S:SETUP rtsp://media.server.com/media/recognizer RTSP/1.0
CSeq: 2
Transport:RTP/AVP;unicast;client_port=46456-46457
Content-Type: application/sdp
Content-Length: 190
v=0
o=- 123 456 IN IP4 10.0.0.1
s=Media Server
p=+1-888-555-1212
c=IN IP4 0.0.0.0
t=0 0
m=audio 0 RTP/AVP 0 96
a=rtpmap:0 pcmu/8000
a=rtpmap:96 telephone-event/8000
a=fmtp:96 0-15
S->C:RTSP/1.0 200 OK
CSeq: 2
Transport:RTP/AVP;unicast;client_port=46456-46457;
Burnett, et al. IETF-Draft Page 56
MRCP Extensions October 2003
server_port=46460-46461
Session: 12345678
Content-Length: 190
Content-Type: application/sdp
v=0
o=- 3211724219 3211724219 IN IP4 10.3.2.88
s=Media Server
c=IN IP4 0.0.0.0
t=0 0
m=audio 46460 RTP/AVP 0 96
a=rtpmap:0 pcmu/8000
a=rtpmap:96 telephone-event/8000
a=fmtp:96 0-15
Opening a verification resource. Uses the existing session ID and
ports.
C->S:SETUP rtsp://media.server.com/media/verification-resource
RTSP/1.0
CSeq: 3
Transport: RTP/AVP;unicast;client_port=46456-46457;
mode=record;ttl=127
Session: 12345678
S->C:RTSP/1.0 200 OK
CSeq: 3
Transport: RTP/AVP;unicast;client_port=46456-46457;
server_port=46460-46461;mode=record;ttl=127
Session: 12345678
Start a verification session.
C->S:ANNOUNCE rtsp://media.server.com/media/verification-resource
RTSP/1.0
Cseq: 4
Session: 12345678
Content-Type: application/mrcp
Content-Length: 53
VER-START-SESSION 314161 MRCP/1.0
Adapt-Model: true
S->C:RTSP/1.0 200 OK
CSeq: 4
Session: 12345678
Content-Length: 30
Content-Type: application/mrcp
MRCP/1.0 314161 200 COMPLETE
Start a recognition request, getting the account number for example.
C->S:ANNOUNCE rtsp://media.server.com/media/recognizer RTSP/1.0
Burnett, et al. IETF-Draft Page 57
MRCP Extensions October 2003
CSeq: 6
Session: 12345678
Content-Type: application/mrcp
Content-Length: 188
RECOGNIZE 314163 MRCP/1.0
No-Input-Timeout: 7000
Recognizer-Start-Timers: false
Save-Waveform: true
Ver-Buffer-Utterance: true
N-Best-List-Length: 2
Content-Type: text/uri-list
Content-Length: 33
builtin:grammar/digits?length=5
S->C:RTSP/1.0 200 OK
CSeq: 6
Session: 12345678
Content-Length: 33
Content-Type: application/mrcp
MRCP/1.0 314163 200 IN-PROGRESS
S->C:ANNOUNCE rtsp://media.server.com/media/recognizer RTSP/1.0
CSeq: 1
Session: 12345678
Content-Length: 65
Content-Type: application/mrcp
START-OF-SPEECH 314163 IN-PROGRESS MRCP/1.0
Proxy-Sync-Id: 1
C->S:RTSP/1.0 200 OK
CSeq: 1
The recognition result contains 2 choices.
S->C:ANNOUNCE rtsp://media.server.com/media/recognizer RTSP/1.0
CSeq: 2
Session: 12345678
Content-Length: 3511
Content-Type: application/mrcp
RECOGNITION-COMPLETE 314163 COMPLETE MRCP/1.0
Completion-Cause: 000 success
Waveform-URL: http://media.server.com/waveforms/utt01.wav
Content-Type: application/x-nlsml
Content-Length: 3280
13579
Burnett, et al. IETF-Draft Page 58
MRCP Extensions October 2003
one three five seven nine
13479
one three four seven nine
C->S:RTSP/1.0 200 OK
CSeq: 2
Check to see if the first choice from nbest list exists in the
Voiceprint repository.
C->S:ANNOUNCE rtsp://media.server.com/media/verification-resource
RTSP/1.0
CSeq: 7
Session: 12345678
Content-Type: application/mrcp
Content-Length: 119
VER-SET-VOICEPRINT 314164 MRCP/1.0
Voiceprint-URI: http://media.server.com/VoicePrints
Voiceprint-Identifier: 13579
Voiceprint ID 13579 doesn’t exist.
S->C:RTSP/1.0 200 OK
CSeq: 7
Session: 12345678
Content-Length: 139
Content-Type: application/mrcp
MRCP/1.0 314164 200 COMPLETE
Voiceprint-URI: http://media.server.com/VoicePrints
Voiceprint-Identifier: 13579
Voiceprint-Exists: false
Check the second choice in the nbest list.
C->S:ANNOUNCE rtsp://media.server.com/media/verification-resource
RTSP/1.0
CSeq: 8
Session: 12345678
Content-Type: application/mrcp
Content-Length: 119
VER-SET-VOICEPRINT 314165 MRCP/1.0
Voiceprint-URI: http://media.server.com/VoicePrints
Voiceprint-Identifier: 13479
Burnett, et al. IETF-Draft Page 59
MRCP Extensions October 2003
Voiceprint ID 13479 exists.
S->C:RTSP/1.0 200 OK
CSeq: 8
Session: 12345678
Content-Length: 138
Content-Type: application/mrcp
MRCP/1.0 314165 200 COMPLETE
Voiceprint-URI: http://media.server.com/VoicePrints
Voiceprint-Identifier: 13479
Voiceprint-Exists: true
Start verify on the voiceprint 13479.
C->S:ANNOUNCE rtsp://media.server.com/media/verification-resource
RTSP/1.0
CSeq: 9
Session: 12345678
Content-Type: application/mrcp
Content-Length: 54
VER-FROM-BUFFER 314166 MRCP/1.0
Verify-Mode: verify
S->C:RTSP/1.0 200 OK
CSeq: 9
Session: 12345678
Content-Length: 33
Content-Type: application/mrcp
MRCP/1.0 314166 200 IN-PROGRESS
The caller is verified (assume num-min-verification-phrases and num-
max-verification-phrases are 1).
S->C:ANNOUNCE rtsp://media.server.com/media/verification-resource
RTSP/1.0
CSeq: 3
Session: 12345678
Content-Type: application/mrcp
Content-Length: 183
VERIFICATION-COMPLETE 314166 COMPLETE MRCP/1.0
Completion-Cause: 000 success
Content-Type: application/x-nlsml
Content-Length: 123
Burnett, et al. IETF-Draft Page 60
MRCP Extensions October 2003
50 cellular-phone female accepted 50 50 cellular-phone female accepted 50
C->S:RTSP/1.0 200 OK
CSeq: 3
End the verification session.
C->S:ANNOUNCE rtsp://media.server.com/media/verification-resource
RTSP/1.0
CSeq: 11
Session: 12345678
Content-Type: application/mrcp
Content-Length: 33
VER-END-SESSION 314168 MRCP/1.0
S->C:RTSP/1.0 200 OK
CSeq: 11
Session: 12345678
Content-Length: 30
Content-Type: application/mrcp
MRCP/1.0 314168 200 COMPLETE
Teardown the recognizer and verification resource.
C->S:TEARDOWN rtsp://media.server.com/media/verification-resource
RTSP/1.0
CSeq: 12
Session: 12345678
S->C:RTSP/1.0 200 OK
CSeq: 12
C->S:TEARDOWN rtsp://media.server.com/media/recognizer RTSP/1.0
CSeq: 13
Session: 12345678
Burnett, et al. IETF-Draft Page 61
MRCP Extensions October 2003
S->C:RTSP/1.0 200 OK
CSeq: 13
8.3. Hotword Recognition
Will be provided later.
9. Security Considerations
The primary additional security considerations raised by the
extensions in this document have to do with the use of speaker
identification and verification as security functions. One such
consideration is that individualized voiceprints are used to
identify or confirm the identity of a caller. The privacy and
integrity of these voiceprints is of high importance. Fortunately,
voiceprints are not transferred between client and server but are
rather maintained by the server using the server’s own security
mechanisms.
Another consideration particular to these functions is the
consequence of manipulating the media (speech) stream. Some
verification technologies in use today are susceptible to
impersonation or "replay" attacks, and all are susceptible to a
denial of access attack by garbling an otherwise acceptable media
stream. We recommend that standard media-securing protocols such as
SRTP be used in these cases.
10. Reference Documents
[1] Fielding, R., Gettys, J., Mogul, J., Frystyk. H.,
Masinter, L., Leach, P., and T. Berners-Lee, "Hypertext
transfer protocol -- HTTP/1.1", RFC 2616, June 1999.
[2] Schulzrinne, H., Rao, A., and R. Lanphier, "Real Time
Streaming Protocol (RTSP)", RFC 2326, April 1998
[3] Shanmugham, S., et al., "A Media Resource Control Protocol
Developed by Cisco, Nuance, and Speechworks.", Internet-draft
draft-shanmugham-mrcp-04, (work in progress), May 1, 2003
[4] World Wide Web Consortium, "Natural Language Semantics Markup
Language (NLSML) for the Speech Interface Framework", W3C
Working Draft, 30 May 2001.
[5] Bradner, S., "Key words for use in RFCs to Indicate
Requirement Levels", RFC 2119, March 1997.
[6] T. Bray et al., "Namespaces in XML", W3C Recommendation, 14
January 1999. See http://www.w3.org/TR/1999/REC-xml-names-
19990114.
Acknowledgements
Burnett, et al. IETF-Draft Page 62
MRCP Extensions October 2003
The authors would like to thank the following additional individuals
for their contributions to this document:
Andre Gillet (Nuance Communications)
Klaus Reifenrath (Scansoft)
Saravanan Shanmugham (Cisco Systems, Inc.)
Full Copyright Statement
Copyright (C) The Internet Society (2003). All Rights Reserved.
This document and translations of it may be copied and furnished to
others, and derivative works that comment on or otherwise explain it
or assist in its implementation may be prepared, copied, published
and distributed, in whole or in part, without restriction of any
kind, provided that the above copyright notice and this paragraph
are included on all such copies and derivative works. However, this
document itself may not be modified in any way, such as by removing
the copyright notice or references to the Internet Society or other
Internet organizations, except as needed for the purpose developing
Internet standards in which case the procedures for copyrights
defined in the Internet Standards process must be followed, or as
required to translate it into languages other than English.
The limited permissions granted above are perpetual and will not be
revoked by the Internet Society or its successors or assigns.
This document and the information contained herein is provided on an
"AS IS" basis and THE INTERNET SOCIETY AND THE INTERNET
ENGINEERING TASK FORCE DISCLAIMS ALL WARRANTIES, EXPRESS OR IMPLIED,
INCLUDING BUT NOT LIMITED TO ANY WARRANTY THAT THE USE OF THE
INFORMATION HEREIN WILL NOT INFRINGE ANY RIGHTS OR ANY IMPLIED
WARRANTIES OF MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE.
Authors’ Addresses
Daniel C. Burnett
Nuance Communications
1005 Hamilton Court
Menlo Park, CA 94025-1422
USA
Email: burnett@nuance.com
Pierre Forgues
Nuance Communications Ltd.
111 Duke Street
Suite 4100
Montreal, Quebec
Canada H3C 2M1
Email: forgues@nuance.com
Burnett, et al. IETF-Draft Page 63
MRCP Extensions October 2003
Charles Galles
Intervoice, Inc.
17811 Waterview Parkway
Dallas, Texas 75252
Email: charles.galles@intervoice.com
This document expires on June 24, 2004.
Burnett, et al. IETF-Draft Page 64