Internet Engineering Task Force D. Burnett
Internet-Draft Nuance Communications
draft-burnett-mrcpext-00 P. Forgues
Expires: April 17, 2004 Nuance Communications
C. Galles
Intervoice, Inc.
October 17, 2003
MRCP Extensions: Media Resource Control Protocol Extensions
Status of this Memo
This document is an Internet-Draft and is subject to all provisions
of Section 10 of RFC2026.
Internet-Drafts are working documents of the Internet Engineering
Task Force (IETF), its areas, and its working groups. Note that
other groups may also distribute working documents as Internet-
Drafts.
Internet-Drafts are draft documents valid for a maximum of six
months and may be updated, replaced, or obsoleted by other documents
at any time. It is inappropriate to use Internet-Drafts as
reference material or to cite them other than as "work in progress."
The list of current Internet-Drafts can be accessed at
http://www.ietf.org/1id-abstracts.html
The list of Internet-Draft Shadow Directories can be accessed at
http://www.ietf.org/shadow.html
Copyright Notice
Copyright (C) The Internet Society (2003). All Rights Reserved.
Abstract
The Media Resource Control Protocol (MRCP) is an application level
protocol to control media service resources like Speech
Synthesizers, Recognizers, Signal Generators, Signal Detectors, Fax
Servers etc. over a network. This document captures the extensions
required to implement Voice Enrollment, Speaker Verification and
Hotword recognition as well as to augment the recognizer
functionality using MRCP. The extensions are largely orthogonal to
existing features of MRCP and to each other, with an eye towards
backwards compatibility with existing features and independence of
the extensions from each other to simplify integration.
Page 1
MRCP Extensions October 2003
This document is published as an Internet-Draft as input for further
IETF development in this area.
Burnett, et al. IETF-Draft Page 2
MRCP Extensions October 2003
Table of Contents
Status of this Memo.................................................1
Abstract............................................................1
1. Introduction....................................................6
2. Architecture....................................................7
3. Notational Conventions..........................................7
4. Recognizer resource extensions..................................8
4.1. Recognizer Resource Extensions Methods........................8
4.2. Recognizer Resource Extensions Events.........................8
4.3. Recognizer Resource Extensions Header Fields..................8
4.3.1. Recording-URL.............................................8
4.3.2. Required-Phrase...........................................9
4.3.3. Phrase-Status.............................................9
4.3.4. Interpret-Text............................................9
4.4. RECORD........................................................9
4.5. INTERPRET....................................................10
4.6. RECORDING-COMPLETE...........................................11
4.7. INTERPRETATION-COMPLETE......................................12
5. Enrollment.....................................................13
5.1. Enrollment State Machine.....................................14
5.2. Enrollment Methods...........................................14
5.3. Enrollment Events............................................14
5.4. Enrollment Header Fields.....................................14
5.4.1. Num-Min-Consistent-Pronunciations........................16
5.4.2. Consistency-Threshold....................................16
5.4.3. Clash-Threshold..........................................16
5.4.4. Personal-Grammar-URI.....................................16
5.4.5. Phrase-Id................................................17
5.4.6. Phrase-NL................................................17
5.4.7. Weight...................................................17
5.4.8. Save-Waveform............................................17
5.4.9. Waveform-URL.............................................18
5.4.10. New-Phrase-Id..........................................18
5.4.11. Phrase-Text............................................18
5.4.12. Completion-Cause.......................................18
5.4.13. Num-Clashes............................................18
5.4.14. Num-Good-Repetitions...................................19
5.4.15. Num-Repetitions-Still-Needed...........................19
5.4.16. Consistency-Status.....................................19
5.4.17. Clash-Phrase-Ids.......................................19
5.5. Enrollment Methods...........................................20
5.5.1. START-ENROLLMENT-SESSION.................................20
5.5.2. RECOGNIZE................................................20
5.5.3. STOP.....................................................21
5.5.4. PAUSE-ENROLLMENT-SESSION.................................21
5.5.5. RESUME-ENROLLMENT-SESSION................................22
5.5.6. ENROLLMENT-ROLLBACK......................................22
5.5.7. END-ENROLLMENT-SESSION...................................22
5.5.8. ABORT-ENROLLMENT-SESSION.................................23
5.5.9. MODIFY-PHRASE............................................23
5.5.10. ADD-PHRASE.............................................24
5.5.11. DELETE-PHRASE..........................................24
5.5.12. RECOGNITION-COMPLETE...................................24
Burnett, et al. IETF-Draft Page 3
MRCP Extensions October 2003
6. Speaker Verification and Identification........................26
6.1. Speaker Verification/Identification Resource.................26
6.2. SETUP Verification/Identification Resource...................27
6.3. Speaker Verification State Machine...........................27
6.4. Speaker Verification Methods.................................27
6.5. Verification Events..........................................28
6.6. Verification Header Fields...................................28
6.6.1. Voiceprint-URI...........................................29
6.6.2. Voiceprint-Identifier....................................29
6.6.3. Voiceprint-Group.........................................30
6.6.4. Verification-Mode........................................30
6.6.5. Adapt-Model..............................................31
6.6.6. Abort-Model..............................................32
6.6.7. Buffering-Mode...........................................32
6.6.8. Security-Level...........................................32
6.6.9. Num-Min-Verification-Phrases.............................32
6.6.10. Num-Max-Verification-Phrases...........................32
6.6.11. Completion-Cause.......................................33
6.6.12. No-Input-Timeout.......................................34
6.6.13. Save-Waveform..........................................34
6.6.14. Waveform-URL...........................................34
6.6.15. Vendor-Specific........................................34
6.6.16. Voiceprint-Exists......................................35
6.6.17. Is-Valid-Utterance.....................................35
6.6.18. Num-Valid-Utterances...................................35
6.6.19. Decision...............................................35
6.6.20. Num-Frames.............................................36
6.6.21. Device.................................................36
6.6.22. Gender.................................................36
6.6.23. Matched................................................36
6.6.24. Adapted................................................36
6.6.25. Verification-Score.....................................37
6.6.26. Group-Name.............................................37
6.6.27. Member.................................................37
6.6.28. Score..................................................37
6.7. Verification Session Methods.................................37
6.7.1. VER-START-SESSION........................................38
6.7.2. VER-END-SESSION..........................................39
6.7.3. VER-SET-VOICEPRINT.......................................39
6.7.4. VER-DELETE-VOICEPRINT....................................41
6.7.5. VERIFY...................................................42
6.7.6. VER-BUFFERING-START......................................42
6.7.7. VER-BUFFERING-CONTROL....................................43
6.7.8. VER-BUFFERING-STOP.......................................43
6.7.9. VER-FROM-BUFFER..........................................43
6.7.10. VER-ROLLBACK...........................................46
6.7.11. VER-STOP...............................................46
6.7.12. VER-START-TIMERS.......................................47
6.7.13. SET-PARAMS.............................................47
6.7.14. GET-PARAMS.............................................47
6.8. Verification Session Events..................................48
6.8.1. VERIFICATION-COMPLETE....................................48
6.8.2. START-OF-SPEECH..........................................49
7. Hotword Recognition............................................50
Burnett, et al. IETF-Draft Page 4
MRCP Extensions October 2003
7.1. Hotword State Machine........................................50
7.1.1. Addressing Resources.....................................50
7.2. Hotword Header Fields........................................51
7.2.1. Hotword-Max-Seconds......................................51
7.2.2. Hotword-Min-Seconds......................................51
7.3. Hotword Methods..............................................51
7.3.1. SETUP....................................................51
7.3.2. RECOGNIZE................................................52
8. RTSP based Examples:...........................................54
8.1. Enrollment...................................................54
8.2. Speaker Verification and Identification......................56
8.3. Hotword Recognition..........................................62
9. Security Considerations........................................62
10. Reference Documents............................................63
Acknowledgements...................................................63
Full Copyright Statement...........................................63
AuthorsÆ Addresses.................................................64
Burnett, et al. IETF-Draft Page 5
MRCP Extensions October 2003
1. Introduction
The Media Resource Control Protocol (MRCP) [3] is an application
level protocol to control media service resources like Speech
Synthesizers, Recognizers, Signal Generators, Signal Detectors, Fax
Servers etc. over a network. This protocol is designed to work with
streaming protocols like RTSP (Real Time Streaming Protocol) or SIP
(Session Initiation Protocol) which help establish control
connections to external media streaming devices, and media delivery
mechanisms like RTP (Real Time Protocol). MRCP supports basic
recognition and speech synthesis (TTS) capabilities.
This document captures the extensions required to implement Voice
Enrollment, Speaker Verification and Hotword recognition as well as
to augment the recognizer functionality using MRCP. Already having
functional implementations of [3], the authors developed these
extensions within that framework. It is expected that these methods
will also prove useful as information for the IETF in its
standardization efforts beyond this draft version of MRCP.
A major goal of the Recognition, Enrollment, Speaker Verification
and Hotword recognition extensions is to be backward compatible,
i.e. to implement them in such a way that previous functionality is
available without change. In addition, the MRCP extensions used for
Enrollment, Speaker Verification and Identification and Hotword
recognition are independent from one another. This means a client
can implement only the set of methods needed for a particular
integration. For example, only the Enrollment methods and responses
need to be implemented by a client, provided the server has
implemented those methods.
The extensions for Enrollment do not need a separate resource type
because they are implemented as part of the recognition resource.
Speaker Verification and Hotword recognition were defined as new
resource types since they essentially consist in either creating a
verification resource or attaching a special kind of Recognizer
resource on the session in addition to the primary Recognizer
resource (unlike Enrollment).
There is no need to change the underlying protocols to support
Enrollment, Speaker Verification or Hotword recognition. Like the
original MRCP specification, the extensions rely on a protocol like
the Real Time Streaming Protocol (RTSP) or Session Initiation
Protocol (SIP) to establish and maintain the session. The session
control protocol is also responsible for establishing the media
connection from the client to the network server.
The MRCP protocol extensions define the requests, responses and
events needed to control Voice Enrollment, Speaker Verification and
Hotword recognition features. It is assumed the state machine for a
recognition resource is preserved.
Burnett, et al. IETF-Draft Page 6
MRCP Extensions October 2003
The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",
"SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY" and "OPTIONAL" in this
document are to be interpreted as described in RFC 2119[5].
Please send any feedback on this document directly to the authors.
2. Architecture
There is no change in architecture from the original MRCP
specification. It is assumed that Enrollment is done by a
Recognizer resource. Therefore, an appropriate SETUP message needs
to be sent and a media stream established between a client and
server before these functions are used.
Speaker Verification and Hotword recognition are slightly different.
For Speaker verification, a new verification resource is now
defined. This verification resource can be used on its own or be
attached to a session where a recognition is already set up.
For Hotword recognition it differs in that a second Recognizer
resource needs to be attached to the same session. The state
machine for this second recognizer is the same as for the primary
Recognizer resource.
The following sections describe each of the following MRCP
extensions separately: (1) Recognizer resource extensions, (2)
Enrollment, (3) Speaker Verification and Identification and (4)
Hotword recognition.
3. Notational Conventions
Most of the definitions and syntax follow the same format used in
the MRCP draft submission. The only new field required is to
represent short floating-point numbers needed to indicate relative
weight for some of the header fields. A weight is normalized in the
range of 0 to 1.
WEIGHT = ( "0" [ "." 0*3DIGIT ] ) | ( "1" [ "." 0*3("0") ] )
FLOAT = [ "+" / "-" ] 1*DIGIT [ "." 0*DIGIT ]
Burnett, et al. IETF-Draft Page 7
MRCP Extensions October 2003
4. Recognizer resource extensions
The only new functionality added to the recognizer resource is the
inclusion of the INTERPRET and RECORD methods and the associated
INTERPRETATION-COMPLETE and RECORDING-COMPLETE events.
4.1. Recognizer Resource Extensions Methods
The following methods are supported by the recognizer resource in
addition to those already defined in [3].
recognizer-extension-method = "RECORD"
| "INTERPRET"
4.2. Recognizer Resource Extensions Events
The recognizer resource may now generate the following events in
addition to those already defined in [3].
recognizer-extension-event = "RECORDING-COMPLETE"
| "INTERPRETATION-COMPLETE"
4.3. Recognizer Resource Extensions Header Fields
The recognizer resource extensions define new header fields to
augment the request, response or event messages they are associated
with.
recognizer-extension-header = "Recording-URL" ; Section 4.3.1
| "Required-Phrase" ; Section 4.3.2
| "Phrase-Status" ; Section 4.3.3
| "Interpret-Text" ; Section 4.3.4
Parameter Support Methods/Events/Responses
recording-url MANDATORY RECORD, SET-PARAMS, GET-PARAMS
required-phrase MANDATORY RECOGNIZE, SET-PARAMS,
GET-PARAMS
phrase-status MANDATORY RECOGNITION-COMPLETE
interpret-text MANDATORY INTERPRET
4.3.1. Recording-URL
This header field specifies the location where the audio stream
recorded by a call to the RECORD method should be saved. Currently,
this should only be a URL using the ÆfileÆ scheme. Should this URL
be relative, it will be treated relative to the current working
directory where the MRCP server process is running.
Burnett, et al. IETF-Draft Page 8
MRCP Extensions October 2003
This header field MAY be used only when invoking the RECORD, SET-
PARAMS and GET-PARAMS method.
recording-url = "Recording-URL" ":" Url CRLF
4.3.2. Required-Phrase
This header field specifies the required or expected phrase to be
spoken during recognition. The required phrase is a hint to the
recognizer resource to examine its n-best list to determine if the
required phrase is contained somewhere in the list (even if it is
not the top choice). This header field MAY occur in the RECOGNIZE,
SET-PARAMS, and GET-PARAMS methods. An empty string for this header
field means that there is no required phrase needed. The default
value is an empty string.
Use of the Required-Phrase header field causes the RECOGNITION-
COMPLETE method to include a header field, "Phrase-Status" with
values of "valid" or "invalid" to indicate whether the result was
found in the N-best list.
A scenario in which the required phrase may be useful is in voice
verification against an expected response. If the caller does not
speak a valid phrase, the client can use a phrase status of
"invalid" to rollback a verification resource utterance.
required-phrase = "Required-Phrase" ":" 1*ALPHA CRLF
4.3.3. Phrase-Status
This header field provides an indicator of the validity of the
caller utterance when a required phrase is used. Utterances that
produce a recognition result matching the required phrase somewhere
in the n-best recognizer matches, yield a Phrase-Status of "valid ".
While recognition results that do not match the required phrase
anywhere in the N-best list yield a Phrase-Status of "invalid".
phrase-status = "Phrase-Status" ":" phrase-status-string CRLF
phrase-status-string = "valid" | "invalid"
4.3.4. Interpret-Text
This header field is used to provide the text string for which a
natural language interpretation is desired. This header field MUST
be used when invoking the INTERPRET method as it cannot be set with
the SET-PARAMS method.
interpret-text = "Interpret-Text" : 1*OCTET CRLF
4.4. RECORD
The RECORD method does not invoke the recognizer resource but simply
endpoints and records the input audio stream. It saves the
Burnett, et al. IETF-Draft Page 9
MRCP Extensions October 2003
endpointed audio to a URL having its name supplied in the recording-
url header field. Currently, this URL can only use the ÆfileÆ
scheme.
If a RECOGNIZE, INTERPRET or another RECORD operation is already in
progress, invoking this method will cause the response to have a
status code of 402, "Method not valid in this state", and a COMPLETE
request state.
It the recording-url is not valid, a status code of 404, "Illegal
Value for Parameter", will be returned in the response. If it is
impossible for the server to create the requested file, a status
code of 407, "Method or Operation Failed", will be returned.
If the recording-url is valid, the recording operation is initiated
and the response will indicate an IN-PROGRESS request state. The
server MAY generate a subsequent START-OF-SPEECH event when speech
is detected. Upon completion of the recording operation, the server
will generate a RECORDING-COMPLETE event.
Example:
C->S:RECORD 456234 MRCP/1.0
Recording-URL: file://mediaserver/recordings/myfile.wav
S->C:MRCP/1.0 456234 200 IN-PROGRESS
S->C:START-OF-SPEECH 456234 IN-PROGRESS MRCP/1.0
S->C:RECORDING-COMPLETE 456234 COMPLETE MRCP/1.0
Completion-Cause: 000 success
4.5. INTERPRET
The INTERPRET method from the client to the server takes as input an
interpret-text header, containing the text for which the semantic
interpretation is desired, and returns, via the INTERPRETATION-
COMPLETE event, an interpretation result which is very similar to
the one returned from a RECOGNIZE method invocation. Only portions
of the result relevant to acoustic matching are excluded from the
result. The interpret-text header MUST be included in the INTERPRET
request.
Recognizer grammar data is treated in the same way as it is when
issuing a RECOGNIZE method call.
If a RECOGNIZE, RECORD or another INTERPRET operation is already in
progress, invoking this method will cause the response to have a
status code of 402, "Method not valid in this state", and a COMPLETE
request state.
Example:
C->S:INTERPRET 234567 MRCP/1.0
Burnett, et al. IETF-Draft Page 10
MRCP Extensions October 2003
Interpret-Text: may I speak to Andre Roy
Content-Type: application/grammar+xml
Content-Id: request1@form-level.store
Content-Length: 104
ouiyes
may I speak to
Michel TremblayAndre Roy
S->C:MRCP/1.0 234567 200 IN-PROGRESS
S->C:INTERPRETATION-COMPLETE 234567 COMPLETE MRCP/1.0
Completion-Cause: 000 success
Content-Type: application/x-nlsml
Content-Length: 276
Andre Roy
may I speak to Andre Roy
4.6. RECORDING-COMPLETE
Burnett, et al. IETF-Draft Page 11
MRCP Extensions October 2003
This event from the recognition resource to the client indicates
that the RECORD operation is complete. The request state MUST be
set to COMPLETE.
The completion-cause header MUST be included in this event. It MUST
be set to one of the following values defined for the recognizer
resource:
Cause-Code Cause-Name Description
000 success RECORD completed successfully
002 no-input-timeout RECORD completed with no audio
recorded due to lack of input
006 error RECORD operation terminated
due to an error
When the completion-cause is "000 success", the URL specified via
the recording-url header in the RECORD method invocation will
contain the recorded audio. The client may then use this URL to
retrieve the audio.
Example:
C->S:RECORD 456234 MRCP/1.0
Recording-URL: file://mediaserver/recordings/myfile.wav
S->C:MRCP/1.0 456234 200 IN-PROGRESS
S->C:START-OF-SPEECH 456234 IN-PROGRESS MRCP/1.0
S->C:RECORDING-COMPLETE 456234 COMPLETE MRCP/1.0
Completion-Cause: 000 success
4.7. INTERPRETATION-COMPLETE
This event from the recognition resource to the client indicates
that the INTERPRET operation is complete. The interpretation result
is sent in the body of the MRCP message. The request state MUST be
set to COMPLETE.
The completion-cause header MUST be included in this event and MUST
be set to one of the following two values defined for the recognizer
resource:
Cause-Code Cause-Name Description
000 success INTERPRET completed
successfully
006 error INTERPRET terminated
due to an error
Example:
C->S:INTERPRET 234567 MRCP/1.0
Burnett, et al. IETF-Draft Page 12
MRCP Extensions October 2003
Interpret-Text: may I speak to Andre Roy
Content-Type: application/grammar+xml
Content-Id: request1@form-level.store
Content-Length: 104
ouiyes
may I speak to
Michel TremblayAndre Roy
S->C:MRCP/1.0 234567 200 IN-PROGRESS
S->C:INTERPRETATION-COMPLETE 234567 COMPLETE MRCP/1.0
Completion-Cause: 000 success
Content-Type: application/x-nlsml
Content-Length: 276
Andre Roy
may I speak to Andre Roy
5. Enrollment
Burnett, et al. IETF-Draft Page 13
MRCP Extensions October 2003
This document captures the extensions required to implement Voice
Enrollment, Speaker Verification and Hotword recognition using MRCP.
This section describes the methods, responses and events needed for
doing Enrollment.
Enrollment can be performed using a personÆs voice or by building
the personal grammar using text entry. For example, a list of
contacts can be created and maintained by recording the personÆs
names using their voice or by editing the list of contacts using a
Web-based tool. These techniques are called Voice Enrollment or
Text-based enrollment, respectively.
Voice Enrollment has a concept of an enrollment session. Adding a
new phrase to a personal grammar involves the initial enrollment
followed by a repeat of enough utterances before committing the new
phrase to the personal grammar. Each time an utterance is recorded,
it is compared for similarity with the other samples and a clash
test is performed against other entries in the personal grammar to
ensure there are no similar and confusable entries.
5.1. Enrollment State Machine
Starting an enrollment session does not change the state of the
recognizer resource, i.e. it remains idle. Once an enrollment
session is started, then utterances are enrolled by calling the
RECOGNIZE method repeatedly. The state of the Speech Recognizer
resources goes from IDLE to RECOGNIZING state each time RECOGNIZE is
called.
5.2. Enrollment Methods
Enrollment supports the following methods.
enrollment-method = "START-ENROLLMENT-SESSION"
| "RECOGNIZE"
| "STOP"
| "PAUSE-ENROLLMENT-SESSION"
| "RESUME-ENROLLMENT-SESSION"
| "ENROLLMENT-ROLLBACK"
| "END-ENROLLMENT-SESSION"
| "ABORT-ENROLLMENT-SESSION"
| "MODIFY-PHRASE"
| "ADD-PHRASE"
| "DELETE-PHRASE"
5.3. Enrollment Events
Enrollment may generate the following events.
enrollment-event = "RECOGNITION-COMPLETE"
5.4. Enrollment Header Fields
Burnett, et al. IETF-Draft Page 14
MRCP Extensions October 2003
An Enrollment request may contain header fields containing request
options and information to augment the Request, Response or Event
message it is associated with.
Some of the header fields from the following list, such as Save-
Waveform, Waveform-URL, are from the MRCP Recognizer resources. They
are put here again because they are also related to enrollment
operations.
enrollment-header =
num-min-consistent-pronunciations ; Section 5.4.1
| consistency-threshold ; Section 5.4.2
| clash-threshold ; Section 5.4.3
| personal-grammar-uri ; Section 5.4.4
| phrase-id ; Section 5.4.5
| phrase-nl ; Section 5.4.6
| weight ; Section 5.4.7
| save-waveform ; Section 5.4.8
| waveform-url ; Section 5.4.9
| new-phrase-id ; Section 5.4.10
| phrase-text ; Section 5.4.11
| completion-cause ; Section 5.4.12
Parameter Support Methods/Events
num-min-consistent MANDATORY START-ENROLLMENT-SESSION,
-pronunciations SET-PARAMS, GET-PARAMS
consistency-threshold Optional START-ENROLLMENT-SESSION,
SET-PARAMS, GET-PARAMS
clash-threshold Optional START-ENROLLMENT-SESSION,
SET-PARAMS, GET-PARAMS
personal-grammar-uri MANDATORY START-ENROLLMENT-SESSION,
SET-PARAMS, GET-PARAMS,
MODIFY-PHRASE, ADD-PHRASE,
DELETE-PHRASE
phrase-id MANDATORY ADD-PHRASE, DELETE-PHRASE,
MODIFY-PHRASE,
END-ENROLLMENT-SESSION
phrase-nl MANDATORY ADD-PHRASE, MODIFY-PHRASE,
END-ENROLLMENT-SESSION
weight Optional ADD-PHRASE, MODIFY-PHRASE,
END-ENROLLMENT-SESSION
save-waveform MANDATORY SET-PARAMS, GET-PARAMS, RECOGNIZE
waveform-url MANDATORY RECOGNITION-COMPLETE
new-phrase-id Optional MODIFY-PHRASE
phrase-text MANDATORY ADD-PHRASE
completion-cause MANDATORY RECOGNITION-COMPLETE
For enrollment-specific header fields that can appear as part of
SET-PARAMS or GET-PARAMS methods, the following general rule
applies: The START-ENROLLMENT-SESSION method must be called before
these header fields can be set through the SET-PARAMS method or
retrieved through the GET-PARAMS method.
Burnett, et al. IETF-Draft Page 15
MRCP Extensions October 2003
enrollment-result-elements =
num-clashes ; Section 5.4.13
| num-good-repetitions ; Section 5.4.14
| num-repetitions-still-needed; Section 5.4.15
| consistency-status ; Section 5.4.16
| clash-phrase-id ; Section 5.4.17
5.4.1. Num-Min-Consistent-Pronunciations
This parameter MAY BE specified in a START-ENROLLMENT-SESSION, SET-
PARAMS, or GET-PARAMS method and is used to specify the minimum
number of consistent pronunciations that must be obtained to voice
enroll a new phrase. The minimum value is 1. The default value is 2.
num-min-consistent-pronunciations =
"Num-Min-Consistent-Pronunciations" ":" 1*DIGIT CRLF
5.4.2. Consistency-Threshold
This parameter MAY BE sent as part of the START-ENROLLMENT-SESSION,
SET-PARAMS, or GET-PARAMS method. Used during voice-enrollment,
this parameter specifies how similar an utterance needs to be to a
previously enrolled pronunciation of the same phrase to be
considered "consistent." The higher the threshold, the closer the
match between an utterance and previous pronunciations must be for
the pronunciation to be considered consistent. The range for this
threshold is 0 to 100.
consistency-threshold = "Consistency-Threshold" ":" 1*DIGIT CRLF
5.4.3. Clash-Threshold
This parameter MAY BE sent as part of the START-ENROLLMENT-SESSION,
SET-PARMS, or GET-PARAMS method. Used during voice-enrollment, this
parameter specifies how similar the pronunciations of two different
phrases can be before they are considered to be clashing. For
example, pronunciations of phrases such as "John Smith" and "Jon
Smits" may be so similar that they are difficult to distinguish
correctly. A smaller threshold reduces the number of clashes
detected. The range for this threshold is 0 to 100. The default
value for this field is platform specific.
clash-threshold = "Clash-Threshold" ":" 1*DIGIT CRLF
5.4.4. Personal-Grammar-URI
Burnett, et al. IETF-Draft Page 16
MRCP Extensions October 2003
This parameter specifies the speaker-trained grammar to be used or
referenced during enrollment operations. For example, a contact
list for user "Jeff" could be stored at the Personal-Grammar-
URI="http://myserver/myenrollmentdb/jeff-list". There is no default
value for this header field.
personal-grammar-uri = "Personal-Grammar-URI" ":" Url CRLF
5.4.5. Phrase-Id
This header identifies a phrase in a personal grammar and will also
be returned when doing recognition. This header field MAY occur in
ADD-PHRASE, DELETE-PHRASE, MODIFY-PHRASE and END-ENROLLMENT-SESSION
requests. There is no default value for this header field.
phrase-id = "Phrase-ID" ":" 1*ALPHA CRLF
5.4.6. Phrase-NL
This is a string specifying the natural language statement to
execute when the phrase is recognized. This header field MAY occur
in ADD-PHRASE, MODIFY-PHRASE and END-ENROLLMENT-SESSION requests.
There is no default value for this header field.
phrase-nl = "Phrase-NL" ":" 1*ALPHA CRLF
5.4.7. Weight
The value of this header field represents the occurrence likelihood
of this branch of the grammar. The weights are normalized to sum to
one at compilation time, so use the value of Æ1Æ if you want all
branches to have the same weight. This header field MAY occur in
ADD-PHRASE, MODIFY-PHRASE and END-ENROLLMENT-SESSION requests. The
default value is 1.
weight = "Weight" ":" WEIGHT CRLF
5.4.8. Save-Waveform
This header field is from the recognizer resource and it allows the
client to indicate to the recognizer that it MUST save the audio
stream that was used during the enrollment session. The recognizer
MUST then record the recognized audio and make it available to the
client in the form of a URL returned in the waveform-url header
field in the RECOGNITION-COMPLETE event. If there was an error in
recording the stream or the audio clip is otherwise not available,
the recognizer MUST return an empty waveform-url header field.
Burnett, et al. IETF-Draft Page 17
MRCP Extensions October 2003
save-waveform = "Save-Waveform" ":" Boolean-value CRLF
5.4.9. Waveform-URL
This header field is from the recognizer resource. If the Save-
Waveform header field is set to true, the recognizer MUST record the
incoming audio stream of the recognition into a file and provide a
URL for the client to access it. This header MUST be present in the
RECOGNITION-COMPLETE event if the Save-Waveform header field was set
to true. The URL value of the header MUST be empty if there was
some error preventing the server from recording. Otherwise, the URL
generated by the server MUST be unique across the server and all its
recognition and enrollment sessions.
waveform-url ="Waveform-URL" ":" Url CRLF
5.4.10. New-Phrase-Id
This header field replaces the id used to identify the phrase in a
personal grammar. The recognizer returns the new id when using an
enrollment grammar. This header field MAY occur in MODIFY-PHRASE
requests.
new-phrase-id = "New-Phrase-ID" ":" 1*ALPHA CRLF
5.4.11. Phrase-Text
This represents the text that will be returned by the recognizer
when a text enrolled phrase is recognized. This parameter is plain
text. This header field MAY occur in ADD-PHRASE requests.
phrase-text = "Phrase-Text" ":" 1*ALPHA CRLF
5.4.12. Completion-Cause
This header field is from the recognizer resource and it MUST be
specified in a RECOGNITION-COMPLETE event coming from the recognizer
resource to the client. This indicates the reason behind the
RECOGNIZE request completion.
The error codes used for Enrollment should not clash with those for
normal recognition. There are no completion-cause values specific
to enrollment, so please refer to the original MRCP specification
for valid completion causes.
completion-cause = "Completion-Cause" ":" 1*DIGIT SP
1*ALPHA CRLF
5.4.13. Num-Clashes
Burnett, et al. IETF-Draft Page 18
MRCP Extensions October 2003
This is not a header field, but part of the recognition results. It
is returned in a RECOGNITION-COMPLETE event. Its value represents
the number of clashes that this pronunciation has with other
pronunciations in an active enrollment session. The header field
Clash-Threshold determines the sensitivity of the clash measurement.
Clash testing can be turned off completely by setting Clash-
Threshold to 0.
num-clashes = "num-clashes" ":" 1*DIGIT CRLF
5.4.14. Num-Good-Repetitions
This is not a header field, but part of the recognition results. It
is returned in a RECOGNITION-COMPLETE event. Its value represents
the number of consistent pronunciations obtained so far in an active
enrollment session.
num-good-repetitions = "num-good-repetitions" ":" 1*DIGIT CRLF
5.4.15. Num-Repetitions-Still-Needed
This is not a header field, but part of the recognition results. It
is returned in a RECOGNITION-COMPLETE event. Its value represents
the number of consistent pronunciations that must still be obtained
before the new phrase can be added to the enrollment grammar. The
number of consistent pronunciations required is determined by the
parameter Num-Min-Consistent-Pronunciations, whose default value is
two. The returned value must be 0 before the system will allow you
to end an enrollment session for a new phrase.
num-repetitions-still-needed =
"num-repetitions-still-needed" ":" 1*DIGIT CRLF
5.4.16. Consistency-Status
This is not a header field, but part of the recognition results. It
is returned in a RECOGNITION-COMPLETE event. This is used to
indicate how consistent the repetitions are when learning a new
phrase. It can have the values of CONSISTENT, INCONSISTENT and
UNDECIDED.
consistency-status = "consistency-status" ":" 1*ALPHA CRLF
5.4.17. Clash-Phrase-Ids
This is not a header field, but part of the recognition results. It
is returned in a RECOGNITION-COMPLETE event. This gets filled with
the phrase ids of the clashing pronunciation(s). This field is
absent if there are no clashes. This MAY occur in RECOGNITION-
COMPLETE events.
Burnett, et al. IETF-Draft Page 19
MRCP Extensions October 2003
phrase-id = "phrase-id" ":" 1*ALPHA CRLF
Phrase-Id à
5.5. Enrollment Methods
5.5.1. START-ENROLLMENT-SESSION
The START-ENROLLMENT-SESSION method sent from the client to the
server starts a new enrollment session during which the client may
call RECOGNIZE to enroll a new utterance. This consists of a set of
calls to RECOGNIZE in which the caller speaks a phrase several times
so the system can "learn" it. You then add the phrase to a personal
grammar (speaker-trained grammar), and the system can recognize it
later.
Only one enrollment session may be active at a time. The Personal-
Grammar-URI identifies the grammar that is used during enrollment to
store the personal list of phrases. Once RECOGNIZE is called, the
result is returned in a RECOGNITION-COMPLETE event and may contain
either an enrollment result OR a recognition result for a regular
recognition.
Calling END-ENROLLMENT-SESSION ends the ongoing enrollment session,
which is typically done after a sequence of successful calls to
RECOGNIZE. Alternatively a call to ABORT-ENROLLMENT-SESSION
terminates the enrollment session without committing the new
enrollments to the database.
The Personal-Grammar-URI, which specifies the grammar to contain the
new enrolled phrase, will be created if it does not exist. Also, the
personal grammar may ONLY contain phrases added via an enrollment
session.
Example:
C->S: START-ENROLLMENT-SESSION 543258 MRCP/1.0
Num-Min-Consistent-Pronunciations: 2
Consistency-Threshold: 30
Clash-Threshold: 12
Personal-Grammar-URI:
S->C: MRCP/1.0 543258 200 COMPLETE
5.5.2. RECOGNIZE
The RECOGNIZE method from the client to the server starts an ongoing
enrollment/recognition during which either the phrase is learned, or
recognition occurs against the grammar passed to RECOGNIZE. A START-
OF-SPEECH event followed by a RECOGNITION-COMPLETE event should be
expected.
Burnett, et al. IETF-Draft Page 20
MRCP Extensions October 2003
There can only be a single RECOGNIZE operation IN-PROGRESS at a time
and this method MUST be called during an ongoing START-ENROLLMENT-
SESSION if enrollment is desired.
If the RECOGNIZE request contains a Content-Id header field then the
resulting grammar (which includes the personal grammar as a sub-
grammar) can be referenced from elsewhere by using "session:my-
grammar".
Example:
C->S: RECOGNIZE 543259 MRCP/1.0
Content-Type: application/grammar+xml
Content-Id: my-grammar
Content-Length: 123
help cancel
S->C: MRCP/1.0 543259 200 IN-PROGRESS
S->C: START-OF-SPEECH 543259 200 MRCP/1.0
5.5.3. STOP
The STOP method from the client to the server may only be called
during an ongoing RECOGNIZE operation and is used to abort that
recognition. No RECOGNITION-COMPLETE event will follow.
There is no difference in behavior for regular recognition versus an
enrollment. It is included here for completeness.
Example:
C->S: STOP 543258 MRCP/1.0
S->C: MRCP/1.0 543258 200 COMPLETE
Active-Request-Id-List: 543259
5.5.4. PAUSE-ENROLLMENT-SESSION
The PAUSE-ENROLLMENT-SESSION method from the client to the server
may only be called during an ongoing START-ENROLLMENT-SESSION. It
may NOT be called during an ongoing RECOGNIZE operation.
Burnett, et al. IETF-Draft Page 21
MRCP Extensions October 2003
This operation will pause the enrollment session. Any RECOGNIZE
requests sent by the client after the session is paused will only
return recognition results, not enrollment results.
This method is quietly ignored if the resource is already paused. A
response indicating a success status will be returned in those
cases.
Example:
C->S: PAUSE-ENROLLMENT-SESSION 543260 MRCP/1.0
S->C: MRCP/1.0 543260 200 COMPLETE
5.5.5. RESUME-ENROLLMENT-SESSION
The RESUME-ENROLLMENT-SESSION method from the client to the server
may only be called during an ongoing START-ENROLLMENT-SESSION that
has been paused. It may NOT be called during an ongoing RECOGNIZE
operation.
This will resume the enrollment session. Any RECOGNIZE requests
sent by the client after the session is resumed can return
recognition or enrollment results.
This method is quietly ignored if the resource is already resumed.
A response indicating a success status will be returned in those
cases.
Example:
C->S: RESUME-ENROLLMENT-SESSION 543261 MRCP/1.0
S->C: MRCP/1.0 543261 200 COMPLETE
5.5.6. ENROLLMENT-ROLLBACK
The ENROLLMENT-ROLLBACK method discards the last live utterances
from the RECOGNIZE operation. This method should be invoked when the
caller provides undesirable input such as non-speech noises, side-
speech, commands, utterance from the RECOGNIZE grammar, etc. Note
that this method does not provide a stack of rollback states.
Executing ENROLLMENT-ROLLBACK twice in succession without an
intervening recognition operation has no effect on the second
attempt.
Example:
C->S: ENROLLMENT-ROLLBACK 543261 MRCP/1.0
S->C: MRCP/1.0 543261 200 COMPLETE
5.5.7. END-ENROLLMENT-SESSION
Burnett, et al. IETF-Draft Page 22
MRCP Extensions October 2003
The END-ENROLLMENT-SESSION method can only be called during an
active enrollment session, which was started by calling the method
START-ENROLLMENT-SESSION. It may NOT be called during an ongoing
RECOGNIZE operation. It should be called only when successive calls
to RECOGNIZE have succeeded and Num-Repetitions-Still-Needed has
been returned as 0 in the RECOGNITION-COMPLETE event. The Phrase-ID
passed to this method will be used to identify this phrase in the
grammar and will be returned as the speech input when doing a
RECOGNIZE on the grammar. The Phrase-NL similarly will be returned
in a RECOGNITION-COMPLETE event in the same manner as other NL in a
grammar. The tag-format of this NL is vendor specific.
If the client has specified Save-Waveform as true, the response
should contain the location/URL of a recording of the best
repetition of the learned phrase.
Example:
C->S: END-ENROLLMENT-SESSION 543262 MRCP/1.0
Phrase-Id:
Phrase-NL:
Weight: 1
Save-Waveform: true
S->C: MRCP/1.0 543262 200 COMPLETE
Waveform-URL:
5.5.8. ABORT-ENROLLMENT-SESSION
The ABORT-ENROLLMENT-SESSION method may only be called during an
ongoing enrollment session and is used to abort that session. It
may NOT be called during an ongoing RECOGNIZE operation. After
calling this function, you cannot call END-ENROLLMENT-SESSION and
the phrase is not added to the personal grammar.
Example:
C->S: ABORT-ENROLLMENT-SESSION 543263 MRCP/1.0
S->C: MRCP/1.0 543263 200 COMPLETE
5.5.9. MODIFY-PHRASE
The MODIFY-PHRASE method sent from the client to the server is used
to change the phrase ID, NL phrase and/or weight for a given phrase
in a personal grammar.
If no fields are supplied then calling this method has no effect and
it is silently ignored.
Example:
C->S: MODIFY-PHRASE 543265 MRCP/1.0
Personal-Grammar-URI:
Burnett, et al. IETF-Draft Page 23
MRCP Extensions October 2003
Phrase-Id:
New-Phrase-Id:
Phrase-NL:
Weight: 1
S->C: MRCP/1.0 543265 200 COMPLETE
5.5.10. ADD-PHRASE
The ADD-PHRASE method sent from the client to the server is used to
add a text phrase to a personal grammar. The phrase must be simple
text with no special characters. As with voice enrollment, a Phrase
Id, NL phrase and weight MAY be supplied.
Example:
C->S: ADD-PHRASE 543266 MRCP/1.0
Personal-Grammar-URI:
Phrase-Id:
Phrase-Text:
Phrase-NL:
Weight: 1
S->C: MRCP/1.0 543266 200 COMPLETE
5.5.11. DELETE-PHRASE
The DELETE-PHRASE method sent from the client to the server is used
to delete a phase in a personal grammar added through voice
enrollment or text enrollment. If the specified phrase doesnÆt
exist, this method has no effect and it is silently ignored.
Example:
C->S: DELETE-PHRASE 543266 MRCP/1.0
Personal-Grammar-URI:
Phrase-Id:
S->C: MRCP/1.0 543266 200 COMPLETE
5.5.12. RECOGNITION-COMPLETE
The RECOGNITION-COMPLETE event follows a method call to RECOGNIZE
and is used to communicate to the client the results of the
enrollment. Note that the event can contain recognition or
enrollment results depending on what was spoken.
Example:
S->C: RECOGNITION-COMPLETE 543259 200 MRCP/1.0
Completion-Cause: 000 success
Content-Type: application/x-nlsml
Content-Length: 123
Burnett, et al. IETF-Draft Page 24
MRCP Extensions October 2003
2 1 1
consistent Jeff Andre
Burnett, et al. IETF-Draft Page 25
MRCP Extensions October 2003
6. Speaker Verification and Identification
This document captures the extensions required to implement Voice
Enrollment, Speaker Verification / Identification and Hotword
recognition using MRCP. This section describes the methods,
responses and events needed for doing Speaker Verification /
Identification.
6.1. Speaker Verification/Identification Resource
Speaker verification is a voice authentication feature that can be
used to identify the speaker in order to grant the user access to
sensitive information and transactions. To do this, a recorded
utterance is compared to a voiceprint previously stored for that
user. Verification consists of two phases: a designation phase to
establish the claimed identity of the caller and an execution phase
in which a voiceprint is either created (training) or used to
authenticate the claimed identity (verification).
Speaker identification identifies the speaker from a set of valid
users, such as family members. Identification can be performed on a
small set of users or for a large population. This feature is
useful for applications where multiple users share the same account
number, but where the individual speaker must be uniquely identified
from the group. Speaker identification is also done in two phases,
a designation phase and an execution phase.
It is possible for a speaker verification resource to share the same
session as an existing recognizer resource or a speaker verification
session can be SETUP to operate in standalone mode, without a
recognizer resource sharing the same session.In order to share the
same session, the SETUP message for the verification resource should
include the RTSP session identifier of the recognizer resource it
wishes to share. If no session identifier is specified, an
independent verification resource, running on the same physical
server or a separate one, will be set up.
Some of the speaker verification methods, described below, apply
only to a specific mode of operation.
The verification resource supports some buffering methods that allow
the user to buffer the verification data from one or more utterances
and then process this set of utterances as a single entity. This is
different from collecting waveforms and processing them using the
verification methods that operate directly on the incoming audio
stream because the buffering mechanism does not simply accumulate
utterance data to a buffer. In particular, when both the
recognition and verification resources share the same session,
additional information gathered by the recognition resource is saved
with these buffers to improve verification performance.
Burnett, et al. IETF-Draft Page 26
MRCP Extensions October 2003
6.2. SETUP Verification/Identification Resource
The SETUP method from the client to the server is used to open a
resource for verification/identification from a media server. If
session-id header field is specified in the SETUP method, the
verification/identification resource would share the same session
with other resources in the session. Otherwise, a new session would
be created for the verification/identification resource. The
resource name is Æverification-resourceÆ.
Example:
This example assumes the verification resource would share a session
that is already created.
C->S: SETUP rtsp://media.server.com/media/verification-resource
RTSP/1.0
CSeq: 1
Transport: RTP/AVP;unicast;client_port=46456-46457
Session: 0a030258_00003815_3bc4873a_0001_0000
S->C: RTSP/1.0 200 OK
CSeq: 1
Transport: RTP/AVP;unicast;client_port=46456-46457;
server_port=46460-46461
Session: 0a030258_00003815_3bc4873a_0001_0000
6.3. Speaker Verification State Machine
Speaker Verification has a concept of a training, verification or
buffering sessions. Starting one of these sessions does not change
the state of the verification resource, i.e. it remains idle. Once
a verification or training session is started, then utterances are
trained or verified by calling the VERIFY or VER-FROM-BUFFER method.
The state of the Speaker Verification resources goes from IDLE to
VERIFYING state each time VERIFY or VER-FROM-BUFFER is called.
6.4. Speaker Verification Methods
Speaker Verification supports the following methods.
verification-method = "VER-START-SESSION"
| "VER-END-SESSION"
| "VER-SET-VOICEPRINT"
| "VER-DELETE-VOICEPRINT"
| "VERIFY"
| "VER-BUFFERING-START"
| "VER-BUFFERING-CONTROL"
| "VER-BUFFERING-STOP"
| "VER-FROM-BUFFER"
| "VER-ROLLBACK"
| "VER-STOP"
| "VER-START-TIMERS"
| "SET-PARAMS"
| "GET-PARAMS"
Burnett, et al. IETF-Draft Page 27
MRCP Extensions October 2003
6.5. Verification Events
Speaker Verification may generate the following events.
verification-event = "VERIFICATION-COMPLETE"
| "START-OF-SPEECH"
6.6. Verification Header Fields
A Speaker Verification request may contain header fields containing
request options and information to augment the Request, Response or
Event message it is associated with.
The verification result elements will be returned in a VERIFICATION-
COMPLETE event containing an NLSML document [4], having a MIME-type
application/x-nlsml. The current specification proposes some
element names which could be incorporated to an
namespace
verification-header =
voiceprint-uri ; Section 6.6.1
| voiceprint-identifier ; Section 6.6.2
| voiceprint-group ; Section 6.6.3
| verification-mode ; Section 6.6.4
| adapt-model ; Section 6.6.5
| abort-model ; Section 6.6.6
| buffering-mode ; Section 6.6.7
| security-level ; Section 6.6.8
| num-min-verification-phrases; Section 6.6.9
| num-max-verification-phrases; Section 6.6.10
| completion-cause ; Section 6.6.11
| no-input-timeout ; Section 6.6.12
| save-waveform ; Section 6.6.13
| waveform-url ; Section 6.6.14
| vendor-specific ; Section 6.6.15
| voiceprint-exists ; Section 6.6.16
Parameter Support Methods/Events
voiceprint-uri MANDATORY VER-SET-VOICEPRINT,
VER-DELETE-VOICEPRINT
voiceprint-identifier MANDATORY VER-SET-VOICEPRINT,
VER-DELETE-VOICEPRINT
voiceprint-group Optional VER-SET-VOICEPRINT,
VER-DELETE-VOICEPRINT
verification-mode MANDATORY SET-PARAMS, GET-PARAMS,
VERIFY, VER-FROM-BUFFER
adapt-model Optional VER-START-SESSION
abort-model Optional VER-END-SESSION
buffering-mode Optional VER-BUFFERING-CONTROL
security-level Optional SET-PARAMS, GET-PARAMS,
VERIFY, VER-FROM-BUFFER
num-min-verification Optional SET-PARAMS, GET-PARAMS,
-phrases VERIFY, VER-FROM-BUFFER
num-max-verification Optional SET-PARAMS, GET-PARAMS,
Burnett, et al. IETF-Draft Page 28
MRCP Extensions October 2003
-phrases VERIFY, VER-FROM-BUFFER
completion-cause MANDATORY VERIFICATION-COMPLETE
VER-SET-VOICEPRINT,
VER-DELETE-VOICEPRINT
no-input-timeout MANDATORY SET-PARAMS, GET-PARAMS,
VERIFY
save-waveform MANDATORY SET-PARAMS, GET-PARAMS,
VERIFY
waveform-url MANDATORY VERIFICATION-COMPLETE
vendor-specific MANDATORY SET-PARAMS, GET-PARAMS
voiceprint-exists MANDATORY VER-SET-VOICEPRINT,
VER-DELETE-VOICEPRINT
verification-result-elements =
| is-valid-utterance ; Section 6.6.17
| num-valid-utterance ; Section 6.5.18
| decision ; Section 6.6.19
| num-frames ; Section 6.6.20
| device ; Section 6.6.21
| gender ; Section 6.6.22
| matched ; Section 6.6.23
| adapted ; Section 6.6.24
| verification-score ; Section 6.6.25
| group-name ; Section 6.6.26
| member ; Section 6.6.27
| score ; Section 6.6.28
6.6.1. Voiceprint-URI
This parameter specifies the voiceprint repository to be used or
referenced during speaker verification or identification operations.
This header field is required in VER-SET-VOICEPRINT and
VER-DELETE-VOICEPRINT method. If this header field is set through
the SET-PARAMS method, it can be silently ignored.
voiceprint-uri = "Voiceprint-URI" ":" Url CRLF
6.6.2. Voiceprint-Identifier
This header field specifies the claimed identity for voice
verification applications. The claimed identity may be used to
specify an existing voiceprint or to establish a new voiceprint.
This header field is required in VER-SET-VOICEPRINT and VER-DELETE-
VOICEPRINT method executions in preparation for verification
application operations. The Voiceprint-Identifier is not required
for identification applications except in the VER-DELETE-VOICEPRINT
method when the client needs to remove an identity from a voiceprint
group.
voiceprint-identifier = "Voiceprint-Identifier" ":" 1*ALPHA CRLF
Burnett, et al. IETF-Draft Page 29
MRCP Extensions October 2003
6.6.3. Voiceprint-Group
This header field specifies the voiceprint group for speaker
identification operations. The voiceprint group narrows the
potential voiceprint identification candidates to a subset of the
voiceprints in the repository. This header field may appear in VER-
SET-VOICEPRINT and VER-DELETE-VOICEPRINT method executions for
speaker identification operations. If this header field is absent,
then verification, not identification, operations will be executed.
voiceprint-group = "Voiceprint-Group" ":" 1*ALPHA CRLF
6.6.4. Verification-Mode
This header field specifies the mode of the verification resource in
a VERIFY or VER-FROM-BUFFER method execution. Acceptable values
indicate whether the verification session should ignore audio
("idle"), train a voiceprint ("train"), or verify/identify using an
existing voiceprint ("verify").
The default value for the verification resource mode is "idle".
While the mode is idle, the verification resource only applies
utterance end-pointing to incoming speech and potentially adds
utterances to the audio buffer.
Setting this header field to "train" or "verify" requires that the
voiceprint or voiceprint group identifier attributes have already
been set through the VER-SET-VOICEPRINT method.
Training and verification sessions both require the voiceprint URI
to be specified at the start of the session. In many usage
scenarios, however, the system cannot know the speakerÆs claimed
identity until the speaker says, for example, their account number.
In order to allow the first few utterances of a dialog to be both
recognized and verified, the verification resource on the MRCP
server retains an audio buffer. In this audio buffer, the MRCP
server will accumulate recognized utterances in memory. The
application can later execute a verification method and apply the
buffered utterances to the current verification session. The
buffering methods are used for this purpose. When buffering is used,
subsequent input utterances are added to the audio buffer for later
analysis.
Some voice user interfaces may require additional user input that
should not be analyzed for verification. For example, the userÆs
input may have been recognized with low confidence and thus require
a confirmation cycle. In such cases, the client should not execute
the VERIFY or VER-FROM-BUFFER methods to collect and analyze the
callerÆs input. A separate recognizer resource can analyze the
callerÆs response without any participation on behalf of the
verification resource.
Once the following conditions have been met:
Burnett, et al. IETF-Draft Page 30
MRCP Extensions October 2003
1. Voiceprint identity has been successfully established through the
voiceprint identifier header fields of the VER-SET-VOICEPRINT
method, and
2. the verification mode has been set to one of "train" or "verify",
the verification resource may begin providing verification
information during verification operations. The verification
resource MUST reach one of the two major states ("train" or
"verify") if the above two conditions hold, or it MUST report an
error condition in the MRCP status code to indicate why the
verification resource is not ready for action.
The value of verification-mode is persistent within a verification
session. Changing the mode to a different value than the previous
setting causes the verification resource to report an error if the
previous setting was either "train" or "verify". If the mode is
changed back to its previous value, the operation may continue. For
example:
MRCP MRCP
Server Client
| |
|<--------VERIFY: mode verify------|
|<--------VERIFY-------------------|
|<--------VERIFY: mode idle--------|
|<--------VERIFY-------------------|
|<--------VERIFY: mode verify------|
The above sequence of VERIFY method requests would start a
verification operation. When the verification resource is placed
into idle, any subsequent audio would be ignored until the final
update to verification-mode. At that time, the verification
operation would continue, using the original utterances and any
subsequent utterances.
verification-mode = "Verification-Mode" ":"
verification-mode-string
verification-mode-string = "idle"
| "train"
| "verify"
6.6.5. Adapt-Model
This header field indicates the desired behavior of the verification
resource after a successful verification execution. If the value of
this parameter is "true", the audio collected during the
verification session is used to update the voiceprint to account for
ongoing changes in a speakerÆs incoming speech characteristics. If
the value is "false" (the default), the voiceprint is not updated
with the latest audio. This header field MAY only occur in VER-
START-SESSION method.
adapt-model = "Adapt-Model" ":" Boolean-value CRLF
Burnett, et al. IETF-Draft Page 31
MRCP Extensions October 2003
6.6.6. Abort-Model
The Abort-Model header field indicates the desired behavior of the
verification resource upon session termination. If the value of this
parameter is "true", the pending changes to a voiceprint due to
verification training or verification adaptation are discarded. If
the value is "false" (the default), the pending changes for a
training session or a successful verification session are committed
to the voiceprint repository. A value of "true" for Abort-Model
overrides a value of "true" for the Adapt-Model header field. This
header field MAY only occur in VER-END-SESSION method.
abort-model = "Abort-Model" ":" Boolean-value CRLF
6.6.7. Buffering-Mode
The Buffering-Mode header field is used to indicate which action, of
pausing or resuming, should be applied to a buffering session. It
MUST only be used with the VER-BUFFERING-CONTROL method.
buffering-mode = "Buffering-Mode" ":" "pause" | "resume" CRLF
6.6.8. Security-Level
The Security-Level header field determines the range of verification
scores in which a decision of ÆacceptedÆ may be declared. This
header field MAY occur in SET-PARAMS, GET-PARAMS, VERIFY and VER-
FROM-BUFFER methods. It can be "high" (highest security level),
"medium-high", "medium" (normal security level), "medium-low", or
"low" (low security level). The default value is platform specific.
security-level = "Security-Level" ":" security-level-string CRLF
security-level-string = "high" |
"medium-high" |
"medium" |
"medium-low" |
"low"
6.6.9. Num-Min-Verification-Phrases
The Num-Min-Verification-Phrases header field is used to specify the
minimum number of valid utterances before a positive decision is
given for verification. The value for this parameter is integer and
the default value is 1. The verification resource should not
announce a decision of ÆacceptedÆ unless the Num-Min-Verification-
Phrases utterances are available. The minimum value is 1.
num-min-verification-phrases = "Num-Min-Verification-Phrases" ":"
1*DIGIT CRLF
6.6.10. Num-Max-Verification-Phrases
The Num-Max-Verification-Phrases header field is used to specify the
number of valid utterances required before a decision is forced for
Burnett, et al. IETF-Draft Page 32
MRCP Extensions October 2003
verification. The verification resource MUST NOT return a decision
of ÆundecidedÆ once Num-Max-Verification-Phrases have been collected
and used to determine a verification score. The value for this
parameter is integer and the minimum value is 1.
num-min-verification-phrases = "Num-Max-Verification-Phrases" ":"
1*DIGIT CRLF
6.6.11. Completion-Cause
This header field MUST be part of a VERIFICATION-COMPLETE event
coming from the verification resource to the client. This indicates
the reason behind the VERIFY or VER-FROM-BUFFER method completion.
This header field MUST BE sent in the VERIFY, VER-FROM-BUFFER, VER-
SET-VOICEPRINT responses, if they return with a failure status and a
COMPLETE state.
completion-cause = "Completion-Cause" ":" 1*DIGIT SP
1*ALPHA CRLF
Cause-Code Cause-Name Description
000 success VERIFY or VER-FROM-BUFFER request
completed successfully. The verify
decision can be "accepted",
"rejected", or "undecided".
001 error VERIFY or VER-FROM-BUFFER request
terminated prematurely due to a
verification resource or system
error.
002 no-input-timeout VERIFY request completed with no
result due to a no-input-timeout.
003 buffer-empty VER-FROM-BUFFER request completed
with no result due to empty buffer.
004 invalid-phrase VERIFY or VER-FROM-BUFFER request
completed, but the required phrase
was not found by a co-operative
recognizer resource. This
completion code is a hint that the
utterance should be removed.
005 out-of-sequence Verification operation failed due
to out-of-sequence method
invocations. For example calling
VERIFY before VER-SET-VOICEPRINT.
006 voiceprint-uri-failure
Failure accessing voiceprint URI.
007 voiceprint-uri-missing
Voiceprint-uri is not specified.
008 voiceprint-id-missing
Voiceprint-identification is not
specified.
009 voiceprint-id-not-exist
Voiceprint-identification doesnÆt
exist in the voiceprint repository.
010 voiceprint-group-not-exist
Burnett, et al. IETF-Draft Page 33
MRCP Extensions October 2003
Voiceprint-group doesnÆt exist.
6.6.12. No-Input-Timeout
The No-Input-Timeout header field sets the length of time from the
start of the verification timers (see VER-START-TIMERS) until the
declaration of a no-input event in the VERIFICATION-COMPLETE server
event message. The value is in milliseconds. This header field MAY
occur in VERIFY, SET-PARAMS or GET-PARAMS. The value for this field
ranges from 0 to MAXTIMEOUT, where MAXTIMEOUT is platform specific.
The default value for this field is platform specific.
no-input-timeout = "No-Input-Timeout" ":" 1*DIGIT CRLF
6.6.13. Save-Waveform
This header field allows the client to indicate to the verification
resource that it MUST save the audio stream that was used for
verification/identification. The verification resource MUST then
record the audio and make it available to the client in the form of
a URI returned in the waveform-uri header field in the
VERIFICATION-COMPLETE event. If there was an error in recording the
stream or the audio clip is otherwise not available, the
verification resource MUST return an empty waveform-uri header
field. The default value for this field is "false". This header
field MAY appear in the VERIFY method, but NOT in the VER-FROM-
BUFFER method since it can control whether or not to save the
waveform for live verification / identification operations only.
save-waveform = "Save-Waveform" ":" boolean-value CRLF
6.6.14. Waveform-URL
If the save-waveform header field is set to true, the verification
resource MUST record the incoming audio stream of the verification
into a file and provide a URI for the client to access it. This
header MUST be present in the VERIFICATION-COMPLETE event if the
save-waveform header field is set to true. The URL value of the
header MUST be NULL if there was some error condition preventing the
server from recording. Otherwise, the URL generated by the server
SHOULD be globally unique across the server and all its verification
sessions. The URL SHOULD BE available until the session is torn
down. Since the save-waveform header field applies only to live
verification / identification operations, the waveform-url will only
be returned in the VERIFICATION-COMPLETE event for live verification
/ identification operations.
waveform-url = "Waveform-URL" ":" Url CRLF
6.6.15. Vendor-Specific
This set of headers allows the client to set Vendor Specific
parameters.
Burnett, et al. IETF-Draft Page 34
MRCP Extensions October 2003
vendor-specific = "Vendor-Specific-Parameters" ":"
vendor-specific-av-pair
*[";" vendor-specific-av-pair] CRLF
vendor-specific-av-pair = vendor-av-pair-name "="
vendor-av-pair-value
This header can be sent in the SET-PARAMS method and is used to set
vendor-specific parameters on the server. The vendor-av-pair-name
can be any vendor-specific field name and conforms to the XML
vendor-specific attribute naming convention. The vendor-av-pair-
value is the value to set the attribute to, and needs to be quoted.
When asking the server to get the current value of these parameters,
this header can be sent in the GET-PARAMS method with the list of
vendor-specific attribute names to get separated by a semicolon.
This header field MAY occur in SET-PARAMS or GET-PARAMS.
6.6.16. Voiceprint-Exists
This header field is returned in a VER-SET-VOICEPRINT or VER-DELETE-
VOICEPRINT response. This is the status of the voiceprint specified
in the VER-SET-VOICEPRINT method. For the VER-DELETE-VOICEPRINT
method this field indicates the status of the voiceprint as the
method execution started.
Voiceprint-Exists = "Voiceprint-Exists " ":" Boolean-value CRLF
6.6.17. Is-Valid-Utterance
This is not a header field, but part of the verification results. It
is returned in a VERIFICATION-COMPLETE event. Its value indicates
if verification has determined that the last utterance is valid. A
verification utterance is valid if it matches the required
verification phrase, as determined by the recognizer. If the
utterance was valid, you can get other information such as the
acceptance decision and the score. The value can be TRUE or FALSE.
is-valid-utterance = "is-valid-utterance" ":" Boolean-value CRLF
6.6.18. Num-Valid-Utterances
This is not a header field, but part of the verification results. It
is returned in a VERIFICATION-COMPLETE event. Its value indicates
the cumulative number of valid utterances found during verification.
A verification utterance is valid if it matches the required
verification phrase, as determined by the recognizer.
num-valid-utterance = "num-valid-utterance" ":" 1*DIGIT CRLF
6.6.19. Decision
This is not a header field, but part of the verification results. It
is returned in a VERIFICATION-COMPLETE event. Its value indicates
Burnett, et al. IETF-Draft Page 35
MRCP Extensions October 2003
the decision as determined by verification. It can have the values
of accepted, rejected or undecided.
decision = "decision" ":" decision-string CRLF
decision-string = "accepted" | "rejected" | "undecided"
6.6.20. Num-Frames
This is not a header field, but part of the verification results. It
is returned in a VERIFICATION-COMPLETE event. Its value indicates
the number of 10 millisecond speech frames in the last utterance or
in the cumulated set of utterances.
num-frames = "num-frames" ":" 1*DIGIT CRLF
6.6.21. Device
This is not a header field, but part of the verification results. It
is returned in a RECOGNITION-COMPLETE event. Its value indicates
the apparent type of device used by the caller as determined by
verification. It can have the values of cellular-phone, electret-
phone and carbon-button-phone.
device = "device" ":" device-string CRLF
device-string = "cellular-phone" | "electret-phone"
| "carbon-button-phone"
6.6.22. Gender
This is not a header field, but part of the verification results. It
is returned in a VERIFICATION-COMPLETE event. Its value indicates
the apparent gender of the speaker as determined by verification. It
can have the values of male or female.
gender = "gender" ":" "male" | "female" CRLF
6.6.23. Matched
This is not a header field, but part of the verification results. It
is returned in a VERIFICATION-COMPLETE event. When verification is
trying to confirm the voiceprint, this indicates if the last
utterance and the voiceprints are of the same gender and used the
same type of device. It is not returned during verification
training. The value can be TRUE or FALSE.
matched = "matched" ":" Boolean-value CRLF
6.6.24. Adapted
This is not a header field, but part of the verification results. It
is returned in a VERIFICATION-COMPLETE event. When verification is
trying to confirm the voiceprint, this indicates if the voiceprint
has been adapted as a consequence of analyzing the source
Burnett, et al. IETF-Draft Page 36
MRCP Extensions October 2003
utterances. It is not returned during verification training. The
value can be TRUE or FALSE.
adapted = "adapted" ":" Boolean-value CRLF
6.6.25. Verification-Score
This is not a header field, but part of the verification results. It
is returned in a VERIFICATION-COMPLETE event. Its value indicates
the score of the last utterance as determined by verification.
During verification, the higher the score the more likely it is that
the speaker is the same one as the one who spoke the voiceprint
utterances. During training, the higher the score the more likely
the speaker is to have spoken all of the analyzed utterances. If
there are no such utterances the score is -100.
verification-score = "verification-score" ":" FLOAT CRLF
6.6.26. Group-Name
This is not a header field, but part of the verification results. It
is returned in a VERIFICATION-COMPLETE event. Its value indicates
the name of the group used in speaker identification.
group-name = "group-name" ":" 1*ALPHA CRLF
6.6.27. Member
This is not a header field, but part of the verification results. It
is returned in a VERIFICATION-COMPLETE event. Its value indicates
the member in a group identified by its URI. There is one URI for
each member in the group.
member = "member" ":" 1*ALPHA CRLF
6.6.28. Score
This is not a header field, but part of the verification results. It
is returned in a VERIFICATION-COMPLETE event. This is the score
associated with the identified member of the group, as returned in
the member result.
Score = "score " ":" 1*ALPHA CRLF
6.7. Verification Session Methods
These methods allow the client to control the mode and target of
verification or identification operations within the context of a
session. All the verification input cycles that occur within a
session may be used to create, update, or validate against the
voiceprint specified during the session. At the beginning of each
session the verification resource is reset to a known state.
Burnett, et al. IETF-Draft Page 37
MRCP Extensions October 2003
Verification/identification operations can be executed against live
or buffered audio. The verification resource provides methods for
controlling collection of audio data into an audio buffer, methods
for collecting and evaluating live audio data, and methods for
controlling the verification resource and adjusting its configured
behavior.
The following methods provide controls for collecting buffered audio
data from live caller utterances and subsequently evaluating the
buffered audio against voiceprints:
buffered-audio-method = "VER-FROM-BUFFER"
| "VER-BUFFERING-START"
| "VER-BUFFERING-CONTROL"
| "VER-BUFFERING-STOP"
The following methods provide controls for collection and or
evaluation of live audio utterances :
live-audio-method = "VERIFY"
| "VER-START-TIMERS"
The following methods provide controls for configuring the
verification resource and for establishing resource states :
live-or-buffered-audio-method = "VER-START-SESSION"
| "VER-END-SESSION"
| "VER-SET-VOICEPRINT"
| "VER-DELETE-VOICEPRINT"
| "VER-ROLLBACK"
| "VER-STOP"
| "SET-PARAMS"
| "GET-PARAMS"
6.7.1. VER-START-SESSION
The VER-START-SESSION method starts a Speaker
Verification/Identification Session. Execution of this method
forces the verification resource into a known initial state. If this
method is called during an ongoing verification session, the
previous session is implicitly aborted.
Upon completion of the VER-START-SESSION method, the verification
resource MUST terminate any ongoing verification sessions, and clear
any voiceprint designation.
The header field "Adapt-Model" may also be present in the start
session method to indicate whether or not to adapt a voiceprint with
data collected during the session (if the voiceprint verification
phase succeeds). By default the voiceprint model should NOT be
adapted with data from a verification session.
Burnett, et al. IETF-Draft Page 38
MRCP Extensions October 2003
Before a verification/identification resource is started, only audio
buffering operations, VER-BUFFERING-START, VER-BUFFERING-CONTROL,
VER-BUFFERING-STOP, VER-ROLLBACK and generic SET-PARAMS and GET-
PARAMS operations can be performed. The media server should return
402(Method not valid in this state) for all other operations, such
as VERIFY, VER-SET-VOICEPRINT.
A single session can be active at one time.
Example:
C->S: VER-START-SESSION 314161 MRCP/1.0
Adapt-Model: true
S->C: MRCP/1.0 314161 200 COMPLETE
6.7.2. VER-END-SESSION
The VER-END-SESSION method terminates an ongoing verification
session and releases the verification voiceprint model in one of
three ways:
a. aborting û the voiceprint adaptation or creation may be aborted
so that the voiceprint remains unchanged (or is not created).
b. committing û when terminating a voiceprint training session, the
new voiceprint is committed to the repository.
c. adapting û an existing voiceprint is modified using a successful
verification.
The header field "Abort-Model" may be included in the VER-END-
SESSION to control whether or not to abort any pending changes to
the voiceprint. The default behavior is to commit (not abort) any
pending changes to the designated voiceprint.
The VER-END-SESSION method may be safely executed multiple times
without first executing the VER-START-SESSION method. Any additional
executions of this method without an intervening use of the VER-
START-SESSION method have no effect on the system.
Example:
This example assumes there are a training session or a verification
session in progress.
C->S: VER-END-SESSION 314174 MRCP/1.0
Abort-Model: true
S->C: MRCP/1.0 314174 200 COMPLETE
6.7.3. VER-SET-VOICEPRINT
The VER-SET-VOICEPRINT method causes the verification resource to
establish the voiceprint to be used for verification, identification,
or training purposes. At this time the desired mode of the
verification resource is not yet known.
Burnett, et al. IETF-Draft Page 39
MRCP Extensions October 2003
The VER-SET-VOICEPRINT method can also be used to query whether or not
a voiceprint exists. The response to the VER-SET-VOICEPRINT method
request will contain an indication of the status of the designated
voiceprint in the "Voiceprint-Exists" header field, allowing the client
to determine whether to use the current voiceprint for verification,
train a new voiceprint, or choose a different voiceprint.
A Voiceprint location may be completely specified by providing the URI
of the voiceprint repository along with attributes to locate a single
voiceprint within the repository. The voiceprint repository is
specified through the "Voiceprint-URI" header field, in which a URI
describing the location of the voiceprint repository is given. The
attributes used to locate a specific record or records within the
repository depend on whether the client intends to use speaker
verification or speaker identification.
In the case of speaker verification, only a single attribute is
required to uniquely locate a voiceprint record within the repository.
The "Voiceprint-Identity" header field MUST describe a unique
voiceprint record within a given repository.
In the case of speaker identification, an attribute describing the set
or group of speakers from which to select a specific identity must be
supplied in the VER-SET-VOICEPRINT message. The header field
"Voiceprint-Group" specifies the group of voiceprints from which an
identity is determined. If a new voiceprint is to be added to an
existing voiceprint group, then both the voiceprint group and the new
voiceprint identifier must be supplied.
In most cases, the voiceprint operations, VER-SET-VOICEPRINT and VER-
DELETE-VOICEPRINT, would operate on the same voiceprint repository, but
using different voiceprint records or group names. For simplicity
reasons, the ÆVoiceprint-URIÆ header field can be omitted if itÆs
already set by previous voiceprint operations. But VER-START-SESSION
would clear any voiceprint designation, including the ÆVoiceprint-URIÆ.
Unlike the ÆVoiceprint-URIÆ, the ÆVoiceprint-IdentifierÆ header field
MUST be specified in every voiceprint operations. And the ÆVoiceprint-
GroupÆ header field MUST be specified in every voiceprint operations
for identification.
Example1:
This example assumes a verification session is in progress and the
voiceprint exists in the voiceprint repository.
C->S: VER-SET-VOICEPRINT 314168 MRCP/1.0
Voiceprint-URI:
Voiceprint-Identifier:
S->C: MRCP/1.0 314168 200 COMPLETE
Voiceprint-URI:
Voiceprint-Identifier:
Voiceprint-Exists: true
Burnett, et al. IETF-Draft Page 40
MRCP Extensions October 2003
Example2:
This example assumes a verification session is in progress and the
voiceprint doesnÆt exist in the voiceprint repository.
C->S: VER-SET-VOICEPRINT 314168 MRCP/1.0
Voiceprint-URI:
Voiceprint-Identifier:
S->C: MRCP/1.0 314168 200 COMPLETE
Voiceprint-URI:
Voiceprint-Identifier:
Voiceprint-Exists: false
Example3:
This example assumes a verification session is in progress and the
ÆVoiceprint-URIÆ header field is a bad URI.
C->S: VER-SET-VOICEPRINT 314168 MRCP/1.0
Voiceprint-URI:
Voiceprint-Identifier:
S->C: MRCP/1.0 314168 405 COMPLETE
Voiceprint-URI:
Voiceprint-Identifier:
Completion-Cause: 006 voiceprint-uri-failure
Example 4:
This example assumes an identification session is in progress and
the group doesnÆt exist in the voiceprint repository.
C->S: VER-SET-VOICEPRINT 314168 MRCP/1.0
Voiceprint-URI:
Voiceprint-Group:
S->C: MRCP/1.0 314168 200 COMPLETE
Voiceprint-URI:
Voiceprint-Group:
Completion-Cause: 010 voiceprint-group-not-exist
6.7.4. VER-DELETE-VOICEPRINT
The VER-DELETE-VOICEPRINT method removes a voiceprint from a
repository or speaker identification group. For removal of a speaker
identification voiceprint, three attributes describing the
voiceprint repository, group, and voiceprint identifier are
required. For removal of a speaker verification voiceprint, two
attributes describing the repository and the specific voiceprint are
needed.
If a single voiceprint record is specified with no group identifier
information, the voiceprint record is deleted.
Burnett, et al. IETF-Draft Page 41
MRCP Extensions October 2003
If a group identifier is specified but no specific voiceprint within
the group, the group record is deleted, and all the voiceprints
associated with that group are deleted.
If both a voiceprint record and a group identifier are specified,
that voiceprint is deleted, and the group identifier is updated to
no longer reference that voiceprint. If, after removing the
reference to that voiceprint, the group identifier is empty, the
group record is also removed.
If a voiceprint record or a voiceprint group doesnÆt exist, the VER-
DELETE-VOICEPRINT method can silently ignore the message and still
return 200 status code.
Example:
This example demonstrates a message to remove a specific voiceprint.
C->S: VER-DELETE-VOICEPRINT 314168 MRCP/1.0
Voiceprint-URI:
Voiceprint-Identifier:
S->C: MRCP/1.0 314168 200 COMPLETE
6.7.5. VERIFY
The VERIFY method is used to send the utteranceÆs audio stream to
the verification resource, which will then process it according to
the current Verification-Mode, either to train the voiceprint or
verify the user.
When both a recognizer and verification resource share the same
session, the VERIFY method MUST be called prior to calling the
RECOGNIZE method on the recognizer resource. In such cases, media
server vendors will know that verification must be enabled for a
subsequent call to RECOGNIZE.
Example:
C->S: VERIFY 543260 MRCP/1.0
S->C: MRCP/1.0 543260 200 IN-PROGRESS
When the VERIFY request is done, the MRCP server should send a
ÆVERIFICATION-COMPLETEÆ event to the client.
6.7.6. VER-BUFFERING-START
The VER-BUFFERING-START method starts a buffering session. Upon
completion of the VER-BUFFERING-START method, the audio buffer
associated with the verification resource MUST be cleared. Note that
the audio buffer is independent of a verification session, so that a
verification session may be started and terminated while the audio
buffer continues to maintain its audio data. The lifespan of the
data in the audio buffer is determined solely by the VER-BUFFERING-
Burnett, et al. IETF-Draft Page 42
MRCP Extensions October 2003
START and VER-BUFFERING-STOP methods during the life of the
verification resource.
The audio buffer is initially cleared out when a verification
resource is successfully allocated from an MRCP server.
If another buffering session is in progress, this method will fail.
Only a single buffering session may be in progress at a time.
Example:
C->S: VER-BUFFERING-START 314163 MRCP/1.0
S->C: MRCP/1.0 314163 200 COMPLETE
6.7.7. VER-BUFFERING-CONTROL
The VER-BUFFERING-CONTROL method is used to either pause or resume
an active buffering session. The "Buffering-Mode" parameter MUST be
used when invoking this method.
When invoked with Buffering-Mode set to pause, this method causes an
active buffering session to be paused. Subsequent utterances are
not buffered.
When invoked with Buffering-Mode set to resume, this method resumes
a buffering session and subsequent utterances will be buffered.
Example:
C->S: VER-BUFFERING-CONTROL 314165 MRCP/1.0
Buffering-Mode: pause
S->C: MRCP/1.0 314165 200 COMPLETE
6.7.8. VER-BUFFERING-STOP
The VER-BUFFERING-STOP method terminates the active buffering
session, and frees the memory holding buffered utterances.
Example:
C->S: VER-BUFFERING-STOP 314167 MRCP/1.0
S->C: MRCP/1.0 314167 200 COMPLETE
6.7.9. VER-FROM-BUFFER
The VER-FROM-BUFFER method begins an ongoing evaluation of the
currently buffered audio against the voiceprint established through
the VER-SET-VOICEPRINT method. Execution of this method without
first establishing the voiceprint repository and identifier
attributes produces an error response. Since a verification session
may only have a single voiceprint identity at any given time, this
Burnett, et al. IETF-Draft Page 43
MRCP Extensions October 2003
method may not be started repeatedly without first receiving a
completion response or sending a VER-STOP message.
Embedded with the request for audio evaluation is a header field to
describe the desired usage of the verification resource. The value
of the "Verification-Mode" header field MUST be one of either
"train" or "verify".
The buffered audio is not consumed by this evaluation operation and
thus VER-FROM-BUFFER may be called repeatedly using different
voiceprints. Such usage is desirable to implement an n-best
processing strategy to determine a voiceprint identity.
The processing initiated under a VER-FROM-BUFFER method may be
terminated using the VER-STOP method.
For VER-FROM-BUFFER method, the media server can optionally return
an "IN-PROGRESS" response followed by the "VERIFICATION-COMPLETE"
event.
Example:
This example illustrates the usage of some buffering methods. In
this scenario the client first performed a live verification, but
the utterance is rejected. In the meantime, the utterance is also
saved to the audio buffer. Then, another voiceprint is used to do
verification against the audio buffer and the utterance is accepted.
Here, we assume both Ænum-min-verification-phrasesÆ and Ænum-max-
verification-phrasesÆ are 1.
C->S: VER-START-SESSION 314161 MRCP/1.0
Adapt-Model: true
S->C: MRCP/1.0 314161 200 COMPLETE
C->S: VER-SET-VOICEPRINT 314162 MRCP/1.0
Voiceprint-URI:
Voiceprint-Identifier:
S->C: MRCP/1.0 314162 200 COMPLETE
Voiceprint-URI:
Voiceprint-Identifier:
Voiceprint-Exists: true
C->S: VER-BUFFERING-START 314163 MRCP/1.0
S->C: MRCP/1.0 314163 200 COMPLETE
C->S: VERIFY 314164 MRCP/1.0
S->C: MRCP/1.0 314164 200 IN-PROGRESS
S->C: VERIFICATION-COMPLETE 314164 COMPLETE MRCP/1.0
Completion-Cause: 000 success
Content-Type: application/x-nlsml
Burnett, et al. IETF-Draft Page 44
MRCP Extensions October 2003
Content-Length: 123
true 50 cellular-phone female rejected -50 1 50 cellular-phone female rejected -50
C->S: VER-SET-VOICEPRINT 314165 MRCP/1.0
Voiceprint-Identifier:
S->C: MRCP/1.0 314165 200 COMPLETE
Voiceprint-URI:
Voiceprint-Identifier:
Voiceprint-Exists: true
C->S: VER-FROM-BUFFER 314166 MRCP/1.0
Verification-Mode: verify
S->C: MRCP/1.0 314166 200 IN-PROGRESS
S->C: VERIFICATION-COMPLETE 314166 COMPLETE MRCP/1.0
Completion-Cause: 000 success
Content-Type: application/x-nlsml
Content-Length: 123
true 50 cellular-phone
Burnett, et al. IETF-Draft Page 45
MRCP Extensions October 2003
female accepted 50 1 50 cellular-phone female accepted 50
C->S: VER-BUFFERING-STOP 314167 MRCP/1.0
S->C: MRCP/1.0 314167 200 COMPLETE
C->S: VER-END-SESSION 314168 MRCP/1.0
S->C: MRCP/1.0 314168 200 COMPLETE
6.7.10. VER-ROLLBACK
The VER-ROLLBACK method discards the last buffered utterance or
discards the last live utterances (when the mode is "train" or
"verify"). This method should be invoked when the caller provides
undesirable input such as non-speech noises, side-speech, out-of-
grammar utterances, commands, etc. Note that this method does not
provide a stack of rollback states. Executing VER-ROLLBACK twice in
succession without an intervening recognition operation has no
effect on the second attempt.
Example:
C->S: VER-ROLLBACK 314165 MRCP/1.0
S->C: MRCP/1.0 314165 200 COMPLETE
6.7.11. VER-STOP
The VER-STOP method from the client to the server tells the
verification resource to stop VERIFY or VER-FROM-BUFFER requests if
one is active. If such a request is active and the STOP request
successfully terminated it, then the response header contains an
active-request-id-list header field containing the request-id of the
VERIFY or VER-FROM-BUFFER request that was terminated. In this case,
no VERIFICATION-COMPLETE event will be sent for the terminated
request. If there was no verify request active, then the response
MUST NOT contain an active-request-id-list header field. Either way
the response MUST contain a status of 200(Success).
Burnett, et al. IETF-Draft Page 46
MRCP Extensions October 2003
The VER-STOP method aborts an ongoing evaluation operation against
live audio or buffered audio.
Example:
This example assumes a voiceprint identity has already been
established.
C->S: VERIFY 314177 MRCP/1.0
Verification-Mode: verify
S->C: MRCP/1.0 314177 200 IN-PROGRESS
C->S: VER-STOP 314178 MRCP/1.0
S->C: MRCP/1.0 314178 200 COMPLETE
Active-Request-Id-List: 314177
6.7.12. VER-START-TIMERS
This request is sent from the client to the verification resource to
start the no-input timer, usually once the audio prompts to the
caller have played to completion.
Example:
C->S: VER-START-TIMERS 543260 MRCP/1.0
S->C: MRCP/1.0 543260 200 COMPLETE
6.7.13. SET-PARAMS
The SET-PARAMS method, from the client to the server, tells the
verification resource to set and modify its configuration
parameters. If the server resource does not recognize an OPTIONAL
parameter it MUST
ignore that field. Many of the parameters in the SET-PARAMS method
can also be used in another method like the VERIFY method. But the
difference is that when you set something like the security-level
using the SET-PARAMS it applies for all future requests, whenever
applicable. On the other hand, when you pass security-level in a
VERIFY request it applies only to that request.
Example:
C->S: SET-PARAMS 543256 MRCP/1.0
Security-Level: high
No-Input-Timeout: 5000
S->C: MRCP/1.0 543256 200 COMPLETE
6.7.14. GET-PARAMS
The GET-PARAMS method, from the client to the server, asks the
verification resource for its current values for parameters in the
request. The client can request specific parameters from the server
by sending it one or more empty parameter headers with no values.
Burnett, et al. IETF-Draft Page 47
MRCP Extensions October 2003
The server should then return the settings for those specific
parameters only. When the client does not send a specific list of
empty parameter headers, the verification resource should return the
settings for all parameters. The wild card use can be very intensive
as the number of settable parameters can be large depending on the
vendor. Hence it is RECOMMENDED that the client does not use the
wildcard GET-PARAMS operation very often.
Example:
C->S: GET-PARAMS 543256 MRCP/1.0
Security-Level:
No-Input-Timeout:
S->C: MRCP/1.0 543256 200 COMPLETE
Security-Level: high
No-Input-Timeout: 5000
6.8. Verification Session Events
6.8.1. VERIFICATION-COMPLETE
The VERIFICATION-COMPLETE event follows a call to VERIFY or VER-
FROM-BUFFER and is used to communicate to the client the
verification results. This event will contain only verification
results.
Example:
S->C: VERIFICATION-COMPLETE 543259 COMPLETE MRCP/1.0
Completion-Cause: 000 success
Content-Type: application/x-nlsml
Content-Length: 123
true 50 cellular-phone female accepted 50 3 150 cellular-phone female accepted 25
Burnett, et al. IETF-Draft Page 48
MRCP Extensions October 2003
123456 Martha-smith
75
6.8.2. START-OF-SPEECH
The START-OF-SPEECH event is returned from the server to the client
once the server has detected speech. This event is always returned
by the verification resource when speech has been detected,
irrespective of the fact that both the recognizer and verification
resource are sharing the same session or not.
Example:
S->C: START-OF-SPEECH 543259 IN-PROGRESS MRCP/1.0
Burnett, et al. IETF-Draft Page 49
MRCP Extensions October 2003
7. Hotword Recognition
This document captures the extensions required to implement Voice
Enrollment, Speaker Verification and Hotword recognition using MRCP.
This section describes the methods, responses and events needed for
doing Hotword recognition.
A new type of Speech Recognizer resource is presented that can be
used for Hotword recognition. Unlike the primary recognizer
resource, which is driven by the client for each recognition
request, the secondary Hotword recognition resource is attached to
the session and listens continuously until a particular command
phrase is spoken.
The Hotword recognition resource can be the only recognition
resource in a session or it can be attached to the same session as a
primary recognizer resource, and consequently connected to the same
audio stream. When a client sends a SETUP request to add a Hotword
recognizer resource to an existing session, then the MRCP server
attaches the Hotword recognition resource in eavesdropping mode on
the RTP stream already established by the primary resource.
7.1. Hotword State Machine
The difference between a Hotword recognition resource and the
primary recognition resource is minor. The RECOGNIZE method is the
only method allowed on a Hotword recognition resource. The only
event generated is RECOGNITION-COMPLETE. The resource goes from
IDLE to RECOGNIZING and back to IDLE just like a regular recognizer
resource.
A Hotword recognition resource, unlike a normal recognizer resource,
will not send a START-OF-SPEECH event while it is trying to locate a
Hotword. The first event that will be returned once the Hotword is
detected is a RECOGNITION-COMPLETE event.
After a RECOGNITION-COMPLETE event is reported, the Hotword
recognition resource must be primed once again by sending another
RECOGNIZE request.
7.1.1. Addressing Resources
To request a Hotword recognition resource be added to a session, a
different URI must be specified in the SETUP message. The same
rules apply as for other resources. That is, if no session is
specified in the SETUP message, then this is considered to be the
first resource added to a session. For subsequent SETUP requests,
the MRCP client should indicate to the server that these resources
belong to the same session by returning the same session id in the
SETUP request message.
Burnett, et al. IETF-Draft Page 50
MRCP Extensions October 2003
There is no special order required when requesting synthesizer,
recognizer or Hotword-recognizer resources.
7.2. Hotword Header Fields
Hotword recognition requests may contain the following header
fields.
Hotword-header = Hotword-Max-Seconds ; Section 6.2.1
| Hotword-Min-Seconds ; Section 6.2.2
7.2.1. Hotword-Max-Seconds
This parameter MAY BE sent in a RECOGNIZE request to enable Hotword
listening. It specifies the maximum length of an utterance (in
seconds) that should be considered for Hotword. This parameter,
along with Hotword-Min-Seconds, can be used to tune performance by
preventing the recognizer from evaluating utterances that are too
short or too long to be the Hotword. The value is in milliseconds.
The default is 1700 milliseconds.
hotword-max-seconds = " Hotword-Max-Seconds" ":" 1*DIGIT CRLF
7.2.2. Hotword-Min-Seconds
This parameter MAY BE sent in a RECOGNIZE request to enable Hotword
listening. It specifies the minimum length of an utterance (in
seconds) that can be considered for Hotword. This parameter, along
with Hotword-Max-Seconds, can be used to tune performance by
preventing the recognizer from evaluating utterances that are too
short or too long to be the hot word. The value is in milliseconds.
The default is 300 milliseconds.
hotword-min-seconds = " Hotword-Min-Seconds" ":" 1*DIGIT CRLF
7.3. Hotword Methods
7.3.1. SETUP
The SETUP method from the client to the server is used to attach a
Hotword recognizer resource to the session.
Example:
C->S: ANNOUNCE rtsp://media.server.com/media/hotword-asr RTSP/1.0
CSeq: 3
Transport: RTP/AVP;unicast;client_port=8000-8001; mode=record
Session: 12345678
S->C: RTSP/1.0 200 OK
CSeq: 3
Burnett, et al. IETF-Draft Page 51
MRCP Extensions October 2003
Transport: RTP/AVP;unicast;client_port=8000-8001;
server_port=9000-9001;mode=record
Session: 12345678
7.3.2. RECOGNIZE
The RECOGNIZE method from the client to the server starts an ongoing
Hotword recognition. This operation can be stopped using the STOP
method. Otherwise, the RECOGNITION-COMPLETE event will be returned
when the Hotword has been recognized.
The client must call RECOGNIZE once again to re-start Hotword
recognition.
Example:
C->S: ANNOUNCE rtsp://media.server.com/media/hotword-asr RTSP/1.0
Cseq: 314
Session: 12345678
Content-Type: application/mrcp
Content-Length: 276
RECOGNIZE 543259 MRCP/1.0
Content-Type: application/grammar+xml
Content-Length: 123
Hotword-Min-Seconds: 0.3
Hotword-Max-Seconds: 1.7
S->C: RTSP/1.0 200 OK
Cseq: 314
Content-Type: application/mrcp
Content-Length: 67
MRCP/1.0 543259 200 IN-PROGRESS
S->C: ANNOUNCE rtsp://media.server.com/media/hotword-asr RTSP/1.0
Cseq: 315
Session: 12345678
Content-Type: application/mrcp
Content-Length: 123
RECOGNITION-COMPLETE 543259 200 MRCP/1.0
Completion-Cause: 000 Normal
Content-Type: application/x-nlsml
Content-Length: 76
Wakeup
Burnett, et al. IETF-Draft Page 52
MRCP Extensions October 2003
Wakeup
C->S: RTSP/1.0 200 OK
Cseq: 315
Burnett, et al. IETF-Draft Page 53
MRCP Extensions October 2003
8. RTSP based Examples:
This section contains examples of typical sessions between a client
and the server.
8.1. Enrollment
This example illustrates a typical enrollment session.
First, you need to start an enrollment session before proceeding to
learn new phrases.
C->S: ANNOUNCE rtsp://media.server.com/media/recognizer RTSP/1.0
Cseq: 406
Session: 12345678
Content-Type: application/mrcp
Content-Length: 123
START-ENROLLMENT-SESSION 543258 MRCP/1.0
Num-Min-Consistent-Pronunciations: 2
Consistency-Threshold: 3000
Clash-Threshold: 1200
Personal-Grammar-URI:
S->C: RTSP/1.0 200 OK
Cseq: 406
Content-Type: application/mrcp
Content-Length: 86
MRCP/1.0 543258 200 COMPLETE
Then, the application can proceed to enroll an utterance by
iterating over the following command.
C->S: ANNOUNCE rtsp://media.server.com/media/recognizer RTSP/1.0
Cseq: 407
Session: 12345678
Content-Type: application/mrcp
Content-Length: 276
RECOGNIZE 543259 MRCP/1.0
Content-Type: application/grammar+xml
Content-Length: 123
help cancel
Burnett, et al. IETF-Draft Page 54
MRCP Extensions October 2003
S->C: RTSP/1.0 200 OK
Cseq: 407
Content-Type: application/mrcp
Content-Length: 67
MRCP/1.0 543259 200 IN-PROGRESS
S->C: ANNOUNCE rtsp://media.server.com/media/recognizer RTSP/1.0
Cseq: 408
Session: 12345678
Content-Type: application/mrcp
Content-Length: 87
START-OF-SPEECH 543259 200 MRCP/1.0
C->S: RTSP/1.0 200 OK
Cseq: 408
The recognizer resource returns the enrollment status after each
attempt to enroll an utterance. This repeats until the required
number of pronunciations is consistent and that there are no clashes
with other pronunciations in the personal grammar.
S->C: ANNOUNCE rtsp://media.server.com/media/recognizer RTSP/1.0
Cseq: 409
Session: 12345678
Content-Type: application/mrcp
Content-Length: 276
RECOGNITION-COMPLETE 543259 200 MRCP/1.0
Completion-Cause: 000 Normal
Content-Type: application/x-nlsml
Content-Length: 123
2 1 1
consistent Jeff Andre
C->S: RTSP/1.0 200 OK
Cseq: 409
Burnett, et al. IETF-Draft Page 55
MRCP Extensions October 2003
Finally, when the application is satisfied with the enrollment
results then the enrollment is committed to the personal grammar by
ending the enrollment session, as follows.
C->S: ANNOUNCE rtsp://media.server.com/media/recognizer RTSP/1.0
Cseq: 410
Session: 12345678
Content-Type: application/mrcp
Content-Length: 123
END-ENROLLMENT-SESSION 543260 MRCP/1.0
Phrase-Id:
Phrase-NL:
Weight: 1
Save-Waveform: true
S->C: RTSP/1.0 200 OK
Cseq: 410
Content-Type: application/mrcp
Content-Length: 67
MRCP/1.0 543260 200 COMPLETE
Waveform-URL:
8.2. Speaker Verification and Identification
This example illustrates a verification session. Assume prompts are
played outside, MRCP synthesizer resource is left out for simplicity
reasons.
Opening the recognizer. This is the first resource for this
session. The server and client agree on a single Session ID 12345678
and set of RTP/RTCP ports on both sides.
C->S:SETUP rtsp://media.server.com/media/recognizer RTSP/1.0
CSeq: 2
Transport:RTP/AVP;unicast;client_port=46456-46457
Content-Type: application/sdp
Content-Length: 190
v=0
o=- 123 456 IN IP4 10.0.0.1
s=Media Server
p=+1-888-555-1212
c=IN IP4 0.0.0.0
t=0 0
m=audio 0 RTP/AVP 0 96
a=rtpmap:0 pcmu/8000
a=rtpmap:96 telephone-event/8000
a=fmtp:96 0-15
S->C:RTSP/1.0 200 OK
CSeq: 2
Transport:RTP/AVP;unicast;client_port=46456-46457;
Burnett, et al. IETF-Draft Page 56
MRCP Extensions October 2003
server_port=46460-46461
Session: 12345678
Content-Length: 190
Content-Type: application/sdp
v=0
o=- 3211724219 3211724219 IN IP4 10.3.2.88
s=Media Server
c=IN IP4 0.0.0.0
t=0 0
m=audio 46460 RTP/AVP 0 96
a=rtpmap:0 pcmu/8000
a=rtpmap:96 telephone-event/8000
a=fmtp:96 0-15
Opening a verification resource. Uses the existing session ID and
ports.
C->S:SETUP rtsp://media.server.com/media/verification-resource
RTSP/1.0
CSeq: 3
Transport: RTP/AVP;unicast;client_port=46456-46457;
mode=record;ttl=127
Session: 12345678
S->C:RTSP/1.0 200 OK
CSeq: 3
Transport: RTP/AVP;unicast;client_port=46456-46457;
server_port=46460-46461;mode=record;ttl=127
Session: 12345678
Start a verification session.
C->S:ANNOUNCE rtsp://media.server.com/media/verification-resource
RTSP/1.0
Cseq: 4
Session: 12345678
Content-Type: application/mrcp
Content-Length: 53
VER-START-SESSION 314161 MRCP/1.0
Adapt-Model: true
S->C:RTSP/1.0 200 OK
CSeq: 4
Session: 12345678
Content-Length: 30
Content-Type: application/mrcp
MRCP/1.0 314161 200 COMPLETE
Start buffering utterance.
Burnett, et al. IETF-Draft Page 57
MRCP Extensions October 2003
C->S:ANNOUNCE rtsp://media.server.com/media/verification-resource
RTSP/1.0
Cseq: 5
Session: 12345678
Content-Type: application/mrcp
Content-Length: 37
VER-BUFFERING-START 314162 MRCP/1.0
S->C:RTSP/1.0 200 OK
CSeq: 5
Session: 12345678
Content-Length: 30
Content-Type: application/mrcp
MRCP/1.0 314162 200 COMPLETE
Start a recognition request, getting the account number for example.
C->S:ANNOUNCE rtsp://media.server.com/media/recognizer RTSP/1.0
CSeq: 6
Session: 12345678
Content-Type: application/mrcp
Content-Length: 188
RECOGNIZE 314163 MRCP/1.0
No-Input-Timeout: 7000
Recognizer-Start-Timers: false
Save-Waveform: true
N-Best-List-Length: 2
Content-Type: text/uri-list
Content-Length: 33
builtin:grammar/digits?length=5
S->C:RTSP/1.0 200 OK
CSeq: 6
Session: 12345678
Content-Length: 33
Content-Type: application/mrcp
MRCP/1.0 314163 200 IN-PROGRESS
S->C:ANNOUNCE rtsp://media.server.com/media/recognizer RTSP/1.0
CSeq: 1
Session: 12345678
Content-Length: 65
Content-Type: application/mrcp
START-OF-SPEECH 314163 IN-PROGRESS MRCP/1.0
Proxy-Sync-Id: 1
C->S:RTSP/1.0 200 OK
CSeq: 1
Burnett, et al. IETF-Draft Page 58
MRCP Extensions October 2003
The recognition result contains 2 choices.
S->C:ANNOUNCE rtsp://media.server.com/media/recognizer RTSP/1.0
CSeq: 2
Session: 12345678
Content-Length: 3511
Content-Type: application/mrcp
RECOGNITION-COMPLETE 314163 COMPLETE MRCP/1.0
Completion-Cause: 000 success
Waveform-URL: http://media.server.com/waveforms/utt01.wav
Content-Type: application/x-nlsml
Content-Length: 3280
13579
one three five seven nine
13479
one three four seven nine
C->S:RTSP/1.0 200 OK
CSeq: 2
Check to see if the first choice from nbest list exists in the
Voiceprint repository.
C->S:ANNOUNCE rtsp://media.server.com/media/verification-resource
RTSP/1.0
CSeq: 7
Session: 12345678
Content-Type: application/mrcp
Content-Length: 119
VER-SET-VOICEPRINT 314164 MRCP/1.0
Voiceprint-URI: http://media.server.com/VoicePrints
Voiceprint-Identifier: 13579
Voiceprint ID 13579 doesnÆt exist.
S->C:RTSP/1.0 200 OK
CSeq: 7
Session: 12345678
Content-Length: 139
Burnett, et al. IETF-Draft Page 59
MRCP Extensions October 2003
Content-Type: application/mrcp
MRCP/1.0 314164 200 COMPLETE
Voiceprint-URI: http://media.server.com/VoicePrints
Voiceprint-Identifier: 13579
Voiceprint-Exists: false
Check the second choice in the nbest list.
C->S:ANNOUNCE rtsp://media.server.com/media/verification-resource
RTSP/1.0
CSeq: 8
Session: 12345678
Content-Type: application/mrcp
Content-Length: 119
VER-SET-VOICEPRINT 314165 MRCP/1.0
Voiceprint-URI: http://media.server.com/VoicePrints
Voiceprint-Identifier: 13479
Voiceprint ID 13479 exists.
S->C:RTSP/1.0 200 OK
CSeq: 8
Session: 12345678
Content-Length: 138
Content-Type: application/mrcp
MRCP/1.0 314165 200 COMPLETE
Voiceprint-URI: http://media.server.com/VoicePrints
Voiceprint-Identifier: 13479
Voiceprint-Exists: true
Start verify on the voiceprint 13479.
C->S:ANNOUNCE rtsp://media.server.com/media/verification-resource
RTSP/1.0
CSeq: 9
Session: 12345678
Content-Type: application/mrcp
Content-Length: 54
VER-FROM-BUFFER 314166 MRCP/1.0
Verify-Mode: verify
S->C:RTSP/1.0 200 OK
CSeq: 9
Session: 12345678
Content-Length: 33
Content-Type: application/mrcp
MRCP/1.0 314166 200 IN-PROGRESS
Burnett, et al. IETF-Draft Page 60
MRCP Extensions October 2003
The caller is verified (assume num-min-verification-phrases and num-
max-verification-phrases are 1).
S->C:ANNOUNCE rtsp://media.server.com/media/verification-resource
RTSP/1.0
CSeq: 3
Session: 12345678
Content-Type: application/mrcp
Content-Length: 183
VERIFICATION-COMPLETE 314166 COMPLETE MRCP/1.0
Completion-Cause: 000 success
Content-Type: application/x-nlsml
Content-Length: 123
true 50 cellular-phone female accepted 50 1 50 cellular-phone female accepted 50
C->S:RTSP/1.0 200 OK
CSeq: 3
Stop the audio buffering session, clear the audio buffer.
C->S:ANNOUNCE rtsp://media.server.com/media/verification-resource
RTSP/1.0
CSeq: 10
Session: 12345678
Content-Type: application/mrcp
Content-Length: 39
VER-BUFFERING-STOP 314167 MRCP/1.0
Burnett, et al. IETF-Draft Page 61
MRCP Extensions October 2003
S->C:RTSP/1.0 200 OK
CSeq: 10
Session: 12345678
Content-Length: 30
Content-Type: application/mrcp
MRCP/1.0 314167 200 COMPLETE
End the verification session.
C->S:ANNOUNCE rtsp://media.server.com/media/verification-resource
RTSP/1.0
CSeq: 11
Session: 12345678
Content-Type: application/mrcp
Content-Length: 33
VER-END-SESSION 314168 MRCP/1.0
S->C:RTSP/1.0 200 OK
CSeq: 11
Session: 12345678
Content-Length: 30
Content-Type: application/mrcp
MRCP/1.0 314168 200 COMPLETE
Teardown the recognizer and verification resource.
C->S:TEARDOWN rtsp://media.server.com/media/verification-resource
RTSP/1.0
CSeq: 12
Session: 12345678
S->C:RTSP/1.0 200 OK
CSeq: 12
C->S:TEARDOWN rtsp://media.server.com/media/recognizer RTSP/1.0
CSeq: 13
Session: 12345678
S->C:RTSP/1.0 200 OK
CSeq: 13
8.3. Hotword Recognition
Will be provided later.
9. Security Considerations
The primary additional security considerations raised by the
extensions in this document have to do with the use of speaker
identification and verification as security functions. One such
consideration is that individualized voiceprints are used to
Burnett, et al. IETF-Draft Page 62
MRCP Extensions October 2003
identify or confirm the identity of a caller. The privacy and
integrity of these voiceprints is of high importance. Fortunately,
voiceprints are not transferred between client and server but are
rather maintained by the server using the serverÆs own security
mechanisms.
Another consideration particular to these functions is the
consequence of manipulating the media (speech) stream. Some
verification technologies in use today are susceptible to
impersonation or "replay" attacks, and all are susceptible to a
denial of access attack by garbling an otherwise acceptable media
stream. We recommend that standard media-securing protocols such as
SRTP be used in these cases.
10. Reference Documents
[1] Fielding, R., Gettys, J., Mogul, J., Frystyk. H.,
Masinter, L., Leach, P., and T. Berners-Lee, "Hypertext
transfer protocol -- HTTP/1.1", RFC 2616, June 1999.
[2] Schulzrinne, H., Rao, A., and R. Lanphier, "Real Time
Streaming Protocol (RTSP)", RFC 2326, April 1998
[3] Shanmugham, S., et al., "A Media Resource Control Protocol
Developed by Cisco, Nuance, and Speechworks.", Internet-draft
draft-shanmugham-mrcp-04, (work in progress), May 1, 2003
[4] World Wide Web Consortium, "Natural Language Semantics Markup
Language (NLSML) for the Speech Interface Framework", W3C
Working Draft, 30 May 2001.
[5] Bradner, S., "Key words for use in RFCs to Indicate
Requirement Levels", RFC 2119, March 1997.
Acknowledgements
The authors would like to thank the following additional individuals
for their contributions to this document:
Andre Gillet (Nuance Communications)
Saravanan Shanmugham (Cisco Systems, Inc.)
Full Copyright Statement
Copyright (C) The Internet Society (2003). All Rights Reserved.
This document and translations of it may be copied and furnished to
others, and derivative works that comment on or otherwise explain it
or assist in its implementation may be prepared, copied, published
and distributed, in whole or in part, without restriction of any
kind, provided that the above copyright notice and this paragraph
are included on all such copies and derivative works. However, this
document itself may not be modified in any way, such as by removing
the copyright notice or references to the Internet Society or other
Internet organizations, except as needed for the purpose developing
Burnett, et al. IETF-Draft Page 63
MRCP Extensions October 2003
Internet standards in which case the procedures for copyrights
defined in the Internet Standards process must be followed, or as
required to translate it into languages other than English.
The limited permissions granted above are perpetual and will not be
revoked by the Internet Society or its successors or assigns.
This document and the information contained herein is provided on an
"AS IS" basis and THE INTERNET SOCIETY AND THE INTERNET
ENGINEERING TASK FORCE DISCLAIMS ALL WARRANTIES, EXPRESS OR IMPLIED,
INCLUDING BUT NOT LIMITED TO ANY WARRANTY THAT THE USE OF THE
INFORMATION HEREIN WILL NOT INFRINGE ANY RIGHTS OR ANY IMPLIED
WARRANTIES OF MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE.
AuthorsÆ Addresses
Daniel C. Burnett
Nuance Communications
1005 Hamilton Court
Menlo Park, CA 94025-1422
USA
Email: burnett@nuance.com
Pierre Forgues
Nuance Communications Ltd.
111 Duke Street
Suite 4100
Montreal, Quebec
Canada H3C 2M1
Email: forgues@nuance.com
Charles Galles
Intervoice, Inc.
17811 Waterview Parkway
Dallas, Texas 75252
Email: charles.galles@intervoice.com
This document expires on April 17, 2004.
Burnett, et al. IETF-Draft Page 64