mardi 25 mars 2008

VoiceXML, and 3G video

I've been working on VoiceXML (www.w3c.org/voice, http://www.w3.org/TR/2004/REC-voicexml20-20040316/) for quite a while now in the work context. Development of VoiceXML 2.0 interpreter, operation with HP OCMP media platforms, operation with a SIP audio media platform, creation of VXML applications for products and customers.....

Latest requirement : 3G video. How to create or update a audio IVR service to add video capabilities?

Well, using a video capable platform (such as HP's OCMP 4.0 or the NMS VoiceXML gateway), its pretty easy to do the basic play a 3gp file to get a video image up. Generally, the audio tag just does it if the file referenced is the right format.... But I find 2 big issues in creating a video service:
1/ creating those pesky video files for prompting
2/ dealing with 'dynamic' data

For audio services, the situation is pretty simple:
Point 1 : use a studio or a 'batch' interface to a speech synthesis motor (TTS) to get .wav files
Point 2 : use TTS directly in the VXML - say whatever you want (a name, a date, a value etc)

For video, not so easy. In fact, what we're missing is sort of TTS for video : data+markup -> video stream or video file. Thus was born the "renderer server"!

In its current form, it takes markup (xhtml at the moment), and an output format (.3gp files currently) and does the neccessary to create a static 2s long video file, of the correct format and encoding to work with a 3G call. This is already pretty much what we need for point 1!

You can try the thing here:
https://www.eloquant.net/rend/file/render/genVideo.html
(ok, time for license and disclaimer : this is provided purely for demo or personel interest, with no support or guarantees, if it breaks and takes your arm off then you've only yourself to blame)
Note that it only likes 'pure' xhtml and has no 'plugins' such as flash or javascript...

I used this in a batch mode to create my static video prompts for my first demo video service!

Next : dynamic data
Often, in a vxml service, we get some text or details that needs to be communicated to the caller. When we're in audio mode, TTS is used. Easy enough, but limited (there's only so much info you can absorb via your ears!). In video mode, we can potentially communicate a LOT more (a picture is worth a thousand words and all that). But we need to be able to generate that picture, on the fly and in 3G compatible files.

Ordering up the generation currently is pretty simple : a subdialog call to the renderer, with the xhtml content, and it does the job! (even better would be to call up the renderer from the vxml interpreter whenever we get a text prompt during a video call.... but that requires modifications to the vxml interpreter which I can't yet do....)

Then we have to get back the video: the best method would be to generate it as a RTP H264 stream, directly to the 3G gateway. But I haven't got an RTP encoder from the static images working yet.... Otherwise, an RTSP controlled RTP stream back to the media platform. Nope, still blocked at the RTP encoder level......

So instead, the renderer generates the same 3gp file, and the vxml just 'plays' it. As these platforms generally just repeat the last image from a video when they run out of file, then that's good enough for the moment.

In the VXML, this just boils down to
a/ create the xhtml to markup the image you want (with an inline CSS generally to deal with the small mobile screens)
b/ get the renderer server to create or update the 3gp file via a subdialog call
c/ play the resulting file directly from the renderer server
[I was going to embed some vxml here but the tags break the html...]
So get it here:
http://eloquant.nerim.net/blog_bits/render_demo.vxml

Aucun commentaire: