Imagine finding the perfect training video but it is an hour long and you need to find the specific part of the video that covers your specific interest. Or possibly you are an advertiser and you would like to have content aware advertising in some specific online videos. You would like to have your advertising show up on a page when a specific word is mentioned in a video or maybe when a video is searched for a specific word, does that sound familiar it should as it is the google model for advertising and now it can apply to video.
The solution that I will cover in this article will allow you to transcribe the spoken word in videos into text and allow a user to search that text to find relevant areas of the video. The example that we will cover is only the beginning. I could imagine an entire searchable library of video or possibly even a web search engine that easily finds and displays video starting at the location where the search engine found a word that was spoken in the video. Before we jump in it might be helpful to see an example.
In order to create the transcription you will need either Premiere® Pro CS4 or Adobe® Soundbooth® CS4. My opinion is that if you are going to do any serious video editing you should consider Adobe® Creative Suite® 4 Production Premium as it contains everything you will need to edit, composite and publish high quality video projects.
I should start by telling you what I am not going to cover. I am not going to cover how to get started with Premiere Pro, Soundbooth and the Flash Media Server. There are a number of great resources for each of these products. What I am going to cover is the specific functionality in each of these products that relate to creating searchable video. Lets start with Adobe Premiere Pro.
Once your project is created and the video and sound edited to your satisfaction it is time to transcribe the video. This is a very straight forward step. Simply open or expand the Metadata panel and you should see a button at the bottom of the panel called Transcribe. If you do not see the panel you can open it from the Window menu. The following panel allows you to transcribe as well as edit an existing transcription.
When you click the Transcribe button you will see the following dialog that allows you to choose the language, quality and whether to identify speakers.
The language choice is pretty straight forward. As for the quality I see no reason to ever choose anything other than high. When you choose identify speakers you will notice in the exported xml something like the following <Name>Speaker 1</Name>. I can see this being an interesting addition to an application but for this example I did not take advantage of this feature.
This process can take some time to complete. Keep in mind that an accurate transcription relies on two very important criteria. First and foremost a good quality sound track is critically important. If you have background noise or a muffled speaker you will not get a very accurate transcription. Second is the dictionary that is used to recognize words. In order to give you some context, imagine attending a lecture where the speaker spoke in a language that you did not understand. A custom library is a library with specific word definitions that allow Premiere to more accurately identify and transcribe the spoken word. This is an area that you will see expanded on in future versions of the Production Premium Creative Suite.
When the transcription process is complete the xml that represents that transcription is stored as XMP Metadata as part of the video file. It is possible to access and use XMP Metadata that is stored in a swf, flv and f4v via Actionscript but this should be carefully considered. If you create a long format video that contains a great deal of speech to be transcribed that finished transcription can represent a fairly large increase in file size. As an example the following represents two files encoded with the same settings and length the larger file has metadata and the smaller does not.
Storing and accessing the transcription metadata stored in the final video file can also cause problems for streamed video from the Flash Media Server because of the way that text is stored in the video file. I found the best way to use this transcription to create online searchable video was to export the xml file so that it did not remain part of the video file. This may be addressed in the future but for now if your video is longer than say 15 minutes and you plan to stream that video it would be best to export the xml that represents transcription to an xml file that is stored outside the video file.
I would suggest that you exclude the metadata for the final video that will be displayed online. In order to exclude metadata during export from Premiere Pro CS4 click the panel menu button on the right hand side of the options panel in the export dialog box. In the popup deselect Include Source XMP Metadata.
One more slight wrinkle in the process is that you cannot export the transcription as xml from Premiere. The only way to export the transcription is from Soundbooth.
Editing The Transcription:
Anyone familiar with speech to text knows that it is not possible to get one hundred percent accuracy. With that said it is possible to get a high degree of accuracy given the criteria mentioned earlier in this article. There are a number of tools that can help you clean up the transcription in both Soundbooth and Premiere. Once the transcription process is complete and you see the text in Metadata panel you should see the play, loop and transcribe buttons. You should also see a search field at the to of the Metadata panel in both Soundbooth and Premiere. The search field allows you to search for words that are in the transcription. This functionality can prove very useful when you find a word that was not transcribed properly and may show up in multiple locations through out the transcription. Once you find a word that needs to be corrected simply double click the word and you will be able to edit that word. Depending on how the word was recognized during the transcription process you may that you need additional functionality. If you right click on a word in the Metadata panel you will see the additional features.
Exporting The XML:
As I mentioned previously you cannot export the XML file required for this example from Premiere that must be done in Soundbooth. Soundbooth has many fantastic features including volume correction and my favorite visual audio healing. You might wonder why I am mentioning Soundbooth features under the Exporting The XML heading; to refresh your memory it is because you cannot export the XML transcription from Premiere. If you have any last minute touch-ups for your audio track it might be a good idea to leave them to just before you export the transcription as xml. Then it is a simple matter to right click the audio track in Premiere and select Edit in Adobe Soundbooth and then Render and Replace.
From Soundbooth you will be able to clean up your audio, if required, and export the transcription as xml. The changes that you make in Soundbooth will show up in Premiere as soon as you save thanks to Dynamic Link. Once you are in Soundbooth you should see that the transcription has come over with the audio track. To export the xml file simply go to the File menu and then down to Export then select Speech Transcription.
Once the xml file is exported it is time to finish encoding your video into either an FLV or an F4V. If you have made changes to the audio in Soundbooth save and return to Premiere to finish encoding your video. When considering the functionality required to create searchable video streaming vs progressive is an important consideration. Progressive video is video that is accessed by using the http protocol. Streaming video is accessed by using RTMP. For searchable video there are a couple of key differences that should help you choose the right technology. First navigation with in the video is key to the idea of searchable video. With progressive you would have to download all of the video from the start of the video to the place a user is navigating to in the video. If the video is fairly long this can be very frustrating for a user. You can imagine doing a search and finding the word you are looking for close to the end of say a forty minute video and having to wait for that amount of video to download before being able to jump to that portion of the video. Streaming on the other hand simply starts sending the video from the part of the video requested. Streaming makes searchable video usable for even long format video. Another key feature of the Flash Media Server that assists in making searchable video a reality is something called enhanced seek. The following is a definition from the online documentation:
“Enhanced seeking is a Boolean flag in the Application.xml file. By default, this flag is set to
false. When a play occurs, the server seeks to the closest video keyframe possible and starts from that keyframe. For example, if you want to play at time 15, and there are keyframes only at time 11 and time 17, seeking will start from time 17 instead of time 15. This is an approximate seeking method that works well with compressed streams.
If the flag is set to
true, some compression is invoked on the server. Using the previous example, if the flag is set to
true, the server creates a keyframe–based on the preexisting keyframe at time 11–for each keyframe from 11 through 15. Even though a keyframe does not exist at the seek time, the server generates a keyframe, which involves some processing time on the server.”
The preceding should be fairly easy to understand how searchable video would benefit from this functionality but to make sure I will put this functionality in the context of this example. Imagine doing a number of searches on a long video. If the keyframes were not placed on every frame what you would notice is that some percentage of searches would take you past the point where the word was spoken in the video leaving you to wonder if the word was found at all.
The Final Step:
At this point you should have a video encoded as either an FLV or an F4V and an xml file that represents the transcription from the video file. In the second paragraph of this article there is a link to an example, the sample provided with this article has very close to the same functionality as that example. You can download the sample with the link provided at the bottom this article. Once you have downloaded those files and unpacked you should have the following files:
You should also have a skins folder. All of the files and the skins folder should be in the same directory on a web server or in the webroot in the Flash Media Server installation directory. The following is the typical path for the webroot folder when installed on Windows XP.
C:\Program Files\Adobe\Flash Media Server 3.5\webroot
This is of course required so that you can access the content from a web browser. The searchableVideo.html file is the html wrapper that hosts the swf. This is not required you can use your own html page to host the searchableVideo.swf. It is easy to add the searchableVideo.swf to your own html page from within Dreamweaver. Just add these files and the skins folder to your Dreamweaver site files and create a new page or open an existing page and simply drag the swf file (searchableVideo.swf) onto that page. Dreamweaver will create the object and embed code for you. Keep in mind that the searchableVideoPrefs.xml file needs to be in the same directory as the searchableVideo.swf on the web server. Organizing files for access over the web is much easier now with the Flash Media Server 3.5 because it has an embedded web server. You can simply drop the html and swf files that you would like someone to access using a web browser in to the webroot folder in the installation folder of the Flash Media Server. Once you have the files setup it is time to edit the searchableVideoPrefs.xml which allows you to specify the location of the transcription xml file and video. In addition you can also make changes to some of the player functionality and appearance from this same searchableVideoPrefs.xml file.
I would suggest that you open the searchableVideoPrefs.xml file in either Dreamweaver or an xml editor do not use a word processor. Once the file is open you should see the <sourceURI> tag. It is important that you do not edit any of the tags only the content between them. For instance make sure that when you are editing the <sourceURI> which represents the location of the video you do not change either <sourceURI> or </sourceURI> in anyway only the text between those tags. This is true for any tags in this file. If you wanted to play a video on a server with a dns name of foo.com and the video was in a Flash Media Server application called lectureVideos and the name of the video was chem101.flv the <sourceURI> tag should look like the following:
This may seem complex but it is really fairly simple. When you install the Flash Media Server on windows it has two pre-installed applications vod and live. These applications are represented as folders in the following location C:\Program Files\Adobe\Flash Media Server 3.5\applications. The name of the application is the name of the folder in that location. In the example that we mentioned you would have a folder called lectureVideo in the application directory mentioned above (e.g. C:\Program Files\Adobe\Flash Media Server 3.5\applications\lectureVideo). In side of that folder you will have a folder called streams and inside of that you will have a folder called _defInst_ and inside of that folder you will have your video file. So the entire path to your video file on the Flash Media Server would be C:\Program Files\Adobe\Flash Media Server 3.5\applications\lectureVideo\streams\_defInst_\chem101.flv. Remember that if you are using an f4v file that you must prefix the file name with mp4:. For instance if the previous example was an f4v not an flv the sourceURI tag would be the following:
Now you should have the hang of editing the searchableVideoPrefs.xml file. Next we are going to edit the location of the transcription xml file that was generated in Premiere and exported from Soundbooth. The same precaution holds true for editing this tag, do not change the tags in anyway only the contents. This location can either be relative to the swf or a complete http uri. For instance both of the following are correct:
In the first example it would mean that the RedWorkflow.xml (i.e. transcription xml file) would be in an assets folder which would be in the same folder as the searchableVideo.swf and searchableVideoPrefs.xml files. In the second example the RedWorkflow.xml file could be stored on the same or a different server and access by using the http protocol.
Lets take a look at how the finished example should look and label some of the areas of the example so that it is easier to follow along.
In the following image I have highlighted and labeled the important areas of the player so that it will be easier to understand my references:
The Text Cloud section represents the highest occurring words in the transcription. This allows the viewer to get an idea of the kind of content in the video. The viewer can click on any of the keywords in the Text Cloud in order to display those results in the Found Items section. The Relevant Text section is displayed when a viewer clicks an item in the Found Items section, it displays the text surrounding the found text. The Video Display shows the video as well as markers representing the found text. A viewer can rollover the marker to display the surrounding text as well as seek to that portion of the video. The Search Field allows a user to enter and search on one or more words and the Found Items section displays the result. It is also possible to sort the columns in the Found Items section by clicking the column headings.
There are a number of other settings in the searchableVideoPrefs.xml that you can use to change the functionality and appearance of the searchable video player. First minwordLength, use this to set the minimum word length for the text cloud at the bottom of the example. A setting of 5 will mean that no words shorter than 5 letters will show up in the word cloud. Next there is videoInitStartSeconds which determines the initial start time of the video when the player is first launched. There are a few other settings that will affect the appearance of the player and they are as follows:
globalTextColor : sets the color for ALL text on screen
swfBackgroundColor : sets the background color of the player
foundTextHighLightColor : changes the color of the text in the relevant text display
textBackGroundColor changes the background color for the text cloud and the relevant text area
Well that should get you started. If you have problems, interests or comments please leave them on this site so that I can respond. There are many possible additions and changes that I can think of for this example. I am eagerly waiting to see and hear how people have used this example so please leave me a comment.