Notes from Writing HTML5 Media
This last weekend I finished my latest book for O'Reilly: HTML5 Media. This is one of O'Reilly's shorter books (about 100 pages), primarily focused at the eBook market, though you can get a hard copy with print-on-demand.
The book focuses on the HTML5 audio and video elements. I cover how to use the elements in a web page and go into detail on the attributes for each element, as well as cover video and audio codec support. I also devote a couple of chapters on developing with both elements, including how to create a custom control, as well as integrating the media elements with the canvas element and SVG.
I enjoyed working on this book. I enjoyed worked with the media elements, though I'm more partial to the video element. Working on the book was also a learning experience—even, at times, an eyebrow raising experience. I thought I would share with you all some of the notes I wrote while working on the book.
WebVTT versus TTML
The WHATWG group started working on a subtitle/caption format based on SRT (SubRip) text format. The original name was WebSRT, but it was recently renamed to WebVTT. The LeanBack Player web site provides a good review of WebVTT.
WebVTT is a pretty basic format, consisting of line numbers, timelines, and text with formatting options. There are plans to add additional capabilities, but what we have now should meet most needs.
There's been interest in bringing WebVTT over to the W3C. However, the W3C already has a timed text specification, TTML. TTML is an XML based format that is more sophisticated than WebVTT, but also more complicated to use.
I covered WebVTT in the book in detail, but only briefly mentioned TTML. The reason I didn't spend time with TTML is because of existing support and the industry movement away from XML.
Additionally, TTML is an XML format. Now, XML might have been the approach to take a half dozen years ago, when most everything at the W3C was heading in an XML direction. In the last several years, though, we've seen the popularity of the RDF/XML serialization fade in favor of Turtle or RDFa, and XHTML2 abandoned in favor of HTML5. SVG is still holding on, but now there's rumblings of an API that will generate SVG or canvas API calls, and basically hide most of the XMLness of SVG from view. I vaguely remember reading something somewhere that the folks working on TTML were even thinking of creating a JSON version of the spec.
Whether intended or by accident, there is a subtle but noticeable shift away from XML in the W3C. At the same time, there is a strong core of support for XML formats in the W3C. Between both seemingly contradictory paths, I'm thinking we should just skip the interim pain and anguish of yet another format war, and go right to the end point. So I covered SRT and WebVTT and only mentioned TTML in passing.
Protecting the Users from the Big Bad Web Developers
I like HTML5 video and audio, I really do. I had a great deal of fun writing this book. However, despite my affection for these elements I must also admit to some irritation with their design and implementation. (Well, other than the fact that an entire block of the specification changed mysteriously one night, requiring a sudden and unexpected re-write in one of my chapters.)
The part about the HTML5 media elements I like the least is the seeming level of distrust directed at web page authors and developers.
For instance, if you're creating a custom control and remove the controls attribute, you may think you then have complete control over the media playback. You don't, though—at least, not in most browsers.
In the section of the HTML5 spec related to the media element's user interface, implementors are advised to provide playback control in some manner regardless of whether the controls attribute is present or not:
Even when the attribute is absent, however, user agents may provide controls to affect playback of the media resource (e.g. play, pause, seeking, and volume controls), but such features should not interfere with the page's normal rendering. For example, such features could be exposed in the media element's context menu.
If you right mouse click on a video element in Firefox, you're given the options to play or pause the video, mute the volume, play the video in fullscreen, show or hide the controls, as well as save the video or play the video by itself in another page. Chrome provides options to play, pause, or mute the video, as well as show or hide the controls, open the video in another tab, or save the video. Opera's context menu options are similar to Chrome's, minus the option to open the video in a new tab. IE10 provides play, pause, mute options, the ability to save the video, and the ability to control playback speed. Safari is the only browser that doesn't provide context menu options to control the video. At least, not yet.
There is absolutely no way to directly control what does or does not display in the context menu that the browser provides. There is no way to control some of the actions that people can take in the context menu, such as preventing the fullscreen display of the video, if you don't want it played fullscreen.
If you're providing custom controls for the video, you have to account for the fact that the video playback is being managed by the context menu as well as your controls. One of my examples in the book provides a video playback control that consists of separate buttons for play, pause, and stop. These controls are disabled based on what action the user takes. It seems like a simple act to just disable and enable the appropriate buttons at the same time you play or pause the video, but you actually have to capture two sets of events: the click events from the buttons, and the play and pause event from the video.
Of course, the amount of extra code to do something like enable and disable buttons based on playback is trivial. But what isn't trivial is controlling which options are made available to the user. If, for whatever reason, you don't want the video to be played fullscreen, there is absolutely no way to prevent this from happening with Firefox.
The only way to prevent the context menu from displaying for the video is to provide a transparent div overlay for the video, so that the context menu reflects the div element, not the video. That or turn the video element's display off, and play the video by redrawing it into a canvas element—a case of overkill, just to be able to control video playback.
The conflict between the context menu and customization isn't the only web developer/author restriction.
There are the times when the web page author wants the audio or video to begin automatically when the page loads. The media elements do provide attributes for this: autoplay and loop. To ensure automatic playback, the author removes the controls attribute, adds autoplay and possibly loop, and when the page loads, the media element begins playing. The web page author can also remove the audio element completely from display so all that's left is the sound. The video element is, of course, left displayed, but the control UI should not be showing. We can't control the context menu options, but at least the control UI isn't displaying. Well, not unless scripting is disabled, that is.
If the user has scripting disabled, the control UI is automatically re-displayed with the media element—even if you don't want it to be displayed. If scripting is disabled, you cannot control the visibility of the control UI. According to the HTML5 specification:
If the attribute is present, or if scripting is disabled for the media element, then the user agent should expose a user interface to the user.
I've been told by members of the HTML WG that "should" in this context is equivalent to "must". The two terms are not the same, but I gather that they become one in HTML5 land.
Currently, only Opera provides a visual control UI when scripting is disabled. Firefox doesn't display a visual control UI when scripting is disabled regardless of whether the controls attribute is present or not. Safari, Chrome, and IE currently do not display the control UI. If "should" is equivalent to "must", then Firefox, Safari, Chrome, and IE are all in error in their handling of disabled scripting and the media elements. I imagine bugs will be filed, if they haven't already been filed, and these browsers will also automatically add the control UI when scripting is disabled.
I hear people cheering. You're cheering, aren't you? You're all insanely happy with the power given the end user with the HTML5 video and audio elements.
Most of us remember those times when we opened a web page and some horrid music was blaring, or a video automatically plays with some idiot in a suit talking about his constipation. If we're at work, we keep our machines permanently muted, lest something embarrassing blare out at an inopportune moment. If screen readers are not sophisticated enough to automatically lower background sound when a page is opened, the background sound competes badly with the reader.
Automatic audio, bad. Automatic video, bad.
I also imagine most of you have forgotten your visits to sites where you expected music or a video to play, and how much you enjoyed a well crafted multimedia experience.
Consider sites devoted to movies. Currently the last of the Harry Potter movies, the latest Transformer, and the new Spielberg Super 8 are playing in movie theaters. All three movies have their own web sites. If you open all three movie sites, you'll find extensive use of both audio and video media.
The Harry Potter site opens with a preview of the movie with its own custom control that automatically starts playing as soon as it is sufficiently loaded. Among the options not provided with this video are the ability to open the movie out of context of the frame, such as opening the video in fullscreen. At most you can start or stop the video, choose a different video format, or skip the video and go to the site offerings.
The site offerings page has audio playing in the background. In addition, the bottom of the page features an animated video of owls. You can do nothing to stop either.
The Transformer movie site also provides background sound, as well as a video that begins to play automatically and loops continuously in the splash page. When you enter the site, another video plays continuously in the background of the page. Again, sound is used. There is very little about the site that is text-based: it's all eye and ear candy.
The Super 8 movie site provides an automatically playing trailer with its own control. The page also has background audio when the trailer is finished. One of the sections of the site is the Editing Room. This page features a video playing automatically in an old 8mm style. Once this video if finished, rows of film are displayed. You can click a control that opens another video, again in super 8 style, providing a back story for the movie. You're provided with controls to play the video and mute the sound.
None of these movie sites provide a context menu for their videos, other than what you would expect to see with a Flash movie. None of the sites allowed you to open the videos in fullscreen or play in a separate tab, because the videos are part of an integrated whole. The sites don't allow you to switch off audio that I can see. I realize that automatically playing audio can be irritating for some, and can play havoc with screen readers, but again, none of this is unexpected for a movie site.
These types of sites will never be created using HTML5— not because HTML5 isn't capable of creating most of the effects, but because HTML5 deliberately circumvents finer control over the video element. Can you imagine what would happen with the Transformer site with scripting disabled? The browser would then automatically plunk the control UI over the video in the page, which would ruin the overall effect the page creator was trying to make.
The unfortunate consequence of making HTML5 video and audio unattractive for these sites is that once they start using Flash for one component of the site, they continue to use Flash for every component of the sites. If you open these pages and use a screen reader such as NVDA, the only sound you'll get is the background audio because every last bit of the site is in Flash: the text, the menus, all of it.
We want these sites to consider using HTML5 instead of just Flash, because if they do, the sites will end up being more accessible rather than less. Yes, even if the HTML5 media elements don't have a control UI, and audio and video are played automatically. If we want to convince people to use something other than Flash, we need to ensure they have the same level of control that they had with Flash. Currently, the HTML5 video and audio elements do not provide this level of control.
HTML Media and Security
During the recent brouhaha related to WebGL security, the HTML5 editor, Ian Hickson, discovered that the video element, as it was currently defined, would not allow the cross-domain access that the img element provides. In other words, if the video you linked in with the src attribute was not from the same domain as your web page, the video wouldn't play. This restriction was lifted, and the video (and track) resources are now treated the same as image resources.
However, one of the safety features related to cross-domain resource access was the concept of canvas tainting. If the image or video drawn into a canvas element is from another domain, the canvas is marked as tainted (the origin-clean flag is set to false). When the canvas is tainted, the toDataURL, getDataImage, and measureText methods generate a security exception. You couldn't circumvent the same-origin restriction by using Ajax, either, because it would not allow cross-domain resource access.
Of course, much of this has changed because of the WebGL security issues. Originally WebGL was limited to using only same-origin image access for canvas textures, but a more recent version of the specification allowed for cross-domain image access. WebGL developers wanted to add images (and potentially video) from other domains as textures for their 3D creations. Unfortunately, when the WebGL specification and implementations enabled cross-domain image access, they also opened up a security violation: the WebGL could be manipulated in such a way as to create a "data leak", giving the web pages access to actual image (and video) data.
In order to allow WebGL to proceed without having to tackle the functionality causing the data leak (I'm told a daunting task), the WebGL community requested and received a new attribute that can be added to the img, audio, and video elements in HTML5: crossorigin. This attribute allows same-origin privileges with cross domain resources, as long as the resource server concurs with this use. This is a concept known as Cross-Origin Resource Sharing, or CORS.
CORS is another specification in work at the W3C. It originated as a way for web developers to access cross-domain resources using XMLHttpRequest (Ajax). The concept has since been expanded to include workarounds for the same-origin security restrictions in other uses, including the newest related to canvas tainting.
It sounds all peaches and cream except that there are issues related to the concept, especially when accessing image and video data from cloud services such as Amazon's AWS or centralized image systems, such as Flickr. For CORS and the crossorigin attribute to work, these services must be willing to support CORS. The WebGL and other developers assumed the sites would be more than willing to do so. However, I know that Amazon has already expressed reservation about supporting CORS, and I wouldn't be surprised if there wasn't some reluctance on the part of other services.
I also had reservations about the breathlessly quick addition of crossorigin to HTML5, starting with the unanswered question, "What would WebGL had done if HTML5 was too far along in the recommendation track to add this change?" I still have concerns about quickly adding in functionality that routes around security protocols because another specification needs to have this functionality because of a security violation. I've long been a fan of 3D effort on the web, beginning with the earlier VRML and continuing with my interest in WebGL (I covered it in my Painting the Web book). However, I'm even more of a fan of web security. That and a stable specification. What would have happened if WebGL had made this request after HTML5 had progressed to candidate recommendation status?
Yes, I am a stick in the mud. I like stable specifications and secure web pages. I'm just old fashioned that way.
Anyway, for those wanting to integrate HTML5 video and canvas element, be aware of this very new functionality. You won't find it included in the HTML5 Last Call document, you'll only find it in the HTML5 editor's draft.
You would expect to find tables with audio and video browser container/codec support littering the internet, and you do. The only problem is, none of the tables seem to agree.
Trying to determine exactly what container/codec each browser supports is actually a pain in the butt. I'm sure each and every browser has a page somewhere that explicitly lists what it supports in all possible environments. Wherever these pages are, though, must be one of the better kept web secrets.
It's not as if there's a simple yes/no answer to audio or video codec. After all, if you use the HTMLMediaElement's canPlayType method with various audio or video codecs, you'll either get a "maybe", "probably", or an empty string. Maybe and probably are not normally viewed as decisive words. It also doesn't help when Chrome answers either maybe or probably to everything.
Then there are the quirks.
Firefox and Chrome only like uncompressed WAV files. Opera and Safari don't seem to mind compressed WAV files. Technically, though, all four browsers "support" WAV.
Both these statements are true: only Safari supports AAC; Safari, Chrome, and IE support AAC.
If you use a tool such as the Free MP3/Wma/Ogg Converter (http://www.freemp3wmaconverter.com/), you're given an option to convert your sound file to several different formats, including AAC and M4A. Many people will tell you AAC and M4A are one in the same. Well, yes and no.
The AAC option creates an AAC file that is packaged in a streaming format called Audio Data Transport System (ADTS). The M4A option is an AAC file that's packaged in MPEG-4. Since Safari can play whatever QuickTime can play on a system, and QuickTime can play the ADTS AAC file, the AAC file only plays in Safari. Chrome and IE can also play the AAC file, but only if it's wrapped in the MPEG-4 container, which Safari also supports.
But wait...there's more!
No, no. I'm just joshing you.
Well, there really is more but I don't want to be cruel.
The confusion about support is further exacerbated by the politics surrounding container/codec support. Yes, Chrome supports MP4. No, Chrome does not support MP4. Yes, Ogg is the open source community's fair haired child. No, WebM is the open source community's fair haired child ... they just don't know it yet. Speaking of WebM, yes, WebM is a video container/codec, but it's also an audio container/codec—just leave out the video track.
Remember when everything was going to be Ogg and life was simpler?
Anyway, to add to the audio/video container/codec noise on the internet, my own versions of browser/codec support for the HTML5 audio and video elements.
Are they accurate? Sure. Why not.
What day is it?
*Make darn sure the WAV file is uncompressed
*Google has announced that Chrome will not support H.264. However, there are faint traces of support—ghosts if you will—still left in Chrome.
Official HTML5 Video Mascot
The official HTML5 video mascot is ....