SDP in WebRTC? Who cares…

I would like to believe that I’m not hopelessly confused and outdated with regards to what is going on with RTCWEB. Last I checked my head is not stuck in the sand nor have I been buried under a rock for the last several years. I recently watched the February 7th, 2013 netcast talking about the data channel and questions about how it relates the SDP and the SDP ‘application m-line’.

For the love of all that is human, why is SDP part of RTCWEB efforts at all?

To be clear, I’m talking about a few specific aspects of SDP: the format, the exchange of SDP between browsers and media negotiations via the offer/answer model (and all that it implies regarding the negotiation of media streams). Come to think of it, all that makes SDP, well… SDP. I know what some will say: We need to exchange some kind of blob-like information between browsers so they can talk, that’s why SDP is used. And I would respond “of course”! Beyond arguing how arcane SDP is as a format, RTCWEB was specifically designed not to do signaling stuff at all. That part was purposefully (and wisely IMHO) left out of the specification so that the future was wide open for whatever it might hold in creative solutions.

So why was it so wise to take out call flow signaling but then decide to keep media signaling, specifically the SDP and the offer/answer model? To be clear, I’m not suggesting we dump SDP in favour of something else, like JSON blobs or JavaScript structured data with unspecified exchange formats. Nor am I’m proposing a stateless exchange of SDP (breaking offer/answer). I’m saying that it’s operating entirely at too high a level.

What we really need in order to do future stuff in the browsers (yet remain compatible with the past) is a good API for a lower level media engine to create, destroy, control and manage media streams. That’s it. Write an engine that doesn’t take SDP, but manages much lower level streams and allow the programmer to dictate how they are plumbed together, which are active and inactive, and give events for the streams as they progress.

There are sources for audio/video such as microphones, camera or pre-recorded sources. There are destinations, like the speaker and the video monitor or perhaps even a recording destination. Then there’s the wiring and controlling of the pipelines between mixers, codecs engines and finally RTP streams that can be opened given the proper information (basically a set of ICE candidates that can optionally be updated e.g. trickle ICE). Media engines are well-understood things with many reference implementations to draw upon and abstract for the sake of JavaScript.

That’s the API I want. There’s no SDP offer/answer needed. There’s no shortage of really smart people out there who would know how to produce a great API proposal.

Some will argue that this is way too complex for JavaScript programmers. Nonsense! The stuff these JavaScript guys can do is mind blowing. Plus, it’s vital and necessary to allow for incredible flexibility and power for future protocols, future services (including in browser peer-to-peer conferencing or multiparty call control), while maintaining existing compatibility to legacy protocols. Yes, there are some JavaScript programmers that would be too intimidated to touch such an API. That’s completely fine. The ambitious smart guys will write the wrappers for those who want a simple “give me a blob to give to the remote side”.

There’s no security threat introduced by managing streams with a solid API. As long as ICE is used to verify there are two endpoints who agree to talk and user listening acquisition is granted, why shouldn’t the rest be under the control of JavaScript?

Likewise, this will not make any more silos between the browsers than those that already exist because basically both sides need to have the same signaling regardless. Is it really a big stretch that both sides would have the same JavaScript to run the media engines too?

Such an API would lower the bar of browsers being able to interoperate at the media level. This removes the concerns about SDP compatibility issues (including the untold extensions that will happen to handle more powerful features and all that it implies and complex behaviours associated with SDP offer/answer, including rollback and ‘m=’ stability). If the browsers support RTP, ICE and codecs, and can stream then they are pretty much compatible even if individually their API sets aren’t up to par to their counterparts.

This also solves an issue regarding the data channel. There is no need for the data channel to be tied to an offer/answer exchange in the media at all. They are separate things entirely (as well they should be). For example, in Open Peer’s case the data channel gets formed in advanced to maintain our document subscription model between peers and media channels are open and closed as required.

Those who still want to do full on SDP can do SDP. Those who want stateless SDP-like exchanges can do exactly that. Those who want to negotiate media once and leave the streams alone can do so.

Perhaps there are those that would argue that the JavaScript APIs build the SDP hidden behind the scenes and the SDP can be treated as an opaque type and thus the appropriate low level API already exists. But they are missing the point. The moment the SDP is required the future is tied to offer/answer.

As an example, let’s examine Open Peer’s use case. Open Peer does not have, nor does it need or want a stateful offer/answer model. It also doesn’t support or require media renegotiation. Open Peer offers the ports and codecs (including offering to multiple parties the same port sets) and establishes the desired media. Call and media state is completely separated out. From then if alternative media is needed, a new media dialog is created to replace the existing one and then a ‘quick swap’ happens and the media streams are rewired appropriately to the correct inputs and outputs without renegotiation, at least this is not a renegotiation in the offer/answer sense of the meaning. Further, Open Peer allows either side to change its media without waiting for the other party’s answer.

Media is complicated for good reason as there are many use cases. The entire IETF/W3C discussion around video constraints illustrates some of the complexities and competing desires for just one single media type. If we tie ourselves to SDP we are limiting ourselves big time, and some of the cool future stuff will be horribly hampered by it.

Let’s face it, browsers are moving toward becoming sandboxed operating systems. So why do we not give an appropriate API low level as it deserves that allows for flexible futuristic application writing? Complicated and powerful HTML 5 APIs are being well received, so why can’t the same be true for lower level RTCWEB APIs?

I know Microsoft has argued the API is too high level and they’ve even gone to the trouble of submitting their own specification with CU-RTC Web and splintering and fragmenting efforts. I don’t presume to represent this stance regarding SDP, nor will I go into the merits of their offering, but I think they are right in principle. And for saying so, I’ve got my rotten tomato and egg shield in position.

Stop Talking to Yourself. Go beyond the RTCWEB Silo!

RTCWEB / WebRTC is designed to let two or more browser-enabled devices communicate P2P (peer-to-peer) with audio, video or data. But there’s a big catch. The browsers can’t communicate out of the box unless some undefined “external process” gathers information about each browser and hands the information to the other browser.

This mystical external process is known as “on the wire signaling”. Gathering information from a browser/peer needed to communicate isn’t incredibly difficult for a moderately talented programmer, nor is exchanging the required information. All that would be required is some kind of go-between web server and a socket or two. This solution is relatively simple and there are other companies setting themselves up to provide that kind of service offering.

But that kind of signaling will quickly becomes unwieldy to manage in the real world and misses many critical use cases and components in much larger deployments. The overriding presumption in such a model assumes both ends want to communicate and does not define how they want to communicate, let alone addressing very complex security issues.

So what does make up a robust and complete P2P communication solution?

A well thought out P2P solution should addresses these concerns:

  1. Initiation of communication between peers that are not actively expecting communication
  2. Exchanging the types of communication desired (audio/video/text/etc.)
  3. Allow peers the option to allow or disallow communication
  4. Allow peer to disengage communication at any time gracefully
  5. Changing the nature of the communication at any time (adding or removing media types like audio/video/text, media on hold, transferring sessions to other participants, etc.)
  6. Handle users’ identities so that users on independent systems can interoperate (and identify themselves when communicating)
  7. Handle users logged into multiple locations as the same user
  8. Find users to communicate with by their known identities (social, generic, 3rd party, etc)
  9. Validate the identity of the user you think you are connecting with
  10. Secure communication channels in a way that even servers involved in the “communication setup” are not able to decrypt information exchanged between peers
  11. Handle group conversations amongst peers without needing servers to relay the data
  12. Handle communication to applications outside to the browser (e.g. interoperate with mobile apps)

A well designed P2P platform should be designed to enable users on various websites to talk beyond each respective web silo. Users of one website can find and communicate with users on another website and even to users on mobile devices.

It should work with your existing identity model. Alice and Bob on your website are still known as Alice and Bob in the P2P network. You don’t need to administer and map a separate database of usernames and password that would be required with other legacy signaling protocols.

The network should allow users to locate other users by their social IDs, phone numbers, email addresses or by using your own custom defined identities – social or otherwise. It should be built with strong security in mind. Each user has their own private and public key, which when tied with an identity model yield strong proof of identity with completely private communications between peers.

A developer should be able to take the open source libraries and rapidly build and deploy powerful client applications with all of these features built-in and deploy without the headache of managing a communications network. No web developer I have ever met volunteered to be the one trying to figure out the complex ins-and-outs of everything that a good P2P design will resolve. So, I ask you, do you as a developer really want to be stuck in a little silo of communication, maintaining your own custom communication signalling protocol?

If you are looking to leverage WebRTC in a browser or if you just want to build a powerful communications feature into an app, you owe it to yourself to do the research. Before you get headlong into your project and find out the tech you chose was not up to the challenge, take a look around. Libraries like the one found in the Open Peer project could very well fit the bill.

Authored by Robin Raymond, edited by Erik Lagerway

In the Trenches with RTCWEB and Real-time Video

The concept of video streaming seems extraordinarily simple. One side has a camera and the other side has a screen. All one has to do is move the video images from the camera to the screen and that’s it. But alas, it’s nowhere near that simple.

Cameras are input sources, but they have a variety of modes for which they can operate. Each camera has it’s own dimensions (width/height), aspect ratio (the ratio of width/height) and frame rate. Cameras are often capable of recording at selectable input formats, for example, SD, or HD formats, which dictate their pixel dimensions and aspect ratio sizes (e.g. 4:3 or 16:9). If a camera opens in one format and switches to another there can be a time penalty before the video starts streaming again from the camera (thus switching modes needs to be minimized or avoided entirely). On portable devices, the camera can be orientated in a variety of ways and dynamically change its pixel dimensions and aspect ratio on the fly as the device is physically rotated.

Some devices have multiple camera inputs (e.g. front camera or rear camera). Each source input need not be identical in dimensions nor capability and the user can choose the input on the fly. Further, there are even cameras that record multiple angles (e.g. 3D) simultaneously, but I’m not sure if that should be covered right now even though 3D TVs are all the rage (at least from Hollywood’s perspective).

If I could equate cameras to a famous movie quote: they are like a box of chocolates, you never know what you are going to get.

Cameras aren’t the only sources though. Pre-recorded video can be used as a source just as much as a camera. Pre-recorded video has a fixed width, height and aspect ratio, but it must be considered as a potential video source.

The side receiving video typically renders the video onto a display. These output displays are also known as a type of video sink. There are other types of video sinks though, such as a video recording sink or even videoconferencing sink. Each has it’s own unique attributes that vary and the output width and height of these video sinks vary greatly.

Some video recording sinks work best when they receive the maximum resolution possible. While others might desire a fixed width/height (as it’s intended for later viewing on a particular fixed size output device). When video is displayed in a webpage, it might be rendered to a fixed width/height or there might be flexibility in the size of the output. For example, the page can have a fixed width, but the video height could be adjustable (up to a maximum viewable area), or vice versa with the width being the adjustable axis. In some cases both dimensions can adjust automatically larger or smaller.

Some output areas are adjustable in size when manually manipulated by a user. In such cases the user can dynamically resize the output areas larger or smaller as desired (up to a maximum width and height). Other output screens are fixed in size entirely and can never be adjusted. Still other devices adjust their output dimensions based upon the physical rotation of the device.

The problem is how do you fit the source size into the video sink’s area? A camera can be completely different in dimensions and aspect ratio than the area for the video sink. The old adage “how do you fit a square peg in a round hole” seems to apply.

In the simplest case, the video source (camera) would be exactly the same size as the output area or the output area would be adjustable in the range to match the camera source. But what happens when they don’t match?

The good news is that video can be scaled up or down to fit. The bad news is that scaling has several problems. Making a small image into a big image makes the image appear pixelated and ugly. Making a bigger image smaller is better (except there are consequences for processing and bandwidth).

Aspect ratio is also a big problem. Anyone who’s watched a widescreen movie on a standard screen TV (or vice versa) will understand this problem. There are basically three solutions for this problem. One solution is shrinking the wide screen to fix into the narrow and put “black bars” above and below the image, known as letterboxing (or pillarboxing on the other axis). Another solution is to expand the image large enough while maintaining aspect ratio so there are no black bars (but with the side effect that some of the image is cropped because it’s too big to fit in the viewing area). Another method is to stretch the image making images look taller or fatter. Fortunately that technique is largely discredited, although still selectively used at times.

Some people might argue that  displaying video using a letterboxing/pillarboxing technique is too undesirable to ever be used. They would prefer video was stretched to fit the display area and any superfluous video image edges are automatically cropped off. Videophiles might gasp at such a suggestion for the very idea that discarding part of an image is nearing sacrilege. In practical terms, it’s both user preference as well as context that determine which technique is best.

As an example of why context is important, consider video rendered to the entire view screen (i.e. full screen mode). In this context, letterboxing/pillarboxing might be perfectly acceptable, as those black bars become part of the background of the video terminal. In a different context, black bars in the middle of a beautifully formatted web page might be horrifically ugly and unacceptable under any circumstance.

The complexities for video are far from over. When users place video calls, the source and the video sink are often not physically located together. That means that the video has to go from the source to the video sink located on different machines/devices and across a network.

When video is transmitted across a network pipe, a few important considerations must be factored. A network pipe has a maximum bandwidth that fluctuates with usage and saturation. Attempt to send too large a video and the video will become choppy and glitch badly. Even where the network pipe is sufficiently large, bandwidth has a cost associated, thus it’s wasteful to send a super high quality image to a device that is incapable of rendering it to the original quality. To waste less bandwidth, a codec is used to compress images and thus preserve network bandwidth as much as possible (the cost being the bigger an image, the more CPU required to compress the image using the codec).

As a general rule…

  • a source should never send video images to the remote video sink that ends up being discarded or at a higher quality than the receiver is capable rendering, as it’s a waste of bandwidth as well as CPU processing power. For example, do not send HD video to a device only capable of displaying SD quality.

Too bad this general rule above has to have an exception. There are cases where the video cannot be scaled down before sending, although rare. Nonetheless, this exception cannot be ignored. Some servers offer pre-recorded video and do not scale the video at the source because doing so would require expensive hardware processing power to transcode the recorded video. Likewise, a simple device might be too underpowered or hard wired to its output format to be capable of scaling the video appropriately for the remote video sink.

The question becomes which end (source or sink) manipulates the video? And then there are the questions of how and what does each side need to know to do the right thing to get the video in the correct format for the other side?

I can offer a few suggestions that will help. Again, as to the general rules..

  • A source should always attempt to send what a video sink expects and nothing more
  • A source should never attempt to stretch the source image larger than the original source image’s dimensions.
  • If the source is incapable of adjusting the dimensions to the video sink completely, it does so as much as it is capable and then the video sink must finish the job of adjusting the image before final rendering.
  • The source must understand the video sink can change dimensions and aspect ratio anytime with a moment’s notice. As such, there must be a set of active “current” video properties the source must be aware of at all times with regard to the video sink.
  • The “current” properties include the active width and height of the video sink (or maximum width or height should the area be automatically adjustable). The area needs to be flagged as safe for letterboxing/pillarboxing or not. If the area is unable to accept letterbox or pillarbox then the image must ultimately be adjusted to fill the rendered output area. Under such a situation the source could and should pre-crop the image before sending knowing the final dimensions used.
  • The source needs to know the maximum possible resolution the output video sink is capable of producing to not waste its own CPU opening a camera at a higher resolution than will ever be possible to render (e.g. an iPad sending to an iPhone device). Unfortunately, this needs to be a list of maximum rendered output dimensions as a device might have multiple combinations (such as an iPhone device suddenly turned on its side).

I’m skeptical if a reciprocal minimum resolution is ever needed (or even possible). For example, an area may be deemed letterbox/pillarbox unsafe and the image is just too small to fit a minimum dimension (and thus would have to be stretched upon rendering). In the TV world, an image is simply stretched to fit upon output (typically while maintaining aspect ratio). Yes, a stretched image can become pixilated and that sucks, but there are smoothing algorithms that do a reasonable job within reasonable limitations. People playing DVDs on Blu-ray players with HD TVs are familiar with such processes, which magically outputs the DVD video image to the HD TV output size. Perhaps a “one pixel by one pixel” source connected to an HD (1920×1080) output would be the extreme case of unacceptable, but what would anyone expect in such a circumstance? That’s like hooking up an Atari 2600 to an HD TV. There’s only so much that can be done to smooth out the image, as the source image quality just isn’t available. But that doesn’t mean the image shouldn’t be displayed at all!

Another special case happens when a source cannot be scaled down for whatever reason before transmission and the receiving video sink is incapable of scaling it down further to display (due to bandwidth or CPU limitations on the device). The CPU limitation might be pre-known, but the bandwidth might not. In theory the sink could report failures to the source and cause a scale back in frame rate (i.e. cause the sender to send fewer images rather than smaller images). If CPU and bandwidth conditions are pre-known, then a maximum acceptable dimension and bandwidth could be elected by the video sink thus such a non dimension adjusting source must be incapable of connecting.

Aside from the difficulties in building good RTC video technology, those involved in RTCWEB / WebRTC have yet to agree on which codecs are Mandatory to Implement (MTI), which isn’t helping things at all. Since MTI Video is on the agenda for IETF 86 in Orlando maybe we will see it happen soon. If there is a decision (that’s a big IF), what is likely to happen is that there will be two or more MTI video codecs, which means we will need to support codec swapping and all the heavy lifting related thereto.

I have not even touched on the IPR issues around real-time video, but if patents around video were the only problem, perhaps RTCWEB would be ready by now. The truth is that video patents are not likely to be the biggest concern that needs to be addressed when it comes to real time video. It’s just that “doing it right” in a browser, using JavaScript, on various devices… is rather complex.

Update: IETF 86 – Orlando | RTCWEB Agenda

IETF 86 Registration – is open, early bird discounts end March 1.


Eric Rescorla was kind enough to publish an agenda for the RTCWEB-related Working Group meetings…

RTCWEB I: Tuesday 0900-1130
0900 – 0905  Administrivia
0905 – 1100  Video Codec MTI discussion
  – draft-alvestrand-rtcweb-vp8-00 (30 mins)
  – draft-burman-rtcweb-h264-proposal-00+draft-dbenham-webrtcvideomti-00+draft-marjou-rtcweb-video-codec-00 (30 mins)
  – General discussion 30 min)
  – Call the question of which mandatory to implement video codec to select (5 min)
  – Next steps (20 min)
 1100 – 1115  draft-ietf-rtcweb-use-cases-and-requirements (LC comments)
 1115 – 1130  draft-ietf-draft-ietf-rtcweb-overview (LC comments)
MMUSIC I: Tuesday 1520-1830
1520 – 1525  Administrivia
1525 – 1605  Trickle ICE (Ivov)
– Under what conditions can you do full trickle (discovery)
– When can checking stop? [UPDATE vs. in-band vs…?]
– SDP/SIP encoding for additional trickle candidates
1605 – 1650  – Bundle
 – draft-ejzak-mmusic-bundle-alternatives (15 min)
 – draft-ietf-mmusic-sdp-bundle-negotiation (15 min)
 – discussion/consensus calls (15 min)
1650 – 1700  [Break]
1700 – 1800  SDP attribute analysis for multiplexing
 – draft-nandakumar-mmusic-sdp-mux-attributes
1800 – 1830  [Other MMUSIC business]
RTCWEB II: Thursday 0900-1130
0900 – 0905  Administrivia
0905 – 1025  Data Channel negotiation
 – draft-jesup-rtcweb-data-protocol (15 min)
 – draft-thomson-rtcweb-data (15 min)
 – draft-marcon-rtcweb-data-channel-management (15 min)
 – Discussion (30 min)
 – Consensus call (5 min)
1025 – 1100 JSEP Update (Uberti)
 – draft-ietf-rtcweb-jsep
1100 – 1115  RTCP-XR (Westerlund)
1115 – 1130  RTP Concepts and relation
– daft-burman-rtcweb-mmusic-media-structure
MMUSIC II: Friday 0900-1100
0900 – 0905  Administrivia
0905 – 0925  Requirements (draft-jennings-mmusic-media-req)
0925 – 1015  PC-track to m-line mapping (draft-jennings-rtcweb-plan)
1015 – 1100  MSID changes to match multiplexing (draft-jennings-rtcweb-plan)
Will we see some progress on MTI Video Codecs for RTCWEB at this next IETF meeting? We can always hope. The chances of this actually happening are not in our favor, many have tried before and failed. Does it matter? It might be good if we as developers, knew which codecs we could rely on in WebRTC endpoints. There is also plenty of work around SDP, dataChannel and trickle ICE underway that needs discussion as well. We plan on participating remotely.

The next post from Robin Raymond will be a doozy! At Hookflash, we are no fans of SDP. From where we are sitting, SDP really has little place being in WebRTC at all. Robin’s post later in the week will explain our reasoning. Subscribe to the feed to be notified when that and other posts are published.

From the Sidelines, My Introduction into RTCWEB

I’ve been following the RTCWEB standardization for a while now from an architecture and technology standpoint. For the most part, I’ve been quiet and I’ve assumed a rather neutral stance in regards to the RTCWEB process when it comes to Open Peer, but my opinion has changed and I can no longer maintain a neutral standpoint.

There are many companies taking stances who all need to have a say in what happens because they want to make sure their technology does not get left out in the cold when RTCWEB comes into reality, as most people think this technology will be huge with consumers and businesses. The big guys with SIP, XMPP and Skype have various established offerings and they are married to existing technology that is difficult to change. They need to make sure that RTCWEB closely follows, or at the very least, does not hinder their own technology from functioning otherwise they will get left behind. The process of adapting existing systems to a new standard is understandably costly.

Thus, I have to ask, who am I with Open Peer to come along and push back against these tides as a new protocol when I have much more flexibility in our implementation than existing deployed systems? Further, the Open Peer protocol particularities didn’t even exist until recently and it has been under revision as Hookflash tested the implementation. We’ve just recently published our specification and source code and we’ve just undergone a significant update based on internal and external feedback from our initial implementations.

To be honest, architecting, designing and implementing a brand new protocol with such an ambitious scale for a small company has kept me extremely engaged and busy. I could listen to what’s happening from a 1,000-foot high perspective, but unfortunately that has also been a factor in my personal ability to participate. I don’t think it’s a great secret for those already involved that it takes immense devotion of time resources to follow the details, let alone participate in these long drawn procedures in ratifying a specification complex as RTCWEB that spans two organizations, namely the IETF and W3C groups. This is unfortunate that such time commitments are so huge as I think having those on the front lines much more actively involved would be healthy, but I digress.

In reality though, Hookflash is in a unique position with Open Peer. I am working on this protocol with a clean slate and a future thinking sense. I do not have the old technology shackles and I didn’t have to design with legacy deployed services in mind which would no doubt confound my decision making process. Likewise, I’ve had the experience of these legacy systems to help avoid their pitfalls (specifically as the original author of the X-Lite/X-Pro SIP softphone client for CounterPath years back with SIP).

For those unaware, Open Peer is an open peer-to-peer signaling protocol that has an initial implementation in C++ and Hookflash is in the process of writing a pure JavaScript version. The idea is to allow secure peer-to-peer signaling communication straight from browser-to-browser and capability to talk to native mobile device applications as well.

The Open Peer implementation goes beyond basic call flow signaling and even beyond peer-to-peer signaling and incorporates identity and federation concepts with strong privacy and security considerations in mind.

Having just completed the next iteration of the protocol that is going through internal testing, I plan to spend much more time actively examining the details of RTCWEB standards. Even though I’m later to the table representing a newer company with newer technology, I hope the input will be welcome to the discussion. I do understand that decisions may be too immovable to change peoples’ minds and there is an active amount of established legacy systems, but hopefully coming from a unique perspective will help bring fresh blood and deeper insight. Forgive me if I argue points already lost, but I will always explain my reasoning for wanting to push certain aspects even if I am ignored in the end.

Ultimately, Open Peer will leverage RTCWEB and the implementation will adapt accordingly to the mutually agreed standards. I’m still going to give my opinion for whatever it is worth and I hope to prove it worthy, unique and valuable.

There are many bright people involved in the process and many companies with unique corporate political angles and agendas. My perspective and motivation will be straight up front. I want RTCWEB to succeed as soon as possible, but with an equal emphasis on ensuring the technology is sound from the future perspective as well, obviously in relation to plans with Open Peer utilizing RTCWEB.

1 2
Recent Comments
    Privacy Settings
    We use cookies to enhance your experience while using our website. If you are using our Services via a browser you can restrict, block or remove cookies through your web browser settings. We also use content and scripts from third parties that may use tracking technologies. You can selectively provide your consent below to allow such third party embeds. For complete information about the cookies we use, data we collect and how we process them, please check our Privacy Policy
    Consent to display content from Youtube
    Consent to display content from Vimeo
    Google Maps
    Consent to display content from Google
    Consent to display content from Spotify
    Sound Cloud
    Consent to display content from Sound