18 Nov 2013

Sub-surfaces. Now.

Wayland sub-surfaces is a feature that has been brewing for a long long time, and finally it has made it into Wayland core in a recent commit + the Weston commit. The design for sub-surfaces started some time in December 2012, when the task was given to me at Collabora. It went through several RFCs and was finally merged into Weston in May 2013. After that there have been only small changes if any, and sub-surfaces matured (Or was forgotten? I had other things to do.) over several months. Now it is coming out in Wayland 1.4 (plan), but what is it really?

Introduction

The basic visual (and UI) building block in Wayland (the protocol) is a wl_surface. Basically everything on screen is represented as wl_surfaces in the protocol: mouse cursors, windows, icons, etc. A surface gets its content and size by attaching a wl_buffer to it, which is a handle to a pixel container. A surface has many attributes, like the input region: the region of the surface where it can receive input events. Input events, e.g. pointer motion, that happen on the surface but outside of the input region get directed to what is below the surface. The input region can be empty, but it cannot extend beyond the surface dimensions.

It so happens, that cursor, shell surface (window), and drag icon are also surface roles. Under a desktop shell, a surface cannot become visible (mapped) unless it has a role, and it fills the requirements of that particular role. For example, a client can set a cursor surface only when it has the pointer focus. Without a role the compositor would not know what do with a surface. Roles are exclusive: a surface can have only one role at a time. How a role is assigned depends on the protocol for the particular role, there is no generic set_role-interface.

A window is a wl_surface with a suitable shell role, there is no separate object type "window" in the protocol. A window being a single wl_surface means that its contents must come from a single wl_buffer at a time. For most applications that is just fine, but there are few exceptions where it makes things less than optimal when you want to take advantage of hardware acceleration features to the fullest.

The problem

Let us consider a video player in a window. Window decorations and GUI elements are usually rendered in an RGB color format on the CPU. Video usually decodes into some YUV color format. To create one complete wl_buffer for the window, the application must merge these: convert the video into RGB and combine it with the GUI elements. And it has to do that for every single video frame, whether the GUI elements change or not. This causes several performance penalties. If your graphics card is capable of showing YUV-formatted content directly in an overlay, you cannot take advantage of that. If you have video decoding hardware, you probably have to access and copy the produced YUV images with the CPU, while doing a color conversion. Getting CPU access to a hardware rendered buffer may be expensive to begin with, and then color conversion means you are doing a copy. When you finally have that wl_buffer finished and send it to the compositor, the compositor will likely just have to upload it to the GPU again, making another expensive copy. All this hassle and pain is just to get the GUI elements and the video drawn into the same wl_buffer.

Another example is an OpenGL window, or an OpenGL canvas in a window. You definitely do not want to make the GL rendered buffer CPU-accessible, as that can be very expensive. The obvious workaround is to upload your other GUI elements into textures, and combine them with the GL canvas in GL. That could be fairly performant, but it is also very painful to achieve, especially if your toolkit has not been designed to work like that.

A more complex example is a Web browser, where you can have any number of video and GL widgets around the page.

Enter sub-surfaces

Sub-surface is a wl_surface role, that means the surface is an integral sub-part of a window. A sub-surface must always have a parent surface, and the parent surface can have any role. Therefore a window can be constructed from any number of wl_surface objects by choosing one of them to be the main surface which gets a role from the shell, and others are sub-surfaces. Also nesting is allowed, so you can have sub-sub-surfaces etc.

The tree of sub-surfaces starting from the main surface defines a window. The application sets the sub-surface's position on the parent surface, and the compositor will keep the sub-surface glued to the parent. The compositor does not clip sub-surfaces to the parent surface. This means you could implement decorations as four surfaces around the content surface, and compared to one big surface for decorations, you avoid wasting memory for the part that will always be behind the content surface. (This approach may have a visual downside, though.) It also means, that for window management purposes, the size of the window comes from the union of the whole (sub-)surface tree.

In the windowed video player example, the video can be put on a wl_surface of its own, and the decorations into another. If there are sub-titles on top of the video, that could be a third wl_surface. If the compositor accepts the YUV color format the video decoder produces, you can decode straight into a wl_buffer's storage, and attach that wl_buffer to the wl_surface. No more copying or color conversions in the application. When the compositor gets the YUV buffer, it could use GLSL shaders to convert it into RGBA while it composites, or put the buffer into a hardware overlay directly. In the overlay case, the data produced by the (hardware) video decoder gets scanned out on the graphics chip zero-copy! After decoding, the data is not copied or converted even once, which is the optimal path. Of course, in practice there are many implementation details to get right before reaching the optimal path.

Atomicity

Updates to one wl_surface are made atomic with the commit request. A tree of sub-surfaces needs to be updated atomically, too. This is important especially in resizing a window.

A sub-surface's commit request acts specially, when the sub-surface is in synchronized mode. A commit on the sub-wl_surface does not immediately apply the pending surface state, but instead the pending state is cached. The cache is just another copy of the surface state, in addition to the pending and current sets of state. The cached state gets applied when the parent wl_surface gets new state applied (Note: not straight on the parent surface's commit, but when it gets new state applied.) Relying on the cache mechanism, an application can submit new state for the whole tree of surfaces, and then apply it all with a single request: commit on the main surface.

Input handling considerations

When a window has sub-surfaces completely overlapping with its main surface, it is often easiest to set the input region of all sub-surfaces to empty. This will cause all input events to be reported on the main surface, and in the main surface coordinates. Otherwise the input events on a sub-surface are reported in the sub-surface's coordinates.

Independent application sub-modules

A use case than was strongly affecting the design of the sub-surface protocol was application plugin level embedding. An application creates a wl_surface, turns it into a sub-surface, and gives control of that wl_surface to a sub-module or a plugin.

Let us say the plugin is a video sink running in its own thread, and the host application is a Web browser. The browser initializes the video sink and gives it the wl_surface to play on. The video sink decodes the video and pushes frames to the wl_surface. To avoid waking up the browser for every video frame and requiring it to commit on its main surface to let each video frame become visible, the browser can set the sub-surface to desynchronized mode. In desynchronized mode, commits on the sub-surface apply the pending state directly, just like without the sub-surface role. The video sink can run on its own. The browser is still able to control the sub-surface's position on the main surface, glitch-free.

However, resizing gets more complicated, which was also a cause for some criticism. When the browser decides it needs to resize the sub-surface the video sink is using, it sets the sub-surface to synchronized mode temporarily, which means the video on screen stops updating, as all surface state updates now go into the cache. Then the browser signals the new size to the video sink, and the sink acknowledges when it has committed the first buffer with the new size. In the mean time, the browser has repainted its other window parts as needed, and then commits on its main surface. This produces an atomic window update on screen. Finally the browser sets the sub-surface back to the free-running mode. If all goes fast, the result is a glitch-free resize without missing a frame. If things take time, the user still sees a window resize without any flickers, but the video content may freeze for a moment.

Multiple input handlers

It is possible that sub-modules want to handle input on their wl_surfaces, which happen to be sub-surfaces. Sub-modules may even create new wl_surfaces, regardless whether they will be part of the sub-surface tree of a window or not. In such cases, there are a couple of catches.

The first catch is, that when input focus moves to a sub-surface, the input events are given in that surfaces coordinates, like said before.

The bigger catch is how input actually targets surfaces in the client side code. Actual input events for keyboards and pointer devices do not carry the target wl_surface as a parameter. The targeted surface is given by enter events, wl_pointer.enter(surface) for instance. In C code, it means a callback with the following signature gets called:
void pointer_enter(void *data, struct wl_pointer *wl_pointer, uint32_t serial, struct wl_surface *surface, wl_fixed_t surface_x, wl_fixed_t surface_y)
You get a struct wl_surface* saying which surface the following pointer events will target. I assume, that toolkits will call wl_surface_get_user_data(surface) to get a pointer to their internal structure, and then continue with that.

What if the wl_surface is not created by the toolkit to begin with? What if the surface was created by a sub-module, or a sub-module unexpectedly set a non-empty input region on a sub-surface? Then, get_user_data will give you a pointer which points to something else that you thought, and the application likely crashes.

When a toolkit gets an enter event for a surface it does not know about, it must not try to use the user_data pointer. I see two obvious ways to detect such surfaces: maintain a hash table of known wl_surface pointers, or use a magic value in the beginning of the struct used as user_data. Neither is nice, but I do not see a way around it, and this is not limited to sub-surfaces or sub-sub-surfaces. Enter events may refer to any wl_surface objects created through the Wayland connection.

Therefore I would propose the following:
  • Always be prepared to receive an unknown wl_surface on enter and similar events.
  • When writing sub-modules and plugin interfaces, specify whether input is allowed, and whose responsibility is to set the input region to empty.

Out of scope

When I started designing the sub-surface protocol, a huge question was what to leave out of it. The following are not provided by sub-surfaces:
  • Embedding content from other Wayland clients. The sub-surface extension does not implement any "foreign surface" interfaces, or anything like what X allows by just taking the Window XID and passing it to another client to use. The current consensus seems to be that this should be solved by implementing a mini-compositor in the hosting application.
  • Clipping or scaling. The buffer you attach to a sub-surface will decide the size of the sub-surface. There is another extension coming for clipping and scaling.
  • Any kind of message passing between application components. That is better solved in application specific ways.

Summary

Sub-surfaces are intended for special cases, where you need to build a window from several buffers that are composited together, to make efficient use of the hardware resources. They are not meant for widgets in general, nor for pushing parts of application rendering to the compositor. Sub-surfaces are also not meant for things that are not integral parts of a window, like tooltips, menus, or drop-down boxes. These "transient" surface types should be offered by the shell protocol.

Thanks to Collabora, reviewers on wayland-devel@ and in IRC, my work colleagues, and everyone who has helped me with this. Special thanks to Giulio Camuffo for testing the decorations in 4 sub-surfaces use case. I hope I didn't forget anyone.

8 comments:

Unknown said...

I do understand this sub-surface solution as nice workaround for 2 solutions
1. architectures that don't support EGLImage destination for videodecoder(no dmabuf for example)
2. devices without GPU shaders ( no YUV -> RGB possiblity )

And your example with videoplayer isn't so good example

In our world(user space application) we have application rendered with some kind framework (gtk, Qt)
In qt for example whole ui(play time, time toolbar, subtitles) is just bunch of vertices in opengl, If you don't change anything all buffers stay same => nothing change just buch of draw calls to draw videoplayer UI => nothing complicate.
if you add video output as opengl texture (EGLImage) everthing go same way
only one gl bind texture will be added with different texture ID
zero copy on every API level. Look at api trace and make some nice qt/qml video player

Video output image as EGLImage is already in kernel space (rpi,nvidia handle it this way)
than subsurface solution is useless in this situation
You want to do all nice shaders effects from user space because you know it best as and it is easy to do it all with one api (opengl)
http://www.youtube.com/watch?v=P4kv-AoAJ-Q

subsurface is really useless for video, if EGLImage is implemented properly(zero copy)
That is reason why nvidia, broadcom,.. moving to EGL on linux( one memory space for all big data)
It is same bad idea as xvideo ouput. Old idea/solution for current usecasses.
And nobody really use it this way in real applications. It is nice for testing purpose :)

Rob said...

Michal Lazo, even if your video is an eglImage, what you describe requires combining it with the rest of the application surface, and then as a second step combining it with the framebuffer. Versus the subsurface approach where those two steps can be combined into one. So it is very much a good idea for video (so long as the application takes advantage of it).

The comparison to xvideo isn't very good. xvideo is pretty horrible for hw decode (without resorting to some driver specific hacks).

pq said...

Michal Lazo, Rob is right. If your only path is making an EGLImage from a hw-decoded video frame and combining that with the UI elements in the app, then you are forcing at least one copy (the GL rendering pass to create your window image) in the app, compared to no copying.

For RPi the sub-surface path is even more important, because on protocol level it allows you to go directly from a decoded video frame into scanout. Doing a copy in between with GLES2 will be a noticeable performance hit. Also, we stopped using GLES2 for compositing in Weston on RPi, because the direct overlay path feels much faster.

Even if GLSL shading is available, using an YUV-overlay instead can be better due to all the filtering a sophisticated overlay hw is doing, and it certainly will be better for performance.

And like Rob said, even in the lamest case, where overlays are not usable, we can eliminate one copy, and have the compositor do the only remaining copy while it composites.

Using sub-surfaces can avoid redundant compositing in the clients (redundant because the compositor would do it anyway), and sometimes they allow the compositor to skip compositing. This is especially good for video, and still useful for OpenGL apps (the video player you described is really just an OpenGL app).

Unknown said...
This comment has been removed by the author.
Unknown said...

ok Rob
You are right that my example is one more copy.
In my case opengl make one scene composition => surface update for every picture and then one more time compositor.

But I see at least one situation where it is hard to use subsurface for video.
1. some background picture
2. video with 50% opacity
3. something around video + something over video

1 and 3 is handled with your application in opengl
2. will be handled with compositor

How can I compose final scene right in this situation ? and I don't want to make 1 and 3 as two different surfaces in my application. As if you move some objects on scene and you want to 2. and 3. will be in sync (on right position on scene) 3. will be always one frame behind in final composition. It will be like overlay solution in windows 95 when you moved with player window and video was one frame behind.
But as in my 1 and 2. and 3. is in one application and that is worst.
It is same as iPhone video player when you rotate screen.
Which part of your framework will really handle rotation? And I want to rotate with 1,2,3 in same time.

Video isn't always full screen background or overlay on scene!
I pretty much understand benefit for video in subsurface as video will be really fast on screen better sync and better sync(timing) with audio.

Unknown said...

But if I do all composition in application and application is full screen.
westen composition will be faster as on one surface on top in Z order and full screen it is all win win and that should be really fast.

this is only full screen use case but
opengl will do composition(in application) VS weston(one surface) with some kind other kind of composition accelerator (directFB, pixman, opengl, other DSP accelerated hw,..)

pq said...

Michal Lazo, if you really set out to find use cases that are poor fits for sub-surfaces, then you will find use cases that are poor fits for sub-surfaces. You do not have to use sub-surfaces for all imaginable video things. That does not undermine the usefulness of sub-surfaces for other use cases.

Nothing says that sub-surfaces *must* be put on hw overlays by the compositor. No. The compositor is free to composite them any way it can. If hw overlays can be used on a particular compositor refresh cycle for any surfaces, that is good. If not, just composite as normal. The compositor does the decision on how to use overlays for each refresh cycle. Clients have no say in that, clients will only give opportunities for the compositor to better use the overlay hardware.

Also in case you didn't notice, the sub-surface protocol allows you to keep the whole set of surfaces perfectly in sync when you need to. There is no lagging behind on resizes, no matter how many surfaces you use. And there simply cannot be any lagging behind on window moves, because on Wayland moves do not concern the client at all.

Rotations are handled by the compositor, per-output and per-surface/window, any way the compositor wants, and it does not need the clients to cooperate or even respond to do that.

Unknown said...

Thank you for good posting. It seems to be similar to XReparentWindow: http://tronche.com/gui/x/xlib/window-and-session-manager/XReparentWindow.html