Pekka Paalanen

Wayland color-management, SDR vs. HDR, and marketing

2025-03-13T11:40:00.000+02:00

This time I have three topics.

First, I want to promote the blog post I wrote to celebrate the landing of the Wayland color-management extension into wayland-protocols staging area. It's a brief historique of the journey.

Second, I want to discuss SDR and HDR video modes on monitors and TVs. I have seen people expect that the same sRGB content displayed on the SDR video mode and the HDR (BT.2100/PQ) video mode on the same monitor will look the same, and they can arbitrarily switch between the modes at any time. I have argued that this is a false expectation. Why?

Monitors tend to have a slew of settings. I tend to call them monitor "knobs". There are brightness, contrast, color temperature, picture mode, dynamic contrast, sharpness, gamma, and whatever. Many people have noticed that when the video source puts the monitor into BT.2100/PQ video mode, the monitor locks out some settings, often brightness and/or contrast included. So, SDR and HDR video modes do not play by the same rules. Hence, one cannot generally expect a match even if the video source does everything correctly.

Third, there is marketing. Have a look at the first third of this video. They discuss video streaming services, TV selling, and HDR from the picture quality point of view. My take of that is, that (some? most?) monitors and TVs come with a screaming broken picture out-of-the-box because marketing has to sell them. If all displays displayed a given content as intended, they would all look the same, major technology differences notwithstanding, but marketing wants to make each individual stand out.

Have you heard of TV calibration services? If I buy a new TV from a local electronics department store, they offer a calibration service, for a considerable additional fee. Why would anyone need a calibration service, the factory settings should be good, right?

A Pixel's Color

2022-01-26T13:59:00.000+02:00

(This post was first published with Collabora on Jan 25, 2022.)

A Pixel's Color

My work on Wayland and Weston color management and HDR support has been full of learning new concepts and terms. Many of them are crucial for understanding how color works. I started out so ignorant that I did not know how to blend two pixels together correctly. I did not even know that I did not know - I was just doing the obvious blend, and that was wrong. Now I think I know what I know and do not know, and I also feel that most developers around window systems and graphical applications are as uneducated as I was.

Color knowledge is surprisingly scarce in my field it seems. It is not enough that I educate myself. I need other people to talk to, to review my work, and to write patches that I will be reviewing. With the hope of making it even a little bit easier to understand what is going on with color I wrote the article: A Pixel's Color.

The article goes through most of the important concepts, trying to give you, a programmer, a vague idea of what they are. It does not explain everything too well, because I want you to be able to read through it, but it still got longer than I expected. My intention is to tell you about things you might not know about, so that you would at least know what you do not know.

A warm thank you to everyone who reviewed and commented on the article.

A New Documentation Repository

Originally Wayland CM&HDR extension merge request included documentation about how color management would work on Wayland. The actual protocol extension specification cannot even begin to explain all that.

To make that documentation easier to revise and contribute to, I proposed to move it into a new repository: color-and-hdr. That also allowed us to widen the scope of the documentation, so we can easily include things outside of Wayland: EGL, Vulkan WSI, DRM KMS, and more.

I hope that color-and-hdr documentation repository gains traction and becomes a community maintained effort in gathering information about color and HDR on Linux, and that we can eventually move it out of my personal namespace to become truly community owned.

Testing 4x4 matrix inversion precision

2021-02-27T12:47:00.000+02:00

It is extremely rare that a hobby software project of mine gets completed, but now it has happened. Behold! Fourbyfour!

Have you ever had to implement a mathematical algorithm, say, matrix inversion? You want it to be fast and measuring the speed is fairly simple, right. But what about correctness? Or precision? Behavior around inputs that are on the edge? You can hand-pick a few example inputs, put those into your test suite, and verify the result is what you expect. If you do not pick only trivial inputs, this is usually enough to guarantee your algorithm does not have fundamental mistakes. But what about those almost invalid inputs, can you trust your algorithm to not go haywire on them? How close to invalid can your inputs be before things break down? Does your algorithm know when it stops working and tell you?

Inverting a square matrix requires that the inverse matrix exists to begin with. Matrices that do not mathematically have an inverse matrix are called singular. Can your matrix inversion algorithm tell you when you are trying to invert a matrix that cannot be inverted, or does it just give you a bad result pretending it is ok?

Working with computers often means working with floating-point numbers. With floating-point, the usual mathematics is not enough, it can actually break down. You calculate something and the result a computer gives you is total nonsense, like 1+2=2 in spirit. In the case of matrix inversion, it's not enough that the input matrix is not singular mathematically, it needs to be "nice enough" numerically as well. How do you test your matrix inversion algorithm with this in mind?

These questions I tried to answer with Fourbyfour. The README has the links to the sub-pages discussing how I solved this, so I will not repeat it here. However, as the TL;DR, if there is one thing you should remember, it is this:

Do not use the matrix determinant to test if a matrix is invertible!

Yes, the determinant is zero for a singular matrix. No, close to zero determinant does not tell you how close to singular the matrix is. There are better ways.

However, the conclusion I came to is that if you want a clear answer for a specific input matrix, is it invertible, the only way to know for sure is to actually invert it, multiply the input matrix with the inverse you computed, and measure how far off from the identity matrix it is. Of course, you also need to set a threshold, how close to identity matrix is close enough for your application, because with numerical algorithms, you will almost never get the exact answer. Also, pick an appropriate matrix norm for the matrix difference.

The reason for this conclusion is what one of the tools I wrote tells me about a matrix that would be typical for a display server with two full-HD monitors. The matrix is simply the pixel offset of the second monitor on the desktop. The analysis of the matrix is the example I used to demonstrate fourbyfour-analyse. If you read through it, you should be shocked. The mathematics, as far as I can understand, seems to tell us that if you use 32-bit floating-point, inverting this matrix gives us a result that leads to no correct digits at all. Obviously this is nonsense, the inverse is trivial and algorithms should give the exact correct result. However, the math does not lie (unless I did). If I did my research right, then what fourbyfour-analyse tells us is true, with an important detail: it is the upper error bound. It guarantees that we cannot get errors larger than that (heh, zero correct digits is pretty hard to make much worse). But I also read that there is no better error bound possible for a generic matrix inversion algorithm. (If you take the obvious-to-human constraints into account that those elements must be one and those must be zero, the analysis would likely be very different.) Therefore the only thing left to do is to actually go on with the matrix inversion and then verify the result.

Here is a list of the cool things the Fourbyfour project does or has:

Generates random matrices arbitrarily close to singular in a controlled way. If you simply generated random matrices for testing, they would almost never be close to singular. With this code, you can define how close to singular you want the matrices to be, to really torture your inversion algorithms.
Generates random matrices with a given determinant value. This is orthogonal to choosing how close to singular the generated matrices are. You can independently pick the determinant value and the condition number, and have the random matrices have both simultaneously.
Plot a graph about a matrix inversion algorithm's behavior when inputs get closer to singular, to see exactly when it breaks down.
A tutorial on how mathematical matrix notation works and how it relates to row- vs. column-major layouts (spoiler: it does not).
A comparison between Graphene and Weston matrix inversion algorithms.

In this project I also tried out several project quality assurance features:

Use Gitlab CI to run the test suite for the main branch, tags, and all merge requests, but not for other git branches.
Use Freedesktop ci-templates to easily generate the Docker image in CI under which CI testing will run.
Generate LCOV test code coverage report from CI.
Use reuse lint tool in CI to ensure every single file has a defined, machine-readable license. Using well-known licenses clearly is important if you want your code to be attractive. Fourbyfour also uses the Developer Certificate of Origin.
Use ci-fairy to ensure every commit has Singed-off-by and every merge request allows maintainer pushes.
Good CI test coverage. Test even the pseudo-random number generator in the test suite that it roughly follows the intended distribution.
CONTRIBUTING file. I believe that every open source project regardless of size needs this to set up the people's expectations when they see your project, whether you expect or accept contributions or not.

I'm really happy this project is now "done", well, version 1.0.0 so to say. One thing I have realized it is still missing is a determinant sweep mode. The precision testing mode sweeps over condition numbers and allows plotting the inversion behavior. It should have another mode where the sweep controls the determinant value, with some fixed condition number for the random test matrices. This determinant mode could point out inversion algorithms that use determinant value for matrix singularity testing and show how it leads to completely arbitrary results.

I you want to learn about numerical methods for matrices, I recommend the book Gene H. Golub, Charles F. van Loan, Matrix Computations. The Johns Hopkins University Press. I used the third edition, 1996, when implementing the Weston matrix inversion years ago.

Developing Wayland Color Management and High Dynamic Range

2020-11-20T19:44:00.001+02:00

(This post was first published with Collabora on Nov 19, 2020.) (Fixed a broken link on Jan 28, 2021.)

Wayland (the protocol and architecture) is still lacking proper consideration for color management. Wayland also lacks support for high dynamic range (HDR) imagery which has been around in movie and broadcasting industry for a while now (e.g. Netflix HDR UI).

While there are well established tools and workflows for how to do color management on X11, even X11 has not gained support for HDR. There were plans for it (Alex Goins, DeepColor Visuals), but as far as I know nothing really materialized from them. Right now, the only way to watch HDR content on a HDR monitor in Linux is to use the DRM KMS API directly, in other words, not use any window system, which means not using any desktop environment. Kodi is one of the very few applications that can do this at all.

This is a story about starting the efforts to fix the situation on Wayland.

History and People

Color management for Wayland has been talked about on and off for many years by dozens of people. To me it was obvious from the start that color management architecture on Wayland must be fundamentally different from X11. I thought the display server must be part of the color management stack instead of an untrusted, unknown entity that must be bypassed and overridden by applications that fight each other for who gets to configure the display. This opinion was wildly controversial and it took a long time to get my point across, but over the years some color management experts started to open up to new ideas and other people joined in the opinion as well. Whether these new ideas are actually better than the ways of old remains to be seen, though. I think the promise of getting everything and more to work better is far too great to not try it out.

The discussions started several times over the years, but they always dried out mostly without any tangible progress. Color management is a wide, deep and difficult topic, and the required skills, knowledge, interest, and available time did not come together until fairly recently. People did write draft protocol extensions, but I would claim that it was not really until Sebastian Wick started building on top of them that things started moving forward. But one person cannot push such a huge effort alone even for the simple reason that there must be at least one reviewer before anything can be merged upstream. I was very lucky that since summer 2020 I have been able to work on Wayland color management and HDR for improving ChromeOS, letting me support Sebastian's efforts on a daily basis. Vitaly Prosyak joined the effort this year as well, researching how to combine the two seemingly different worlds of ICC and HDR, and how tone-mapping could be implemented.

I must also note the past efforts of Harish Krupo, who submitted a major Weston merge request, but unfortunately at the time reviewers in Weston upstream were not much available. Even before that, there were experiments by Ville Syrjälä. All these are now mostly superseded by the on-going work.

Currently the active people around the topic are me (Collabora), Vitaly Prosyak (AMD), and Naveen Kumar (Intel). Sebastian Wick (unaffilated) is still around as well. None of us is a color management or HDR expert by trade, so we are all learning things as we go.

Design

The foundation for the color management protocol are ICC profile files for describing both output and content color spaces. The aim is for ICCv4, also allowing ICCv2, as these are known and supported well in general. Adding iccMAX support or anything else will be possible any time in the future.

As color management is all about color spaces and gamuts, and high dynamic range (HDR) is also very much about color spaces and gamuts plus extended luminance range, Sebastian and I decided that Wayland color management extension should cater for both from the beginning. Combining traditional color management and HDR is a fairly new thing as far as I know, and I'm not sure we have much prior art to base upon, so this is an interesting research journey as well. There is a lot of prior art on HDR and color management separately, but they tend to have fundamental differences that makes the combination not obvious.

To help us keep focused and explain to the community about what we actually intend with Wayland color management and HDR support, I wrote the section "Wayland Color Management and HDR Design Goals" in color.rst (draft). I very much recommend you to read it so that you get a picture what we (or I, at least) want to aim for.

Elle Stone explains in their article how color management should work on X11. As I wanted to avoid repeating the massive email threads that were had on the wayland-devel mailing list, I wrote the section "Color Pipeline Overview" in color.rst (draft) more or less as a response to her article, trying to explain in what ways Wayland will be different from X11. I think that understanding that section is paramount before anyone makes any comment on our efforts with the Wayland protocol extension.

HDR brings even more reasons to put color space conversions in the display server than just the idea that all applications should be color managed if not explicitly then implicitly. Most of the desktop applications (well, literally all right now) are using Standard Dynamic Range (SDR). SDR is a fuzzy concept referring to all traditional, non-HDR image content. Therefore, your desktop is usually 100% SDR. You run your fancy new HDR monitor in SDR mode, which means it looks just like any old monitor with nothing fancy. What if you want to watch a HDR video? The monitor won't display HDR in SDR mode. If you simply switch the monitor to HDR mode, you will be blinded by all the over-bright SDR applications. Switching monitor modes may also cause flicker and take a bit of time. That would be a pretty bad user experience, right?

A solution is to run your monitor in HDR mode all the time, and have the window system compositor convert all SDR application windows appropriately to the HDR luminance, so that they look normal in spite of the HDR mode. There will always be applications that will never support HDR at all, so the compositor doing the conversion is practically the only way.

For the protocol, we are currently exploring the use of relative luminance. The reason is that people look at monitors in wildly varying viewing environments, under standard office lighting for example. The environment and personal preferences affect what monitor brightness you want. Also monitors themselves can be wildly different in their capabilities. Most prior art on HDR uses absolute luminance, but absolute luminance has the problem that it assumes a specific viewing environment, usually a dark room, similar to a movie theatre. If a display server would show a movie with the absolute luminance it was mastered for, in most cases it would be far too dark to see. Whether using relative luminance at the protocol level turns out to be a good idea or not, we shall see.

Development

The Wayland color management and HDR protocol extension proposal is known as wayland/wayland-protocols!14 (MR14). Because it is a very long running merge request (the bar for landing a new protocol into wayland-protocols is high) and there are several people working on it, we started using sub-merge-requests to modify the proposal. You can find the sub-MRs in Sebastian's fork. If you have a change to propose, that is how to do it.

Obviously using sub-MRs also splits the review discussions into multiple places, but in this case I think it is a good thing, because the discussion threads in Gitlab are already massive.

There are several big and small open questions we haven't had the time to tackle yet even among the active group; questions that I feel we should have some tentative answers before asking for wider community comments. There is also no set schedule, so don't hold your breath. This work is likely to take months still before there is a complete tentative protocol, and probably years until these features are available in your favourite Wayland desktop environments.

If you are an expert on the topics of color management or HDR displays and content, you are warmly welcome to join the development.

If you are an interested developer or an end user looking to try out things, sorry, there is nothing really for you yet.

Waltham: a generic Wayland-style IPC over network

2016-10-28T14:48:00.000+03:00

I have recently been occupied with a new project (and being with a cold all this week), so I have not been much present in the Wayland community. Now I can finally say what I and Emilio have been up to: Waltham! For more information, please see our annoucement.

Wayland has been accepted as a Google Summer of Code organization

2016-03-01T13:58:00.000+02:00

Now is a high time to start discussing what you might want to do, for both student candidates and possible mentors.

Students, have a look at our project idea examples to get a feeling of what kind of projects you could propose. First you will need to contribute at least a small but significant patch to show that you understand the workflow, we have put some first task ideas together.

There are our application instructions for students. Of course all the pages are reachable from the Wayland GSoC wiki page and also the Wayland organization page.

If you want to become a mentor, please contact me or Kat, the contact details are on the Wayland GSoC wiki page.

Note, that students can also apply under the X.Org Foundation organization since Wayland is within their scope too and they also have other excellent graphics project ideas. You are welcome to submit your Wayland proposals to both projects.

A programmer's view on digital images: the essentials

2016-02-16T16:03:00.000+02:00

How is an uncompressed raster image laid out in computer memory? How is a pixel represented? What are stride and pitch and what do you need them for? How do you address a pixel in memory? How do you describe an image in memory?

I tried to find a web page for dummies explaining all that, and all I could find was this. So, I decided to write it down myself with the things I see as essential.

An image and a pixel

Wikipedia explains the concept of raster graphics, so let us take that idea as a given. An image, or more precisely, an uncompressed raster image, consists of a rectangular grid of pixels. An image has a width and height measured in pixels, and the total number of pixels in an image is obviously width×height.

A pixel can be addressed with coordinates x,y after you have decided where the origin is and which way the coordinate axes go.

A pixel has a property called color, and it may or may not have opacity (or occupancy). Color is usually described as three numerical values, let us call them "red", "green", and "blue", or R, G, and B. If opacity (or occupancy) exists, it is usually called "alpha" or A. What R, G, B, and A actually mean is irrelevant when looking at how they are stored in memory. The relevant thing is that each of them is encoded with a certain number of bits. Each of R, G, B, and A is called a channel.

When describing how much memory a pixel takes, one can use units of bits or bytes per pixel. Both can be abbreviated as "bpp", so be careful which one it is and favour more explicit names in code. Also bits per channel is used sometimes, and channels can have a different number of bits per pixel each. For example, rgb565 format is 16 bits per pixel, 2 bytes per pixel, 5 bits per R and B channels, and 6 bits per G channel.

A pixel in memory

Pixels do not come in arbitrary sizes. A pixel is usually 32 or 16 bits, or 8 or even 1 bit. 32 and 16 bit quantities are easy and efficient to process on 32 and 64 bit CPUs. Your usual RGB-image with 8 bits per channel is most likely in memory with 32 bit pixels, the extra 8 bits per pixel are simply unused (often marked with X in pixel format names). True 24 bits per pixel formats are rarely used in memory because trading some memory for simpler and more efficient code or circuitry is almost always a net win in image processing. The term "depth" is often used to describe how many significant bits a pixel uses, to distinguish from how many bits or bytes it occupies in memory. The usual RGB-image therefore has 32 bits per pixel and a depth of 24 bits.

How channels are packed in a pixel is specified by the pixel format. There are dozens of pixel formats. When decoding a pixel format, you first have to understand if it is referring to an array of bytes (particularly used when each channel is 8 bits) or bits in a unit. A 32 bits per pixel format has a unit of 32 bits, that is uint32_t in C parlance, for instance.

The difference between an array of bytes and bits in a unit is the CPU architecture endianess. If you have two pixel formats, one written in array of bytes form and one written in bits in a unit form, and they are equivalent on big-endian architecture, then they will not be equivalent on little-endian architecture. And vice versa. This is important to remember when you are mapping one set of pixel formats to another, between OpenGL and anything else, for instance. Figure 1 shows three different pixel format definitions that produce identical binary data in memory.

Figure 1. Three equivalent pixel formats with 8 bits for each channel. The writing convention here is to list channels from highest to lowest bits in a unit. That is, abgr8888 has r in bits 0-7, g in bits 8-15, etc.

It is also possible, though extremely rare, that architecture endianess also affects the order of bits in a byte. Pixman, undoubtedly inheriting it from X11 pixel format definitions, is the only place where I have seen that.

An image in memory

The usual way to store an image in memory is to store its pixels one by one, row by row. The origin of the coordinates is chosen to be the top-left corner, so that the leftmost pixel of the topmost row has coordinates 0,0. First there are all the pixels of the first row, then the second row, and so on, including the last row. A two-dimensional image has been laid out as a one-dimensional array of pixels in memory. This is shown in Figure 2.

Figure 2. The usual layout of pixels of an image in memory.

There are not only the width×height number of pixels, but each row also has some padding. The padding area is not used for storing anything, it only aligns the length of the row. Having padding requires a new concept: image stride.

Padding is often necessary due to hardware reasons. The more specialized and efficient hardware for pixel manipulation, the more likely it is that it has specific requirements on the row start and length alignment. For example, Pixman and therefore also Cairo (image backend particularly) require that rows are aligned to 4 byte boundaries. This makes it easier to write efficient image manipulations using vectorized or other instructions that may even process multiple pixels at the same time.

Stride or pitch

Image width is practically always measured in pixels. Stride on the other hand is related to memory addresses and therefore it is often given in bytes. Pitch is another name for the same concept as stride, but can be in different units.

You may have heard rules of thumb that stride is in bytes and pitch is in pixels, or vice versa. Stride and pitch are used interchangeably, so be sure of the conventions used in the code base you might be working on. Do not trust your instinct on bytes vs. pixels here.

Addressing a pixel

How do you compute the memory address of a given pixel x,y? The canonical formula is:

pixel_address = data_begin + y * stride_bytes + x * bytes_per_pixel.

The formula stars with the address of the first pixel in memory data_begin, then skips to row y while each row is stride_bytes long, and finally skips to pixel x on that row.

In C code, if we have 32 bit pixels, we can write

uint32_t *p = data_begin;
p += y * stride_bytes / sizeof(uint32_t);
p += x;

Notice, how the type of p affects the computations, counting in units of uint32_t instead of bytes.

Let us assume the pixel format in this example is argb8888 which is defined in bits of a unit form, and we want to extract the R value:

uint32_t v = *p;
uint8_t r = (v >> 16) & 0xff;

Finally, Figure 3 gives a cheat sheet.

Figure 3. How to compute the address of a pixel.

Now we have covered the essentials, and you can stop reading. The rest is just good to know.

Not everyone has the "right" way up

In the above we have assumed that the image origin is the top-left corner, and rows are stored top-most first. The most notable exception to this is the OpenGL API, which defines image data to be in bottom-most row first. (Traditionally also BMP file format does this.)

Multi-planar formats

In the above, we have talked about single-planar formats. That means that there is only a single two-dimensional array of pixels forming an image. Multi-planar formats use two or more two-dimensional arrays for forming an image.

A simple example with an RGB-image would be to store R channel in the first plane (2D-array) and GB channels in the second plane. Pixels on the first plane have only R value, while pixels on the second plane have G and B values. However, this example is not used in practice.

Common and real use cases for multi-planar images are various YUV color formats. Y channel is stored on the first plane, and UV channels are stored on the second plane, for instance. A benefit of this is that e.g. the UV plane can be sub-sampled - its resolution could be only half of the plane with Y, saving some memory.

Tiled formats

If you have read about GPUs, you may have heard of tiling or tiled formats (tiled renderer is a different thing). These are special pixel layouts, where an image is not stored row by row but a rectangular block by block. Tiled formats are far too wild and various to explain here, but if you want a taste, take a look at Nouveau's documentation on G80 surface formats.

Weston repaint scheduling

2015-02-11T16:59:00.002+02:00

Now that Presentation feedback has finally landed in Weston (feedback, flags), people are starting to pay attention to the output timings as now you can better measure them. I have seen a couple of complaints already that Weston has an extra frame of latency, and this is true. I also have a patch series to fix it that I am going to propose.

To explain how the patch series affects Weston's repaint loop, I made some JSON-timeline recordings before and after, and produced some graphs with Wesgr. Here I will explain how the repaint loop works timing-wise.

Original post Feb 11, 2015.
Update Mar 20, 2015: the patches have landed in Weston.

The old algorithm

The old repaint scheduling algorithm in Weston repaints immediately on receiving the pageflip completion event. This maximizes the time available for the compositor itself to repaint, but it also means that clients can never hit the very next vblank / pageflip.

Figure 1. The old algorithm, the client paints as response to frame callbacks.

Frame callback events are sent at the "post repaint" step. This gives clients almost a full frame's time to draw and send their content before the compositor goes to "begin repaint" again. In Figure 1. you see, that if a client paints extremely fast, the latency to screen is almost two frame periods. The frame latency can never be less than one frame period, because the compositor samples the surface contents (the "repaint flush" point) immediately after the previous vblank.

Figure 2. The old algorithm, the client paints as response to Presentation feedback events.

While frame callback driven clients still get to the full frame rate, the situation is worse if the client painting is driven by presentation_feedback.presented events. The intent is to draw and show a new frame as soon as the old frame was shown. Because Weston starts repaint immediately on the pageflip completion, which is essentially the same time when Presentation feedback is sent, the client cannot hit the repaint of this frame and gets postponed to the next. This is the same two frame latency as with frame callbacks, but here the framerate is halved because the client waits for the frame to be actually shown before continuing, as is evident in Figure 2.

Figure 3. The old algorithm, client posts a frame while the compositor is idle.

Figure 3. shows a less relevant case, where the compositor is idle while a client posts a new frame ("damage commit"). When the compositor is idle graphics-wise (the gray background in the figure), it is not repainting continuously according to the output scanout cycle. To start painting again, Weston waits for an extra vblank first, then repaints, and then the new frame is shown on the next vblank. This is also a 1-2 frame period latency, but it is unrelated to the other two cases, and is not changed by the patches.

The modification to the algorithm

The modification is simple, yet perhaps counter-intuitive at first. We reduce the latency by adding a delay. The "delay before repaint" is in all the figures, and the old algorithm is essentially using a zero delay. The compositor's repaint is delayed so that clients have a chance to post a new frame before the compositor samples the surface contents.

A good amount of delay is a hard question. Too small delay and clients do not have time to act. Too long delay and the compositor itself will be in danger of missing the vblank deadline. I do not know what a good amount is or how to derive it, so I just made it configurable. You can set the repaint window length in milliseconds in weston.ini. The repaint window is the time from starting repaint to the deadline, so the delay is the frame period minus the repaint window. If the repaint window is too long for a frame period, the algorithm will reduce to the old behaviour.

The new algorithm

The following figures are made with a 60 Hz refresh and a 7 millisecond repaint window.

Figure 4. The new algorithm, the client paints as response to frame callback.

When a client paints as response to the frame callback (Figure 4), it still has a whole frame period of time to paint and post the frame. The total latency to screen is a little shorter now, by the length of the delay before compositor's repaint. It is a slight improvement.

Figure 5. The new algorithm, the client paints as response to Presentation feedback.

A significant improvement can be seen in Figure 5. A client that uses the Presentation extension to wait for a frame to be actually shown before painting again is now able to reach the full output frame rate. It just needs to paint and post a new frame during the delay before compositor's repaint. This mode of operation provides the shortest possible latency to screen as the client is able to target the very next vblank. The latency is below one frame period if the deadlines are met.

Discussion

This is a relatively simple change that should reduce display latency, but analyzing how exactly it affects things is not trivial. That is why Wesgr was born.

This change does not really allow clients to wait some additional time before painting to reduce the latency even more, because nothing tells clients when the compositor will repaint exactly. The risk of missing an unknown deadline grows the later a client paints. Would knowing the deadline have practical applications? I'm not sure.

These figures also show the difference between the frame callback and Presentation feedback. When a client's repaint loop is driven by frame callbacks, it maximizes the time available for repainting, which reduces the possibility to miss the deadline. If a client drives its repaint loop by Presentation feedback events, it minimizes the display latency at the cost of increased risk of missing the deadline.

All the above ignores a few things. First, we assume that the time of display is the point of vblank which starts to scan out the new frame. Scanning out a frame actually takes most of the frame period, it's not instantaneous. Going deeper, updating the framebuffer during scanout period instead of vblank could allow reducing latency even more, but the matter becomes complicated and even somewhat subjective. I hear some people prefer tearing to reduce the latency further. Second, we also ignore any fencing issues that might come up in practise. If a client submits a GPU job that takes a long while, there is a good chance it will cause everything to miss a deadline or more.

As usual, this work and most of the development of JSON-timeline and Wesgr were sponsored by Collabora.

PS. Latency and timing issues are nothing new. Owen Taylor has several excellent posts on related subjects in his blog.

Wayland protocol design: object lifespan

2014-07-25T22:01:00.000+03:00

Now that we have a few years of experience with the Wayland protocol, I thought I would put some of my observations in writing. This, what will hopefully become a series rather than just one post, considers how to design Wayland protocol extensions the right way.

This first post considers protocol object lifespan and the related races between the compositor/server and the client. I assume that the reader is already aware of the Wayland protocol basics. If not, I suggest reading Chapter 4. Wayland Protocol and Model of Operation.

How protocol objects are created

On a new Wayland connection, the only object that exists is the wl_display which is a specially constructed object. You always have it, and there is no wire protocol for creating it.

The only thing the client can create next is a wl_registry through the wl_display. Registry is the root of the whole interface (class) hierarchy. Wl_registry advertises the global objects by numerical name, and using wl_registry.bind request to bind to a global is the first normal way to create a protocol object.

Binding is slightly special still, as the protocol specification in XML for wl_registry uses the new_id argument type, but does not specify the interface (class) for the new object. In the wire protocol, this special argument gets turned into three arguments: interface name (string), interface version (uint32_t), and the new object ID (uint32_t). This is unique in the Wayland core protocol.

The usual way to create a new protocol object is for the client to send a request that has a new_id type of argument. The protocol specification (XML) defines what the interface is, so there is no need to communicate the interface type over the wire. All that is needed on the wire is the new object ID. Almost all object creation happens this way.

Although rare, also the server may create protocol objects for the client. This happens by having a new_id type of argument in an event. Every time the client receives this event, it receives a new protocol object.

As all requests and events are always part of some interface (like a member of a class), this creates an interface hierarchy. For example, wl_compositor objects are created from wl_registry, and wl_surface objects are created from wl_compositor.

Object creation never fails. Once the request or event is sent, the new objects it creates exists, period. This keeps the protocol asynchronous, as there is no need to reply or check that the creation succeeded.

How protocol objects are destroyed

There are two ways to destroy a protocol object. By far the most common one is to have a request in the interface that is specified to be a destructor. Most often this request is called "destroy". When the client code calls the function wl_foobar_destroy(), the request is sent to the server and the client side proxy (struct wl_proxy) for the object gets destroyed. The server then handles the destructor request at some point in the future.

The other way is to destroy the object by an event. In that case, no destructor must be defined in the interface's protocol specification, and the event must be clearly documented to be destructive as there is no automation nor safeties for this. This is for cases where the server decides when an object dies, and requires extreme care in protocol design to work right in all cases. When a client receives such an event, all it can do is destroy the proxy. The (in)famous example of an interface like this is wl_callback.

Enter the boogeyman: races

It is very important that both the client and the server agree on which protocol objects exist. If the client sends a request on, or references as an argument, an object that does not exist in the server's opinion, the server raises a protocol error, and disconnects the client. Obviously this should never happen, nor should it happen that the server sends an event to an object that the client destroyed.

Wayland being a completely asynchronous protocol, we have no implicit guarantees. The server may send an event at the same time as the client destroys the object, and now the event targets an object the client does not know about anymore. Rather than the client shooting itself dead (that's the server's job), we have a trick in libwayland-client: it silently ignores events to destroyed objects, until the server confirms that the object is truly gone.

This works very well for interfaces where the destructor is a request. If the client first sends the destructor request and then sends another request on the destroyed object, it just shot its own head off - no race needed.

Things get tricky for the other case, destructor events. The server may send the destructor event at the same time the client is sending a request on the same object. When the server finally gets the request, the object is already gone, and the client gets taken behind the shed and shot. Therefore pretty much the only safe way to use destructor events is if the interface does not define any requests at all. Ever, not even in future extensions. Furthermore, objects with that interface should not be used as arguments anywhere, or you may hit the race. That is why destructor events are difficult to use right.

The boogeyman's brother

There is yet another nasty race with events that create objects, i.e. server-created objects. If the client is destroying the (parent) object at the same time as the server is sending an event on that object, creating a new (child) object, the server cannot know if the client actually handled the event or not. If the client ignored the event, it will never tell the server to destroy that new object, and you leak in the server.

You could try to make your way out of that pitfall by writing in your protocol specification, that when the (parent) object is destroyed, all the child objects will be destroyed implicitly. But then the client must not send the destructor request for the child objects after it has destroyed the parent, because otherwise the server sees requests on objects it does not know about, and kicks you in the groin, hard. If the child interface defines a destructor, the client cannot destroy its proxies after destroying the parent object. If the child interface does not define a destructor, you can never free the server-side resources until the parent gets destroyed.

The client could destroy all the child objects with a defined destructor in one go, and then immediately destroy the parent object. I am not sure if that works, but it might. If it does not, you have to specify a whole tear-down protocol sequence. The client tells the server it wants to destroy the parent object, the server acks and guarantees it no longer sends any events on it, then the client actually destroys the parent object. Hey, you have a round-trip and just turned a beautiful asynchronous protocol into synchronous, congratulations!

Concluding with recommendations

Here are my recommendations when designing Wayland protocol extensions:

Always make sure there is a guaranteed way to destroy all objects. This may sound obvious, but we have fixed several cases in the Wayland core protocol where there was no way to destroy a created protocol object such, that all resources on both server and client side could be freed. And there are still some cases not fixed.
Always define a destructor request. If you have any doubt whether your new interface needs a destructor request, just put it there. It is more awkward to add later than normal requests. If you do not have one, the client cannot tell the server to free those protocol object resources.
Do not use destructor events. They are hard to design right, and extending the interface later will be a bitch. The client cannot tell the server to free the resources, so objects with destructor events should be short-lived, and the destruction must be guaranteed.
Do not use server-side created objects without a serious thought. Designing the destruction sequence such that it never leaks nor explodes is tricky.

From pre-history to beyond the global thermonuclear war

2014-06-05T14:00:00.000+03:00

This is a short and vague glimpse to the interfaces that the Linux kernel offers to user space for display and graphics management, from the history to what is hot and new, to what might perhaps be coming after. The topic came current for me when I started preparing Weston for global thermonuclear war.

The pre-history

In the age of dragons, kernel mode setting did not exist. There was only user space mode setting, where the job of the kernel driver (if any) was simply to give user space direct access to the graphics card registers. A user space driver (well, Xorg video DDX, really, err... or what it was at the time of XFree86) would then poke the card registers to set a mode. The kernel had no idea of anything.

The kernel DRM infrastructure was started as an out-of-tree kernel module for cooperating between multiple programs wanting to access the graphics card's resources. Later it was (partially?) merged into the kernel tree (the year is a lie, 2.3.18 came out in 1999), and much much later it was finally deleted from the libdrm repository.

The middle age

For some time, the kernel DRM existed alongside user space mode setting. It was a dark time full of crazy hacks to keep it all together with duct tape, barbwire and luck. GPUs and hardware accelerated OpenGL started to come up.

The new age

With the invent of kernel mode setting (KMS), the DRM kernel drivers got in charge of the graphics card resources: outputs, video modes, memory allocations, hotplug! User space mode setting became obsolete and was eventually killed. The kernel driver was finally actually in control of the graphics hardware.

KMS probably started with just setting the main framebuffer (primary plane) for each "CRTC" and programming the video mode. A CRTC is for "cathode-ray tube controller", but essentially means a block that reads memory (a framebuffer) and produces a bitstream according to video mode timings. The bitstream is directed into an "encoder", which turns it into a proper physical/analogue signal, like VGA or digital DVI. The signal then exits the graphics card though a "connector". CRTC, encoder, and connector are the basic concepts in KMS API. Quite often these can be combined in some restricted ways, like a single CRTC feeding two encoders for clone mode.

Even ancient hardware supported hardware cursors: a small sprite that was composited into the outgoing video signal on the fly, which meant that it was very cheap to move around. Cursor being so special, and often with funny color format (alpha!), got its very own DRM ioctl.

There were also hardware overlays (additional or secondary planes) on some hardware. While the primary framebuffer covers the whole display, an overlay is another buffer (just like the cursor) that gets mixed into the bitstream at the CRTC level. It is like basic compositing done on the scanout hardware level. Overlays usually had additional benefits, for example they could apply scaling or color space conversion (hello, video players) very efficiently. Overlays being different, they too got their very own DRM ioctls.

The KMS user space ABI was anything but atomic. With the X11 tradition, it wasn't too important how to update the displays, as long as the end result eventually was what you wanted. Race conditions in content updates didn't matter too much either, as X was racy as hell anyway. You update the CRTC. Then you update each overlay. You might update the cursor, too. By luck, all these updates could hit the same vblank. Or not. Or you don't hit vblank at all, and get tearing. No big deal, as X was essentially all about front-buffer rendering anyway. (And then there were huge efforts in trying to fix it all up with X, GLX, Mesa and GL-compositors, and avoid tearing, and it ended up complicated.)

With the advent of X compositing managers, that did not play well with the awkward X11 protocol (Xv) or the hardware overlays, and with rise of the GPU power and OpenGL, it was thought that hardware overlays would eventually die out. Turned out the benefits of hardware overlays were too great to abandon, and with Wayland we again have a decent chance to make the most of them while still enjoying compositing.

The global thermonuclear war (named after a git branch by Rob Clark)

The quality of display updates became important. People do not like tearing. Someone actually wanted to update the primary framebuffer and the overlays on the same vblank, guaranteed. And the cursor as the cherry on top.

We needed one ABI to rule them all.

Universal planes brings framebuffers (primary planes), overlays (secondary planes) and cursors (cursor planes) together under the same API. No more type specific ioctls, but common ioctls shared by them all. As these objects are still somewhat different, overlays having wildly differing features and vendors wanting to expose their own stuff, object properties were invented.

An object property is essentially a {key, value} pair. In the API, the name of a key is a string. Each object has its own set of keys. To use a key, you must know it by name, fetch the handle, and then use the handle when setting the value. Handles seem to be per-object, so make sure to fetch them separately for each.

Atomic mode setting and nuclear pageflip are two sides of the same feature. Atomicity is achieved by gathering a set of property changes, and then pushing them all into the kernel in a single ioctl call. Then that call either succeeds or fails as a whole. Libdrm offers a drmModePropertySet for gathering the changes. Everything is exposed as properties: the attached FB, overlay position, video mode, etc.

Atomic mode setting means setting the output modes of a single graphics device, more or less. Devices may have hard to express limitations. A simple example is the available scanout memory bandwidth: You can drive either two mid-resolution outputs, or one high-resolution output. Or maybe some crtc-encoder-connector combination is not possible with a particular other combination for another output. Collecting the video mode, encoder and connector setup over the whole grahics card into a single operation avoids flicker. Either the whole set succeeds, or it fails. Without atomic mode setting, changing multiple outputs would not only take longer, but if some step failed, you'd have to undo all earlier steps (and hope the undo steps don't fail). Plus, there would be no way to easily test if a certain combination is possible. Atomic mode setting fixes all this.

Nuclear pageflip is about synchronizing the update of a single output (monitor) and making that atomic. This means that when user space wants to update the primary framebuffer, move the cursor, and update a couple of overlays, all those changes happen at the same vblank. Again it all either succeeds or fails. "Every frame is perfect."

And then there shall be ponies (at the end of the rainbow)

Once the global thermonuclear war is over, we have the perfect ABI for driving display updates.

Well, almost. Enter NVidia G-Sync, or AMD's FreeSync which is actually backed by a VESA standard. Dynamically variable refresh rate. We have no way yet for timing display updates in DRM. All we can do is kick out a display update, and it will hopefully land on the next vblank, whenever that is. But we can't tell the DRM when we would like it to be. Everything so far assumes, that the display refresh rate is a constant, apart from an explicit mode switch. Though I have heard that e.g. Chrome for Intel (i915, LVDS/eDP reclocking) has some hacks that opportunistically drops the refresh rate to save power.

There is also a culprit in the DRM of today (Jun 3rd, 2014). You can schedule a pageflip, but if you have pending rendering on that framebuffer for the same GPU as were you are presenting it, the pageflip will not happen until the rendering completes. And you do not know when it will complete, which means you do not know if you will hit the very next vblank or something later.

If the rendering GPU is not the same graphics device that presents the framebuffer, you do not get synchronization at all. That means that you may be scanning out an incomplete rendering for a frame or two, or you have to stall the GPU to make sure it is done before scheduling the page flip. This should be fixed with the fences related to dma-bufs (Hi, Maarten Lankhorst).

And so the unicorn keeps on running.

Improving presentation on Wayland

2014-01-31T14:03:00.000+02:00

In the last two (or three?) weeks at Collabora I have been looking into a Wayland protocol extension that would allow accurately timed presentation. Accurate timing is essential for two quite different use cases: video playback with audio/video synchronization, and interactive GUI with just-in-time redrawing. Video playback with A/V sync was our primary goal when we started working on this, and there is Frederic Plourde's first proposal from October 2013. Since then I have realized that also other kinds of applications need timings, especially feedback on when their content updates were shown, and when is the next chance to show an update (vblank). Primarily my re-design started with the aim to improve resizing performance when I got the assignment from Daniel Stone to push Wayland presentation protocol forward. The RFC v2 of Wayland presentation extension is now out for review and discussion.

I looked at various timing and content posting related APIs, like EGL and its extensions, GLX_OML_sync_control, and the new X11 Present and Keith's blog posts about it. I found a couple of things I could not understand what they were for and asked about them on a few mailing lists. The replies and further pondering resulted in a conclusion that I do not have to support the MSC modulus matching logic, and eglSwapInterval for intervals greater than one could be implemented if anyone really needs it.

I took a more detailed look at X11 Present for XWayland purposes. The Wayland presentation protocol extension I am proposing is not meant to solve all problems in supporting X11 Present in XWayland, but the investigation gave me some faith that with small additional XWayland extensions it could be done. Axel Davy is already implementing Present on core Wayland protocol as far as it is possible anyway, and we had lots of interesting discussions.

I am not going into any details of the RFC v2 proposal here, as the email should contain exhaustive documentation on the design. If no significant flaws are found, the next steps would be to implement this in Weston and see how it works.

Sub-surfaces. Now.

2013-11-18T17:05:00.000+02:00

Wayland sub-surfaces is a feature that has been brewing for a long long time, and finally it has made it into Wayland core in a recent commit + the Weston commit. The design for sub-surfaces started some time in December 2012, when the task was given to me at Collabora. It went through several RFCs and was finally merged into Weston in May 2013. After that there have been only small changes if any, and sub-surfaces matured (Or was forgotten? I had other things to do.) over several months. Now it is coming out in Wayland 1.4 (plan), but what is it really?

Introduction

The basic visual (and UI) building block in Wayland (the protocol) is a wl_surface. Basically everything on screen is represented as wl_surfaces in the protocol: mouse cursors, windows, icons, etc. A surface gets its content and size by attaching a wl_buffer to it, which is a handle to a pixel container. A surface has many attributes, like the input region: the region of the surface where it can receive input events. Input events, e.g. pointer motion, that happen on the surface but outside of the input region get directed to what is below the surface. The input region can be empty, but it cannot extend beyond the surface dimensions.

It so happens, that cursor, shell surface (window), and drag icon are also surface roles. Under a desktop shell, a surface cannot become visible (mapped) unless it has a role, and it fills the requirements of that particular role. For example, a client can set a cursor surface only when it has the pointer focus. Without a role the compositor would not know what do with a surface. Roles are exclusive: a surface can have only one role at a time. How a role is assigned depends on the protocol for the particular role, there is no generic set_role-interface.

A window is a wl_surface with a suitable shell role, there is no separate object type "window" in the protocol. A window being a single wl_surface means that its contents must come from a single wl_buffer at a time. For most applications that is just fine, but there are few exceptions where it makes things less than optimal when you want to take advantage of hardware acceleration features to the fullest.

The problem

Let us consider a video player in a window. Window decorations and GUI elements are usually rendered in an RGB color format on the CPU. Video usually decodes into some YUV color format. To create one complete wl_buffer for the window, the application must merge these: convert the video into RGB and combine it with the GUI elements. And it has to do that for every single video frame, whether the GUI elements change or not. This causes several performance penalties. If your graphics card is capable of showing YUV-formatted content directly in an overlay, you cannot take advantage of that. If you have video decoding hardware, you probably have to access and copy the produced YUV images with the CPU, while doing a color conversion. Getting CPU access to a hardware rendered buffer may be expensive to begin with, and then color conversion means you are doing a copy. When you finally have that wl_buffer finished and send it to the compositor, the compositor will likely just have to upload it to the GPU again, making another expensive copy. All this hassle and pain is just to get the GUI elements and the video drawn into the same wl_buffer.

Another example is an OpenGL window, or an OpenGL canvas in a window. You definitely do not want to make the GL rendered buffer CPU-accessible, as that can be very expensive. The obvious workaround is to upload your other GUI elements into textures, and combine them with the GL canvas in GL. That could be fairly performant, but it is also very painful to achieve, especially if your toolkit has not been designed to work like that.

A more complex example is a Web browser, where you can have any number of video and GL widgets around the page.

Enter sub-surfaces

Sub-surface is a wl_surface role, that means the surface is an integral sub-part of a window. A sub-surface must always have a parent surface, and the parent surface can have any role. Therefore a window can be constructed from any number of wl_surface objects by choosing one of them to be the main surface which gets a role from the shell, and others are sub-surfaces. Also nesting is allowed, so you can have sub-sub-surfaces etc.

The tree of sub-surfaces starting from the main surface defines a window. The application sets the sub-surface's position on the parent surface, and the compositor will keep the sub-surface glued to the parent. The compositor does not clip sub-surfaces to the parent surface. This means you could implement decorations as four surfaces around the content surface, and compared to one big surface for decorations, you avoid wasting memory for the part that will always be behind the content surface. (This approach may have a visual downside, though.) It also means, that for window management purposes, the size of the window comes from the union of the whole (sub-)surface tree.

In the windowed video player example, the video can be put on a wl_surface of its own, and the decorations into another. If there are sub-titles on top of the video, that could be a third wl_surface. If the compositor accepts the YUV color format the video decoder produces, you can decode straight into a wl_buffer's storage, and attach that wl_buffer to the wl_surface. No more copying or color conversions in the application. When the compositor gets the YUV buffer, it could use GLSL shaders to convert it into RGBA while it composites, or put the buffer into a hardware overlay directly. In the overlay case, the data produced by the (hardware) video decoder gets scanned out on the graphics chip zero-copy! After decoding, the data is not copied or converted even once, which is the optimal path. Of course, in practice there are many implementation details to get right before reaching the optimal path.

Atomicity

Updates to one wl_surface are made atomic with the commit request. A tree of sub-surfaces needs to be updated atomically, too. This is important especially in resizing a window.

A sub-surface's commit request acts specially, when the sub-surface is in synchronized mode. A commit on the sub-wl_surface does not immediately apply the pending surface state, but instead the pending state is cached. The cache is just another copy of the surface state, in addition to the pending and current sets of state. The cached state gets applied when the parent wl_surface gets new state applied (Note: not straight on the parent surface's commit, but when it gets new state applied.) Relying on the cache mechanism, an application can submit new state for the whole tree of surfaces, and then apply it all with a single request: commit on the main surface.

Input handling considerations

When a window has sub-surfaces completely overlapping with its main surface, it is often easiest to set the input region of all sub-surfaces to empty. This will cause all input events to be reported on the main surface, and in the main surface coordinates. Otherwise the input events on a sub-surface are reported in the sub-surface's coordinates.

Independent application sub-modules

A use case than was strongly affecting the design of the sub-surface protocol was application plugin level embedding. An application creates a wl_surface, turns it into a sub-surface, and gives control of that wl_surface to a sub-module or a plugin.

Let us say the plugin is a video sink running in its own thread, and the host application is a Web browser. The browser initializes the video sink and gives it the wl_surface to play on. The video sink decodes the video and pushes frames to the wl_surface. To avoid waking up the browser for every video frame and requiring it to commit on its main surface to let each video frame become visible, the browser can set the sub-surface to desynchronized mode. In desynchronized mode, commits on the sub-surface apply the pending state directly, just like without the sub-surface role. The video sink can run on its own. The browser is still able to control the sub-surface's position on the main surface, glitch-free.

However, resizing gets more complicated, which was also a cause for some criticism. When the browser decides it needs to resize the sub-surface the video sink is using, it sets the sub-surface to synchronized mode temporarily, which means the video on screen stops updating, as all surface state updates now go into the cache. Then the browser signals the new size to the video sink, and the sink acknowledges when it has committed the first buffer with the new size. In the mean time, the browser has repainted its other window parts as needed, and then commits on its main surface. This produces an atomic window update on screen. Finally the browser sets the sub-surface back to the free-running mode. If all goes fast, the result is a glitch-free resize without missing a frame. If things take time, the user still sees a window resize without any flickers, but the video content may freeze for a moment.

Multiple input handlers

It is possible that sub-modules want to handle input on their wl_surfaces, which happen to be sub-surfaces. Sub-modules may even create new wl_surfaces, regardless whether they will be part of the sub-surface tree of a window or not. In such cases, there are a couple of catches.

The first catch is, that when input focus moves to a sub-surface, the input events are given in that surfaces coordinates, like said before.

The bigger catch is how input actually targets surfaces in the client side code. Actual input events for keyboards and pointer devices do not carry the target wl_surface as a parameter. The targeted surface is given by enter events, wl_pointer.enter(surface) for instance. In C code, it means a callback with the following signature gets called:

void pointer_enter(void *data, struct wl_pointer *wl_pointer, uint32_t serial, struct wl_surface *surface, wl_fixed_t surface_x, wl_fixed_t surface_y)

You get a struct wl_surface* saying which surface the following pointer events will target. I assume, that toolkits will call wl_surface_get_user_data(surface) to get a pointer to their internal structure, and then continue with that.

What if the wl_surface is not created by the toolkit to begin with? What if the surface was created by a sub-module, or a sub-module unexpectedly set a non-empty input region on a sub-surface? Then, get_user_data will give you a pointer which points to something else that you thought, and the application likely crashes.

When a toolkit gets an enter event for a surface it does not know about, it must not try to use the user_data pointer. I see two obvious ways to detect such surfaces: maintain a hash table of known wl_surface pointers, or use a magic value in the beginning of the struct used as user_data. Neither is nice, but I do not see a way around it, and this is not limited to sub-surfaces or sub-sub-surfaces. Enter events may refer to any wl_surface objects created through the Wayland connection.

Therefore I would propose the following:

Always be prepared to receive an unknown wl_surface on enter and similar events.
When writing sub-modules and plugin interfaces, specify whether input is allowed, and whose responsibility is to set the input region to empty.

Out of scope

When I started designing the sub-surface protocol, a huge question was what to leave out of it. The following are not provided by sub-surfaces:

Embedding content from other Wayland clients. The sub-surface extension does not implement any "foreign surface" interfaces, or anything like what X allows by just taking the Window XID and passing it to another client to use. The current consensus seems to be that this should be solved by implementing a mini-compositor in the hosting application.
Clipping or scaling. The buffer you attach to a sub-surface will decide the size of the sub-surface. There is another extension coming for clipping and scaling.
Any kind of message passing between application components. That is better solved in application specific ways.

Summary

Sub-surfaces are intended for special cases, where you need to build a window from several buffers that are composited together, to make efficient use of the hardware resources. They are not meant for widgets in general, nor for pushing parts of application rendering to the compositor. Sub-surfaces are also not meant for things that are not integral parts of a window, like tooltips, menus, or drop-down boxes. These "transient" surface types should be offered by the shell protocol.

Thanks to Collabora, reviewers on wayland-devel@ and in IRC, my work colleagues, and everyone who has helped me with this. Special thanks to Giulio Camuffo for testing the decorations in 4 sub-surfaces use case. I hope I didn't forget anyone.

Weston on Raspberry Pi Accelerated

2013-05-24T09:27:00.000+03:00

Raspberry Pi is a nice tiny computer with a relatively powerful VideoCore graphics processor, and an ARM core bolted on the side running Linux. Around October 2012 I was bringing Wayland to it, and in November the Weston rpi-backend was merged upstream. Unfortunately, somehow I did not get around to write about it. In spring 2013 I did a follow-on project on the rpi-backend for the Raspberry Pi Foundation as part of my work for Collabora. We are now really pushing Wayland forward on the Raspberry Pi, and strengthening Collabora's Wayland expertise on all fronts. In the following I will explain what I did and how the new rpi-backend for Weston works in technical terms. If you are more interested in why this was done, I refer you to the excellent post by Daniel Stone: Weston on Raspberry Pi.

Bringing Wayland to Raspberry Pi in 2012

Raspberry Pi has EGL and GL ES 2 support, so the easiest way to bring Wayland was to port Weston. Fortunately unlike most Android-based devices, Raspberry Pi supports normal Linux distributions, and specifically Raspbian, which is a variant of Debian. That means very standard Linux desktop stuff, and easy to target. Therefore I only had to write a new Raspberry Pi specific backend to Weston. I could not use any existing backend, because the graphics stack does not support DRM nor GBM, and running on top of (fbdev) X server would void the whole point. No other usable backends existed at the time.

The proprietary graphics API on RPi is Dispmanx. Dispmanx basically offers a full 2D compositor, but since Weston composited with GL ES 2, I only needed enough Dispmanx to get a full-screen surface for EGL. Half of the patch was just boilerplate to support input and VT handling. All that was fairly easy, but left the Dispmanx API largely unused, not hooking up to the real performance of the VideoCore. Sure, GL ES 2 is accelerated on the VideoCore, too, but it is a much more complex API.

I continued to take more advantage of the hardware compositor Dispmanx exposes. At the time, the way to do that was to implement support for Weston planes. Weston planes were developed for taking advantage of overlay hardware. A backend can take suitable surfaces out from the scenegraph and composite them directly in hardware, bypassing the GL ES 2 renderer of Weston. A major motivation behind it was to offload video display to dedicated hardware, and avoid YUV-RGB color conversion and scaling in GL shaders. Planes allow also the use of hardware cursors.

The hardware compositor on RPi is partially firmware-based. This means that it does not have a constant limit in number of overlays. Standard PC hardware has at most a few overlays if any, the hardware cursor included. The RPi hardware however offers a lot more. In fact, it is possible to assign all surfaces into overlay elements. That is what I implemented, and in an ideal case (no surface transformations) I managed to put everything into overlay elements, and the GL renderer was left with nothing to do.

The hardware compositor does have its limitations. It can do alpha blending, but it cannot rotate surfaces. It also does have a limit on how many elements it can handle, but the actual number depends on many things. Therefore, I had an automatic fallback to the GL renderer. The Weston plane infrastructure made that very easy.

The fallback had some serious downsides, though. There was no way to synchronize all the overlay elements with the GL rendering, and switches between fallback and overlays caused glitches. What is worse, memory consumption exploded through the roof. We only support wl_shm buffers, which need to be copied into GL textures and Dispmanx resources (hardware buffers). As we would jump between GL and overlays arbitrarily and per surface, and I did not want to copy each attached buffer to both of texture and resouce, I had to keep the wl_shm buffer around, just in case it needs to jump and copy as needed. That means that clients will be double-buffered, as they do not get the buffer back until they send a new one. In Dispmanx, the elements, too, need to be double-buffered to ensure that there cannot be glitches, so they needed two resources per element. In total, that means 2 wl_shm buffers, 1 GL texture, and 2 resources. That is 5 surface-sized buffers for every surface! But it worked.

The first project ended, and time passed. Weston got the pixman-renderer, and the renderer interfaces matured. EGL and GL were decoupled from the Weston core. This made the next project possible.

Introducing the Rpi-renderer in Spring 2013

Since Dispmanx offers a full hardware compositor, it was decided that the GL renderer is dropped from Weston's rpi-backend. We lose arbitrary surface transformations like rotation, but on all other aspects it is a win: memory usage, glitches, code and APIs, and presumably performance and power consumption. Dispmanx allows scaling, output transforms, and alpha channel mixed with full-surface alpha. No glitches as we do not jump between GL and overlays anymore. All on-screen elements can be properly synchronized. Clients are able to use single buffering. The Weston renderer API is more complete than the plane API. We do not need to manipulate complex GL state and create vertex buffers, or run the geometry decimation code; we only compute clips, positions, and sizes.

The rpi-backend's plane code had all the essential bits for Dispmanx to implement the rpi-renderer, so lots of the code was already there. I took me less than a week to kick out the GL renderer and have the rpi-renderer show the desktop for the first time. The rest of a month's time was spent on adding features and fixing issues, pretty much.

Graphics Details

Rpi-backend

The rpi-renderer and rpi-backend are tied together, since they both need to do their part on driving the Dispmanx API. The rpi-backend does all the usual stuff like opens evdev input devices, and initializes Dispmanx. It configures a single output, and manages its updates. The repaint callback for the output starts a Dispmanx update cycle, calls into the rpi-renderer to "draw" all surfaces, and then submits the update.

Update submission is asynchronous, which means that Dispmanx does a callback in a different thread, when the update is completed and on screen, including the synchronization to vblank. Using a thread is slightly inconvenient, since that does not plug in to Weston's event loop directly. Therefore I use a trick: rpi_flippipe is essentially a pipe, a pair of file descriptors connected together. Write something into one end, and it pops out the other end. The callback rpi_flippipe_update_complete(), which is called by Dispmanx in a different thread, only records the current timestamp and writes it to the pipe. The other end of the pipe has been registered with Weston's event loop, so eventually rpi_flippipe_handler() gets called in the right thread context, and we can actually handle the completion by calling rpi_output_update_complete().

Rpi-renderer

Weston's renderer API is pretty small:

There are hooks for surface create and destroy, so you can track per-surface renderer private state.
The attach hook is called when a new buffer is committed to a surface.
The flush_damage hook is called only for wl_shm buffers, when the compositor is preparing to composite a surface. That is where e.g. GL texture updates happen in the GL renderer, and not on every commit, just in case the surface is not on screen right now.
The surface_set_color callback informs the renderer that this surface will not be getting a buffer, but instead it must be painted with the given color. This is used for effects, like desktop fade-in and fade-out, by having a black full-screen solid color surface whose alpha channel is changed.
The repaint_output is the workhorse of a renderer. In Weston core, weston_output_repaint() is called for each output when the output needs to be repainted. That calls into the backend's output repaint callback, which then calls the renderer's hook. The renderer then iterates over all surfaces in a list, painting them according to their state as needed.
Finally, the read_pixels hook is for screen capturing.

The rpi-renderer per-surface state is struct rpir_surface. Among other things, it contains a handle to a Dispmanx element (essentially an overlay) that shows this surface, and two Dispmanx resources (hardware pixel buffers); the front and the back. To show a picture, a resource is assigned to an element for scanout.

The attach callback basically only grabs a reference to the given wl_shm buffer. When Weston core starts an output repaint cycle, it calls flush_damage, where the buffer contents are copied to the back resource. Damage is tracked, so that in theory, only the changed parts of the buffer are copied. In reality, the implementation of vc_dispmanx_resource_write_data() does not support arbitrary sub-region updates, so we are forced to copy full scanlines with the same stride as the resource was created with. If stride does not match, the resource is reallocated first. Then flush_damage drops the wl_shm buffer reference, allowing the compositor to release the buffer, and the client can continue single-buffered. The pixels are saved in the back resource.

Copying the buffer involves also another quirk. Even though the Dispmanx API allows to define an image with a pre-multiplied alpha channel, and mix that with a full-surface (element) alpha, a hardware issue causes it to produce wrong results. Therefore we cannot use pre-multiplied alpha, since we want the full-surface alpha to work. This is solved by setting the magic bit 31 of the pixel format argument, which causes vc_dispmanx_resource_write_data() to un-pre-multiply, that is divide, the alpha channel using the VideoCore. The contents of the resource become not pre-multiplied, and mixing with full-surface alpha works.

The repaint_output callback first recomputes the output transformation matrix, since Weston core computes it in GL coordinate system, and we use framebuffer coordinates more or less. Then the rpi-renderer iterates over all surfaces in the repaint list. If a surface is completely obscured by opaque surfaces, its Dispmanx element is removed. Otherwise, the element is created as necessary and updated to the new front resource. The element's source and destination pixel rectangles are computed from the surface state, and clipped by the resource and the output size. Also output transformation is taken into account. If the destination rectangle turns out empty, the element is removed, because every existing Dispmanx element requires VideoCore cycles, and it is best to use as few elements as possible. The new state is set to the Dispmanx element.

After all surfaces in the repaint list are handled, rpi_renderer_repaint_output() goes over all other Dispmanx elements on screen, and removes them. This makes sure that a surface that was hidden, and therefore is not in the repaint list, will really get removed from the screen. Then execution returns to the rpi-backend, which submits the whole update in a single batch.

Once the update completes, the rpi-backend calls rpi_renderer_finish_frame(), which releases unneeded Dispmanx resources, and destroys orphaned per-surface state. These operations cannot be done any earlier, since we need to be sure the related Dispmanx elements have really been updated or removed to avoid possible visual glitches.

The rpi-renderer implements surface_set_color by allocating a 1×1 Dispmanx resource, writing the color into that single pixel, and then scaling it to the required size in the element. Dispmanx also offers a screen capturing function, which stores a snapshot of the output into a resource.

Conclusion

While losing some niche features, we gained a lot by pushing all compositing into the VideoCore and the firmware. Memory consumption is now down to a reasonable level of three buffers per surface, or just two if you force single-buffering of Dispmanx elements. Two is on par with Weston's GL renderer on DRM. We leverage the 2D hardware for compositing directly, which should perform better. Glitches and jerks should be gone. You may still be able to cause the compositing to malfunction by opening too many windows, so instead of the compositor becoming slow, you get bad stuff on screen, which is probably the only downside here. "Too many" is perhaps around 20 or more windows visible at the same time, depending.

If the user experience of Weston on Raspberry Pi was smooth earlier, especially compared to X (see the video), it is even smoother now. Just try the desktop zoom (Win+MouseWheel), for instance! Also, my fellow collaborans wrote some new desktop effects for Weston in this project. Should you have a company needing assistance with Wayland, Collabora is here to help.

The code is available in the git branch raspberrypi-dispmanx, and in the Wayland mailing list. On May 23rd, 2013, the Raspberry Pi specific patches are already merged upstream, and the demo candy patches are waiting for review.

Further related links:
Raspberry Pi Foundation, Wayland preview
Collabora, press release

Broken connection to DiNovo bluetooth device: a solution

2013-02-03T22:59:00.000+02:00

I have a Logitech DiNovo Mini (combined keyboard & touchpad), which at first worked just fine on my Gentoo laptop, Asus G50V, using the laptop's built-in bluetooth adapter, Bluez major version 4 (4.101-r5 today), and manual connection. Then I tried to connect the DiNovo to other devices, both without and with the USB-bluetooth-dongle that came with the DiNovo. Then I wanted it back to my laptop. There was a time when it worked only if I temporarily removed the battery from DiNovo. In the end, after several weeks if not months, it did not work anymore, at all. Blindly poking around, I now found how to fix it.

The DiNovo showed a green light, saying it got a connection, but on the laptop, all I could see was the device appearing and very soon disappearing in udev (confirmed with udevadm monitor). I tried to pair it again, many times, and while the pairing seemed to succeed, the device just did not work. I also installed blueman, which indicated the same: when I touched the DiNovo, it connected, and was immediately disconnected. In the system log I got:

bluetoothd[280]: Refusing input device connect: Operation already in progress (114)
bluetoothd[280]: Refusing connection from ##:##:##:##:##:##: setup in progress
(bluetoothd:280): GLib-WARNING **: Invalid file descriptor.

Ok, so is it trying to connect twice for some odd reason? Where is the state kept? Could I manually fix it?

Apparently, I could. In the /var/lib/bluetooth/*/ directory I saw several files that seemed to be about the bluetooth settings on my Gentoo. Not knowing anything about how Bluez works, I looked at the files there, to see if I could spot something suspicious. Luckily they were all plain-enough text files, so I did spot something.

The file /var/lib/bluetooth/*/spd had two lines with my DiNovo's device address on it. The first line was a long one, the second line short. Not knowing what I'm doing, I stopped the bluetooth service, removed the short line, and restarted the bluetooth service. Like magic, the DiNovo started working again, connecting automatically. No errors in the system log anymore, either.

I have not used the DiNovo much after the repair yet, remains to be seen if I broke anything, but so far so good. Apparently when I was playing around with the DiNovo, somehow that file got a second entry for the same device, and caused malfunction. Is it my fault or a bug, I do not know. Googling did not give any helpful hints on solving this, so I am recording this note here, hoping it helps someone.

-- A note, barely readable, scratched with a broken SD-card on the wall of a passageway in the huge, monster crawling dungeon they call the Intternets.

On supporting Wayland GL clients and proprietary embedded platforms

2012-11-21T15:58:00.000+02:00

How would one start implementing support for graphics hardware accelerated Wayland clients on an embedded platform that only has proprietary drivers?

This is a question I have answered more than once recently. Presumably you already have some ways to implement a Wayland compositor, some APIs you can use to composite and get images on screen. You may have wl_shm based clients already working, and now you want hardware rendered clients to work. Where to start?

First I will explain the architecture a bit. There are basically three things related to Wayland graphics:

the client side support for graphics hardware acceleration
the compositor's support for hardware accelerated clients
the compositor's own rendering or compositing, and output to screen

Usually the graphics hardware accelerated applications (clients) use EGL for the window system glue, and GL ES (2) for rendering. The client side is the EGL Wayland platform, plus wayland-egl API. The EGL Wayland platform means, that you can pass Wayland types as EGL native types, for instance a struct wl_display * as the EGLNativeDisplayType parameter of the eglGetDisplay() function.

The compositor's support for hardware accelerated clients is the server side of Wayland-enabled libEGL. Normally it consists of the EGL_WL_bind_wayland_display extension. For a compositor programmer, this extension allows you to create an EGLImageKHR object from a struct wl_buffer *, and then bind that as a GL texture, so you can composite it.

The compositor's own rendering mechanisms are largely irrelevant to the client support. The only requirement is, that the compositor can effectively use the kinds of buffers the clients send to it. If you turn a wl_buffer object via EGLImageKHR into a GL texture, you would better be compositing with a GL API, naturally. Apart from that, it does not matter what APIs the compositor uses for compositing and displaying.

Now, what do we actually need for supporting hardware accelerated clients?

First, forget about using wl_shm buffers, they are not suitable for hardware accelerated clients. Buffers that GPUs render into are often badly suited for CPU usage, or not directly accessible by the CPU at all. Due to GPU requirements, you likely cannot make a GPU to render into an shm buffer, either. Therefore to get the pixel data into an shm buffer you would need to do a copy, like glReadPixels(). Then you send the shm buffer to the server, and the server needs to copy the pixels again to make them accessible to the GPU for compositing, e.g. by calling glTexImage2D(). That is two copies between CPU and GPU domains, and that is slow. I would say unusably slow. It is far better to not move the pixels into CPU domain at all, and avoid all copying.

Therefore, the most important thing is graphics buffer sharing or passing. Buffer sharing works by creating a handle for a buffer, and passing that handle to another process which then uses the handle to make the GPU access again the same buffer. On your graphics platform, find out:

Do such handles exist at all?
How do you create a buffer and a handle?
How do you direct GL ES rendering into that buffer?
What is the handle? Does it contain integers, open file descriptors, or opaque pointers? Integers and file descriptors are not a problem, but you cannot pass (host virtual) pointers from one process to another.
How do you create something usable, like an EGLImageKHR or a GL texture, from the handle?

It would be good to test that the buffer passing actually works, too.

Once you know what the handle is, and whether clients can allocate their own buffers (preferred), or must the compositor hand out buffers to clients for some obscure security reasons, you can think about how to use the Wayland protocol to pass buffers around. You must invent a new Wayland protocol extension. The extension should allow a client to create a wl_buffer object from the handle. All the standard Wayland interfaces deal with wl_buffer objects, and the server will detect the type of each wl_buffer object when used by a client. Examples of the protocol extension are wl_drm of Mesa, and my experimental android_wlegl.

I recommend you do the first implementation of the protocol extension completely ad hoc. Hack the server to work with your buffer types, and write a custom client that directly uses the protocol without any fancy API like wayland-egl. Once you confirm it works, you can design the real implementation, whether it should be in a wrapper library around the proprietary libEGL or something else.

EGL is the standard interface to implement accelerated Wayland client support and it conveniently hides the details from both servers and clients, but it is not the only way. If you control both server and client code, you can use any other API, or create your own. That is a big if, though.

The key point is buffer sharing, copying will kill your system performance. After you know how to share graphics buffers, your work has only begun. Good luck!

Wayland on Android: upgrade to 4.0.4 and new build integration

2012-09-24T13:46:00.000+03:00

We at Collabora have been working on a new Android build system integration with autotools projects, still based on Androgenizer (git). Now we have our own repo manifest repository, and a tool called anagrman for managing optional feature packages (aggregates). Wayland on Android is one feature package, and the first to become available. We also upgraded to Ice Cream Sandwhich 4.0.4_r2.1. Instead of a snapshot release, this particular announcement is about live branches.

Weston (git) was upgraded to upstream as of Sept. 18th, 2012, though there are no user visible changes on Android. This brings the 0.95 protocol, and the new evdev input rework from upstream, which went through some changes since the 4.0.1_r1.2-b Wayland on Android release. Weston now has its GLESv2 renderer separated in code, and shaders are faster and simpler (I have not done any shader benchmarks myself).

Libxkbcommon lost its final dependencies to kbproto and xproto, and our Android build files are upstream. Thanks to Daniel Stone, we can use libxkbcommon straight from upstream.

The new Android build system integration requires you to manually download only anagrman in addition to Android repo. There is no make wayland-aggregate-configure step anymore, all generated Android makefiles are created during the full build. Also androgenizer and wayland-scanner are built automatically as needed. All this is possible using the makefile update feature of GNU Make. If there is a rule that can update a makefile, GNU Make will update the makefile as needed. If any makefiles were updated, Make will then start from scratch, reading in all makefiles again, before continuing to the actual build phase. This causes Make to reload all Android makefiles 2-3 times during the first build. It should also solve any dependency issues of the explicit configure step, like when one project's configure depends on another project's fully built library. A big thank you to Helio Chissini de Castro for doing most of the build system work.

This work is available in two ways:

The source code you can build yourself, see README. Use the module wayland where anagrman is invoked.
A snapshot image you can flash into a Samsung Galaxy Nexus: wayland-on-android-20120924.zip (105 MB) (sha1: b75cb4e325bc7a2960464128ae7f569c21741471). You can flash it with the command: $ fastboot update wayland-on-android-20120924.zip

The ready-made image is configured to launch Weston at boot instead of SurfaceFlinger (the setting is in device/samsung/maguro/system.prop in the source tree). The source code, however, does not start Weston nor SurfaceFlinger automatically. You have to use the commands # adb root and # adb shell to log into the phone, and run one of:

# setprop service.compositor surfaceflinger
# setprop service.compositor weston

You can also run Weston by just # weston & and start other demo clients manually. Weston will automatically start simple-touch finger drawing demo, and the power button will cause Weston to power off the phone. The available demo apps are: simple-touch, simple-shm, flower, and clickdot. Any GL based demos are not included, since the EGL Wayland platform for clients is still unimplemented.

Even though this is a live release, i.e. not tagged to a specific revision, I do not expect much changes in the near future. We are now researching other ways to enable Wayland on Android and other embedded-like devices.

Wayland on Android snapshot release: input

2012-07-13T15:40:00.000+03:00

It is time to announce the android-4.0.1_r1.2-b snapshot release of the Wayland on Android project at Collabora! We give you: input support in Weston and a finger-painting demo!

Collabora will have people at GUADEC demoing this on real devices, though not me personally.

Click to see the video!

This release provides ports of the following projects (git repositories, really) to Android 4.0.1 on Samsung Galaxy Nexus:

wayland
weston
pixman
mtdev
cairo
libxkbcommon, and its dependencies xproto, kbproto, and xkeyboard-config
libffi
mouse cursors

It also includes some changes to Android internals, and the aggregate for building it all.

This is just a snapshot release of a work in progress, and you cannot do much with it. Everything an end user would have known about Android is still gone.

In Weston, the three device buttons are working, and the touchscreen is working. Unfortunately, the only application really supporting touch devices is simple-touch, but I turned that into a demo that is automatically launched. If you install this release into a Galaxy Nexus device, it will boot into Weston and you can play with simple-touch. The power button is hooked up in Weston to power off the device immediately, so a computer is not necessary to show and exit the demo.

The main advancement compared to my previous posts is that the touchscreen is fully working now. Also, this time I am providing a proper release:

an image you can flash into your device, if you have the right tools: android-4.0.1_r1.2-b.tar.gz (106 MB) (sha1: 1bd52cef8b53574452b4e2feac76c5191e815884)
instructions for building a full Android OS with Weston: README

You can get the fastboot tool needed for flashing the images from the Android SDK, I think. I have never used the SDK myself, I have always gone with the full AOSP tree.

Please, if you try this on your device, let me know how it went. If you find problems that I can fix, I might push the fixes to the android-4.0.1_r1.2-b release branches, and update the ChangeLog for this release, but I will not provide new images. Before August I probably won't react, though.

If you look at the histories of the git branches mentioned towards the top, you will find many ugly hacky commits. All commits marked as HACK will be replaced by the proper changes during the course of this project. We are planning to send almost all changes to respective upstream projects, too. The input enablement patch series in Weston needs a rewrite, before it gets upstream.

Thanks to the whole Android team at Collabora for making this happen!

Forwarding evdev devices to Android

2012-07-06T17:16:00.002+03:00

I'm working on adding input support for Weston's Android backend, and to test a normal keyboard and a mouse, I needed a way to get those as evdev devices on Android. I don't have any Bluetooth devices here, so I started hacking up evdev forwarding from a laptop PC. I could not find much information on anyone doing this before.

First I tried the various forwarding types of adb other than TCP. None of them seemed to work at all. Therefore I needed to use TCP, since that worked. Thanks to a co-worker at Collabora, I found out that busybox has a netcat tool I could use on Android. Laptop of course has the original netcat. Nethid is a quick & dirty program for creating a fake evdev device node using uinput, so I fixed that to work on Android.

First on the host, set up TCP forwarding (port number just made up):

$ adb forward tcp:7776 tcp:7776

On the Android device, start uinput:

$ busybox nc -l -p 7776 | uinput-pipe

Then on the host start feeding input events:

$ cat /dev/input/event10 | nc localhost 7776

And Weston now finds my forwarded evdev input device and receives input from it!

But there are some minor problems with this. Something in this pipe spaghetti is caching data, so input events arrive in bursts, not as soon as possible. Also, something is eating over 99% of the REL_X events, so it is quite impossible to move the cursor horizontally. REL_Y events seem go through fine. Strange...

Anyway, this fills its purpose: I was able to test that pointer devices work on my Weston android backend, and I also got mouse cursors to show up after some fixes. The video below is the proof.

Wayland on Android: intermediate report

2012-06-19T15:06:00.001+03:00

We at Collabora had an internal meeting, where I presented the Wayland on Android project and its current state. You can find the slides by clicking the picture below.

Note: epdfview may get the colors wrong, at least Evince shows it right.

Some links related to the slides:

Weston on Android: desktop-shell

2012-05-24T10:40:00.000+03:00

Last week I got weston-desktop-shell working on Android, along with some toytoolkit clients. This means I have Cairo androgenized, and it can even render text. I did have trouble with Cairo's configure script, so the build lacks all thread-safety. For the curious, all sources can be found via my wayland-aggregate. Right now I'm working on Android's libEGL, trying to add Wayland support.

Galaxy Nexus sideways on a table, running Weston and desktop-shell.

Wayland anti-FUD

2012-05-11T10:16:00.000+03:00

I was replying to an email, and got side-tracked into writing some Wayland anti-FUD. There are lots of myths about Wayland out there, so I thought to better make it into a blog post.

This post is about the very small overhead of a Wayland (system) compositor, and why Wayland over network will be much better than X-over-ssh.

I predict that on desktops and other systems that may have accounts for more than one person, there will actually be two Wayland compositors stacked. There is a system compositor at the bottom, handling fast user switching, replacing VT switching, etc., and then a session compositor that actually provides the desktop environment. This is not my idea, it has been written in the Wayland FAQ under "Is Wayland replacing the X server?" for a long time.

My point is: Wayland compositors will not make 3D games suck because of compositing. While explaining why, I also continue to explaining why network transparency will not suck either. Now, do not mix up these things, I am not claiming that remoting 3D games over network will magically become feasible.

The overhead of adding a system compositor in the Wayland stack will be very small. A system compositor normally does not do any real work for compositing, it only takes the buffer handle from a client compositor, and flips it onto the screen. No rendering and no image copying involved in the system compositor.

It is the same with a full-screen game vs. any Wayland compositor: the compositor will not do any real work. A game renders its image into a buffer, passes the buffer handle to the compositor, and the compositor tells the hardware to scan out the buffer. No extra copying, no extra rendering.

The overhead that will appear with adding a system compositor, is relaying input events and buffer flips. The amount of data is small, and at least buffer flips will happen at most once per vertical refresh per monitor. There is also the idea of relaying input events only once per frame. This means that CPU process context switches will increase only by few per frame, when adding another compositor in the stack. Ideally the increase is 2 per frame: a switch to system compositor, system compositor handles input and output, and a switch back.

The overhead can be this small, because the protocol has been designed to avoid round-trips. A round-trip means that one process is waiting for another to reply before it can continue. The protocol also favors batching: accumulate a bunch of data, and then send as a batch. Both of these principles minimize the number CPU process context switches.

Because of these design principles, no Wayland developer is worried about the performance of a possible network transparency layer. Minimizing CPU context switches translates directly to minimizing the effect of network latency. Some believe, that even a simple Wayland network transport which practically just relays the Wayland protocol messages as is, and adds transferring of buffer data, will clearly outperform the traditional X-over-ssh.

Now, if you still claim that X-over-ssh would be better, you a) underestimate the effect of latency, and b) forget that modern applications do not send small rendering commands through the X protocol like 10-20 years ago. Modern applications render their content client-side, and send images to the X server. Wayland simply makes images the only way to send content to the server, allowing to drop the whole rendering machinery from the server and avoiding a huge amount of protocol.

First light from Weston on Android

2012-04-27T15:33:00.000+03:00

A couple of months ago, Collabora assigned me first to research and then make a proof of concept port of Wayland on Android. I had never even seen an Android before. Yesterday, Weston on Android achieved first light!

Galaxy Nexus running Weston and simple-shm.

That is a Samsung Galaxy Nexus smart phone, running a self-built image of Android 4.0.1. Weston is driving the screen, where you see the simple-shm Wayland client. There is no desktop nor wallpaper, because right now, simple-shm is the only ported client.

How is that possible? Android has no DRI, no DRM, no KMS (the DRM API), no GBM, no Mesa, and for this device the graphics drivers are proprietary and I do not have access to the closed driver source.

Fortunately, Android's self-invented graphics stack has pretty similar requirements to Weston. All it took was to write a new Android specific backend for Weston, that interfaces to the Android APIs. Writing it took roughly three days.

And the rest of the two months? I spent some time in studying Android's graphics stack, but the majority of the time sunk into porting the minimum required library dependencies, libwayland, Weston, and simple-shm to the Android platform and build environment. Simply getting the Android build system to build stuff properly took a huge effort, and then I got to write workarounds to features missing in Android's C library (Bionic). Features, that we have taken for granted on standard Linux operating systems for years. I also had to completely remove signal handling and timers from libwayland, because signalfd and timerfd interfaces do not exist in Bionic. Those need to be reinvented still.

Android has gralloc and fb hardware abstraction layer (HAL) APIs. Hardware vendors are required to implement these APIs, and provide EGL and GL support libraries. These implementations are usually closed and proprietary. On top of these is the Android wrapper-libEGL, written in C++, open source. My first thought was to use the gralloc and fb HAL APIs directly, but turned out that the wrapper-libEGL does not support using them in the application side. Instead, I was forced to use some Android C++ API (there is no C API for this, as far as I can tell) to get access to the framebuffer in an EGL-compatible way. In the end, I had to write a lot less code than using the HALs directly.

The Android backend for Weston so far only provides basic graphics output to the screen, and offers (presumably) accelerated GLES2 via EGL for the server. No input devices are hooked up yet, so you cannot interact with Weston. I do not know how to get pageflip completion events (if possible?), so that is hacked over.

Simple-shm is the only client that runs for now. There is no support for EGL/GL in Wayland clients. Toytoolkit clients are waiting for Cairo and dependencies to be ported.

The framebuffer can be used by one program at a time. Normally that program is SurfaceFlinger, the Android system compositor. To be able to run Weston, I have to kill SurfaceFlinger and make sure it stays down. Killing SurfaceFlinger also kills the whole Android UI infrastructure. You cannot even power off the phone by pressing or holding down the physical power button!

A video about simple-shm running on Galaxy Nexus:

The sources with Android build integration and other hacks can be found here:

The wayland_aggregate is how I actually connect to the Android build system. Building is not trivial, and you cannot simply do a checkout and start compiling. You have to get the right Android tree for your device, add my local manifest (which still points to repositories on my hard drive, i.e., won't work for you), download and extract binary blobs, and whatnot. Be warned, I will rebase the above branches.

This is the beginning of pushing a Wayland stack into Android. Next I need to clean up, send stuff upstream, add input support, find out about that pageflip, reinvent signal handling and timerfd, and then move on to the second major task: supporting Wayland GL clients. I hope it is possible to implement the Wayland platform in the wrapper-libEGL.

This work is sponsored by Collabora, Ltd, and I also thank my fellow collaborans for guidance through the vast jungles of Android.

Improved appearance for the simplest Wayland client

2012-04-27T10:17:00.000+03:00

Of the Wayland demo clients in the Weston repository, simple-shm is the simplest. All the related code is in that one file, and it interfaces directly with libwayland. It does not use GL or EGL, so it can be ran on systems where the EGL stack does not support the Wayland platform nor extensions. However, what it renders, is surprising:

The original simple-shm client on a Weston desktop.

The square with apparently garbage texture is the original simple-shm. To any graphics developer, who does not know any better, that immediately looks like something is wrong with the image stride somewhere in the graphics stack. That really is what it was supposed to look like, not a bug.

I decided to propose a different rendering, that would not look so much like a bug, and had some real diagnostic value.

The proposed appearance of simple-shm, the way it is supposed to look like.

The new appearance has some vertical bars moving from left to right, some horizontal bars moving upwards, and some circles that shrink into the center. With these, you can actually see if there is a stride bug somewhere, or non-uniform scaling. There is one more diagnostic feature.

This is how the proposed simple-shm looks like when the X-channel is mistaken as alpha.

Simple-shm uses XRGB buffers. If the compositor does not properly ignore the X-channel, and uses it as alpha, you will see a cross over the image. Depending on whether the compositor repaints what is below simple-shm or not, the cross will either saturate to white or show the background through. It is best to have a bright background picture to clearly see it.

I do hope no-one gets hypnotized by the animation. ;-)

What does EGL do in the Wayland stack

2012-03-10T11:51:00.001+02:00

Recently I drew some diagrams of how an EGL library relates to the Wayland stack. Here I am presenting the Mesa EGL version of them with the details explained.

Mesa EGL with Wayland, and simplified X as comparison.

X11 part

The X11 part of the diagram is very much simplified. It completely ignores indirect rendering, DRI1, details of DRI2, and others. It only shows, that a direct rendering X11 EGL application uses the X11 protocol to create an X11 window, and the Mesa EGL X11 platform uses the DRI2 protocol in some way to communicate with the X server. Naturally the application also uses one of the OpenGL interfaces. The X server has hardware or platform specific drivers that are generally referred to as DDX. On the Linux DRI stack, these call into libdrm and the various driver specific sub-libraries. In the end they use the kernel DRM services, like kernel mode setting (KMS). All this in the diagram is just for comparison with a Wayland stack.

Wayland server

The Wayland server in the diagram is Weston with the DRM backend. The server does its rendering using GL ES 2, which it initialises by calling EGL. Since the server runs on "bare KMS", it uses the EGL DRM platform, which could really be called as the GBM platform, since it relies on the Mesa GBM interface. Mesa GBM is an abstraction of the graphics driver specific buffer management APIs (for instance the various libdrm_* libraries), implemented internally by calling into the Mesa GPU drivers.

Mesa GBM provides graphics memory buffers to Weston. Weston then uses EGL calls to bind them into GL objects, and renders into them with GL ES 2. A rendered buffer is shown on an output (monitor) by queuing a page flip via the libdrm KMS API.

If the EGL implementation offers the extension EGL_WL_bind_wayland_display, Weston will use it to register its wl_display object (facing the clients) to EGL. In practice, the Mesa EGL then adds a new global Wayland object to the wl_display. That object (or interface) is called wl_drm, and the server will automatically advertise that to all clients. Clients will use wl_drm for DRM authentication, getting the right DRM device node, and sharing graphics buffers with the server without copying pixels.

Wayland client

A Wayland client, naturally, connects to a Wayland server, and gets the main Wayland protocol object wl_display. The client creates a window, which is a Wayland object of type wl_surface. All what follows is enabled by the Wayland platform support in Mesa EGL.

The client passes the wl_display object to eglGetDisplay() and receives an EGLDisplay to be used with EGL calls. Then comes the trick that is denoted by the double-arrowed blue line from Wayland client to Mesa EGL in the diagram. The client calls the wayland-egl API (implemented in Mesa) function wl_egl_window_create() to get the native window handle. Normally you would just use the "real" native window object wl_surface (or an X11 Window if you were using X). The native window handle is used to create the EGLSurface EGL handle. Wayland has this extra step and the wayland-egl API because a wl_surface carries no information of its size. When the EGL library allocates buffers, it needs to know the size, and wayland-egl API is the only way to tell that.

Once EGL Wayland platform knows the size, it can allocate a graphics buffer by calling the Mesa GPU driver. Then this graphics buffer needs to be mapped into a Wayland protocol object wl_buffer. A wl_buffer object is created by sending a request through the wl_drm interface carrying the name of the (DRM) graphics buffer. In the server side, wl_drm requests are handled in the Mesa EGL library, where the corresponding server side part of the wl_buffer object is created. In the diagram this is shown as the blue dotted arrow from EGL Wayland platform to itself. Now, whenever the wl_buffer object is referenced in the Wayland protocol, the server knows exactly what it is.

The client now has an EGLSurface ready, and renders into it by using one of the GL APIs or OpenVG offered by Mesa. Finally, the client calls eglSwapBuffers() to show the result in its Wayland window.

The buffer swap in Mesa EGL Wayland platform uses the Wayland core protocol and sends an attach request to the wl_surface, with the wl_buffer as an argument. This is the blue dotted arrow from EGL Wayland platform to Wayland server.

Weston itself processes the attach request. It knows the buffer is not a shm buffer, so it passes the wl_buffer object to the Mesa EGL library in an eglCreateImageKHR() function call. In return Weston gets an EGLImage handle, which is then turned into a 2D texture, and used in drawing the surface (window). This operation is enabled by EGL_WL_bind_wayland_display extension.

Summary

The important facts, that should be apparent in the diagram, are:

There are two different EGL platforms in play: one for the server, and one for the clients.
A Wayland server does not contain any graphics hardware or driver specific code, it is all in the generic libraries of DRM, EGL and GL (libdrm and Mesa).
Everything about wl_drm is an implementation detail internal to the EGL library in use.

The system dependent part of Weston is the backend, which somehow must be able to drive the outputs. The new abstractions in Mesa (GBM API) make it completely hardware agnostic on standard Linux systems. Therefore every Wayland server implementation does not need its own set of graphics drivers, like X does.

It is also worth to note, that 3D graphics on X uses very much the same drivers as Wayland. However, due to the Wayland requirements from the EGL framework (extensions, EGL Wayland platform), proprietary driver stacks need to specifically implement Wayland support, or they need to be wrapped into a meta-EGL-library, that glues Wayland support on top. Proprietary drivers also need to provide a way to use accelerated graphics without X, for a Wayland server to run without X beneath. Therefore the desktop proprietary drivers like Nvidia's have a long way to go, as currently nvidia does not implement EGL at all, no support for Wayland, and no support for running without X, or even setting a video mode without X.

Due to the way wl_drm is totally encapsulated into Mesa EGL and how the interfaces are defined for the EGL Wayland platform and the EGL extension, another EGL implementor can choose their very own way of sharing graphics buffers instead of using wl_drm.

There are already plans to change to some of the architecture described in this article, so it is possible that details in the diagram become outdated fairly soon. This article also does not consider a purely software rendered Wayland stack, which certainly would lift all these requirements, but quite likely be too slow in practice for the desktop.

See also: the authoritative description of the Wayland architecture

Wayland R&D at Collabora

2012-02-02T14:57:00.000+02:00

While being contracted by Collabora, I started a Wayland R&D project in October 2011 with the primary goal of getting to know Wayland, and strengthening Wayland expertise in Collabora. During the four months I started the wl_shell_surface protocol for desktops, added screen locking, ported an X screensaver to Wayland with new protocol, and most recently implemented surface transformations in Weston (the reference compositor, originally the wayland-demos compositor). All this sponsored by Collabora.

The project started by getting wayland-demos running under X, and then looking into the bugs I hit. To rule out problems in hardware GL renderer, I also got the demos running with softpipe and llvmpipe. Trying to fix segmentation faults and other obvious problems was my stepping stone into the Wayland code base.

My first real piece of work was screen locking. That included adding special protocol for it, having a way to have privileged Wayland clients, implementing locking in the shell plugin in the compositor, and writing an unlock dialog for the desktop-shell client. Those are the obvious parts. I also had to extend the shell plugin interface, find a way to hide surfaces so they do not render while the screen is locked, and of course bug hunting and patch set rebasing and rewriting, before screen locking landed upstream.

Next was porting an X screensaver as a regular Wayland client. Once that worked, I extended the protocol by adding a screensaver interface, and made the shell plugin automatically start the screensaver application. Handling screensavers would have been a walk in the park, except I needed shell-specific data to be attached to all surfaces. I wrote a hacky solution, but in the end, Kristian Høgsberg wanted me to add a whole new interface into the shell protocol for this. It became the wl_shell_surface interface, and all demo clients needed to adopt it. Yet that was not all. Since we are used to have per-monitor screensavers, I needed my screensaver to set different instances for each monitor. Hence I had to add output event callbacks in the toytoolkit.

A cleanup phase came next, I took Valgrind and ran. I fixed a pile of memory leaks and wrote missing destructor functions all over, in compositor, clients and the toytoolkit, at the same time collecting a Valgrind suppressions list to ease Valgrinding in the future. This work included adding some ad hoc way of cleanly exiting demo clients.

In January there were some discussions on maximised and full-screen surfaces, what they are and how they should be implemented. Surface scaling was raised as one point. Weston already had the zoom effect, and full-screen scaling would be another surface transformation, so I decided to write a transformation matrix stack for supporting any number of simultaneous transformations. It turned out to be a three week task.

Implementing surface transformations required changes all over Weston. First, I needed a way to invert the transformation which is a 4-by-4 matrix. After searching in vain for a MIT-licenced C implementation I wrote one myself, based on LU-decomposition. I believe LU-decomposition is more efficient on a 4x4 matrix than the cofactor method. Along the inversion routines, I wrote a unit test application for testing the speed and precision of the inversion. Detecting and dealing with non-invertible transformations is also important.

Going through the transformation stack every time you need to transform a point might be costly, so I added a cached total transform and its inverse. Implementing input redirection was a simple matter of applying the inverse total transform to pointer coordinates. Needing a way to test transformations, I added a Weston key binding for rotating surfaces, and modified an existing demo application to mark the clicked point. Adding functions for explicitly converting between display global coordinates and surface local coordinates (surface local are the only ones a client knows of) clarified some of the coordinate computations.

Surface painting and damage region tracking needed fixes, too. Previously, a zoomed surface was repainted as a whole, and it forced a full display redraw, i.e. damaging the whole display. Transformed surface repaint needed to start honoring the repaint regions, so we could avoid excessive repainting. Damage and repaint regions are tracked as global coordinate axis aligned rectangles. Whenever a transformed surface is damaged (requires repainting), we need to compute the bounding box for the damage instead of simply using the global x, y of the top-left corner and the surface width, height. Then during surface painting, we take the list of damage rectangles, and render only those. Surface local coordinates (texture coordinates) are computed via the inverse transformation. This method may result in sampling outside of a surface's buffer (texture), so those samples need to be discarded in the fragment shader.

Other things that needed fixing after the surface transformations were window move and resize. Before fixing, moving a surface would not follow the pointer but move in the surface local orientation. Resize needed the same orientation fix, and another fix in relative surface motion that a client can set in the surface's attach request.

What you mostly see as the result of the surface transformations work is, that you can rotate any normal window, no application support needed. The pointer position on screen, over a window, accurately corresponds to what the application receives as the local pointer location. I did not realise it at the time, but this input redirection working flawlessly became an appreciated feature. Apparently it is hard or impossible to do in X, I would not know. In Wayland, and for me, it was just another relatively easy bug to be fixed. The window rotation feature was meant purely for debugging surface transformations.

Two rotated windows and some flowers.

There are still further issues to be fixed with surface transformations. Relative surfaces, like pop-up windows and menus, are not transformed and appear at a wrong location. Pointer cursors are not transformed; you would want the text bar cursor to be aligned with the text orientation. Continuously resizing a transformed window from its (locally) top-left corner makes the window drift away. We are probably still damaging larger regions than absolutely necessary for repaints. Repaint optimisation of opaque surfaces does not work with transformations.

During all this work of four months there were also the usual bug hunts, enhancements and fixes all over. For example, decorationless EGL apps, which turned out to have been a bug in Cairo, and moving the configuration file parser into a helper library that is shared between clients and the compositor.

Now, I am done with the Wayland R&D project and moving into another project at Collabora. In the new project I will continue working on Wayland, Weston, and the demos.