Navigation

Phaser Dev Log - Summer 2021

Published on 14th September 2021

Sorry for the delay since the last Dev Log. I know it has been a couple of months. I can assure you that I have not been idle during this time. Far from it, in fact. But I've also been unable to write a Dev Log either and I think it's worth taking a little time to explore the reasons why, before digging into the absolute mountain that is "What's New".

A long and twisted road

I'm sure most of you would agree with me that development typically follows a pattern of evolution. You code something and get it working. You've successfully dragged it out of the primordial ooze and it's now tottering around on its own. Perhaps later down the line, you introduce an environmental change, and all of a sudden your fledgling code needs restructuring to carry on living. Sometimes I think of a better way to do it while I'm working elsewhere in the API, or, as if often the case, I won't be happy with the performance of it. Sure, it's walking, but it's just too slow.

Regularly I will cull large parts of code to try a completely different approach. Usually, this pays off. However, it's also equivalent to taking steps backward. Redoing features I already had worked in place just to see if the newer approach is better. And sometimes, it isn't.

As I'm working on Phaser 4 entirely on my own I don't have other people to bounce ideas off or check my code. Now and again I can present a question on Discord, but no one in the world understands the codebase as I do, or what I'm trying to achieve, so it's understandably hard for people there to give much insight. I don't blame them. I mean, how could they? However, it makes development a very solitary process a lot of the time. And when I hit a roadblock I'm literally the only person who can figure out a way around it.

Sometimes, this just all gets a bit too much.

To give this some context: In June I started really performance testing Phaser 4 and I just wasn't happy with how it sat. It worked, sure, but it didn't work as fast as I wanted and I knew I could do better. In order to reach the kind of performance I required, I would have to try things I'd never tried before in the history of Phaser (or my own development experience) and keep re-evaluating it as I went. This meant writing hundreds of lines of code, in order to test a theory, only to find it all needed throwing away again when it didn't pan out. Believe me, this gets extremely demoralizing, really fast as I sunk more and more time into trying to achieve something I'd never done before.

I'm not going to start throwing around the D-word. I, thankfully, do not suffer from depression. Yes, I'll have off-days, like most of us do, but it's nothing clinical. I am tenacious to the nth degree. I simply will not give up once I've started down a path, often to the detriment of other tasks I need to do. This is why there hasn't been any news posted on the Phaser site in August. I become very insular and draw into my coding shell, finding it hard to talk about what I'm doing when a feature hasn't been completed. This is why this is the first Dev Log since the end of June. Everything was in such a state of flux I didn't want to write about it. I didn't feel proud enough, there was nothing to celebrate. I felt there wasn't any point because I knew it was all going to change. And it did.

Here we are in mid-September and the fact I'm freely writing again is evidence that I've pulled through all of the coding headaches I had over the summer. I'm now happier with the place Phaser 4 is in than I have ever been. It would be wonderful if there was a way to hire another developer to work with me. I'm positive it would have had a significant effect on the recent re-coding, but there simply isn't the income to sustain this right now. I'm eternally thankful to my patrons and sponsors who keep this ship afloat, but it's fair to say I'd need a few more of you before I can afford to put someone else on the payroll.

So for now, I'll continue the battle on my own. I know that as soon as I get a 1.0.0 release out there, lots of you will be willing to get involved and help code. Right now, that wouldn't be a good use of your time though, because things are still evolving fast. However, let's jump in so I can talk about the state of where Phaser 4 is today.

Bitten by ECS

Phaser 4 uses bitECS to manage its entity component system. bitECS describes itself as a "Functional, minimal, data-oriented, ultra-high performance ECS library written using JavaScript TypedArrays." - and honestly, for the most part, I have really enjoyed working with it.

Internally, all Game Objects are Entities. They have a bunch of components, such as a Transform Component, a Color Component, a Permissions Component, a Hierarchy Component, and a few others, depending on what type of Game Object they are. bitECS manages all of this. I spent weeks writing my components to fit into the bitECS style, running hundreds of performance tests and tweaking it and tweaking until I was happy.

As I've said from the start, Phaser 4 is almost entirely modular and consists of thousands of independent functions, which all compile to their own modules that you can import directly into your code. There are still a couple of internal managers, but on the whole, it's about as functional as I can make it.

Now, you don't have to use any of the ECS approaches in your own code if you don't wish to. Nothing stops you from creating an instance of a Sprite and adding it into the display list, as you've always done. Classes are not outlawed in Phaser 4. However, they are optional. This is the first version of Phaser where you can just pull in the rendering functions directly and use them, without instantiating anything. I covered this in detail in the previous Dev Log, so please read that for code examples.

You don't even need to create a Scene any longer if you don't wish to. Take a look at the following code:

We use the `CreateGame` function here to create a game instance and then load an image and draw it. Notice how these actions are Promise-based, too. The above just works and what's more, it's only 14 kB all bundled. And that will be even smaller once I've removed a load of debugging stats that are currently in the Game instance!

As you can see, we didn't create any World, Sprites, or even Scenes.

What does this have to do with ECS? Well, internally it's all still running under it. The Game is our root entity. When a Game Object is created, it is given a unique id and added into the entity world. Any components it requires are added to it directly. When you add the Sprite to a Phaser World, it's given the World's tag.

I'm using bitECS almost as much as a data store as I am an entity-component system. All entities and components are carefully stored in a structure of typed arrays. When you define a component, you do so by specifying the type of number it can contain. This allows for insanely fast iteration and, for those of you into networked games, serialization.

Let's take a look at the guts of the Color Component as an example:

As you can see, we've got a component with RGB properties that are clamped uint8s, an alpha, a 16-element color matrix, and a 4 element offset array. Sprites are then given this component and internally Phaser uses and populates these properties as required. For example, the SetAlpha function looks like this:

The entity id is passed in, along with the alpha value. It's set into the component (and in this case, we flag the entity's parents as being dirty) and that's all. You can access a Game Objects id via the read-only 'id' property, so any Sprite could be passed to this function.

Rather than shoving a load of properties onto Game Objects and bloating the class sizes up, instead, they're stored in components. This also allows you to create your own Game Objects by just ensuring they have the right components and skip using the ones Phaser provides at all. Or, you can use a hybrid approach, so a Phaser Sprite combined with your own systems. Have a look at this example:

In the above, I define a new component called 'speedComponent' which has two properties for its x and y speed. I then create 1000 Sprites in a for-loop and add the speed component to each of them. The component values are initialized with a random speed and the Sprite is added into the World.

I then create a query, which will return all entities that have the speed component. This is used in the Update Particle System. It runs the query, gets the entities back, iterations them, and sets their x/y positions accordingly. Finally, the system is invoked every frame on the world update. The final result looks like this:

Click here to run the demo.

Of course, you don't have to use ECS at all! It's entirely your choice. It will always be there, ready for those who like this style of coding, because Phaser is using it internally. But if you want to just remain in the land of OOP that is perfectly ok! I will continue to create all of the Phaser 4 examples using the traditional OOP approach, too, as that is the most expected way. Just be aware though, it isn't the only one.

The new Display List

As mentioned at the start of this Dev Log I was stuck in a coding black-hole for a while. What was it that consumed so much of my time this summer? The answer is somewhat innocuous: it was simply the Display List.

This probably sounds crazy on the surface, yet there's a good reason for it. I didn't want a slow display list. I don't want any part of Phaser 4 to be slow, in fact. I want it to equal and surpass other popular libraries. That's my end goal and always has been.

Phaser 2 had a very Flash-like Display List. That is, you could add children to any Game Object. For example, you could take a Sprite and then add other Sprites as children of it. When the parent Sprite was updated, such as rotating or scaling, the children would be influenced as well.

This is common and expected behavior in game engines. Although not all games require it, it gives you a lot of control. However, we chose not to support this approach in Phaser 3, instead opting for a 'flat' display list that went for speed over flexibility. It was, in hindsight, a bad decision. This is why the Container Game Object in Phaser 3 is quite tricky to work with, as it was shoe-horned in after the fact and never really meant to be there.

I wasn't going to face that issue with Phaser 4, so from the very start, it has been built with a fully transformed display list. Anything can be a parent and Containers are the root element of all Sprites, Text, and other objects. It has always been like this, since the very first version of Phaser 4 that I built.

The issue with display lists is that they're inherently slow.

There's no getting away from it. The more complex your list, the more calculations the CPU is having to perform. If you update a parent Container, that update needs to flow down to all children, so they have the correct transform for rendering, multiplying all the transform matrices as it works down the tree. They also don't lend themselves well to data locality or iteration. The structure makes perfect sense on-screen but goes against what a CPU really wants in terms of ordering.

For some games, this doesn't matter, because the object count is low so you don't even notice it, at least on a decent device, anyway. On a feature phone though, every millisecond counts.

UI heavy games that render the UI in-game rather than in HTML, benefit from strong parent-child relationships that display lists offer. More 'flat' games, that instead just want to blast tens of thousands of sprites around, don't need it, so why should they suffer from their cost? Lots of questions remained open and lots of experimentation began.

As I started Phaser 4, a parent would store its direct ancestors in an array. If you called AddChild(parent, child) it would add that child onto the end of the parents 'children' array. If you removed a child, it would be sliced out of the array, and so on. APIs that have this approach offer functions such as 'AddChildAt' in order to insert a child at a specific position on the list, as the list determines the order in which things are rendered. This is a very common approach and is what Pixi, along with many others, does.

When you want to iterate an array-based system like this, you tend to have a function similar to this:

The above code is from Pixi 6. As you can see, it iterates the children in the array, calling 'updateTransform' on each one, which in turn would do the same for its own children and so on, repeating on itself, building up an internal stack until the process finishes.

The downside is that it really doesn't scale very well when you get into large volumes of Game Objects. For a start, in most games, the display list is constantly changing. For example: in a shooter, if an alien ship explodes, maybe you want bits of debris flying out? The debris should fit into the display list, so it doesn't accidentally render behind or in front of the wrong objects, but it's also likely to be very short-lived, so will be added and then removed from the list really quickly. Imagine this happening for hundreds of sprites.

Array operations in JavaScript are pretty fast these days, but modifying an array by adding or removing elements has a real measurable cost, made exponentially worse the larger the array is. Even an operation as mundane as 'indexOf' can start costing a lot as the array grows. Try it on an array with 100,000 elements and you'll see what I mean.

Another massive cost is in calculating the Game Objects transform for rendering. This is the process of taking its position, scale, rotation, skew, and texture frame and turning that into values its vertices can use to render with. While also factoring in any influence the parent may have. You also want to calculate the bounds of the object, which can be really handy for camera culling.

The math that happens to do all of this isn't complex, just a few additions and multiplications. Yet when you're doing it every single frame, for tens or hundreds of thousands of Sprites, your frame rate will tank as all of the CPU time is spent on it.

Do all of this in anger and it will hamstring your game before it's even had a chance to process its own game logic. I was noticing that iterating through the display list was becoming one of the biggest costs in Phaser 4 and I was convinced there had to be a better way than using arrays.

Data locality - caches do matter!

First of all, I looked at the importance of data locality on performance.

In a typical set-up, you may have a Sprite with an 'x' property that is simply a value, which you read from during rendering. In more advanced set-ups, such as Pixi, 'x' is part of an observable vector. When the property is modified the vector 'observes' this has happened and notifies a callback about it. This flags the object as being 'dirty' so it recognizes that it needs to update its transform during the next render step. The concept of something being 'dirty' or not allows the renderer to skip lots of math for clean objects that didn't change since the last frame.

This is the approach I started out with for Phaser 4 as well. However, Objects in JavaScript are not stored consecutively in memory. As the game step runs and the objects are iterated, the browser is jumping around all over the place in memory accessing their properties. This kills off all chances of data locality helping iteration speed.

I don't want to get into the complexities of CPU caches here, suffice to say the subject has been well written about and this YouTube video by SimonDev will give you a nice top-level overview of it in the context of a browser, showing the real difference it can make.

Because I had swapped to using bitECS all of the transform data was now stored in a single Typed Array, rather than being object properties. When iterating the entities array in sequence, the processor was able to load the next batch of data in the array into memory and the result was remarkable. Instead of taking a few hundred ms per property, it took a fraction of that time once the first property had been invoked:

Net result? Data locality matters a lot more than you may expect when you start to scale out your game.

The crazy cost of Maps

Phaser 4 contains a Game Object Cache. This is essentially a single location where all Game Objects are stored and can be retrieved based on their ID. So, of course, I used a Map for this. It was a single line of code:

Nice and easy, elegant, worked a treat.

Then I really got into profiling Phaser 4 and determining how many Sprites it could actually push around the screen. Proper fully-fledged sprites with complete transforms, textures, etc. And as part of profiling, I uncovered a somewhat depressing truth: at scale, Maps are crazy slow.

I replaced the above line of code with this:

A basic sparse array, where objects are inserted based on their numeric key with two simple functions to get and set them.

The difference was night and day. I ran a profile on a test with 50,000 updating Sprites on the display list, all visible on-screen at the same time and moving. Each Sprite was just 2x2 pixels in size to avoid GPU fill rate issues. The call to Map.get was eating up a shocking amount of processing time, causing our frame rate to dip to 55fps. The cost was massive:

This was just one call of many in an 8-second profile sample.

That's a shocking amount of time to get an object based on a numeric key. I ran the exact same test using my sparse array approach and it ran at a solid 60 fps. In profiling, the new 'get' function didn't even register a time it was so insignificant.

My sparse array approach is taking less than half the time of the Map during rendering and updating.

Honestly, this is quite sad. And the take-away from this absolutely shouldn't be "don't use Maps!" because that couldn't be further from the truth. However, if you are needing to store large amounts of data with a numeric key and either iterate or get from that data frequently, good old arrays are still the best way to do it. Making this change helped me get another step closer to the peak performance I was after. There was still more to do, though.

The Return of the Linked List

I had spent a while implementing ECS throughout the core parts of Phaser 4 and I was moderately happy with the iteration speed, but there was still the massive elephant in the room to deal with: The Display List. As I mentioned, I wanted to get rid of the array style parent-child approach, because it doesn't scale well if the list grew large or needed to constantly change, as most games do.

There's a well-known data structure that is suited to the task of constant hierarchy manipulation with minimal cost: A Linked List. A lot of you may be familiar with this already, but for those who aren't the concept is simple: Each entity maintains a pointer to the next entity in the list and the previous entity that came before it. Technically this is known as a doubly-linked list, because you link in both directions, not just forwards.

There are pros and cons to linked lists. The biggest pro is that inserting children is really fast. All you need to do is get the entity you want to insert after, find out what it is connected to on the right (i.e. it's "next" value), and just insert the new entity into this place, linking them together. Adding an entity to a list involves only updating the pointers of the 2 nodes on either side. The rest of the tree doesn't have to know anything about this change and, most importantly, no array manipulation took place. Removing entities is equally as fast and easy.

In a 'traditional' linked list you would join all nodes together in a sequence. The image below that I created shows this. The dark lines are the parent-child relationships and the orange ones the sequence:

This sequence allows you to follow the list from the head (in this case World), all the way through to the tail (K). Often a list will be circular, hence the connection from K back to World again. By visiting each node and then moving to its 'next' pointer the entire tree can be traversed. I've marked this on the diagram with numbers, showing the jumps involved so you can see the order in which nodes are visited. In a doubly-linked list, you can easily reverse this process too, by starting at the tail and following the 'prev' pointers to the head.

However, this isn't beneficial for the type of Display List a game requires. The reason is that if, for example, I wanted to move node J (and its children N and O) and insert it as a child of node M, I would have to ensure that all of the surrounding nodes are updated. In this case, D would have to be re-linked to K. K would have to have its prev link to O severed and patched to D. M would need to stop linking to F and instead link to J. O would have to link to G instead of K, G would need to break its link with F and so on. I probably didn't even cover them all! Suddenly moving one child to another parent becomes quite the ordeal.

That isn't all, though. Remember how I said that, if a parent isn't dirty, we don't need to visit its children? You can't do this easily with a traditional list sequence, because the leaf nodes are connected together. In the image above, if node E was clean, we need to jump right to node F, but there is no connection there because it isn't next in the list.

It doesn't look too complex in this example, because it's artificially small, but expand this out to a typical game with thousands of nodes, and these kinds of operations start to cost a lot more and become brittle, something we desperately need to avoid.

So instead of one giant linked list, which incidentally is what Phaser 2 uses, my approach was a series of isolated lists. I actually found out, several weeks after coding it, that this is known as a 'tree of linked lists' :)

What this means is that all of the children of a parent are linked together in sequence, however, the first and last entities don't link anywhere else, they terminate at each end. This means you can move a child to any other parent on the list and it will take all of its offspring with it, without breaking multiple parts of the list. Let's take our diagram above and redo it using my alternative method of linking:

In this approach, each node maintains a link to its next sibling (i.e. A to B), its previous sibling (i.e. B to A), and its parent (i.e. E to A). If a node is a parent it maintains a link to the first child and last child only. If the parent only has one child, both of these are the same. If a node is a head (i.e. A) it doesn't have any previous sibling set. The same is true if it's the tail, with no next sibling set.

In short, what we've got is a tree of linked lists, where each 'level' of our tree is its own isolated list.

The benefits of this approach are that you get the main speed advantage of a linked list: Costing very little to rapidly insert or remove nodes with no impact on the rest of the tree, but you also get to easily move nodes and all of their children around, something that's traditionally difficult to do. Using our previous example, if we move node J to be a child of M, all we need to do is set D's "first child" to point at K, nullify Ks "previous sibling" pointer, and set M to be the parent of J.

If one part of your tree breaks off (i.e. you delete a Game Object that had lots of children) the only nodes that need worry about it are the direct siblings and maybe the parent. The entire tree doesn't need to recalculate or re-balance itself, or anything equally expensive.

At this point I started benchmarking my list approach vs. the array approach I had before and the difference was dramatic, to say the least:

The stats in the above chart were derived from a demo that had a set of static sprites in the background and a set of moving (constantly updating) sprites over the top. At 102,970 sprites (5000 updating, the rest static) you can see both approaches managed a solid 60 fps (the 2nd column in the chart). So far, so good.

Bumping it up to 25,000 Updating Sprites and the array version was dropping frames already, at around 54-56 fps, while the Linked List approach was steady. Increasing to 50,000 Updating Sprites the array was into the 30 fps range, while the Linked List carried on strong at 59 fps average. Doubling this to 100,000 Updating Sprites killed the array version, down to just 21 fps average with a 14 fps low, while the Linked List managed it at 36 fps.

I ran many more tests to ensure it wasn't just a fluke. Across the board, it was consistently faster. The work I had done on the list paid off, but I was still convinced I could do better.

The above test was mostly comparing the cost of iterating the updating Sprites. Yet it was still taking a lot of time running through the pre-render sequence, which involved checking various dirty flags, recalculating transforms, and camera culling.

The problem with a Display List hierarchy is that transforms need to feed down from the top to the bottom. Parents need to update their children. This is an expensive journey to take, so you only want to do it when absolutely necessary. You also don't want to use recursion, if you can at all help it.

The original version of the Pre-Render step was performing the following tasks in order:

1) Update all dirty local transforms. If a Sprite had moved, its transform was refreshed.
2) Update all dirty world transforms. If a parent had moved, pass that on to all of its children, all the way down.
3) Update the vertex positions of all those who had been impacted by step 1 or 2.
4) Update the color component values for all children that had dirty colors.
5) Update the 'in view' calculation for all children if the camera was dirty (had moved)

That's a lot to do every frame, but it all needed doing! More importantly, you have to update local transforms before you can update world ones because one requires the other. And you can't update the vertex positions until the transforms are correct. Finally, you can't update the camera culling 'in view' setting until you have the vertex positions, because it's derived from the bounds. The color update is about the only thing that could go anywhere.

While this looks like a lot of tasks, I wasn't just arbitrarily running them every frame. I had carefully built a Dirty Component whose sole job was to maintain the dirty state of all of these systems. Sprite has moved? Its Local Transform flag was dirty. Color has changed? The Color dirty flag was set, and so on. This allowed pre-render to quickly skip over anything that wasn't dirty while iterating the display list. Because if it wasn't dirty, it didn't need updating.

However, some actions would make another entity dirty. For example, a parent with a dirty local transform meant that all children needed to be updated, too, even if they weren't previously dirty. Hence why pre-render was broken down into the 5 steps shown. I constantly saw this appearing as the most expensive part of the process when profiling it. It needed a different approach. I just didn't know what.

I spent days on this problem, and honestly, was losing all hope. I mean sure, what I had did actually work well and was definitely faster than the array approach before, yet it still didn't feel like I had achieved what I wanted. There had to be a better way.

Rusty Transforms

I figured I would try recoding the Transform steps of this flow in Rust, compiled to Web Assembly. After all, WASM is touted as the 'holy grail' of speed. As if it will somehow give your JavaScript app superpowers to rival native platforms. You can probably infer from my tone that this isn't what happens at all.

I ordered a few books on Rust. I like books, after all, and sat down and started getting set up. I won't bore you with the details, but it's a shockingly messy process and what works on one OS won't on another. If you thought npm package management was bad, wait until you've tried using Crates. When I mentioned this on Twitter, all the responses I got were "it's better than C++". Hardly a glowing endorsement.

Anyway, I digress. After a few false starts, I was able to code in Rust, compile to .wasm, and test within Phaser. I mean sure, the wasm file was adding 1.5 MB to the bundle (in dev mode), but that's ok if it means super-powers. One of the most expensive things you can do in Web Assembly is pass objects to and from it. This has got better, but it's still bloody expensive. Web Assembly, at its heart, accepts only numeric data types. You've got integers and floats in 32 and 64-bit variations and that's it. No strings, no objects, nothing. Just numbers. However, for my needs, that was ok. Because bitECS is entirely numeric, too, all I needed to do was operate on numbers. So far, so good.

Web Assembly also has a block of Shared Memory you can create. This memory is accessible from both wasm and JavaScript. All you need to do is create a function in your wasm code that returns a pointer to the shared memory:

This allows your JS code to operate on the exact same memory buffer that wasm is using. The importance of this is that you don't need to transfer any data to and from wasm, you can just grab a pointer to the buffer and fill it with data, call your wasm function to process that data and read the results back out of the buffer again. A perfect set-up, because both sides are operating on the exact same block.

I set about coding a Transform update function in Rust, which was really easy to do. Rust has some lovely features when it comes to working with numbers and buffers. If you're interested, here's the full code:

Even if you don't speak Rust, you can likely tell what is going on. It's some basic transform math, reading from one buffer and writing the results to another. It does this across all entities in sequence. I wasn't calling this function for every entity, I was calling it just once for optimal speed.

Keen to get profiling I built a test and cracked open Dev Tools. I mean, all of those "Learn WASM" examples out there had shown a massive increase in speed when doing something real-world like generating a Fibonacci sequence (ok, ok, I'll stop now), so surely it'd do well for me?

I'm sure you can anticipate the results already. It was slower. During profiling, it was especially slow where Rust was calling f32::cos and f32::sin. I was beyond confused, so I posted my code to a Rust forum and asked for feedback and all I got was that my approach was sound and, in this case, V8 was basically producing compiled code that outperformed compiled Rust.

I was devastated. I mean I didn't *want* to have to code parts of Phaser in Rust because it introduced so much more complexity in testing and bundling. But I could have lived with that had it increased the speed! To see that it couldn't even match JS was a real eye-opener. And after lots of hard work, somewhat upsetting. It reinforces my belief that wasm is only any good for games if the vast majority of your game code is running inside of it, or your app isn't trying to do anything in real-time and is happy to sit there managing the UI while wasm does some heavy lifting in the background (think video encoding or data manipulation).

Ultimately, I'm glad I spent the time doing this otherwise I would have been left with the lingering feeling that, just maybe, things could be a lot quicker if only I embraced wasm. Now I know, for this use-case, it won't make any difference. It was, at least, cathartic. You can find all of my Rust code, wasm files, and more in the repo if you dig hard enough, but I'm not linking them here as that ship has sailed.

After all this, I felt like I was right back to square one. I had a nice linked-list implementation, yet still had no idea how to make it quicker.

If at first, you don't succeed ...

I'm a notebook kind of person. I don't mean in the computing sense. I mean in the physical paper-bound sense. I carry a notebook everywhere with me and am constantly writing in it. I'll plan out features and code, scribble down To-Do lists, record metrics, all kinds of things. I can't live without them and have Phaser notebooks going back years and years, full of ideas. Some implemented, some not.

The more entrenched in a problem I get, the more pages I fill up. Diagrams scrawled next to things to try, concepts crossed out when they fail. It's just how my brain works. I like to get a rough idea on paper before turning to code.

While going back through my notebook I came across something I had jotted down. It was a picture of a linked list and written below it, underlined several times, was the phrase: "Transform Order and Display Order are NOT the same things!". This got me thinking that maybe the issue was simply how I was traversing my list, rather than the list itself.

I came up with a plan that worked by starting at the root (a World Game Object). It then jumped to the first of its children and processed it. Processing involved checking all of those 5 steps I mentioned above (dirty transform, color, etc). If that child also had children of its own, it would jump to the first one of those. That would be processed too. It carried on down the list until it reached the very bottom. At each step of the way processing the nodes it came across because all a child needs is to know that all of its parents are up to date.

Because I'm using a tree of linked lists, when I got to the bottom of a branch it was easy to jump to the next sibling across. At this stage, the process repeated. If this sibling had children of its own, we looped and dived down. If not, we carried on moving across. Each time we met a parent, it was added to a stack. Each time we hit the bottom of a branch, we pop the parent off the stack and continue iterating through its siblings.

This is very hard to visualize with words alone, and actually quite tricky to draw too, but I've tried my best in the following diagram. Here I've taken the layout from before and added lines to show how the flow operates under this new system:

What is important is that with this concept we traverse the list in skippable transform order. When we reach a child, we resolutely guarantee that all of their parents have already been fully updated. But most importantly, we can also quickly bailout. If we come to a parent and it doesn't have any dirty children, and the parent itself isn't dirty, we just pass right over it to the next sibling on the list.

This works at any level, at any depth. A whole section of the tree can be entirely skipped if required, without needing to iterate down into it in order to check its status or find out what the 'next' node is. This is because parents know if any of their children are dirty.

If nothing in the entire World has updated this frame, we don't process a thing. We can simply skip the entire lot.

It took me a while to refine and hone the code required for this approach, but once I had, the results spoke volumes and I was seeing huge performance increases. Numbers speak louder than words:

In a test with 10,000 Updating Sprites (all moving and animating every frame) and 266,242 static Sprites in a Layer:

On my Core i7-3770K running Windows 10 (GeForce GTX 1660), using the old Linked List approach it was taking 5.1 ms to run the pre-render step. Using the new method described above, which I called Branch Search, it was completed in 0.59 ms. A staggering 764% improvement! Buoyed by this result I tried increasing the counts.

Now with 25,000 Updating Sprites (and still a quarter of a million static ones), the old system took 6.7 ms to run the pre-render step. The new approach? 1.6 ms. With 50,000 Updating Sprites, the old one ran at 8.7 ms. The new system is just 3.4 ms. At 100,000 Updating Sprites, the old system was taking 13.59 ms which is almost all of our fps budget spent on processing alone! The new one took 6.8 ms.

The difference on my iMac M1 was just as pronounced. With 266,242 static sprites and 10,000 Updating ones, it went from 4.89 ms down to just 0.5 ms. At 50,000 Updating sprites, 6 ms down to 2.5 ms. At 100,000, 7.8 ms down to 4.5 ms. Because of the way the search works, with its quick bail-out feature, adding more static sprites barely impacted anything.

I upped the test so it had 649,042 static Sprites and 50,000 fully Updating Sprites and it ran the pre-render in just 2.3 ms and was a buttery smooth 60 fps. I increased it yet again, this time to 1,114,962 Sprites, again with 50,000 of those fully updating, and it was still just 2.09 ms time spent processing.

Finally, finally, after weeks of hard work and soul searching, I felt validated :)

This wasn't just faster. It was orders of magnitude faster. All that hard work had paid off.

Below is the final routine in its refined and simple glory:

You can view it here in the repo.

This can cope with complex display lists, of any depth, with millions of entries, in mere milliseconds. Leaving you plenty of time to deal with you know the important stuff, like the actual game logic (and rendering, of course!).

Having completed this work I felt empowered and knew I could move onto the next stage: implementing a Spatial Hash Grid.

Grid Runner

The concept of a spatial hash grid is quite simple and well covered elsewhere, so I'm not going to go into any depth. If you want to learn more, read this excellent article on it. Suffice to say, the idea is that you divide your world into segments and your Game Objects are inserted into whichever segment their bounds falls within:

When it comes to drawing, the renderer can say "Hey! My Camera is here and covers this area. What have you got that intersects with this?" and the Spatial Grid will do a simple rectangle check, get back a list of segments and return the Game Objects they contain. This partitioning can dramatically reduce the cost of iteration because in a densely populated world instead of having to check every Sprite to see if it intersects with the Camera, you only need to check those the grid returns.

Spatial Hash Grids are best used for static Sprites, such as background scenery or tiles. Although the Sprites can move, the cost of their movement if done en-masse can be quite heavy, as they have to constantly recalculate which grid segment they are in. For that reason, a spatial grid is something you should only deploy when you know it's going to be a benefit.

In Phaser 4 they are their own type of Game Object to which you can add children. I don't use them internally by default, because for some games they could have a detrimental performance effect. But, for the right game, the improvement can be phenomenal.

Here is how you create and populate one:

You don't need to do anything more than this. Internally, the Spatial Grid Layer will handle the rest. Just add children to it like any other Container or Layer. As I said, you can manipulate those children if needed, but you should try to keep it to a minimum. Changing their texture frame doesn't cost anything, however, so feel free to put animated Sprites in here. Just keep their transformations to a minimum.

You can also define the size of the grid segments. In the example above they're 256 x 256. You've also the ability to control if the children should have their 'update' method called, or not. Finally, and most importantly, they can have their own sorting callback:

When the renderer asks the Spatial Grid Layer for a list of children, the resulting array is sorted based on your callback first. This lets you have fine-grained control over the display order. The example above sorts the sprites based on their y position, allowing them to be depth sorted.

By default, the Spatial Grid in Phaser 4 will always return the Game Objects in the exact same order in which you added them to it. This order is maintained and explicitly used. Most spatial grid implementations you find on npm have no concept of the insertion order, and will often return the contents in whatever order they feel like (even changing every frame!). This may work well for a mapping application, but it's no good for games where display order is essential. So this is a feature I've implemented, but you can still override it with your own sort function.

Want to see this in action? Sure you do :)

Click here to run this example. It takes a while to start due to the ecs preallocation.

This is for desktop only and it'll consume a whole bunch of RAM, but you should see the results in action. Use the cursors to scroll around the map and click to add new land to it.

Every time you click it adds more and more land, so after a while, the process of adding the land itself will make the frame rate dip! It also generates a bunch of new moving fireballs with each click, too. On my PC the fps will dip during the generation after around half a million sprites. On my Mac, after about 900,000. Once the land is generated, the fps returns back to normal.

The demo is hard-capped at 2 million Sprites but see how far you can go :) Have a look to see what fps rate you get while scrolling around a world with thousands upon thousands of sprites in it. I think you'll be pleasantly surprised. If you open Dev Tools and type in 'game.renderStats' you'll see a breakdown of what that frame cost to generate. You can also click the FPS meter in the top-left to cycle through the different stats.

I'm very happy to have this feature available. It's not enforced upon you, you don't have to use it, yet lots of games will benefit massively from it existing.

The Power of Containers

Lots of Sprites is always nice, but how about something a little different? The following is an example of an animation created using different texture components, with a sprite for each part, joined together and tweened. It looks like this:

It's best seen moving, so have a look at the demo.

Pretty neat, huh?

The code is no more complicated than this:

I then apply a few tweens to the joints to make it all move in a ghostly manner. It's very similar to how an application like Spine works, just a little more manual in the approach. Perfect for more interesting animated characters in your games without the need for 3rd party apps or libs.

Compressed Texture Support

Performance comes in many guises. As well as all of the insanely hard work detailed above, I've also implemented comprehensive support for Compressed Textures. Phaser 4 can now load and render the following types: ETC, ETC1, ATC, ASTC, BPTC, RGTC, PVRTC, S3TC, and S3TCRB across both KTX and PVR formats, with a fall back to true color (PNG, JPG) as needed.

All you need to do is load them, much like you would a standard image:

Phaser will query the browser to see which formats it supports and go through the object you've given it, in order, and load the first one supported. Just to make things tricky, much like audio, different platforms support different formats. ASTC is perfect for iOS and will work in Mobile Safari. It's the most flexible and modern format with support for alpha channels. The other formats differ based on GPU and OS.

Creating compressed textures can be a bit of a chore, but it's possible to integrate them into your asset workflow with some patience and experimentation. It's also really important for mobile performance and can have a significant impact on rendering speed and GPU memory consumption. In short, use them where you can.

ColorMatrix Resurrections

For many years now, Phaser has given you the ability to tint a Game Object. This was a cheap effect, handled in the shader, that applied an additive tint effect to the texture colors during rendering. It was fast to process, didn't cost much in terms of buffer size or speed, and was quite handy for quick visual effects.

It was also massively restricted in what it could do. Want to tint a Sprite white? Nope, not happening, due to the additive nature of the shader. Want to make it quickly flash red when you get damaged? Again, not easy, because you have to apply a tween to the color values and hope they blend properly with your base texture.

This was something I wanted to address and I wanted it to be built right into the core of Phaser 4. Thus, the new Color Matrix component was born. All Game Objects from the Container and up (so Sprites, Text, etc) have the color component by default, accessed via the .color property.

The core component is very light. The Color Component has a few handy properties, such as the ability to set the red, green, blue, and alpha channels on their own, or set a global tint. You can also control if the color should influence children, or not. Where it gets really powerful, however, is by using some of the included color effect functions, or creating your own.

As with everything in Phaser 4 these functions are entirely independent. If you don't use one, it won't be bundled into your build. This allows me to offer you loads of really nice effects without worrying about the impact on your game, because it'll only include what _you_ choose, not what I choose.

There are 21 effects as of today, including the following: BlackWhite, Brightness, Brown, ColorTone, Contrast, Desaturate, DesaturateLuminance, Grayscale, Hue, Kodachrome, LSD, Negative, Night, Polaroid, Predator, Saturate, Sepia, ShiftToBGR, Technicolor, Tint, and VintagePinhole. Have a look at the following demos to see them in action:

Matrix Effects Demo 1

Matrix Effects 2 Demo

Matrix Effects 3 Demo

Matrix Effects 4 Demo

Matrix Effects 5 Demo

Matrix Effects 6 Demo

Matrix Effects 7 Demo

Using a color effect is really easy. Just import the one you want and call it:

Lots of the effects have optional parameters that control their intensity, such as Brightness in the code above. You can also create your own. They all work in the same way: by setting values into a color matrix and color offset array. Here is the Contrast effect to show what the code looks like:

You can also apply color effects to a World and have it influence all children. Which is an easy way to perform effects such as making a World appear sepia-toned or grayscale during a cut-scene, or to tint it or fade it out. Have a look at the following demo to see this in action:

Click here to view the World Color Demo

Using the Color Component you can now truly tint Sprites to any color, so flashes of white or red are now easy, especially as you can control each color channel on its own.

The color values are all bundled into the WebGL buffers, so using them won't cause a batch flush or add any extra WebGL operations. They're all accounted for already.

I'm very happy to have this available built-in by default and that it's so easy to extend for your own effects. I'm sure you'll find many uses for it and I can't wait to see what :)

preRender and postRender Hooks

All Game Objects have two empty methods defined: preRender and postRender. As you can probably guess from the names they are invoked before the Game Object renders, and after it has been rendered (or more accurately, added to the render batch), respectively.

On the surface this doesn't sound like much, yet if you factor in all of the Direct Draw features Phaser 4 has, suddenly it opens up a whole new world of possibilities. Let me give you an example:

Let's say you wanted to give a Sprite a trail effect. Sure, you could start getting messy with shaders, or you could use the old-school method of having a series of sprites following it, each one slightly alpha'd a little, following in the wake of the main Sprite. This is easier to explain with a demo, so have a look at the link below:

Click here to run this demo

Quite pretty, right? The trail is made up of 96 images, drawn to the previous locations of the Sprite, each one with a diminishing alpha value. Sure, I could have created 96 Sprites, added them all to the display list, and kept track of them, updating them each frame. Or, you could use a preRender hook, like in the example above. Here's the code for it:

Most of this code is concerned with making sure the trail is managed and updated, but the core of it is the call to DrawImage, where we've passed in a texture and a position to draw it. Because this happens in PRE Render, it appears behind our Star Sprite. If we swap this to be postRenderGL instead, it will still work, but the trail will appear over the top of the Sprite.

There are many use-cases for this. From useful debugging, such as drawing Sprite Bounds, to special effects like above. If you wanted to emit a shower of sparks as something exploded, you could easily do it from here, without needing to create any Game Object instances or modify the display list at all.

Where do we stand today?

As you can see, I've been busy. Really busy. Not very communicative, but certainly not idle. While I really wish I could hire another developer, that isn't going to happen without an increase in funding, so I'm soldiering on solo for now.

I'm happy that Phaser 4 finally has the most solid fundamental core of any release ever created. It has some of the best code I've ever written in my life within it, and performance absolutely shines as a result, while still being extremely flexible and compact. I wanted no compromises. I wanted power, performance, and genuinely useful game-making features, and that is what I'm striving for and achieving.

I haven't even touched upon all of the changes I've made, although these are all the important ones by far. I still want to talk about the Render Layers, Effect Layers, and loads more! So I'll be sure to explore these in the coming Dev Logs.

As I'm writing this it's mid-September. I'm going to spend a couple more days tidying things up and then I need to return to Phaser 3. There are a bunch of pull requests to be merged, issues that should be looked at, and all of the compressed texture code I wrote for Phaser 4 will be ported to Phaser 3 for the forthcoming 3.60 release. Hopefully, this release won't take more than a couple of weeks, after which I will return to Phaser 4 development.

Thank you for all of your support everyone. I really mean that. And for those of you who are patrons, an extra special thanks to you. I know it can seem like a while between updates, but please never question my commitment to this, or to you. Phaser literally is my life and I'm giving it my all. Hopefully, you can see that.

Until the next Dev Log please feel free to catch me on Discord because sadly Patreon only allows comments if you are an actual patron, which is sucky of them. Discord is where it's at.

Cheers,

Rich