Wednesday, June 5, 2013

GEn3CIS: An HTML5 Based 3D Engine for Gaming and Simulations: Part 2

Game Engine for 3D Complex Interactive Simulations: Part 2

For a background on the motivations that brought GEn3CIS, check out Part 1 of this series...

We considered various different architectures for GEn3CIS, each has its own strengths and weaknesses.  Our goal in this part of the series is to take you through our design decisions and logic and layout the base architecture for GEn3CIS.  In the context of this article a sub-system will represent any engine component such as Graphics, Physics, Artificial Intelligence, Networking, Input, Sound, etc. Also since the focus of the article is building a multi-threaded systemin in JavaScript, the term thread and WebWorker are synonymous.

Option 1: 
Run each Sub System in a thread

This is current implementation used in JaHOVA OS, so if you read Part 1 of the series then you know this will not produce the results we are looking for, but it is still worth discussing.

On the surface this seems like the easiest architecture to implement   While getting everything up and running is a bit on the trivial side, getting each sub-system to play nice with each other is very different story.  

Sub-System Setup

The major advantage of this architecture is the sub-systems do not require any re-work.  You can implement the same serial execute sub-systems and drop into a multi-threaded environment   This is the biggest draw to using this layout.  But as you already probably know, if you simplify one thing, then you are probably complicating another.

Memory Syncing

One of the main drawbacks to this design is keeping the memory shared between each sub-system synced.  While implementing this system in a language such as C++ causes a dramatic increase in complexity, JavaScript limits the access to memory between individual threads (WebWorkers).  Each WebWorker has its own memory space and is completely isolated from any other process.  This prevents having to implement any sort of Critical Sections or Mutex control.  The overall accessibility and communication between WebWorkers actually mimics that of Erlang programming language.  Henning Diedrich of Eonblast gave a great talk at GDC Online in Austin last year called, "Why ... Erlang".  There is over a 100 slides in his presentation, and can you believe he went through all of them in an hour, crazy!

Due to the way WebWorkers communicated, copies of world/entity data had to be sent to each sub-system.  The sub-system could therefore make any changes/updates to the dataset and then return it back to the main application to by synced to the "master copy".  The issues comes when two sub-systems want to modify the same data set, which has priority?  In small systems, this is not a huge issues, but as the complexity of the scene increases keeping everything synced can become an issue.

One way around this is to only allow specific subsystem to update specific datasets.  This prevents the syncing conflict issue as no two sub-systems can update the same piece of data.  This was the approach taken with JaHOVA OS.


The second issues this implementation brings is expandability.  As of this moment 4 and 8 core machines are becoming common, but 12, 16 and 32 core systems are on the horizon.

Wouldn't it be great if software actually utilized all of the power afforded it by the hardware?  

With this implementation you have no expandability, if you have 4 subsystems, you use 4 threads.  It doesn't matter if the hardware only has 2 cores, you will run 4 threads.  Conversely, if the system has 8 cores, you still only use 4... I think we can do better.

Option 2: 
Run each sub-system serially, but thread the subsystem

Sub-System Setup

As you can probably guess, this is one of the biggest downfalls to this implementation   It requires you to rewrite any serial driven sub-system into an equivalent parallel processed system, not a trivial task.

Memory Syncing

The greatest advantage to this implementation is the removal of memory syncing issues.  Since each sub-system is executed serially, no two sub-systems will try to update a shared resource at the same time.  This completely removes any need for Critical Sections or Mutex controls in  your code.  A major Win!


Since each sub-system can generate as many threads as required for execution, this design should allow for expandability to utilize higher core systems efficiently.

Can We Do Better?

While Option 2 shows great promise and is actually a common architecture used in production for other languages, it still left a bit to be desired for what I was looking for.

Sub-System Has Control

One of the main issues I have with Option 2 is that all control of optimization exist in the Sub-System.  I suppose this is not an issue if you only plan on using your own sub-systems, but if you plan to allow extensibility for others to add 3rd party sub-systems you are giving quite a bit of control over to that 3rd party system.  I would rather the control rest with the main engine (GEn3CIS) and the sub-systems request execution on the engine.

Idle Time

The serial execution allowed for all resource syncing issues to be removed, but is it overkill?  If one sub-system is completely done executing on a portion of a dataset, why should a second sub-system have to wait for a non related/blocking process to complete before it began execution?  I would prefer to see an architecture that could be simplified to Option 2, but allowed for expandability to allow non-blocked processes to begin execution.

Option 3:
Parallel executed sub-systems via Thread Controller with Thread Pool

This layout requires that each sub-system be designed for parallel execution, but also requires a more Functional Programming approach than what is normally seen with regular Object Oriented (OO) design.  The idea of moving away from OO code design to Functional Code design has its controversies ... here is a good read from John Carmack Id Software on Functional Programming in C++.  Each sub-system must be able to create functional code blocks that can be executed over a dataset passed into the code block.  The functional code blocks are passed to thread controller.  The Thread Controller can then organize and execute code blocks in any available thread inside of the Thread Pool.  This moves the overall engine design to Data Oriented vs Object Oriented.  Niklas Frykholm from BitSquid has an interesting presentation, "Practical Examples in Data Oriented Design" showing some of the advantages for Data Oriented Engine design, although I dont think they will translate over to a JavaScript based system... but maybe ASM.js will give some performance boost.

In its simplest form, the Thread Controller can collect the request from each sub-system and execute them one system at a time, which gives the same functional execution as Option 2.  But now that other systems have already queued request to Thread Controller, it now has the ability to begin executing request from waiting sub-systems if they are no longer blocked based upon memory access.  This gives the control back to the engine (GEn3CIS) while also limiting idle time.

This design also allows for complete control over thread creation and therefore can make sure that every sub-system takes complete advantage of all cores available on the given hardware platform.  Intel created a paper on "Programming a Parallel Game Engine" that shows has some similarity to the design shown here as well.

So with out further ado, I present the proposed GEn3CIS Architecture...

Few Concerns...

I have just gone off the Multithreading Deep End in JavaScript... can Web Workers Really handle this?
How many WebWorkers can actually be run at once?
Is this just going to crash browsers?

At this point I feel these are all very valid questions.  I have done some background performance testing which as mentioned in Part 1 was presented at various conferences, check the presentations out if you want to dive deeper, but I have a recap below trying to answers some of these concerns.

Have I just gone off the Multi-threading Deep End in JavaScript?

Ya, probably...

Can WebWorkers Really Handle This?

While that has yet to be seen, I can say that after doing quite a bit of testing the overhead caused by using WebWorkers is quite low.  Data/Message transmission latency is quite low (<1ms)

How many WebWorkers can actually be run at once?

While I dont have a specific answer to this question, I did try creating 10 threads on the fly and using each one to control a ball on the screen.  In this demo all of the physics associated with each balls movement is actually be calculated in a thread and sending back its updated position to the main application   The main application then updates the ball position (and shadow) on the screen. You can actually run the demo live here.

Is this just going to crash browsers?
That is a very real possibility...

So we have shown the initial architectural design of how GEn3CIS is going to look, but how is it going to work?
That is Part 3 of the series...

1 comment: