Monday, June 3, 2013

Pipelines as a generic framework for multi-threading

In this post I'm going to talk about a new abstraction (to the C2C codebase, that is) available in the form of some framework classes, intended to provide a fairly generic mechanism for a lot of our multi-threading needs.

Fundamentally, turn processing in C2C consists of doing a once-per-turn action for each game entity.  In some cases this may involve doing the same for 'owned subservient entities'.  So, deconstructing from the highest level down (taking a 10,000 foot view), it looks something like this:

For each plot
    process plot
For each player
    for each city
        process city
    for each unit
        process unit

Each entity processed (a city say) will normally have a number of aspects to be updated in its processing, so for example the line 'process city' in the above breaks down to something a bit like this:

process invasions
process damage against attacking adjacent units
process growth
process culture
process build queue
process religion spread
...

Most (though not all) of the actions involved in processing a city (or any entity) primarily modify that entity, but not its peers (other cities in this case).  That mean that in principal we should be able to process different cities at the same time on multiple threads.

This is where pipelines come in.  The processing of an entity can be thought of as a set of independent (but serialized) steps.  Some of these steps have to be performed in set order even relative to other (peer) entities, whereas others are entirely independent and have no constraints on how they are ordered.  For example, cities can process their culture spread independently - it makes no difference if city A performs its culture spread before or after city B.  On the other hand this does not (fully) apply to production - suppose city A and city B both want to choose what to build - if they were to choose a Wonder then that choice would no longer be valid for the other city, so if the order of processing of city A and city B were non-deterministic the results would also be (which one got to build the Wonder might vary).  This means that some aspects of the processing have to synchronize in some way to ensure a deterministic order of processing (or else multiplayer out-of-sync errors will result).

Imagine a production line, where each station performs one of the actions involved in processing an entity.  Taking the city example, and its associated actions, as listed above, station one in the city example processes invasions, station 2 processes damage against attacking units, and so on.  The city entity passes down this production line, being updated at each stage.

Now in a traditional, physical production line, we would speed things up by keeping all the stations working in parallel, feeding new city entities into the line as each one moves along a stage.  This is one model that can be used to deploy multiple threads (one thread per production line stage), but it doesn't work very well for us, because to work efficiently it requires all stages to move at the same 'speed'.  If one slows down it slows the entire production line.  In our case some action (e.g. - choosing what to build in a city) are likely to be both much longer than others, and also potentially very variable in their execution time.

So another way to split the work up is to have lots of production lines operating in parallel, and a single entity progressing down each of them at their own rate.  For stages of the production line that don't need to synchronize, this allows each to keep working independently.  However, as pointed out above, some actions need synchronization (the building of a Wonder being one example).  In effect this introduces 'gates' on the production line such that no entity can progress past a 'gate' until ALL entities (on the parallel production lines) have caught up.

At first sight this requirement to synchronize means idle workers!  But we get round this by reducing our headcount!  Instead of having one employee run each production line we have a ticketing system like you sometimes see in grocery stores or post offices.  Each time an entity progresses to a new stage of our production line, it posts a ticket saying 'hey guys - I'm waiting for one of you to come and work on me'.  Workers take tickets in a priority order (earliest production line stage first, within that a fixed ordering of production lines) and man whatever station the ticket refers to, performing whatever changes are needed to move the entity through that step.  This means that (provided we have significantly more production lines than workers) most of the time when one line is waiting for others to catch up there will still be work to do on those others, and our workers won't spend all their time taking tea breaks.

The class framework

Ok, enough hokey analogies.  The framework (new source files, CvPipeline.cpp and CvPipeline.h) implements a generic pipeline model as below (public and overridable API only shown). Note - there will be a concrete example later!

Work items

CvPipelineWorkItem is an abstract class, representing an entity to be worked on (e.g. - a city in a city processing pipeline).  Its role is to hold state information about the entity being worked on as it progresses through the pipeline.  Here is the API:
 
typedef enum
{
 WORKITEM_STATE_NONE = -1,
 WORKITEM_STATE_QUEUED,
 WORKITEM_STATE_PROCESSING,
} PipelineWorkItemState;
 
// A pipeline work item - this records the state of the work item as
// it flows through the pipeline.  Appropriate subclasses should be
// defined for concrete items, sufficient to record their state
// changes as they flow through the pipeline stages
class CvPipelineWorkItem
{
public:
 CvPipelineWorkItem();
 virtual ~CvPipelineWorkItem();

 // Get the priority of this item (determines dequeue order
 // at each stage and may return different values at
 // different stages if necessary).  Must be deterministic
 // based on item state however to avoid OOS issues.
 // In cases where requeuing may result in different states
 // the priority used should be related to the item
 // identifier (e.g. - city id) to ensure deterministic
 // processing order in synchronous stages
 virtual int GetPriority(void) const = 0;

 // Requeue the workitem to the last non-synchronous stage
 // not yet completed
 // This is intended for use with optimistic locking systems
 // to handle a lock failure down-stream of the pipeline
 // stage that took the lock
 void Requeue(void);

 // Abandon a work item
 void Abandon(void);

 // Note that processing has completed (for the current
 // pipeline stage) on this work item
 void Complete(int iKey);

 // Get a queuing key - this should be queried before
 // processing and passed to Complete() on completion
 int GetKey(void) const;

 // Initialize the seed to use in the item's sync rand
 void SetRandSeed(unsigned long ulSeed);

 // A synchronous random number, which is OOS safe even in
 // multi-threaded pipeline usage.  Always use this in
 // pipeline stage processing rather than the regular
 // synced random number generator
 int GetSyncRand(int iNum, const char* pszLog);

 // Get the current workflow sate of this item
 PipelineWorkItemState GetState(void) const;
}; 

You just need to override GetPriority() to return some deterministic ordering between entities (e.g. - city id).  Obviously you may also need to add state variables to record specific state that is important for your usage also.

Pipeline Stages

The entities represented by the work items will flow through a number of pipeline stages (the stations on our production line).  Each stage is represented by an instance of the CvPipelineStage class (or more specifically one derived from it for a particular usage):
// Class representing a processing stage in a pipeline.  A stage
// may be synchronous or asynchronous.  A synchronous stage will
// not begin executing until all workflow items have been
// completely processed up to the end of the previous stage,
// whereas an asynchronous one will process items as soon as
// they are queued
// A stage may have multiple threads allocated to its processing
// if desired
// Define appropriate subclasses of this class to provide the
// processing
class CvPipelineStage
{
public:
 CvPipelineStage(const char* name,
                 bool bSynchronous,
                 int iMaxThreads);

 virtual ~CvPipelineStage();

 // Queue a work item into the stage
 void EnqueueWorkItem(CvPipelineWorkItem* item);

 // Process a work item
 virtual void ProcessWorkItem(CvPipelineWorkItem* item) = 0;
};

In this case you just need to override ProcessWorkItem() to provide code to actually do whatever work this stage is supposed to be responsible for.

Pipelines

A pipeline consists of a number of stages, so this the API basically just lets you string stages together and then run the resulting pipeline.

// A workflow pipeline the processes work items through
// multiple pipeline processing stages
class CvPipeline
{
public:
 CvPipeline();
 virtual ~CvPipeline();

 // Add a stage to the pipeline.  Must occur before the
 // pipeline's Begin() is called
 // Note - a stage is assumed to be owned by the pipeline
 // once added, and will be deleted by it when it is finished
 // with it.
 // FUTURE - encapsulate better within a stage factory
 void AddStage(CvPipelineStage* stage);

 // Queue a work item to be processed by the pipeline
 void EnqueueWorkItem(CvPipelineWorkItem* item);

 // Commence processing of queued work items
 void Begin(void);

 // Complete processing - will synchronously wait for all
 // work items to complete
 void End(void);
};

Concrete Example

As an initial use, I decided to multi-thread enable the choice of what to produce in cities, since this is a reasonably substantial block of time, that is largely independent between cities.  The fact that it both needs to consider multi-player synchronization issues, and is not entirely independent between cities (e.g. - the Wonder example cited earlier) makes it a good proof-of-concept to implement first.

Processing a city can be split into three conceptual stages as a first cut:
  1. Do some stuff before considering production
  2. Do the production
  3. Do the stuff that comes after production
This is just a way to slice up the existing code in CvCity::doTurn() into three parts so that we can attempt to parallelize the middle one.  So to begin with we'll need pipeline stages for the 'stuff' in (1) and (3) - this is completely mechanical - we can literally just cut&paste the code from the original monolithic CvCity:doTurn into three separate methods:
  • CvCity::doTurnPreProduction
  • CvCity::doTurnProduction
  • CvCity::doTurnPostProduction
The first and last of these we don't care about the contents of - they are just whatever CvCity::doTurn used to do on each side the production processing.  Each becomes a pipeline stage, wherein the override of CvPipelineStage::processWorkItem() is a call to one of the methods above.  Because we are not (currently) trying to parallelize those pieces, they will just be synchronous single-threaded stages.  For instance:

class CvCityTurnPreProductionPipelineStage : public CvPipelineStage
{
public:
    CvCityTurnPreProductionPipelineStage(const char* name) :
     CvPipelineStage(name, true, 1)
    {
    }
    virtual ~CvCityTurnPreProductionPipelineStage()
    {
    }

    virtual void ProcessWorkItem(CvPipelineWorkItem* item)
    {
        CvCityTurnPipelineWorkItem* cityItem = (CvCityTurnPipelineWorkItem*)item;
        int iKey = cityItem->GetKey();

        cityItem->getCity()->doTurnPreProduction();
        cityItem->Complete(iKey);
    }
};

The middle part is what we want to run across several threads, so clearly it has to involve an asynchronous stage, and since it's the choice of what to produce we are seeking to split across threads, the asynchronous stage must involve that.  However, actually making a choice (and adding it to the build queue) effects what is a valid choice in other cities (national units, wonders, etc.).  Furthermore the act of building something will influence the desirability of other things (potentially in other cities too), so we cannot actually enact the builds until everyone has made their initial choices.  This means we need to split production up into several sub-stages.  The actual list of sub-stages and their purpose is as follows:
  1. Enact current production - process the existing production queue and build whatever is at its head if we have enough production to do so.
  2. Current production sync point - this is a pseudo stage that is actually a no-op, but because it is tagged as synchronous no cities can get past this point until every city has completed the previous (current production) stage.  This ensures that the basis for choices (the current state of the empire) is consistent regardless of processing order.
  3. Choose production - this is the key asynchronous stage, which means all cities partake of it at once.  The choice they make is stored in the work item state for each city - it is NOT enacted in this stage, because to do so would impact other cities' choices, and introduce non-determinism that would result in multi-player OOS errors
  4. Enact production - this is a single-threaded synchronous stage, which means cities process through it one at a time in priority order, only after every city has made its initial choice.  At this point we may be in a situation where cities have made mutually incompatible choices (two have decided to go for the same Wonder say).  It is in this stage that we detect and handle this.  Because this stage processes cities one at a time strictly in priority order, it is deterministic which city that is trying to build the hypothetical wonder will process first (and succeed).  Assuming the choice is legal (we can push it onto the build queue) this stage then continues by processing the head item of the build queue - if it has sufficient production to complete it then the order is popped from the build queue and production adjusted.  Note that the popping of the order from the build queue merely adds whatever is to be built to a pending completion list for the city - it doesn't actually process its effect yet.  I'll come back to why this is necessary shortly.
    If we could not add the choice we had made to the build queue (second city trying to build the wonder in our canonical example) then we call CvPipelineWorkItem::Requeue() - this causes what is known as a 'pipeline stall' which means that once processing the current stage finishes, the overall pipeline processing will go back to the last asynchronous stage rather than forward to the next stage.  It also requeues (hence the name) the work item for the city that could not enact its build back on the previous asynchronous stage (choosing what to produce) so that it can choose again and get its second preference.  Because the choice stage is asynchronous, work on this will commence immediately, even while the remainder of the cities are still progressing through the next (enact) stage - it is this parallelism that means the enact stage cannot actually process the effects of the build (that would introduce non-determinism into the requeued choice processing).
    The enact stage also handles multiple production.  With multiple production enabled, if a build completes we also requeue back to the choice stage unless the build queue already contains further items (in which case we drain it first).  This is exactly the same requeue mechanism that is triggered by inability to enact a choice - it's just a way of getting a new choice made, whether because we exhausted the build queue in multiple production, or because a higher priority city pre-empted our choice before we got to enact it.
  5. Complete production - currently a synchronous single-threaded stage, but can probably be split into two and the expensive part made asynchronous in future - this stage actually processes the effects of the production that occurred (the building effects, instantiate units, etc.)
The code that used to be part of CvCity::doTurn was this simple loop:

 for (pLoopCity = firstCity(&iLoop); pLoopCity != NULL; pLoopCity = nextCity(&iLoop))
 {
  pLoopCity->doTurn();
 }

Turning this into a partially parallelized pipeline yields this code (still fairly simple, if a little more verbose):
 CvPipeline cityProcessingPipeline;
 CvPipelineStage* preProductionStage = new CvCityTurnPreProductionPipelineStage("PreProduction");
 CvPipelineStage* productionEnactCurrentStage = new CvCityTurnEnactCurrentProductionPipelineStage("EnactCurrentProduction");
 CvPipelineSyncPointStage* currentProductionSyncPoint = new CvPipelineSyncPointStage("CurrentProductionSyncPoint");
 CvPipelineStage* productionChooseStage = new CvCityTurnChooseProductionPipelineStage("ChooseProduction");
 CvPipelineStage* productionEnactNewStage = new CvCityTurnEnactNewProductionPipelineStage("EnactNewProduction");
 CvPipelineStage* productionCompletionStage = new CvCityTurnCompleteProductionPipelineStage("CompleteProdcution");
 CvPipelineStage* postProductionStage = new CvCityTurnPostProductionPipelineStage("PostProduction");

 cityProcessingPipeline.AddStage(preProductionStage);
 cityProcessingPipeline.AddStage(productionEnactCurrentStage);
 cityProcessingPipeline.AddStage(currentProductionSyncPoint);
 cityProcessingPipeline.AddStage(productionChooseStage);
 cityProcessingPipeline.AddStage(productionEnactNewStage);
 cityProcessingPipeline.AddStage(productionCompletionStage);
 cityProcessingPipeline.AddStage(postProductionStage);

 cityProcessingPipeline.Begin();

 for (pLoopCity = firstCity(&iLoop); pLoopCity != NULL; pLoopCity = nextCity(&iLoop))
 {
  cityProcessingPipeline.EnqueueWorkItem(new CvCityTurnPipelineWorkItem(pLoopCity));
 }

 cityProcessingPipeline.End();

Multi-Player Synchronization

IP-connected multi-player mode in Civ IV relies on the AIs on all machines making the same decisions when the AI turns are processed.  Any difference in decision taken (by AIs, automation, or governors) will result in OOS errors.  Since many decisions are made with a pseudo-random element (in order to provide varying experiences) this means:
  • The same set of random numbers must be used for the same decision points on all machines
  • The same state must exist at the point a decision is taken in regard to all factors that impact that decision
These constraints make the non-deterministic processing order between threads problematic.  The CvPipeline framework tackles this in two ways:
  1. By defining certain stages to be synchronous we can set up 'gates' at which all entities synchronize.  In the city example the split of pipeline stages ensures that the state present when build choices are made is always consistent, regardless of processing between threads in the asynchronous stages.  This is due to:
    1. The use of a synchronous stage between enactment of current production and new production choice (ensures the state in which all current production is complete is present before any choices are made)
    2. The separation of processing the build queue from processing the effects of actually building the resulting buildings/units into two synchronous stages (ensures that requeued choices never see the results of other enacted choices that might be processing at the same time)
  2. By defining a random number API against a work item, rather than using a global one.  This ensures that each city (in this case) generates its own deterministic random number stream, and hence is not impacted by random number rolls in the processing of other cities hat may be happening in parallel
Miscellaneous Notes for modders and adventurous users

Global Defines

The number of threads the choice stage of the city processing pipeline uses is defined by a global define in 'A_New_Dawn_GlobalDefines.xml' called 'NUM_CITY_PIPELINE_THREADS'.  This is currently set rather conservatively, to 2.  Setting this to 1 will ensure single-threaded operation.  I'm currently running with it set to 4.

Helper classes

In addition to the 3 framework classes, CvPipeline.h also defines a concrete helper class called 'CvPipelineSyncPointStage'.  This is just a no-op synchronous single-threaded stage that can be inserted to provide synchronization checkpoints.

Cautions

The big gotcha's in using this framework are:
  • Designing the necessary synchronization points so that the asynchronous bits can be OOS safe - this is probably what needs the most up-front thought in any new usage.  The best advice is probably to ensure that asynchronous stages themselves do not change the persistent state of anything  - mostly they should be performing expensive calculations necessary to decide what changes to make, and leave the actual making of those changes to down-stream synchronous stages.  That, coupled with appropriate assignment of work item priority and pipeline stalling via requeueing when necessary, should enable the result to be OOS safe.
  • Making sure that the stage processing function for any asynchronous stages is really thread-safe!  That includes everything it calls.  You'll probably find yourself having to dot critical sections around as you go, but the main piece of advice I can give here is to keep the asynchronous bit limited to fairly isolated code, that ideally only operates on the entity represented by the work item it is processing (anything else it modifies will require locks as well  as consideration of potential OOS effects)

Future development

I'm planning to extend the use of asynchronous stages to other elements of the city production pipeline, such as:
  • Processing added/removed buildings (after all decisions about WHAT is going to be built have been made, actually performing those changes in a non-deterministic order should be fine)
  • Worked tile choice
  • Maybe other things (like culture spread), if any look expensive enough on profiles to justify it
I'm also planning to make use of CvPipeline in some UnitAI processing.  Many of the unitAI routines are of the form:

for each plot within a certain range
  evaluate a cost function for performing an action at the plot
  if the action has some value evaluate a path to it for the unit
    if it is possible to path there
      normalize the cost by how long it will take to get there
      if the normalized cost is the best so far
        note this plot as the best target found
if we found a best target plot at all
  queue up orders to move there and enact a mission

In a few cases both the cost function evaluation and the pathing are expensive operations.  In such cases recoding this as a two stage pipeline (cost and path) with the costing stage running asynchronously could be beneficial.  Profiles suggest CvUnitAI::AI_attackCityMove() might be a good candidate.

Another potential use is in tender finalization, which basically asks every city to evaluate a cost function for each unit tender request, in order to decide where it is best to get them built.  City tender processing can almost certainly be done as an asynchronous stage in this.

Sunday, June 2, 2013

Multi-threading support (infrastructure)

Recently, I have been looking into multi-thread enabling some of the time-consuming elements of the end-turn processing.

My initial focus was on path generation, which turned out to be an interesting exercise (I was able to increase performance through general optimizations unrelated to threading, but illuminated by recasting the algorithm to support it) by 50% or so.  However, ultimately multi-threading at the level of individual paths (and probably individual unit groups) was not very successful, and in the end I didn't push these changes to the main codebase, because the gains were too small to justify both the complexity, and the potential for harming performance through loading more cores, and thereby reducing turbo-mode availability to the primary thread.

Subsequent to that I started looking at the processing of cities in turn-end processing, which proved a much more fruitful area.

The reason city processing is a better candidate then (individual) path generation is basically down to granularity.  If the work you want to split up (across threads) only takes a small amount of time anyway, the latency effects of thread context switches (background threads won't get going on a problem until they first get a time-slice) quickly become more significant, and in the case of path generation it was simply the case that the total time to generate a path is comparable with time-slice latency, which makes it much harder to achieve real gains through attempting to split the processing up.  City end turn processing, on the other hand, takes an appreciable fraction of a second, and processing all of a player's cities in the turn end typically many seconds.

Anyway, the approach I've taken to multi-threading aspects of the city turn processing will be the subject of my next blog entry, since it involves a re-usable generic pattern, that can be applied to many areas.  On the way to that I wanted to stop off and talk about profiling a bit, and multi-threaded profiling in particular.

Profiling Basics

Some time ago (one of the first things I did on joining the C2C team a couple of years ago), I rewrote the profiler that came with the Firaxis code, so I could get more useful output from it to do performance analysis.  A brief digression on what the profiler does, and some basics as to how it works are in order, before we get to the impact of multi-threading on it.  I'll start off pretty basic here for anyone that doesn't have much background in the area:

What is profiling?

Profiling is a means of measuring, ideally in some detail, how long different parts of a program take to execute.

Why do we want to profile?

It's not generally possible to optimize effectively without knowing what is taking all the time in the first place.  For example, it's easy to spot a piece of inefficient code that can be improved, but if it's only executed a small number of times and takes a small percentage of overall execution time, it doesn't matter how much you optimize it - you'll never have much effect on the overall time taken. As such we need to know where the time is spent in order to know what to focus our optimization efforts on.

So what specific things does a profiler measure?

This will vary from profiler to profiler, but for the purposes of our discussion we want to measure the time taken by individual methods and functions in the code.  We might also want to know things like:
  • How many times is a function executed?
  • What is its average execution time?
  • What is the total execution time for all calls to the method/function?
  • Where is it being called from?

How do we tell the profiler which functions to measure?

Again this will vary depending on the profiler and the environment.  In some environments, a profiler can identify functions and allow the user to select which to profile dynamically, without impacting the code being profiled at all.  The C++ environment of the Civ4 DLL is not one of these!  Instead we rely on the programmer placing profiling macros in the code, usually at function starts.  In the version we release to the end users (the 'Final Release' build in C2C nomenclature) these macros will expand to nothing at all (so they have no impact or function in that build), but in a profilable build (the 'Release' build in C2C) they expand into function calls to some profiling code, which I'll be talking more about later in this blog entry.

How do we view the results of profiling?

Profilers may have their own UI, or may output to files.  C2C's profiler outputs tab-separated records for each profile point to a special profile log file, which can then be loaded into a spreadsheet for easy manipulation (sorting by time for example).   The following is an example.  This is the profiler output from the latest build, while it executes the infamous 'Talin 6 minute turn' which I've been using as a test case for the past month or so (it doesn't take 6 minutes any longer as you can see):



This output has been sorted by 'Time', so it shows the things taking longest first.  'root' is a pseudo-entry that means 'the entirety of everything that got profiled', which in this case is effectively a synonym for 'CvGame::update()' which is the entry point for the entire end-turn calculation.  'PipelineThread' is also a pseudo-entry, which I'll have more to say about later (the next blog entry in fact), but everything after that corresponds to a function, or more granular profiled section, in the code.  As you can see, some major constituents are the kinds of things we'd expect:
  • An end turn consists of ending the turn for every AI player, so a lot of it is in CvPlayer::doTurn, which is the end turn processing for an individual player
  • A large chunk is deciding what to do with units (AI_unitUpdate)
  • Pathing is still a major cost (CvPathenerator::generatePath) - as you can see this turn required us to calculate over 65000 unit paths
Just because a function has a large 'time' value doesn't necessarily mean its the problem in itself - functions near the root of the call tree will inevitably have large times, but note the columns 'Child time' and 'Self time' - for any given profile entry these are the amount of the time accrued to that function spent in (respectively) child functions that it calls, and itself.  As such, it is often entries with large 'self time' values that are the actual issues.
Also note the '#calls' column - this is the number of calls to the function made within the profiled turn.  Some functions may not take long for an individual call, but if they are used a lot it can add up (check out CvUnit::getCommander, for example - it is called 47 million times).  Such functions may well be worthwhile targets of optimization efforts, either to make them as streamlined as possible or (often more effectively) to look for higher level algorithmic changes that can reduce the number of times we need to call them.

How the profiler works

When adding profiling, it is very important to ensure that the overhead you impose by the act of profiling does not come to dominate the overall execution times, and distort the very figures it is generating as a result.  As such, the profiling code itself must be very light-weight - we can't have it doing things like allocate memory or other lengthy operations when processing a function invocation!

The profiler (prior to any attempt to add multi-threading) worked by having the profiling macros (typically placed at the start of functions) declare a static structure and a locally-scoped object.  The locally scoped object is responsible for the measurement of that particular function invocation (its constructor records the start time, its destructor the end time), and the static structure holds the aggregate totals for that function.  Thus the following code:

int myFunction()
{
    PROFILE_FUNC();

    // do some stuff
    ...
}

expands to something like this, through the definition of the macro PROFILE_FUNC:

int myFunction()
{
    static ProfileSample __myFunctionSample;
    CProfileScope __myFunctionScope(&__myFunctionSample);

    // do some stuff
    ...
}

Internally the constructor for CProfileScope will squirrel away the pointer to the static ProfileSample structure, and record the start time of the function call.  When the function exits, it will go out of scope and its destructor is called.  The destructor will calculate the time and add it to an accumulated value in the static structure.  It will also add a reference to the static structure to a list of profile entries that have been hit in this run.  At the end of profiling the list of entries hit will be iterated over, and their values dumped out to form the log which gives rise to spreadsheets like the earlier example.

In practice there is slightly more to it than this, but the gist is as simple as the above.  Specific other aspects are:
  • You need to do the 'add a reference' bit efficiently, so the table index of the reference added is cached in the static structure, so that subsequent calls don't have to do it again.
  • In order to track call paths we want to record the parent from which an entry is (first) called, so the profiler maintains a stack of pointers to the static structures declared by each macro invocation, storing the id of the current one in a child when it is called.
  • Recursion needs some special handling, because you don't want the inner invocations of a function to add to its time, or the outer invocation will be double-counting.  To handle this, the static structure also contains a current 'EntryCount' which amounts to a recursion depth - the destructor will only accumulate the time if this is the outer exit (EntryCount going to 0)

Enabling multi-threading support

Enabling multi-threading on the above structure raises some issues:
  • You cannot use simple EntryCount and ParentId fields in the static structure, because 2 or more independent threads may be in it at once, and need to store independent values
  • You cannot use a global stack - each thread has its own independent stack to keep track of
  • You need to be careful to lock access to parts of the data structure that multiple threads can be manipulating at once - for example - when accumulating a time into a sample's total you cannot just say something like:
    Total += elapsed;
    This is because the addition (in this case of 64-bit quantities in a 32-bit program) is not atomic, so two threads doing it at once can mangle the result

Addressing some of these points is easier than others!

The global stack and parent linkage are easiest, and have an obvious solution, which is replacing the stack with a linked list, and putting the linkage information in the locally scoped object rather than the static sample structure (we still keep the id of the first parent we see there, to give the 'parent' dumped in the final output, which was always only the first call path a function is seen on).  In this scheme we use thread-local-storage (variables declared as thread-local have different values on different threads, but they need to be used carefully as you cannot have very many of them in total) to hold a 'currentSample', which is a pointer to the sample structure of the profile scope we're currently in, on any given thread.  Using this we can replace the global stack with a series of linked lists (one per thread) each rooted in the thread-local 'currentSample' entry.

The EntryCount is a bit trickier.  In principle we can continue to use a single value on the sample structure, which now becomes the TOTAL entry count (between all threads).  Of course if we do that then incrementing, decrementing and testing it all need to do so in a thread-safe manner (which I'll come to in a moment).  In this scheme we need to explicitly check for recursion by walking up the linked list to determine if the same sample occurs higher up - this would be expensive except that it can be omitted in the special (but very common) case where the total entry count is going to 0 on exit anyway (in which case there cannot possibly be a recursive entry, since that would mean the count would still have to be at least 1).

Finally we need to lock access to the structures across update of them (and testing of them in some cases, like the recursion checking's access to EntryCount).

This is where things get interesting.

My initial (naïve) attempt, was pretty much what I've described above, with critical sections protecting the key manipulations in the profile entry and exit functions.  This turned out to be problematic however, as I'll explain shortly.  First a brief digression into locking and thread safety...

Locking and Thread Safety

If you have 2 or more threads concurrently accessing a single data structure, and at least one of them can be making changes to it, then you need to take precautions to ensure that a thread switch at a 'bad' time (like when one thread is half way through modifying the structure) cannot cause another thread to operate on a corrupted (aka partially updated) set of data, getting an incorrect result or (often) just plain crashing.  Note that this can occur even on single-core systems, since when the OS chooses to pre-emptively switch threads is outside of the control of the application. To achieve thread safety you need to either use locks of one sort or another, or develop a lockless algorithm that relies on guaranteed atomicity of certain operations.

Broadly speaking there are 3 levels of primitives provided you can choose to employ:

  1. OS-level synchronization primitives - semaphores, mutexs, events, etc.  I'm not going to go into details of all the various options available (it's extensive), but canonically consider a mutex.  This is an entity that can be 'owned' by a single thread, and has an API that allows threads to 'request' and 'release' it.  A thread requesting an un-owned mutex will take ownership, and continue to process.  When it is done it releases it.  If a thread requests a mutex that is already owned, the OS will block that thread's execution until the owner releases it.  Hence you can use mutexs to guarantee that only  single thread is in some critical piece of code at any one time.
  2. Process-level synchronization primitives - critical sections essentially.  A critical section is essentially a mutex that only operates within a single process.
  3. Interlocked functions, that provide (processor level) guarantees of atomicity.  Intel x86 CPUs have a capability for some instructions to be 'locked', such that their operation is guaranteed to be atomic across the system, even in multi-CPU systems.  These are exposed via Windows (or compiler intrinsics, but not in the ancient complier we're stuck with) as the Interlocked... set of APIs.  Canonically:
    • InterlockedIncrement - adds 1 to a (32 bit) variable atomically, returning the value it ended up as
    • InterlockedDecrement - subtracts 1 atomically, returning the resulting value
    • InterlockedCompareExchange - compare a variable against a provided value, and if they are equal set it to another value - all occurring atomically
Each has pros and cons:
  • The OS-level primitives are easy to use and come in many useful forms.  However, a call to one always requires a context switch into kernel space, which is expensive (think order of 1000 CPU cycles expensive!)
  • Critical sections are relatively fast in at least the uncontended case (that is, if the section is not owned, it is fairly fast to take ownership and later to release it), but they only operate within a single process (which is fine for us).  In the contended case they still cause a thread switch and hence drop into kernel space, so contention gets expensive (there is an API that allows you to set a spin wait count on them to mitigate this, but it's a detail (albeit one we want to take advantage of when we use critical sections, in many cases) rather than a fundamental shift in the behavior, at the level we're concerned with)
  • Interlocked function are much harder to code complex algorithms with, and are generally suitable for very small-scale usage (like using InterlockedIncrement rather than just the '++' operator with a critical section).  However, if something really needs the tightest possible optimization, it is sometimes possible to construct lockless algorithms around the use of the Interlocked functions.  Even Interlocked functions have some performance overhead, which I'll come back to.

What I discovered, with my initial naïve attempt, was that the use of critical sections in profiler function entry processing was highly distorting (the very act of profiling added tens of percent to many functions, and hundreds to some very frequently called ones).  This was even in the single threaded case - where contention occurred due to multiple threads executing the same code, it got much, much worse!  In short, having to take a critical section inside of the processing of profiling a function entry was not acceptable.

I then spent some time, and was able to recode using only Interlocked functions, so it became non-blocking, which totally removes the contention problem.  However, to my surprise, even this produced massive distortions!  At that point I started digging into the exact cost of an Interlocked function.  Although I'd realized it had some overhead (CPU caches have to synchronize at what are known as 'memory fences', and out-of-order execution pipelines in the CPU have to be flushed and restarted), I had not comprehended just how expensive this can be.  Consider:

result = ++iValue;

vs

result = InterlockedIncrement(&iValue);

Semantically they are the same (well they are in a single-threaded environment at least), but the first will execute (subject to compiler optimizations on where it has put result and iValue) in as little as a single CPU cycle (250 pica seconds, say), whereas the second will typically cost hundreds of cycles, due to pipeline stalls and so on!!

So, we need a totally lockless algorithm, which is a bit harder to achieve!

What I ended up doing is this:
  1. Only support up to some (small) finite maximum number of concurrent threads (to be profiled at least) - I chose 8
  2. For the key shared fields in the sample structure (EntryCount, and the accumulated time totals) replace the single value with an array of MAX_THREADS+1 values
  3. On first call from a thread, take a critical section lock (only on thread entry and exit will we need to do this), and find an unused slot in a table of size MAX_THREADS (i.e. - give the thread a unique index in the arrays declared in (2) to use).  Record in this table which slots are occupied, and save the thread's slot id in a thread-local variable
  4. On exit from the root call for a thread (thread exit from profiling essentially) take the lock again, and for each sample in the entire profile, add the accumulator values from the thread's slot into an aggregate maintained in a reserved slot (MAX_THREADS+1) - thus as threads exit (from profiling) their profile data is added into the aggregate total.  Mark the thread's slot id as unused again in the slots table as the thread exits profiling.  This step is very expensive, but only occurs at thread exit, so its essentially irrelevant on the overall scale of the entire profile.
  5. At profile exit dump the values from the reserved slot (which has aggregated all individual thread's contributions) to the output log.
The net result of all this is no need to ever lock anything except at thread profile start and end.

Profiler changes DLL modders may need to be aware of

Macros

The macros used are unchanged, with the exception of the addition of:

PROFILE_THREAD(name)

usage example:

void MyThread(void)
{
    PROFILE_THREAD("MyThreadRoot");

    // Do stuff
    ...
}

This macro should be placed (typically) in the root thread function of any thread you wish to enable profiling on (apart from the main thread, which is always profiled).  It serves the twin purpose of turning on profiling for that thread (by default a thread will not be), and declaring a root profiling entry.

Global Defines

I have also added some new global defines as follows (in 'A_New_Dawn_GlobalDefines.xml'):
  •  (Boolean) 'ENABLE_BACKGROUND_PROFILING' (default value '1').  If this is set to 0 all profiling of threads other than the main thread will be disabled
  • (String) 'PROFILER_ALTERNATE_SAMPLE_SET_SOURCE'.  If set to the name of a profile section (e.g. - the name of a profiled method) will accrue into the 'Alternate sample set' (see below) the time spent in the named profile section within call stacks rooted in the entry reporting.  For example, if we see a function showing up a lot on the overall profile and want to know more about the call paths it gets executed in, we can set it as the alternate sample, and this will tell us under which ancestor methods the (expensive) calls are taking place under.  Pathing is a good example - suppose I know that a lot of time is spent pathing, but I want to look into optimizing the use of the pathing (reduce number of calls) rather than the pathing engine itself.  To do that I need to know what callers are accounting for the most time (in this case I'll find that AttackCityMove is responsible for a lot of it, as it happens)

Changes to the output table in the profile log

The log example above, is from the new profiler.  Compared to the old version there are a few extra columns:
  • Main Thread Time - this column shows how much of the total time (reported in 'Time') was on the main thread.  This is important to know, because (unless we're trying to optimize heat output at least), it is only main thread elapsed time that ultimately matters (the rest is happening in parallel)
  • Alternate Time - this is time accrued by the alternate sample as discussed above.  In the example spreadsheet shown earlier, the 'alternate sample set' was set to 'CvTeam::setHasTech'.  The results are unremarkable, but note that although most time accrues from CvPlayer::doResearch, there are also some calls from other paths

Other Interesting Things I learnt

One other interesting discovery occurred while I was testing this (and the general framework for multi-threading which I'll describe in my next entry, but here is as good a place as any to mention it).

It turns out that something (I think the boost threading library, but that's just a hypothesis, which I have not confirmed) catches structured exceptions on background threads (at least those launched via boost::thread).  The net effect of this is that (without further coding at least), if a crash condition (typically bad memory reference or divide by zero) occurs on a background thread, no minidump is created, the thread silently terminates, and execution continues (or at least it attempts to!)!!

I have rectified this for the threads we currently have (that are boost-launched) with some explicit code to call the minidump creator, and then exit the process.

Saturday, June 1, 2013

Introduction

I've been meaning to start an ongoing blog about interesting modifications to the C2C codebase (and/or interesting experiences in making those changes) for some time, but never really got round to it, so now seems as good a time as any to finally commit some thoughts to 'paper' (unlike Mr. Ives, I like skeuomorphism, even if it's only the metaphorical kind).

First of all, a few words of introduction to C2C for those who somehow got here without being familiar with it!

C2C (aka 'Caveman 2 Cosmos') is a mod for the game 'Civilization IV', a turn-based strategy game in which you build an empire, expanding and learning new technologies as you go.  See http://forums.civfanatics.com/showthread.php?t=444594 for specifics on the C2C mod.

Firaxis (the authors of the 'Civilization' series games), chose to open the game to modding in fairly extreme ways by exposing both an XML-based entity definition system (by which new unit types, buildings, technologies, etc. can be defined), a scripting interface (which allows functionality to be scripted in Python), and also by open-sourcing a DLL which implements the game world and the AI for computer-controlled players.  Just the graphical front-end is closed, but since pretty much the entire ruleset of the game, and all the AI is in the open-sourced DLL part, the changes that can be made are almost limitless.

C2C takes this to something of an extreme, by making very extensive changes to the DLL (significantly modifying, and adding to the ruleset, which in turn is exposed to the XML and scripting layers via an enhanced API and XML schema).

My involvement is primarily with modifications to the DLL code (which is C++), principally in the following areas:
  • Performance optimization - many areas of the original Firaxis code have been reworked to improve their performance, and especially their scalability (C2C defines an order of magnitude more entities than the original game, and also tends to be run on far larger maps, both of which stress aspects of scalability in the original code).  Areas that have been significantly reworked include:
    • New path calculation engine (to determine best legal routing for units across the map)
    • New trade network calculation code
    • Introduction of multi-core processing (the original was entirely single-threaded)
    • Greatly optimized AI evaluation calculations (what to build, what to research, etc.)
  • AI improvements - as new capabilities are added, the AI has to be taught how and when to make use of them.  Also there is an ongoing effort to increase the effectiveness of AI players, and to improve the scalability of the more time-consuming AI algorithms (which is really performance again)
My intention, with this blog, is to make postings about particular changes that are either interesting in their own right, illustrate some unexpected discoveries (either about C2C, Civilization, or programming generally), or which need to be explained in some defined place for the benefit of other modders.

I may well write posts on some things that have already been done, because they meet one or more of the above criteria (e.g. - the new pathing engine, viewports, and so on), but the thing that triggered me to start this now was the need to write up changes I'm making for multi-threading of the turn-processing.  Consequently the next couple of posts will be about that in on way or another...