The queue system thus consists of the RabbitMQ/Redis services on top of which we have 2 Celery workers:
relaydruns on the front-end server, and serves primarily to relay data back into the front-end Drupal system about backend Operations and Tasks (status, log data, etc).
dispatcherdruns on the back-end server, and serves primarily to dispatch Tasks to a backend plugin (currently Ansible, but eventually anything), to actually provision something (Platform, Site, Server, Service), or do something with the provisioned resources (run a backup, perform updates, etc).
We’d originally assumed the Celery queue would be shared between these two worker tasks, but that doesn’t seem to be the right model.
Instead, we’ve determined that Celery uses the concept of different queues or exchanges (what we’re thinking of as “channels” on top of the underlying “bus”), that use AMQP routing to get tasks to the correct worker. Workers in turn specify which queue or queues they want to listen to when they start up, and we specify which queue to put things on when we post a task.
Worker applications seem to need to be able to handle any task that gets put on the queue they’re listening on, and if a task comes in for a worker that doesn’t have a method to handle it, things fail badly.
As such, we’ve refactored the AbstractTaskQueue and related classes to take an
“exchange” argument, and similarly configured
specify a particular queue/exchange/channel when they start up. This needs to
be better documented, and we probably need to understand the Celery, RabbitMQ,
and AMQP pieces here, or at least point our docs to the relevant docs for those
@TODO: expand on these components and how they fit together.
aegir:exitcodeand how they interact with relayd/dispatcherd
In Issue #64, we implemented the
drupal aegir:validate_queue command (see commits
along with some testing/debugging mechanisms currently living alongside the
“Check connection settings” page (
admin/aegir/queue), called “Check task
queue”. It works like this:
queue_valid. Here again, we’ve implemented both a
dispatcherd.queue_valid] routine for each of the workers, and they are triggered in turn. Both use the same technique: set a State API variable to
FALSEinitially, then call the worker task whose job is to end up setting that same variable to
TRUE. After posting the task, the submit handler currently polls the State variable (resetting the cache each time through the wait loop), and returns TRUE when the State variable changes, or FALSE if it times out.
relayd.queue_validtask, we simply call
drupal aegir:queue_validimmediately, validating that we can have the Python Celery task code call out to a
drupal aegircommand, in turn feeding data back to the frontend Aegir site.
dispatcherd.queue_validtask, we emulate the “round trip” feedback mechanism, where a backend task in turn posts a Celery task onto the queue for
relaydto pick up and process (generally via a
drupal aegirconsole command). In this case, we
relayd.validate_queue, which in turn calls the
drupal aegir:queue_validcommand, just as in the previous step.