Queue System

Queue system architecture

The Aegir5 queue is implemented using Celery, which is a full-featured task queue written in Python and built atop RabbitMQ.

The queue system thus consists of the RabbitMQ service on top of which we have 2 Celery workers:

  • relayd runs on the front-end server, and serves primarily to relay data back into the front-end Drupal system about backend Operations and Tasks (status, log data, etc).
  • dispatcherd runs on the back-end server, and serves primarily to dispatch Tasks to a backend plugin (currently Ansible, but eventually anything), to actually provision something (Platform, Site, Server, Service), or do something with the provisioned resources (run a backup, perform updates, etc).

2 Workers, 2 Queue exchanges

Celery uses the concept of different queues or exchanges (what we’re thinking of as “channels” on top of the underlying “bus”), that use AMQP routing to get tasks to the correct worker. Workers in turn specify which queue or queues they want to listen to when they start up, and we specify which queue to put things on when we post a task.

Worker applications seem to need to be able to handle any task that gets put on the queue they’re listening on, and if a task comes in for a worker that doesn’t have a method to handle it, things fail badly.

As such, we’ve refactored the AbstractTaskQueue and related classes to take an “exchange” argument, and similarly configured dispatcherd and relayd to specify a particular queue/exchange/channel when they start up. This needs to be better documented, and we probably need to understand the Celery, RabbitMQ, and AMQP pieces here, or at least point our docs to the relevant docs for those tools.

See commits d05e9bf, 1fee690, 2ed8c72 for the related changes here.