Skip to content

Commit 52c1062

Browse files
committed
[doc] Extend task management docs
Some further information was written in the commit message of the task management implementation. These are now saved in the official documentations.
1 parent 1b412ad commit 52c1062

2 files changed

Lines changed: 78 additions & 5 deletions

File tree

docs/web/background_tasks.md

Lines changed: 33 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -32,10 +32,13 @@ Tasks are generally spawned by API handlers, executed in the control flow of a T
3232

3333
1. An **API** request arrives (later, this might be extended with a _`cron`_ -like scheduler) which exercises an endpoint that results in the need for a task.
3434
2. _(Optionally)_ some conformance checks are executed on the input, in order to not even create the task if the input is ill-formed.
35-
3. A task **`token`** is _`ALLOCATED`_: the record is written into the database, and now we have a unique identifier for the task.
36-
4. The task is **pushed** to the _task queue_ of the CodeChecker server, resulting in the _`ENQUEUED`_ status.
37-
5. The task's identifier **`token`** is returned to the user.
35+
3. A task **`token`** is _`ALLOCATED`_: the **`BackgroundTask`** record is written into the database, and now we have a unique identifier for the task.
36+
4. The task is **pushed** to a shared, synchronised _task queue_ of the CodeChecker server, resulting in the _`ENQUEUED`_ status.
37+
* `AbstractTask` subclasses **MUST** be `pickle`-able and reasonably small.
38+
* The library offers means to store additional large data on the file system, in a temporary directory specific to the task.
39+
5. The **`task token`** is returned to the user via the RPC API call, and the API worker is free too respond to other requests.
3840
6. The API hander exits and the Thrift RPC connection is terminated.
41+
7. In a loop with some frequency, the user exercises the `getTaskInfo()` API (executed in the context of any _API worker_ process, synchronised over the database) to query whether the task was completed, if the user wishes to receive this information.
3942

4043
The API request dispatching of the CodeChecker server has a **`TaskManager`** instance which should be passed to the API handler implementation, if not already available.
4144
Then, you can use this _`TaskManager`_ object to perform the necessary actions to enqueue the execution of a task:
@@ -118,7 +121,7 @@ The business logic of tasks are implemented by subclassing the _`AbstractTask`_
118121
4. The implementation does its thing, periodically calling _`task_manager.heartbeat()`_ to update the progress timestamp of the task, and, if appropriate, checking with _`task_manager.should_cancel()`_ whether the admins requested the task to cancel or the server is shutting down.
119122
5. If _`should_cancel()`_ returned `True`, the task does some appropriate clean-up, and exits by raising the special _`TaskCancelHonoured`_ exception, indicating that it responded to the request. (At this point, the status becomes either _`CANCELLED`_ or _`DROPPED`_, depending on the circumstances of the service.)
120123
6. Otherwise, or if the task is for some reason not cancellable without causing damage, the task executes its logic.
121-
7. If the task's _`_implementation()`_ method exits cleanly, it reaches the _`COMPLETED`_ status; otherwise, if any exception escapes from the _`_implementation()`_ method, the task becomes _`FAILED`_.
124+
7. If the task's _`_implementation()`_ method exits cleanly, it reaches the _`COMPLETED`_ status; otherwise, if any exception escapes from the _`_implementation()`_ method, the task becomes _`FAILED`_, and exception information is logged into the `BackgroundTask.comments` column of the database.
122125

123126
**Caution!** Tasks, executing in a separate background process part of the many processes spawned by a CodeChecker server, no longer have the ability to synchronously communicate with the user!
124127
This also includes the lack of ability to "return" a value: tasks **only exercise side-effects**, but do not calculate a "result".
@@ -170,6 +173,32 @@ class MyTask(AbstractTask):
170173
foo(element)
171174
```
172175

176+
### Abnormal path 1: admin cancellation
177+
178+
At any point following _`ALLOCATED`_ status, but most likely in the _`ENQUEUED`_ and _`RUNNING`_ statuses, a **`SUPERUSER`** may issue a _`cancelTask()`_ order.
179+
This will set `BackgroundTask.cancel_flag`, and the task is expected (although not required!) to poll its own _`should_cancel()`_ status internally in checkpoints, and terminate gracefully to this request. This is done by **`_implementation()`** exiting by raising a **`TaskCancelHonoured`** exception.
180+
(If the task does not raise one, it will be allowed to conclude normally, or fail in some other manner.
181+
Tasks cancelled gracefully will have the _`CANCELLED`_ status.
182+
183+
For example, a background task that performs an action over a set of input files generally should be implemented like this:
184+
185+
```py3
186+
def _implementation(tm: TaskManager):
187+
for file in INPUTS:
188+
if tm.should_cancel(self):
189+
ROLLBACK()
190+
raise TaskCancelHonoured(self)
191+
192+
DO_LOGIC(file)
193+
```
194+
195+
### Abnormal path 2: server shutdown
196+
197+
Alternatively, at any point in this life cycle, the server might receive the command to terminate itself (kill signals `SIGINT`, `SIGTERM`; alternatively caused by `CodeChecker server --stop`). Following the termination of _API workers_, the _background workers_ will also shut down one by one.
198+
At this point, the default behaviour is to cause a special _cancel event_ which tasks currently _`RUNNING`_ may still gracefully honour, as-if it was a `SUPERUSER`'s single-task cancel request. All other tasks that have not started executing yet and are in the _`ALLOCATED`_ or _`ENQUEUED`_ status will never start.
199+
200+
All tasks not in a _normal termination state_ will be set to the _`DROPPED`_ status, with the `comments` field containing a log about the specifics of in which state the task was dropped, and why. (Together, _`CANCELLED`_ and _`DROPPED`_ are the _"abnormal termination states"_, indicating that the task terminated due to some external influence.)
201+
173202
Client-side handling
174203
--------------------
175204

docs/web/server_config.md

Lines changed: 45 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -9,15 +9,28 @@ using the package's installed `config/server_config.json` as a template.
99
1010
Table of Contents
1111
=================
12+
* [Task handling](#task-handling)
13+
* [Number of API worker processes](#number-of-api-worker-processes)
14+
* [Number of task worker processes](#number-of-task-worker-processes)
15+
1216
* [Run limitation](#run-limitations)
1317
* [Storage](#storage)
1418
* [Directory of analysis statistics](#directory-of-analysis-statistics)
1519
* [Limits](#Limits)
1620
* [Maximum size of failure zips](#maximum-size-of-failure-zips)
1721
* [Size of the compilation database](#size-of-the-compilation-database)
22+
* [Keepalive](#keepalive)
23+
* [Idle time](#idle-time)
24+
* [Interval time](#interval-time)
25+
* [Probes](#probes)
1826
* [Authentication](#authentication)
27+
* [Secrets](#secrets)
28+
* [server_secrets.json](#server_secretsjson)
29+
* [Environmental variables](#environmental-variables)
30+
31+
## Task handling
1932

20-
## Number of API worker processes
33+
### Number of API worker processes
2134
The `worker_processes` section of the config file controls how many processes
2235
will be started on the server to process API requests.
2336

@@ -33,6 +46,37 @@ processes will be started on the server to process background jobs.
3346

3447
The server needs to be restarted if the value is changed in the config file.
3548

49+
### `--machine-id`
50+
Unfortunately, servers don't always terminate gracefully (cue the aforementioned
51+
`SIGKILL`, but also the container, VM, or the host machine could simply die
52+
during execution, in ways the server is not able to handle). Because tasks are
53+
not shared across server processes, and there are crucial bits of information in
54+
the now dead process's memory which would have been needed to execute the task,
55+
a server later restarting in place of a previously dead one should be able to
56+
identify which tasks its "predecessor" left behind without clean-up.
57+
58+
This is achieved by storing the running computer's identifier, configurable via
59+
`CodeChecker server --machine-id`, as an additional piece of information for
60+
each task. By default, the machine ID is constructed from
61+
`gethostname():portnumber`, e.g., `cc-server:80`.
62+
63+
In containerised environments, relying on `gethostname()` may not be entirely
64+
stable! For example, Docker exposes the first 12 digits of the container's
65+
unique hash as the _"hostname"_ of the insides of the container. If the
66+
container is started with `--restart always` or `--restart unless-stopped`, then
67+
this is fine, however, more advanced systems, such as _Docker swarm_ will
68+
**create a new container** in case the old one died (!), resulting in a new
69+
value of `gethostname()`.
70+
71+
In such environments, service administrators must pay additional caution and
72+
configure their instances by setting `--machine-id` for subsequent executions of
73+
the "same" server accordingly. If a server with machine ID **`M`** starts up
74+
(usually after a container or "system" restart), it will set every task not in
75+
any "termination states" and associated with machine ID **`M`** to the
76+
_`DROPPED`_ status (with an appropriately formatted comment accompanying),
77+
signifying that the _previous instance_ "dropped" these tasks, but had no chance
78+
of recording this fact.
79+
3680
## Run limitation
3781
The `max_run_count` section of the config file controls how many runs can be
3882
stored on the server for a product.

0 commit comments

Comments
 (0)