truthtracer/super-node
Folders and files
| Name | Name | Last commit date | ||
|---|---|---|---|---|
Repository files navigation
Copyright (c) 2019-2050 TruthTracer. All rights reserved.
Author
TruthTracer
License
super-node is under the LGPL V3 license.
See LICENSE for the full license text.
About
A supervisor and data node of various workers for edge computing (AI inference, including LLM backend) / IoT etc.
monitor all task workers and keep them always working (health check / restart / send params)
collect all data from workers into queues cache, consuming by other biz service
work flow
basically, the pattern is
1. node receive a task from '/cmd' and start a worker with params
2. worker send data back to the node (master) via websocket('/put'), node will cache data with limited queue size
3. other app will fetch data from node via websocket('/fetch')
4. for every task worker, node use a configurable fixed length ordered queue to cache
workers can be:
IoT sensor collectors
audio / video pullers
data from other network services (rpc/http/websocket/mqtt/redis....)
other calc jobs / data gathering
api design
'/m/cmd' : in json, send command to node
'/m/fetch/[task_id]' : biz app get data from a special task, can fetch all data one after another when not passed task_id
'/m/stat' : get task status
'/w/ch' : a worker channel put data to node, get cmd from master ...
multiple clients talk to one api url are allowed
worker management
monitor all worker process and restart a worker with no response, frozen or abnormal...
register as a worker, 2 methods
exec a child process
connect as a net client
restart condition
process status : no configurable, dead/gone will kill and restart
process memory usage : configurable, in kb, default 0 : no limit
process execution timeout : configurable, in ms, default 0 : no limit
net client heartbeat
data structure
task_id : string, unique in whole workers of one node
start_params : string
pid : int, use pid to notify a process, SIGTERM SIGKILL ...
first_sig : if trigger restart of worker, send a sig to process
second_sig_duration : int, in ms, wait duration after first sig, will use SIGKILL to terminate process
heartbeat_interval : if worker use net client to talk, will do heart beat check
accept_tasks : yes/no, default no, need supported by the worker
exec_interval : int, in ms, repeatedly exec a command
PS: node was gone, worker reconnect to continue work, is not supported, cause node can not persistent some states (restarting task count, stopped a task ...)
data gathering
data received from workers and cache to queue (length=N) with entity
REST/websocket method to fetch from queue