Research · Systems · Protocol Design

Research that turns distributed behavior into interpretable intelligence.

This page is for research work that sits between systems, human understanding, and real-world decision-making. The first featured project is EGESS, a protocol for turning swarm failure, disagreement, and recovery into an interpretable hazard signal.

TL;DR

EGESS turns missing nodes into an explainable hazard signal.

It uses absence tomography, 2-bit neighbor states, and a conservative `T` score to estimate direction, distance, approach speed, and recovery in a distributed swarm.

Core Idea

Failure patterns are treated as information, not just noise.

What It Shows

Where the hazard is coming from, how close it is, and whether it is spreading or recovering.

Proof

Python node network, visual demo, phase-based evaluation, and exportable paper evidence.

EGESS

Experimental Gear for Evaluation of Swarm Systems

EGESS is a distributed sensing protocol built around a simple but unusual idea: the pattern of node failure can become the signal. Instead of only asking sensors to report the environment, each node watches reachability, disagreement, spread, and recovery across its local neighborhood. I describe this as absence tomography: using what disappears at the edges of the network to reconstruct where hazard pressure is moving.

The improved model separates sudden destruction from persistent spread, then combines them into a conservative instability score called `T`. That makes EGESS useful as a swarm-systems prototype, a fault-aware inference tool, and a repeatable evaluation harness for resilient distributed coordination.

How the idea works

Every cycle, a node pulls nearby peers with a lightweight status request. If a peer stops responding, EGESS treats that as possible destruction. If a peer is reachable but reports a much higher instability state, EGESS treats that as possible spread. Recovery is also part of the model: when missing nodes return and spread flattens, the system can shift from warning or impact toward contained recovery.

Updated Model

Two lanes feed one conservative signal

Absence Tomography

EGESS treats missing neighbors as an edge-mapped signal. The absence pattern becomes a way to infer hazard shape, direction, and pressure.

Destruction Lane

Confirms sudden missing or unreachable neighbors. This lane is sensitive to fresh breakage and moving impact.

Spread Lane

Confirms persistent disagreement and instability around a node. This lane captures hazard pressure even before direct loss.

Recovery Lane

Tracks returning nodes and stalled spread so recovery is explainable instead of being treated as a silent reset.

Combined T

`T` is the conservative combined instability score. The node listens to the strongest danger signal rather than averaging risk away.

Absence Tomography

The 2-bit state says something is happening. Tomography estimates where it is coming from.

Each node sends a compact 2-bit alert state to its six neighboring nodes. That gives every node a local hazard view without requiring global knowledge of the whole network.

2-bit alert state

00normal

01warning

10watch

11impact

Glassboro example

E and NE are fine. NW has `T = 3` and slope `+1.5`. W has `T = 1` and slope `+0.5`. The local pattern says the strongest pressure is northwest and getting worse.

T-gradient = [0, 0, 3, 1, 0, 0]
Delta bits = [0, 0, 2, 1, 0, 0]

Direction = atan2(3 x sin(120°) + 1 x sin(180°), 3 x cos(120°) + 1 x cos(180°)) ≈ 134° ≈ northwest
Distance = 12 / max(3, 1) = 4 hops
Last distance = 5.2 hops, now 4 hops, so approach speed = 1.2 hops/cycle
ETA = 4 / 1.2 ≈ 3.3 cycles

In simpler terms: weighted voting becomes direction, direction plus distance becomes movement, and movement gives an ETA. Glassboro can conclude: something hazardous is approaching from the northwest, about four hops away, moving at about 1.2 hops per cycle, and likely to arrive in a little over three cycles.

Pull Cycle

What each cycle collects

New Missing

A neighbor that was reachable last cycle and fails this cycle. This is the strongest signal for fresh front movement.

Persistent Missing

A neighbor that was already gone and is still gone. This matters for impact and severity, even if the damage is no longer new.

Recovered

A neighbor that was missing and then comes back. Recovery is part of the sensing model, not just a cleanup event.

Disagreement

A neighbor whose instability is already much higher than mine. This acts as a directional warning and confidence signal.

Scoring Model

Sub-scores make the conservative T signal explainable

FrontScore

Answers: “Is danger moving toward me?” It emphasizes new disappearances, then reinforces that signal with disagreement, corroboration, persistent loss, and momentum.

FrontScore = 2 x new_missing + disagreement + corroboration + 0.5 x persistent_missing + momentum

ImpactScore

Answers: “How bad is the local damage right now?” It counts total missing neighbors and increases when those losses are adjacent, because clustering suggests real local concentration rather than random isolated failure.

ImpactScore = 3 x total_missing + 2 x adjacency_cluster

ArrestScore

Answers: “Has the event stalled or started recovering?” It rises when no new damage appears for several cycles and neighbors begin returning, then falls if the front keeps spreading sideways.

ArrestScore = stalled + recovering - sideways_spread

T = max(FrontScore, ImpactScore)

The combined score `T` is deliberately conservative: whichever is worse between approaching danger and local damage becomes the node’s working instability value. That makes the signal easier to reason about in both the visualizer and the protocol logs.

State Logic

How severity becomes an interpretable state

Normal `T = 0` and the node sees no active hazard pattern.

Watch `T >= 3` and the node is starting to see an early front signal.

Warning `T >= 6` and the signal is reinforced strongly enough to suggest confirmed approach.

Impact `T >= 10` and the node is effectively in the damage zone.

Recovery Recovered neighbors and flat or declining spread shift the system into contained and recovering states.

Evaluation Harness

The improved version is built for repeatable paper evidence

EGESS now has a phase-based evaluation runner for exact active windows of `60s` or `120s`. The harness can run steady baselines, moving hazards, fire/bomb spread, and adversarial stress while collecting compact evidence for dashboards, spreadsheets, figures, and paper appendices.

Phase 1: Baseline

Measures steady-state reachability, throughput, overhead, and per-node load without injected damage.

Phase 2: Fire & Bomb

Simulates center ignition, hop-based spread, temporary bomb impact, and recovery trailing the front.

Phase 3: Tornado Sweep

Moves a hazard band across the grid so local and far watch nodes can show detection over distance.

Phase 4: Adversarial Stress

Injects false unavailability, lying sensors, noisy behavior, flapping, and recovery to test robustness.

Implementation Evidence

What exists in the actual project

Python Node Network

EGESS runs as a real multi-node simulation with `node.py`, background protocol loops, trigger tooling, and fault injection controls.

Protocol Overview

The local project includes a full `egess_protocol.html` overview explaining the model, commands, and scoring behavior.

Interactive Visual Demo

`egess-demo` includes a browser-based visualizer that exposes missing neighbors, disagreement, front score, impact score, and `T`.

Auto Demonstrations

The repo supports staged runs like baseline, fire/bomb, tornado sweep, adversarial noise, and recovery.

Proof and Paper Runs

`demo_proof.sh` and `run_paper_eval.sh` export event logs, compact evidence, TSV summaries, dashboards, and portable bundles.

Monitor, Inspector, and Reports

The toolchain includes a terminal monitor, visual inspector, merged dashboards, figure exports, and Google Sheets-ready CSVs.

Why It Matters

Research significance

EGESS matters because it treats failure patterns as information rather than only as noise. The improved version is stronger because it is no longer only a visual swarm idea: it has a detector layout, a proof runner, exact evaluation windows, storage-safe collection, cross-protocol comparison scaffolding, and exportable evidence. That makes it relevant to hazard inference, distributed systems, resilience engineering, and HCI for interpretable autonomy.