Reliable connection lifecycle, auto-reconnect, and per-SET correlation by peter-dolkens · Pull Request #68 · jnimmo/pyIntesisHome

peter-dolkens · 2026-05-12T10:59:03Z

Reliable connection lifecycle, auto-reconnect, and per-SET correlation

Why

I hit a chronic "the IntesisHome integration goes unavailable until I reload it"
problem on my Home Assistant instance. Tracing it through the library turned up
a cluster of related lifecycle bugs in the persistent TCP session, plus an
opportunity to make per-SET behaviour observable to callers.

This PR collects the library-side changes into three commits, each
self-contained. Companion changes for hass-intesishome will follow in a
separate PR.

Three commits

1. `fix: protocol cleanup and map alignment`

Small standalone fixes uncovered while debugging:

get_fan_speed returns None instead of the raw int when the device's
reported fan_speed value isn't in the device's fan_map. HA would
otherwise accept the integer as a fan-mode string and fail on the
reverse-map lookup when setting it.
poll_status guards installation.get("devices") against None,
matching the existing guard on config.get("inst").
connect_req is built via json.dumps instead of %-format. The old
form only produced valid JSON when the cloud's token happened to be
numeric.
COMMAND_MAP key reset_eror renamed to error_reset (typo + matches
INTESIS_MAP[54]["name"]; no callers referenced the old key).
INTESIS_MAP[6] (hvane read), COMMAND_MAP["vvane"], and
COMMAND_MAP["hvane"] extended from manual1–5 to manual1–9 so they
match SWINGMODE_BITS and INTESIS_MAP[5] (vvane). Devices that
advertise positions 6–9 via the config_*_vanes bitmap can now be
read and written through the full range.

2. `feat: reliable connection lifecycle with auto-reconnect`

Root cause of the unavailability: _send_command called await self.stop()
on response timeout, tearing the whole controller down. Combined with the
keepalive waiting up to 5 s for a response the cloud doesn't actually emit
for GETs (verified empirically via TCP captures), every keepalive cycle
ended in a permanently dead controller. Recovery required a manual
integration reload.

Lifecycle fixes:

_send_command on timeout / error closes the writer only via a new
_close_writer helper. Tasks and the websession stay intact;
_data_received sees EOF, exits via its finally, which is now the
single trigger for the disconnect-handling path.
The disconnect path cancels the keepalive task (was leaking across
reconnects) and calls a new _handle_disconnect hook.
_send_keepalive is fire-and-forget (wait_for_response=False). The
cloud has no per-GET ACK, so treating its silence as a failure is
what was killing the connection.
stop() swallows ConnectionResetError / BrokenPipeError /
OSError on wait_closed() and always nulls the writer. The peer
often beats us to closing the socket.
connect() resets _connecting via try/finally (the previous flow
left it stuck on auth failure, blocking any retry), raises
IHConnectionError on socket-open OSError instead of swallowing it,
and post-checks _connected before launching the keepalive (so we
don't run keepalive on a session that never received connect_rsp).
connect() no longer wipes _devices. The reset-then-repopulate
window raced any concurrent consumer of get_devices(). poll_status
overwrites entries in place, so the wipe was only ever defensive
against removed-upstream devices.
connect() doesn't raise on transient socket-level failures.
Auto-reconnect is the single recovery path.

Auto-reconnect:

IntesisHome overrides _handle_disconnect to schedule a
_reconnect_task with exponential backoff (5 s → 300 s cap).
Authentication failures halt the loop; connection errors keep
retrying.
stop() disables reconnect (_should_reconnect = False) and
cancels the running reconnect task so integration unload is clean.
connect() re-enables _should_reconnect so a successful manual
reconnect doesn't disable future automatic recovery.

3. `feat: per-seqNo set_ack correlation and awaitable SETs`

The cloud emits a set_ack frame for every SET we send, but the library
used to hardcode seqNo=0 and treat set_ack as an unknown command.
There was no way for callers to know whether a SET was acknowledged.

Empirical observation: each set_ack carries the deviceId, an 8-bit echo
of our seqNo (seqNo & 0xFF), and a fresh rssi reading. There is no
protocol-level success / failure signal in the seqNo field — the cloud
ACKs malformed and clamped SETs identically to valid ones. False
therefore only indicates a connection-level problem.

Changes:

Each SET uses a unique seqNo allocated cyclically over 0–255 via
_next_set_seqno. _pending_set_acks tracks
{seqNo: _PendingSet(uid, value, device_id, future)}.
_handle_set_ack resolves the matching future with True and updates
rssi from the embedded field; stale / unmatched ACKs are logged at
debug.
_set_value is now awaitable and returns bool: True when the cloud
acknowledged the SET, False if no ACK arrived within 5 s. Multiple
SETs can be in flight concurrently.
On disconnect, pending futures are intentionally left pending so the
auto-reconnect has a chance to bring the session back within the
per-SET timeout. stop() resolves them with False since that's an
intentional shutdown.
All public set_* methods on IntesisBase (set_mode,
set_preset_mode, set_temperature, set_fan_speed,
set_vertical_vane, set_horizontal_vane, set_power_on,
set_power_off, and the set_mode_* wrappers) propagate the bool.
Validation-time early exits return False consistently.
set_*_vane previously could raise KeyError on an unknown position;
it now returns False.

Existing callers that don't check the return see no behaviour change;
new callers (notably HA service handlers) can surface SET failures to
their users.

Testing

Running on my HA install for several days against an Intesis WMP
(modelId 550, IntesisHome cloud). Observed multiple natural
keepalive-timeout → reconnect cycles complete cleanly without the
entity going unavailable.
TCP captures from my live device (kept out of the PR to avoid leaking
account-side state) confirmed the set_ack shape, the 8-bit seqNo
encoding, and the fact that the cloud doesn't emit GET responses.
Happy to share sanitised samples privately if helpful for review.
All three commits leave the test suite passing (the end-to-end mocked
HTTP suite added in cecf220 continues to pass).

Open questions

The auto-reconnect default backoff is 5 s → 300 s with a cap. Happy
to expose those as constructor kwargs if you'd prefer them
user-tunable.
The change in set_*_vane from "raises KeyError on unknown
position" to "returns False" is a subtle contract change. I think
False is more consistent with the rest of the public API, but it's
a callable behavioural difference for any code that relied on the
exception.
The capture-driven understanding of the set_ack protocol is
empirical from one device/model; I haven't seen Intesis docs for it.
If you've got reference material that contradicts my reading
(especially around success / failure signalling), happy to revise.

Small standalone fixes uncovered while debugging a recurring 'integration goes unavailable' problem: - get_fan_speed returns None when the device's reported fan_speed is not in the fan_map (was leaking the raw int upstream, which HA would then accept as a fan_mode string and fail on the reverse-map set). - poll_status guards installation.get('devices') against None, matching the existing guard on config.get('inst'). - connect_req is built via json.dumps. The old %-format produced valid JSON only when the cloud's token happened to be numeric. - COMMAND_MAP key 'reset_eror' renamed to 'error_reset' (matches INTESIS_MAP[54]['name'] and is a one-line dead-code rename - no caller referenced the old typo'd key). - INTESIS_MAP[6] (hvane read map), COMMAND_MAP['vvane'], and COMMAND_MAP['hvane'] extended from manual1-5 to manual1-9 to match SWINGMODE_BITS and INTESIS_MAP[5] (vvane). Devices that advertise positions 6-9 via the config_*_vanes bitmap can now be read and written through the full range.

Hardens the persistent TCP session used by the cloud (IntesisHome, airconwithme, anywair) so the controller recovers from transient drops on its own instead of dying permanently after the first hiccup. Root cause: _send_command called 'await self.stop()' on response timeout, tearing the controller down. Combined with the keepalive waiting up to 5s for a response the cloud doesn't actually emit for GETs (verified empirically), every keepalive cycle ended in a permanently dead controller. Recovery required a manual integration reload. Lifecycle fixes: - _send_command on timeout/error now closes the writer only via a new _close_writer helper, leaving tasks and the websession intact. _data_received sees EOF and exits via its finally block, which is now the single trigger for the disconnect handling path. - The disconnect path cancels the keepalive task (was leaking across reconnects) and calls a new _handle_disconnect hook. - _send_keepalive sends and forgets (wait_for_response=False). The cloud doesn't emit a per-GET response; treating that silence as a failure is what was killing the connection. - stop() swallows ConnectionResetError / BrokenPipeError / OSError on wait_closed() and always nulls the writer. The peer often beats us to closing the socket. - connect() resets _connecting via try/finally (the previous flow left it stuck on auth failure, blocking any retry), raises IHConnectionError on socket-open OSError instead of swallowing it, and post-checks _connected before launching the keepalive (so we don't run a keepalive on a session that never received connect_rsp). - connect() no longer wipes _devices. The reset-then-repopulate window raced any concurrent consumer of get_devices() (e.g. a second platform setting up against the same controller). poll_status overwrites entries in place, so the wipe was only ever defensive against removed-upstream devices. - connect() doesn't raise on transient socket-level failures (open_connection succeeds but the auth handshake never completes). Raising would make integrations like HA respawn a duplicate controller via PlatformNotReady while the original one's auto-reconnect was already running, and the Intesis cloud RSTs whichever connection it considers stale. The single recovery path is now auto-reconnect. Auto-reconnect: - IntesisHome overrides _handle_disconnect to schedule a _reconnect_task with exponential backoff (5s -> 300s cap). Authentication failures halt the loop (same creds won't fix themselves); connection errors keep retrying. - stop() disables reconnect via _should_reconnect=False and cancels the running reconnect task so integration unload is clean. - connect() re-enables _should_reconnect so a successful manual reconnect doesn't disable future automatic recovery. The integration's existing async_update_callback can drop its own reconnect logic now that the library handles it.

The cloud emits a set_ack frame for every SET we send, but the library used to hardcode seqNo=0 and treat set_ack as an unknown command. There was no way for callers to know whether a SET was acknowledged. Empirical capture (in the captures directory of the integration's companion repo): each set_ack carries the deviceId, an 8-bit echo of our seqNo (seqNo & 0xFF), and a fresh rssi reading. There is no protocol-level success/failure signal in the seqNo field - the cloud ACKs malformed and clamped SETs identically to valid ones. False therefore only indicates a connection-level problem. Changes: - Each SET now uses a unique seqNo allocated cyclically over 0..255 via _next_set_seqno. _pending_set_acks tracks {seqNo: _PendingSet(uid, value, device_id, future)} for in-flight commands. - _handle_set_ack resolves the matching future with True and updates rssi from the embedded field; stale/unmatched ACKs are logged at debug. - _set_value is now awaitable and returns bool: True when the cloud acknowledged the SET, False if no ACK arrived within _set_ack_timeout (5s). Multiple SETs can be in flight concurrently, each with its own future. - On disconnect, pending futures are intentionally left pending so the auto-reconnect has a chance to bring the session back within the per-SET timeout window. stop() does resolve them with False since that's an intentional shutdown. - All public set_* methods on IntesisBase (set_mode, set_preset_mode, set_temperature, set_fan_speed, set_vertical_vane, set_horizontal_vane, set_power_on, set_power_off, and the set_mode_* convenience wrappers) propagate the bool. Validation-time early exits return False consistently. set_*_vane previously could raise KeyError on an unknown position; it now returns False instead. Existing callers that don't check the return see no behaviour change; new callers can surface SET failures to their users.

peter-dolkens · 2026-05-12T11:09:56Z

Also see jnimmo/hass-intesishome#41 for the matching hass-intesishome PR

peter-dolkens added 3 commits May 12, 2026 19:12

peter-dolkens mentioned this pull request May 12, 2026

Lifecycle refactor + merge swing-modes PR #27 + error surfacing jnimmo/hass-intesishome#41

Open

peter-dolkens marked this pull request as ready for review May 12, 2026 11:09

peter-dolkens mentioned this pull request May 12, 2026

Update available while jnimmo is unavailable #63

Open

jnimmo merged commit 067de68 into jnimmo:master May 14, 2026
4 of 6 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reliable connection lifecycle, auto-reconnect, and per-SET correlation#68

Reliable connection lifecycle, auto-reconnect, and per-SET correlation#68
jnimmo merged 3 commits into
jnimmo:masterfrom
peter-dolkens:master

peter-dolkens commented May 12, 2026

Uh oh!

peter-dolkens commented May 12, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

peter-dolkens commented May 12, 2026

Reliable connection lifecycle, auto-reconnect, and per-SET correlation

Why

Three commits

1. fix: protocol cleanup and map alignment

2. feat: reliable connection lifecycle with auto-reconnect

3. feat: per-seqNo set_ack correlation and awaitable SETs

Testing

Open questions

Uh oh!

peter-dolkens commented May 12, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

1. `fix: protocol cleanup and map alignment`

2. `feat: reliable connection lifecycle with auto-reconnect`

3. `feat: per-seqNo set_ack correlation and awaitable SETs`