Skip to content

Reliable connection lifecycle, auto-reconnect, and per-SET correlation#68

Merged
jnimmo merged 3 commits into
jnimmo:masterfrom
peter-dolkens:master
May 14, 2026
Merged

Reliable connection lifecycle, auto-reconnect, and per-SET correlation#68
jnimmo merged 3 commits into
jnimmo:masterfrom
peter-dolkens:master

Conversation

@peter-dolkens
Copy link
Copy Markdown
Contributor

Reliable connection lifecycle, auto-reconnect, and per-SET correlation

Why

I hit a chronic "the IntesisHome integration goes unavailable until I reload it"
problem on my Home Assistant instance. Tracing it through the library turned up
a cluster of related lifecycle bugs in the persistent TCP session, plus an
opportunity to make per-SET behaviour observable to callers.

This PR collects the library-side changes into three commits, each
self-contained. Companion changes for hass-intesishome will follow in a
separate PR.

Three commits

1. fix: protocol cleanup and map alignment

Small standalone fixes uncovered while debugging:

  • get_fan_speed returns None instead of the raw int when the device's
    reported fan_speed value isn't in the device's fan_map. HA would
    otherwise accept the integer as a fan-mode string and fail on the
    reverse-map lookup when setting it.
  • poll_status guards installation.get("devices") against None,
    matching the existing guard on config.get("inst").
  • connect_req is built via json.dumps instead of %-format. The old
    form only produced valid JSON when the cloud's token happened to be
    numeric.
  • COMMAND_MAP key reset_eror renamed to error_reset (typo + matches
    INTESIS_MAP[54]["name"]; no callers referenced the old key).
  • INTESIS_MAP[6] (hvane read), COMMAND_MAP["vvane"], and
    COMMAND_MAP["hvane"] extended from manual1–5 to manual1–9 so they
    match SWINGMODE_BITS and INTESIS_MAP[5] (vvane). Devices that
    advertise positions 6–9 via the config_*_vanes bitmap can now be
    read and written through the full range.

2. feat: reliable connection lifecycle with auto-reconnect

Root cause of the unavailability: _send_command called await self.stop()
on response timeout, tearing the whole controller down. Combined with the
keepalive waiting up to 5 s for a response the cloud doesn't actually emit
for GETs (verified empirically via TCP captures), every keepalive cycle
ended in a permanently dead controller. Recovery required a manual
integration reload.

Lifecycle fixes:

  • _send_command on timeout / error closes the writer only via a new
    _close_writer helper. Tasks and the websession stay intact;
    _data_received sees EOF, exits via its finally, which is now the
    single trigger for the disconnect-handling path.
  • The disconnect path cancels the keepalive task (was leaking across
    reconnects) and calls a new _handle_disconnect hook.
  • _send_keepalive is fire-and-forget (wait_for_response=False). The
    cloud has no per-GET ACK, so treating its silence as a failure is
    what was killing the connection.
  • stop() swallows ConnectionResetError / BrokenPipeError /
    OSError on wait_closed() and always nulls the writer. The peer
    often beats us to closing the socket.
  • connect() resets _connecting via try/finally (the previous flow
    left it stuck on auth failure, blocking any retry), raises
    IHConnectionError on socket-open OSError instead of swallowing it,
    and post-checks _connected before launching the keepalive (so we
    don't run keepalive on a session that never received connect_rsp).
  • connect() no longer wipes _devices. The reset-then-repopulate
    window raced any concurrent consumer of get_devices(). poll_status
    overwrites entries in place, so the wipe was only ever defensive
    against removed-upstream devices.
  • connect() doesn't raise on transient socket-level failures.
    Auto-reconnect is the single recovery path.

Auto-reconnect:

  • IntesisHome overrides _handle_disconnect to schedule a
    _reconnect_task with exponential backoff (5 s → 300 s cap).
    Authentication failures halt the loop; connection errors keep
    retrying.
  • stop() disables reconnect (_should_reconnect = False) and
    cancels the running reconnect task so integration unload is clean.
  • connect() re-enables _should_reconnect so a successful manual
    reconnect doesn't disable future automatic recovery.

3. feat: per-seqNo set_ack correlation and awaitable SETs

The cloud emits a set_ack frame for every SET we send, but the library
used to hardcode seqNo=0 and treat set_ack as an unknown command.
There was no way for callers to know whether a SET was acknowledged.

Empirical observation: each set_ack carries the deviceId, an 8-bit echo
of our seqNo (seqNo & 0xFF), and a fresh rssi reading. There is no
protocol-level success / failure signal in the seqNo field — the cloud
ACKs malformed and clamped SETs identically to valid ones. False
therefore only indicates a connection-level problem.

Changes:

  • Each SET uses a unique seqNo allocated cyclically over 0–255 via
    _next_set_seqno. _pending_set_acks tracks
    {seqNo: _PendingSet(uid, value, device_id, future)}.
  • _handle_set_ack resolves the matching future with True and updates
    rssi from the embedded field; stale / unmatched ACKs are logged at
    debug.
  • _set_value is now awaitable and returns bool: True when the cloud
    acknowledged the SET, False if no ACK arrived within 5 s. Multiple
    SETs can be in flight concurrently.
  • On disconnect, pending futures are intentionally left pending so the
    auto-reconnect has a chance to bring the session back within the
    per-SET timeout. stop() resolves them with False since that's an
    intentional shutdown.
  • All public set_* methods on IntesisBase (set_mode,
    set_preset_mode, set_temperature, set_fan_speed,
    set_vertical_vane, set_horizontal_vane, set_power_on,
    set_power_off, and the set_mode_* wrappers) propagate the bool.
    Validation-time early exits return False consistently.
    set_*_vane previously could raise KeyError on an unknown position;
    it now returns False.

Existing callers that don't check the return see no behaviour change;
new callers (notably HA service handlers) can surface SET failures to
their users.

Testing

  • Running on my HA install for several days against an Intesis WMP
    (modelId 550, IntesisHome cloud). Observed multiple natural
    keepalive-timeout → reconnect cycles complete cleanly without the
    entity going unavailable.
  • TCP captures from my live device (kept out of the PR to avoid leaking
    account-side state) confirmed the set_ack shape, the 8-bit seqNo
    encoding, and the fact that the cloud doesn't emit GET responses.
    Happy to share sanitised samples privately if helpful for review.
  • All three commits leave the test suite passing (the end-to-end mocked
    HTTP suite added in cecf220 continues to pass).

Open questions

  1. The auto-reconnect default backoff is 5 s → 300 s with a cap. Happy
    to expose those as constructor kwargs if you'd prefer them
    user-tunable.
  2. The change in set_*_vane from "raises KeyError on unknown
    position" to "returns False" is a subtle contract change. I think
    False is more consistent with the rest of the public API, but it's
    a callable behavioural difference for any code that relied on the
    exception.
  3. The capture-driven understanding of the set_ack protocol is
    empirical from one device/model; I haven't seen Intesis docs for it.
    If you've got reference material that contradicts my reading
    (especially around success / failure signalling), happy to revise.

Small standalone fixes uncovered while debugging a recurring
'integration goes unavailable' problem:

- get_fan_speed returns None when the device's reported fan_speed
  is not in the fan_map (was leaking the raw int upstream, which
  HA would then accept as a fan_mode string and fail on the
  reverse-map set).
- poll_status guards installation.get('devices') against None,
  matching the existing guard on config.get('inst').
- connect_req is built via json.dumps. The old %-format produced
  valid JSON only when the cloud's token happened to be numeric.
- COMMAND_MAP key 'reset_eror' renamed to 'error_reset' (matches
  INTESIS_MAP[54]['name'] and is a one-line dead-code rename - no
  caller referenced the old typo'd key).
- INTESIS_MAP[6] (hvane read map), COMMAND_MAP['vvane'], and
  COMMAND_MAP['hvane'] extended from manual1-5 to manual1-9 to
  match SWINGMODE_BITS and INTESIS_MAP[5] (vvane). Devices that
  advertise positions 6-9 via the config_*_vanes bitmap can now
  be read and written through the full range.
Hardens the persistent TCP session used by the cloud (IntesisHome,
airconwithme, anywair) so the controller recovers from transient
drops on its own instead of dying permanently after the first
hiccup.

Root cause: _send_command called 'await self.stop()' on response
timeout, tearing the controller down. Combined with the keepalive
waiting up to 5s for a response the cloud doesn't actually emit
for GETs (verified empirically), every keepalive cycle ended in a
permanently dead controller. Recovery required a manual integration
reload.

Lifecycle fixes:
  - _send_command on timeout/error now closes the writer only via
    a new _close_writer helper, leaving tasks and the websession
    intact. _data_received sees EOF and exits via its finally
    block, which is now the single trigger for the disconnect
    handling path.
  - The disconnect path cancels the keepalive task (was leaking
    across reconnects) and calls a new _handle_disconnect hook.
  - _send_keepalive sends and forgets (wait_for_response=False).
    The cloud doesn't emit a per-GET response; treating that
    silence as a failure is what was killing the connection.
  - stop() swallows ConnectionResetError / BrokenPipeError /
    OSError on wait_closed() and always nulls the writer. The
    peer often beats us to closing the socket.
  - connect() resets _connecting via try/finally (the previous
    flow left it stuck on auth failure, blocking any retry),
    raises IHConnectionError on socket-open OSError instead of
    swallowing it, and post-checks _connected before launching
    the keepalive (so we don't run a keepalive on a session that
    never received connect_rsp).
  - connect() no longer wipes _devices. The reset-then-repopulate
    window raced any concurrent consumer of get_devices() (e.g.
    a second platform setting up against the same controller).
    poll_status overwrites entries in place, so the wipe was only
    ever defensive against removed-upstream devices.
  - connect() doesn't raise on transient socket-level failures
    (open_connection succeeds but the auth handshake never
    completes). Raising would make integrations like HA respawn a
    duplicate controller via PlatformNotReady while the original
    one's auto-reconnect was already running, and the Intesis
    cloud RSTs whichever connection it considers stale. The
    single recovery path is now auto-reconnect.

Auto-reconnect:
  - IntesisHome overrides _handle_disconnect to schedule a
    _reconnect_task with exponential backoff (5s -> 300s cap).
    Authentication failures halt the loop (same creds won't fix
    themselves); connection errors keep retrying.
  - stop() disables reconnect via _should_reconnect=False and
    cancels the running reconnect task so integration unload is
    clean.
  - connect() re-enables _should_reconnect so a successful manual
    reconnect doesn't disable future automatic recovery.

The integration's existing async_update_callback can drop its own
reconnect logic now that the library handles it.
The cloud emits a set_ack frame for every SET we send, but the
library used to hardcode seqNo=0 and treat set_ack as an unknown
command. There was no way for callers to know whether a SET was
acknowledged.

Empirical capture (in the captures directory of the integration's
companion repo): each set_ack carries the deviceId, an 8-bit echo
of our seqNo (seqNo & 0xFF), and a fresh rssi reading. There is
no protocol-level success/failure signal in the seqNo field - the
cloud ACKs malformed and clamped SETs identically to valid ones.
False therefore only indicates a connection-level problem.

Changes:
  - Each SET now uses a unique seqNo allocated cyclically over
    0..255 via _next_set_seqno. _pending_set_acks tracks
    {seqNo: _PendingSet(uid, value, device_id, future)} for
    in-flight commands.
  - _handle_set_ack resolves the matching future with True and
    updates rssi from the embedded field; stale/unmatched ACKs
    are logged at debug.
  - _set_value is now awaitable and returns bool: True when the
    cloud acknowledged the SET, False if no ACK arrived within
    _set_ack_timeout (5s). Multiple SETs can be in flight
    concurrently, each with its own future.
  - On disconnect, pending futures are intentionally left
    pending so the auto-reconnect has a chance to bring the
    session back within the per-SET timeout window. stop() does
    resolve them with False since that's an intentional shutdown.
  - All public set_* methods on IntesisBase (set_mode,
    set_preset_mode, set_temperature, set_fan_speed,
    set_vertical_vane, set_horizontal_vane, set_power_on,
    set_power_off, and the set_mode_* convenience wrappers)
    propagate the bool. Validation-time early exits return False
    consistently. set_*_vane previously could raise KeyError on
    an unknown position; it now returns False instead.

Existing callers that don't check the return see no behaviour
change; new callers can surface SET failures to their users.
@peter-dolkens
Copy link
Copy Markdown
Contributor Author

Also see jnimmo/hass-intesishome#41 for the matching hass-intesishome PR

@jnimmo jnimmo merged commit 067de68 into jnimmo:master May 14, 2026
4 of 6 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants