Reliable connection lifecycle, auto-reconnect, and per-SET correlation#68
Merged
Conversation
Small standalone fixes uncovered while debugging a recurring
'integration goes unavailable' problem:
- get_fan_speed returns None when the device's reported fan_speed
is not in the fan_map (was leaking the raw int upstream, which
HA would then accept as a fan_mode string and fail on the
reverse-map set).
- poll_status guards installation.get('devices') against None,
matching the existing guard on config.get('inst').
- connect_req is built via json.dumps. The old %-format produced
valid JSON only when the cloud's token happened to be numeric.
- COMMAND_MAP key 'reset_eror' renamed to 'error_reset' (matches
INTESIS_MAP[54]['name'] and is a one-line dead-code rename - no
caller referenced the old typo'd key).
- INTESIS_MAP[6] (hvane read map), COMMAND_MAP['vvane'], and
COMMAND_MAP['hvane'] extended from manual1-5 to manual1-9 to
match SWINGMODE_BITS and INTESIS_MAP[5] (vvane). Devices that
advertise positions 6-9 via the config_*_vanes bitmap can now
be read and written through the full range.
Hardens the persistent TCP session used by the cloud (IntesisHome,
airconwithme, anywair) so the controller recovers from transient
drops on its own instead of dying permanently after the first
hiccup.
Root cause: _send_command called 'await self.stop()' on response
timeout, tearing the controller down. Combined with the keepalive
waiting up to 5s for a response the cloud doesn't actually emit
for GETs (verified empirically), every keepalive cycle ended in a
permanently dead controller. Recovery required a manual integration
reload.
Lifecycle fixes:
- _send_command on timeout/error now closes the writer only via
a new _close_writer helper, leaving tasks and the websession
intact. _data_received sees EOF and exits via its finally
block, which is now the single trigger for the disconnect
handling path.
- The disconnect path cancels the keepalive task (was leaking
across reconnects) and calls a new _handle_disconnect hook.
- _send_keepalive sends and forgets (wait_for_response=False).
The cloud doesn't emit a per-GET response; treating that
silence as a failure is what was killing the connection.
- stop() swallows ConnectionResetError / BrokenPipeError /
OSError on wait_closed() and always nulls the writer. The
peer often beats us to closing the socket.
- connect() resets _connecting via try/finally (the previous
flow left it stuck on auth failure, blocking any retry),
raises IHConnectionError on socket-open OSError instead of
swallowing it, and post-checks _connected before launching
the keepalive (so we don't run a keepalive on a session that
never received connect_rsp).
- connect() no longer wipes _devices. The reset-then-repopulate
window raced any concurrent consumer of get_devices() (e.g.
a second platform setting up against the same controller).
poll_status overwrites entries in place, so the wipe was only
ever defensive against removed-upstream devices.
- connect() doesn't raise on transient socket-level failures
(open_connection succeeds but the auth handshake never
completes). Raising would make integrations like HA respawn a
duplicate controller via PlatformNotReady while the original
one's auto-reconnect was already running, and the Intesis
cloud RSTs whichever connection it considers stale. The
single recovery path is now auto-reconnect.
Auto-reconnect:
- IntesisHome overrides _handle_disconnect to schedule a
_reconnect_task with exponential backoff (5s -> 300s cap).
Authentication failures halt the loop (same creds won't fix
themselves); connection errors keep retrying.
- stop() disables reconnect via _should_reconnect=False and
cancels the running reconnect task so integration unload is
clean.
- connect() re-enables _should_reconnect so a successful manual
reconnect doesn't disable future automatic recovery.
The integration's existing async_update_callback can drop its own
reconnect logic now that the library handles it.
The cloud emits a set_ack frame for every SET we send, but the
library used to hardcode seqNo=0 and treat set_ack as an unknown
command. There was no way for callers to know whether a SET was
acknowledged.
Empirical capture (in the captures directory of the integration's
companion repo): each set_ack carries the deviceId, an 8-bit echo
of our seqNo (seqNo & 0xFF), and a fresh rssi reading. There is
no protocol-level success/failure signal in the seqNo field - the
cloud ACKs malformed and clamped SETs identically to valid ones.
False therefore only indicates a connection-level problem.
Changes:
- Each SET now uses a unique seqNo allocated cyclically over
0..255 via _next_set_seqno. _pending_set_acks tracks
{seqNo: _PendingSet(uid, value, device_id, future)} for
in-flight commands.
- _handle_set_ack resolves the matching future with True and
updates rssi from the embedded field; stale/unmatched ACKs
are logged at debug.
- _set_value is now awaitable and returns bool: True when the
cloud acknowledged the SET, False if no ACK arrived within
_set_ack_timeout (5s). Multiple SETs can be in flight
concurrently, each with its own future.
- On disconnect, pending futures are intentionally left
pending so the auto-reconnect has a chance to bring the
session back within the per-SET timeout window. stop() does
resolve them with False since that's an intentional shutdown.
- All public set_* methods on IntesisBase (set_mode,
set_preset_mode, set_temperature, set_fan_speed,
set_vertical_vane, set_horizontal_vane, set_power_on,
set_power_off, and the set_mode_* convenience wrappers)
propagate the bool. Validation-time early exits return False
consistently. set_*_vane previously could raise KeyError on
an unknown position; it now returns False instead.
Existing callers that don't check the return see no behaviour
change; new callers can surface SET failures to their users.
Contributor
Author
|
Also see jnimmo/hass-intesishome#41 for the matching hass-intesishome PR |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Reliable connection lifecycle, auto-reconnect, and per-SET correlation
Why
I hit a chronic "the IntesisHome integration goes unavailable until I reload it"
problem on my Home Assistant instance. Tracing it through the library turned up
a cluster of related lifecycle bugs in the persistent TCP session, plus an
opportunity to make per-SET behaviour observable to callers.
This PR collects the library-side changes into three commits, each
self-contained. Companion changes for
hass-intesishomewill follow in aseparate PR.
Three commits
1.
fix: protocol cleanup and map alignmentSmall standalone fixes uncovered while debugging:
get_fan_speedreturnsNoneinstead of the raw int when the device'sreported
fan_speedvalue isn't in the device'sfan_map. HA wouldotherwise accept the integer as a fan-mode string and fail on the
reverse-map lookup when setting it.
poll_statusguardsinstallation.get("devices")againstNone,matching the existing guard on
config.get("inst").connect_reqis built viajson.dumpsinstead of%-format. The oldform only produced valid JSON when the cloud's token happened to be
numeric.
COMMAND_MAPkeyreset_erorrenamed toerror_reset(typo + matchesINTESIS_MAP[54]["name"]; no callers referenced the old key).INTESIS_MAP[6](hvane read),COMMAND_MAP["vvane"], andCOMMAND_MAP["hvane"]extended from manual1–5 to manual1–9 so theymatch
SWINGMODE_BITSandINTESIS_MAP[5](vvane). Devices thatadvertise positions 6–9 via the
config_*_vanesbitmap can now beread and written through the full range.
2.
feat: reliable connection lifecycle with auto-reconnectRoot cause of the unavailability:
_send_commandcalledawait self.stop()on response timeout, tearing the whole controller down. Combined with the
keepalive waiting up to 5 s for a response the cloud doesn't actually emit
for GETs (verified empirically via TCP captures), every keepalive cycle
ended in a permanently dead controller. Recovery required a manual
integration reload.
Lifecycle fixes:
_send_commandon timeout / error closes the writer only via a new_close_writerhelper. Tasks and the websession stay intact;_data_receivedsees EOF, exits via itsfinally, which is now thesingle trigger for the disconnect-handling path.
reconnects) and calls a new
_handle_disconnecthook._send_keepaliveis fire-and-forget (wait_for_response=False). Thecloud has no per-GET ACK, so treating its silence as a failure is
what was killing the connection.
stop()swallowsConnectionResetError/BrokenPipeError/OSErroronwait_closed()and always nulls the writer. The peeroften beats us to closing the socket.
connect()resets_connectingviatry/finally(the previous flowleft it stuck on auth failure, blocking any retry), raises
IHConnectionErroron socket-openOSErrorinstead of swallowing it,and post-checks
_connectedbefore launching the keepalive (so wedon't run keepalive on a session that never received
connect_rsp).connect()no longer wipes_devices. The reset-then-repopulatewindow raced any concurrent consumer of
get_devices().poll_statusoverwrites entries in place, so the wipe was only ever defensive
against removed-upstream devices.
connect()doesn't raise on transient socket-level failures.Auto-reconnect is the single recovery path.
Auto-reconnect:
IntesisHomeoverrides_handle_disconnectto schedule a_reconnect_taskwith exponential backoff (5 s → 300 s cap).Authentication failures halt the loop; connection errors keep
retrying.
stop()disables reconnect (_should_reconnect = False) andcancels the running reconnect task so integration unload is clean.
connect()re-enables_should_reconnectso a successful manualreconnect doesn't disable future automatic recovery.
3.
feat: per-seqNo set_ack correlation and awaitable SETsThe cloud emits a
set_ackframe for every SET we send, but the libraryused to hardcode
seqNo=0and treatset_ackas an unknown command.There was no way for callers to know whether a SET was acknowledged.
Empirical observation: each
set_ackcarries the deviceId, an 8-bit echoof our seqNo (
seqNo & 0xFF), and a fresh rssi reading. There is noprotocol-level success / failure signal in the seqNo field — the cloud
ACKs malformed and clamped SETs identically to valid ones.
Falsetherefore only indicates a connection-level problem.
Changes:
_next_set_seqno._pending_set_ackstracks{seqNo: _PendingSet(uid, value, device_id, future)}._handle_set_ackresolves the matching future withTrueand updatesrssi from the embedded field; stale / unmatched ACKs are logged at
debug.
_set_valueis now awaitable and returnsbool:Truewhen the cloudacknowledged the SET,
Falseif no ACK arrived within 5 s. MultipleSETs can be in flight concurrently.
auto-reconnect has a chance to bring the session back within the
per-SET timeout.
stop()resolves them withFalsesince that's anintentional shutdown.
set_*methods onIntesisBase(set_mode,set_preset_mode,set_temperature,set_fan_speed,set_vertical_vane,set_horizontal_vane,set_power_on,set_power_off, and theset_mode_*wrappers) propagate the bool.Validation-time early exits return
Falseconsistently.set_*_vanepreviously could raiseKeyErroron an unknown position;it now returns
False.Existing callers that don't check the return see no behaviour change;
new callers (notably HA service handlers) can surface SET failures to
their users.
Testing
(modelId 550, IntesisHome cloud). Observed multiple natural
keepalive-timeout → reconnect cycles complete cleanly without the
entity going unavailable.
account-side state) confirmed the
set_ackshape, the 8-bit seqNoencoding, and the fact that the cloud doesn't emit GET responses.
Happy to share sanitised samples privately if helpful for review.
HTTP suite added in
cecf220continues to pass).Open questions
to expose those as constructor kwargs if you'd prefer them
user-tunable.
set_*_vanefrom "raises KeyError on unknownposition" to "returns False" is a subtle contract change. I think
False is more consistent with the rest of the public API, but it's
a callable behavioural difference for any code that relied on the
exception.
set_ackprotocol isempirical from one device/model; I haven't seen Intesis docs for it.
If you've got reference material that contradicts my reading
(especially around success / failure signalling), happy to revise.