Skip to content

Enhance ContinuousJfrStarter to utilize JFR DiagnosticCommand MBean for recording management#17851

Merged
gortiz merged 7 commits intoapache:masterfrom
gortiz:improve-continuous-profiler
Mar 24, 2026
Merged

Enhance ContinuousJfrStarter to utilize JFR DiagnosticCommand MBean for recording management#17851
gortiz merged 7 commits intoapache:masterfrom
gortiz:improve-continuous-profiler

Conversation

@gortiz
Copy link
Copy Markdown
Contributor

@gortiz gortiz commented Mar 10, 2026

Summary

This PR reworks Pinot's continuous JFR integration to use the JVM DiagnosticCommand MBean (JFR.configure, JFR.start, JFR.stop) instead of the jdk.jfr.Recording object API, while keeping cluster-level dynamic control through ContinuousJfrStarter.

Why this change is needed

The previous implementation had correctness and operability issues:

  • It used Recording.setDestination(...), which does not match the intended long-running "continuous recording + operator-managed dumps" behavior.
  • It mixed concerns by trying to handle dump-file lifecycle/cleanup in Pinot code.
  • It did not provide a reliable way to control JFR repository location through Pinot in the way operators need.
  • It made semantics around dump boundaries/rotation unclear.

The new approach aligns Pinot with JFR's actual control model and removes behavior that was easy to misinterpret.

Why we decided not to keep using Recording

We moved away from Recording API for control flow because important operational knobs we need are in the global JFR control plane, not the per-recording API:

  • repository path (repositorypath)
  • dump path (dumppath)
  • unified runtime command semantics for start/stop/configure

Using DiagnosticCommand MBean gives one coherent mechanism for all required controls and maps directly to jcmd behavior.

Why we still keep ContinuousJfrStarter (instead of only jcmd or JAVA_OPTS)

We intentionally keep ContinuousJfrStarter as Pinot's control plane because it gives cluster-wide operability that JAVA_OPTS/manual jcmd do not:

  • Dynamic reconfiguration without restart via Pinot cluster config updates.
  • Single cluster-level control surface for enable/disable and JFR parameters.
  • No need to manually run commands on many nodes during incidents.
  • Better consistency with Pinot's existing config propagation model across cluster roles.

JAVA_OPTS and direct jcmd remain valid tools, but they are not sufficient as the primary Pinot control plane for large clusters.

What changed

ContinuousJfrStarter behavior

  • Uses DiagnosticCommand MBean commands:
    • jfrConfigure for runtime options (repositorypath, dumppath)
    • jfrStart for recording start
    • jfrStop for stop by recording name
  • Supports cluster config mapping:
    • pinot.jfr.directory -> repositorypath (kept for backward compatibility)
    • pinot.jfr.dumpPath -> dumppath
    • existing keys still supported: enabled, name, configuration, toDisk, maxSize, maxAge, dumpOnExit
  • Keeps one continuous recording model; no dump rotation/cleanup thread logic.

Reliability hardening

  • If DiagnosticCommand MBean is unavailable, Pinot logs warnings and safely no-ops instead of failing startup/runtime paths.
  • Static initialization is safe even if ObjectName creation fails (logs warning; does not throw).

Tests

  • Unit-style tests validate command generation and control flow.
  • Added integration test that uses the real DiagnosticCommand MBean and verifies recording visibility via jfrCheck.
  • Added coverage for graceful behavior when MBean is unavailable.

Docs (See pinot-contrib/pinot-docs#471)

  • Updated continuous JFR operator docs to reflect:
    • MBean-based control model
    • pinot.jfr.directory compatibility key and pinot.jfr.dumpPath
    • graceful degradation when MBean is unavailable

@gortiz gortiz requested review from Copilot and yashmayya March 10, 2026 16:29
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR updates Pinot’s continuous JFR integration to control recordings via the JVM DiagnosticCommand MBean (jfrConfigure, jfrStart, jfrStop) rather than the jdk.jfr.Recording API, while preserving cluster-level dynamic control through ContinuousJfrStarter.

Changes:

  • Replace Recording API usage with MBean command execution for start/stop/configure.
  • Add runtime option mapping for repository and dump paths and simplify lifecycle (remove dump cleanup thread logic).
  • Update tests to validate generated commands and add an integration-style test using the real DiagnosticCommand MBean.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.

File Description
pinot-core/src/main/java/org/apache/pinot/core/util/trace/ContinuousJfrStarter.java Switch continuous JFR control to DiagnosticCommand MBean, add repo/dump path configuration, remove dump cleanup thread.
pinot-core/src/test/java/org/apache/pinot/core/util/trace/ContinuousJfrStarterTest.java Update unit tests to assert executed MBean commands and add a real-MBean integration test.
Comments suppressed due to low confidence (1)

pinot-core/src/test/java/org/apache/pinot/core/util/trace/ContinuousJfrStarterTest.java:1

  • Multiple tests assert the full rendered command string (including argument ordering). This makes the suite brittle to harmless internal changes (e.g., reordering arguments) while still producing semantically equivalent commands. Consider asserting on command type plus a set/contains of key arguments (or parsing into a canonical representation) so tests fail only on meaningful command changes.
/**

@codecov-commenter
Copy link
Copy Markdown

codecov-commenter commented Mar 10, 2026

Codecov Report

❌ Patch coverage is 59.91903% with 99 lines in your changes missing coverage. Please review.
✅ Project coverage is 63.21%. Comparing base (58daa99) to head (c62605e).
⚠️ Report is 88 commits behind head on master.

Files with missing lines Patch % Lines
...he/pinot/core/util/trace/ContinuousJfrStarter.java 59.91% 77 Missing and 22 partials ⚠️
Additional details and impacted files
@@             Coverage Diff              @@
##             master   #17851      +/-   ##
============================================
- Coverage     63.23%   63.21%   -0.03%     
- Complexity     1466     1525      +59     
============================================
  Files          3190     3193       +3     
  Lines        192009   193488    +1479     
  Branches      29412    29760     +348     
============================================
+ Hits         121424   122305     +881     
- Misses        61063    61583     +520     
- Partials       9522     9600      +78     
Flag Coverage Δ
custom-integration1 100.00% <ø> (ø)
integration 100.00% <ø> (ø)
integration1 100.00% <ø> (ø)
integration2 0.00% <ø> (ø)
java-11 63.18% <59.91%> (-0.04%) ⬇️
java-21 63.18% <59.91%> (-0.03%) ⬇️
temurin 63.21% <59.91%> (-0.03%) ⬇️
unittests 63.20% <59.91%> (-0.03%) ⬇️
unittests1 55.49% <59.91%> (-0.07%) ⬇️
unittests2 34.17% <6.88%> (-0.08%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 2 out of 2 changed files in this pull request and generated 4 comments.

Comment on lines 243 to +252
@VisibleForTesting
protected static Path getRecordingPath(Path parentDir, String name, Instant timestamp) {
String filename = "recording-" + name + "-" + timestamp + ".jfr";
return parentDir.resolve(filename);
}

private void prepareFileDumps(PinotConfiguration subset) {
protected boolean executeDiagnosticCommand(String operationName, String... arguments) {
try {
Path directory = Path.of(subset.getProperty(DIRECTORY, Paths.get(".").toString()));
if (!directory.toFile().canWrite()) {
throw new RuntimeException("Cannot write: " + directory);
}
_mBeanServer.invoke(_diagnosticCommandObjectName, operationName, new Object[]{arguments},
new String[]{String[].class.getName()});
return true;
} catch (Exception e) {
LOGGER.warn("Failed to execute JFR command '{}' with arguments {}", operationName, Arrays.toString(arguments), e);
return false;
}
Copy link

Copilot AI Mar 11, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

executeDiagnosticCommand() can be invoked with a null _diagnosticCommandObjectName (e.g., if createDiagnosticCommandObjectName() returned null) and will then rely on catching a resulting exception. Adding an explicit null/availability guard here (and logging a clearer one-line message) would avoid noisy stack traces and make the failure mode clearer for callers/subclasses.

Copilot uses AI. Check for mistakes.
Comment on lines +263 to +277
private static boolean waitForRecordingPresence(String recordingName, boolean expectedPresent, int maxAttempts,
long delayMs) {
for (int i = 0; i < maxAttempts; i++) {
boolean present = isRecordingPresent(recordingName);
if (present == expectedPresent) {
return true;
}
try {
Thread.sleep(delayMs);
} catch (InterruptedException e) {
Thread.currentThread().interrupt();
return false;
}
}
return false;
Copy link

Copilot AI Mar 11, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This integration test uses a custom Thread.sleep polling loop with a 1s total wait (10 * 100ms), which is prone to CI flakiness on slow or loaded machines. Prefer using the existing TestUtils.waitForCondition(...) helper (used elsewhere in pinot-core tests) with a more generous timeout and failure message.

Copilot uses AI. Check for mistakes.
Comment on lines +234 to +242
String recordingName = "pinot-continuous-it-" + System.currentTimeMillis();
Map<String, String> enabledConfig = Map.of(
"pinot.jfr.enabled", "true",
"pinot.jfr.name", recordingName,
"pinot.jfr.toDisk", "false");
Map<String, String> disabledConfig = Map.of(
"pinot.jfr.enabled", "false",
"pinot.jfr.name", recordingName,
"pinot.jfr.toDisk", "false");
Copy link

Copilot AI Mar 11, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The real-MBean integration test only exercises toDisk=false, so it won't catch incompatibilities in generated arguments like maxage/maxsize (which are only sent when toDisk=true). Consider adding an additional integration assertion that starts a recording with toDisk=true and a non-default maxAge to validate that toJfrTimeArgument(...) produces a value accepted by the DiagnosticCommand MBean.

Copilot generated this review using guidance from repository custom instructions.
Copy link
Copy Markdown
Contributor

@yashmayya yashmayya left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice, this seems like a much cleaner approach

Comment on lines +106 to +113
public static final String DEFAULT_MAX_SIZE = "2GB";
/// Key that controls the maximum age of the recording file.
/// Once the file reaches this age, older events will be discarded.
/// If both maxSize and maxAge are set, the recording will be discarded when either condition is met.
///
/// This is only used if toDisk is true.
/// The value is a duration string, as defined by the Duration class.
/// The default value is 1 day (P1D).
/// The default value is 7 days (P7D).
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Curious - why these larger defaults?

return true;
}
try {
Thread.sleep(delayMs);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: can use TestUtils::waitForCondition pattern instead?

import org.assertj.core.api.Assertions;
import org.mockito.Mockito;
import org.testng.annotations.BeforeMethod;
import org.testng.SkipException;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is causing the spotless plugin check to fail

@gortiz gortiz requested review from Copilot and yashmayya March 17, 2026 16:04
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 3 out of 3 changed files in this pull request and generated 5 comments.

Comment on lines +454 to 463
private void cleanupRepositoriesIfNeeded(String recordingName, String repositoryRoot, long maxTotalSizeBytes) {
try {
Configuration configuration = Configuration.getConfiguration(jfrConfName);
return new Recording(configuration);
} catch (ParseException e) {
throw new RuntimeException("Failed to parse JFR configuration '" + jfrConfName + "'", e);
} catch (IOException e) {
throw new UncheckedIOException("Failed to read JFR configuration '" + jfrConfName + "'", e);
Path repositoryRootPath = Path.of(repositoryRoot).toAbsolutePath().normalize();
Path activeRepositoryPath = resolveCurrentRepositoryPath();
cleanupOldRepositoriesInternal(repositoryRootPath, maxTotalSizeBytes, activeRepositoryPath);
} catch (Exception e) {
LOGGER.warn("Failed to clean up old JFR repositories for recording '{}' under '{}'", recordingName, repositoryRoot,
e);
}
}
return;
@VisibleForTesting
protected static void resetGlobalStateForTesting() {
synchronized (GLOBAL_RECORDING_LOCK) {
Comment on lines +399 to +406
if (preserveRepository) {
List<String> configureArgumentsWithPreserve = new ArrayList<>(configureArguments);
configureArgumentsWithPreserve.add("preserve-repository=true");
if (executeDiagnosticCommand(JFR_CONFIGURE_COMMAND, configureArgumentsWithPreserve.toArray(new String[0]))) {
return true;
}
LOGGER.warn("Failed to apply preserveRepository option for continuous JFR recording. Retrying without it");
}
Comment on lines +334 to 341
if (!isDiagnosticCommandAvailable()) {
LOGGER.warn("JFR DiagnosticCommand MBean is unavailable. Cannot start continuous JFR recording");
return false;
}
if (!applyRuntimeOptions(subset)) {
LOGGER.warn("Failed to apply JFR runtime options for recording '{}'", recordingName);
return false;
}
Comment on lines +24 to +25
clusterConfigOverrides.put(CommonConstants.JFR + ".dumpPath", jfrDirectory);
clusterConfigOverrides.put(CommonConstants.JFR + ".preserveRepository", "true");
@xiangfu0 xiangfu0 added the tools Related to Pinot CLI tools and utilities label Mar 20, 2026
@gortiz gortiz merged commit acedd32 into apache:master Mar 24, 2026
15 of 16 checks passed
@gortiz gortiz deleted the improve-continuous-profiler branch March 24, 2026 14:56
xiangfu0 pushed a commit to pinot-contrib/pinot-docs that referenced this pull request Mar 24, 2026
@xiangfu0
Copy link
Copy Markdown
Contributor

Documentation PR: pinot-contrib/pinot-docs#572

This PR updates the Continuous JFR runbook to reflect the DiagnosticCommand MBean integration changes in this PR.

xiangfu0 added a commit to pinot-contrib/pinot-docs that referenced this pull request Mar 24, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

observability Related to observability (logging, tracing, metrics) tools Related to Pinot CLI tools and utilities

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants