Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[receiver/hostmetrics] Divide by logical cores when calculating process.cpu.utilization #31378

Merged

Conversation

BinaryFissionGames
Copy link
Contributor

Description:
When calculating the process.cpu.utilization metric, values over 1 were possible since the number of cores was not taken into account (a single process may run on multiple logical cores, this effectively multplying the maximum amount of CPU time the process may take).

This PR adds a division by the number of logical cores to the calculation for cpu utilization.

Link to tracking Issue: Closes #31368

Testing:

  • Added some unit tests
  • Tested locally on my system with the program I posted in the issue:
{
  "name": "process.cpu.utilization",
  "description": "Percentage of total CPU time used by the process since last scrape, expressed as a value between 0 and 1. On the first scrape, no data point is emitted for this metric.",
  "unit": "1",
  "gauge": {
    "dataPoints": [
      {
        "attributes": [{ "key": "state", "value": { "stringValue": "user" } }],
        "startTimeUnixNano": "1708562810521000000",
        "timeUnixNano": "1708562890771386000",
        "asDouble": 0.8811268516953904
      },
      {
        "attributes": [
          { "key": "state", "value": { "stringValue": "system" } }
        ],
        "startTimeUnixNano": "1708562810521000000",
        "timeUnixNano": "1708562890771386000",
        "asDouble": 0.0029471002907659667
      },
      {
        "attributes": [{ "key": "state", "value": { "stringValue": "wait" } }],
        "startTimeUnixNano": "1708562810521000000",
        "timeUnixNano": "1708562890771386000",
        "asDouble": 0
      }
    ]
  }
}

In activity monitor, this process was clocking in around ~1000% - ~1100% cpu, on my machine that has 12 logical cores. So the value of around 90% total utilization seems correct here.

Documentation:
N/A

Copy link
Member

@djaglowski djaglowski left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR would be correct according to the semantic conventions which define the metric as:

"Difference in process.cpu.time since the last measurement, divided by the elapsed time and number of CPUs available to the process."

Copy link
Contributor

@braydonk braydonk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would definitely wait for @dmitryax to chime in, as part of system semantic conventions work he will be going through and clearing up all the definitions of CPU utilization and process.

Personally I think it's fine to merge this PR because based on the description of this metric in the current metadata.yaml, the existing implementation is categorically wrong and this fixes it as far as I can tell. I'm a bit fuzzy on whether this technically constitutes a breaking change, since we are now fixing something that was incorrect but technically users could be relying on that incorrect number. I'm not sure on that judgement call.

On a personal level I actually am not a huge fan of the described behaviour of this metric. At least in Linux-land, a process' CPU utilization is often interpreted as the percentage of a single core, thus the percentage actually could be over 1 if the process uses more than 1 full core. I discussed this in the last system semconv meeting, and I will open an issue about it for us to work out there.

All that aside, my opinion is that this PR should be merged to match the description, and the semconv working group should make a decision what we want this metric to be in the final process semantic conventions. Definitely wait for Dmitrii to weigh in though.

Copy link
Member

@andrzej-stencel andrzej-stencel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that this PR aligns the receiver's behavior with the current description in the Semantic Conventions for OS Process Metrics.

The only caveat is that the OS process convention itself is apparently in "Experimental" status. We probably don't want to change this behavior twice. ;)

Perhaps worth marking this as a breaking change to make it more visible to users who might be relying on the current behavior.

@BinaryFissionGames
Copy link
Contributor Author

Bump @dmitryax - Would like to get your thoughts on this one!

Copy link
Member

@dmitryax dmitryax left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's the right fix given that we don't have an attribute for a particular core as we have for system.cpu.utilization

@dmitryax
Copy link
Member

dmitryax commented Mar 7, 2024

But, given that it's a widely used component, should we make the change with a feature gate? @djaglowski @astencel-sumo @braydonk

@andrzej-stencel
Copy link
Member

But, given that it's a widely used component, should we make the change with a feature gate? @djaglowski @astencel-sumo @braydonk

Yeah I'm afraid I agree. 😉 It's always additional work adding the feature gate and changing it in later stages, but probably it's justified in this case. This might be quite a disruptive change for users relying on this metric.

@BinaryFissionGames BinaryFissionGames force-pushed the fix/cpu-utilization-divide-by-cores branch from d646652 to 1db0017 Compare March 8, 2024 15:55
@BinaryFissionGames
Copy link
Contributor Author

@dmitryax I've updated the PR to add a new featuregate, and optionally normalize the metric based on the feature gate.

@andrzej-stencel andrzej-stencel self-requested a review March 8, 2024 20:28
@andrzej-stencel
Copy link
Member

@BinaryFissionGames , thanks for adding the feature gate. I've added a couple comments to make naming more consistent.

Can you also please add documentation describing the behavior and the schedule of the feature gate. See e.g. here for inspiration: https://github.com/open-telemetry/opentelemetry-collector-contrib/blob/v0.87.0/extension/storage/filestorage/README.md#extensionfilestoragereplaceunsafecharacters.

Comment on lines +205 to +209
The schedule for this feature gate is:
- Introduced in v0.97.0 (March 2024) as `alpha` - disabled by default.
- Moved to `beta` in v0.99.0 (April 2024) - enabled by default.
- Moved to `stable` in v0.101.0 (May 2024) - cannot be disabled.
- Removed three releases after `stable`.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let me know if this schedule seems too compressed. Tried to go with the 2 release per advancement approach.

Copy link
Member

@andrzej-stencel andrzej-stencel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Love it! Thank you @BinaryFissionGames 👏

@djaglowski djaglowski merged commit 757ddd1 into open-telemetry:main Mar 12, 2024
142 checks passed
@github-actions github-actions bot added this to the next release milestone Mar 12, 2024
DougManton pushed a commit to DougManton/opentelemetry-collector-contrib that referenced this pull request Mar 13, 2024
…ss.cpu.utilization (open-telemetry#31378)

**Description:**
When calculating the process.cpu.utilization metric, values over 1 were
possible since the number of cores was not taken into account (a single
process may run on multiple logical cores, this effectively multplying
the maximum amount of CPU time the process may take).

This PR adds a division by the number of logical cores to the
calculation for cpu utilization.

**Link to tracking Issue:** Closes open-telemetry#31368

**Testing:**
* Added some unit tests
* Tested locally on my system with the program I posted in the issue:

```json
{
  "name": "process.cpu.utilization",
  "description": "Percentage of total CPU time used by the process since last scrape, expressed as a value between 0 and 1. On the first scrape, no data point is emitted for this metric.",
  "unit": "1",
  "gauge": {
    "dataPoints": [
      {
        "attributes": [{ "key": "state", "value": { "stringValue": "user" } }],
        "startTimeUnixNano": "1708562810521000000",
        "timeUnixNano": "1708562890771386000",
        "asDouble": 0.8811268516953904
      },
      {
        "attributes": [
          { "key": "state", "value": { "stringValue": "system" } }
        ],
        "startTimeUnixNano": "1708562810521000000",
        "timeUnixNano": "1708562890771386000",
        "asDouble": 0.0029471002907659667
      },
      {
        "attributes": [{ "key": "state", "value": { "stringValue": "wait" } }],
        "startTimeUnixNano": "1708562810521000000",
        "timeUnixNano": "1708562890771386000",
        "asDouble": 0
      }
    ]
  }
}
```

In activity monitor, this process was clocking in around ~1000% - ~1100%
cpu, on my machine that has 12 logical cores. So the value of around 90%
total utilization seems correct here.

**Documentation:**
N/A

---------

Co-authored-by: Daniel Jaglowski <[email protected]>
XinRanZhAWS pushed a commit to XinRanZhAWS/opentelemetry-collector-contrib that referenced this pull request Mar 13, 2024
…ss.cpu.utilization (open-telemetry#31378)

**Description:**
When calculating the process.cpu.utilization metric, values over 1 were
possible since the number of cores was not taken into account (a single
process may run on multiple logical cores, this effectively multplying
the maximum amount of CPU time the process may take).

This PR adds a division by the number of logical cores to the
calculation for cpu utilization.

**Link to tracking Issue:** Closes open-telemetry#31368

**Testing:**
* Added some unit tests
* Tested locally on my system with the program I posted in the issue:

```json
{
  "name": "process.cpu.utilization",
  "description": "Percentage of total CPU time used by the process since last scrape, expressed as a value between 0 and 1. On the first scrape, no data point is emitted for this metric.",
  "unit": "1",
  "gauge": {
    "dataPoints": [
      {
        "attributes": [{ "key": "state", "value": { "stringValue": "user" } }],
        "startTimeUnixNano": "1708562810521000000",
        "timeUnixNano": "1708562890771386000",
        "asDouble": 0.8811268516953904
      },
      {
        "attributes": [
          { "key": "state", "value": { "stringValue": "system" } }
        ],
        "startTimeUnixNano": "1708562810521000000",
        "timeUnixNano": "1708562890771386000",
        "asDouble": 0.0029471002907659667
      },
      {
        "attributes": [{ "key": "state", "value": { "stringValue": "wait" } }],
        "startTimeUnixNano": "1708562810521000000",
        "timeUnixNano": "1708562890771386000",
        "asDouble": 0
      }
    ]
  }
}
```

In activity monitor, this process was clocking in around ~1000% - ~1100%
cpu, on my machine that has 12 logical cores. So the value of around 90%
total utilization seems correct here.

**Documentation:**
N/A

---------

Co-authored-by: Daniel Jaglowski <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[receiver/hostmetrics] The value of process.cpu.utilization may exceed 1
6 participants