Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

start_gpu_data_collector.sh script failure when tried to excute #583

Open
upendart opened this issue May 12, 2022 · 1 comment
Open

start_gpu_data_collector.sh script failure when tried to excute #583

upendart opened this issue May 12, 2022 · 1 comment
Assignees
Labels
bug Something isn't working

Comments

@upendart
Copy link

Hello,

I am trying to create dashboard on azure for GPU Monitoring.
I followed the steps mentioned and updated the fields with the required details in start_gpu_data_collector.sh file, when I tried to execute the scripts it throws below error.
gethostbyname("::1") failed.

Then I updated the script and executed as below
./gpu_data_collector.py -tis $INTERVAL_SECS -dfi $DCGM_FIELD_IDS > /tmp/gpu_data_collector.log, though I don't see any GPU Monitor custom fields created in the Log Analytics Workspace.

@upendart upendart added the bug Something isn't working label May 12, 2022
@garvct
Copy link
Collaborator

garvct commented May 17, 2022

I am curious what modification to the script did you make to overcome the gethostbyname error?

Are you using this GPU monitoring script to monitor SLURM jobs GPU activity? (This is the default behavior). So, if you do not have a SLURM job running no GPU monitoring data will be sent to Azure monitor.
If you would like all processes on nodes to be monitored (even if they are not associated with a SLURM job) for GPU activity and the data to be sent to Azure Monitor (then add the -fgm command line argument).
I have made some corrections to the start-up and shutdown scripts (start_gpu_data_collector.sh, stop_gpu_data_collector.sh), see #584

If you still do not see any data being sent to log analytics (Custom logs), then please send me the stdout/stderr (/tmp/gpu_data_collector.log) when you execute the python script.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants