Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

VM jobs run indefinitely #152

Open
magsol opened this issue Jun 6, 2018 · 13 comments · May be fixed by #148
Open

VM jobs run indefinitely #152

magsol opened this issue Jun 6, 2018 · 13 comments · May be fixed by #148

Comments

@magsol
Copy link

magsol commented Jun 6, 2018

We're testing out a new autograder image on AutoLab, but after making a submission and the autograder spinning up, the launched job simply runs forever.

Expected Behavior

I would expect the job to eventually halt (particularly around the "Timeout" interval set in the autograder, which is 360 seconds in our case), and a Runtime Trace to be available showing the entire command and its output.

Actual Behavior

screen shot 2018-06-06 at 2 57 23 pm

Each time I refresh the status page, all the "time elapsed" columns increment, indicating that the jobs are still running. However, when I actually SSH into the running Tango container, I don't see anything after running the docker ps -a command?

screen shot 2018-06-06 at 2 58 34 pm

Are these jobs actually running? Or is it some bug in the database? Either way, how can I stop them?

Steps to Reproduce the Behavior

???

Honestly not sure. The autograder settings have a 360 second timeout set, so why these jobs are still running is beyond me. There's no discernible output for any of the jobs; under the "Runtime Trace" for each job, it has entries for adding the job to the queue and eventually the "Job [x] started", but that's the very last one. Nothing after that; no debugging output.

@skoch9
Copy link

skoch9 commented Dec 11, 2018

We encounter the same problem, the docker grading container is not running anymore, but the job does not get deleted from the tango jobs queue and is therefore blocking the queue.

@nitsanshai
Copy link

Was this issue resolved? I'm hitting the same problem with the default image (ubuntu).

@magsol
Copy link
Author

magsol commented Feb 4, 2019

No, it wasn't. Still no answer on our end for what was going on.

@nitsanshai
Copy link

Were you able to get this to work with any image? What did you have to do for the default image?

@devanshk
Copy link
Contributor

devanshk commented Feb 6, 2019

Hm just looking at the symptoms, it looks like

  1. Autolab successfully sends a job request to Tango
  2. Autolab is waiting for Tango to trigger a callback with the results of the job
  3. Tango must have received the job
  4. Tango fails to add the job or completes it and fails to communicate the result back to Autolab.

Can you give me some more context on your setup? Have you had autograders run successfully in the past? Did you change anything recently? Are Autolab and Tango deployed on the same machine?

@magsol
Copy link
Author

magsol commented Feb 7, 2019

@nitsanshai AutoLab has actually worked fine for me for over two years; the problem only started when we significantly overhauled the Docker image used by the autograder. I unfortunately don't have immediate access to that image anymore (I could probably drum it up, though) but that was the only thing we changed prior to this behavior.

@devanshk Yes, we've had autograders run successfully in the past. It was only when we added an entirely new Docker image for the autograder (one with significant JVM/Scala dependencies) that we observed these problems. That was the only change. Otherwise it's a vanilla one-click install on a single machine, and has otherwise been working fine with no issues.

@devanshk
Copy link
Contributor

devanshk commented Feb 7, 2019

Could you join our Slack channel? It would be easier to debug this 1-on-1 and post our findings back here.

https://autolab-slack.herokuapp.com

@magsol
Copy link
Author

magsol commented Feb 7, 2019

Sure, but I can't really debug this in the short-term. I filed this ticket in June of last year when these issues cropped up, but we had to move past it months ago. I'm actively using AutoLab for my course right now and that prevents me from doing any live debugging on this issue until the semester is over.

@devanshk
Copy link
Contributor

devanshk commented Feb 7, 2019

If you're able to create a test course, we could try things out there - or if you send me your dockerfile, I can replicate your bug on my end and experiment with it.

@pratikbin
Copy link

pratikbin commented Jul 23, 2021

running docker-compose form fresh, created course with hello lab as per docs and when i submit my assignment for auto grade it shown me
image

so i docker pull autolabproject/autograding_image
and it then it worked and went in queue, but now it's in queue forever and not executing auto grading as per

We're testing out a new autograder image on AutoLab, but after making a submission and the autograder spinning up, the launched job simply runs forever.

@pratikbin
Copy link

i restarted whole docker-compoe stack, and now it seems working but got status 125... further checking

Runtime Trace

    2021-07-23 18:14:06 +0530 | Added job devops-00-02_hello_3_pratik@xxx:1 to queue
    2021-07-23 18:14:08 +0530 | Dispatched job devops-00-02_hello_3_pratik@xxx:1 [try 0]
    2021-07-23 18:14:08 +0530 | Assigned job devops-00-02_hello_3_pratik@xxxx:1 existing VM prod-1001-autograding_image
    2021-07-23 18:14:08 +0530 | Job devops-00-02_hello_3_pratik@xxxx:1 waiting for VM prod-1001-autograding_image
    2021-07-23 18:14:08 +0530 | VM prod-1001-autograding_image ready for job devops-00-02_hello_3_pratik@xxxx:1
    2021-07-23 18:14:08 +0530 | Input copied for job devops-00-02_hello_3_pratik@xxxx:1 [status=0]
    2021-07-23 18:14:12 +0530 | Job devops-00-02_hello_3_pratik@xxxx:1 executed [status=125]

@umar221b
Copy link

umar221b commented Jun 1, 2023

@pratikbin did you manage to figure out what was going on? I seem to be having the same problem locally.

@damianhxy
Copy link
Member

Unfortunately, this seems to be a sporadic issue that we've yet to fully resolve. Some related PRs include #227 and #228.

If you could share the lab / image that you're using, I could take a look

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

8 participants