Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NoneType Exception When Queuing Jobs #138

Closed
chriswailes opened this issue Sep 25, 2017 · 4 comments · Fixed by #160
Closed

NoneType Exception When Queuing Jobs #138

chriswailes opened this issue Sep 25, 2017 · 4 comments · Fixed by #160
Assignees

Comments

@chriswailes
Copy link

Expected Behavior

When using Redis with VM ReUse enabled the server should continue to process jobs.

Actual Behavior

After some combination of job submissions takes place the jobManager will produce the following error:

DEBUG|2017-09-24 16:58:32,713|JobQueue|get| Acquired lock to job queue.
DEBUG|2017-09-24 16:58:32,724|JobQueue|get| Released lock to job queue.
Starting the stand-alone Tango JobManager
Traceback (most recent call last):
  File "jobManager.py", line 125, in <module>
    jobs.run()
  File "jobManager.py", line 45, in run
    self.__manage()
  File "jobManager.py", line 105, in __manage
    self.jobQueue.makeDead(job.id, str(err))
AttributeError: 'NoneType' object has no attribute 'id'

Steps to Reproduce the Behavior

Run the Tango job manager with Redis and VM ReUse.

Additional Details

The error appears to be the result of getNextPendingJobReuse returning (None, None) at jobManager.py:66. This then results in the next line trying to retrieve a job using None as an ID. This results in an exception and the variable job holding the value None. Next, the exception handler at jobManager.py:105 attempts to mark the job as dead. Unfortunately, because job is None the above error is raised.

@bstriner
Copy link

Just saw this bug. Had the same problem myself.

Not a fix to the root problem, but this PR avoids the crash and logs the error.

#146

@chriswailes
Copy link
Author

It turns out the "solution" to this issue is a necessary but entirely undocumented command. Tango keeps track of "pools" (that aren't actually pools) of VMs and attempts to "allocate" machines from there. For some reason the pool for the first Docker image name given is automatically created for you, but you can't use a different Docker image without manually creating them using the following command:

python ./clients/tango-cli -k <key> -l <lab name> --prealloc --image <image name> --num <pool size>

With VM re-use disabled (having it enabled was causing problems for me) the pool size is a useless parameter. Tango will simply spawn as many jobs as it can as fast as it can.

@victorhuangwq
Copy link
Contributor

Indeed there seems to be an issue with VM Reuse, and it still has not be resolved. Bumping this thread to have devs to notice

@victorhuangwq
Copy link
Contributor

victorhuangwq commented Jun 20, 2020

Hi, a quick update for those who are viewing this thread later, read the TLDR below if you don't want to read through the explaination.

Firstly looking at this line in the codebase, the docker containers are already destroyed after copying out the feedback
https://github.com/autolab/Tango/blob/master/vmms/localDocker.py#L159

I tested after commenting out the line where localDocker calls on destroyVMs when it finishes its job, and the yet the docker container still exits. It seems that docker containers are meant to exit when it's main command finishes https://stackoverflow.com/questions/28212380/why-docker-container-exits-immediately . So even without the destroy command, it exits anyway.

In fact I see that the localDocker depends on the docker exiting to obtain its output, because when I stop the docker container from exiting, localDocker was not able to obtain the feedback.

So to truly enable reuse of docker containers would require a different implementation of localDocker, but I don't think that is needed anyways, cause it seems that the correct way to reuse docker containers is to let them exit when they aren't being used.

TLDR

Applying @bstriner patch does create a limited pool, limiting the number of docker containers that can be spunned up at any one time, but it does not actually enable actual reuse of Docker containers (but reuse might still be possible with other VMs), and as of what I see, not reusuing docker containers seems to be the correct way to use them

@victorhuangwq victorhuangwq linked a pull request Jul 8, 2020 that will close this issue
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants