Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

migration to pydantic > 2.0 #613

Open
wants to merge 36 commits into
base: master
Choose a base branch
from

Conversation

Vectorrent
Copy link
Contributor

This is a first pass at upgrading Pydantic to 2.7.3, as discussed in #597. Fortunately, Pydantic is barely used in Hivemind - so this should be fairly straightforward.

So far, I have extended the BaseModel class, in order to define a Config object with arbitrary_types_allowed = True. From what I can tell, this was the simplest way to deal with our validator's regex handling. The __get_validators__ method was removed in 2.0, and this function needed some work.

I also removed the _patch_schema calls, because 1) field.required was changed to field.is_required in 2.0, and 2) field.is_required is now read-only. I couldn't figure out how to set it... and the code seems to work with out it? So is it really necessary here?

DO NOT MERGE; this is an initial attempt. Within the next few days, I will test this in a multi-node environment, and I will test it with Petals as well.

Until then, this version of the code seems to work well. I wanted to open the PR, to get us all started.

@Vectorrent
Copy link
Contributor Author

This PR is as far as I can reasonably take it. I fixed the linting errors introduced by my changes (but not the ones already existing in the master branch), and I reverted Pydantic to 2.5.3 (to make it compatible with CI environment). That said, Hivemind is also compatible with 2.7.3.

I tested this change with a multi-node training run, and it works well. I also tested this with Petals, and was able to run inference there as well.

I'd say this is ready for merge.

@Vectorrent
Copy link
Contributor Author

It looks like #612 might fix the errors we're seeing in tests here.

@justheuristic justheuristic self-requested a review June 6, 2024 08:36
@justheuristic
Copy link
Member

NB: We'll need to update hivemind to fix incompatibility with the latest torch update to so we can test this (see failed tests). I intend to do this as soon as I have some bandwidth, but if someone's willing to do this earlier, please do.

@Vectorrent
Copy link
Contributor Author

So, it appears that torch.cuda.amp was deprecated, and everything was moved to torch.amp. So, I fixed that.

Strangely, some of these tests will install torch 2.3.1, while others are installing 1.9.0 (and thus, failing). Not sure if that is intentional, but I'll keep digging.

@samsja
Copy link
Contributor

samsja commented Jun 7, 2024

It looks like #612 might fix the errors we're seeing in tests here.

yes it should fix all ci error, expect the black/isort one

@samsja
Copy link
Contributor

samsja commented Jun 7, 2024

NB: We'll need to update hivemind to fix incompatibility with the latest torch update to so we can test this (see failed tests). I intend to do this as soon as I have some bandwidth, but if someone's willing to do this earlier, please do.

fy: #612

@Vectorrent
Copy link
Contributor Author

It looks like mine and @samsja's PR are addressing the same bug, so we should only merge one.

That said, I have no idea why these tests are failing right now, while theirs worked. All I'm seeing is Error: The operation was canceled.. Tests are not failing with my code; they are simply hanging forever, and cancelled after a while.

Both PRs still have the same bug, relating to the Albert test:

#15 20.13   × python setup.py egg_info did not run successfully.
#15 20.13   │ exit code: 1
#15 20.13   ╰─> [6 lines of output]
#15 20.13       Traceback (most recent call last):
#15 20.13         File "<string>", line 2, in <module>
#15 20.13         File "<pip-setuptools-caller>", line 34, in <module>
#15 20.13         File "/tmp/pip-install-_8jw_ha0/pathtools_37faf27303d54936a074efc743740ba1/setup.py", line 25, in <module>
#15 20.13           import imp
#15 20.13       ModuleNotFoundError: No module named 'imp'
#15 20.13       [end of output]

Apparently, the "imp" module was removed in Python 3.12, so I'm sure that's where the problem lies. I was able to reproduce this bug in Docker locally, so I'll spend some time offline troubleshooting.

@Vectorrent
Copy link
Contributor Author

It seems that tests/test_optimizer.py is hanging at test_progress_tracker(). Since that code references BytesWithPublicKey, and my Pydantic changes touched that code - it's probably related. I just don't know why it hangs without doing anything.

If anyone with Pydantic experience wants to jump in, I'd appreciate it.

@samsja
Copy link
Contributor

samsja commented Jun 7, 2024

It seems that tests/test_optimizer.py is hanging at test_progress_tracker(). Since that code references BytesWithPublicKey, and my Pydantic changes touched that code - it's probably related. I just don't know why it hangs without doing anything.

If anyone with Pydantic experience wants to jump in, I'd appreciate it.

I just tried your PR with pydantic 2.5.3 and 2.7.3. Seems to be working fine, at least the test is not hanging for me. Happy to help if you hint me on how to reproduce the hanging behavior

@Vectorrent
Copy link
Contributor Author

The code is not hanging anymore, but we are getting some failures relating to the validator. From what I can tell, match_regex() is never executing, and so - it makes sense that we're getting errors about missing fields. I don't know Pydantic well enough to know how to fix this.

Easiest way to reproduce is with the docker-compose.yml I added. Just cd into that directory, and run two commands:

docker compose build
docker compose up

That will execute the tests specified in docker-compose.yml. You can change that as needed.

@Vectorrent
Copy link
Contributor Author

I made some progress, fixed some Pydantic deprecations, but I've been stumped by this one:

         for schema in self._schemas:
            try:
                parsed_record = schema.model_validate(record)
            except pydantic.ValidationError as e:
                if not self._is_failed_due_to_extra_field(e):
                    validation_errors.append(e)
                continue
    
>           parsed_value = parsed_record.dict(by_alias=True)[field_name]
E           KeyError: 'some_field'

@mryab
Copy link
Member

mryab commented Jun 9, 2024

I started digging into this as well, and the easiest way to triage the problem seems to be replacing all pydantic imports with pydantic.v1 imports. Have you tried that approach? I'd be ok with merging such a change in case you were hesitating

@Vectorrent
Copy link
Contributor Author

I didn't even realize pydantic.v1 was an API. Regardless, I made it this far... I might give it a few more days of trying. There's only a couple tests failing now. But if that doesn't work, I'll do the v1 thing soon. Thanks!

@Vectorrent
Copy link
Contributor Author

Vectorrent commented Jun 10, 2024

I reverted to v1 API, since v2 would probably break Pydantic usage in Petals as well... and I don't want to deal with that.

It seems that most tests are passing now. I have no idea why some of them are being cancelled, but it seems like they're timing-out. And these failures are not even consistent; whereas 3.8 was successful on the previous run, it timed-out on this attempt. Not sure what's happening there, but it looks like a Github issue more than anything.

@Vectorrent
Copy link
Contributor Author

These tests keep failing with random, transient errors. What works on this run, will fail on the next. Hivemind is just being unreliable right now.

@Vectorrent
Copy link
Contributor Author

I'd recommend merging, if you're okay with this code. Tests keep failing, but they are failing with transient errors that were probably not introduced by the Pydantic update:

test_averaging.py::test_allreduce_once[0-0] PASSED                       [ 23%]
test_averaging.py::test_allreduce_once[0-1] PASSED                       [ 23%]
test_averaging.py::test_allreduce_once[0-2] PASSED                       [ 24%]
test_averaging.py::test_allreduce_once[1-0] PASSED                       [ 24%]
test_averaging.py::test_allreduce_once[1-1] PASSED                       [ 24%]
test_averaging.py::test_allreduce_once[1-2] PASSED                       [ 25%]
test_averaging.py::test_allreduce_once[2-0] PASSED                       [ 25%]
test_averaging.py::test_allreduce_once[2-1] PASSED                       [ 25%]
test_averaging.py::test_allreduce_once[2-2] PASSED                       [ 25%]

Stuff like this is just a quirk of Hivemind. all_reduce can execute successfully 9 times in a row, before randomly hanging forever (until the test times-out). There is a deeper issue here, but one that should probably be fixed in a different PR.

Copy link
Member

@justheuristic justheuristic left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Member

@mryab mryab left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the contribution! However, before merging, please reduce the diff of the PR to make sure all changes are Pydantic-related

LICENSE Outdated Show resolved Hide resolved
docker-compose.yml Outdated Show resolved Hide resolved
Copy link
Member

@mryab mryab left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you again for the pull request! I believe we are close to merging, but there is still a small list of changes that need to be made before this

tests/test_utils/p2p_daemon.py Outdated Show resolved Hide resolved
tests/test_dht_schema.py Outdated Show resolved Hide resolved
examples/albert/requirements.txt Outdated Show resolved Hide resolved
benchmarks/benchmark_optimizer.py Outdated Show resolved Hide resolved
requirements.txt Outdated Show resolved Hide resolved
.github/workflows/run-tests.yml Outdated Show resolved Hide resolved
@Vectorrent
Copy link
Contributor Author

#619 and #620 to fix these errors again

@Vectorrent
Copy link
Contributor Author

@mryab - is there anything else you need from me? Between the 3 open PRs, we should be able to get past these failing tests.

@samsja
Copy link
Contributor

samsja commented Jul 5, 2024

@mryab - is there anything else you need from me? Between the 3 open PRs, we should be able to get past these failing tests.

I have been using this PR for couple of weeks, seems to be stable.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants