Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

tensorboard logging #208

Merged
merged 6 commits into from
Apr 22, 2024
Merged

tensorboard logging #208

merged 6 commits into from
Apr 22, 2024

Conversation

bnb32
Copy link
Collaborator

@bnb32 bnb32 commented Apr 9, 2024

adding tensorboard log writing with some additional timing info

Copy link
Member

@castelao castelao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In a glance, I don't have anything to add here. But let me know if you would like me to dive in and do a proper review.

One question on the design. Could it make sense to split those new resources in a new class instead of keep expanding AbstractInterface and Sup3rGan? It could be sideways or aggregating (such as ProfiledInterface(AbstractInterface) or ProfiledMixInterface()). The main benefit would be to isolate the related components so it is easier later to test, inspect, and maintain, but also allow to choose what to run. For instance one might prefer to run in production without the profiling related resources.

@bnb32
Copy link
Collaborator Author

bnb32 commented Apr 11, 2024

@castelao Yeah I considered splitting into another class as well, so that probably means I should. I've included a flag in the train method to turn off profiling for production though.

OptimizerClass = getattr(optimizers, class_name)
sig = signature(OptimizerClass)
optimizer_class = getattr(optimizers, class_name)
sig = signature(optimizer_class)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I used camel case here because the variable is actually a class object even though it doesnt have the actual class name. I know it feels/looks kind of weird but i thought it was appropriate? I could be convinced otherwise.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I suppose that makes sense. I figured _class took care of that distinction and ruff yelled at me about it so I shrugged.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm also a shrug but this is why i don't like auto formatters haha no freedom to think for yourself!

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well it was just a ruff suggestion tbf lol. I had no strong opinion either way.

sup3r/models/abstract.py Outdated Show resolved Hide resolved
msg = ('Could not run layer #{} "{}" on tensor of shape {}'.
format(layer_num, layer, hi_res.shape))
logger.error(msg)
raise RuntimeError(msg) from e
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Curious, why is this your preference for the nesting of the for loop / try wrapper?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's bad practice to have try/except inside the loop, and also very slightly worse performance.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is it bad practice? Makes sense about performance but yeah i expect it's minimal given that the tensorflow forward/backward pass operation is expensive.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bad practice bc of the micro-optimization I guess. If you can do the same thing with a single call rather than a loop that seems like a better practice.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Gotcha that makes sense. I liked being able to log which layer broke but the iter counter work too 👍

t0 = time.time()
loss, loss_details = loss_out
grad = tape.gradient(loss, training_weights)
self._timing_details['dt:tape.gradient'] = time.time() - t0
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

all of these timer variables are kind of gross and not very readable. I tried to look up a better way to do this and I honestly don't see anything great. What about defining a utility fun like this:

def timer(fun, *args, **kwargs):
  t0 = time.time()
  out = fun(*args, **kwargs)
  t_elap = time.time() - t0
  return out, t_elap

So you could call something like htis:

hi_res_exo, t1 = timer(self.get_high_res_exo_input, hi_res_true)
self._timing_details['dt:get_high_res_exo_input'] = t1

Not great but do you think there's a better way to do this besides all of the t0 variables? You could even have the function be a self method in a timer class that would log the details to self._timing_details instead of returning t_elap.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I like the class idea

out = fun(*args, **kwargs)
t_elap = time.time() - t0
self.log[f'elapsed:{fun.__name__}'] = t_elap
return out
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why don't you want Timer to be part of TensorboardMixIn?

How does the log discern consecutive calls to the function? Just because you write the log to tensorboard every epoch?

I guess i could see how Timer would be useful outside of the tensorboard stuff with additional instance functions. Maybe an auto-logger? You don't need to add now but could be useful in the future.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I figured it was a pretty general method and yeah log is written for each batch actually.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

makes sense 👍

@bnb32 bnb32 merged commit a171300 into main Apr 22, 2024
8 checks passed
@bnb32 bnb32 deleted the bnb/tensorboard_logging branch April 22, 2024 14:58
github-actions bot pushed a commit that referenced this pull request Apr 22, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants