Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Slow performance due copying of instances #987

Open
yoavkatz opened this issue Jul 3, 2024 · 7 comments
Open

Slow performance due copying of instances #987

yoavkatz opened this issue Jul 3, 2024 · 7 comments
Assignees
Labels
bug Something isn't working

Comments

@yoavkatz
Copy link
Member

yoavkatz commented Jul 3, 2024

Current dict_utils performs overwrite of dictionaries:

if query.strip() == "":
       # change the whole input dic, as dic indeed matches ""
       if isinstance(dic, dict):
           if value is None or not isinstance(value, dict):
               raise ValueError(
                   f"Through an empty query, trying to set a whole new value, {value}, to the whole of dic, {dic}, but value is not a dict"
               )
           dic.clear()
           dic.update(value)
           return

       if isinstance(dic, list):
           if value is None or not isinstance(value, list):
               raise ValueError(
                   f"Through an empty query, trying to set a whole new value, {value}, to the whole of dic, {dic}, but value is not a list"
               )
           dic.clear()
           dic.extend(value)
           return

This caused problems because operators made changes to data of their source and it cause conflict with other operators. To workaround this, we now copy instances :

class InstanceFieldOperator(InstanceOperator):

def process(
       self, instance: Dict[str, Any], stream_name: Optional[str] = None
   ) -> Dict[str, Any]:
       # Need to deep copy instance, because when assigning two dictionary fields,
       # dict_set() the target field dictionary fields.
       # This means that if this target field was assigned to another field before,
       # the field is updated as well.
       instance = deepcopy(instance)
   
   

However this cause performance issue with large datasets (6x runtime increase) We should change the code such that new instances are returned with shallow copy - and only relevant atomic are copied.

@yoavkatz yoavkatz added the bug Something isn't working label Jul 3, 2024
@elronbandel
Copy link
Member

To me operators should modify dictionaries in place, the reason is that that's how most people work with dictionaries. In cases where this creates confusion we should avoid it. Also I think we should have a separate issue for creating performance benchmark to unitxt we run on different pipelines, so we can keep track over performance across PRs.

@yoavkatz
Copy link
Member Author

yoavkatz commented Jul 3, 2024

The problem is when operators modify nested dictionary

instance:
predictions : { "a" : 3, "b" :4}
references : [{ "a" : 5, "b" :6}]

Then we have an operator Copy(field="predictions", to_field="orig_predictions"}

instance:
predictions : { "a" : 3, "b" :4}
orig_predictions = { "a" : 3, "b" :4} (same object as above)
references : [{ "a" : 5, "b" :6}]

Now we do Copy(field = "references/0/a" to, "predictions/a")

instance:
predictions : { "a" : 5, "b" :4}
orig_predictions = { "a" : 5, "b" :4} (same object as above < - notice how it was corrupted)
references : [{ "a" : 5, "b" :6}]

How do you suggest we handle it?

@dafnapension
Copy link
Collaborator

dafnapension commented Jul 3, 2024

This is the point that wonders me, and without knowing, mentioned in our slack: please write the name of the metric that dares to change the predictions.
That metric should be punished. Not the whole of Unitxt. Having to copy the fields before each and every application of every metric is unbearable.
The bug is in thinking that copying a structured field (like the predictions that you gave above), can save that field from the mutation that the metric does to it. A structured field is not saved by copying, as opposed to deepcopying.

@dafnapension
Copy link
Collaborator

dafnapension commented Jul 3, 2024

Strangely enough, I manage to reproduce the above bug (that I thought should be a bug, indeed) only with 'bare' dict_utils, but not with Copy Operator, although I commented out the line instance = deepcopy(instance) in InstanceFieldOperator.process() . Not sure why. Perhaps more deepcopy -s are hiding all over there? yes. in Copy.process_value. When I removed the latter, Indeed the bug is reproduceable also with the operators, and the following throws an exception at the end.

from unitxt.dict_utils import dict_get, dict_set
from unitxt.operators import Copy
from unitxt.stream import MultiStream

instance = {"predictions" : { "a" : 3, "b" :4}, "references" : [{ "a" : 5, "b" :6}]}
val = dict_get(instance,"predictions")
dict_set(instance, "predictions_orig", val)
instance["predictions"]["a"] = 10
assert(instance["predictions_orig"]["a"] == 10)

instance = {"predictions" : { "a" : 3, "b" :4}, "references" : [{ "a" : 5, "b" :6}]}
ms = MultiStream.from_iterables({"tmp" :[instance]})
copyoperator = Copy(field="predictions", to_field="predictions_orig")
ms = copyoperator(ms)
instance2 = list(ms["tmp"])[0]
assert(instance2["predictions_orig"] == instance2["predictions"])
instance2["predictions"]["a"] = 10
assert(instance2["predictions_orig"]["a"] == 3)

@dafnapension
Copy link
Collaborator

So, until that metric that modifies predictions is crucified, here is an attempt to reduce the damange:
#991

@dafnapension
Copy link
Collaborator

in the branch suggested in the above PR:

from unitxt.dict_utils import dict_get, dict_set
from unitxt.operators import Copy
from unitxt.stream import MultiStream


instance = {"predictions" : { "a" : 3, "b" :4}}
ms = MultiStream.from_iterables({"tmp" :[instance]})
copyoperator = Copy(field="predictions", to_field="predictions_orig")
ms = copyoperator(ms)
instance2 = list(ms["tmp"])[0]
assert(instance2["predictions_orig"] == instance2["predictions"])
instance2["predictions"]["a"] = 10
assert(instance2["predictions_orig"]["a"] == 10)
# and with use_deep_copy = true:
instance = {"predictions" : { "a" : 3, "b" :4}}
ms = MultiStream.from_iterables({"tmp" :[instance]})
copyoperator = Copy(field="predictions", to_field="predictions_orig", use_deep_copy=True)
ms = copyoperator(ms)
instance2 = list(ms["tmp"])[0]
assert(instance2["predictions_orig"] == instance2["predictions"])
instance2["predictions"]["a"] = 10
assert(instance2["predictions_orig"]["a"] == 3)

@dafnapension
Copy link
Collaborator

Addressed at #1087
An important lesson learned in the process is:
#1087 (comment)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants