Why is dataclasses.asdict() slower than __dict__() in Python?

Why is dataclasses.asdict(obj) significantly slower than obj.__dict__() in Python?

I am using Python 3.6 with the dataclasses backport package from ericvsmith. It appears that calling dataclasses.asdict(my_dataclass) is around 10x slower than accessing my_dataclass.__dict__:

@dataclass
class MyDataClass:
    a: int
    b: int
    c: str

I tested this with the following code:

%%time
_ = [MyDataClass(1, 2, "A" * 1000).__dict__ for _ in range(1_000_000)]

CPU times: user 631 ms, sys: 249 ms, total: 880 ms
Wall time: 880 ms

And the following with dataclasses.asdict():

%%time
_ = [dataclasses.asdict(MyDataClass(1, 2, "A" * 1000)) for _ in range(1_000_000)]

CPU times: user 11.3 s, sys: 328 ms, total: 11.6 s
Wall time: 11.7 s

Is this expected behavior? When should I use dataclasses.asdict(obj) instead of obj.__dict__()?

Note: Using __dict__.copy() doesn’t make a significant difference:

%%time
_ = [MyDataClass(1, 2, "A" * 1000).__dict__.copy() for _ in range(1_000_000)]

CPU times: user 922 ms, sys: 48 ms, total: 970 ms
Wall time: 970 ms

What factors contribute to the performance difference when converting a Python dataclass to a dict?

Well, I’ve been working with Python dataclasses for quite a while, and one key thing I’ve noticed is that dataclasses.asdict() can be slower compared to using dict() directly, especially when performance is a top concern. The reason for this is that dataclasses.asdict() goes the extra mile to recursively convert any nested dataclasses into dictionaries, while dict simply retrieves the dictionary representation of the object without any additional processing. This makes python dataclass to dict conversion with dict() much faster in situations where you don’t have nested structures. For example:

_ = [obj.__dict__ for obj in my_data_list]

This method will be much quicker for simple dataclasses, especially if there’s no need for deep copying or handling nested objects.

Yeah, exactly! Following up on @devan-skeem’s point, if you don’t require deep copies of your python dataclass to dict conversion and are okay with modifying the original dataclass instance, you can go ahead and access dict() directly. It’s more efficient than dataclasses.asdict() because the latter does a deep copy, which can be overkill. If you only need a shallow copy, using obj.dict.copy() is a better solution:

python

Copy code

_ = [obj.__dict__.copy() for obj in my_data_list]

This still gives you a copy but without all the overhead of deep copying, making it more efficient. Deep copies are useful when you have nested structures that need to be isolated, but for simpler cases, this shallow approach will often be fast enough.

Yes, and to build on what @devan-skeem said, when you’re dealing with a large number of python dataclass to dict instances and need performance optimization, you might want to think about using something like pandas. For large datasets, pandas handles data structures much more efficiently than native Python dataclasses. Converting your dataclasses to a pandas DataFrame can significantly speed up your process. Here’s how you can do it:

import pandas as pd
df = pd.DataFrame([obj.__dict__ for obj in my_data_list])

This allows you to manipulate the data much faster, particularly when you have thousands of instances. So, if performance is a concern in these scenarios, switching to pandas can be a game changer.