Dataclass vs TypedDict in Python
Type annotations were introduced to Python with PEP484 and since then they have been a staple tool for most Python programs. They open up Python code to easier static analysis and refactoring and potential runtime type checking. Not to mention making the code a lot easier to reason with in general.
Take the following example:
def hello(a):
return "hello " + a
hello(2)
The above code would rase this error:
TypeError: can only concatenate str (not "int") to str
However, using type annotations would help us catch this error before the code is even executed:
def hello(a: str) -> str:
return "hello " + a
hello(2)
Trying to compile this code with mypy would yield the following error:
error: Argument 1 to "hello" has incompatible type "int"
Of course, using a good IDE or text editor with good static analysis tools would help us catch this error while writing the code.
Obviously, Python type annotations work with classes as well. For example:
import requests
from typing import List
class Book:
title: str
authors: List[str]
def get_book_from_api(endpoint: str) -> Book:
response = requests.get(endpoint)
response_json = response.json()
book = Book()
book.title = response_json["title"]
book.authors = response_json["authors"]
return book
In the example above, the `Book` class works as a wrapper around some data we
get from a rest endpoint. It contains two fields, a title field and an authors
field. Using Python type annotations we have defined the types of both those
fields. However, assigning the values to each field one by one is a bit
annoying, and the OOP way to do that would be to assign attributes inside the
__init__
method:
class Book:
title: str
authors: List[str]
def __init__(self, title: str, authors: List[str]):
self.title = title
self.authors = authors
def get_book_from_api(endpoint: str) -> Book:
response = requests.get(endpoint)
response_json = response.json()
book = Book(title=response_json["title"], authors=response_json["authors"])
return book
This is great, we're making important steps here. We managed to create a Python
class that uses type annotations and can be constructed using the __init__
method. But this class is still just a wrapper around some data, so it would be
great if we can reduce the boilerplate for this code while still keeping all the
advantages that the Python type system gives us
This is where dataclass
becomes useful. It's a Python decorator that
automatically adds the __init__
method (including types!) and some additional
goodies to a Python class. It can be used as follows:
import requests
from dataclasses import dataclass
from typing import List
@dataclass
class Book:
title: str
authors: List[str]
def get_book_from_api(endpoint: str) -> Book:
response = requests.get(endpoint)
response_json = response.json()
book = Book(title=response_json["title"], authors=response_json["authors"])
return book
As you we can see, there's no need to manually define the __init__
method
anymore. Not only does this reduce the time needed to write classes since it
reduces the amount of boilerplate code, it also reduces the probability of
errors since there's less boilerplate code for us to type in.
However, dataclasses are not a silver bullet and have some drawbacks. But in order to understand those drawbacks we must first introduce the notion of duck typing. The idea behind it is simple: if it quacks like a duck and walks like a duck then it's a duck. In other words, an object is of a particular type if it has all the methods and attributes required by that type.
Duck typing cannot be followed with dataclasses though. Consider the following example:
@dataclass
class Table:
color: str
age: int
@dataclass
class Bookshelf:
color: str
age: int
Table(color="brown", age=12) == Bookshelf(color="brown", age=12) # False
Following the logic of duck typing, then both the Table
and the Bookshelf
objects should represent the same type since they have the same attributes and
the same values. However, they are not treated as equal objects because they
belong to different classes. In order to strip that difference between both
objects we have to resort to dictionaries:
{"color": "brown", "age": 12} == {"color": "brown", "age": 12} # True
But using Python dicts would make us lose all the advantages that Python types
gives us, right? Wrong. This is where TypedDict
comes into play. TypedDict
basically allows Python dicts to work as types:
from typing import TypedDict
class Table(TypedDict):
color: str
age: int
class Bookshelf(TypedDict):
color: str
age: int
Table(color="brown", age=12) == Bookshelf(color="brown", age=12) # True
However, even though TypedDict
can give us all the advantages of duck typing,
it has two big disadvantages compared to dataclasses:
Validation is harder
Not immutable
Let's start with the first point. Since dataclasses can have explicit __init__
methods, then they can be validated whenever an object of that type is
constructed. This is not true for TypedDict
and every time we construct a
dictionary of a specific type, then we'd have to validate all its values at
runtime.
Whereas for immutability, in dataclasses its ensured by the frozen=True
argument. If set, it ensures that any object that belongs to that type would be
immutable, which is not possible with TypedDict
.