How to group by in Python with dataset?

Shielagaa · November 17, 2024, 6:30pm

How do you perform a python group by operation? Given a dataset with values and types like this:

input = [ (‘11013331’, ‘KAT’), (‘9085267’, ‘NOT’), (‘5238761’, ‘ETH’), (‘5349618’, ‘ETH’), (‘11788544’, ‘NOT’), (‘962142’, ‘ETH’), (‘7795297’, ‘ETH’), (‘7341464’, ‘ETH’), (‘9843236’, ‘KAT’), (‘5594916’, ‘ETH’), (‘1550003’, ‘ETH’) ]

You want to group by the type and produce the following output: result = [ { ‘type’: ‘KAT’, ‘items’: [‘11013331’, ‘9843236’] }, { ‘type’: ‘NOT’, ‘items’: [‘9085267’, ‘11788544’] }, { ‘type’: ‘ETH’, ‘items’: [‘5238761’, ‘5349618’, ‘962142’, ‘7795297’, ‘7341464’, ‘5594916’, ‘1550003’] } ]

yanisleidi-rodriguez · November 17, 2024, 6:31pm

Hey everyone, I would like to share a solution using collections.defaultdict for grouping data by type. Here’s how it works:

from collections import defaultdict

def group_by_type(input):
    grouped = defaultdict(list)
    for value, type_ in input:
        grouped[type_].append(value)
    
    return [{'type': k, 'items': v} for k, v in grouped.items()]

# Result
result = group_by_type(input)
print(result)

This approach is quite efficient because defaultdict automatically initializes the lists, so you don’t need to check whether the key exists before appending the value. It simplifies the code and avoids unnecessary conditions.

madhurima_sil · November 20, 2024, 8:54am

I see the value in using defaultdict, but another approach you could try is using a manual dictionary. Here’s how you can achieve the same result, though it requires a little more verbosity:

def group_by_type(input):
    grouped = {}
    for value, type_ in input:
        if type_ not in grouped:
            grouped[type_] = []
        grouped[type_].append(value)
    
    return [{'type': k, 'items': v} for k, v in grouped.items()]

# Result
result = group_by_type(input)
print(result)

This method requires checking if the key exists in the dictionary, and if not, manually initializing it. It’s more explicit, but definitely less concise than the defaultdict solution. It’s all about trade-offs based on your preference for readability versus succinctness.

prynka.chatterjee · November 21, 2024, 8:55am

Both of your solutions are great! Another way to approach this would be by using itertools.groupby. It’s a bit different since it requires sorting the input, but it’s a more compact solution:

from itertools import groupby

def group_by_type(input):
    input.sort(key=lambda x: x[1])  # Grouping requires sorted data
    grouped = [
        {'type': key, 'items': [item[0] for item in group]} 
        for key, group in groupby(input, key=lambda x: x[1])
    ]
    return grouped

# Result
result = group_by_type(input)
print(result)

In this case, we use groupby from the itertools module, which is very efficient for grouping when the data is already sorted by the key. It’s concise and leverages Python’s built-in functionality, but remember, sorting is a crucial step here!