Nik Kantar

Wednesday, March 8, 2023

Dedupe a List in Python, Slowly

A pretty slow way of removing duplicate elements in a Python list.

Let’s say you have a list in Python:

words = ["foo", "bar", "foo", "baz", "foo", "barf", "foo"]

That’s a lot of "foo"s. Too many "foo"s, in fact. You probably want only one.

Trey Hunner published a whole post about some ways to do it, and I thought of a potentially slow but clever method I want to try:

unique_words = [
    words[idx]
    for idx in range(len(words))
    if not idx > words.index(words[idx])
]

The list comprehension iterates through the original list and keeps only the first appearance of each element by comparing it to the cursor’s position. Sounds slow, but let’s verify by comparing to Trey’s examples!

Given this dedupe.py file:

WORDS = ["foo", "bar", "foo", "baz", "foo", "barf", "foo"]

def idx_approach():
    [
        WORDS[idx]
        for idx in range(len(WORDS))
        if not idx > WORDS.index(WORDS[idx])
    ]

def list_approach():
    unique_words = []
    for word in WORDS:
        if word not in unique_words:
            unique_words.append(word)

def set_approach():
    list(set(WORDS))

def set_dict_approach():
    list(dict.fromkeys(WORDS))

Here are the timing results:

$ python3 -m timeit "import dedupe; dedupe.idx_approach()"
1000000 loops, best of 5: 652 nsec per loop

$ python3 -m timeit "import dedupe; dedupe.list_approach()"
1000000 loops, best of 5: 316 nsec per loop

$ python3 -m timeit "import dedupe; dedupe.set_approach()"
1000000 loops, best of 5: 226 nsec per loop

$ python3 -m timeit "import dedupe; dedupe.set_dict_approach()"
1000000 loops, best of 5: 286 nsec per loop

Yup, pretty slow! :D


Thanks for reading! You can keep up with my writing via the feed or newsletter, or you can get in touch via email.


Older:
Manuals Matter