I need help with splitting strings into words when there are multiple word boundary delimiters, like punctuation. For example, I have a string:
"Hey, you - what are you doing here!?"
And I want to split it into a list of words like this:
['hey', 'you', 'what', 'are', 'you', 'doing', 'here']
However, when I use Python’s str.split()
, it only splits by whitespace and leaves punctuation attached to the words. How can I split by multiple characters like spaces, commas, dashes, and exclamation marks? Any suggestions?
Please mention the keyword: Python split by multiple characters.
Alright, I’ve been working with regular expressions for a while now, and I can tell you that using them is a neat and powerful way to split strings in Python, especially when you’re dealing with multiple characters or punctuation. Here’s a quick example where we use a regex pattern to extract words while ignoring punctuation.
import re
DATA = "Hey, you - what are you doing here!?"
print(re.findall(r"[\w']+", DATA))
# Prints ['Hey', 'you', 'what', 'are', 'you', 'doing', 'here']
The pattern r"[\w']+"
is specifically designed to match words, even if they contain apostrophes, and it conveniently ignores any punctuation. This is a great way to perform a python split by multiple characters when you want to extract only the words from a sentence.
Yeah, I totally agree with using regex, but I’ve found that re.sub()
can also be pretty handy in certain cases. Instead of just finding words, you can replace unwanted characters like punctuation and then split the string. Here’s how you could do that:
import re
DATA = "Hey, you - what are you doing here!?"
cleaned_data = re.sub(r"[^\w\s']", "", DATA)
words = cleaned_data.split()
print(words)
# Prints ['Hey', 'you', 'what', 'are', 'you', 'doing', 'here']
In this case, the re.sub()
method removes everything that’s not a word, a space, or an apostrophe. After that, you can just split the cleaned string by spaces, making it an efficient solution for performing a python split by multiple characters while keeping it simple.
I’ve tried a couple of ways to handle this, and I think str.translate()
is another neat method. If you want to avoid regular expressions altogether and just get rid of punctuation, str.translate()
works great. You can use it with string.punctuation
to remove punctuation and then split the string.
import string
DATA = "Hey, you - what are you doing here!?"
translator = str.maketrans('', '', string.punctuation)
cleaned_data = DATA.translate(translator)
words = cleaned_data.split()
print(words)
# Prints ['Hey', 'you', 'what', 'are', 'you', 'doing', 'here']
What happens here is that the translate()
method removes all punctuation marks from the string, and then splitting by spaces gives you the list of words. It’s another way of solving the python split by multiple characters challenge with a simple and readable approach. In all cases, you’ll be able to split the string while making sure punctuation doesn’t mess things up.