Matching Multiple Regex Patterns in Pandas
If you do much web scraping or parsing of human-readable tables you’ve probably come across the problem of matching multiple regex patterns across multiple columns or rows in your data. What is the best way to handle regex patterns when they start becoming complex? How do you aggregate your results once you have them?
This article assumes you know how to use regular expressions and you’re somewhat experienced using pandas. First, we’re going to explore how to handle matching regex patterns using pandas.
Regex in Pandas
Let’s suppose we have data that identify a time period. Unfortunately, the strings are not all formatted the same, and there is variation in how the time period is described. Sometimes it is a single year, sometimes it is a range of years, sometimes it includes a season such as “Fall 2020”. We want to extract any relevant information describing the time period, but nothing more.
Considering the following example:
Match
Pandas provides several functions where regex patterns can be applied to Series or DataFrames. Series.str.match
returns a boolean value indicating whether the string starts with a match.
First, let’s try to match any four digits.
We can see that ' 2020'
didn't match because of the leading whitespace. None of the time periods matched when there was text in front of our year.
Contains
Series.str.contains
might be better to use in this case. It is based on re.search
instead of re.match
. It searches for the regex pattern at any position in the string, not only matching from the first character. It will return True
if our pattern is found at any position in the string.
Now we get get a match for every string! Note that we would also get a match if our string contained any combination of 4 digits in a row. Even if it were something like 12345.45
.
Extract
We want to extract these values from our strings so let’s try the Series.str.extract
method.
Note that we’ve added parentheses to our regex pattern. When using extract
you need to have at least 1 group in your pattern. If you include more than 1 group the method will return a DataFrame with one column for each group. Normally, Series.str.extract
would return a DataFrame anyway, with a single column if only a single group existed in the pattern. However, expand = False
tells the method to return a Series instead.
Now, you may have noticed that our output wasn’t exactly what we wanted. We did successfully extract the first 4 digits found together in each string, but remember we wanted to extract '2012-2017'
from 'Percent change, 2012-2017'
not '2012'
.
So it looks like we need to make our regular expression much more complicated.
This seems to work, at least for these values. I’ve added a bunch of non-reporting groups with the (?:.*)
syntax. And I've made them optional with a ?
after them.
What happens when someone comes along and says, “Can you add 2012 to 2017
to the list?" What about 2012 through 2017
? What about ...?
You can see that our regex might quickly get unwieldy. This is where I would want to run multiple extracts. For each new matching requirement, we could run a new extract.
There are a couple of problems with that approach. We might have to deal with multiple matches when more than one of our extracts returned a result for the same string. But more than that, we are iterating through all of our data multiple times to find different year formats. I’d still like to combine our searches into a single regex.
I was inspired by this Stack Overflow answer to solve the problem. We can write multiple regex strings and then concatenate them with '|'
. Let's rewrite our search above using this method. Also, note: We are getting rid of any leading or trailing whitespace in our series.
Hmmm. Why isn’t this working?
When using |
regex takes the first match. So if it finds a match for '\d{4}\'
then it won't match '\d{4}-\d{2}'
if that expression comes later in our list. Therefore we need to have the most specific regex patterns appear first in our list. That would be in the exact opposite order we have currently. So let's reverse our list.
That’s better. We are now correctly matching these examples. But now we have to deal with our results being spread out across multiple columns. One approach would be to iterate through the rows of our resulting DataFrame, finding the non-missing value, and constructing a new series that contains this value for each row.
We could do that. In fact, that was my first approach. But we can make this pretty simple by making our regex list have non-reporting groups (with '(?:.*)'
). We also have to add a pair of parentheses encompassing the entire expression to create a single group for the expression.
Now we have a single column that extracts exactly what we set out to do.
Let me know if you found this interesting or if you have a different approach.