Matching Multiple Regex Patterns in Pandas

Photo by Mélody P on Unsplash

If you do much web scraping or parsing of human-readable tables you’ve probably come across the problem of matching multiple regex patterns across multiple columns or rows in your data. What is the best way to handle regex patterns when they start becoming complex? How do you aggregate your results once you have them?

This article assumes you know how to use regular expressions and you’re somewhat experienced using pandas. First, we’re going to explore how to handle matching regex patterns using pandas.

Regex in Pandas

Let’s suppose we have data that identify a time period. Unfortunately, the strings are not all formatted the same, and there is variation in how the time period is described. Sometimes it is a single year, sometimes it is a range of years, sometimes it includes a season such as “Fall 2020”. We want to extract any relevant information describing the time period, but nothing more.

Considering the following example:

A series of time periods

Match

Pandas provides several functions where regex patterns can be applied to Series or DataFrames. Series.str.match returns a boolean value indicating whether the string starts with a match.

First, let’s try to match any four digits.

years.str.match

We can see that ' 2020' didn't match because of the leading whitespace. None of the time periods matched when there was text in front of our year.

Contains

Series.str.contains might be better to use in this case. It is based on re.search instead of re.match. It searches for the regex pattern at any position in the string, not only matching from the first character. It will return True if our pattern is found at any position in the string.

years.str.contains

Now we get get a match for every string! Note that we would also get a match if our string contained any combination of 4 digits in a row. Even if it were something like 12345.45.

Extract

We want to extract these values from our strings so let’s try the Series.str.extract method.

years.str.extract

Note that we’ve added parentheses to our regex pattern. When using extract you need to have at least 1 group in your pattern. If you include more than 1 group the method will return a DataFrame with one column for each group. Normally, Series.str.extract would return a DataFrame anyway, with a single column if only a single group existed in the pattern. However, expand = False tells the method to return a Series instead.

Now, you may have noticed that our output wasn’t exactly what we wanted. We did successfully extract the first 4 digits found together in each string, but remember we wanted to extract '2012-2017' from 'Percent change, 2012-2017' not '2012'.

So it looks like we need to make our regular expression much more complicated.

Extract a more complicated regex

This seems to work, at least for these values. I’ve added a bunch of non-reporting groups with the (?:.*) syntax. And I've made them optional with a ? after them.

What happens when someone comes along and says, “Can you add 2012 to 2017 to the list?" What about 2012 through 2017? What about ...?

You can see that our regex might quickly get unwieldy. This is where I would want to run multiple extracts. For each new matching requirement, we could run a new extract.

There are a couple of problems with that approach. We might have to deal with multiple matches when more than one of our extracts returned a result for the same string. But more than that, we are iterating through all of our data multiple times to find different year formats. I’d still like to combine our searches into a single regex.

I was inspired by this Stack Overflow answer to solve the problem. We can write multiple regex strings and then concatenate them with '|'. Let's rewrite our search above using this method. Also, note: We are getting rid of any leading or trailing whitespace in our series.

Concatenating our regex list

Hmmm. Why isn’t this working?

When using | regex takes the first match. So if it finds a match for '\d{4}\' then it won't match '\d{4}-\d{2}' if that expression comes later in our list. Therefore we need to have the most specific regex patterns appear first in our list. That would be in the exact opposite order we have currently. So let's reverse our list.

Extract using regex_list_reverse

That’s better. We are now correctly matching these examples. But now we have to deal with our results being spread out across multiple columns. One approach would be to iterate through the rows of our resulting DataFrame, finding the non-missing value, and constructing a new series that contains this value for each row.

We could do that. In fact, that was my first approach. But we can make this pretty simple by making our regex list have non-reporting groups (with '(?:.*)'). We also have to add a pair of parentheses encompassing the entire expression to create a single group for the expression.

Extracting one, big match

Now we have a single column that extracts exactly what we set out to do.

Let me know if you found this interesting or if you have a different approach.

Data Nerd, Information Visualizer, Social Scientist