Introducing Python’s Parse: The Ultimate Alternative to Regular Expressions | by Peng Qian | Jun, 2023


The parse API is similar to Python Regular Expressions, mainly consisting of the parse, search, and findall methods. Basic usage can be learned from the parse documentation.

Pattern format

The parse format is very similar to the Python format syntax. You can capture matched text using {} or {field_name}.

For example, in the following text, if I want to get the profile URL and username, I can write it like this:

content:
Hello everyone, my Medium profile url is https://qtalen.medium.com,
and my username is @qtalen.

parse pattern:
Hello everyone, my Medium profile url is {profile},
and my username is {username}.

Or you want to extract multiple phone numbers. Still, the phone numbers have different formats of country codes in front, and the phone numbers are of a fixed length of 11 digits. You can write it like this:

compiler = Parser("{country_code}{phone:11.11},")
content = "0085212345678901, +85212345678902, (852)12345678903,"

results = compiler.findall(content)

for result in results:
print(result)

Or if you need to process a piece of text in an HTML tag, but the text is preceded and followed by an indefinite length of whitespace, you can write it like this:

content:
<div> Hello World </div>

pattern:
<div>{:^}</div>

In the code above, {:11} refers to the width, which means to capture at least 11 characters, equivalent to the regular expression (.{11,})?. {:.11} refers to the precision, which means to capture at most 11 characters, equivalent to the regular expression (.{,11})?. So when combined, it means (.{11, 11})?. The result is:

Capture fixed-width characters.
Capture fixed-width characters. Image by Author

The most powerful feature of parse is its handling of time text, which can be directly parsed into Python datetime objects. For example, if we want to parse the time in an HTTP log:

content:
[04/Jan/2019:16:06:38 +0800]

pattern:
[{:th}]

Retrieving results

There are two ways to retrieve the results:

  1. For capturing methods that use {} without a field name, you can directly use result.fixed to get the result as a tuple.
  2. For capturing methods that use {field_name}, you can use result.named to get the result as a dictionary.

Custom Type Conversions

Although using {field_name} is already quite simple, the source code reveals that {field_name} is internally converted to (?P<field_name>.+?). So, parse still uses regular expressions for matching. .+? represents one or more random characters in non-greedy mode.

The transformation process of parse format to regular expressions
The transformation process of parse format to regular expressions. Image by Author

However, often we hope to match more precisely. For example, the text “my email is xxx@xxx.com”, “my email is {email}” can capture the email. Sometimes we may get dirty data, for example, “my email is xxxx@xxxx”, and we don’t want to grab it.

Is there a way to use regular expressions for more accurate matching?

That’s when the with_pattern decorator comes in handy.

For example, for capturing email addresses, we can write it like this:

@with_pattern(r'b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+.[A-Z|a-z]{2,}b')
def email(text: str) -> str:
return text

compiler = Parser("my email address is {email:Email}", dict(Email=email))

legal_result = compiler.parse("my email address is xx@xxx.com") # legal email
illegal_result = compiler.parse("my email address is xx@xx") # illegal email

Using the with_pattern decorator, we can define a custom field type, in this case, Emailwhich will match the email address in the text. We can also use this approach to match other complicated patterns.



Source link

Leave a Comment