Difference between revisions of "Introduction to Regular Expressions in Python"
(9 intermediate revisions by one other user not shown) | |||
Line 12: | Line 12: | ||
* '''Character Classes''': Represent a set of characters that match any one character from the set (like '\d', '\w', '\s'). | * '''Character Classes''': Represent a set of characters that match any one character from the set (like '\d', '\w', '\s'). | ||
+ | |||
+ | ===Use Cases=== | ||
+ | |||
+ | * '''String Parsing''': Extracting specific information from text. | ||
+ | * '''Data Validation''': Ensuring formats of data are correct (like emails, phone numbers, passwords). | ||
+ | * '''Text Preprocessing''': Used in natural language processing for tasks like tokenization, cleaning data. | ||
+ | |||
+ | ==Regular expressions` patterns== | ||
+ | |||
+ | ===Special characters:=== | ||
+ | |||
+ | . - any charachter, except newline (\n)<br> | ||
+ | ^ - start of the string or line<br> | ||
+ | $ - end of the string<br> | ||
+ | \* - the previous charachter repeats 0 or more times =={0,}<br> | ||
+ | \+ - the previous charachter repeats 1 or more times =={1,}<br> | ||
+ | ? - the presence of the previous charachter is not necessary =={0,1}<br> | ||
+ | {n, m} - the previous charachter repeats from n till m times<br> | ||
+ | {n} - the previous charachter repeats exactly n times<br> | ||
+ | |||
+ | ===Range of charachters:=== | ||
+ | |||
+ | [A-Za-z] - any letter of alphabet in lower- and uppercases<br> | ||
+ | [^0-9] - any charachter, except digits<br> | ||
+ | |||
+ | ===Metacharachters, which represent a class:=== | ||
+ | |||
+ | \d - matches digits<br> | ||
+ | \s - matches any whitespace character, like spaces and tabs<br> | ||
+ | \w - matches any word character, which includes letters (both uppercase and lowercase), digits, and underscores | ||
+ | |||
+ | ===Combinations=== | ||
+ | |||
+ | [\w\s]+ - The combination [\w\s] matches any word character or space, and the + means one or more occurrences of this combination. | ||
+ | |||
+ | ([\w\s]+?) - This is a capture group, which means we are trying to extract whatever matches this pattern. | ||
+ | |||
+ | The ? after the + makes the + "lazy" instead of "greedy". In regular expressions, greedy means matching as much as possible, while lazy means matching as little as possible. So [\w\s]+? will match the shortest sequence of word characters and spaces at the start of the string. |
Latest revision as of 12:38, 3 September 2024
THIS ARTICLE IS STILL IN EDITING MODE
Contents
Introduction
Regular expressions are sequences of characters that form a search pattern, mainly used for string searching and manipulation. In Python, they are implemented through the built-in 're' module.
Basic Concepts
- Patterns: Specific sequences of characters that represent a search criteria.
- Metacharacters: Special characters that signify broader types of patterns (like '*', '+', '?').
- Character Classes: Represent a set of characters that match any one character from the set (like '\d', '\w', '\s').
Use Cases
- String Parsing: Extracting specific information from text.
- Data Validation: Ensuring formats of data are correct (like emails, phone numbers, passwords).
- Text Preprocessing: Used in natural language processing for tasks like tokenization, cleaning data.
Regular expressions` patterns
Special characters:
. - any charachter, except newline (\n)
^ - start of the string or line
$ - end of the string
\* - the previous charachter repeats 0 or more times =={0,}
\+ - the previous charachter repeats 1 or more times =={1,}
? - the presence of the previous charachter is not necessary =={0,1}
{n, m} - the previous charachter repeats from n till m times
{n} - the previous charachter repeats exactly n times
Range of charachters:
[A-Za-z] - any letter of alphabet in lower- and uppercases
[^0-9] - any charachter, except digits
Metacharachters, which represent a class:
\d - matches digits
\s - matches any whitespace character, like spaces and tabs
\w - matches any word character, which includes letters (both uppercase and lowercase), digits, and underscores
Combinations
[\w\s]+ - The combination [\w\s] matches any word character or space, and the + means one or more occurrences of this combination.
([\w\s]+?) - This is a capture group, which means we are trying to extract whatever matches this pattern.
The ? after the + makes the + "lazy" instead of "greedy". In regular expressions, greedy means matching as much as possible, while lazy means matching as little as possible. So [\w\s]+? will match the shortest sequence of word characters and spaces at the start of the string.