Difference between revisions of "Introduction to Regular Expressions in Python"

From Sustainability Methods
(Created page with "THIS ARTICLE IS STILL IN EDITING MODE")
 
 
(10 intermediate revisions by one other user not shown)
Line 1: Line 1:
 
THIS ARTICLE IS STILL IN EDITING MODE
 
THIS ARTICLE IS STILL IN EDITING MODE
 +
 +
==Introduction==
 +
 +
Regular expressions are sequences of characters that form a search pattern, mainly used for string searching and manipulation. In Python, they are implemented through the built-in 're' module.
 +
 +
===Basic Concepts===
 +
 +
* '''Patterns''': Specific sequences of characters that represent a search criteria.
 +
 +
* '''Metacharacters''': Special characters that signify broader types of patterns (like '*', '+', '?').
 +
 +
* '''Character Classes''': Represent a set of characters that match any one character from the set (like '\d', '\w', '\s').
 +
 +
===Use Cases===
 +
 +
* '''String Parsing''': Extracting specific information from text.
 +
* '''Data Validation''': Ensuring formats of data are correct (like emails, phone numbers, passwords).
 +
* '''Text Preprocessing''': Used in natural language processing for tasks like tokenization, cleaning data.
 +
 +
==Regular expressions` patterns==
 +
 +
===Special characters:===
 +
 +
. - any charachter, except newline (\n)<br>
 +
^ - start of the string or line<br>
 +
$ - end of the string<br>
 +
\* - the previous charachter repeats 0 or more times =={0,}<br>
 +
\+ - the previous charachter repeats 1 or more times =={1,}<br>
 +
? - the presence of the previous charachter is not necessary =={0,1}<br> 
 +
{n, m} - the previous charachter repeats from n till m times<br> 
 +
{n} - the previous charachter repeats exactly n times<br>
 +
 +
===Range of charachters:===
 +
 +
[A-Za-z] - any letter of alphabet in lower- and uppercases<br>
 +
[^0-9] - any charachter, except digits<br>
 +
 +
===Metacharachters, which represent a class:===
 +
 +
\d - matches digits<br>
 +
\s - matches any whitespace character, like spaces and tabs<br>
 +
\w - matches any word character, which includes letters (both uppercase and lowercase), digits, and underscores
 +
 +
===Combinations===
 +
 +
[\w\s]+ - The combination [\w\s] matches any word character or space, and the + means one or more occurrences of this combination.
 +
 +
([\w\s]+?) - This is a capture group, which means we are trying to extract whatever matches this pattern.
 +
 +
The ? after the + makes the + "lazy" instead of "greedy". In regular expressions, greedy means matching as much as possible, while lazy means matching as little as possible. So [\w\s]+? will match the shortest sequence of word characters and spaces at the start of the string.

Latest revision as of 12:38, 3 September 2024

THIS ARTICLE IS STILL IN EDITING MODE

Introduction

Regular expressions are sequences of characters that form a search pattern, mainly used for string searching and manipulation. In Python, they are implemented through the built-in 're' module.

Basic Concepts

  • Patterns: Specific sequences of characters that represent a search criteria.
  • Metacharacters: Special characters that signify broader types of patterns (like '*', '+', '?').
  • Character Classes: Represent a set of characters that match any one character from the set (like '\d', '\w', '\s').

Use Cases

  • String Parsing: Extracting specific information from text.
  • Data Validation: Ensuring formats of data are correct (like emails, phone numbers, passwords).
  • Text Preprocessing: Used in natural language processing for tasks like tokenization, cleaning data.

Regular expressions` patterns

Special characters:

. - any charachter, except newline (\n)
^ - start of the string or line
$ - end of the string
\* - the previous charachter repeats 0 or more times =={0,}
\+ - the previous charachter repeats 1 or more times =={1,}
? - the presence of the previous charachter is not necessary =={0,1}
{n, m} - the previous charachter repeats from n till m times
{n} - the previous charachter repeats exactly n times

Range of charachters:

[A-Za-z] - any letter of alphabet in lower- and uppercases
[^0-9] - any charachter, except digits

Metacharachters, which represent a class:

\d - matches digits
\s - matches any whitespace character, like spaces and tabs
\w - matches any word character, which includes letters (both uppercase and lowercase), digits, and underscores

Combinations

[\w\s]+ - The combination [\w\s] matches any word character or space, and the + means one or more occurrences of this combination.

([\w\s]+?) - This is a capture group, which means we are trying to extract whatever matches this pattern.

The ? after the + makes the + "lazy" instead of "greedy". In regular expressions, greedy means matching as much as possible, while lazy means matching as little as possible. So [\w\s]+? will match the shortest sequence of word characters and spaces at the start of the string.