Mailing List Archive

Adding new escapes to regex module
Other regex implementations have escape sequences for horizontal
whitespace (`\h` and `\H`) and vertical whitespace (`\v` and `\V`).

The regex module already supports `\h`, but I can't use `\v` because it
represents `\0x0b', as it does in the re module.

Now that someone has asked for it, I'm trying to find a nice way of
adding it, and I'm currently thinking that maybe I could use `\y` and
`\Y` instead as they look a little like `\v` and `\V`, and, also,
vertical whitespace is sort-of in the y-direction.

As far as I can tell, only ProgressSQL uses them, and, even then, it's
for what everyone else writes as `\b` and `\B`.

I want the regex module to remain compatible with the re module, in case
they get added there sometime in the future.

Opinions?
_______________________________________________
Python-Dev mailing list -- python-dev@python.org
To unsubscribe send an email to python-dev-leave@python.org
https://mail.python.org/mailman3/lists/python-dev.python.org/
Message archived at https://mail.python.org/archives/list/python-dev@python.org/message/AYOYEAFOJW4ZHVYBDVMH4MWKXNLBBJ62/
Code of Conduct: http://python.org/psf/codeofconduct/
Re: Adding new escapes to regex module [ In reply to ]
> On 16 Aug 2022, at 21:24, MRAB <python@mrabarnett.plus.com> wrote:
>
> Other regex implementations have escape sequences for horizontal whitespace (`\h` and `\H`) and vertical whitespace (`\v` and `\V`).
>
> The regex module already supports `\h`, but I can't use `\v` because it represents `\0x0b', as it does in the re module.

You seem to be mixing the use \ as the escape for strings and the \ that re uses.
Is it the behaviour that '\<unknown>' becomes '\\<unknown>' that means this is a breaking change?

Won't this work?
```
re.compile('\v:\\v')
# which is the same as
re.compile(r'\x0b:\v')
```

Barry

> Now that someone has asked for it, I'm trying to find a nice way of adding it, and I'm currently thinking that maybe I could use `\y` and `\Y` instead as they look a little like `\v` and `\V`, and, also, vertical whitespace is sort-of in the y-direction.
>
> As far as I can tell, only ProgressSQL uses them, and, even then, it's for what everyone else writes as `\b` and `\B`.
>
> I want the regex module to remain compatible with the re module, in case they get added there sometime in the future.
>
> Opinions?
> _______________________________________________
> Python-Dev mailing list -- python-dev@python.org
> To unsubscribe send an email to python-dev-leave@python.org
> https://mail.python.org/mailman3/lists/python-dev.python.org/
> Message archived at https://mail.python.org/archives/list/python-dev@python.org/message/AYOYEAFOJW4ZHVYBDVMH4MWKXNLBBJ62/
> Code of Conduct: http://python.org/psf/codeofconduct/
>

_______________________________________________
Python-Dev mailing list -- python-dev@python.org
To unsubscribe send an email to python-dev-leave@python.org
https://mail.python.org/mailman3/lists/python-dev.python.org/
Message archived at https://mail.python.org/archives/list/python-dev@python.org/message/R7MG2MKGXTIEXOAQDJ72LE2QLGDT7KNA/
Code of Conduct: http://python.org/psf/codeofconduct/
Re: Adding new escapes to regex module [ In reply to ]
On 2022-08-16 22:14, Barry Scott wrote:
> > On 16 Aug 2022, at 21:24, MRAB <python@mrabarnett.plus.com> wrote:
> >
> > Other regex implementations have escape sequences for horizontal whitespace (`\h` and `\H`) and vertical whitespace (`\v` and `\V`).
> >
> > The regex module already supports `\h`, but I can't use `\v` because it represents `\0x0b', as it does in the re module.
>
> You seem to be mixing the use \ as the escape for strings and the \ that re uses.
> Is it the behaviour that '\<unknown>' becomes '\\<unknown>' that means this is a breaking change?
>
> Won't this work?
> ```
> re.compile('\v:\\v')
> # which is the same as
> re.compile(r'\x0b:\v')
> ```
>
Some languages, e.g. Perl, have a dedicated syntax for writing regexes,
and they take `\n` (a backslash followed by 'n') to mean "match a newline".

Other languages, including Python, use string literals and can contain
an actual newline, but they also take `\n` (a backslash followed by 'n')
to mean "match a newline".

Thus:

>>> print(re.match('\n', '\n')) # Literal newline.
<re.Match object; span=(0, 1), match='\n'>
>>> print(re.match('\\n', '\n')) # `\n` sequence.
<re.Match object; span=(0, 1), match='\n'>

On the other hand:

>>> print(re.match('\b', '\b')) # Literal backspace.
<re.Match object; span=(0, 1), match='\x08'>
>>> print(re.match('\\b', '\b')) # `\b` sequence, which means a word
boundary.
None
>>>

The problem is that the re and regex modules already have the `\v` (a
backslash followed by 'v') sequence to mean "match the '\v' character", so:

re.compile('\v')

and:

re.compile('\\v')

mean exactly the same.

> > Now that someone has asked for it, I'm trying to find a nice way of adding it, and I'm currently thinking that maybe I could use `\y` and `\Y` instead as they look a little like `\v` and `\V`, and, also, vertical whitespace is sort-of in the y-direction.
> >
> > As far as I can tell, only ProgressSQL uses them, and, even then, it's for what everyone else writes as `\b` and `\B`.
> >
> > I want the regex module to remain compatible with the re module, in case they get added there sometime in the future.
> >
> > Opinions?
_______________________________________________
Python-Dev mailing list -- python-dev@python.org
To unsubscribe send an email to python-dev-leave@python.org
https://mail.python.org/mailman3/lists/python-dev.python.org/
Message archived at https://mail.python.org/archives/list/python-dev@python.org/message/KHI74Y2JJRYFRBGGNJUSL7RZCBAI7IAN/
Code of Conduct: http://python.org/psf/codeofconduct/
Re: Adding new escapes to regex module [ In reply to ]
16.08.22 23:24, MRAB ????:
> Other regex implementations have escape sequences for horizontal
> whitespace (`\h` and `\H`) and vertical whitespace (`\v` and `\V`).
>
> The regex module already supports `\h`, but I can't use `\v` because it
> represents `\0x0b', as it does in the re module.
>
> Now that someone has asked for it, I'm trying to find a nice way of
> adding it, and I'm currently thinking that maybe I could use `\y` and
> `\Y` instead as they look a little like `\v` and `\V`, and, also,
> vertical whitespace is sort-of in the y-direction.
>
> As far as I can tell, only ProgressSQL uses them, and, even then, it's
> for what everyone else writes as `\b` and `\B`.
>
> I want the regex module to remain compatible with the re module, in case
> they get added there sometime in the future.
>
> Opinions?

I do not like introducing escapes which are not supported in other RE
implementations. There is a chance of future conflicts.

Java broke compatibility in Java 8 by redefining \v from a single
vertical tab character to the vertical whitespace class. I am not sure
that it is a good example that we should follow, because different
semantic of \v in raw and non-raw strings is a potential source of bugs.
But with special flag which controls the meaning of \v it may be more safe.

Horizontal whitespace can be matched by [
\t\xA0\u1680\u180e\u2000-\u200a\u202f\u205f\u3000] in re or [\t\p{Zs}]
in regex. Vertical whitespace can be matched by
[\n\x0b\f\r\x85\u2028\u2029]. Note that there is a dedicated Unicode
category for horizontal whitespaces (excluding the tab itself), but not
for vertical whitespaces, it means that vertical whitespaces are less
important.

In any case it is simple to introduce special Unicode categories and use
\p{ht} and \p{vt} for horizontal and vertical whitespaces.

_______________________________________________
Python-Dev mailing list -- python-dev@python.org
To unsubscribe send an email to python-dev-leave@python.org
https://mail.python.org/mailman3/lists/python-dev.python.org/
Message archived at https://mail.python.org/archives/list/python-dev@python.org/message/CXK6TDJDGYM2BTPVNIDQIWMQN76K65KN/
Code of Conduct: http://python.org/psf/codeofconduct/
Re: Adding new escapes to regex module [ In reply to ]
On 2022-08-17 08:25, Serhiy Storchaka wrote:
> 16.08.22 23:24, MRAB ????:
>> Other regex implementations have escape sequences for horizontal
>> whitespace (`\h` and `\H`) and vertical whitespace (`\v` and `\V`).
>>
>> The regex module already supports `\h`, but I can't use `\v` because it
>> represents `\0x0b', as it does in the re module.
>>
>> Now that someone has asked for it, I'm trying to find a nice way of
>> adding it, and I'm currently thinking that maybe I could use `\y` and
>> `\Y` instead as they look a little like `\v` and `\V`, and, also,
>> vertical whitespace is sort-of in the y-direction.
>>
>> As far as I can tell, only ProgressSQL uses them, and, even then, it's
>> for what everyone else writes as `\b` and `\B`.
>>
>> I want the regex module to remain compatible with the re module, in case
>> they get added there sometime in the future.
>>
>> Opinions?
>
> I do not like introducing escapes which are not supported in other RE
> implementations. There is a chance of future conflicts.
>
> Java broke compatibility in Java 8 by redefining \v from a single
> vertical tab character to the vertical whitespace class. I am not sure
> that it is a good example that we should follow, because different
> semantic of \v in raw and non-raw strings is a potential source of bugs.
> But with special flag which controls the meaning of \v it may be more safe.
>
> Horizontal whitespace can be matched by [
> \t\xA0\u1680\u180e\u2000-\u200a\u202f\u205f\u3000] in re or [\t\p{Zs}]
> in regex. Vertical whitespace can be matched by
> [\n\x0b\f\r\x85\u2028\u2029]. Note that there is a dedicated Unicode
> category for horizontal whitespaces (excluding the tab itself), but not
> for vertical whitespaces, it means that vertical whitespaces are less
> important.
>
> In any case it is simple to introduce special Unicode categories and use
> \p{ht} and \p{vt} for horizontal and vertical whitespaces.
> It's not just Java. Perl supports all 4 of \h, \H, \v and \V. That might
be why Java 8 changed.
_______________________________________________
Python-Dev mailing list -- python-dev@python.org
To unsubscribe send an email to python-dev-leave@python.org
https://mail.python.org/mailman3/lists/python-dev.python.org/
Message archived at https://mail.python.org/archives/list/python-dev@python.org/message/UPDWUF3RDVKG7LNKSS4LHCSF7XA32H6W/
Code of Conduct: http://python.org/psf/codeofconduct/
Re: Adding new escapes to regex module [ In reply to ]
On 2022-08-17 17:34, MRAB wrote:
> On 2022-08-17 08:25, Serhiy Storchaka wrote:
>> 16.08.22 23:24, MRAB ????:
>>> Other regex implementations have escape sequences for horizontal
>>> whitespace (`\h` and `\H`) and vertical whitespace (`\v` and `\V`).
>>>
>>> The regex module already supports `\h`, but I can't use `\v` because it
>>> represents `\0x0b', as it does in the re module.
>>>
>>> Now that someone has asked for it, I'm trying to find a nice way of
>>> adding it, and I'm currently thinking that maybe I could use `\y` and
>>> `\Y` instead as they look a little like `\v` and `\V`, and, also,
>>> vertical whitespace is sort-of in the y-direction.
>>>
>>> As far as I can tell, only ProgressSQL uses them, and, even then, it's
>>> for what everyone else writes as `\b` and `\B`.
>>>
>>> I want the regex module to remain compatible with the re module, in case
>>> they get added there sometime in the future.
>>>
>>> Opinions?
>>
>> I do not like introducing escapes which are not supported in other RE
>> implementations. There is a chance of future conflicts.
>>
>> Java broke compatibility in Java 8 by redefining \v from a single
>> vertical tab character to the vertical whitespace class. I am not sure
>> that it is a good example that we should follow, because different
>> semantic of \v in raw and non-raw strings is a potential source of bugs.
>> But with special flag which controls the meaning of \v it may be more safe.
>>
>> Horizontal whitespace can be matched by [
>> \t\xA0\u1680\u180e\u2000-\u200a\u202f\u205f\u3000] in re or [\t\p{Zs}]
>> in regex. Vertical whitespace can be matched by
>> [\n\x0b\f\r\x85\u2028\u2029]. Note that there is a dedicated Unicode
>> category for horizontal whitespaces (excluding the tab itself), but not
>> for vertical whitespaces, it means that vertical whitespaces are less
>> important.
>>
>> In any case it is simple to introduce special Unicode categories and use
>> \p{ht} and \p{vt} for horizontal and vertical whitespaces.
>
> It's not just Java. Perl supports all 4 of \h, \H, \v and \V. That might
> be why Java 8 changed.
> I've found that Perl has \p{HorizSpace} and \p{VertSpace}, so I'm going
with that.
_______________________________________________
Python-Dev mailing list -- python-dev@python.org
To unsubscribe send an email to python-dev-leave@python.org
https://mail.python.org/mailman3/lists/python-dev.python.org/
Message archived at https://mail.python.org/archives/list/python-dev@python.org/message/UKT4YPABIEEDK3XYSVLTF2ALWUVEPW5Z/
Code of Conduct: http://python.org/psf/codeofconduct/
Re: Adding new escapes to regex module [ In reply to ]
On Wed, 17 Aug 2022 19:23:02 +0100
MRAB <python@mrabarnett.plus.com> wrote:
> >>
> >> I do not like introducing escapes which are not supported in other RE
> >> implementations. There is a chance of future conflicts.
> >>
> >> Java broke compatibility in Java 8 by redefining \v from a single
> >> vertical tab character to the vertical whitespace class. I am not sure
> >> that it is a good example that we should follow, because different
> >> semantic of \v in raw and non-raw strings is a potential source of bugs.
> >> But with special flag which controls the meaning of \v it may be more safe.
> >>
> >> Horizontal whitespace can be matched by [
> >> \t\xA0\u1680\u180e\u2000-\u200a\u202f\u205f\u3000] in re or [\t\p{Zs}]
> >> in regex. Vertical whitespace can be matched by
> >> [\n\x0b\f\r\x85\u2028\u2029]. Note that there is a dedicated Unicode
> >> category for horizontal whitespaces (excluding the tab itself), but not
> >> for vertical whitespaces, it means that vertical whitespaces are less
> >> important.
> >>
> >> In any case it is simple to introduce special Unicode categories and use
> >> \p{ht} and \p{vt} for horizontal and vertical whitespaces.
> >
> > It's not just Java. Perl supports all 4 of \h, \H, \v and \V. That might
> > be why Java 8 changed.
> > I've found that Perl has \p{HorizSpace} and \p{VertSpace}, so I'm going
> with that.

+1 for special Unicode categories rather than retargetting existing
escapes for something else.

(also, matching horizontal/vertical whitespace sounds rather unusual)

Regards

Antoine.


_______________________________________________
Python-Dev mailing list -- python-dev@python.org
To unsubscribe send an email to python-dev-leave@python.org
https://mail.python.org/mailman3/lists/python-dev.python.org/
Message archived at https://mail.python.org/archives/list/python-dev@python.org/message/7XN73YFKX4CGMSZBP7D4D3GOQOQVH5NM/
Code of Conduct: http://python.org/psf/codeofconduct/
Re: Adding new escapes to regex module [ In reply to ]
17.08.22 19:34, MRAB ????:
> It's not just Java. Perl supports all 4 of \h, \H, \v and \V. That might
> be why Java 8 changed.

But Perl does not have conflict between strings and regular expressions,
because regular expression is a separate syntax construct.

_______________________________________________
Python-Dev mailing list -- python-dev@python.org
To unsubscribe send an email to python-dev-leave@python.org
https://mail.python.org/mailman3/lists/python-dev.python.org/
Message archived at https://mail.python.org/archives/list/python-dev@python.org/message/V47KWBKWCTBBXM637VPPJTR3QD4C5S23/
Code of Conduct: http://python.org/psf/codeofconduct/