Mailing List Archive

Comparison of cyclic objects (was RE: trashcan and PR#7)
[Jeremy Hylton]>
> It doesn't seem like tail-recursion is the issue, rather we need to
> define some rules about when to end the recursion. If I understand
> what is being suggest, it is to create a worklist of subobjects to
> compare instead of making recursive calls to compare. This change
> would turn the core dump into an infinite loop; I guess that's an
> improvement, but not much of one.
>
> ...
>
> So the real problem is defining some reasonable semantics for
> comparison of recursive objects.

I think this is exactly a graph isomorphism problem, since Python always
compares "by value" (so isomorphism is the natural generalization).

This isn't hard (!= tedious, alas) to define or to implement naively, but a
straightforward implementation would be very expensive at runtime compared
to the status quo. That's why "real languages" <wink> would rather suffer
an infinite loop. It's expensive because there's no cheap way to know
whether you have a loop in an object.

An anal compromise would be to run comparisons full speed without trying to
detect loops, but if the recursion got "too deep" break out and start over
with an expensive alternative that does check for loops. The later requires
machinery similar to copy.deepcopy's.

> ...
> I think the comparison ought to return false or raise a ValueError.

After

a = []
b = []
a.append(a)
b.append(b)

it certainly "ought to be" the case that a == b in Python. "false" makes no
sense. ValueError makes no sense either unless we incur the expense of
proving first that at least one object does contain a loop (as opposed to
that it's just possibly nested thousands of levels deep) -- but then we may
as well implement an isomorphism discriminator.

> I'm not sure which is right. It seems odd to me that comparing two
> builtin lists could ever raise an exception, but it may be more
> Pythonic to raise an exception in the face of ambiguity. As the
> X3J13 committee noted:

Lisps have more notions of "equality" than Python 1.6 has flavors of strings
<wink>. Python has only one notion of equality (conveniently ignoring that
it actually has two <wink>). The thing the Lisp people argue about is which
of the three or four notions of equality to apply at varying levels when
trying to compute one of their *other* notions of equality -- there *can't*
be a universally "good" answer to that mess. Python's life is easier here.

in-concept-if-not-in-implementation-ly y'rs - tim
Re: Comparison of cyclic objects (was RE: trashcan and PR#7) [ In reply to ]
>>>>> "TP" == Tim Peters <tim_one@email.msn.com> writes:

>> So the real problem is defining some reasonable semantics for
>> comparison of recursive objects.


I'm not familiar with any algorithms for the graph isomorphism
problem, but I took a stab at a simple comparison algorithm. The idea
is to detect comparisons that would cross back-edges in the object
graphs. Instead of starting a new comparison, assume they are the
same. If, in fact, the objects are not the same, they must differ in
some other way; some other part of the comparison will fail.


My first attempt at implementing this is expensive. I maintain a
dictionary that contains all the object pairs that are currently being
compared. Specifically, the dictionary is used to implement a set of
object id pairs. Every call to PyObject_Compare will add a new
pair to the dictionary when it is called and remove it when it returns
(except for a few trivial cases).

A naive patch is included below. It does seem to involve a big
performance hit -- more than 10% slower on pystone. It also uses a
lot of extra space.

Note that the patch has all its initialization code inline in
PyObject_Compare; moving that elsewhere will help a little. It also
use a bunch of function calls where macros would be more efficient.


It looks like the anal compromise might be necessary. I'll
re-implement the patch more carefully and see what the real
effect on performance is.

Jeremy

Index: object.c
===================================================================
RCS file: /projects/cvsroot/python/dist/src/Objects/object.c,v
retrieving revision 2.67
diff -r2.67 object.c
239c239
< "__repr__ returned non-string (type %.200s)",
---
> "__repr__ returned non-string (type %s)",
276c276
< "__str__ returned non-string (type %.200s)",
---
> "__str__ returned non-string (type %s)",
300a301,328
> static PyObject *cmp_state_key = NULL;
>
> static PyObject*
> cmp_state_make_pair(v, w)
> PyObject *v, *w;
> {
> PyObject *pair = PyTuple_New(2);
> if (pair == NULL)
> return NULL;
> if ((long)v <= (long)w) {
> PyTuple_SET_ITEM(pair, 0, PyInt_FromLong((long)v));
> PyTuple_SET_ITEM(pair, 1, PyInt_FromLong((long)w));
> } else {
> PyTuple_SET_ITEM(pair, 0, PyInt_FromLong((long)w));
> PyTuple_SET_ITEM(pair, 1, PyInt_FromLong((long)v));
> }
> return pair;
> }
>
> void
> cmp_state_clear_pair(dict, key)
> PyObject *dict, *key;
> {
> PyDict_DelItem(dict, key);
> Py_DECREF(key);
> }
>
>
305a334,336
> PyObject *tstate_dict, *cmp_dict, *pair;
> int result;
>
311a343,376
> tstate_dict = PyThreadState_GetDict();
> if (tstate_dict == NULL) {
> PyErr_BadInternalCall();
> return -1;
> }
> /* fprintf(stderr, "PyObject_Compare(%X: %s, %X: %s)\n", (long)v,
> v->ob_type->tp_name, (long)w, w->ob_type->tp_name);
> */
> /* XXX should initialize elsewhere */
> if (cmp_state_key == NULL) {
> cmp_state_key = PyString_InternFromString("compare_state");
> cmp_dict = PyDict_New();
> if (cmp_dict == NULL)
> return -1;
> PyDict_SetItem(tstate_dict, cmp_state_key, cmp_dict);
> } else {
> cmp_dict = PyDict_GetItem(tstate_dict, cmp_state_key);
> if (cmp_dict == NULL)
> return NULL;
> PyDict_SetItem(tstate_dict, cmp_state_key, cmp_dict);
> }
>
> pair = cmp_state_make_pair(v, w);
> if (pair == NULL) {
> PyErr_BadInternalCall();
> return -1;
> }
> if (PyDict_GetItem(cmp_dict, pair)) {
> /* already comparing these objects. assume they're
> equal until shown otherwise
> */
> Py_DECREF(pair);
> return 0;
> }
316a382,384
> if (PyDict_SetItem(cmp_dict, pair, pair) == -1) {
> return -1;
> }
317a386
> cmp_state_clear_pair(cmp_dict, pair);
329a399,401
> if (PyDict_SetItem(cmp_dict, pair, pair) == -1) {
> return -1;
> }
344a417
> cmp_state_clear_pair(cmp_dict, pair);
350,364c423,425
< else if (PyUnicode_Check(v) || PyUnicode_Check(w)) {
< int result = PyUnicode_Compare(v, w);
< if (result == -1 && PyErr_Occurred() &&
< PyErr_ExceptionMatches(PyExc_TypeError))
< /* TypeErrors are ignored: if Unicode coercion
< fails due to one of the arguments not
< having the right type, we continue as
< defined by the coercion protocol (see
< above). Luckily, decoding errors are
< reported as ValueErrors and are not masked
< by this technique. */
< PyErr_Clear();
< else
< return result;
< }
---
> cmp_state_clear_pair(cmp_dict, pair);
> if (PyUnicode_Check(v) || PyUnicode_Check(w))
> return PyUnicode_Compare(v, w);
372c433,434
< if (vtp->tp_compare == NULL)
---
> if (vtp->tp_compare == NULL) {
> cmp_state_clear_pair(cmp_dict, pair);
374c436,439
< return (*vtp->tp_compare)(v, w);
---
> }
> result = (*vtp->tp_compare)(v, w);
> cmp_state_clear_pair(cmp_dict, pair);
> return result;
Re: Comparison of cyclic objects (was RE: trashcan and PR#7) [ In reply to ]
I did one more round of work on this idea, and I'm satisfied with the
results. Most of the performance hit can be eliminated by doing
nothing until there are at least N recursive calls to
PyObject_Compare, where N is fairly large. (I picked 25000.)
Non-circular objects that are not deeply nested only pay for an
integer increment, a decrement, and a compare.

Background for patches-only readers: This patch appears to fix PR#7.

Comments and suggestions solicitied. I think this is worth checking
in.

Jeremy

Index: Include/object.h
===================================================================
RCS file: /projects/cvsroot/python/dist/src/Include/object.h,v
retrieving revision 2.52
diff -r2.52 object.h
286a287,289
> /* tstate dict key for PyObject_Compare helper */
> extern PyObject *_PyCompareState_Key;
>
Index: Python/pythonrun.c
===================================================================
RCS file: /projects/cvsroot/python/dist/src/Python/pythonrun.c,v
retrieving revision 2.91
diff -r2.91 pythonrun.c
151a152,153
> _PyCompareState_Key = PyString_InternFromString("cmp_state");
>
Index: Objects/object.c
===================================================================
RCS file: /projects/cvsroot/python/dist/src/Objects/object.c,v
retrieving revision 2.67
diff -r2.67 object.c
300a301,306
> PyObject *_PyCompareState_Key;
>
> int _PyCompareState_nesting = 0;
> int _PyCompareState_flag = 0;
> #define NESTING_LIMIT 25000
>
305a312,313
> int result;
>
372c380
< if (vtp->tp_compare == NULL)
---
> if (vtp->tp_compare == NULL) {
374c382,440
< return (*vtp->tp_compare)(v, w);
---
> }
> ++_PyCompareState_nesting;
> if (_PyCompareState_nesting > NESTING_LIMIT)
> _PyCompareState_flag = 1;
> if (_PyCompareState_flag &&
> (vtp->tp_as_mapping || (vtp->tp_as_sequence &&
> !PyString_Check(v))))
> {
> PyObject *tstate_dict, *cmp_dict, *pair;
>
> tstate_dict = PyThreadState_GetDict();
> if (tstate_dict == NULL) {
> PyErr_BadInternalCall();
> return -1;
> }
> cmp_dict = PyDict_GetItem(tstate_dict, _PyCompareState_Key);
> if (cmp_dict == NULL) {
> cmp_dict = PyDict_New();
> if (cmp_dict == NULL)
> return -1;
> PyDict_SetItem(tstate_dict,
> _PyCompareState_Key,
> cmp_dict);
> }
>
> pair = PyTuple_New(2);
> if (pair == NULL) {
> return -1;
> }
> if ((long)v <= (long)w) {
> PyTuple_SET_ITEM(pair, 0, PyInt_FromLong((long)v));
> PyTuple_SET_ITEM(pair, 1, PyInt_FromLong((long)w));
> } else {
> PyTuple_SET_ITEM(pair, 0, PyInt_FromLong((long)w));
> PyTuple_SET_ITEM(pair, 1, PyInt_FromLong((long)v));
> }
> if (PyDict_GetItem(cmp_dict, pair)) {
> /* already comparing these objects. assume
> they're equal until shown otherwise
> */
> Py_DECREF(pair);
> --_PyCompareState_nesting;
> if (_PyCompareState_nesting == 0)
> _PyCompareState_flag = 0;
> return 0;
> }
> if (PyDict_SetItem(cmp_dict, pair, pair) == -1) {
> return -1;
> }
> result = (*vtp->tp_compare)(v, w);
> PyDict_DelItem(cmp_dict, pair);
> Py_DECREF(pair);
> } else {
> result = (*vtp->tp_compare)(v, w);
> }
> --_PyCompareState_nesting;
> if (_PyCompareState_nesting == 0)
> _PyCompareState_flag = 0;
> return result;
Re: Comparison of cyclic objects (was RE: trashcan and PR#7) [ In reply to ]
On Thu, 13 Apr 2000, Jeremy Hylton wrote:
> >>>>> "TP" == Tim Peters <tim_one@email.msn.com> writes:
>
> TP> [Jeremy Hylton]>
> >> So the real problem is defining some reasonable semantics for
> >> comparison of recursive objects.

There is a "right" way to do this, i believe, and my friend
Mark Miller implemented it in E.

He tells me his algorithm is inspired by the method for
unification of cyclic structures in Prolog III. It's available
in the E source code (in the file elib/tables/Equalizer.java).

See interesting stuff on equality and cyclic data structures at

http://www.erights.org/javadoc/org/erights/e/elib/tables/Equalizer.html
http://www.erights.org/elang/same-ref.html
http://www.erights.org/elang/blocks/defVar.html
http://www.eros-os.org/~majordomo/e-lang/0698.html

There is also a thread about equality issues in general at:

http://www.eros-os.org/~majordomo/e-lang/0000.html

It's long, but worth perusing.

Here is my rough Python translation of the code in the E Equalizer.

Python 1.4 (Mar 25 2000) [C]
Copyright 1991-1997 Stichting Mathematisch Centrum, Amsterdam
Python Console v1.4 by Ka-Ping Yee <ping@lfw.org>
>>> def same(left, right, sofar={}):
... hypothesis = (id(left), id(right))
... if left is right or sofar.has_key(hypothesis): return 1
... if type(left) is not type(right): return 0
... if type(left) is type({}):
... left, right = left.items(), right.items()
... if type(left) is type([]):
... sofar[hypothesis] = 1
... try:
... for i in range(len(left)):
... if not same(left[i], right[i], sofar): return 0
... return 1
... finally:
... del sofar[hypothesis]
... return left == right
...
...
>>> same([3],[4])
0
>>> same([3],[3])
1
>>> a = [1,2,3]
>>> b = [1,2,3]
>>> c = [1,2,3]
>>> same(a,b)
1
>>> a[1] = a
>>> same(a,a)
1
>>> same(a,b)
0
>>> b[1] = b
>>> same(a,b)
1
>>> b[1] = c
>>> b
[1, [1, 2, 3], 3]
>>> same(a,b)
0
>>> c[1] = b
>>> same(a,b)
1
>>> same(b,c)
1
>>>

I would like to see Python's comparisons work this way (i.e.
"correct" as opposed to "we give up").


-- ?!ng
Re: Comparison of cyclic objects (was RE: trashcan and PR#7) [ In reply to ]
JH> Comments and suggestions solicitied. I think this is worth
JH> checking in.

Please regenerate with unified or context diffs!

-Barry
Re: Comparison of cyclic objects (was RE: trashcan and PR#7) [ In reply to ]
Here it is contextified. One small difference from the previous patch
is that NESTING_LIMIT is now only 1000. I think this is sufficient to
cover commonly occuring nested containers.

Jeremy

Index: Include/object.h
===================================================================
RCS file: /projects/cvsroot/python/dist/src/Include/object.h,v
retrieving revision 2.52
diff -c -r2.52 object.h
*** object.h 2000/03/21 16:14:47 2.52
--- object.h 2000/04/13 21:50:10
***************
*** 284,289 ****
--- 284,292 ----
extern DL_IMPORT(int) Py_ReprEnter Py_PROTO((PyObject *));
extern DL_IMPORT(void) Py_ReprLeave Py_PROTO((PyObject *));

+ /* tstate dict key for PyObject_Compare helper */
+ extern PyObject *_PyCompareState_Key;
+
/* Flag bits for printing: */
#define Py_PRINT_RAW 1 /* No string quotes etc. */

Index: Python/pythonrun.c
===================================================================
RCS file: /projects/cvsroot/python/dist/src/Python/pythonrun.c,v
retrieving revision 2.91
diff -c -r2.91 pythonrun.c
*** pythonrun.c 2000/03/10 23:03:54 2.91
--- pythonrun.c 2000/04/13 21:50:25
***************
*** 149,154 ****
--- 149,156 ----
/* Init Unicode implementation; relies on the codec registry */
_PyUnicode_Init();

+ _PyCompareState_Key = PyString_InternFromString("cmp_state");
+
bimod = _PyBuiltin_Init_1();
if (bimod == NULL)
Py_FatalError("Py_Initialize: can't initialize __builtin__");
Index: Objects/object.c
===================================================================
RCS file: /projects/cvsroot/python/dist/src/Objects/object.c,v
retrieving revision 2.67
diff -c -r2.67 object.c
*** object.c 2000/04/10 13:42:33 2.67
--- object.c 2000/04/13 21:44:42
***************
*** 298,308 ****
--- 298,316 ----
return PyInt_FromLong(c);
}

+ PyObject *_PyCompareState_Key;
+
+ int _PyCompareState_nesting = 0;
+ int _PyCompareState_flag = 0;
+ #define NESTING_LIMIT 1000
+
int
PyObject_Compare(v, w)
PyObject *v, *w;
{
PyTypeObject *vtp, *wtp;
+ int result;
+
if (v == NULL || w == NULL) {
PyErr_BadInternalCall();
return -1;
***************
*** 369,377 ****
/* Numerical types compare smaller than all other types */
return strcmp(vname, wname);
}
! if (vtp->tp_compare == NULL)
return (v < w) ? -1 : 1;
! return (*vtp->tp_compare)(v, w);
}

long
--- 377,443 ----
/* Numerical types compare smaller than all other types */
return strcmp(vname, wname);
}
! if (vtp->tp_compare == NULL) {
return (v < w) ? -1 : 1;
! }
! ++_PyCompareState_nesting;
! if (_PyCompareState_nesting > NESTING_LIMIT)
! _PyCompareState_flag = 1;
! if (_PyCompareState_flag &&
! (vtp->tp_as_mapping || (vtp->tp_as_sequence &&
! !PyString_Check(v))))
! {
! PyObject *tstate_dict, *cmp_dict, *pair;
!
! tstate_dict = PyThreadState_GetDict();
! if (tstate_dict == NULL) {
! PyErr_BadInternalCall();
! return -1;
! }
! cmp_dict = PyDict_GetItem(tstate_dict, _PyCompareState_Key);
! if (cmp_dict == NULL) {
! cmp_dict = PyDict_New();
! if (cmp_dict == NULL)
! return -1;
! PyDict_SetItem(tstate_dict,
! _PyCompareState_Key,
! cmp_dict);
! }
!
! pair = PyTuple_New(2);
! if (pair == NULL) {
! return -1;
! }
! if ((long)v <= (long)w) {
! PyTuple_SET_ITEM(pair, 0, PyInt_FromLong((long)v));
! PyTuple_SET_ITEM(pair, 1, PyInt_FromLong((long)w));
! } else {
! PyTuple_SET_ITEM(pair, 0, PyInt_FromLong((long)w));
! PyTuple_SET_ITEM(pair, 1, PyInt_FromLong((long)v));
! }
! if (PyDict_GetItem(cmp_dict, pair)) {
! /* already comparing these objects. assume
! they're equal until shown otherwise
! */
! Py_DECREF(pair);
! --_PyCompareState_nesting;
! if (_PyCompareState_nesting == 0)
! _PyCompareState_flag = 0;
! return 0;
! }
! if (PyDict_SetItem(cmp_dict, pair, pair) == -1) {
! return -1;
! }
! result = (*vtp->tp_compare)(v, w);
! PyDict_DelItem(cmp_dict, pair);
! Py_DECREF(pair);
! } else {
! result = (*vtp->tp_compare)(v, w);
! }
! --_PyCompareState_nesting;
! if (_PyCompareState_nesting == 0)
! _PyCompareState_flag = 0;
! return result;
}

long
RE: Comparison of cyclic objects (was RE: trashcan and PR#7) [ In reply to ]
[Jeremy Hylton]
> I'm not familiar with any algorithms for the graph isomorphism
> problem,

Well, while an instance of graph isomorphism, this one is a relatively
simple special case (because "the graphs" here are rooted, directed, and
have ordered children).

> but I took a stab at a simple comparison algorithm. The idea
> is to detect comparisons that would cross back-edges in the object
> graphs. Instead of starting a new comparison, assume they are the
> same. If, in fact, the objects are not the same, they must differ in
> some other way; some other part of the comparison will fail.

Bingo! That's the key trick.
Re: Comparison of cyclic objects (was RE: trashcan and PR#7) [ In reply to ]
Jeremy Hylton wrote:
>
> Here it is contextified. One small difference from the previous patch
> is that NESTING_LIMIT is now only 1000. I think this is sufficient to
> cover commonly occuring nested containers.
>
> Jeremy
>
> [patch omitted]

Nice.

I think you don't need the _PyCompareState_flag. Like in trashcan,
_PyCompareState_nesting is enough to enter the sections of the code
that depend on _PyCompareState_flag.

--
Vladimir MARANGOZOV | Vladimir.Marangozov@inrialpes.fr
http://sirac.inrialpes.fr/~marangoz | tel:(+33-4)76615277 fax:76615252
Re: Comparison of cyclic objects (was RE: trashcan and PR#7) [ In reply to ]
>>>>> "VM" == Vladimir Marangozov <Vladimir.Marangozov@inrialpes.fr> writes:

VM> Jeremy Hylton wrote:
>> Here it is contextified. One small difference from the previous
>> patch is that NESTING_LIMIT is now only 1000. I think this is
>> sufficient to cover commonly occuring nested containers.
>>
>> Jeremy
>>
>> [patch omitted]

VM> Nice.

VM> I think you don't need the _PyCompareState_flag. Like in
VM> trashcan, _PyCompareState_nesting is enough to enter the
VM> sections of the code that depend on _PyCompareState_flag.

Right. Thanks for the suggestion, and thanks to Barry & Fred for
theirs. I've checked in the changes.

Jeremy
RE: Comparison of cyclic objects (was RE: trashcan and PR#7) [ In reply to ]
On Thu, 13 Apr 2000, Tim Peters wrote:

> Well, while an instance of graph isomorphism, this one is a relatively
> simple special case (because "the graphs" here are rooted, directed, and
> have ordered children).

Ordered? What about dictionaries?

--
Moshe Zadka <mzadka@geocities.com>.
http://www.oreilly.com/news/prescod_0300.html
http://www.linux.org.il -- we put the penguin in .com
RE: Comparison of cyclic objects (was RE: trashcan andPR#7) [ In reply to ]
[Tim]
> Well, while an instance of graph isomorphism, this one is a relatively
> simple special case (because "the graphs" here are rooted, directed, and
> have ordered children).

[Moshe Zadka]
> Ordered? What about dictionaries?

An ordering of a dict's kids is forced in the context of comparison (see
dict_compare in dictobject.c).