Mailing List Archive

GadFly - MemoryError
Hello!

I tried to add yeat another database backend to my project "Bookmarks
database". My database contains now about 3000 URLs, not too much, I think.
I subclass by BookmarksParser to parse bookmarks.html into gadfly database
and got a database of 500 Kbytes - very small database, I hope.
Then I tried to find duplicates (there are duplicates). I ran the query:

SELECT b1.rec_no, b2.rec_no, b1.URL
FROM bookmarks b1, bookmarks b2
WHERE b1.URL = b2.URL
AND b1.rec_no < b2.rec_no

GadFly eats all memory and fails with MemoryError. Is it that gadfly
really constructs multiplication of b1 and b2, thus getting 3000*3000
pairs? Or I just did something wrong?
I tried to create indicies, but this was not of any help.

BTW, where are indicies stored? After creating ones I didn't find
additional files - only one file in the database changed.

Oleg.
----
Oleg Broytmann National Research Surgery Centre http://sun.med.ru/~phd/
Programmers don't die, they just GOSUB without RETURN.
GadFly - MemoryError [ In reply to ]
I don't know why you have this problem.
Gadfly should not construct the cross product with
the equality you mention present.
(b1.URL = b2.URL)

Please try adding an index on URL on the table
in question. Also, maybe try increasing your virtual
memory (page file size, or whatever).

Good luck. fwiw, I've definitely run tables with
100k rows in Gadfly on a 64Meg machine.
-- Aaron Watters

ps: let me know how it goes.

----Original Message-----
>From: Oleg Broytmann <phd@sun.med.ru>
>To: arw@ifu.net
>Cc: Python Mailing List <python-list@cwi.nl>
>Subject: GadFly - MemoryError
>Reply-To: phd@sun.med.ru
>Date: Saturday, April 03, 1999 9:35 AM
>
>Hello!
>
> I tried to add yeat another database backend to my project "Bookmarks
>database". My database contains now about 3000 URLs, not too much, I think.
>I subclass by BookmarksParser to parse bookmarks.html into gadfly database
>and got a database of 500 Kbytes - very small database, I hope.
> Then I tried to find duplicates (there are duplicates). I ran the query:
>
>SELECT b1.rec_no, b2.rec_no, b1.URL
> FROM bookmarks b1, bookmarks b2
>WHERE b1.URL = b2.URL
>AND b1.rec_no < b2.rec_no
>
> GadFly eats all memory and fails with MemoryError. Is it that gadfly
>really constructs multiplication of b1 and b2, thus getting 3000*3000
>pairs? Or I just did something wrong?
> I tried to create indicies, but this was not of any help.
>
> BTW, where are indicies stored? After creating ones I didn't find
>additional files - only one file in the database changed.
>
>Oleg.
>----
> Oleg Broytmann National Research Surgery Centre http://sun.med.ru/~phd/
> Programmers don't die, they just GOSUB without RETURN.
>
>
>
GadFly - MemoryError [ In reply to ]
Whoops. I didn't notice the python-list was cc'd on the last message
oops. Sorry, I didn't see the cc to the python list on the last message.
Please also make sure you are using the kjbuckets extension module, as it
significantly improves both speed and memory usage. Lemme know...
As I say, I've run some pretty large gadfly databases on some pretty
mediocre machines...
-- Aaron Watters

===
time flies like an arrow.
fruit flies like a banana.
GadFly - MemoryError [ In reply to ]
Hello!

On Mon, 5 Apr 1999, arw wrote:
> I don't know why you have this problem.
> Gadfly should not construct the cross product with
> the equality you mention present.
> (b1.URL = b2.URL)

I freed some memory and the program worked. It ate 30 Megs while running,
so I was in need of memory. But it ran 30 minutes (on 3000 rows!)

> Please try adding an index on URL on the table
> in question. Also, maybe try increasing your virtual
> memory (page file size, or whatever).

Indices or compiled kjbuckets was not of big help - instead of running
35 minutes the program ran 30. I think, my computer spent more time
swapping, not working.

> Good luck. fwiw, I've definitely run tables with
> 100k rows in Gadfly on a 64Meg machine.
> -- Aaron Watters

You are in luck.

> ps: let me know how it goes.

Not so good. I think I would not use gadfly, at least for this project
(currently I am using very simple "database", that stored in text files. Of
course, I cannot make complex requests, sucj as joins).

Oleg.
----
Oleg Broytmann National Research Surgery Centre http://sun.med.ru/~phd/
Programmers don't die, they just GOSUB without RETURN.
GadFly - MemoryError [ In reply to ]
Oleg Broytmann wrote:
>
> Hello!
>
> I tried to add yeat another database backend to my project "Bookmarks
> database". My database contains now about 3000 URLs, not too much, I think.
> I subclass by BookmarksParser to parse bookmarks.html into gadfly database
> and got a database of 500 Kbytes - very small database, I hope.
> Then I tried to find duplicates (there are duplicates). I ran the query:
>
> SELECT b1.rec_no, b2.rec_no, b1.URL
> FROM bookmarks b1, bookmarks b2
> WHERE b1.URL = b2.URL
> AND b1.rec_no < b2.rec_no

How many duplicates are there? Something like

SELECT URL FROM bookmarks GROUP BY URL HAVING COUNT(*) > 1

will produce the URL's with duplicates; you could then do

SELECT rec_no, URL FROM bookmarks
WHERE URL IN
(SELECT URL FROM bookmarks GROUP BY URL HAVING COUNT(*) > 1)

or create a temp table first with the results of the subquery, then join it in a
separate query.
--
=========================================================
Tres Seaver tseaver@palladion.com 713-523-6582
Palladion Software http://www.palladion.com
GadFly - MemoryError [ In reply to ]
Hello!

On Wed, 7 Apr 1999, Tres Seaver wrote:
> SELECT rec_no, URL FROM bookmarks
> WHERE URL IN
> (SELECT URL FROM bookmarks GROUP BY URL HAVING COUNT(*) > 1)
>
> or create a temp table first with the results of the subquery, then join it in a
> separate query.

It looks nice. I'll try.
The problem with it is that I work with Postgres SQl server a little
more than a year, and Postgres has many problems with HAVING, so usually I
avoid it. I forget to switch my mind when switched to GadFly.
Thanks.

> --
> =========================================================
> Tres Seaver tseaver@palladion.com 713-523-6582
> Palladion Software http://www.palladion.com
>

Oleg.
----
Oleg Broytmann National Research Surgery Centre http://sun.med.ru/~phd/
Programmers don't die, they just GOSUB without RETURN.
GadFly - MemoryError [ In reply to ]
It would be interesting to try this, but my guess is that this would be
slower than the first query since equalities are optimized and "in" is not.
I hope oleg is using gadfly 1.0 too (not beta 0.2 or whatever). -- Aaron
Watters

In article <370B5E83.C796BFF3@palladion.com>,
Tres Seaver <tseaver@palladion.com> wrote:
> Oleg Broytmann wrote:
> >
> > Hello!
> >
> > I tried to add yeat another database backend to my project "Bookmarks
> > database". My database contains now about 3000 URLs, not too much, I think.
> > I subclass by BookmarksParser to parse bookmarks.html into gadfly database
> > and got a database of 500 Kbytes - very small database, I hope.
> > Then I tried to find duplicates (there are duplicates). I ran the query:
> >
> > SELECT b1.rec_no, b2.rec_no, b1.URL
> > FROM bookmarks b1, bookmarks b2
> > WHERE b1.URL = b2.URL
> > AND b1.rec_no < b2.rec_no
>
> How many duplicates are there? Something like
>
> SELECT URL FROM bookmarks GROUP BY URL HAVING COUNT(*) > 1
>
> will produce the URL's with duplicates; you could then do
>
> SELECT rec_no, URL FROM bookmarks
> WHERE URL IN
> (SELECT URL FROM bookmarks GROUP BY URL HAVING COUNT(*) > 1)
>
> or create a temp table first with the results of the subquery, then join it in a
> separate query.
> --
> =========================================================
> Tres Seaver tseaver@palladion.com 713-523-6582
> Palladion Software http://www.palladion.com
>

-----------== Posted via Deja News, The Discussion Network ==----------
http://www.dejanews.com/ Search, Read, Discuss, or Start Your Own
GadFly - MemoryError [ In reply to ]
Hi!

On Thu, 8 Apr 1999 aaron_watters@my-dejanews.com wrote:
> It would be interesting to try this, but my guess is that this would be
> slower than the first query since equalities are optimized and "in" is not.
> I hope oleg is using gadfly 1.0 too (not beta 0.2 or whatever). -- Aaron
> Watters

I started playing with GadFly a few weeks ago, so I downloaded latest
versions of GadFly and kjBuckets.
Yesterday I found a way to use kjSet in my program.

BTW, what are "kw" in "kwParsing" and "kj" in "kjBuckets"?

Oleg.
----
Oleg Broytmann National Research Surgery Centre http://sun.med.ru/~phd/
Programmers don't die, they just GOSUB without RETURN.
GadFly - MemoryError [ In reply to ]
My guess is that you have a lot of url matches and
gadfly is doing what you ask it to do. Could be wrong.
Rewriting the query might help, as suggested elsewhere.
If you want to discuss this further I don't think we need
copy the list. With 3000 entries you have a potential
9M matches.

Please try this

select url, count(*) from table group by url order by 1 desc

And have a look at how many redundancies you have. This
query should run faster, I hope.

We could take this kind of discussion offline, if you like. :)
-- Aaron Watters

----Original Message-----
>From: Oleg Broytmann <phd@sun.med.ru>
>To: arw <arw@ifu.net>
>Cc: Python Mailing List <python-list@cwi.nl>
>Subject: RE: GadFly - MemoryError
>Reply-To: phd@sun.med.ru
>Date: Wednesday, April 07, 1999 4:16 AM
>
>Hello!
>
>On Mon, 5 Apr 1999, arw wrote:
>> I don't know why you have this problem.
>> Gadfly should not construct the cross product with
>> the equality you mention present.
>> (b1.URL = b2.URL)
>
> I freed some memory and the program worked. It ate 30 Megs while running,
>so I was in need of memory. But it ran 30 minutes (on 3000 rows!)
>
>> Please try adding an index on URL on the table
>> in question. Also, maybe try increasing your virtual
>> memory (page file size, or whatever).
>
> Indices or compiled kjbuckets was not of big help - instead of running
>35 minutes the program ran 30. I think, my computer spent more time
>swapping, not working.
>
>> Good luck. fwiw, I've definitely run tables with
>> 100k rows in Gadfly on a 64Meg machine.
>> -- Aaron Watters
>
> You are in luck.
>
>> ps: let me know how it goes.
>
> Not so good. I think I would not use gadfly, at least for this project
>(currently I am using very simple "database", that stored in text files. Of
>course, I cannot make complex requests, sucj as joins).
>
>Oleg.
>----
> Oleg Broytmann National Research Surgery Centre http://sun.med.ru/~phd/
> Programmers don't die, they just GOSUB without RETURN.
>
>
>
GadFly - MemoryError [ In reply to ]
Sorry to reply again, but I replied a bit too quickly last
time.

Could it be that a large number (say 1000) records
have a default URL value of the null string generating
1000000 matching record pairs? this would explain the
behavior you see. In this case rewriting the query is in order
see below.

----Original Message-----
>From: Oleg Broytmann <phd@sun.med.ru>
> I freed some memory and the program worked. It ate 30 Megs while running,
>so I was in need of memory. But it ran 30 minutes (on 3000 rows!)

My feeling is that you are not using gadfly appropriately.
For example if what you want is to identify rows with matching
urls a join is not the best way to do it by far. Try this method
instead.

select * from bookmarks order by url, recno

This query also should be fairly fast.

Take the result of that query and loop through it
in a python "for" loop to find the matching urls grouped together
in the sequence. This is the way I'd recommend you do it
using gadfly, oracle, sybase, mysql, etc...

You can also potentially use "group by URL" with max() or
min() to good effect, depending on what you want.

Sorry for the fuss and Best regards, Aaron Watters

===
Please come back another day, and another person.
-- from a story told by Erdos
GadFly - MemoryError [ In reply to ]
In article <Pine.SOL2.3.96.SK.990408202514.3496S-100000@sun.med.ru>,
phd@sun.med.ru wrote:
> I started playing with GadFly a few weeks ago, so I downloaded latest
> versions of GadFly and kjBuckets.
> Yesterday I found a way to use kjSet in my program.
>
> BTW, what are "kw" in "kwParsing" and "kj" in "kjBuckets"?

That's for me to know and you to guess.

BTW, I ran the following benchmark on my workstation, emulating
your query with artificial data:

===snip
fanout = 5
length = 3000

# create a table for self-join test
import gadfly
g = gadfly.gadfly()
g.startup("jtest", "dbtest") # dir ./dbtest should exist
c = g.cursor()
print "making table"
c.execute("create table test (a integer, b integer)")
def mapper(i): return (i, i/fanout)
data = map(mapper, range(length))
c.execute("insert into test(a,b) values (?,?)", data)

# do a self join with fanout
from time import time
print "doing query"
now = time()
c.execute("select * from test x, test y where x.b=y.b and x.a<y.a")
print "elapsed", time()-now
print len(c.fetchall()), "results generated from initial", length
====snip

On my machine (200Mhz P5 with 64Meg, NT4.0WS) it prints

C:\gadfly>testjoin.py
making table
doing query
elapsed 2.39299988747
6000 results generated from initial 3000

This is actually reasonably fast, I think. Adding an index didn't
make that much of a difference because the join algorithm actually
builds an index on the fly without one for this particular query.
The optimized join builds an intermediate table of size
15000 before eliminating most of the intermediate entries with the
x.a<y.a predicate, I think.

I conclude that the problem you had is probably your data and
your query, with your machine contributing a bit if it has little
memory. See my previous remarks for alternative approaches that
will probably work better.
-- Aaron Watters

===
His leather jacket had chains that would jingle
They both met movie stars, partied and mingled
They're A&R man said "I don't hear a single"
The future was wide open -- Tom Petty

-----------== Posted via Deja News, The Discussion Network ==----------
http://www.dejanews.com/ Search, Read, Discuss, or Start Your Own