Mailing List Archive

LRU cache
Here's my problem today. I am using a dict() to implement a quick and
dirty in-memory cache.

I am stopping adding elements when I am reaching 1000 elements (totally
arbitrary number), but I would like to have something slightly more
sophisticated to free up space for newer and potentially more relevant
entries.

I am thinking of the Least Recently Used principle, but how to implement
that is not immediate. Before I embark on reinventing the wheel, is
there a tool, library or smart trick that will allow me to remove
elements with LRU logic?

thanks

Dino
--
https://mail.python.org/mailman/listinfo/python-list
Re: LRU cache [ In reply to ]
On Wed, 15 Feb 2023 at 09:37, Dino <dino@no.spam.ar> wrote:
>
>
> Here's my problem today. I am using a dict() to implement a quick and
> dirty in-memory cache.
>
> I am stopping adding elements when I am reaching 1000 elements (totally
> arbitrary number), but I would like to have something slightly more
> sophisticated to free up space for newer and potentially more relevant
> entries.
>
> I am thinking of the Least Recently Used principle, but how to implement
> that is not immediate. Before I embark on reinventing the wheel, is
> there a tool, library or smart trick that will allow me to remove
> elements with LRU logic?
>

Check out functools.lru_cache :)

ChrisA
--
https://mail.python.org/mailman/listinfo/python-list
RE: LRU cache [ In reply to ]
Dino,

If your question is understood, you want to treat a dictionary as a sort of
queue with a maximum number of entries. And, you want to remove some kind of
least useful item to make room for any new one.

Most dictionaries now have entries in the order they were entered. There may
already be some variant out there that implements this but it should not be
hard to create.

So you could simply change the method that adds an item to the dictionary.
If the new next item is not already in the dictionary, simply remove the
first item using whatever method you wish.

Getting all the keys may be avoided by using an iterator once as in:
next(iter( my_dict.keys() )) or something like that.

You can then remove that item using the key and insert your new item at the
end.

Of course, that is not least recently used but least recently entered.

To deal with keeping track of what was accessed last or never, is a problem
not trivially solved with just a dictionary. I mean you could store a tuple
for each item that included a date and a payload as the value, and you could
then search all the values and find the one with the oldest date. That seems
a bit much so another architecture could be to maintain another data
structure holding keys and dates that perhaps you keep sorted by date and
every time the cache accesses a value, it finds the item in the second
storage and updates the date and moves the item to the end of the
structure.

But note that some may want to keep an access count so you always know how
many times an item has been re-used and thus not remove them as easily.

The main goal of a dictionary is to speed up access and make it almost
linear. Your algorithm can add so much overhead, depending on how it is
done, that it can defeat the purpose.

What some might do is skip the dictionary and use some kind of queue like a
dequeue that handles your thousand entries and new items are added at the
end, items accessed moved to the front, and a brute force search is used to
find an entry. But note some algorithms like that may end up removing the
newest item immediately as it is least recently used if placed at the end.

It may be an Ordered Dict is one solution as shown here:

https://www.geeksforgeeks.org/lru-cache-in-python-using-ordereddict/



-----Original Message-----
From: Python-list <python-list-bounces+avi.e.gross=gmail.com@python.org> On
Behalf Of Dino
Sent: Tuesday, February 14, 2023 5:07 PM
To: python-list@python.org
Subject: LRU cache


Here's my problem today. I am using a dict() to implement a quick and dirty
in-memory cache.

I am stopping adding elements when I am reaching 1000 elements (totally
arbitrary number), but I would like to have something slightly more
sophisticated to free up space for newer and potentially more relevant
entries.

I am thinking of the Least Recently Used principle, but how to implement
that is not immediate. Before I embark on reinventing the wheel, is there a
tool, library or smart trick that will allow me to remove elements with LRU
logic?

thanks

Dino
--
https://mail.python.org/mailman/listinfo/python-list

--
https://mail.python.org/mailman/listinfo/python-list
Re: LRU cache [ In reply to ]
On 2/14/23 15:07, Dino wrote:
>
> Here's my problem today. I am using a dict() to implement a quick and
> dirty in-memory cache.
>
> I am stopping adding elements when I am reaching 1000 elements (totally
> arbitrary number), but I would like to have something slightly more
> sophisticated to free up space for newer and potentially more relevant
> entries.
>
> I am thinking of the Least Recently Used principle, but how to implement
> that is not immediate. Before I embark on reinventing the wheel, is
> there a tool, library or smart trick that will allow me to remove
> elements with LRU logic?

It depends here how fancy you want to get. If you really need the
behavior of a size-limited cache and you expect there to be a lot of
churn, then you'll probably want a combination of data structures...
e.g. a dict-like thing for quick hashed lookups (means your "keys" need
to be hashable) and a doubly-linked list for ordering so you can quickly
move a hit to the most-recent position - and maybe you want to keep
cache stats as well.

Do look at functools if that well-tested standard library module fits
your needs. If you need, or are motivated (as a learning experience?) to
roll your own, my memory tells me the RealPython crew did a tutorial on
implementing an LRU cache, might be worth a look (not sure if this is
one of the all-free ones, or one of the
must-be-a-paid-subscriber-to-get-the-advanced-stuff ones.


--
https://mail.python.org/mailman/listinfo/python-list
RE: LRU cache [ In reply to ]
Chris,

That is a nice decorator solution with some extra features.

We don't know if the OP needed a cache that was more general purpose and
could be accessed from multiple points, and shared across multiple
functions.



-----Original Message-----
From: Python-list <python-list-bounces+avi.e.gross=gmail.com@python.org> On
Behalf Of Chris Angelico
Sent: Tuesday, February 14, 2023 5:46 PM
To: python-list@python.org
Subject: Re: LRU cache

On Wed, 15 Feb 2023 at 09:37, Dino <dino@no.spam.ar> wrote:
>
>
> Here's my problem today. I am using a dict() to implement a quick and
> dirty in-memory cache.
>
> I am stopping adding elements when I am reaching 1000 elements
> (totally arbitrary number), but I would like to have something
> slightly more sophisticated to free up space for newer and potentially
> more relevant entries.
>
> I am thinking of the Least Recently Used principle, but how to
> implement that is not immediate. Before I embark on reinventing the
> wheel, is there a tool, library or smart trick that will allow me to
> remove elements with LRU logic?
>

Check out functools.lru_cache :)

ChrisA
--
https://mail.python.org/mailman/listinfo/python-list

--
https://mail.python.org/mailman/listinfo/python-list
Re: LRU cache [ In reply to ]
Thank you Mats, Avi and Chris

btw, functools.lru_cache seems rather different from what I need, but
maybe I am missing something. I'll look closer.

On 2/14/2023 7:36 PM, Mats Wichmann wrote:
> On 2/14/23 15:07, Dino wrote:
>>

--
https://mail.python.org/mailman/listinfo/python-list
Re: LRU cache [ In reply to ]
I think this does the trick:

https://gist.github.com/Gerardwx/c60d200b4db8e7864cb3342dd19d41c9


#!/usr/bin/env python3
import collections
import random
from typing import Hashable, Any, Optional, Dict, Tuple


class LruCache:
"""Dictionary like storage of most recently inserted values"""

def __init__(self, size: int = 1000):
""":param size number of cached entries"""
assert isinstance(size, int)
self.size = size
self.insert_counter = 0
self.oldest = 0
self._data : Dict[Hashable,Tuple[Any,int]]= {} # store values and age index
self._lru: Dict[int, Hashable] = {} # age counter dictionary

def insert(self, key: Hashable, value: Any) -> None:
"""Insert into dictionary"""
existing = self._data.get(key, None)
self._data[key] = (value, self.insert_counter)
self._lru[self.insert_counter] = key
if existing is not None:
self._lru.pop(existing[1], None) # remove old counter value, if it exists
self.insert_counter += 1
if (sz := len(self._data)) > self.size: # is cache full?
assert sz == self.size + 1
while (
key := self._lru.get(self.oldest, None)) is None: # index may not be present, if value was reinserted
self.oldest += 1
del self._data[key] # remove oldest key / value from dictionary
del self._lru[self.oldest]
self.oldest += 1 # next oldest index
assert len(self._lru) == len(self._data)

def get(self, key: Hashable) -> Optional[Any]:
"""Get value or return None if not in cache"""
if (tpl := self._data.get(key, None)) is not None:
return tpl[0]
return None


if __name__ == "__main__":
CACHE_SIZE = 1000
TEST_SIZE = 1_000_000
cache = LruCache(size=CACHE_SIZE)

all = []
for i in range(TEST_SIZE):
all.append(random.randint(-5000, 5000))

summary = collections.defaultdict(int)
for value in all:
cache.insert(value, value * value)
summary[value] += 1
smallest = TEST_SIZE
largest = -TEST_SIZE
total = 0
for value, count in summary.items():
smallest = min(smallest, count)
largest = max(largest, count)
total += count
avg = total / len(summary)
print(f"{len(summary)} values occurrences range from {smallest} to {largest}, average {avg:.1f}")

recent = set() # recent most recent entries
for i in range(len(all) - 1, -1, -1): # loop backwards to get the most recent entries
value = all[i]
if len(recent) < CACHE_SIZE:
recent.add(value)
if value in recent:
if (r := cache.get(value)) != value * value:
raise ValueError(f"Cache missing recent {value} {r}")
else:
if cache.get(value) != None:
raise ValueError(f"Cache includes old {value}")

From: Python-list <python-list-bounces+gweatherby=uchc.edu@python.org> on behalf of Dino <dino@no.spam.ar>
Date: Wednesday, February 15, 2023 at 3:07 PM
To: python-list@python.org <python-list@python.org>
Subject: Re: LRU cache
*** Attention: This is an external email. Use caution responding, opening attachments or clicking on links. ***

Thank you Mats, Avi and Chris

btw, functools.lru_cache seems rather different from what I need, but
maybe I am missing something. I'll look closer.

On 2/14/2023 7:36 PM, Mats Wichmann wrote:
> On 2/14/23 15:07, Dino wrote:
>>

--
https://urldefense.com/v3/__https://mail.python.org/mailman/listinfo/python-list__;!!Cn_UX_p3!jb3Gr2BFAPLJ2YuI5rFdJUtalqWcijhxHAfdmCI3afnLFDdcekALxDYAQwpE1L_JlJBBJ-BB3BuLdoSE$<https://urldefense.com/v3/__https:/mail.python.org/mailman/listinfo/python-list__;!!Cn_UX_p3!jb3Gr2BFAPLJ2YuI5rFdJUtalqWcijhxHAfdmCI3afnLFDdcekALxDYAQwpE1L_JlJBBJ-BB3BuLdoSE$>
--
https://mail.python.org/mailman/listinfo/python-list
Re: LRU cache [ In reply to ]
Thank you, Gerard. I really appreciate your help

Dino

On 2/16/2023 9:40 PM, Weatherby,Gerard wrote:
> I think this does the trick:
>
> https://gist.github.com/Gerardwx/c60d200b4db8e7864cb3342dd19d41c9
>
>
> #!/usr/bin/env python3
> import collections
> import random
> from typing import Hashable, Any, Optional, Dict, Tuple
>
>
> class LruCache:
> """Dictionary like storage of most recently inserted values"""
>
> def __init__(self, size: int = 1000):
> """:param size number of cached entries"""
> assert isinstance(size, int)
> self.size = size
> self.insert_counter = 0
> self.oldest = 0
> self._data : Dict[Hashable,Tuple[Any,int]]= {} # store values and age index
> self._lru: Dict[int, Hashable] = {} # age counter dictionary
>
> def insert(self, key: Hashable, value: Any) -> None:
> """Insert into dictionary"""
> existing = self._data.get(key, None)
> self._data[key] = (value, self.insert_counter)
> self._lru[self.insert_counter] = key
> if existing is not None:
> self._lru.pop(existing[1], None) # remove old counter value, if it exists
> self.insert_counter += 1
> if (sz := len(self._data)) > self.size: # is cache full?
> assert sz == self.size + 1
> while (
> key := self._lru.get(self.oldest, None)) is None: # index may not be present, if value was reinserted
> self.oldest += 1
> del self._data[key] # remove oldest key / value from dictionary
> del self._lru[self.oldest]
> self.oldest += 1 # next oldest index
> assert len(self._lru) == len(self._data)
>
> def get(self, key: Hashable) -> Optional[Any]:
> """Get value or return None if not in cache"""
> if (tpl := self._data.get(key, None)) is not None:
> return tpl[0]
> return None
>
>
> if __name__ == "__main__":
> CACHE_SIZE = 1000
> TEST_SIZE = 1_000_000
> cache = LruCache(size=CACHE_SIZE)
>
> all = []
> for i in range(TEST_SIZE):
> all.append(random.randint(-5000, 5000))
>
> summary = collections.defaultdict(int)
> for value in all:
> cache.insert(value, value * value)
> summary[value] += 1
> smallest = TEST_SIZE
> largest = -TEST_SIZE
> total = 0
> for value, count in summary.items():
> smallest = min(smallest, count)
> largest = max(largest, count)
> total += count
> avg = total / len(summary)
> print(f"{len(summary)} values occurrences range from {smallest} to {largest}, average {avg:.1f}")
>
> recent = set() # recent most recent entries
> for i in range(len(all) - 1, -1, -1): # loop backwards to get the most recent entries
> value = all[i]
> if len(recent) < CACHE_SIZE:
> recent.add(value)
> if value in recent:
> if (r := cache.get(value)) != value * value:
> raise ValueError(f"Cache missing recent {value} {r}")
> else:
> if cache.get(value) != None:
> raise ValueError(f"Cache includes old {value}")
>
> From: Python-list <python-list-bounces+gweatherby=uchc.edu@python.org> on behalf of Dino <dino@no.spam.ar>
> Date: Wednesday, February 15, 2023 at 3:07 PM
> To: python-list@python.org <python-list@python.org>
> Subject: Re: LRU cache
> *** Attention: This is an external email. Use caution responding, opening attachments or clicking on links. ***
>
> Thank you Mats, Avi and Chris
>
> btw, functools.lru_cache seems rather different from what I need, but
> maybe I am missing something. I'll look closer.
>
> On 2/14/2023 7:36 PM, Mats Wichmann wrote:
>> On 2/14/23 15:07, Dino wrote:
>>>
>
> --
> https://urldefense.com/v3/__https://mail.python.org/mailman/listinfo/python-list__;!!Cn_UX_p3!jb3Gr2BFAPLJ2YuI5rFdJUtalqWcijhxHAfdmCI3afnLFDdcekALxDYAQwpE1L_JlJBBJ-BB3BuLdoSE$<https://urldefense.com/v3/__https:/mail.python.org/mailman/listinfo/python-list__;!!Cn_UX_p3!jb3Gr2BFAPLJ2YuI5rFdJUtalqWcijhxHAfdmCI3afnLFDdcekALxDYAQwpE1L_JlJBBJ-BB3BuLdoSE$>

--
https://mail.python.org/mailman/listinfo/python-list
Re: LRU cache [ In reply to ]
I sometimes use this trick, which I learnt from a book by Martelli.
Instead of try/except, membership testing with "in" (__contains__) might
be faster. Probably "depends". Matter of measuring.
def somefunc(arg, _cache={}):
    if len(_cache) > 10 ** 5:
        _cache.pop()
    try:
        return _cache[arg]
    except KeyError:
        result = expensivefunc(arg)
        _cache[arg] = result
        return result
Albert-Jan
--
https://mail.python.org/mailman/listinfo/python-list
Re: LRU cache [ In reply to ]
On 2/18/2023 5:38 AM, Albert-Jan Roskam wrote:
> I sometimes use this trick, which I learnt from a book by Martelli.
> Instead of try/except, membership testing with "in" (__contains__) might
> be faster. Probably "depends". Matter of measuring.
> def somefunc(arg, _cache={}):
>     if len(_cache) > 10 ** 5:
>         _cache.pop()
>     try:
>         return _cache[arg]
>     except KeyError:
>         result = expensivefunc(arg)
>         _cache[arg] = result
>         return result
> Albert-Jan

_cache.get(arg) should be a little faster and use slightly fewer
resources than the try/except.

--
https://mail.python.org/mailman/listinfo/python-list
Re: LRU cache [ In reply to ]
On 18/02/2023 15:29, Thomas Passin wrote:
> On 2/18/2023 5:38 AM, Albert-Jan Roskam wrote:
>>     I sometimes use this trick, which I learnt from a book by Martelli.
>>     Instead of try/except, membership testing with "in"
>> (__contains__) might
>>     be faster. Probably "depends". Matter of measuring.
>>     def somefunc(arg, _cache={}):
>>         if len(_cache) > 10 ** 5:
>>             _cache.pop()
>>         try:
>>             return _cache[arg]
>>         except KeyError:
>>             result = expensivefunc(arg)
>>             _cache[arg] = result
>>             return result
>>     Albert-Jan
>
> _cache.get(arg) should be a little faster and use slightly fewer
> resources than the try/except.
>
Provided that you can provide a default value to get() which will never
be a genuine "result".

--
https://mail.python.org/mailman/listinfo/python-list
Re: LRU cache [ In reply to ]
On Feb 18, 2023 17:28, Rob Cliffe via Python-list <python-list@python.org>
wrote:

On 18/02/2023 15:29, Thomas Passin wrote:
> On 2/18/2023 5:38 AM, Albert-Jan Roskam wrote:
>>     I sometimes use this trick, which I learnt from a book by
Martelli.
>>     Instead of try/except, membership testing with "in"
>> (__contains__) might
>>     be faster. Probably "depends". Matter of measuring.
>>     def somefunc(arg, _cache={}):
>>         if len(_cache) > 10 ** 5:
>>             _cache.pop()
>>         try:
>>             return _cache[arg]
>>         except KeyError:
>>             result = expensivefunc(arg)
>>             _cache[arg] = result
>>             return result
>>     Albert-Jan
>
> _cache.get(arg) should be a little faster and use slightly fewer
> resources than the try/except.
>
Provided that you can provide a default value to get() which will never
be a genuine "result".

=====
This might be better than None:
_cache.get(arg, Ellipsis)
--
https://mail.python.org/mailman/listinfo/python-list
Re: LRU cache [ In reply to ]
On 18/02/2023 17:19, Albert-Jan Roskam wrote:
>
>
> On Feb 18, 2023 17:28, Rob Cliffe via Python-list
> <python-list@python.org> wrote:
>
> On 18/02/2023 15:29, Thomas Passin wrote:
> > On 2/18/2023 5:38 AM, Albert-Jan Roskam wrote:
> >>     I sometimes use this trick, which I learnt from a book by
> Martelli.
> >>     Instead of try/except, membership testing with "in"
> >> (__contains__) might
> >>     be faster. Probably "depends". Matter of measuring.
> >>     def somefunc(arg, _cache={}):
> >>         if len(_cache) > 10 ** 5:
> >>             _cache.pop()
> >>         try:
> >>             return _cache[arg]
> >>         except KeyError:
> >>             result = expensivefunc(arg)
> >>             _cache[arg] = result
> >>             return result
> >>     Albert-Jan
> >
> > _cache.get(arg) should be a little faster and use slightly fewer
> > resources than the try/except.
> >
> Provided that you can provide a default value to get() which will
> never
> be a genuine "result".
>
>
> =====
>
> This might be better than None:
> _cache.get(arg, Ellipsis)
>
>
A common strategy is to have a dedicated sentinel object.  E.g. (untested):
IMPOSSIBLE_RESULT = object()
...
        if _cache.get(arg, IMPOSSIBLE_RESULT) == IMPOSSIBLE_RESULT:
            # arg was not in the cache
            ...
--
https://mail.python.org/mailman/listinfo/python-list