Critical memory leaks and problems with NDB queries [35895760]

Fixed

Bug

Status Update

No update yet.

Description

jk...@jkrebs.com

created issue #1

Jul 2, 2013 10:25AM

There are critical memory leaks and problems with the NDB library.

The main issue happens during querying and makes it impossible to process any significant number of entities in a request. The same symptoms are not exhibited with the legacy ext.db library.

I've included a self-contained test application below:
- Setup handler to insert some entities
- Basic task handler that iterates through a query to total the length of data in the entities
- Page task handler that tries to accomplish the same using cursors

There is a constant named BATCH_SIZE which can be modified to test various scenarios. I am regularly clearing the NDB context cache.

There is also a constant named ENABLE_GC which aggressively calls gc.collect(). However, this causes the error "The datastore operation timed out, or the data was temporarily unavailable." with 100% reproducibility. This is probably a separate issue with NDB or the datastore.

Here are the handlers in the application:

SetupHandler (/setup):
- Handler is able to insert 10,000 entities with ~25k of data each, no problems with memory or deadline.

BasicTaskHandler (/task/basic):
- BATCH_SIZE = 25, Success, instance memory usage grows to 100MB and
remains after request.
- BATCH_SIZE = 50, Success but memory usage grows to 143MB and
instance is killed after request due to high memory.
- BATCH_SIZE = 100, Failure due to high memory usage.

PageTaskHandler (/task/page):
- Failure due to high memory usage on *all* batch sizes.
- Handler appears to be significantly slower and inefficient, even on local SDK

Please note the results above are all 100% reproducible with both Python 2.7 and 2.5 and all SDK versions. The SDK development server does not enforce memory restrictions so I have not done memory profiling to check if the issue occurs locally.

There is also a group discussion of the issue here with peer review and confirmation:

https://groups.google.com/forum/?fromgroups=#!topic/appengine-ndb-discuss/yZ0QgA3MV4A

from google.appengine.ext import ndb
from google.appengine.api import taskqueue

import webapp2
import logging
import gc

BATCH_SIZE = 100
ENABLE_GC = False

class Entity(ndb.Model):
text = ndb.TextProperty()

class SetupHandler(webapp2.RequestHandler):
def get(self):
for n in xrange(100):
entities = [Entity(text='a' * 25000) for n in xrange(100)]
ndb.put_multi(entities)

ndb.get_context().clear_cache()

class BasicTaskHandler(webapp2.RequestHandler):
def get(self):
self.queue()

def post(self):
if self.request.headers.get('X-AppEngine-TaskRetryCount', '0') != '0':
logging.warning("Preventing retry")
return

total = 0

entity_query = Entity.query()

for entity in entity_query.iter(batch_size=BATCH_SIZE):
total += len(entity.text)

ndb.get_context().clear_cache()

if ENABLE_GC:
gc.collect()

logging.info("Total: %d" % total)

@classmethod
def queue(cls):
task = taskqueue.Task(url='/task/basic')
taskqueue.Queue().add(task)

class PageTaskHandler(webapp2.RequestHandler):
def get(self):
self.queue()

def post(self):
if self.request.headers.get('X-AppEngine-TaskRetryCount', '0') != '0':
logging.warning("Preventing retry")
return

total = 0

entity_query = Entity.query()
cursor = ndb.Cursor()

while True:
entities, cursor, more = entity_query.fetch_page(BATCH_SIZE, start_cursor=cursor)

for entity in entities:
total += len(entity.text)

if not more:
break

ndb.get_context().clear_cache()

if ENABLE_GC:
gc.collect()

logging.info("Total: %d" % total)

@classmethod
def queue(cls):
task = taskqueue.Task(url='/task/page')
taskqueue.Queue().add(task)

app = webapp2.WSGIApplication([
('/setup', SetupHandler),
('/task/basic', BasicTaskHandler),
('/task/page', PageTaskHandler)
], debug=True)

Comments

pc...@google.com <pc...@google.com> Jul 19, 2013 09:16PM

Assigned to pc...@google.com.

ar...@google.com <ar...@google.com> #2Jul 19, 2013 11:41PM

This issue does not reproduce with dev preview 4.

ye...@google.com <ye...@google.com> #3Oct 4, 2013 01:41AM

Closing this issue as per

comment #2

from reporter.

[Deleted User] <[Deleted User]> #4Oct 14, 2013 08:38PM

[Comment deleted]

al...@gmail.com <al...@gmail.com> #5Oct 14, 2013 09:35PM

[Comment deleted]

ar...@google.com <ar...@google.com> #6Oct 14, 2013 09:36PM

Do you have any results to share from heapy?

il...@google.com <il...@google.com> #7Feb 25, 2014 12:07PM

Accepted by il...@google.com.

I've reproduced this issue with testbed [

https://developers.google.com/appengine/docs/python/tools/localunittesting] and did some investigation and memory profiling.

There's a race condition in the ndb stack which prevents a percentage of futures from being marked complete. As a result, they are not removed from the state object, and this keeps the entity entity batch reachable (via a callback closure). Yikes!

I'll post an update when I have more information about the root cause and [hopefully!] a workaround. Note that this leak only affects requests when the in-app cache is disabled or cleared.

Regards,
Isaac

il...@google.com <il...@google.com> #8Feb 26, 2014 10:18AM

Some progress:

The non-determinism observed with fetch_page is due to the iteration order of eventloop.rpcs, which is passed to datastore_rpc.MultiRpc.wait_any() and apiproxy_stub_map.__check_one selects the *last* rpc from the iterator.

Fetching with page_size of 10 does an rpc with count=10, limit=11, a standard technique to force the backend to more accurately determine whether there are more results. This returns 10 results, but due to a bug in the way the QueryIterator is unraveled, an RPC is added to fetch the last entry (using obtained cursor and count=1). NDB then returns the batch of entities without processing this RPC. I believe that this RPC will not be evaluated until selected at random (if MultiRpc consumes it before a necessary rpc), since it doesn't block client code.

For better or worse, direct use of iter() does not have this issue (count and limit will be the same). iter() can be used as a workaround for the performance and memory issues associated with fetch page caused by the above.

Joachim also reports memory increases over an iter() call, but I have not been able to reproduces locally. I don't see what prevents no-longer-needed futures and rpcs from building though, so this could explain his observations. We may be able to fix this by replacing some structures with weakrefs, starting with ndb.tasklets._state.all_pending and ndb.eventloop.rpcs.

https://gist.github.com/isaacl/9226812 has some of my local test setup and you may find it useful for testing ndb feature or benchmarking database funcitons.

Please speak up if you're able to come up with a reproducible leak with query().iter().

Cheers,
Isaac

jk...@jkrebs.com <jk...@jkrebs.com> #9Feb 26, 2014 10:40AM

Isaac, many thanks for your investigation and promising results!

- Are you able to reproduce the problem with query().iter() in production? This is the only way I have tested so far.

- Do you have an idea why the value of the batch_size parameter seems to play a role?

- Is there a way to reset the state object completely as a temporary work around? (to allow the objects to be collected)

cs...@khanacademy.org <cs...@khanacademy.org> #10Feb 27, 2014 05:44AM

Joachim, we at Khan Academy are very interested in this bug, since we suspect it may be affecting us as well.

I could reproduce the memory growth locally (on dev-appserver) with the /task/page handler, but when I tried to tweak your example to remove the task queue -- to do the ndb work in get() rather than post() -- it did not use excess memory. Do you know why the task queue is necessary for the memory use bug to emerge?

ho...@gmail.com <ho...@gmail.com> #11Mar 3, 2014 09:17PM

Running in a task or backend is necessary for longer running processes that would exceed the timeout limit of a normal request.

I am seeing memory growth regardless of whether I call iter() or fetch_page()

Any progress towards a resolution?

cs...@khanacademy.org <cs...@khanacademy.org> #12Mar 5, 2014 11:44PM

I spent a long time looking into this, using a hacked version of python that I could do memory profiling on. Caveats: this was only on dev-appserver, so I only looked at /task/page, since /task/memory didn't have this same growth.

My conclusion is that everything is working normally, and while the memory grows to 300M+ while executing /task/page and then stays there, that is because of the way memory allocation works.

Basically, the /task/page job requires lots of memory during the request in order to do its job (it's fetching a lot of data; maybe not 300M worth of data but the extra needed memory could be due to the bug that isaac found). When the request is over, python is properly free()ing the memory, but the malloc implementation doesn't return that freed memory back to the OS, it keeps it on its own internal freelists. (All malloc implementations work this way, for efficiency.)

Thus, python is 'using' a lot of memory but not leaking it. When I run the /task/page job again, it stays using 300M the entire time, showing there's not a leak, at least for me.

The prod environment may well be different, so my analysis is not useful for closing this bug one way or the other, but I am confident that, at least in the dev-appserver implementation, there's no memory leak.

cs...@khanacademy.org <cs...@khanacademy.org> #13Mar 6, 2014 12:36AM

[I meant /task/basic above, not /task/memory]

One thing to point out: we had to put the entire task inside a @ndb.toplevel to avoid memory leaks. Otherwise, there are rogue RPC's might stick around forever, which *would* be a memory leak.

cu...@gmail.com <cu...@gmail.com> #14May 6, 2014 07:08PM

It looks like this bug has gone stale, so I just wanted to pipe in that this is definitely a problem in production and the @ndb.toplevel doesn't help. I ended up with a workaround whereby I dynamically create a db.Expando class from the ndb model to do a keysonly equivalent query. I convert the old db keys to ndb keys and load the entities and do the appropriate work. This resulted in all of our memory errors going away.

jk...@jkrebs.com <jk...@jkrebs.com> #15May 6, 2014 08:53PM

That sounds like an excellent workaround!

Would you mind sharing some of the code as an example?

cu...@gmail.com <cu...@gmail.com> #16May 7, 2014 07:37AM

[Comment deleted]

cu...@gmail.com <cu...@gmail.com> #17May 7, 2014 07:39AM

It's not a full example but I attached some code with the relevant pieces. I think the code is pretty self explanatory but I'd be happy to elaborate if needed. Hope that helps!

fix.py

2.3 KB

View

Download

[Deleted User] <[Deleted User]> #18May 13, 2014 08:16PM

I also use @ndb.toplevel in my module get where I upload all my entities in one go using ndb.put_multi(). I have seen my modules being shut down automatically as the memory consumed is sometimes over >150-200MB, although my total of all entities in the (in put_multi) is less than 5-10 MB. I have been forced to use machines with bigger memory, but still for some reason for the past two days even with bigger machines the memory used for the same modules, with almost the same dataset has spiked to >600MB, the _ah/stop starts killing all my module instances. Is there a possibility the memory balloons while the background process which uploads entities stalls?

Is there a work around, till this problem gets fixed?

Error messages:
F 11:06:32.784 Exceeded soft private memory limit with 614.477 MB after servicing 2 requests total
W 11:06:32.784 While handling this request, the process that handled this request was found to be using too much memory and was terminated. This is likely to cause a new process to be used for the next request to your application. If you see this message frequently, you may have a memory leak in your application.

[Deleted User] <[Deleted User]> #19Aug 5, 2014 07:15PM

I use a script to upload/update data on datastore, I query datastore for a certain product Id, then I do some updates to the properties and update it using a put_multi inside a deffered.defer.

To avoid the issue of memory blow up and having the module quit, I used B4_1G. Now I get "Process moved to a different machine" and the module just quits halfway through the update process. I can see that the utilization, memory are within limits, but the module still quits.

I opened another

issue 11167

, I strongly suspect these issues are related.

App Id: Shopina-Voyage-001

[Deleted User] <[Deleted User]> #20Aug 6, 2014 07:36PM

One additional adder, the problem "Process moved to a different machine" is both with NDB and DB models. I checked with DB and after updating around 22K entities (in ~90 minutes), it fails and the module just quits unexpectedly.

ni...@gmail.com <ni...@gmail.com> #21Aug 6, 2014 07:38PM

You've got a different problem, I'm sure.

[Deleted User] <[Deleted User]> #22Aug 6, 2014 08:05PM

issue 11167

then :(

re...@gmail.com <re...@gmail.com> #23Aug 18, 2014 05:05PM

My symptom of this bug was the following message coming from appstats:

"Found 1 RPC request(s) without matching response (presumably due to timeouts or other errors)"

It took me a while to find this bug based on that, so I'm adding it here so it's searchable.

ni...@gmail.com <ni...@gmail.com> #24Nov 2, 2014 03:07AM

We run into this problem every 2 weeks or so. We've started making db peer classes for long running tasks (like Map/Reduce data migrations, for example). It's a crap work around that results in duplication of code and effort. We've even thought about abandoning ndb altogether and just using db, but the cacheing benefits of ndb have kept us using this workaround. Does anyone at google care about this issue anymore? Perhaps google doesn't even use ndb? It's been sitting here for well over a year with 80 stars on it now and no news.

cu...@gmail.com <cu...@gmail.com> #25Nov 2, 2014 07:41PM

Nick, if you look at the code I attached in fix.py, you'll see how you can create the peer class automatically. That helps a little bit with the code bloat/maintenance.

It seems that this should be higher than 'Medium' priority. How can we get that bumped?

kr...@agosto.com <kr...@agosto.com> #26Nov 2, 2014 09:20PM

Curt, fix.py works if you only need to get at the key of the object. I couldn't get it to work and be able to get values of other properties.

cu...@gmail.com <cu...@gmail.com> #27Nov 3, 2014 12:05AM

Kris, well, there isn't a 1-1 mapping between 'ndb' and 'db' properties. So yes, that is problematic. But! The trick is to do a 'db' keys only query, convert the 'db' keys to 'ndb' keys and then finally 'get' or 'get_multi' on the 'ndb' key(s) to get the ndb entity(s). And then you should be set.

kr...@agosto.com <kr...@agosto.com> #28Nov 3, 2014 02:55AM

Thanks, Curt, that's what I ended up doing in my mapreduce.

[Deleted User] <[Deleted User]> #29Nov 26, 2014 12:59PM

I'm seeing this in requests (wrapped with ndb.toplevel) where if ndb.Query.iter() is invoked *at all*, even with a keys-only query, the request causes a memory fault prior to fully completing. It seems even the creation of the generator is enough to cause the fault, even if the generator never actually produces any results (returning immediately after calling ndb.Query.iter() still produces the fault).

I'll try the db-key conversion to see if that helps.

te...@gmail.com <te...@gmail.com> #30Jan 27, 2015 01:54PM

The `ndb` to `db` conversion seems to do the trick for keys-only queries. I'm also finding I have to manually track and flush async requests (even within the bounds of an @ndb.toplevel decorator) to avoid memory faults.

[Deleted User] <[Deleted User]> #31Jan 30, 2015 02:10AM

Ran into this issue, and created a support case for it. This their response:

"At present, our Team is still working on a fix for this issue. Thought, we still don't have an ETA on resolution but we will update the external issue as soon a fix is available. I advice starring the issue to get any updates moving forward and also keep measuring the developers impact."

Really? It's been a year and 6 months!

ba...@codalytics.com <ba...@codalytics.com> #32Jan 30, 2015 03:31AM

Just wanted to share something that has seemed to help us a lot. We turned in-context caching off for all our NDB models, and turn it on (explicitly) for lighter (smaller) models that get called frequently like the User:

https://cloud.google.com/appengine/docs/python/ndb/cache#policy_functions

Not sure what version of the SDK this was introduced in, but we just define _use_cache = False within the model class and we're seeing this issue far less frequently (1% of the time vs 20%+).

kr...@agosto.com <kr...@agosto.com> #33Jan 30, 2015 02:52PM

Eric, that's the exact same response I got when I opened a support ticket A YEAR AGO.

We still run into this on a regular basis. Our code is now full of kludges to work around this issue.

It appears that ilevy@google.com located the source of the problem last February but nothing has been done to resolve it since then.

kr...@agosto.com <kr...@agosto.com> #34Jan 30, 2015 04:45PM

So a LinkedIn search shows that Isaac Levy left Google last August! Eric, is your support ticket still open? Can you ask them to assign this issue to somebody who actually still works for Google?

[Deleted User] <[Deleted User]> #35Jan 30, 2015 05:20PM

Kris, I will! Can you share some of the work arounds/kludges that worked for you? I understand there is no silver bullet, but we're seeing this in mapreduce and we're down to only allowing one task to run at a time on a module instance. Not good.

I'll talk with support and see what the hell is going on.

kr...@agosto.com <kr...@agosto.com> #36Jan 30, 2015 09:21PM

Eric, I created peer classes for my entities that extend db.Model . So if I have a MyEntity(ndb.Model) class defined in models.py, I create a MyEntity(db.Model) in models_db.py. You don't need to define any properties on the db model, because you're only going to use the key from it. Specify this class as the entity_kind in your mapreduce input_reader. An instance of the db entity will get passed to the mapper function. You can then take the key from that object, convert it to an ndb.Key and do a get() to get the ndb object.

Here's a barebones example:

output = yield mapreduce_pipeline.MapperPipeline(
"My Mapper",
"mapper_function",
"mapreduce.input_readers.DatastoreInputReader",
output_writer_spec="mapreduce.output_writers.FileOutputWriter",
params={
"input_reader": {
"entity_kind": 'models_db.MyEntity',
'filters': filters,
'batch_size': 500,
},
"output_writer": {
"filesystem": "gs",
"gs_bucket_name": MAPREDUCE_BUCKET,
"output_sharding": "none",
},
},
shards=16
)

models.py:
class MyEntity(ndb.Model):
created = ndb.DateTimeProperty(auto_now_add=True)
name = ndb.StringProperty(required=True)
# other properties

models_db.py:
class MyEntity(db.Model):
pass

def mapper_function(db_entity):
entity = ndb.Key.from_old_key(db_entity.key()).get()
# do stuff
# yield something

Hope this helps!

[Deleted User] <[Deleted User]> #37Jan 30, 2015 10:41PM

[Comment deleted]

[Deleted User] <[Deleted User]> #38Jan 31, 2015 07:47PM

Why use a DatastoreInputReader instead of a DatastoreKeyInputReader which would change the mapper to:

def mapper_function(db_key):
entity = ndb.Key.from_old_key(db_key).get()
# do stuff
# yield something

Would this be more efficient, since Mapreduce is presumably calling get() on the key to produce the db_entity?

kr...@agosto.com <kr...@agosto.com> #39Feb 1, 2015 02:31PM

Eric, you can do the same thing with MapreducePipeline. Passing the output of the mapper to the reducer is the job of the library. You don't have to do anything special. Google "appengine mapreduce python" for more examples.

David, you can do whatever fits your use case. This was just one example of how to have a mapreduce iterate over db entities and then convert them to ndb entities, in order to avoid the memory leak in ndb.

[Deleted User] <[Deleted User]> #40Feb 2, 2015 03:14PM

Just an update from what I heard from google (Jose Montes de Oca):
"With regards the points shared around the external issue shared. I am working towards getting it updated accordingly. One thing to keep in mind, even within the scope of an App Engine Request, if you are using In-Context caching and you are querying and transversing an ever increasing amount of data, the Instance memory will grow unbounded proportionally to the data being fetch and will eventually hi8t the limits. Disabling In-Context cache for this uses cases are advice unless you can break your workload into smaller units that will not require that much in-memory data while running.

I hope this clarifies the behavior encountered. Furthermore, I'll work to get this better documented within NDB docs as this In-Context cache is True by default. Also, I've suggested making this an opt-in where Developers are aware of the Instance memory implications this may create."

[Deleted User] <[Deleted User]> #41Feb 2, 2015 05:31PM

So I switched to using a 'mapreduce.input_readers.DatastoreKeyInputReader' and just added the "ndb.Key.from_old_key(key()).get()" and so far, so good. Memory seems pretty stable so far, but I need to test with a full set of production data.

jf...@google.com <jf...@google.com> Feb 2, 2015 10:23PM

Reassigned to jf...@google.com.

jk...@jkrebs.com <jk...@jkrebs.com> #42Feb 3, 2015 09:44AM

Just commenting to say that clearing the In-Context cache has not resolved the problem. The memory leak still accumulates.

We use ndb.get_context().clear_cache() to ensure that only a certain amount of data is in context at any time. See the test program in the first issue comment for an example program.

If the context cache was to blame, I also don't see how using the ext.db library to query would help the situation the way that it clearly does (ndb.Key.get also context-caches the entity with that method).

[Deleted User] <[Deleted User]> #43Feb 3, 2015 12:33PM

I've seen the issue doing keys-only NDB queries. You'd think in those cases there would be no opportunity for the in-context cache to get involved.

si...@gmail.com <si...@gmail.com> #44Feb 3, 2015 03:25PM

I ran into these ndb memory issues as well and this bit of middleware helped stabilize things for me:

(put in appengine_config.py)

class DisableNdbCaching(object):
def __init__(self, application):
self.application = application

def __call__(self, environ, start_response):
ctx = ndb.get_context()
ctx.set_cache_policy(False)

response = self.application(environ, start_response)

ctx.clear_cache()
return response

def webapp_add_wsgi_middleware(app):
app = ndb.toplevel(app)
app = DisableNdbCaching(app)
return app

For me the context cache is only there to help badly designed or lazy code - my app shouldn't ever be requesting the same data from the database so I just want it disabled all the time. Even with it turned off though it seems some things can sneak in which needs to be cleared. I also have a variation that checks and logs the memory use before and after each call and did a gc.collect() if the memory rises over a certain threshold but I haven't needed that lately - it is surprising how much memory some requests consumed though.

Having the max_concurrent_requests set too high in app.yamp caused a lot of memory problems too. 8-10 for an F1 is the limit for me raising proportionally for the others (16 for F2, 32+ for F4). Strangely, the 32 on F4 results in lower instance memory than 8-10 on F1 but I think the faster CPU results in fewer requests actually being concurrent.

Final fix is re-working the heaviest used pages in Go - not as difficult as I first imagined and performance and memory issues all went away.

ru...@gmail.com <ru...@gmail.com> #45Feb 4, 2015 05:42AM

[Comment deleted]

ni...@gmail.com <ni...@gmail.com> #46Feb 4, 2015 05:50AM

[Comment deleted]

ni...@cirrusmachina.com <ni...@cirrusmachina.com> #47Feb 4, 2015 04:18PM

Just want to point out here lots of talk about how to avoid using the in-context cache here, which at best is hiding the issues in 9610. A correctly-implemented cache evicts entries based on a least-recently-used algorithm when memory is tight. And if that's not happening it should be considered a BUG in the cache impl or worse, a bug in gc impl. 9610 seems to be about more than just what's going on with the cache anyway, as others have mentioned: it happens with keys-only queries (which do not use cache) and it happens with query.iter() which takes down the entire instance on the first item invocation in some cases.

si...@intesoft.net <si...@intesoft.net> #48Feb 4, 2015 05:14PM

I don't believe the in-context cache is intended to be anything other than simply "per request" - no LRU or anything like that.

IMO the reason it helps is that it reduces how much memory each request may be consuming while it runs depending on how the app works which is why it appears to help some more than others.

The most significant thing is the combination of how much each request is consuming combined with how many concurrent requests are running. If you use a lot of memory then having the concurrent requests set too high will cause problems. Your options are to turn this dial down, turn the memory dial up (larger instance size) or make sure you use less memory.

There is no single solution - it's a combination of all of them. I was very surprised how much the memory use went up when I logged the before and after of each request - it gave me a better idea of what might be going on in production.

I don't use this but it helped identify some of the more problematic handlers and allowed me to search the logs for "high mem request".

from google.appengine.api import runtime

lass LogResourceUsage(object):
def __init__(self, application):
self.application = application

def __call__(self, environ, start_response):
memory_usage = runtime.memory_usage()
memory_usage_current = memory_usage.current()
logging.debug('mem:{0}'.format(memory_usage_current))

response = self.application(environ, start_response)

memory_usage = runtime.memory_usage()
memory_usage_final = memory_usage.current()
memory_usage_request = memory_usage_final - memory_usage_current
logging.debug('mem:{0} {1}'.format(memory_usage_final, memory_usage_request))

if memory_usage_request > 2: # TODO: limit for instance? or memory_usage_final > 126:
logging.warn('high mem request')
gc.collect()

return response

[Deleted User] <[Deleted User]> #49Feb 4, 2015 05:20PM

joachim@jkrebs.com: Clearing the cache wont help. You have to explicitly tell it not to do it in the first place.

The way I got this to work for me:

ctx = ndb.get_context()
ctx.set_cache_policy(lambda key: False)

Nick:
You are absolutely right. We are talking about work arounds, when we all know this is not expected behavior for a caching algorithm.

si...@gmail.com <si...@gmail.com> #50Feb 4, 2015 07:08PM

As the context / instance cache is scoped to a request, I don't think there is much benefit to it being turned into an LRU cache and it seems unfair to criticise it for not being something it isn't intended to be.

Any LRU cache would need to be per-instance and entities shared between requests (suggesting read-only 'lookup' type objects). There is a good blog article here on this approach:

http://bjk5.com/post/2320616424/layer-caching-in-app-engine-with-memcache-and

Unless you already have memory issues well under control though this is likely to make things worse because those entities will be kept around for much longer than the single request. Trading memory for latency only makes sense when you have memory spare and possibly stale entities aren't an issue.

jf...@google.com <jf...@google.com> #51Feb 6, 2015 12:55AM

Hello All,

First off, we want to extend an apology to the community for falling behind on updating this issue accordingly. Despite the appearance of staleness, we’ve been working on this topic and I will attempt to provide the latest developments.

As previously uncovered by Isaac, NDB is manifesting a non-deterministic memory behavior while using the function fetch_page()[1] which is potentially causing extra memory use. We haven’t fixed this yet, though we have a few solutions in the works. To properly keep track of this, I’ve created issue #11648 [2] which will help us communicate further developments.

Now, going back to the original topic of the issue. Throughout our investigation, we have not been able to reproduce a memory leak both locally or in production. Below, our findings.

We last tested this in production by slightly simplifying the example provided (courtesy of joachim@). We removed the need for a task, but instead defined a module with Manual scaling, instance class B4 and just 1 instance. This allows the request to run for more than 60 seconds, provides a big memory buffer, and subsequent requests can use the same instance over and over to determine if there’s a leak between requests. Also, we utilize App Engine’s Runtime APIs [3] to print out memory values once a request finishes. A short gist of the modified code we used can be found here:

https://gist.github.com/josemontesdeoca/56309320f03dbb6c0a97

Results revealed that the ‘/task/pages’ handler instance memory usage (as measured by our Runtime API [3] which is providing OS Level memory usage) grows up to ~300 MB (using a batch size of 25) but does not grow without bound, as one would expect in the event of a memory leak.

Our conclusion is that the code provided (using fetch_page) requires more instance memory than what an F1 offers (128 MB). This explains why it reaches the memory limits on small instance class, but once a bigger instance is provided, we can truly see how much memory the relevant piece of code needs. This was also observed, tested and documented early on by the Khan Academy Team within

comment #12

[4] alongside a quick description of how Python memory works.

We believe that the issues found on the fetch_page function plus the way ‘TextProperty’ works on Python (Python encodes unicode characters as 32 bit ints, 4 bytes per char) exacerbates the memory usage for the initial code provided.

Furthermore, we also see a few interactions within the issue (not the original report) where developers are running into memory problems due to NDB’s in-context feature [5]. We acknowledge that having it on by default brings more confusion due to the way it works, especially on long running operations. We are discussing how to best set this moving forward and I’ve created issue #11647 [6] to properly track this and inform interested parties of the outcome.

Finally, If someone has a reproducible test-case where a memory leak is visible, please send it over. We are eager to take a closer look, but, without it, we have no evidence that will allow us to keep working within this issue.

I look forward to responses from the community to provide closure on this long running issue.

Jose M, on behalf of the App Engine team

[1]

https://cloud.google.com/appengine/docs/python/ndb/queryclass#Query_fetch_page
[2]

https://code.google.com/p/googleappengine/issues/detail?id=11648
[3]

https://cloud.google.com/appengine/docs/python/backends/runtimeapi
[4]

https://code.google.com/p/googleappengine/issues/detail?id=9610#c12
[5]

https://cloud.google.com/appengine/docs/python/ndb/cache#incontext
[6]

https://code.google.com/p/googleappengine/issues/detail?id=11647

ml...@gmail.com <ml...@gmail.com> #52Feb 25, 2015 11:29AM

So technically there may not be a memory leak, but for all practical purposes, there is. Hopefully this explanation provides useful. (I recently left Google a week ago, so am no longer able to bring this analysis directly to the devs, so am relying on you to pass along the message, assuming you believe it will be helpful.)

The NDB cache is stored in the _State’s context, which means it is a thread-local cache, not shared between threads. While a dev-appserver has 8 threads

https://code.google.com/p/googleappengine/source/browse/trunk/python/google/appengine/tools/devappserver2/python_runtime.py#63 , it seems my production introspecting shows 600+ threads, about 80% WorkerThread, 20% DummyThread. So the problem of reaching steady-state memory consumption will be potentially be much worse (50x?) than what csilvers/kahn-academy demonstrated in a dev-appserver.

Second, these caches are only cleared when the next request comes in (in tasklets.get_context). This means for the duration *inbetween* calls, the thread will hold on to the cache. Given that the threadpool balancer appears to be doing a great job of loadbalancing between threads (my manual inspection shows each request gets its own thread-local variables)…this means each thread will accumulate it’s own cache of NDB objects to sit in memory un-used between requests.

I can fix this part by disabling NDB cache entirely. It’s not worth this headache.

However, things go deeper…because of Futures. Some Futures complete and get their job done, triggering completion callbacks to be stuck in the event queue. However, since I’m not using tasklets, and just writing regular NDB code, there is no guarantee that the event queue will be emptied when the request finishes. This leaves Futures stuck in the thread-local cache, in each thread, uselessly between requests. This one is much harder to fix…there is no way to disable this like with an NDB cache.

However, I can solve this by wrapping *any* code that uses NDB objects with @tasklets.toplevel. The documentation mentions this only in the context of being useful for tasklet code, but honestly, I believe this should be run for *ALL* code that uses NDB libraries. In fact, given the nature of what it does, I honestly think it should be done in a wrapper around the wsgi library in the appengine runtime, similar to what is done with the threadlocal variables in appengine.runtime’s management of request_environment.current_request.

I’ve had hourly/bi-hourly outages for the past week on my site as I relied on NDB more, and made my objects larger, not realizing the consequences of this. I had to ssh into my managed VMs to have any context for wtf was going on, and spot kswapd doing it’s damage. And then spend a lot of time poring over gc object output (objgraph.py proved helpful, but working with the gc in a multithreaded environment proved tricky to get useful data).

te...@gmail.com <te...@gmail.com> #53Feb 25, 2015 11:33AM

Is @tasklets.toplevel distinct from @ndb.toplevel? Because I have all my wsgi entry points wrapped with @ndb.toplevel and I still see the memory issue.

ml...@gmail.com <ml...@gmail.com> #54Feb 25, 2015 11:46AM

Sorry, I should have said @ndb.toplevel, which is the external API. Internally it is tasklets.py's toplevel decorator, but I've been staring at the library source code for too long and called it by the internal location...

When you say you still see "the memory issue", which issue are you referring to? See Jose M's note above for context. He says fetch_page() has bugs. And he acknowledges [5] that a single request can still blow up due to the in-context cache (and trigger the soft and/or hard memory limits that cause appengine to kill your task). My issue described above is distinct from both of those.

My issue appears more like a slow memory leak over time, and is referring to memory retention of NDB objects across within each thread in a Managed VM environment. It *may* also apply to the shared environment of traditional appengine, but I can't really introspect what that runtime does very well, so can't make any strong statements about whether it applies there.

ml...@gmail.com <ml...@gmail.com> #55Feb 26, 2015 03:52AM

Thanks to a suggestion from Patrick, I realized I can fix my issues mentioned above by using ndb.toplevel itself as a wsgi middleware, which effectively adds ndb.toplevel to every endpoint retrieved by that wsgi app. In appengine_config.py:

def webapp_add_wsgi_middleware(app):
# Note: I haven't verified, but this may flatten chunked-responses into a single response, if you happen to use them in your app.
return ndb.toplevel(app)

This gets 99% of the per-thread state cleaned up and finished properly (including potentially NDB put() operations). If you’re curious in what the last 1% of memory is, it seems to be more thread-locals that stick around in-between requests and grow with O(num-threads). In particular:

1) It seems ndb.toplevel cleans up slightly incorrectly. It does:
set_context(None)
ctx.flush().check_success()
Unfortunately, check_success() itself sets a context again. I see check_success()->wait()->get_event_loop().run1()->(in my app at least...)->_on_future_completion()->_help_tasklet_along(). And _help_tasklet_along() calls get_context() to save the current context away to restore later...but in the process of fetching a get_context() on a None context, it actually creates a new context which will last beyond the end of the request. I believe reversing the order of these two calls should clean out an extra Context, which saves the associated datastore_rpc.Connection, datastore_rpc.Configuration, ModelAdapter, 7 AutoBatchers, and perhaps more. That Context will be thrown away on the subsequent request anyway, so really there's no use in creating it.

2) ndb.event_loop uses thread-locals to store an event_loop.EventLoop and its two empty deques. And api.datastore uses thread-locals to store a datastore_rpc.Connection and associated datastore_rpc.Configuration. But I don’t think these are worth attempting to fix.

te...@gmail.com <te...@gmail.com> #56Feb 26, 2015 10:36AM

I am seeing the issue with a simple call to fetch/iter causing a memory failure (with @ndb.toplevel applied), so apparently that's a distinct issue. The only workaround I've found thus far is to do a "db" query in place of an "ndb" query. Unfortunately that's a bit tricky to do without parallel model definitions.

ml...@gmail.com <ml...@gmail.com> #57Feb 26, 2015 10:58AM

So after posting the above, I discovered a problem with ndb.toplevel when used with the mapreduce/ library, whereby the toplevel library is trying to finish the unbounded query that the mapreduce/ library started, and it causes problems with loading too much data and taking forever..but I have not yet figured out a fix to this one, so I am running a hacked/not-really-safe version of ndb.toplevel.

I'm not sure if the above is at all related to your (technomage.talk)'s problem or not, but I am curious:
- Does disabling the ndb cache for that model help for you? (I'm wondering if caching the entire query's results within your process/thread is causing you problems)
- Can you give a few-line example of the fetch/iter code you are using that has problems?
- How many results are expected to be returned by your query, if it runs to completion? (i'm wondering if it's a large query that you interrupt, if ndb.toplevel finishing your query could be causing you problems)

And just a response to #30 relating to iter(), if you are using @ndb.toplevel with iter(), then I believe even if you just create an iterator, the ndb.toplevel code will run the eventloop and try to finish all outstanding requests, including your iterator. That might explain part of the memory behavior mystery.

te...@gmail.com <te...@gmail.com> #58Feb 26, 2015 12:57PM

I've got a job/task management system we use (similar to mapreduce) which can take a python function and a query and run that function on each item produced by the query.

When I adapted the system to handle NDB queries, I immediately ran into memory issues on a simple keys-only query. The handler deserializes the query, and the call to q.iter() usually causes a memory blowout. Usually the query is iterated until the request runs out of time, but the failure happens before any iteration results are processed.

Here's the relevant code, obtaining a generator with which to produce work. I don't recall if I tried providing a non-None limit, but that would kind of defeat the purpose of walking the whole query:

if isinstance(iterator, ndb.Query):
return iterator.iter(start_cursor=ndb.Cursor(urlsafe=cursor) if cursor else None,
limit=limit, offset=offset,
batch_size=batch_size, produce_cursors=True)

Disabling the ndb cache for that model has no effect (or at least doesn't prevent the memory faults).

ml...@gmail.com <ml...@gmail.com> #59Feb 26, 2015 06:54PM

Is your code running through the whole query, or is it "break"ing out of the iterator loop before it reaches the end?

Regardless, unrelated to the issue that my first few posts were about, I have now found two problems with NDB and iteration over big queries. I think your problem (and many others in this thread) is similar to #2 below:

1) Normal DB code (or NDB code without @ndb.toplevel) may construct a large query, and then abandon it (break out of the loop, etc). However, the use of @ndb.toplevel means the decorator will try to finish all "pending" operations, including the abandoned query.

This means in the following code, my mapreduce does 30 seconds of legit work, but then when finishing, toplevel spends the remainder of the request timeout (until 60 seconds) just uselessly iterating/loading the query results, until it runs out of time.
# mapreduce handler:
for entity in ndb_query_iterator():
process_entity()
if time_elapsed > 30:
break

My fix was to use a hacked-toplevel function that didn't try to finish the eventloop at the end of a request. I was unable to figure out any way to cancel the in-progress query/iterator and remove it from the eventloop. The proposal above for using weakrefs sounds appealing, but I don't know the consequences of that.

2) There is some non-determinism in the order in which the eventloop chooses to run stuff. I believe this was documented at the beginning of the thread, in relation to the fetch_page issue...though I believe it is much more broad than fetch_page, and applies to unbounded iterators too. This means when you have a long-running query (in which more data can always be found), and another future (to perform a dependent lookup), it may choose to run the iterator to completion, as opposed to the dependent lookup.

For example:
for entity in ndb_query_iterator():
some_object = make_ndb_key(entity.property).get()
some_object.change_properties()
some_object.put()
In this code, the get() and put() create RPCs to be evaluated, in the same eventloop as the ndb_query_iterator()’s RPCs. But when it has multiple RPCs that are completed and can have their completion callback called, it chooses one at random in api_proxy_stub_map.wait_any(). It grabs one of the self.rpcs to wait for (from dict.keys()[0]), waits for it, and then removes it from self.rpcs and returns it to be processed. If it is an iterator, then when it is processed I believe it will add the next batch onto the self.rpcs, and the next time through the loop it will possibly wait on the same iterator RPC again. :(

Normally, the logic works fine, since the eventqueue is trying to run everything to completion. I believe this execution model breaks in the case of an infinite (or effectively-infinite) query stream, as the eventloop may get stuck processing more and more of the iterator before getting around to processing the get() or put() RPC calls.

I don't have any hacky solutions to this one yet. I guess the "use DB to query to avoid the eventloop, then load them manually with NDB get/get_mutli" is one such approach to solve/avoid this issue. Two ideas for the NDB devs that might help here...though with NDB part of the built-ins, I'm not sure how I'd be able to try these without extensive hackery of my own:

a) extend api_proxy_stub_map to have a get_any_completed() (in addition to the wait_any() that actually waits). And then in eventloop’s run0(), it could try to call get_any_completed() first for processing (to empty the queue), before calling wait_any() to wait for more RPCs

b) use self.rpcs as a lookup dict, but also add a self.rpcs_deque. Push new RPCs onto the end, and when we go to wait, wait for the one in the front of the deque, to ensure that we don’t keep processing the most-recently-added-RPC (which might just be the iterator looping itself).

te...@gmail.com <te...@gmail.com> #60Feb 27, 2015 01:16AM

My code is not iterating *at all* (unless the call to q.iter() does something internally). The call to ndb.Query.iter() is sufficient to blow out memory. I've tried returning immediately after the call to .iter() with the same results.

ml...@gmail.com <ml...@gmail.com> #61Feb 28, 2015 10:16AM

So this is what I hacked up tonight, which seems to help with my runaway mapreduces.

In my case, the unbounded iterator over the datastore was fetching batch after batch of items from the database (most of which went unused), and starving the other RPCs from running that actually were trying to process those items. The inside of the loop (which did its own NDB put() operations) would run sporadically (5-10 seconds at a time), then often getting stuck for much longer, waiting while the eventloop ran the iterator RPCs repeatedly.

My hack attempt at fix uses a FIFO (first-in first-out) queue on the pending RPCs, so that the oldest-added RPC always gets run first. I believe this should help prevent resource starvation where the iterator RPC accidentally gets run with top priority and doesn't relinquish control. It's a monkeypatching hack due to the private methods, but if you're feeling adventuresome, feel free to give it a shot. I run my code from appengine_config.py to ensure it gets run before the eventloop is created.

https://gist.github.com/mikelambert/6615b37fa326c1008a4b

If anyone tries it, I'd be curious to hear how it works. I've only been running it about an hour, but it seems to work for me...so far.

[Deleted User] <[Deleted User]> #62Mar 1, 2015 11:30AM

Running on an F1, the following results in a memory fault more often than not:

results = []
start = time.time()
# jobs.Job is derived from ndb.Model, nothing special
q = jobs.Job.query().order(-jobs.Job.created)
for job in q.iter():
# look for jobs of interest
results.append(job)
if len(results) >= page_size
break
if time.time() - start > 30:
break

This happens whether caching is enabled or not. My only solution thus far has been to convert the query to a DB query and convert any filters or options on the fly.

jf...@google.com <jf...@google.com> Mar 3, 2015 02:52AM

Reassigned to pc...@google.com.

sr...@gmail.com <sr...@gmail.com> #63May 12, 2015 09:09AM

@mlamb
I tried your FIFO event loop fix, but it doesnt seem to solve the problem for me.
I am running a mapreduce operation which iterates over a particular model and one of the attributes of the model is the Key to another model. So i do a ndb.get and based on the attributes of the result, i do an gcm messaging push to my customers.
See the screenshot to see the burst of "Next" operation, each with default batch size 50

Screenshot from 2015-05-12 14:30:35.png

192 KB

View

Download

gr...@gmail.com <gr...@gmail.com> #65Oct 9, 2015 01:07PM

[Comment deleted]

se...@googlemail.com <se...@googlemail.com> #66Dec 15, 2015 02:42PM

I can't provide a lot of deep insight into the issue, but I'd like to point out that I had a similar issue with a long-running worker. In our code base, we use a cursor and fetch_page to query for 100 objects at a time, then iterate over those 100 and perform some actions.

The task was exiting with due to hitting the max memory and was terminated before it was done. Nothing from this thread really helped. I stumbled upon a StackOverflow answer mentioning to disable appstats did the trick for somebody, so I tried and succeeded.

ml...@gmail.com <ml...@gmail.com> #67Dec 22, 2015 07:22AM

The above fixes I posted above helped me way back when, but were never sufficient in stopping all OOM-ing. And a recent change I made last week (with more NDB put calls) made things much worse for me.

Thanks to that timely suggestion from Sebastian...disabling appstats seems to have helped with my issues as well. Hard to say for certain as I haven't let the servers run for more than 24 hours yet (I'm actively pushing new code daily), but they appear to be much healthier. And I'm reaaaaally hoping this helps enough that I don't need to dive back into debugging memory leaks in the NDB library again. :)

js...@google.com <js...@google.com> #68Feb 3, 2016 07:49PM

Hello all,

I have created

https://code.google.com/p/googleappengine/issues/detail?id=12721 specifically to track the AppStats angle. I'm holding this issue open because it's become a catch-all for memory-related issues, but the fact that disabling AppStats can significantly change memory consumption is interesting.

If you have experienced this please share what background you can.

br...@google.com <br...@google.com> #69Mar 3, 2016 05:24PM

Hey all just wanted to let you know I've been actively working on this issue. I do have a patch that works for me, so hang tight.

br...@google.com <br...@google.com> May 3, 2016 04:51PM

Reassigned to br...@google.com.

ba...@gmail.com <ba...@gmail.com> #70May 15, 2016 09:40AM

Hi Bryan,

Hanging tight for over 2 months :) anything we can look forward to?

kr...@agosto.com <kr...@agosto.com> #71May 16, 2016 03:40PM

We've been hanging tight for well over 2 *YEARS*

br...@google.com <br...@google.com> #72May 16, 2016 05:22PM

Hello everyone,

Thank you for your patience. I have some good news to announce. This long standing issue is now fixed! I've attached an app to demonstrate it works.

In the app, hit /setup, /fetch, and then /count. Before, /count would increase linearly with the number of /fetch hits but now, there are no residual models.

Note: the fix is only valid in production and not in local development. However, local development is not strongly affected by this issue to begin with due to a limited number of threads.

Feel free to let me know if there are any issues surrounding this. If not, I'll close this out in a week.

app.yaml

127 B

Download

main.py

1.3 KB

View

Download

al...@gmail.com <al...@gmail.com> #73May 16, 2016 05:36PM

Is there a specific appengine release tied to this fix? Or do you have a date when it's been released globally?

I would like to check the impact of the fix on our production apps, but need to know when/if it's been applied.

ml...@gmail.com <ml...@gmail.com> #74May 16, 2016 05:44PM

Bryan, what version of gcloud or google appengine do we need to be using to see your improvements?

I'm assuming the fixes you are referring to are those described in

https://github.com/GoogleCloudPlatform/datastore-ndb-python/commit/45dbb9e0b4a033024709452ae0e558a2838d42da ?

This looks to be an improvement on per-thread memory usage, which fixes the production memory blowup due to the extensive number of threads used. In fact, this fix appears to be similar to the fixes I submitted above, that tried to address the exact same issue (see #57 above).

However, I suspect you're not entirely out of the woods yet (and I'd give it more than a week, before claiming this is all "fixed"). There are other memory/execution issues in this thread that are unrelated to the "per-thread memory usage", which I don't believe your change addresses (if I'm understanding it correctly).

In particular, @ndb.toplevel seems to cause bad problems when mixed with loops that "break out". And there are other issues where it results in runaway memory usage (and infinite loops) due to the non-deterministic order that RPCs are run in. There is some discussion and potential fixes in #58-#63, but none of them are a great fix, and I was hoping a proper NDB owner would be able to look into them and address them better.

(As it is now, I am continually surprised that Google continues to push and recommend NDB library for clients, when it is much less supported and has fundamental issues that make it worse than DB. I understand you may want to push the world over to NDB, but I think it's a very bad idea to recommend NDB as production-worthy to newbie users until these issues are addressed.)

Thanks for your time and help here, it is certainly appreciated.

br...@google.com <br...@google.com> #75May 16, 2016 06:59PM

@alexis
The code is already released. Any app that has runtime: python-compat or runtime: python27 + vm: true will see the code fixes. This particular issue did not affect GAE Standard due to threads being terminated at the end of a request, but it did affect GAE Flexible.

@mlambert
Thanks for bringing up other potential issues. I don't think I fully follow what you're describing. Do you have a simple app that demonstrates the issue you're describing?

[Deleted User] <[Deleted User]> #76May 16, 2016 08:10PM

as a follow-up - this ticket was created before Manged VM/Flexible Runtimes existed. It very clearly effects traditional GAE. For us it shows up simply by iterating over query results that have more then 100 records. The larger the entities, the more we leak. We have written simple loops to "migrate" entities to have new properties/schema where we iterated over entities, modify, call .put() and then move on. after a few hundred entities we usually crash.

I'm glad to hear that the Managed VMs have been improved in this regard. we re-wrote all datastore access from managed VMs to use db because we want them to be able to run for days without dieing (whereas we let our API serving GAE instances die several times per day)

[Deleted User] <[Deleted User]> #77May 16, 2016 08:22PM

@chris: Are you using custom runtime for the managed VM?

[Deleted User] <[Deleted User]> #78May 16, 2016 08:29PM

we are using python-compat runtime.

[Deleted User] <[Deleted User]> #79May 16, 2016 08:48PM

@Chris: Early this month, the doc for python-compat runtime is updated and it's stated that it's not support DB API

https://cloud.google.com/appengine/docs/flexible/python/migrating-an-existing-app.
With this configuration, your application uses a Python 2.7 runtime that has a certain number of App Engine standard environment's APIs included in the image:

App Identity
Datastore (ndb)
Logging
Memcache
Search
Task Queue
URL Fetch
Users
Notably, the following APIs are not supported:

Blobstore
Datastore (db)
Channel
Logservice
OAuth

Does DB API from python-compat currently work for you?

th...@google.com <th...@google.com> #80May 16, 2016 10:09PM

We want to avoid having any sort of catch-all issue for ndb issues (memory or not). As such, we will mark this one as fixed and will track the status of remaining memory issues in separate bugs (listed below). If you have discovered a novel bug please submit a new issue, and if possible, provide steps to reproduce.

Outstanding ndb bugs:

* 12721 - AppStats increases memory usage.
* 11648 - fetch_page memory uage.
* 11647 - in-context cache growth with long-running operations.

As for vision/documentation issues, you're welcome to submit those as new bugs on this issue tracker as well, though it may be better to send those as feedback on individual documentation pages.

ml...@gmail.com <ml...@gmail.com> #81May 16, 2016 10:26PM

Brian: Jose from Google already did a great job of summarizing the multiple issues here in post #53, but I'll try to summarize again in light of where I believe things stand now. (I see now Jon jumped in too, but I hope this provides some historical catch-up clarity to you and anyone else on this catch-all thread.)

1) There is apparently no true "memory leak", as csilvers proved way up above in #10-13. I believe this was the issue of per-thread memory wastage, which manifested a bit in his dev-appserver environment, but became much worse on Flexible/Managed VMs (which he didn't test on back in early 2014). This one is not noticed so much in Standard VMs for reasons you already are aware, as it tends to kill and restart workers as necessary. This was solved for some people with @ndb.toplevel (which caused some state to be cleaned up at the end of a request), though introduced other problems (due to non-deterministic eventloops). I had my own fix/workaround for it, and diagnosed/fixed this in #54-#57 above. As far as I understand, your fix basically fixes this issue in a better way, and does not require @ndb.toplevel.

2) There is the NDB cache storing objects. This was stored in thread-local cache, which for large iterations, will result in OOMs. This was probably what was the issue in the original post #1. (Curious, have you run it? Or just your modified example which iterates over much fewer objects?) As the cache is on by default, folks will run into using NDB out-of-the-box will likely encounter OOMs in any naive iteration. This is seen as "not a bug" according to Jose in #53, though it is being tracked in

https://code.google.com/p/googleappengine/issues/detail?id=11647. Unfortunately, there is no mention of this problem in the documentation for new users who are likely to run into this problem.

3) There is a related problem (a mixture of 1 and 2 above) that the NDB cache is stored in the thread-local State (as part of the State.current_context._cache). This means the NDB cache is held onto *after* the request finishes. Some people were able to fix this with @ndb.toplevel (which unset the current_context), others by disabling NDB cache entirely, though your patch appears to fix this issue, without requiring either workaround.

4) The NDB eventloop is non-determinstic (uses python hashing order). This was branched out into

https://code.google.com/p/googleappengine/issues/detail?id=11648 based on some initial investigation by Isaac from Google in #8. Unfortunately a lot of conversation continued in this thread in #61-#63 (sorry!), since it was very difficult to disentangle the various issues from each other due to their overlapping and similar effects: OOM with NDB iterators! I believe this one manifests as:

a) memory blowups with iterators within a single request (using way more memory than was actually fetched)
b) infinite loops (looping over a fetch, and never getting a chance to run the code inside the loop)

The problem here is due to the NDB eventloop optimistically fethcing more results, prioritizing that over returning the results to the iterator (sometimes), or over processing the dependent NDB gets/puts that were being performed in the loop itself. This issue is what I refer to in part 2 of post #61. My FIFO fix in #63 was an attempt to help with this issue.

5) If you used @ndb.toplevel (either due to NDB's "async" documentation, or in an attempt to fix (1) above), and you wrote an iterator loop that "break"s out of the loop (such as Google's appengine-mapreduce library), your Flexible/Managed VMs would eventually hang (no idea about Standard GAE). But they hang in a way that was invisible to VM healthchecking, so your servers would return timeouts to all user requests, while returning healthy to all healthchecks. These user errors would persist until you manually logged in and restarted the servers. Yes, this sucked, and was very difficult to diagnose. :(

This was due to the fetch always having more data, and the NDB eventloop optimistically fetching more results despite the calling iterator already break-ing out and ignoring the rest. For large tables, this could take forever (in the @ndb.toplevel at the end of a request) and effectively hang that thread (and be very expensive on DB operations!) This issue is what I refer to in part 1 of post #61. My FIFO fix in #63 was an attempt to help with this issue, but it's not complete, and I'm sure I'm still being overcharged. :(

This one doesn't have an associated issue of its own, so I have created one as

https://code.google.com/p/googleappengine/issues/detail?id=12993 .

If you (Bryan) or anyone else on GAE are interested in really tackling these issues, please feel free to hit me up directly and I'm happy to walk you through stuff. I am frustrated that NDB is pushed so strongly by Google (to the point that they are removing support for DB as described in #83), but still has deep remaining issues with it (a legacy of the original author who left without productionizing it). As the old Google adage goes, the old thing is deprecated, but the new thing doesn't work yet. I was a google engineer for 10 years before leaving to do my own stuff, so also found it ironic (and frustrating) that I was effectively still working for Google (debugging/fixing these issues), except I was no longer being paid for my time. :P

th...@google.com <th...@google.com> #82May 17, 2016 05:29PM

Thanks for the detailed response.

Bryan will follow up on individual points.

A few things I wanted to note:

For point (2) I can work on adding a notice about ndb caching leading to OOM for long-running queries to the documentation.

To address the issue of deprecation: to be clear, db is still supported on App Engine Standard. It may be superseded by ndb, but it is *not* deprecated. We do not support it (and various other App Engine APIs) in the python-compat runtime for Flexible. Flexible users have the additional option of using gcloud-python's Datastore API client.

br...@google.com <br...@google.com> #83May 17, 2016 05:38PM

Marked as fixed.

Great post. Some brief replies before I close out this thread.

1) Yup. No true memory leaks, just a lot of memory wasting going on.
2) This is unfortunately works as intended though a LRU cache could be helpful.
3) This is now fixed.
4) this is a real bug that needs a separate issue tracker. Thank you for creating one.
5) This is actually fixed as well because we cleanup outstanding RPCs and generators at the end of a request now.

ml...@gmail.com <ml...@gmail.com> #84May 17, 2016 06:34PM

Thanks for the responses (and clarifying the push from GAE DB to gcloud-python DB). Bryan, regarding (5) and

https://code.google.com/p/googleappengine/issues/detail?id=12993, I believe it remains unfixed.

From my understanding, CallbackMiddleware (including your cleanup handler for RPCs and eventloop tasklets) is always run at the end of the MetaApp execution, and therefore run after the individual @ndb.toplevel-wrapped app/request handlers. The bug in (5) occurs because @ndb.toplevel itself tries to finish unbounded query iterators, and this would occur before your cleanup handler is able to run. (And in fact, this is a good thing! If your cleanup handler ran before @ndb.toplevel, it would completely break the correctness semantics promised by @ndb.toplevel in the NDB documentation.)

Issue 35895760

Description

Issue summary

Comments

pc...@google.com <pc...@google.com> Jul 19, 2013 09:16PM

ar...@google.com <ar...@google.com> #2Jul 19, 2013 11:41PM

ye...@google.com <ye...@google.com> #3Oct 4, 2013 01:41AM

[Deleted User] <[Deleted User]> #4Oct 14, 2013 08:38PM

al...@gmail.com <al...@gmail.com> #5Oct 14, 2013 09:35PM

ar...@google.com <ar...@google.com> #6Oct 14, 2013 09:36PM

il...@google.com <il...@google.com> #7Feb 25, 2014 12:07PM

il...@google.com <il...@google.com> #8Feb 26, 2014 10:18AM

jk...@jkrebs.com <jk...@jkrebs.com> #9Feb 26, 2014 10:40AM

cs...@khanacademy.org <cs...@khanacademy.org> #10Feb 27, 2014 05:44AM

ho...@gmail.com <ho...@gmail.com> #11Mar 3, 2014 09:17PM

cs...@khanacademy.org <cs...@khanacademy.org> #12Mar 5, 2014 11:44PM

cs...@khanacademy.org <cs...@khanacademy.org> #13Mar 6, 2014 12:36AM

cu...@gmail.com <cu...@gmail.com> #14May 6, 2014 07:08PM

jk...@jkrebs.com <jk...@jkrebs.com> #15May 6, 2014 08:53PM

cu...@gmail.com <cu...@gmail.com> #16May 7, 2014 07:37AM

cu...@gmail.com <cu...@gmail.com> #17May 7, 2014 07:39AM

[Deleted User] <[Deleted User]> #18May 13, 2014 08:16PM

[Deleted User] <[Deleted User]> #19Aug 5, 2014 07:15PM

[Deleted User] <[Deleted User]> #20Aug 6, 2014 07:36PM

ni...@gmail.com <ni...@gmail.com> #21Aug 6, 2014 07:38PM

[Deleted User] <[Deleted User]> #22Aug 6, 2014 08:05PM

re...@gmail.com <re...@gmail.com> #23Aug 18, 2014 05:05PM

ni...@gmail.com <ni...@gmail.com> #24Nov 2, 2014 03:07AM

cu...@gmail.com <cu...@gmail.com> #25Nov 2, 2014 07:41PM

kr...@agosto.com <kr...@agosto.com> #26Nov 2, 2014 09:20PM

cu...@gmail.com <cu...@gmail.com> #27Nov 3, 2014 12:05AM

kr...@agosto.com <kr...@agosto.com> #28Nov 3, 2014 02:55AM

[Deleted User] <[Deleted User]> #29Nov 26, 2014 12:59PM

te...@gmail.com <te...@gmail.com> #30Jan 27, 2015 01:54PM

[Deleted User] <[Deleted User]> #31Jan 30, 2015 02:10AM

ba...@codalytics.com <ba...@codalytics.com> #32Jan 30, 2015 03:31AM

kr...@agosto.com <kr...@agosto.com> #33Jan 30, 2015 02:52PM

kr...@agosto.com <kr...@agosto.com> #34Jan 30, 2015 04:45PM

[Deleted User] <[Deleted User]> #35Jan 30, 2015 05:20PM

kr...@agosto.com <kr...@agosto.com> #36Jan 30, 2015 09:21PM

[Deleted User] <[Deleted User]> #37Jan 30, 2015 10:41PM

[Deleted User] <[Deleted User]> #38Jan 31, 2015 07:47PM

kr...@agosto.com <kr...@agosto.com> #39Feb 1, 2015 02:31PM

[Deleted User] <[Deleted User]> #40Feb 2, 2015 03:14PM

[Deleted User] <[Deleted User]> #41Feb 2, 2015 05:31PM

jf...@google.com <jf...@google.com> Feb 2, 2015 10:23PM

jk...@jkrebs.com <jk...@jkrebs.com> #42Feb 3, 2015 09:44AM

[Deleted User] <[Deleted User]> #43Feb 3, 2015 12:33PM

si...@gmail.com <si...@gmail.com> #44Feb 3, 2015 03:25PM

ru...@gmail.com <ru...@gmail.com> #45Feb 4, 2015 05:42AM

ni...@gmail.com <ni...@gmail.com> #46Feb 4, 2015 05:50AM

ni...@cirrusmachina.com <ni...@cirrusmachina.com> #47Feb 4, 2015 04:18PM

si...@intesoft.net <si...@intesoft.net> #48Feb 4, 2015 05:14PM

[Deleted User] <[Deleted User]> #49Feb 4, 2015 05:20PM

si...@gmail.com <si...@gmail.com> #50Feb 4, 2015 07:08PM

jf...@google.com <jf...@google.com> #51Feb 6, 2015 12:55AM

ml...@gmail.com <ml...@gmail.com> #52Feb 25, 2015 11:29AM

te...@gmail.com <te...@gmail.com> #53Feb 25, 2015 11:33AM

ml...@gmail.com <ml...@gmail.com> #54Feb 25, 2015 11:46AM

ml...@gmail.com <ml...@gmail.com> #55Feb 26, 2015 03:52AM

te...@gmail.com <te...@gmail.com> #56Feb 26, 2015 10:36AM

ml...@gmail.com <ml...@gmail.com> #57Feb 26, 2015 10:58AM

te...@gmail.com <te...@gmail.com> #58Feb 26, 2015 12:57PM

ml...@gmail.com <ml...@gmail.com> #59Feb 26, 2015 06:54PM

te...@gmail.com <te...@gmail.com> #60Feb 27, 2015 01:16AM

ml...@gmail.com <ml...@gmail.com> #61Feb 28, 2015 10:16AM

[Deleted User] <[Deleted User]> #62Mar 1, 2015 11:30AM

jf...@google.com <jf...@google.com> Mar 3, 2015 02:52AM

sr...@gmail.com <sr...@gmail.com> #63May 12, 2015 09:09AM

gr...@gmail.com <gr...@gmail.com> #65Oct 9, 2015 01:07PM

se...@googlemail.com <se...@googlemail.com> #66Dec 15, 2015 02:42PM

ml...@gmail.com <ml...@gmail.com> #67Dec 22, 2015 07:22AM

js...@google.com <js...@google.com> #68Feb 3, 2016 07:49PM

br...@google.com <br...@google.com> #69Mar 3, 2016 05:24PM

br...@google.com <br...@google.com> May 3, 2016 04:51PM

ba...@gmail.com <ba...@gmail.com> #70May 15, 2016 09:40AM

kr...@agosto.com <kr...@agosto.com> #71May 16, 2016 03:40PM

br...@google.com <br...@google.com> #72May 16, 2016 05:22PM

al...@gmail.com <al...@gmail.com> #73May 16, 2016 05:36PM

ml...@gmail.com <ml...@gmail.com> #74May 16, 2016 05:44PM