| Issue 2: | Performance and Quota tuning | |
| 1 person starred this issue and may be notified of changes. | Back to list |
Skimming the code, it appears that there are many facets that can be tuned. A few obvious ones are... - Use AsyncMemcacheService to perform the I/O in parallel with the request. This would reduce user-facing latency. - Cache Datastore operations (expirable in-memory, memcache) as CounterNames rarely change and metrics data can be lossy. Currently ProdEagle consumes 5-10% of the free quota. - Reduce GC overhead by creating less garbage, thereby reducing user-facing latency. This is non-invasive, such as using Collections' singletonMap and emptyMap instead of HashMap or checking log level enablement prior to logging. While minor, its easy to see that ProdEagle is causing more frequent heap churn when comparing the application before/after integration.
Dec 30, 2011
Hi Ed, I knew it was quick-and-dirty, no worries about that. Its really great to have the visibility into production, so thanks! I'm using this for a hobby project to learn Scala + GAE, so its all gravy. For a pre-launch app, prodeagle consumes 7% of Datastore's free quota. This is on an idle day with only prodeagle operations (1.19K in 20hrs). Appstats information is below. If I get a chance, I'll go through it again and send you a patch if I see anything worthy. -Ben POST /prodeagle/ 722 90 (1) (2) (3) (4) (5) (6) (7) (8) (9) (10) ... memcache.Get 362 datastore_v3.Get 90 datastore_v3.RunQuery 90 memcache.BatchIncrement 90 memcache.Delete 90 (1) 2011-12-31 04:02:01.272 "POST /prodeagle/?last_time=1325304060" 200 real=98ms api=0ms overhead=0ms (8 RPCs) memcache.Get 4 datastore_v3.Get 1 datastore_v3.RunQuery 1 memcache.Delete 1 memcache.BatchIncrement 1 (2) 2011-12-31 04:01:00.625 "POST /prodeagle/?last_time=1325303940" 200 real=97ms api=0ms overhead=0ms (9 RPCs) memcache.Get 5 datastore_v3.Get 1 datastore_v3.RunQuery 1 memcache.Delete 1 memcache.BatchIncrement 1 (3) 2011-12-31 03:59:59.941 "POST /prodeagle/?last_time=1325303880" 200 real=92ms api=0ms overhead=0ms (8 RPCs) memcache.Get 4 datastore_v3.Get 1 datastore_v3.RunQuery 1 memcache.Delete 1 memcache.BatchIncrement 1
Dec 31, 2011
On every request, HarvestServlet's checks authentication with #isProdEagle(). This looks for the auth key in datastore in #getKeySecret. I don't see why this can't be cached locally and in memcache. MemCacheManager#getMultipleCounters() may throw an exception by applying an invalid cast of a Map<K, V> to a Future<Map<K, V>>. This could instead return a custom Future (static singleton instance).
Dec 31, 2011
Attached is a diff of the changes described above.
Jan 1, 2012
Good point about the auth key. I forgot about that bit! I've applied the diff and made a slight change to ensure that the auth key does always end up in Memcache (in case it expires out for whatever reason). Let me know if that improves performance for you.
Status:
Fixed
Jan 7, 2012
I deployed v1.1.1 tonight (from v1.0) and the usage has improved by replacing the datastore.GET with a memcache.GET. I don't know where there is still a datastore.RunQuery being performed, though. 2012-01-08 07:32:25.050 "POST /prodeagle/?last_time=1326007860" 200 real=60ms api=0ms overhead=0ms (8 RPCs) memcache.Get 5 memcache.Delete 1 datastore_v3.RunQuery 1 memcache.BatchIncrement 1 I am also receiving alerts from ProdEagle which indicate data loss. From the logs GAE reports, 2012-01-07 23:36:27.727 /prodeagle/?last_time=1326008100 200 78ms 0kb AppEngine-Google; (+https://code.google.com/appengine appid: s~prod-eagle) 0.1.0.40 - - [07/Jan/2012:23:36:27 -0800] "POST /prodeagle/?last_time=1326008100 HTTP/1.1" 200 328 - "AppEngine-Google; (+https://code.google.com/appengine appid: s~prod-eagle)" "drdncws.appspot.com" ms=78 cpu_ms=80 api_cpu_ms=22 cpm_usd=0.002298 instance=00c61b117c0a834c0af3747657abfa7b6894f6 W 2012-01-07 23:36:27.720 com.prodeagle.java.servlets.HarvestHandler wasDataLostSinceLastHarvest: ProdEagle counters lost before 1326008040000 ProdEagle warns via email... Accuracy of your data for the past hour too low for 5 minutes! Counts for 'Accuracy' over the last 10 minutes: 0!, 0!, 99.9916666667, 99.985, 0!, 0!, 0!, 0!, 0!, 0! (min allowed: 95).
Jan 8, 2012
com.google.appengine.tools.appstats.Recorder:290 makeAsyncCall() com.googlecode.objectify.cache.TriggerFutureHook:144 makeAsyncCall() com.google.apphosting.api.ApiProxy:184 makeAsyncCall() com.google.appengine.api.datastore.DatastoreApiHelper:59 makeAsyncCall() com.google.appengine.api.datastore.PreparedQueryImpl:144 runQuery() com.google.appengine.api.datastore.PreparedQueryImpl:70 asIterator() com.google.appengine.api.datastore.BasePreparedQuery:38 asIterator() com.prodeagle.java.counters.CounterNamesManager:96 allCounterNameShards() com.prodeagle.java.counters.CounterNamesManager:105 allCounterNames() com.prodeagle.java.servlets.HarvestHandler:133 createReport() com.prodeagle.java.servlets.HarvestHandler:75 doGet() com.prodeagle.java.servlets.HarvestHandler:88 doPost() javax.servlet.http.HttpServlet:637 service()
Jan 8, 2012
I think I've figured out the issue with inaccurate data. Changing the memcache operations to Async means that on line 307 of the HarvestHandler the memcache keys might not be put until after the next check to see if they exist. I've got a fix in mind that I'll put in this week. As for the Datastore query... let me hunt for that and get back to you.
Status:
Accepted
Jan 8, 2012
Amusing, so my fault there. Perhaps the async only appeared to be a natural fit. If its too difficult to make it play nicely then feel free to revert it. The query is coming from CounterNamesManager#allCounterNameShards(), identified by Appstats in the stracktrace above. This can likely be memoized in memcached with the query invalidated when #addIfNew() or #delete() write to the DB. On the plus side, the latency and GC were improved. Once the query have been cached, I think the DB quota usage will practically disappear, which makes prodeagle all the more awesome.
Jan 9, 2012
I've left most of the async in, looks like its working fine on my dev site, but let me know what you find. I'll try remove the other datastore operation later - probably tomorrow!
Jan 19, 2012
Just want to confirm that the async fix has been working nicely on my site as well over the last week. |
Owner: ed.go...@gmail.com
Cc: andri...@gmail.com
Labels: -Type-Defect Type-Enhancement