shrink jmp-to-slowpath sequence #166

derekbruening · 2014-11-28T02:30:56Z

From derek.br...@gmail.com on December 10, 2010 17:58:11

PR 494769

to shrink the jmp to slowpath sequence:

don't store return pc and derive it from app pc. share single
jmp-to-slowpath for whole bb that stores start pc of fragment, and slowpath
decodes forward until find mov-immed that matches app pc. then return pc
is after the jmp after that mov-immed. but, can't have custom regs holding
app pc since shared jmp-to-slowpath needs to use just one: so don't do this
opt until also have whole-bb stolen reg (which we have now: PR 489221).

try to use jmp-short to shared code: if can't reach then use relay spot, or
duplicated shared code: should save space if > 3 jmp-slowpaths, which
should happen if >128 bytes of code.

additionally: don't store app pc, instead store offs from tag (=> 2-byte
mov_imm). to find tag, either have 1st slowpath entry store full app pc
(== tag) (but then how tell from offs, since off must go to 8-bit
sub-reg?), or have the shared per-bb slowpath entry store it into tls.

alternative: store tag instead of cache start pc, to support traces. then
need to get start pc from tag via DR (have to add API routine: and
disrupt our philosophy of hiding cache).

alternative ideas:

store pointer to data structure, a la linkstub, instead of both app pc
and return pc. total memory usage goes up but cache footprint goes down.
also, have to delete when fragment deleted: so need hashtable, or embed
data inside fragment: but then cache size up even if executed instrs not.
jmp to slowpath: don't return to same fragment: instead create a new
fragment. if slowpath rare enough, worth the new fragments and
duplicated tails, since eliminates need for 2nd arg to slowpath.
have to restore eflags in slowpath handler if saved at top of bb.

PR 494769: shrink jmp-to-slowpath sequence, part 1

Added new type app_loc_t to represent an app pc or an app system call or
for PR 494769 an untranslated app pc
Bulk of diff is replacing "app_pc pc" and "const char *syscall_aux"
params to various routines with "app_loc_t *loc"
This is a cleaner way to pass sycalls than the old hack of using
a low pc and passing syscall_aux separately

Server: perforce-panda.eng.vmware.com:1985

PR 494769: shrink jmp-to-slowpath sequence, Part 2
Not yet on by default but just about all there.

Goal is to shrink this:
0x1f845076 c7 c1 b0 f4 a4 00 mov $0x00a4f4b0 -> %ecx
0x1f84507c c7 c2 87 50 84 1f mov $0x1f845087 -> %edx
0x1f845082 e9 72 50 12 00 jmp $0x1f96a0f9

Optimization #1 (simple):

use OP_mov_imm to eliminate one byte from each slowpath store
had to add support for OP_mov_imm for instr operands to core/x86/encode.c

Optimization #2 (complex):

eliminate one of the arguments.
decided to go with one of my alternate approach ideas which seemed
simpler and better than the main one: though it still ended up being
complex.

Approach:

store return pc only
decode from copy of app instr already present at return point.
walk past spills and restores in between if necessary.
if too much in between (such as adjust-esp instru), or if
DR mangles app instr, use a never-executed clone of app instrs.

List of changes:

added dr_app_pc_from_cache_pc()
changed how meta translation is treated when exact non-meta target not
yet hit during translation
under option -single_arg_slowpath
using new app_loc_t data structure added in separate commit (Part 1) to
distinguish untranslated pc from system call since there is no clean
sentinel value for both. cache pc is only translated when absolutely
needed, such as to check certain error exceptions, to report an actual
error, or to mark an instr as do-not-share.
for xl8 sharing, using an additional cache pc entry to increment
the slow path count and avoid xl8 cost on each slowpath.
once over the threshold, xl8 is done to mark the app pc entry
as do-not-share.
added is_spill_slot_opnd() and improved instr_is_{spill,restore}()

Not enabled by default because of this hole in the implementation: to
handle selfmod and other situations where I can't predict for sure whether
an app instr will remain unmangled I need to add DR support. My plan is to
implement issue #156/PR 306163 and add post-mangling events for both bb and
traces.

Original issue: http://code.google.com/p/drmemory/issues/detail?id=166

The text was updated successfully, but these errors were encountered:

derekbruening · 2014-11-28T02:30:57Z

From bruen...@google.com on March 31, 2011 10:25:57

other alternative ideas from my notes:

for non-mangled app instrs, we don't need the app pc for slow_path since
we can decode the post-return-point instr: except for reporting errors,
and for mangled instrs. update: actually we do need it for xl8-sharing
updating via slow_path_xl8_sharing.
=>
store return pc only. add to DR pclookup to get start tag from cache pc,
or even to translate? if not translate then client has to walk.
if know whether instr at return pc is non-mangled app instr then can
avoid the xl8 cost in common case.
what about esp fastpath, other instru: where placed?
sometimes whole-bb reg restore or re-spill prior to app instr; for bb-final
instr does eflags restore also.
if know ahead of time could store app pc for cases where not at return pc:
though how does slowpath tell vs uninit 2nd val?
or could put extra copy of app instr at return pc and never execute it.
then only if report error do we need a translation to get the app pc to report.
instead of extra copy could have hashtable that stores app pc; since
usually app instr is there, table would be small. slowpath looks there first.

impl: always add prior to fastpath label. store pc of jmp for retaddr so
can easily remove? after add all instru for an app instr, check whether
added is identical to subsequent instr, and whether will be mangled
(whether ind branch or sys/int?!?), and then remove.
create custom to-slowpath out-of-fragment "exit stub" so that jmps to
slowpath can go straight there and don't need stub inside fragment.
though if have 3 2-byte jcc to 15-byte stub inside, that would be 3
5-byte jcc to external stub minus 2-byte fastpath jmp-over-stub:
so savings of 8 bytes, vs savings of 10 bytes if keep inline stub
as landing pad that just does jmp. though for instr w/o shadow table
lookup (b/c sharing or b/c no mem ref) then should have jcc go straight
there. have to link stub to lifetime of fragment, or insert at bottom or
sthg. is having in-cache post-fragment exit stubs any better than
mid-fragment stubs? would have to adjust DR's bb-end reqts to allow
post-final-mbr/cbr meta instrs.
if slowpath gets rare enough, use a fault. instead of "jcc slowpath",
use "cmovcc ,reg". can have the invalid addr be the offset
of the return pc from the faulting cmovcc.

derekbruening added Migrated Priority-Low Type-Feature labels Nov 28, 2014

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

shrink jmp-to-slowpath sequence #166

shrink jmp-to-slowpath sequence #166

derekbruening commented Nov 28, 2014

derekbruening commented Nov 28, 2014

shrink jmp-to-slowpath sequence #166

shrink jmp-to-slowpath sequence #166

Comments

derekbruening commented Nov 28, 2014

derekbruening commented Nov 28, 2014