Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

shrink jmp-to-slowpath sequence #166

Open
derekbruening opened this issue Nov 28, 2014 · 1 comment
Open

shrink jmp-to-slowpath sequence #166

derekbruening opened this issue Nov 28, 2014 · 1 comment

Comments

@derekbruening
Copy link
Contributor

From derek.br...@gmail.com on December 10, 2010 17:58:11

PR 494769

to shrink the jmp to slowpath sequence:

don't store return pc and derive it from app pc. share single
jmp-to-slowpath for whole bb that stores start pc of fragment, and slowpath
decodes forward until find mov-immed that matches app pc. then return pc
is after the jmp after that mov-immed. but, can't have custom regs holding
app pc since shared jmp-to-slowpath needs to use just one: so don't do this
opt until also have whole-bb stolen reg (which we have now: PR 489221).

try to use jmp-short to shared code: if can't reach then use relay spot, or
duplicated shared code: should save space if > 3 jmp-slowpaths, which
should happen if >128 bytes of code.

additionally: don't store app pc, instead store offs from tag (=> 2-byte
mov_imm). to find tag, either have 1st slowpath entry store full app pc
(== tag) (but then how tell from offs, since off must go to 8-bit
sub-reg?), or have the shared per-bb slowpath entry store it into tls.

alternative: store tag instead of cache start pc, to support traces. then
need to get start pc from tag via DR (have to add API routine: and
disrupt our philosophy of hiding cache).

alternative ideas:

  • store pointer to data structure, a la linkstub, instead of both app pc
    and return pc. total memory usage goes up but cache footprint goes down.
    also, have to delete when fragment deleted: so need hashtable, or embed
    data inside fragment: but then cache size up even if executed instrs not.
  • jmp to slowpath: don't return to same fragment: instead create a new
    fragment. if slowpath rare enough, worth the new fragments and
    duplicated tails, since eliminates need for 2nd arg to slowpath.
    have to restore eflags in slowpath handler if saved at top of bb.

PR 494769: shrink jmp-to-slowpath sequence, part 1

  • Added new type app_loc_t to represent an app pc or an app system call or
    for PR 494769 an untranslated app pc
  • Bulk of diff is replacing "app_pc pc" and "const char *syscall_aux"
    params to various routines with "app_loc_t *loc"
  • This is a cleaner way to pass sycalls than the old hack of using
    a low pc and passing syscall_aux separately

Server: perforce-panda.eng.vmware.com:1985

PR 494769: shrink jmp-to-slowpath sequence, Part 2
Not yet on by default but just about all there.

Goal is to shrink this:
0x1f845076 c7 c1 b0 f4 a4 00 mov $0x00a4f4b0 -> %ecx
0x1f84507c c7 c2 87 50 84 1f mov $0x1f845087 -> %edx
0x1f845082 e9 72 50 12 00 jmp $0x1f96a0f9

Optimization #1 (simple):

  • use OP_mov_imm to eliminate one byte from each slowpath store
  • had to add support for OP_mov_imm for instr operands to core/x86/encode.c

Optimization #2 (complex):

  • eliminate one of the arguments.
    decided to go with one of my alternate approach ideas which seemed
    simpler and better than the main one: though it still ended up being
    complex.

Approach:

  • store return pc only
  • decode from copy of app instr already present at return point.
    walk past spills and restores in between if necessary.
    if too much in between (such as adjust-esp instru), or if
    DR mangles app instr, use a never-executed clone of app instrs.

List of changes:

  • added dr_app_pc_from_cache_pc()
  • changed how meta translation is treated when exact non-meta target not
    yet hit during translation
  • under option -single_arg_slowpath
  • using new app_loc_t data structure added in separate commit (Part 1) to
    distinguish untranslated pc from system call since there is no clean
    sentinel value for both. cache pc is only translated when absolutely
    needed, such as to check certain error exceptions, to report an actual
    error, or to mark an instr as do-not-share.
  • for xl8 sharing, using an additional cache pc entry to increment
    the slow path count and avoid xl8 cost on each slowpath.
    once over the threshold, xl8 is done to mark the app pc entry
    as do-not-share.
  • added is_spill_slot_opnd() and improved instr_is_{spill,restore}()

Not enabled by default because of this hole in the implementation: to
handle selfmod and other situations where I can't predict for sure whether
an app instr will remain unmangled I need to add DR support. My plan is to
implement issue #156/PR 306163 and add post-mangling events for both bb and
traces.

Original issue: http://code.google.com/p/drmemory/issues/detail?id=166

@derekbruening
Copy link
Contributor Author

From bruen...@google.com on March 31, 2011 10:25:57

other alternative ideas from my notes:

  • for non-mangled app instrs, we don't need the app pc for slow_path since
    we can decode the post-return-point instr: except for reporting errors,
    and for mangled instrs. update: actually we do need it for xl8-sharing
    updating via slow_path_xl8_sharing.
    =>
    store return pc only. add to DR pclookup to get start tag from cache pc,
    or even to translate? if not translate then client has to walk.
    if know whether instr at return pc is non-mangled app instr then can
    avoid the xl8 cost in common case.
    what about esp fastpath, other instru: where placed?
    sometimes whole-bb reg restore or re-spill prior to app instr; for bb-final
    instr does eflags restore also.
    if know ahead of time could store app pc for cases where not at return pc:
    though how does slowpath tell vs uninit 2nd val?
    or could put extra copy of app instr at return pc and never execute it.
    then only if report error do we need a translation to get the app pc to report.
    instead of extra copy could have hashtable that stores app pc; since
    usually app instr is there, table would be small. slowpath looks there first.

    impl: always add prior to fastpath label. store pc of jmp for retaddr so
    can easily remove? after add all instru for an app instr, check whether
    added is identical to subsequent instr, and whether will be mangled
    (whether ind branch or sys/int?!?), and then remove.

  • create custom to-slowpath out-of-fragment "exit stub" so that jmps to
    slowpath can go straight there and don't need stub inside fragment.
    though if have 3 2-byte jcc to 15-byte stub inside, that would be 3
    5-byte jcc to external stub minus 2-byte fastpath jmp-over-stub:
    so savings of 8 bytes, vs savings of 10 bytes if keep inline stub
    as landing pad that just does jmp. though for instr w/o shadow table
    lookup (b/c sharing or b/c no mem ref) then should have jcc go straight
    there. have to link stub to lifetime of fragment, or insert at bottom or
    sthg. is having in-cache post-fragment exit stubs any better than
    mid-fragment stubs? would have to adjust DR's bb-end reqts to allow
    post-final-mbr/cbr meta instrs.

  • if slowpath gets rare enough, use a fault. instead of "jcc slowpath",
    use "cmovcc ,reg". can have the invalid addr be the offset
    of the return pc from the faulting cmovcc.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant