|
DynamicLoadingOptions
Options for supporting dynamic loading, and how they interact with dynamic libraries
Dynamic loading in Native Client
IntroductionNative Client needs to ensure that only validated code can be executed. This requires a mechanism to separate code and data. We would like to extend Native Client to support dynamic loading of code. There are two main use cases for this:
There are two types of interface that we might provide for loading code dynamically:
Current schemeThe current sandboxing scheme supports statically-linked executables only. It uses the following address space layout:
The Native Client process is limited so that the only instructions it can execute are below code_top. Each of the three architectures has a different way to do this. All three require indirect jumps to be preceded by a masking instruction that forces the destination address to be aligned, but whether this instruction limits the range of the address varies.
Interface 1: ContiguousCodeThis interface scheme adds an extra region into address space into which code can be dynamically loaded. This region appears after the executable's code segment, before its data segment, so that the address space layout becomes:
Implementation: CC-HLTRewriteOne proposed implementation for this interface works as follows:
Pros:
Cons:
Implementation: CC-MProtectWe could fix the memory usage problem of CC-HLTRewrite by using page protections. Instead of filling the entire dynamic code region with HLTs on startup, we initially set the page mappings in the read view to be unreadable (no permission bits set). Whenever we need to allocate a page, we use the write view to fill it with HLTs, and then make the read view mapping readable using mprotect(). This assumes that the OS allocates pages on demand, when we write the HLTs rather than when we map the pages. The Windows equivalent of mprotect() is VirtualProtect(). The Windows docs state, for the PAGE_NOACCESS flag, that "This flag is not supported by the CreateFileMapping function", which may mean that we cannot make pages in shared memory segments selectively unreadable. Instead, we could map pages into the code region dynamically; this is subject to the WindowsDLLInjectionProblem. Implications for dynamic linkingOne of the main points of dynamic linking is that programs don't know in advance what they will be loading, so they don't know how much space to reserve. With the scheme above, programs will want to reserve a large amount of space to be on the safe side. For example, if address space is 1Gb, we might reserve a large proportion of that for code, maybe 512MB or 256MB. Otherwise, a process could get into a situation where it can't continue when, say, a required plugin cannot be loaded because insufficient space was reserved up front for code, while plenty of address space remains for data and plenty of memory remains available. If we knew in advance what libraries we were going to load, we wouldn't need dynamic loading support. Instead we could concatenate our libraries (.so files) and executable into one big executable before running sel_ldr. We could fudge the dynamic linker to find the pre-loaded libraries in memory rather than loading them from the virtual filesystem. Dynamic library segment layoutELF dynamic libraries are normally set up so that a library's data segment immediately follows its code segment. (On x86-64 systems there is a ~1MB gap between the code and data segments in order to support hypothetical systems with a 1MB page size. The resulting address space wastage is not considered significant when you have a 48-bit address space to play with.) This means that code and data are interleaved in address space, i.e.
Once an ELF shared library is linked to become a .so file, its code and data segments are fixed relative to each other (with the caveat that --emit-relocs causes ld to retain the relocations that were applied). The ELF shared library is relocatable, but whatever offset the code segment is moved by, the data segment must be moved by too. This is what the ELF Program Headers format assumes, and any dynamic loader will assume. This means the size of the gap between segments gets linked in to the shared library at link time. Shared libraries linked with different segment gap sizes will be difficult to load together (because of the difficulty of allocating address space), so we may want to choose a standard segment gap size, such as 256MB. Let's call this scheme Big Segment Gap (BSG). Linker script changesIt is straightforward to specify the segment gap size by changing ld's linker script. Normally a linker script contains an instruction like the following (taken from /usr/lib/ldscripts/elf_i386.x) to ensure that the code and data segments are on different pages: /* Adjust the address for the data segment. We want to adjust up to
the same address within the page on the next page up. */
. = ALIGN (CONSTANT (MAXPAGESIZE)) - ((CONSTANT (MAXPAGESIZE) - .) & (CONSTANT (MAXPAGESIZE) - 1));NaCl's linker script for statically linked executables has a slightly different instruction. This forces the data segment to start on a new page: . = ALIGN(CONSTANT (MAXPAGESIZE)); /* NaCl wants page alignment */ (For a discussion of the relative merits of those two instructions, see issue 193.) To link a shared library with a segment gap of 256MB, we would instead use: . = 0x10000000; The same effect can also be achieved without a linker script change by using the ld option --section-start .rodata=0x10000000. Address space wastageThis segment layout wastes some data address space. Suppose library 1's code segment is 192k and its data segment is 64k (after rounding up to a page size of 64k). We will have to set aside 192k of address space for the data segment so that library 2's segments (which are fixed relative to each other) can fit in after library 1's segments. Hence we waste 128k of address space. However, we don't waste memory because nothing needs to be mapped into this space. Furthermore, this is really fragmentation rather than wastage because mmap() could still allocate from these gaps. Note that space in the code region can be similarly wasted if the library's data segment is larger than the code segment, but it is more usual for the code segment to be larger. Address space layout diagram
Deferring the choice of segment gap sizeThere are two ways in which we might defer the choice of segment gap size until after linking the .so:
Use ld's --emit-relocs option. This tells ld to include all ELF relocations in the output. It may be possible to use this information to rewrite the .so file to move the segments relative to each other. However, it is not clear that this is the purpose of --emit-relocs, and it may be easier to simply re-link the .so file from the original inputs to ld. We could use this to change sets of libraries and executables en masse from one segment gap size to another. So if a gap size of 256MB turns out to be too small and we want to load more than 256MB of code, we can rewrite our ELF objects to use a gap size of 300MB without having to rebuild them. There are two potential obstacles:
Extend ELF with a new Program Headers format in which code and data segments are not fixed relative to each other. In this scenario, each dynamic relocation gains an extra flag to say whether it is relative to code or data, and the dynamic linker is extended to understand these relocations. This means libraries and executables do not have an inbuilt preferred segment gap size. The main advantage of this scheme is that it can avoid the address space wastage mentioned above. The resulting executables would contain a load of TEXTRELs (relocations in the code/text segment). TEXTRELs usually come from failing to compile libraries with -fPIC, but in SSG they would occur regardless of whether we compile with -fPIC. TEXTRELs are usually considered to be bad, because they prevent the memory for the code from being shared between processes. Although NaCl does not currently implement sharing code via mmap, pervasive use of TEXTRELs prevents this sharing from being introduced in the future, so this seems like a step in the wrong direction. Extending ELF would involve a lot of toolchain work to implement a feature that no-one else would use, just to save some address space fragmentation, so implementing this does not seem worthwhile. If we wanted to reduce the number of TEXTRELs in SSG, there are a couple of things we could do. When code is compiled with -fPIC, many of the TEXTRELs will come from references to the GOT (Global Offset Table). The GOT is the data structure through which code acquires the addresses of symbols in other ELF objects (including symbols in the same ELF object that can potentially be overridden by other ELF objects). On x86, which lacks a PC-relative addressing mode, the GOT's address is also used as a starting point for finding the relocated addresses of code and data in the same ELF object (this is implemented via the R_386_GOTOFF relocation). On x86, there is a set of tiny functions called __i686.get_pc_thunk.reg (where reg is a register name) which are used for copying %eip into a general purpose register (usually %ebx) whose value is then adjusted to get the GOT's address. (See also issue 229 for a discussion of how this works.)
We could introduce a __get_got_address function. The adjustment to find the GOT address would be done in only one place in the code segment so there would be only one TEXTREL. We would use this function on all architectures in place of PC-relative addressing or x86's __i686.get_pc_thunk. All data segment references would have to go via the GOT, as in x86. This would add overhead to every function that uses a global variable or calls a function in another ELF object. Finding code segment addresses by adding a constant to %ebx via R_386_GOTOFF (e.g. taking the address of a static function) would no longer work. These addresses could be added as GOT entries.
Instead of placing the GOT in the data segment, we could place it in the code segment. Though we can't place large amounts of data in the code segment, we can place chunks upto 28 or 31 bytes (depending on architecture) in the code segment provided that each chunk is preceded by a HLT instruction to prevent the data from being executed as code. The GOT can fit these constraints because the pointers it contains do not need to be laid out contiguously. In this scheme, 1 out of every 8 GOT slots will be set aside to contain a HLT instruction. The dynamic linker would have to make a NaCl syscall to relocate a GOT entry. This would add overhead for lazy symbol binding (PLTGOT entries), but GOT relocations for eager symbol binding can be done in bulk. The syscall's implementation can be simple: the trusted runtime just has to check that the 32 byte block being written to starts with a HLT instruction. Note that this assumes we are not using HLTs at the starts of blocks for some other scheme, such as the proposed SFI-invariant-preserved-across-blocks scheme. Direct jumps into blocks starting with a HLT would have to be disallowed. Finding data segment addresses (e.g. addresses of static variables or string literals) by adding a constant to %ebx (via R_386_GOTOFF) on x86 or by PC-relative addressing on other architectures would no longer work. These addresses could be added as GOT entries. Allocating address space at runtimeThe dynamic linker normally allocates address space for a library by first mmapp()ing a region large enough for the whole library, both code and data segments, which assumes that the segments are contiguous. It passes a NULL start argument to mmap(), so the kernel chooses the start address. The kernel keeps track of what parts of address space have been allocated so the process doesn't have to. Allocation will be less straightforward when each library's code and data segments are discontiguous. The dynamic linker will have to allocate space from the code and data regions of address space simultaneously. The data region may contain mappings created by the main program or by libraries; the dynamic linker will need to avoid overwriting these. We may need to extend mmap() to support this; some possibilities are:
Implications for debuggersDebuggers that assume that each ELF object is contiguous in address space may get confused because ELF objects will appear to them to overlap. The same applies to debugging tools such as backtrace generators. These tools may need fixing to correctly map addresses to symbol names. Interface 2: InterleavedCodeIn this interface scheme, NaCl's mmap() call is extended to provide a verify-and-map-code operation. Code and data may be interleaved in the NaCl process's address space. Code may be mapped with page granularity, which is 64k for compatibility with Windows. Implementation: IC-NXBitOn systems that support it, this can be implemented using NX page protection. Architectures:
There are other architectures that we may not target ourselves. However, the ability of NaCl to be ported to them may affect community adoption:
Implementation: IC-HarvardX86On x86 systems without the NX bit, we implement this using x86 segmentation. The x86 code segment is set up to be disjoint from the x86 data segment. Pages that contain validated code are mapped into the region covered by the x86 code segment. Other pages are mapped into the x86 data segment. As an example, sel_ldr's address space would contain the following after loading the initial executable:
(Note that as before, "unmapped" really means "mapped with no permission bits set". The x86 segments are shown side by side to illustrate the correspondence though really one is after the other.) This is a Harvard architecture-style approach, because any given address in the NaCl process's address space could potentially have two meanings depending on whether it is used for code or data access. In practice we would not allow differing code and data pages to be mapped at the same address, because of the potential for confusion that this could cause. Note that PaX, a set of patches to the Linux kernel, uses a similar approach to emulate the NX bit on x86 in its SEGMEXEC scheme. x86 code segment sizeThe layout above halves the amount of address space available to the NaCl process from 1024MB to 512MB. However, we could reduce the address space loss if it were acceptable for address space to be non-uniform. Suppose Windows lets us allocate 1024MB of sel_ldr's address space. We could allocate 300MB to the x86 code segment and 724MB to the x86 data segment. Hence the NaCl process sees 724MB of address space, but it can map code only into the bottom 300MB. However code is still allowed to be interleaved with data. The remaining 424MB can only contain data. Though the resulting address space is non-uniform, it is still more flexible than the ContiguousCode scheme. Whether code is readable as dataShould pages that we map into the x86 code segment also be mapped into the x86 data segment? If we do this, code will be readable as data, as on other platforms. However, there may be difficulties in mapping pages twice on Windows. If we do not do this, we will have a discrepancy between platforms. This could lead to portability problems if programmers rely on being able to read code as data, and test their software on only one platform. We cannot change the other platforms to match because processors do not usually provide an executable-but-not-readable page permission. PROT_EXEC usually implies PROT_READ. This may not be a problem because reading code is inherently unportable -- any program that does so will depend on the instruction set. NaCl does not allow arbitrary data in an ELF code segment (with the possible exception of 31 byte chunks guarded by a HLT byte), so no data in the code segment can be in a portable format. EvaluationPros:
Cons:
Hybrid interfacesWe could combine the benefits of interleaved code and data with the benefits of loading chunks of code that are smaller than page size. NaCl could provide operations to map some HLT-filled pages with code, and incrementally fill those pages with validated code using the HLT-overwriting technique above. Deallocation of codeIn some situations it will be desirable to be able to unload code (e.g. dlclose()) so that the space can be reused. Dealing with jumpsCode is loaded in chunks. If chunks are allowed to contain internal, unaligned jumps (i.e. jumps to validated instructions in the middle of instruction bundles), we must ensure that code is unloaded in chunks too, which means we must record which chunks have been loaded. This applies whether we load code by mapping pages or by overwriting HLTs. If all direct jumps are required to be aligned, this saves us the trouble of having to remember chunks. Dealing with multiple threadsWhat happens in the presence of multi-threading? We must deal with the case where other threads are executing the code that we are attempting to unload.
mmapping code to share memoryMost operating systems that support dynamic linking allow multiple processes using the same library to share the memory pages containing the library code. This usually works by mmapping the code from a common file. This would be desirable to support in Native Client if we expect to have many NaCl processes running simultaneously while using the same libraries. If we were to support this, we would have to ensure that the mapped code will not change after it has been validated, otherwise this would violate the safety of the system. On Linux, mmap()'s MAP_PRIVATE flag is not sufficient to achieve this because it is not fully copy-on-write: writes to the mapping in memory cause a copy, but writes to the file will be visible in the mapped region. We would therefore have to ensure that the file's contents will not be changed. There are two ways we might achieve this:
Proposed NaCl-syscall interfaceThe basic syscall interface would be: /* Load code from memory */ int nacl_copy_code(void *src_addr, void *dest_addr, size_t size); dest_addr must be non-NULL: it is the caller's responsibility to allocate address ranges in the code region. In a higher level interface provided by a library, the library would allocate addresses and would record which parts of the code region have been allocated so far. dest_addr need not be page-aligned. We can also provide a convenience interface: /* Load code from file descriptor. */ int nacl_map_code(int filedesc, int file_offset, void *dest_addr, size_t size); This can be implemented as a library function or as a second syscall. If nacl_map_code() is a library function, a CC-HLTRewrite implementation will involve copying the code three times:
If nacl_map_code() is a syscall, it can avoid the first copy. StatusSupport for dynamic loading in the trusted codebase:
Support for dynamic loading in the untrusted codebase:
| |||||||||||||||||
Something came to mind while reading the section on IC-HarvardX86?. Just-in-time compilers tend to need to be able to read (and write) code as data. As one example, the HotSpot? Java virtual machine embeds heap pointers into the code it generates at runtime, and reads and updates those pointers during full garbage collections. When porting such a system to Native Client it would probably be feasible to require the VM to publish all updates to its dynamically generated code in one operation, but it would be more difficult if it could not read the original code it had generated.