Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

runtime: amd64 vdso parser issues #8197

Closed
gopherbot opened this issue Jun 12, 2014 · 43 comments
Closed

runtime: amd64 vdso parser issues #8197

gopherbot opened this issue Jun 12, 2014 · 43 comments
Milestone

Comments

@gopherbot
Copy link

by amluto:

The vdso parser
(https://code.google.com/p/go/source/browse/src/pkg/runtime/vdso_linux_amd64.c) appears
to have multiple issues.

1. It looks at the section table.  It shouldn't.  In-memory mappings of DSOs are not
required to have section tables at all.

2. It does this:

        for(i=0; i<vdso_info->num_sym; i++) {
                Elf64_Sym *sym = &vdso_info->symtab[i];

That's bogus.  There's no guarantee that the number of symbols in the dynamic symbol
table equals the number of SHT_DYNSYM entries.

3. The error handling is completely broken.  The code does, *in function scope*:

        struct vdso_info vdso_info;

and then assumes that vdso_info is zero-initialized.  C doesn't work that way.

This code appears to borrow heavily from the reference parser I wrote in
Documentation/vDSO/parse_vdso.c in the kernel tree, but all three of these bugs seem to
be go-specific additions to the code :(


There's a report that Go programs don't run on early Linux 3.16 builds.  I haven't
reproduced it yet, but if this problem is real, I will probably do something to allow
these broken Go programs to run, but they are unlikely to continue being able to find
vDSO functions.

Please fix your vDSO parser.
@bradfitz
Copy link
Contributor

Comment 1:

The C compiler as used by the runtime does zero automatic variables.

Status changed to Accepted.

@bradfitz
Copy link
Contributor

Comment 2:

Or, I was thinking of something else. In any case, thanks for the bug report.

@gopherbot
Copy link
Author

Comment 3 by amluto:

Hmm.  It's possible that I've misdiagnosed the exact failure mode.  It looks like the
num_sym loop is reading its way past the vdso and blowing up on iteration 498 (!).
I'm having debug info issues, so it's a bit awkward to debug the Go runtime here.

@gopherbot
Copy link
Author

Comment 4 by amluto:

Candidate workaround.  This applies to this Linux revision:
commit 4d048b0255e3dd4fb001c5f1f609fb67463d04d6
Author: H. Peter Anvin <hpa@zytor.com>
Date:   Tue Jun 10 14:25:26 2014 -0700
    x86, vdso: Remove one final use of htole16()
I'm not sure if it's in Linus' tree yet, but it's in the x86/vdso branch of tip.git.
This will regress performance for all Go programs that use the broken vDSO parser.  How
big of a deal is this?

@gopherbot
Copy link
Author

Comment 5 by amluto:

...now with attachment.

Attachments:

  1. 0001-x86-vdso-Hack-to-keep-64-bit-Go-programs-working.patch (3966 bytes)

@gopherbot
Copy link
Author

Comment 6 by amluto:

Also, see this:
https://git.kernel.org/cgit/linux/kernel/git/luto/linux.git/commit/?h=vdso/cleanups&id=87f62cc3db04047f9edbf7bacac059396d7112f1
and its parent for an updated reference parser.  These changes make it work for 32-bit
userspace.

@gopherbot
Copy link
Author

Comment 7:

CL https://golang.org/cl/101260044 mentions this issue.

@gopherbot
Copy link
Author

Comment 8 by amluto:

This doesn't appear to address the issue with all the vdso_info variables not being
initialized.  Are you *sure* that your compiler does that for you?  My reading of the
code suggests that, if that were the case, then the old code wouldn't have crashed.

@ianlancetaylor
Copy link
Contributor

Comment 9:

Which fields specifically are you concerned about?
Actually, I see now that the valid field was not initialized.  I just fixed that.  Do
you see another one?

@rsc
Copy link
Contributor

rsc commented Jun 12, 2014

Comment 10:

It's just a C compiler. It doesn't zero local variables. However, the function is called
very early on, in a program that starts with a zeroed stack. So the content is not
random garbage. It is the data left over from runtime.check's variables, which are
constants. That might help things a little or at least explain why this hasn't been
flaky.

@gopherbot
Copy link
Author

Comment 12 by amluto:

num_sym was affected as well, but that looks like it's going away.

@ianlancetaylor
Copy link
Contributor

Comment 13:

Yes, you say the new VDSO has no normal symbol table, so there is no more num_sym.

@krasin
Copy link

krasin commented Jun 12, 2014

Comment 14:

Ack. Sorry for the issues.
re: uninitialized field. Is there a way to build it with AddressSanitizer and use it in
the regular Go tests? It would have caught the non-initialized field issue.

@gopherbot
Copy link
Author

Comment 15 by amluto:

I think it wouldn't have caught it except on 3.16 :(
Could you test the new code to make sure it still works if there's no vdso at all?  I
think there are old kernels like that.

@rsc
Copy link
Contributor

rsc commented Jun 12, 2014

Comment 16:

There is no AddressSanitizer. We just write as little C code as we can. 
I'm still annoyed we're even having this discussion. Glibc and/or the Linux kernel is
fundamentally flawed if we have to go to all this effort just to figure out how to read
the time of day. </rant>

@ianlancetaylor
Copy link
Contributor

Comment 17:

If there is no VDSO there is presumably no AT_SYSINFO_EHDR, in which case none of this
code will run anyhow.

@ianlancetaylor
Copy link
Contributor

Comment 18:

Well, we don't need this effort to read the time of day.  We just need it to read the
time more quickly.
But I agree that the approach seems broken.  Considerable effort is being pushed into a
number of tools for little clear reason.

@gopherbot
Copy link
Author

Comment 19 by amluto:

If you have a better idea for how the kernel could provide this service more simply, I'm
all ears.  Note that it's a huge speedup: it's something like 21ns vs 100ns. 
CLOCK_MONOTONIC_COARSE is even more dramatic.  And on 32-bit systems, you need the same
code to avoid using int $0x80 for all syscalls, which is remarkably slow.
We used to provide a service by which you could ask for the time by calling
0xffffffffff600000, but that was a really juicy target for exploits.
Note that Go is a bit unusual here: basically everything else already has a perfectly
function ELF loader that can handle this.  And I did provide a nice CC0-licensed piece
of reference code that worked fine, modulo Go-specific stack size issues.

@krasin
Copy link

krasin commented Jun 13, 2014

Comment 20:

I recall that the stack size was the major pain when writing this code ~2 years ago. The
code had to become less clear and it was quite hard to test it.

@gopherbot
Copy link
Author

Comment 21 by amluto:

Out of curiosity, why does stack size matter?  Is this code not run on the initial
kernel-provided stack?

@ianlancetaylor
Copy link
Contributor

Comment 22:

Basically you want the kernel to provide a mapping for a small number of magic symbols
to addresses that can be called at runtime.  In other words, you want to map a small
number of indexes to addresses.  I can think of many different ways to handle that in
the kernel.  I don't think the first mechanism I would reach for would be for the kernel
to create an in-memory shared library.  It's kind of a baroque mechanism for
implementing a simple table.
It's true that dynamically linked programs can use the ELF loader.  But the ELF loader
needed special changes to support VDSOs.  And so did gdb.  And this approach doesn't
help statically linked programs much.  And glibc functions needed to be changed anyhow
to be aware of the VDSO symbols.  So as far as I can tell, all of this complexity really
didn't get anything for free.  It just wound up being complex.
All just my opinion, of course.

@ianlancetaylor
Copy link
Contributor

Comment 23:

The stack size issue was kind of a red herring.  The linker has special code to catch
stack size overflows, but that code was being triggered inappropriately.  It
unfortunately caused a lot of reworking of the code but it was never meaningful.

@gopherbot
Copy link
Author

Comment 24 by amluto:

Fair enough.  If I'd done this from scratch, I might not have used an ELF file.
OTOH, debuggers probably do want a real ELF image: they can look at the DWARF call-frame
information, and, if they're really fancy, they can even find the debug symbols that we
helpfully install (sometimes) in /lib/modules/KERNELVER/vdso using the ELF build-id
mechanism.  So it's not that bad.
The actual cause of the 3.16 issue is that I'm trying to clean this stuff up.  It was a
giant mess two years ago, and it's gradually getting better -- at least we're converging
on one way of doing this instead of the giant mess that used to exist.

@krasin
Copy link

krasin commented Jun 13, 2014

Comment 25:

*not an actual proposal, just thoughts*
I guess these parser issues will soon become the source of mysterious crashes of Go
servers running on the newer kernels and will potentially affect many people.
Would it be possible for the kernel to detect the bogus programs before they are
executed, and don't provide VDSO for them? In this case, they will run slower, but at
least they won't crash. 
If implemented, it would make sense to keep this compatibility hack for a couple of
kernel releases (say, until 3.18), until most of the actively-being-worked-on Go
programs are recompiled, and then just drop the hack.
* again, that's not a well-thought proposal and I don't even know, if such behavior is
appropriate for the kernel even on a temporary basis. Also, I don't see an easy marker
to detect the binaries affected by the bug *

@gopherbot
Copy link
Author

Comment 26 by amluto:

I've send this patch:
https://git.kernel.org/cgit/linux/kernel/git/luto/linux.git/commit/?h=vdso/cleanups
the idea is that the kernel will provide a fake section table that contains exactly this:
const __visible struct elf64_shdr vdso_fake_sections[] = {
 {
 .sh_type = SHT_DYNSYM,
 .sh_entsize = sizeof(Elf64_Sym),
 }
};
Old Go programs will see it, initialize num_sym to zero, and not find any symbols.
The general kernel policy is to keep these things around for a very long time, if not
forever.  Sometimes they get hidden behind a configuration option if they become a
maintenance burden (e.g. CONFIG_COMPAT_VDSO, which used to be a giant hack to support a
particular broken glibc minor version, and now just disables the vDSO if selected).

@krasin
Copy link

krasin commented Jun 13, 2014

Comment 27:

Thanks for doing this, Andy.
I regret that I didn't come to an idea to ask you for a code review when I wrote this
code.

@gopherbot
Copy link
Author

Comment 28 by amluto:

Bah.  I should have done it anyway -- I had the issue starred, so it made it to my inbox.
Would you be willing to test the new code on a 3.16 kernel with and without my patch?

@krasin
Copy link

krasin commented Jun 13, 2014

Comment 29:

Sure. Will do tomorrow.

@gopherbot
Copy link
Author

Comment 30 by amluto:

Shameless plug: if you want a really easy way to do this, give virtme ([1] or [2]) a try:
[1] https://git.kernel.org/cgit/utils/kernel/virtme/virtme.git/
[2] https://github.com/amluto/virtme
I've been using virtme-run --kimg arch/x86/boot/bzImage --console to test this stuff.
Would it be possible to remove the vsyscall fallback part?  Admittedly, getting
everything (i.e. Go, glibc, etc) to get rid of the vsyscall fallbacks may be a lost
cause, but these days falling back to ordinary syscalls will be much faster than falling
back to the vsyscall page.
Also, I'm not 100% convinced by your symbol version checking code.  I probably need to
remind myself exactly how this is supposed to work.  The version checking code in my
parser is much more complicated, and I assume there's a reason I did that.

@gopherbot
Copy link
Author

Comment 31 by amluto:

I see what's going on.  I think you should abort if vdso_find_version fails, rather than
just assuming that any version is okay.  If someone ever releases a vdso that doesn't
have a "LINUX_2.6" (which is very unlikely), then you might match a symbol that
shouldn't match.

@krasin
Copy link

krasin commented Jun 13, 2014

Comment 32:

re: virtme: thanks for the pointer, I will try it.

@rsc
Copy link
Contributor

rsc commented Jun 13, 2014

Comment 33:

I don't really want to have this conversation again in 2 years. Can we just import the
Linux code please?
Like https://golang.org/cl/105950043 (untested).

@ianlancetaylor
Copy link
Contributor

Comment 34:

The static vdso_info variable makes me sad.
Anyhow, this is standard ELF dynamic loading code.  It doesn't have to be magic.  Where
we went wrong last time was using the standard symbol table rather than sticking to
dynamic linker info.  That was probably my fault, and I'm certainly sorry now.

@rsc
Copy link
Contributor

rsc commented Jun 13, 2014

Comment 35:

Why is the static vdso_info a problem? I think it is 8 words long.

@ianlancetaylor
Copy link
Contributor

Comment 36:

I tweaked the version code so that an unknown version does not match.
I filed issue #8200 to stop using vsyscalls.

@ianlancetaylor
Copy link
Contributor

Comment 37:

I updated the CL to precompute the hash codes.

@ianlancetaylor
Copy link
Contributor

Comment 38:

This issue was closed by revision 1db4c8d.

Status changed to Fixed.

@ianlancetaylor
Copy link
Contributor

Comment 39:

Labels changed: added repo-main, release-go1.3.1.

@krasin
Copy link

krasin commented Jun 15, 2014

Comment 40:

Hi Andy,
I am now have the system setup for testing, but I realized that the branch you referred
to is now deleted:
https://git.kernel.org/cgit/linux/kernel/git/luto/linux.git/commit/?h=vdso/cleanups
What's the new one to try?

@krasin
Copy link

krasin commented Jun 15, 2014

Comment 41:

Nevermind. I found
torvalds/linux@c728762
Will test.

@krasin
Copy link

krasin commented Jun 17, 2014

Comment 42:

For the record: 3.16 with the vdso kernel hack works fine with both old and new Go
binaries.
root@(none):/home/krasin# ./all.sh 
Running on the system:
Linux (none) 3.16.0-rc1 #3 SMP Mon Jun 16 15:08:12 PDT 2014 x86_64 x86_64 x86_64
GNU/Linux
Before fix
Hello, it's  2014-06-17 00:04:18.507931 -0700 PDT
Go runtime:  go1.2.2
After fix
Hello, it's  2014-06-17 00:04:18.684539441 -0700 PDT
Go runtime:  devel +2dc2bf98e6e3 Mon Jun 16 16:32:47 2014 -0700
Also, it works with no issues on older kernels:
./all.sh 
Running on the system:
Linux vdso 3.13.11.2-custom #3 SMP Sat Jun 14 12:36:54 PDT 2014 x86_64 x86_64 x86_64
GNU/Linux
Before fix
Hello, it's  2014-06-17 00:07:11.235963278 -0700 PDT
Go runtime:  go1.2.2
After fix
Hello, it's  2014-06-17 00:07:11.240788594 -0700 PDT
Go runtime:  devel +2dc2bf98e6e3 Mon Jun 16 16:32:47 2014 -0700
Thanks, Andy, Ian, Russ.

@krasin
Copy link

krasin commented Jun 17, 2014

Comment 43:

Test Go program:
$ cat ~/hello.go 
package main
import (
        "fmt"
        "time"
        "runtime"
)
func main() {
   fmt.Println("Hello, it's ", time.Now())
   fmt.Println("Go runtime: ", runtime.Version())
}

@rsc
Copy link
Contributor

rsc commented Aug 11, 2014

Comment 44:

It looks like 1.3 will continue to work on Linux 3.16 because of some kernel
adjustments, so this can be delayed until 1.4.

Labels changed: added release-go1.4, removed release-go1.3.1.

@rsc rsc added this to the Go1.4 milestone Apr 14, 2015
@rsc rsc removed the release-go1.4 label Apr 14, 2015
@golang golang locked and limited conversation to collaborators Jun 25, 2016
wheatman pushed a commit to wheatman/go-akaros that referenced this issue Jun 25, 2018
Reportedly in the Linux 3.16 kernel the VDSO will not have
section headers or a normal symbol table.

Too late for 1.3 but perhaps for 1.3.1, if there is one.

Fixes golang#8197.

LGTM=rsc
R=golang-codereviews, mattn.jp, rsc
CC=golang-codereviews
https://golang.org/cl/101260044
This issue was closed.
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

5 participants