Export to GitHub

plan9front - issue #213

Fresh 9front install in QEMU hangs at boot


Posted on Sep 28, 2014 by Swift Bird

What happened: I installed a 9front onto a qcow2 image through QEMU. I finished the installation, and QEMU rebooted. I killed the boot, and restarted QEMU to boot into the 9front installation. However it hung just after reaching the bootsector.

What was expected: After printing that the PC is booting from the Hard Drive / MBR, 9front should boot.

Steps to reproduce: 0. Install QEMU (I use Homebrew's bottled QEMU 2.1.2) 1. Download 9front ISO (9front-3853.02ebd469f43a.iso.bz2) 2. Create new QEMU image: qemu-img create -f qcow2 9front.qcow2.img 20G. 3. Boot 9front ISO: qemu-system-i386 -hda 9front.qcow2.img -cdrom 9front-3853.02ebd469f43a.iso -boot d -vga std -m 1G 4. Install 9front onto "9front.qcow2.img" 5. Press enter at [finish], QEMU reboots (into the ISO, since the QEMU call hasn't changed) 6. Kill QEMU 7. Boot 9front installation: qemu-system-i386 -hda 9front.qcow2.img -boot c -vga std -m 1G 8. Boot hangs

Attachments

Comment #1

Posted on Sep 28, 2014 by Happy Hippo

that seems odd. there are way too many dots here. the pbs is responsible for loading the 2nd stage loader "9bootfat" from the root of the 9fat partition.

boot the iso, and in a rio window, type:

9fs 9fat

then compare the files:

/n/9fat/9bootfat with /386/9bootfat

like:

ls -l /n/9fat/9bootfat /386/9bootfat md5sum /n/9fat/9bootfat /386/9bootfat

they should be identical.

Comment #2

Posted on Sep 28, 2014 by Swift Bird

So /n/9fat/9bootfat didn't exist. I reran the installation, and I tried hjfs as well, but it didn't help.

Comment #3

Posted on Sep 28, 2014 by Happy Hippo

that would make sense, the pbs is scanning the root directory looking for the file.

the 9fat partition is setup in the "bootsetup" step of the install process. re-run this step and check for error messages (scroll up if it scrolled away).

Comment #4

Posted on Sep 28, 2014 by Swift Bird

Okay, just reinstalled again. If I run md5sum /n/9fat/9bootfat /386/9bootfat at the end of the installation, just before finishing and rebooting, then /n/9fat/9bootfat is present, and the md5 sums match.

However, after rebooting, the boot still hangs. I'm stuck on the divide error bug, but I bet /n/9fat/9bootfat wouldn't exist, if I could get to rio.

Comment #5

Posted on Sep 28, 2014 by Happy Hippo

maybe we'r just rebooting too fast before qemu flushes its data to the disk? at least you got the kernel booted now. the divide by zero panic is caused by the stats(1) command reading /dev/sysstat (this little graphing system statistics window).

you might just wait a bit before hitting enter on the bootargs prompt to avoid this.

you can also try the kernel i just made that has the fix:

http://www.felloff.net/usr/cinap_lenrek/9pcf.alexchandel

you can copy it to 9fat renamed as 9pcf.

another thing, the 9front kernel is a multiboot image. you can try loading it directly with qemu with the -kernel option. plan9.ini (contents) can be passed as -initrd option.

Comment #6

Posted on Sep 28, 2014 by Grumpy Bird

alexchandel: remember you need to run 9fs 9fat to mount the 9fat partition. /n/9fat will not be mounted until you do so. also note: /n/9fat will only be accessible from the same namespace where you run 9fs 9fat.

Comment #7

Posted on Sep 28, 2014 by Swift Bird

Nice, I booted with 9pcf.alexchandel with the command: qemu-system-i386 -hda 9front.qcow2.img -cdrom 9front-3853.02ebd469f43a.iso -boot d -vga std -m 1G -kernel 9pcf.alexchandel -initrd plan9.ini

As soon as the GUI is drawn, a screen with this error flashes: Plan 9 Console i8042: 08 returned to the ea command

It disappears quickly, and then there's a "kernel fault: no user process" panic. I've attached a screenshot.

Attachments

Comment #8

Posted on Sep 28, 2014 by Grumpy Dog

what is the content of the plan9.ini you passed to qemu?

Comment #9

Posted on Sep 28, 2014 by Swift Bird

@mischief It's: ```

config for initial cd booting

cdboot=yes mouseport=ask monitor=ask vgasize=ask bootfile=/386/9pcf ```

Comment #10

Posted on Sep 28, 2014 by Swift Bird

And yeah, I ran 9fs 9fat before checking each time, and from within the same window. In fact /n/9fat was empty. Also it's worth noting that for my past three posts, I chose cwfs64x during installation.

When I use hjfs and boot with qemu-system-i386 -hda 9front.qcow2.img -cdrom 9front-3853.02ebd469f43a.iso -boot d -vga std -m 1G -kernel 9pcf.alexchandel -initrd plan9.ini, the "panic: kernel fault: no user process" error doesn't occur. However, /n/9fat is still empty.

Moreover, booting with qemu-system-i386 -hda 9front.qcow2.img -boot c -vga std -m 1G -kernel 9pcf.alexchandel gives lots of errors, mostly along the lines of "can't open, /rc not found".

Comment #11

Posted on Sep 28, 2014 by Swift Bird

To summarize, 9front appears to create the second stage bootloader on the hard drive during installation, but after rebooting it's gone. Booting off the hard drive hangs; it's only possible to boot off the ISO. Even booting off the hard drive using a kernel image (thus skipping the bootloader) still fails.

Additionally, after installation, if the HD's filesystem is cwfs64x, booting off the ISO will panic with "kernel fault: no user process".

Comment #12

Posted on Sep 28, 2014 by Happy Hippo

decoded the panic, but it makes no sense. it would mean that the machp[0] array contains 0x9 for the mach address of cpu0. this entry gets only set once to a fixed address and then is never touched.

term% ktrace -i f0108507 f0015b24 src(0xf0108507); // dumpstack+0x10 // data at 0xf0015b2c? f0163141 src(0xf0163141); // panic+0xd2 // data at 0xf0015c54? f010867a src(0xf010867a); // fault386+0xd2 // data at 0xf0015d04? f0107c14 src(0xf0107c14); // trap+0x15b // data at 0xf0015dc4? f01005ec src(0xf01005ec); // forkret //passing interrupt frame; last pc found at sp=0xf0015dc4 // data at 0xf0015e04? f013882e src(0xf013882e); // ps2mouseputc+0x19 // data at 0xf0015e38? f01f2add src(0xf01f2add); // i8042intr+0x7a // data at 0xf0015e58? f0107c14 src(0xf0107c14); // trap+0x15b // data at 0xf0015f18? f01005ec src(0xf01005ec); // forkret //passing interrupt frame; last pc found at sp=0xf0015f18 // data at 0xf0015f58? f010055b src(0xf010055b); // halt+0xe // data at 0xf0015f64? f015d946 src(0xf015d946); // idlehands+0x11 // data at 0xf0015f70? f020bee6 src(0xf020bee6); // runproc+0x160 // data at 0xf0015fa4? f020b6b5 src(0xf020b6b5); // sched+0x165 // data at 0xf0015fd0? f020b463 src(0xf020b463); // schedinit+0x85 // data at 0xf0015fe4?

acid: src(0xf013882e); // ps2mouseputc+0x19 /sys/src/9/pc/mouse.c:99 94 int buttons, dx, dy; 95 96 /* 97 * Resynchronize in stream with timing; see comment above. 98 */

99 m = MACHP(0)->ticks; 100 if(TK2SEC(m - lasttick) > 2) 101 nb = 0; 102 lasttick = m; 103
104 /*

acid: asm(ps2mouseputc) ps2mouseputc 0xf0138815 SUBL $0x28,SP ps2mouseputc+0x3 0xf0138818 MOVL packetsize(SB),DI ps2mouseputc+0x9 0xf013881e MOVL nb$1(SB),SI ps2mouseputc+0xf 0xf0138824 MOVL c+0x0(FP),BX ps2mouseputc+0x13 0xf0138828 MOVL machp(SB),AX ps2mouseputc+0x19 0xf013882e MOVL 0x24(AX),BP <- fault ps2mouseputc+0x1c 0xf0138831 MOVL BP,CX

Comment #13

Posted on Sep 28, 2014 by Happy Hippo

ok, i could reproduce this now with many tries in qemu for windows. the trick is to keep twitching the mouse on boot constantly. fix commited in rd2af87472b59. see the explaination there. i build another kernel for you to test under:

http://www.felloff.net/usr/cinap_lenrek/9pcf.alexchandel

Comment #14

Posted on Sep 28, 2014 by Swift Bird

The panic no longer occurs. However I just noticed an abnormalities during the install:

Ream the filesystem? (yes, no)[yes] Starting cwfs64x file server for /dev/sdC0/fscache Reaming filesystem bad nvram key bad authentication id bad authentication domain nvrcheck: can't read nvram config: config: config: auth disabled config: config: config: config: config: config: config: currnt fs in "main" cmd_users: cannot access /adm/users 63-bit cwfs as of Thu Sep 4 20:04:10 2014 last boot Sun Sep 28 17:06:33 2014 Configuring cwfs64x file server for /dev/sdC0/fscache % mount -c /srv/cwfs /n/newfs Mounting cwfs64x file server for /dev/sdC0other % mount -c /srv/cwfs /n/other other

The bootsetup still appears error free: dossrv: serving #s/dos % dd -bs 512 -count 1 -if /dev/sdC0/9fat -of /tmp/pbs.bak 1+0 records in 1+0 records out Initializing Plan 9 FAT partition % disk/format -r 2 -d -b /n/newfs/386/pbs /dev/sdC0/9fat Initializing FAT file system type hard, 12 tracks, 255 heads, 63 secors/track, 512 bytes/sec used 4096 bytes % mount -c /srv/dos /n/9fat /dev/sdC0/9fat % rm -f /n/9fat/9bootfat /n/9fat/plan9.ini /n/9fat/9pcf % cp /n/newfs/386/9bootfat /n/9fat/9bootfat % chmod +al /n/9fat/9bootfat % cp /tmp/plan9.ini /n/9fat/plan9.ini % cp /n/newfs/386/9pcf /n/9fat/9pcf % cp /tmp/pbs.bak /n/9fat % unmount /n/9fat

Regardless, /n/9fat is still empty when I reboot and run "9fs 9fat". And attempting to boot into the HD still hangs at "MBR...pbs....."

Moreover, attempting to boot into the HD using the kernel flag (qemu-system-i386 -hda 9front.qcow2.img -boot c -vga std -m 1G -kernel 9pcf.alexchandel) throws bad nvram key errors and more, screenshot attached. When I type ls in the terminal, it errors with: checktag pc=9b4f cw"/dev/sdC0/fscache"w"/dev/sdC0/fsworm"(11305) tag/path=Tnone/0; expected Tdir ls: . :phase error -- cannot happen

Attachments

Comment #15

Posted on Sep 28, 2014 by Happy Hippo

the messages from the installation are expected. these are ok. but after reboot, the fat is missing and the cwfs filesystem is partially corrupted. my guess would be that we'r just too fast in rebooting? and qemu doesnt flush the changes out to the qcow image for some reason?

reads and writes to /dev/sdXX/parts are uncached and synchronous. plan9 kernel has no buffer caches. and dossrv writes immidiately. maybe qemu expects us to issue write barriers to really flush stuff to the disk?

maybe just wait a minute after installation when it prompts for the [finish] step?

i can try checking qemu source in the meantime...

Comment #16

Posted on Sep 28, 2014 by Happy Hippo

short explaination what checktag messages are:

the cwfs fileserver uses blocks (of 16k in case of cwfs64x) where it stores some redundant checking info at the end (the tag). the tag contains the type of the block (file-data/directory/indirect pointer blocks...) and the qid (file number). it always checks the tag to see that the block is just read is what it expected.

a tag of Tnone/0 means the tag is zero. the block appears to be zeroed out. ... like it was never written.

Comment #17

Posted on Sep 28, 2014 by Swift Bird

I waited ~20 minutes, same result. Is it possible that 9front is corrupting the filesystem when it's shutdown? Zeroed out blocks might be a result of qcow2 corruption.

Comment #18

Posted on Sep 28, 2014 by Happy Hippo

cwfs writes changes to disk lazily. that is, theres a background process that flushes dirty blocks to disk. but waiting 20 minutes is a bit crazy. it should be a few seconds at max. even with qemus slow i/o, not more than 10 seconds max.

dossrv on the other hand writes immidiately. the write() syscall will not return until dossrv did the whole roundtrip to disk. what puzzles me is that your fat filesystem is missing.

this corruption cannot be explained with the lazy writing of cwfs.

maybe it has someting todo with the qemu configuration? can you try using a sparsefile for the disk image? maybe the qcow got damaged with all this testing?

people use qemu with 9front for a while now, but these issues didnt came up yet.

Comment #19

Posted on Sep 28, 2014 by Happy Hippo

another theory. maybe the ide controller that qemu emulates doesnt work right?

you could try using virtio instead.

Comment #20

Posted on Sep 28, 2014 by Swift Bird

Just noticed, when I restart QEMU by entering fshalt in 9front, and then system_reset in the QEMU console, the filesystem is preserved, and /n/9fat has its contents. However, if I restart QEMU in any other way, including killing it while 9front is idle, then the filesystem is corrupted.

Comment #21

Posted on Sep 28, 2014 by Swift Bird

http://wiki.qemu.org/Features/Qcow2DataIntegrity recommends using I/O barriers to avoid data corruption.

Comment #22

Posted on Sep 28, 2014 by Swift Bird

Nevermind, I was using the wrong image. fshalt/system_reset still results in a corrupted filesystem.

Comment #23

Posted on Dec 28, 2014 by Grumpy Dog

any progress here? is this still reproducible?

Comment #24

Posted on Jan 1, 2015 by Swift Bird

The newest ISO, 9front-4045 still exhibits the same hanging behavior, when the reported steps are performed. (install, [finish], QEMU restarts, kill QEMU, restart QEMU without cdrom arg, hangs at boot)

Comment #25

Posted on Jan 2, 2015 by Grumpy Dog

are you still using qemu 2.1.2, and the same qemu arguments as in the original bug report? i can try to reproduce on this version, but i only have linux to test on. i have never had a problem like you described, and i've tried quite a number of qemu versions during ethervirtio development.. it could be an osx-specific issue, or an issue with how brew packages qemu..

Status: NeedsTesting

Labels:
Type-Other Priority-Low