Stability Problem with Mercury & Jaunty

Events happening in the community are now at Drupal community events on www.drupal.org.
kcoop's picture

I had a soft lockup this morning on my 32 bit mercury instance (ami ami-9f9f70f6). No use of XFS at all. Rebooting failed with more soft lockups. I've appended the initial error output, fwiw.

As I searched for the issue, I ran into the following from Eric Hammond. The kernel we use sounds like the same one - is it?

[EDIT: oops - misread Eric's post. The security problem is referring to an earlier build. My bad.]

http://developer.amazonwebservices.com/connect/message.jspa?messageID=15...

danpeg1:

It was started with ami-ccf615a5 (alestic/ubuntu-9.04-jaunty-base-20091011.manifest.xml). I'm going to start up a new one now, but was curious what happened.

Eric Hammond:

If you're using XFS, that AMI uses an Amazon kernel which does not support XFS (reboot triggering the problem).

If you need XFS, you might switch to the 20090804 version of that AMI, though the Amazon kernel there has a security hole which allows any logged in user to become root.

Better yet, upgrade to the Karmic AMIs published by Canonical which use true Ubuntu kernels that do not have problems with XFS or the security hole.

BUG: soft lockup detected on CPU#0!

[<c104a2b2>] softlockup_tick+0xa5/0xb4

[<c1009814>] timer_interrupt+0x55f/0x5b7

[<ee0118bd>] do_blkif_request+0x32e/0x377 [xenblk]

[<c1129d7e>] __add_entropy_words+0x56/0x18b

[<c104a53a>] handle_IRQ_event+0x36/0x6e

[<c104b9f2>] handle_level_irq+0x81/0xc7

[<c104b971>] handle_level_irq+0x0/0xc7

[<c100719a>] do_IRQ+0xac/0xd2

[<c114f086>] evtchn_do_upcall+0x82/0xdb

[<c100585e>] hypervisor_callback+0x46/0x4e

[<c1201b17>] _spin_lock+0xa/0xf

[<c10181e9>] mm_unpin+0x14/0x23

[<c101835f>] _arch_exit_mmap+0x167/0x16f

[<c1062769>] exit_mmap+0x1c/0xee

[<c101f2c3>] __cond_resched+0x25/0x3c

[<c10219ef>] mmput+0x34/0x78

[<c1026457>] do_exit+0x215/0x730

[<c10063bd>] die+0x227/0x24c

[<c10067cb>] do_invalid_op+0x0/0xab

[<c100686d>] do_invalid_op+0xa2/0xab

[<c101bdf3>] xen_l2_entry_update+0x8d/0x98

[<c1057b78>] __do_page_cache_readahead+0x8b/0x202

[<c105645c>] get_page_from_freelist+0x28c/0x33c

[<c1201edd>] error_code+0x35/0x3c

[<c105007b>] utrace_report_vfork_done+0x5a/0x156

[<c101bdf3>] xen_l2_entry_update+0x8d/0x98

[<c105e16d>] __pte_alloc+0x1e1/0x269

[<c105fc4b>] __handle_mm_fault+0x199/0x1146

[<c10e1738>] prio_tree_remove+0x8d/0x9b

[<c1061366>] free_pgtables+0x70/0x7c

[<c105a71a>] vma_prio_tree_insert+0x17/0x2a

[<c1203862>] do_page_fault+0x72d/0xc24

[<c10641ec>] do_mmap_pgoff+0x584/0x6ec

[<c1203135>] do_page_fault+0x0/0xc24

[<c1201edd>] error_code+0x35/0x3c

Comments

EBS?

joshk's picture

Are you running off an EBS volume or have you done anything else to alter the base AMI? We've run the system through many a reboot without issues, but I've personally had some problems in the past with improperly formatted/mounted EBS volumes.

I didn't have an EBS volume mounted

kcoop's picture

I'm pretty sure it was a stock configuration (other than my own /var/www and mysql data). It had been running for about a week. It wasn't so much about the reboot as that it failed while running, and was then in a bad state. Have you had instances running for long periods of time?

High performance

Group notifications

This group offers an RSS feed. Or subscribe to these personalized, sitewide feeds: