Skip to content
  • Dave Chinner's avatar
    xfs: log recovery lsn ordering needs uuid check · 566055d3
    Dave Chinner authored
    
    
    After a fair number of xfstests runs, xfs/182 started to fail
    regularly with a corrupted directory - a directory read verifier was
    failing after recovery because it found a block with a XARM magic
    number (remote attribute block) rather than a directory data block.
    
    The first time I saw this repeated failure I did /something/ and the
    problem went away, so I was never able to find the underlying
    problem. Test xfs/182 failed again today, and I found the root
    cause before I did /something else/ that made it go away.
    
    Tracing indicated that the block in question was being correctly
    logged, the log was being flushed by sync, but the buffer was not
    being written back before the shutdown occurred. Tracing also
    indicated that log recovery was also reading the block, but then
    never writing it before log recovery invalidated the cache,
    indicating that it was not modified by log recovery.
    
    More detailed analysis of the corpse indicated that the filesystem
    had a uuid of "a4131074-1872-4cac-9323-2229adbcb886" but the XARM
    block had a uuid of "8f32f043-c3c9-e7f8-f947-4e7f989c05d3", which
    indicated it was a block from an older filesystem. The reason that
    log recovery didn't replay it was that the LSN in the XARM block was
    larger than the LSN of the transaction being replayed, and so the
    block was not overwritten by log recovery.
    
    Hence, log recovery cant blindly trust the magic number and LSN in
    the block - it must verify that it belongs to the filesystem being
    recovered before using the LSN. i.e. if the UUIDs don't match, we
    need to unconditionally recovery the change held in the log.
    
    This patch was first tested on a block device that was repeatedly
    causing xfs/182 to fail with the same failure on the same block with
    the same directory read corruption signature (i.e. XARM block). It
    did not fail, and hasn't failed since.
    
    Signed-off-by: default avatarDave Chinner <dchinner@redhat.com>
    Reviewed-by: default avatarBen Myers <bpm@sgi.com>
    Signed-off-by: default avatarBen Myers <bpm@sgi.com>
    566055d3