Intro
Today I did again a backup of a machine that has served me well for years. As it happens with old hardware, apparently some hard drive sectors have gone bad over time. These messages I got in syslog:
hda: dma_intr: status=0x51 { DriveReady SeekComplete Error }
hda: dma_intr: error=0x40 { UncorrectableError }, LBAsect=23181294, sector=23181191
ide: failed opcode was: unknown
junge end_request: I/O error, dev hda, sector 23181191
hda: dma_intr: status=0x51 { DriveReady SeekComplete Error }
hda: dma_intr: error=0x40 { UncorrectableError }, LBAsect=23181294, sector=23181199
ide: failed opcode was: unknown
end_request: I/O error, dev hda, sector 23181199
These sectors are "bad sectors" and they should be connected to the bad-inode of the ext3 filesystem the machine is using. A quick Google revealed tons of tutorials how to run badblocks to determine bad sectors, but I didn't find one that tells you how to add these reported ones. So here it comes.
Very little bit background info on how ext3 sits on your hard-drive
A very good introduction is http://www2.uic.edu/~aciani1/sector_blues.html. Below is my own humble explanation in case the page is gone.
Your hard drive is abstracted by your kernel as a long file. In my case that file is /dev/hda. Due to the nature of how hard-drives work, the unit you access are sectors and they are typically 512 bytes long. That means if you want to modify byte 1 of your file, you have to read the first sector (that contains bytes 0 to 511) modify your byte one and then write that whole sector (meaning writing bytes 0 to 511) back to the disk. Although only one byte was to change we read 512 bytes in and wrote 512 bytes out. That's why a hard-drive is called a block device: it writes blocks of bytes. These blocks are called physical blocks.
There are normally several partitions on a hard drive that further divide that file into smaller sections. The fdisk utility or the parted utility are two common Linux tools to view and manipulate partitions. These partitions can have further sub-partitioning, a common example is the "extended partition table" of MS-DOS. Or they can contain file-systems. On my machine in question there are two partitions: the first being the one (/dev/hda1) for the root file-system and the other one (/dev/hda2) being the swap partition.
The filesystems themself further subdivide their partition into logical blocks. That is they take n physical blocks and group them into m logical blocks. The ext2/3/4 filesystems is a so called inode based filesystem, that means a file is simply referred by a number and not by a filename. Filenames are paths that are assigned in directories to inodes. Some of those blocks store data on the actual filesystem, like the superblock. Other blocks store information on files or directories.
Marking a physical block bad in ext2/3/4
To mark the reported physical block as "bad" and thus prevent its usage, we have to go through these steps:
- Map the physical block of the hard-drive to a physical block of the partition
- Map the physical block of the partition to a logical block of the partition
- (Optionally) check what file this block belongs to
- Add the block to the "bad blocks" inode.
The first step can be done with fdisk by listing the boundaries of the partition in physical blocks.
root@host ~ $ fdisk -lu /dev/hda
Disk /dev/hda: 20.0 GB, 20003880960 bytes
255 heads, 63 sectors/track, 2432 cylinders, total 39070080 sectors
Units = sectors of 1 * 512 = 512 bytes
Disk identifier: 0xaffeaffe
Device Boot Start End Blocks Id System
/dev/hda1 * 63 38090114 19045026 83 Linux
/dev/hda2 38090115 39070079 489982+ 82 Linux swap / Solaris
The table above means that the first physical block ("sector") os /dev/hda1 is block 63 of /dev/hda. Both sector 23181191 and sector 23181199 from the error message in the beginning of the page are from /dev/hda1, since they are between Start and End of that partition. To get the physical block address of these two sectors we need to substract 63 of them.
root@host ~ $ echo $((23181191 - 63))
23181128
root@host ~ $ echo $((23181199 - 63))
23181136
The second step is to map these two physical blocks of the partition table to logical ones. The logical block size can be determined using dumpe2fs:
root@host ~ $ /sbin/dumpe2fs /dev/hda1 | grep size
Block size: 4096
Fragment size: 4096
Inode size: 128
Journal size: 128M
In this case the logical block size is 4096. Since the physical block size is 512, it means that every 8 (4096/512) physical blocks are grouped into one logical block. The first block of a ext2/3/4 partition is block 0, so we can calculate the logical block number like this:
root@host ~ $ echo $((23181128 / 8))
2897641
root@host ~ $ echo $((23181136 / 8))
2897642
Now it would be interesting to feed these two logical blocks into e2fsck with the -l parameter. But I can't figure out a format description. To be continued.
Determining names of damaged files
If you want to know which files had been damaged, you can find out using debugfs. An excellent tutorial on how to do this is http://www.linuxjournal.com/article/0193.
In my case all damaged blocks belonged to the same file:
grep UncorrectableError /var/log/messages | grep -v grep | sed 's!.*sector=\([0-9]*\)$!\1!' | sort | uniq | while read sector; do echo $(( ($sector - 63) / 8 )); done | sort | uniq | xargs echo -n icheck | debugfs /dev/hda1
debugfs: Block Inode number
2897641 1436251
2897642 1436251
If you want to run the command above for your machine, you have to replace the partition device fed to debugfs and the 63 in the example above by the first sector of your partition.








