dcache

My dcache notes

Usage Info
Webpage

CMS disk usage
CMS User

Check if a storage node is just too busy
on the storage node: iostat -x 2, look at util (on the s*) and rkB/s ('r' is for 'read').

Finding the pool a file is stuck in
(see also: Checking file integrity)
Take a look at the restoreHandler to find the pnfsid
On gfe02:
su - postgres
psql -U postgres chimera
select * from t_locationinfo where ipnfsid='0003000000000000027A93E8';
\q (to log out)

To find all pnfsid on a pool
psql chimera -t -c "select ipnfsid from t_locationinfo where ilocation='sedsk18other_0'"

Old stuff (some of this will be obsolete)

Related pages
Adding a new VO to dCache.
Setting up a new storage node.
Decommissioning a storage node

Set pool readonly, but (unlike as suggested in the documentation) leave it enabled.Do either: cd [pool_name] pool enable -rdonly
or
cd PoolManager
psu set pool [pool_name] rdonly

Use the 'migration move command'
cd [pool_name]
migration move [target_pool] (make sure source and target associated with the same VO)
set concurrency to 2 - not sure what the optimal number is, but 24 is too big which I found out by accident):
migration concurrency [job nr] 2
rep ls
unregister pnfs
Files in an error state will not unregister - force them by rep rm -force [pnfsid]
It is not aboe to remove 'precious' (<-P---------(0)[0]>)files even if they are obviously dead (i.e. move fails over days etc). In this case the only solution seems to be to set the error bit by hand and then to remove them: rep set bad [pnfsid] on, followed by rep rm -force [pnfsid]. If this doesn't work, then banging your head against the wall might do.

How to diagnose a crashed raid card
Can't copy file of dCache.
Find out where the file is:
On gfe02:
ssh gfe02admin
\c PnfsManager
cacheinfoof /pnfs/hep.ph.ic.ac.uk/data/cms/store/...
\q
Log into sedks shown and run dmesg and areca-cli vsf info.
(This is a bad sign:
[670448.258801] XFS (dm-10): Log I/O Error Detected. Shutting down filesystem
[670448.266017] XFS (dm-10): Please umount the filesystem and rectify the problem(s)
[670448.273806] sd 0:0:0:0: rejecting I/O to offline device
[670448.279249] sd 0:0:0:0: rejecting I/O to offline device
Try first:
service dcache-server stop
umount /srv/data/*
for i in /dev/mapper/sedsk37_datavg-*; do echo $i; xfs_repair "$i"; done
(The last command might have to be run twice.)
If it's not a broken disk, but a crashed raid card do:
disable dCache on startup: service dcache-server stop; chkconfig dcache-server off
unmount the disks and make sure they don't come back on startup: umount /srv/data/*; comment disks out in /etc/fstab.
reboot
for i in /dev/mapper/sedsk37_datavg-*; do echo $i; xfs_repair -L "$i"; done remount disks: emacs /etc/fstab; mount -a
checks: df -h; tail /var/log/dcache/sedskNNDomain.log;

DCache has disabled all its pools on sedsk13:
[root@sedsk13 log]# /etc/init.d/dcache restart pool
Stopping sedsk13Domain (pid=25417) 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 Giving up : sedsk13Domain might still be running
sedsk13Domain might still be running

Oh well, do it in two steps:

[root@sedsk13 log]# /etc/init.d/dcache stop pool
Stopping sedsk13Domain (pid=25417) Done
[root@sedsk13 log]# /etc/init.d/dcache start pool
Starting sedsk13Domain Done (pid=18325)
Check here.

A ping to sedsk13 is only sucessfull from a subset of machines
Try:
/etc/init.d/network stop; sleep 10; /etc/init.d/network start;

Space tokens
ssh gfe02admin
cd SrmSpaceManager
ls (to get space token number, t2k has 1270000)
listFilesInSpace 1270000