How to fix a failed VMFS datastore
A failed VMFS datastore is one of the most stressful incidents in a VMware environment because it affects not just one file or one virtual machine, but the storage layer that multiple workloads depend on at the same time. In practice the symptoms vary: the datastore suddenly becomes inaccessible, VMs disappear from the browser, hosts report I/O errors, ESXi cannot mount the datastore, or virtual machines begin failing with lock, snapshot, or inaccessible-file messages. In this situation the most important thing is to slow down. Many serious data loss cases happen not because the first failure was catastrophic, but because the response was rushed.
A good recovery process starts with discipline: first determine whether the problem is in VMFS metadata, the LUN itself, the RAID controller, multipath connectivity, the iSCSI or FC layer, or only in the view of one host. Only after that should you decide whether to rescan, remount, repair a VM chain, or restore from backup. If your virtual infrastructure supports important services, it is worth planning for resilience in advance: flexible transition environments are often built on Virtual Servers, heavier workloads may need Dedicated Servers, and backup retention is often safer with Cloud Storage.
VMFS failures rarely happen “for no reason”. The root cause is usually one of these layers: storage controller degradation, disk issues, an unclean host restart, LUN ID changes, path flapping, APD or PDL events, broken snapshots, incorrect storage migration, or corruption in the VM file chain. That is why the goal of this guide is not to provide one magic command, but to show the right sequence so you do not reduce your recovery options.
1) First rule: do not write anything to the failed datastore
If the datastore appears damaged or inaccessible, do not create new files on it, do not create test VMs there, and do not format it. Even if the vSphere UI offers to create a new datastore on the same device, that option should be ignored at this stage. Your priority is to preserve the current metadata and file layout as much as possible so you still have a clean basis for analysis or recovery.
If virtual machines are still running from the affected datastore, avoid panic restarts. First determine whether they can be cleanly shut down or migrated elsewhere. If Storage vMotion is available and the storage layer is stable, migration may be a safe exit. But if the storage path is actively throwing I/O errors, aggressive migration can increase pressure on the failing datastore and make the incident worse.
2) Check whether the problem is host-side, path-related, or inside the LUN
Your first practical question should be: does one host fail to see the datastore, do several hosts fail, or is the problem global? If only one ESXi host is affected, there is a good chance VMFS itself is still healthy and the issue is in the HBA, multipath state, or host-side storage discovery. If the datastore disappeared from all hosts at once, the problem is more likely at the storage or LUN layer.
On ESXi, inspect devices, paths, and filesystem visibility. In practice administrators start with commands that show device information, path states, and mounted filesystems. If the device is visible but the filesystem is not mounted, the likely problem is VMFS signature, mount state, or metadata. If the device itself is missing, go lower: iSCSI sessions, FC zoning, HBA state, RAID controller, or storage array health.
For iSCSI, verify sessions and target availability. For FC, review zoning and fabric-side errors. For local RAID, inspect the controller event log, cache or battery status, and physical disk health. This matters because the datastore failure is often just a symptom, not the real defect.
3) Rescan, remount, and resignature — only with a clear goal
If the storage side has been stabilized and the LUN is visible again, the next step is a controlled rescan. A rescan only forces the host to reread the storage topology; it does not “repair” anything by itself. If the datastore still does not show up after a rescan, determine whether ESXi sees the VMFS as a snapshot copy, an unknown signature, or an inactive mount candidate.
This is where the difference between remount and resignature becomes critical. Remount attempts to mount the existing datastore using its current identity. Resignature creates a new VMFS identity. That can be correct for cloned or intentionally copied LUNs, but during failure recovery it can complicate things because the datastore UUID changes and tooling or scripts that depend on the original identity may no longer behave as expected.
A practical rule is simple: if you are not completely certain that you are looking at a clone rather than the original datastore, do not rush into resignature. In many incidents the safer step is remount, path stabilization, or rollback from a storage snapshot rather than assigning a new signature.
4) Check VM directories, descriptor files, and snapshot chains
If the datastore is visible but specific virtual machines are inaccessible, the problem may be at the file level rather than the VMFS mount layer. Check whether VM directories are present, whether .vmx files exist, whether .vmdk descriptor files are intact, and whether only -flat.vmdk or -delta.vmdk files remain. That does not automatically mean the data is gone, but it does mean you need to work carefully.
Descriptor files are especially important because they define the logical disk chain. If a descriptor is missing or damaged, it may be possible to reconstruct access, but only if you understand the disk size, adapter type, and snapshot hierarchy correctly. A badly recreated descriptor can make recovery harder than the original issue.
If the datastore had been close to full capacity, there is a strong chance the failure is tied to snapshot growth or a broken consolidation attempt. After regaining access, do not immediately delete “old delta files”. First identify which disks are active, whether there are locks from another host, and whether the snapshot chain is complete. Only then plan consolidation or cleanup.
5) Logs and commands that provide the most value
For practical analysis, vmkernel logs, vobd messages, and storage-related event records are extremely useful because they reveal APD, PDL, SCSI sense codes, and VMFS metadata symptoms. If you see repeated APD or PDL events, focus first on path availability and storage health rather than trying to “repair” the filesystem. If logs suggest actual corruption, proceed even more conservatively.
Commands that list VMFS extents, filesystems, and SCSI device mappings help you understand which device backs the datastore and what state ESXi believes it is in. They are good first-line diagnostics because they do not overwrite data. In many incidents the real mistake begins only when analysis ends too early and destructive action starts too soon.
When to repair, when to restore from backup, and when to stop
If the incident is caused by path loss, HBA issues, iSCSI or FC instability, or temporary storage-side unavailability, a proper remount is often enough to restore service without data loss. If the issue is specific to VM files or snapshot chains, file-level recovery may be possible. But if you suspect real VMFS metadata damage or physical disk degradation, the safest move is often to stop and restore from backup rather than experiment on the original data.
Prevention matters just as much as recovery. A consistent backup policy, datastore free-space monitoring, and storage event monitoring dramatically reduce the chance that a VMFS incident becomes prolonged downtime. Do not keep datastores nearly full, because snapshot growth and consolidation become much riskier in that state. It is also valuable to document LUN mappings, datastore UUIDs, and path policies so you do not have to reconstruct your environment under pressure.
If you use shared storage across multiple hosts, regularly verify that all hosts see the same device set consistently and that multipath status is healthy everywhere. Subtle differences between hosts often surface only after maintenance or reboot events. The better observed and documented your storage layer is, the less likely a datastore failure is to escalate into a major outage.
A good administrator does not try to defeat every storage failure with a single command. A good administrator knows when to stop. If you have recent backups or storage snapshots, using them is often safer than aggressive repair attempts on live data. The right sequence is simple: do not write anything, determine the cause, collect diagnostics, and only then choose the lowest-risk recovery path.