Troubleshooting Guide
Use this guide to diagnose and fix common deployment issues. Start with the symptom you're seeing, then follow the steps.
What's happening?
How to Check Logs
Most issues can be diagnosed from the pod logs:
- Go to your Dashboard
- Click the deployment card or the ⋮ menu
- Click View Logs
- Look for lines with
ERROR or WARNING
If logs show "Container starting up..." with a spinner, the pod hasn't finished pulling the Docker image yet. This can take 5–15 minutes for large images.
⏳ Deployment Stuck on "Starting"
Deployments go through several phases. Here's how long each should take:
| Model Type | Normal | Extended | Possibly Stuck | Likely Failed |
|---|
| Image / Text | 0–15 min | 15–30 min | 30–60 min | 60+ min |
| Video | 0–30 min | 30–60 min | 60–120 min | 120+ min |
| Custom Workflow | 0–30 min | 30–50 min | 50–90 min | 90+ min |
Why it's slow: First-time deployments download models from HuggingFace (5–20 GB). Video models take longest. Subsequent deploys of the same model are faster because models are cached in the Docker image.
💡 Tip: If your deployment reaches the "Likely Failed" threshold, stop it to save costs and try again. A fresh pod may land on a node with better network connectivity.
❌ Deployment Failed
Check your logs for these common errors:
GPU Memory Exhausted (CUDA OOM)
Log message: CUDA out of memory or OutOfMemoryError
The model needs more VRAM than your GPU has.
- Upgrade to a GPU with more VRAM (L4 24GB → A6000 48GB → A100 80GB)
- Use a smaller model variant (e.g., 8B instead of 32B)
- Reduce image resolution or batch size in your workflow
Model Not Found
Log message: Model not found or No such file .safetensors
- Check the model name is spelled correctly
- For workflows: verify model filenames match what ComfyUI expects
- For gated models: add your HuggingFace token in settings
Missing Python Package
Log message: ModuleNotFoundError or No module named
- A custom node requires a Python package not in the Docker image
- Open ComfyUI Manager → Install Missing Packages → Restart
- Or redeploy to trigger a fresh custom node installation
GPU Driver Error
Log message: CUDA error or cuDNN error
- Restart the pod to reset GPU state
- Try a different GPU type
- This is usually a transient hardware issue on the RunPod node
Process Killed (OOMKilled)
Log message: Killed or OOMKilled
- The system ran out of RAM (not VRAM — system memory)
- Use a GPU with more system RAM
- Reduce the complexity of your workflow
Disk Full
Log message: No space left on device
- The container disk (200 GB default) is full — usually from too many models
- Delete unused generated files from the ComfyUI output folder
- Note: CPU instances have only 20 GB disk
Gated Model (Access Denied)
Log message: gated model or access token required
- Some HuggingFace models require you to accept a license first
- Go to the model page on huggingface.co and accept the terms
- Add your HuggingFace token to your ModelPilot account settings
🔒 GPU Unavailable
Error: "The selected instance type is temporarily unavailable"
This means RunPod's data centers have no machines with your selected GPU available right now. This is not a ModelPilot issue — it's GPU supply and demand.
- Try a different GPU: L4 ($0.51/hr), RTX 4090 ($0.77/hr), A6000 ($0.64/hr), and A100 ($1.81/hr) have different availability
- Wait and retry: GPU capacity changes every few minutes as other users finish
- Try off-peak hours: Availability is typically better outside US business hours
🐌 Model Running But Slow
- Wrong GPU for the model: Large models (32B+ parameters) on a 24 GB GPU will swap to system memory. Upgrade to A6000 (48 GB) or A100 (80 GB).
- First request is slow: The model loads into GPU memory on the first request. Subsequent requests are much faster.
- High resolution / large batch: Reduce image dimensions or batch size in your ComfyUI workflow.
🔌 API Errors
| Status | Meaning | Fix |
|---|
| 401 | Invalid or missing API key | Check your API key in Dashboard → API Keys |
| 402 | Insufficient credits | Add credits at Billing |
| 404 | Deployment not found | Check the pod ID — the deployment may have been deleted |
| 429 | Rate limited | Wait 60 seconds and retry. API limit is 100 requests/min |
| 503 | Pod not ready | The model is still loading. Wait for status to show "Running" |
Still Need Help?
If your issue isn't covered here:
- Email support@modelpilot.ai with your deployment ID and the error from your logs
- Check the FAQ for more answers