Skip to main content

Documentation

Documentation/Troubleshooting

Troubleshooting

Troubleshooting Guide

Use this guide to diagnose and fix common deployment issues. Start with the symptom you're seeing, then follow the steps.

What's happening?

How to Check Logs

Most issues can be diagnosed from the pod logs:

  1. Go to your Dashboard
  2. Click the deployment card or the menu
  3. Click View Logs
  4. Look for lines with ERROR or WARNING

If logs show "Container starting up..." with a spinner, the pod hasn't finished pulling the Docker image yet. This can take 5–15 minutes for large images.

⏳ Deployment Stuck on "Starting"

Deployments go through several phases. Here's how long each should take:

Model TypeNormalExtendedPossibly StuckLikely Failed
Image / Text0–15 min15–30 min30–60 min60+ min
Video0–30 min30–60 min60–120 min120+ min
Custom Workflow0–30 min30–50 min50–90 min90+ min

Why it's slow: First-time deployments download models from HuggingFace (5–20 GB). Video models take longest. Subsequent deploys of the same model are faster because models are cached in the Docker image.

💡 Tip: If your deployment reaches the "Likely Failed" threshold, stop it to save costs and try again. A fresh pod may land on a node with better network connectivity.

❌ Deployment Failed

Check your logs for these common errors:

GPU Memory Exhausted (CUDA OOM)

Log message: CUDA out of memory or OutOfMemoryError

The model needs more VRAM than your GPU has.

  • Upgrade to a GPU with more VRAM (L4 24GB → A6000 48GB → A100 80GB)
  • Use a smaller model variant (e.g., 8B instead of 32B)
  • Reduce image resolution or batch size in your workflow

Model Not Found

Log message: Model not found or No such file .safetensors

  • Check the model name is spelled correctly
  • For workflows: verify model filenames match what ComfyUI expects
  • For gated models: add your HuggingFace token in settings

Missing Python Package

Log message: ModuleNotFoundError or No module named

  • A custom node requires a Python package not in the Docker image
  • Open ComfyUI Manager → Install Missing Packages → Restart
  • Or redeploy to trigger a fresh custom node installation

GPU Driver Error

Log message: CUDA error or cuDNN error

  • Restart the pod to reset GPU state
  • Try a different GPU type
  • This is usually a transient hardware issue on the RunPod node

Process Killed (OOMKilled)

Log message: Killed or OOMKilled

  • The system ran out of RAM (not VRAM — system memory)
  • Use a GPU with more system RAM
  • Reduce the complexity of your workflow

Disk Full

Log message: No space left on device

  • The container disk (200 GB default) is full — usually from too many models
  • Delete unused generated files from the ComfyUI output folder
  • Note: CPU instances have only 20 GB disk

Gated Model (Access Denied)

Log message: gated model or access token required

  • Some HuggingFace models require you to accept a license first
  • Go to the model page on huggingface.co and accept the terms
  • Add your HuggingFace token to your ModelPilot account settings

🔒 GPU Unavailable

Error: "The selected instance type is temporarily unavailable"

This means RunPod's data centers have no machines with your selected GPU available right now. This is not a ModelPilot issue — it's GPU supply and demand.

  • Try a different GPU: L4 ($0.51/hr), RTX 4090 ($0.77/hr), A6000 ($0.64/hr), and A100 ($1.81/hr) have different availability
  • Wait and retry: GPU capacity changes every few minutes as other users finish
  • Try off-peak hours: Availability is typically better outside US business hours

🐌 Model Running But Slow

  • Wrong GPU for the model: Large models (32B+ parameters) on a 24 GB GPU will swap to system memory. Upgrade to A6000 (48 GB) or A100 (80 GB).
  • First request is slow: The model loads into GPU memory on the first request. Subsequent requests are much faster.
  • High resolution / large batch: Reduce image dimensions or batch size in your ComfyUI workflow.

🔌 API Errors

StatusMeaningFix
401Invalid or missing API keyCheck your API key in Dashboard → API Keys
402Insufficient creditsAdd credits at Billing
404Deployment not foundCheck the pod ID — the deployment may have been deleted
429Rate limitedWait 60 seconds and retry. API limit is 100 requests/min
503Pod not readyThe model is still loading. Wait for status to show "Running"

Still Need Help?

If your issue isn't covered here: