NVIDIA NIM Deployment

Deploy optimized AI models using NVIDIA Inference Microservices (NIM) Hosted API.

✅ Benefits

  • No infrastructure management - Fully hosted by NVIDIA
  • Pay-per-use pricing - No idle costs
  • Instant deployment - No provisioning wait time
  • High performance - Optimized inference with TensorRT
  • Wide model selection - Access to popular open-source models

Prerequisites

NVIDIA NGC Account

  • Sign up at ngc.nvidia.com
  • Generate an API key from your account settings
  • No credit card required for free tier access

Required Environment Variables

NVIDIA_API_KEY=nvapi-XXXXXXXXXXXXXXXXXXXXXXXXXXXX
NIM_BASE_URL=https://integrate.api.nvidia.com  # Optional, defaults to this

Step 1: Get NVIDIA API Key

Via NGC Console

  1. Log in to NVIDIA NGC
  2. Click on your profile icon (top right)
  3. Select "Setup" → "Generate API Key"
  4. Copy the generated key
  5. Add to your .env.local file

Security Note: Never commit your API key to version control. Always use environment variables.

Step 2: Browse Available Models

Via Staque IO API

GET /api/models/nvidia

// Response
{
  "success": true,
  "provider": "nvidia-nim",
  "models": [
    {
      "id": "meta/llama-3.2-11b-vision-instruct",
      "name": "Llama 3.2 11B Vision Instruct",
      "provider": "nvidia-nim",
      "tags": ["instruction-following", "vision", "11b"],
      "task": "text-generation"
    },
    {
      "id": "mistralai/mistral-7b-instruct-v0.3",
      "name": "Mistral 7B Instruct",
      "provider": "nvidia-nim",
      "tags": ["instruction-following", "7b"],
      "task": "text-generation"
    }
  ]
}

Via NVIDIA Catalog

Browse all available models at build.nvidia.com

Step 3: Deploy via Staque IO UI

  1. Navigate to Get Started
    • Click "Get Started" in the navigation
    • Get AI recommendations or skip to model selection
  2. Select NVIDIA NIM Platform
    • Choose "NVIDIA NIM" as the platform
    • Browse curated model list
  3. Choose Model
    • Select model based on your use case
    • Review model capabilities and context window
  4. Configure Deployment
    • Enter conversation title
    • Specify use case
    • No instance type selection needed (fully managed)
  5. Deploy
    • Click "Deploy" - instant activation
    • System verifies API connectivity
    • Ready to use immediately

Step 4: Deploy via API

Verify Connectivity (Dry Run)

POST /api/deploy/nims
Content-Type: application/json
Authorization: Bearer <your-token>

{
  "modelId": "mistralai/mistral-7b-instruct-v0.3",
  "dryRun": true
}

// Response
{
  "success": true,
  "provider": "nvidia-nim",
  "modelId": "mistralai/mistral-7b-instruct-v0.3",
  "endpoint": "https://integrate.api.nvidia.com/v1/chat/completions",
  "message": "Dry run successful"
}

Actual Deployment

POST /api/deploy/nims
Content-Type: application/json
Authorization: Bearer <your-token>

{
  "modelId": "mistralai/mistral-7b-instruct-v0.3",
  "dryRun": false
}

// Response
{
  "success": true,
  "provider": "nvidia-nim",
  "modelId": "mistralai/mistral-7b-instruct-v0.3",
  "endpoint": "https://integrate.api.nvidia.com/v1/chat/completions",
  "message": "NIM Hosted API reachable"
}

Create Conversation

POST /api/conversations
Content-Type: application/json
Authorization: Bearer <your-token>

{
  "title": "Mistral Assistant",
  "use_case": "general-purpose",
  "deployed_resource": {
    "resource_name": "Mistral 7B",
    "resource_type": "nvidia-nim",
    "aws_resource_id": "mistralai/mistral-7b-instruct-v0.3",
    "region": "global",
    "instance_type": "api-based",
    "estimated_hourly_cost": 0
  }
}

// Response
{
  "success": true,
  "conversation_id": "uuid-here",
  "resource_id": "resource-uuid-here",
  "message": "Conversation and resource created successfully"
}

Available Models

Curated Model List

Staque IO provides access to a curated selection of high-quality models:

ModelProviderSizeBest For
meta/llama-3.2-11b-vision-instructMeta11BVision + Text tasks
mistralai/mistral-7b-instruct-v0.3Mistral AI7BGeneral purpose
meta/llama3-8b-instructMeta8BBalanced performance
nvidia/nemotron-mini-4b-instructNVIDIA4BFast, efficient
google/gemma-3-27b-itGoogle27BHigh quality outputs
ibm/granite-34b-code-instructIBM34BCode generation
microsoft/phi-4-mini-instructMicrosoftSmallEfficient reasoning

Note: The full NVIDIA NIM catalog includes 200+ models. Staque IO filters to a curated set of high-performance, reliable models. See the full list at build.nvidia.com

Model Configuration

System Prompts

Customize model behavior with system prompts:

POST /api/bedrock/system-prompt
Content-Type: application/json
Authorization: Bearer <your-token>

{
  "modelId": "mistralai/mistral-7b-instruct-v0.3",
  "systemPrompt": "You are a technical support specialist..."
}

// Response
{
  "success": true,
  "message": "System prompt updated successfully"
}

Request Parameters

Configure inference parameters per request:

  • temperature: 0.0-1.0 (creativity level)
  • max_tokens: Maximum response length
  • top_p: Nucleus sampling threshold
  • stream: Enable streaming responses

Cost Management

Understanding NIM Pricing

  • Token-based: Pay per input/output token
  • No idle costs: Only charged when you make requests
  • No infrastructure: No instance or hourly fees
  • Free tier available: Limited requests per day for testing

Cost Comparison vs SageMaker

AspectNVIDIA NIMSageMaker
Pricing ModelPer-tokenPer-hour
Idle Cost$0~$0.74-3.83/hr
Deployment TimeInstant5-10 minutes
InfrastructureManagedSelf-managed
Best ForVariable workloads, testingHigh-volume, custom models

Track Usage

GET /api/usage/current?conversationId=<conversation-id>

// Response includes cost tracking
{
  "success": true,
  "mtd": {
    "cost_usd": 2.45,
    "tokens_in": 45000,
    "tokens_out": 67500,
    "requests": 350
  }
}

Performance Optimization

Model Selection Tips

  • Small models (1-4B): Use for simple tasks, high-volume requests, low latency needs
  • Medium models (7-11B): Use for balanced performance, general-purpose tasks
  • Large models (27-34B): Use for complex reasoning, high-quality outputs, specialized tasks

Latency Optimization

  • Choose smaller models when possible
  • Reduce max_tokens to minimum required
  • Use shorter system prompts
  • Enable streaming for better user experience

Monitoring and Health

Check NIM Status

GET /api/resources/<resource-id>/status

// Response
{
  "success": true,
  "resource": {
    "type": "nvidia-nim",
    "status": "InService",
    "health": "healthy"
  },
  "metrics": {
    "response_time_ms": 150
  },
  "costs": {
    "hourly_cost": 0,
    "pricing_model": "token"
  }
}

Troubleshooting

Common Issues

Error: "NVIDIA_API_KEY is not configured"

Cause: Missing NVIDIA API key environment variable

Solution: Get API key from NGC and add to .env.local

Error: "401 Unauthorized"

Cause: Invalid or expired API key

Solution: Regenerate API key from NGC console and update environment variable

Error: "Model not found"

Cause: Model ID not in curated list or doesn't exist

Solution: Use GET /api/models/nvidia to get valid model IDs

Error: "Rate limit exceeded"

Cause: Too many requests in short time period

Solution: Implement exponential backoff or request higher rate limits from NVIDIA

Best Practices

Development

  • Start with smaller models for testing
  • Use dry run deployments to verify connectivity
  • Monitor token usage closely
  • Test with different models to find best fit

Production

  • Implement request caching when appropriate
  • Use conversation context judiciously (trim when too long)
  • Set up monitoring for API errors and latency
  • Have fallback models in case of API issues

Security

  • Never expose API keys to clients
  • Rotate API keys periodically
  • Use environment variables for key management
  • Monitor for unauthorized usage

Next Steps