NVIDIA NIM Deployment

Deploy optimized AI models using NVIDIA Inference Microservices (NIM) Hosted API.

✅ Benefits

✓ No infrastructure management - Fully hosted by NVIDIA
✓ Pay-per-use pricing - No idle costs
✓ Instant deployment - No provisioning wait time
✓ High performance - Optimized inference with TensorRT
✓ Wide model selection - Access to popular open-source models

Prerequisites

NVIDIA NGC Account

Sign up at ngc.nvidia.com
Generate an API key from your account settings
No credit card required for free tier access

Required Environment Variables

NVIDIA_API_KEY=nvapi-XXXXXXXXXXXXXXXXXXXXXXXXXXXX
NIM_BASE_URL=https://integrate.api.nvidia.com  # Optional, defaults to this

Step 1: Get NVIDIA API Key

Via NGC Console

Log in to NVIDIA NGC
Click on your profile icon (top right)
Select "Setup" → "Generate API Key"
Copy the generated key
Add to your .env.local file

Security Note: Never commit your API key to version control. Always use environment variables.

Step 2: Browse Available Models

Via Staque IO API

GET /api/models/nvidia

// Response
{
  "success": true,
  "provider": "nvidia-nim",
  "models": [
    {
      "id": "meta/llama-3.2-11b-vision-instruct",
      "name": "Llama 3.2 11B Vision Instruct",
      "provider": "nvidia-nim",
      "tags": ["instruction-following", "vision", "11b"],
      "task": "text-generation"
    },
    {
      "id": "mistralai/mistral-7b-instruct-v0.3",
      "name": "Mistral 7B Instruct",
      "provider": "nvidia-nim",
      "tags": ["instruction-following", "7b"],
      "task": "text-generation"
    }
  ]
}

Via NVIDIA Catalog

Browse all available models at build.nvidia.com

Step 3: Deploy via Staque IO UI

Navigate to Get Started
- Click "Get Started" in the navigation
- Get AI recommendations or skip to model selection
Select NVIDIA NIM Platform
- Choose "NVIDIA NIM" as the platform
- Browse curated model list
Choose Model
- Select model based on your use case
- Review model capabilities and context window
Configure Deployment
- Enter conversation title
- Specify use case
- No instance type selection needed (fully managed)
Deploy
- Click "Deploy" - instant activation
- System verifies API connectivity
- Ready to use immediately

Step 4: Deploy via API

Verify Connectivity (Dry Run)

POST /api/deploy/nims
Content-Type: application/json
Authorization: Bearer <your-token>

{
  "modelId": "mistralai/mistral-7b-instruct-v0.3",
  "dryRun": true
}

// Response
{
  "success": true,
  "provider": "nvidia-nim",
  "modelId": "mistralai/mistral-7b-instruct-v0.3",
  "endpoint": "https://integrate.api.nvidia.com/v1/chat/completions",
  "message": "Dry run successful"
}

Actual Deployment

POST /api/deploy/nims
Content-Type: application/json
Authorization: Bearer <your-token>

{
  "modelId": "mistralai/mistral-7b-instruct-v0.3",
  "dryRun": false
}

// Response
{
  "success": true,
  "provider": "nvidia-nim",
  "modelId": "mistralai/mistral-7b-instruct-v0.3",
  "endpoint": "https://integrate.api.nvidia.com/v1/chat/completions",
  "message": "NIM Hosted API reachable"
}

Create Conversation

POST /api/conversations
Content-Type: application/json
Authorization: Bearer <your-token>

{
  "title": "Mistral Assistant",
  "use_case": "general-purpose",
  "deployed_resource": {
    "resource_name": "Mistral 7B",
    "resource_type": "nvidia-nim",
    "aws_resource_id": "mistralai/mistral-7b-instruct-v0.3",
    "region": "global",
    "instance_type": "api-based",
    "estimated_hourly_cost": 0
  }
}

// Response
{
  "success": true,
  "conversation_id": "uuid-here",
  "resource_id": "resource-uuid-here",
  "message": "Conversation and resource created successfully"
}

Available Models

Curated Model List

Staque IO provides access to a curated selection of high-quality models:

Model	Provider	Size	Best For
`meta/llama-3.2-11b-vision-instruct`	Meta	11B	Vision + Text tasks
`mistralai/mistral-7b-instruct-v0.3`	Mistral AI	7B	General purpose
`meta/llama3-8b-instruct`	Meta	8B	Balanced performance
`nvidia/nemotron-mini-4b-instruct`	NVIDIA	4B	Fast, efficient
`google/gemma-3-27b-it`	Google	27B	High quality outputs
`ibm/granite-34b-code-instruct`	IBM	34B	Code generation
`microsoft/phi-4-mini-instruct`	Microsoft	Small	Efficient reasoning

Note: The full NVIDIA NIM catalog includes 200+ models. Staque IO filters to a curated set of high-performance, reliable models. See the full list at build.nvidia.com

Model Configuration

System Prompts

Customize model behavior with system prompts:

POST /api/bedrock/system-prompt
Content-Type: application/json
Authorization: Bearer <your-token>

{
  "modelId": "mistralai/mistral-7b-instruct-v0.3",
  "systemPrompt": "You are a technical support specialist..."
}

// Response
{
  "success": true,
  "message": "System prompt updated successfully"
}

Request Parameters

Configure inference parameters per request:

temperature: 0.0-1.0 (creativity level)
max_tokens: Maximum response length
top_p: Nucleus sampling threshold
stream: Enable streaming responses

Cost Management

Understanding NIM Pricing

Token-based: Pay per input/output token
No idle costs: Only charged when you make requests
No infrastructure: No instance or hourly fees
Free tier available: Limited requests per day for testing

Cost Comparison vs SageMaker

Aspect	NVIDIA NIM	SageMaker
Pricing Model	Per-token	Per-hour
Idle Cost	$0	~$0.74-3.83/hr
Deployment Time	Instant	5-10 minutes
Infrastructure	Managed	Self-managed
Best For	Variable workloads, testing	High-volume, custom models

Track Usage

GET /api/usage/current?conversationId=<conversation-id>

// Response includes cost tracking
{
  "success": true,
  "mtd": {
    "cost_usd": 2.45,
    "tokens_in": 45000,
    "tokens_out": 67500,
    "requests": 350
  }
}

Performance Optimization

Model Selection Tips

Small models (1-4B): Use for simple tasks, high-volume requests, low latency needs
Medium models (7-11B): Use for balanced performance, general-purpose tasks
Large models (27-34B): Use for complex reasoning, high-quality outputs, specialized tasks

Latency Optimization

Choose smaller models when possible
Reduce max_tokens to minimum required
Use shorter system prompts
Enable streaming for better user experience

Monitoring and Health

Check NIM Status

GET /api/resources/<resource-id>/status

// Response
{
  "success": true,
  "resource": {
    "type": "nvidia-nim",
    "status": "InService",
    "health": "healthy"
  },
  "metrics": {
    "response_time_ms": 150
  },
  "costs": {
    "hourly_cost": 0,
    "pricing_model": "token"
  }
}

Troubleshooting

Common Issues

Error: "NVIDIA_API_KEY is not configured"

Cause: Missing NVIDIA API key environment variable

Solution: Get API key from NGC and add to .env.local

Error: "401 Unauthorized"

Cause: Invalid or expired API key

Solution: Regenerate API key from NGC console and update environment variable

Error: "Model not found"

Cause: Model ID not in curated list or doesn't exist

Solution: Use GET /api/models/nvidia to get valid model IDs

Error: "Rate limit exceeded"

Cause: Too many requests in short time period

Solution: Implement exponential backoff or request higher rate limits from NVIDIA

Best Practices

Development

Start with smaller models for testing
Use dry run deployments to verify connectivity
Monitor token usage closely
Test with different models to find best fit

Production

Implement request caching when appropriate
Use conversation context judiciously (trim when too long)
Set up monitoring for API errors and latency
Have fallback models in case of API issues

Security

Never expose API keys to clients
Rotate API keys periodically
Use environment variables for key management
Monitor for unauthorized usage

Next Steps

Platform Details

Learn more about NVIDIA NIM platform features

API Reference

Detailed API documentation for NIM endpoints

Using NIM Models

Learn how to interact with your deployed NIM models

Cost Management

Understand and optimize your NIM usage costs