NVIDIA NIM Deployment
Deploy optimized AI models using NVIDIA Inference Microservices (NIM) Hosted API.
✅ Benefits
- ✓ No infrastructure management - Fully hosted by NVIDIA
- ✓ Pay-per-use pricing - No idle costs
- ✓ Instant deployment - No provisioning wait time
- ✓ High performance - Optimized inference with TensorRT
- ✓ Wide model selection - Access to popular open-source models
Prerequisites
NVIDIA NGC Account
- Sign up at ngc.nvidia.com
- Generate an API key from your account settings
- No credit card required for free tier access
Required Environment Variables
NVIDIA_API_KEY=nvapi-XXXXXXXXXXXXXXXXXXXXXXXXXXXX NIM_BASE_URL=https://integrate.api.nvidia.com # Optional, defaults to this
Step 1: Get NVIDIA API Key
Via NGC Console
- Log in to NVIDIA NGC
- Click on your profile icon (top right)
- Select "Setup" → "Generate API Key"
- Copy the generated key
- Add to your
.env.localfile
Security Note: Never commit your API key to version control. Always use environment variables.
Step 2: Browse Available Models
Via Staque IO API
GET /api/models/nvidia
// Response
{
"success": true,
"provider": "nvidia-nim",
"models": [
{
"id": "meta/llama-3.2-11b-vision-instruct",
"name": "Llama 3.2 11B Vision Instruct",
"provider": "nvidia-nim",
"tags": ["instruction-following", "vision", "11b"],
"task": "text-generation"
},
{
"id": "mistralai/mistral-7b-instruct-v0.3",
"name": "Mistral 7B Instruct",
"provider": "nvidia-nim",
"tags": ["instruction-following", "7b"],
"task": "text-generation"
}
]
}Via NVIDIA Catalog
Browse all available models at build.nvidia.com
Step 3: Deploy via Staque IO UI
- Navigate to Get Started
- Click "Get Started" in the navigation
- Get AI recommendations or skip to model selection
- Select NVIDIA NIM Platform
- Choose "NVIDIA NIM" as the platform
- Browse curated model list
- Choose Model
- Select model based on your use case
- Review model capabilities and context window
- Configure Deployment
- Enter conversation title
- Specify use case
- No instance type selection needed (fully managed)
- Deploy
- Click "Deploy" - instant activation
- System verifies API connectivity
- Ready to use immediately
Step 4: Deploy via API
Verify Connectivity (Dry Run)
POST /api/deploy/nims
Content-Type: application/json
Authorization: Bearer <your-token>
{
"modelId": "mistralai/mistral-7b-instruct-v0.3",
"dryRun": true
}
// Response
{
"success": true,
"provider": "nvidia-nim",
"modelId": "mistralai/mistral-7b-instruct-v0.3",
"endpoint": "https://integrate.api.nvidia.com/v1/chat/completions",
"message": "Dry run successful"
}Actual Deployment
POST /api/deploy/nims
Content-Type: application/json
Authorization: Bearer <your-token>
{
"modelId": "mistralai/mistral-7b-instruct-v0.3",
"dryRun": false
}
// Response
{
"success": true,
"provider": "nvidia-nim",
"modelId": "mistralai/mistral-7b-instruct-v0.3",
"endpoint": "https://integrate.api.nvidia.com/v1/chat/completions",
"message": "NIM Hosted API reachable"
}Create Conversation
POST /api/conversations
Content-Type: application/json
Authorization: Bearer <your-token>
{
"title": "Mistral Assistant",
"use_case": "general-purpose",
"deployed_resource": {
"resource_name": "Mistral 7B",
"resource_type": "nvidia-nim",
"aws_resource_id": "mistralai/mistral-7b-instruct-v0.3",
"region": "global",
"instance_type": "api-based",
"estimated_hourly_cost": 0
}
}
// Response
{
"success": true,
"conversation_id": "uuid-here",
"resource_id": "resource-uuid-here",
"message": "Conversation and resource created successfully"
}Available Models
Curated Model List
Staque IO provides access to a curated selection of high-quality models:
| Model | Provider | Size | Best For |
|---|---|---|---|
meta/llama-3.2-11b-vision-instruct | Meta | 11B | Vision + Text tasks |
mistralai/mistral-7b-instruct-v0.3 | Mistral AI | 7B | General purpose |
meta/llama3-8b-instruct | Meta | 8B | Balanced performance |
nvidia/nemotron-mini-4b-instruct | NVIDIA | 4B | Fast, efficient |
google/gemma-3-27b-it | 27B | High quality outputs | |
ibm/granite-34b-code-instruct | IBM | 34B | Code generation |
microsoft/phi-4-mini-instruct | Microsoft | Small | Efficient reasoning |
Note: The full NVIDIA NIM catalog includes 200+ models. Staque IO filters to a curated set of high-performance, reliable models. See the full list at build.nvidia.com
Model Configuration
System Prompts
Customize model behavior with system prompts:
POST /api/bedrock/system-prompt
Content-Type: application/json
Authorization: Bearer <your-token>
{
"modelId": "mistralai/mistral-7b-instruct-v0.3",
"systemPrompt": "You are a technical support specialist..."
}
// Response
{
"success": true,
"message": "System prompt updated successfully"
}Request Parameters
Configure inference parameters per request:
- temperature: 0.0-1.0 (creativity level)
- max_tokens: Maximum response length
- top_p: Nucleus sampling threshold
- stream: Enable streaming responses
Cost Management
Understanding NIM Pricing
- Token-based: Pay per input/output token
- No idle costs: Only charged when you make requests
- No infrastructure: No instance or hourly fees
- Free tier available: Limited requests per day for testing
Cost Comparison vs SageMaker
| Aspect | NVIDIA NIM | SageMaker |
|---|---|---|
| Pricing Model | Per-token | Per-hour |
| Idle Cost | $0 | ~$0.74-3.83/hr |
| Deployment Time | Instant | 5-10 minutes |
| Infrastructure | Managed | Self-managed |
| Best For | Variable workloads, testing | High-volume, custom models |
Track Usage
GET /api/usage/current?conversationId=<conversation-id>
// Response includes cost tracking
{
"success": true,
"mtd": {
"cost_usd": 2.45,
"tokens_in": 45000,
"tokens_out": 67500,
"requests": 350
}
}Performance Optimization
Model Selection Tips
- Small models (1-4B): Use for simple tasks, high-volume requests, low latency needs
- Medium models (7-11B): Use for balanced performance, general-purpose tasks
- Large models (27-34B): Use for complex reasoning, high-quality outputs, specialized tasks
Latency Optimization
- Choose smaller models when possible
- Reduce max_tokens to minimum required
- Use shorter system prompts
- Enable streaming for better user experience
Monitoring and Health
Check NIM Status
GET /api/resources/<resource-id>/status
// Response
{
"success": true,
"resource": {
"type": "nvidia-nim",
"status": "InService",
"health": "healthy"
},
"metrics": {
"response_time_ms": 150
},
"costs": {
"hourly_cost": 0,
"pricing_model": "token"
}
}Troubleshooting
Common Issues
Error: "NVIDIA_API_KEY is not configured"
Cause: Missing NVIDIA API key environment variable
Solution: Get API key from NGC and add to .env.local
Error: "401 Unauthorized"
Cause: Invalid or expired API key
Solution: Regenerate API key from NGC console and update environment variable
Error: "Model not found"
Cause: Model ID not in curated list or doesn't exist
Solution: Use GET /api/models/nvidia to get valid model IDs
Error: "Rate limit exceeded"
Cause: Too many requests in short time period
Solution: Implement exponential backoff or request higher rate limits from NVIDIA
Best Practices
Development
- Start with smaller models for testing
- Use dry run deployments to verify connectivity
- Monitor token usage closely
- Test with different models to find best fit
Production
- Implement request caching when appropriate
- Use conversation context judiciously (trim when too long)
- Set up monitoring for API errors and latency
- Have fallback models in case of API issues
Security
- Never expose API keys to clients
- Rotate API keys periodically
- Use environment variables for key management
- Monitor for unauthorized usage