NVIDIA NIM (Inference Microservices)
NVIDIA NIM provides optimized inference microservices for AI models, delivering high-performance, cost-effective API access to popular open-source models with NVIDIA's hardware acceleration.
Key Features
- ✓ GPU-Optimized: TensorRT and CUDA-optimized inference for maximum performance
- ✓ Cost-Effective: Pay-per-use pricing with no infrastructure management
- ✓ Fast Response Times: Sub-second latency for most models
- ✓ Popular Models: Access to Llama, Mistral, Gemma, and more
- ✓ Easy Integration: OpenAI-compatible API format
Available Models
Staque IO integrates with a curated selection of NVIDIA NIM models optimized for performance:
Meta Llama Models
- llama-3.2-11b-vision-instruct: Multimodal model with vision capabilities
- llama-3.2-3b-instruct: Efficient 3B parameter model
- llama-3.2-1b-instruct: Ultra-fast 1B parameter model
- llama3-8b-instruct: Balanced 8B model for general use
Mistral AI Models
- mistral-7b-instruct-v0.3: Latest Mistral 7B with improved capabilities
- mistral-7b-instruct-v0.2: Stable Mistral 7B version
Google Gemma Models
- gemma-3-12b-it: Google's 12B instruction-tuned model
- gemma-3-27b-it: Larger Gemma model for complex tasks
- gemma-2-9b-it: Efficient 9B Gemma model
Microsoft Phi Models
- phi-4-mini-instruct: Latest compact high-performance model
- phi-3.5-mini-instruct: Small but powerful instruction model
IBM Granite Models
- granite-3.3-8b-instruct: Latest enterprise-grade model
- granite-3.0-8b-instruct: Business-focused 8B model
- granite-34b-code-instruct: Code-specialized large model
- granite-3.0-3b-a800m-instruct: Compact efficient model
Other Models
- nemotron-mini-4b-instruct: NVIDIA's efficient instruction model
- jamba-1.5-mini-instruct: AI21 Labs' hybrid SSM-Transformer
- breeze-7b-instruct: MediaTek's optimized 7B model
- solar-10.7b-instruct: Upstage's high-performance model
How It Works in Staque IO
1. Model Selection from curated list ↓ 2. API Connectivity Verification (instant) ↓ 3. Configuration Storage (no infrastructure provisioning) ↓ 4. Immediate Availability ↓ 5. Pay only for requests made
"Deployment" Process
NVIDIA NIM models don't require traditional deployment. When you "deploy" a NIM model in Staque IO:
- Platform verifies API connectivity to NVIDIA's hosted service
- Performs a ping test to ensure the model is reachable
- Creates a configuration entry in the database
- Model is immediately ready for chat interactions
OpenAI-Compatible API
NVIDIA NIM uses an OpenAI-compatible API format, making it easy to integrate with existing tools and libraries designed for OpenAI's API.
Pricing
Pay-Per-Use Model
NVIDIA NIM uses token-based pricing similar to other hosted inference services:
- No Idle Costs: Pay only for actual inference requests
- Token-Based: Charged per input and output token
- Cost-Effective: Generally lower cost than managed infrastructure
- Predictable: Transparent per-token pricing
💰 Cost Advantage: NVIDIA NIM is often 2-3x more cost-effective than running equivalent models on dedicated SageMaker instances, especially for variable or low-to-medium volume workloads.
Cost Comparison
Scenario: 1M tokens/day (input + output) NVIDIA NIM (token-based): - Daily cost: ~$2-5 (varies by model) - Monthly cost: ~$60-150 SageMaker ml.g4dn.xlarge (dedicated): - Daily cost: $20.40 (24/7 operation) - Monthly cost: $612.00 Savings: 75-90% for variable workloads
Configuration
Required Environment Variables
# NVIDIA API Key (Required) NVIDIA_API_KEY=nvapi-xxxxxxxxxxxxxxxxxxxxx # Optional: Custom NIM Base URL (default: https://integrate.api.nvidia.com) NIM_BASE_URL=https://integrate.api.nvidia.com # Optional: JWT Secret JWT_SECRET=your-secret-key-here
Obtaining an API Key
- Visit build.nvidia.com
- Sign in with your NVIDIA account (or create one)
- Navigate to API Keys section
- Generate a new API key
- Add to your environment variables as
NVIDIA_API_KEY
System Prompts
Like Bedrock models, you can customize the system prompt for each NVIDIA NIM model to tailor behavior to your specific use case.
// Update system prompt
POST /api/bedrock/system-prompt
{
"modelId": "mistralai/mistral-7b-instruct-v0.3",
"systemPrompt": "You are a helpful AI assistant specialized in..."
}
// Retrieve current prompt
GET /api/bedrock/system-prompt?modelId=mistralai/mistral-7b-instruct-v0.3Usage Examples
Deploying a NIM Model
POST /api/deploy/nims
{
"modelId": "mistralai/mistral-7b-instruct-v0.3",
"dryRun": false
}
// Response (instant)
{
"success": true,
"provider": "nvidia-nim",
"modelId": "mistralai/mistral-7b-instruct-v0.3",
"endpoint": "https://integrate.api.nvidia.com/v1/chat/completions",
"message": "NIM Hosted API reachable"
}Sending a Chat Message
POST /api/chat/thread
{
"message": "Explain quantum computing in simple terms",
"conversationId": "conversation-uuid",
"resourceId": "resource-uuid",
"threadId": "thread-uuid" // Optional
}
// Response
{
"success": true,
"threadId": "thread-uuid",
"messages": [
{
"role": "user",
"content": "Explain quantum computing...",
"timestamp": "2024-01-10T12:00:00Z"
},
{
"role": "assistant",
"content": "Quantum computing uses quantum mechanical...",
"timestamp": "2024-01-10T12:00:01Z",
"tokens_in": 12,
"tokens_out": 245,
"tokens_total": 257,
"latency_ms": 456
}
]
}Direct API Usage
You can also call the NVIDIA NIM API directly (outside of Staque IO):
curl -X POST 'https://integrate.api.nvidia.com/v1/chat/completions' \
-H 'Authorization: Bearer $NVIDIA_API_KEY' \
-H 'Content-Type: application/json' \
-d '{
"model": "mistralai/mistral-7b-instruct-v0.3",
"messages": [
{
"role": "system",
"content": "You are a helpful AI assistant."
},
{
"role": "user",
"content": "Hello!"
}
],
"temperature": 0.7,
"max_tokens": 1000
}'Best Practices
Model Selection
- Prototyping: Start with smaller models (1B-3B) for fast iteration
- Production: Use 7B-8B models for balanced performance
- Complex Tasks: Use 12B-27B models for advanced reasoning
- Code Tasks: Use Granite Code or specialized models
- Vision Tasks: Use Llama 3.2 Vision models
Performance Optimization
- Use smaller models when possible to reduce latency and cost
- Implement client-side caching for common responses
- Batch similar requests together when feasible
- Use streaming responses for better user experience
- Set appropriate
max_tokenslimits
Cost Optimization
- Choose the smallest model that meets quality requirements
- Implement prompt engineering to reduce token usage
- Use context length efficiently
- Monitor usage through Staque IO's tracking
- Set up alerts for unexpected usage spikes
Reliability
- Implement retry logic with exponential backoff
- Handle rate limits gracefully
- Use timeout settings appropriate for your use case
- Have fallback models configured
- Monitor API status and response times
📚 Learn More
Model Availability
Staque IO uses a curated allow-list of NVIDIA NIM models to ensure quality and compatibility. Currently supported models include:
| Model ID | Size | Best For |
|---|---|---|
meta/llama-3.2-11b-vision-instruct | 11B | Multimodal tasks |
mistralai/mistral-7b-instruct-v0.3 | 7B | General purpose |
google/gemma-3-12b-it | 12B | Complex reasoning |
ibm/granite-34b-code-instruct | 34B | Code generation |
microsoft/phi-4-mini-instruct | ~4B | Fast responses |
See the full list of supported models via the API: GET /api/models/nvidia
Troubleshooting
API Key Invalid
Problem: 401 Unauthorized or invalid API key error
Solution:
- Verify
NVIDIA_API_KEYis correctly set in environment - Check API key hasn't expired (regenerate if needed)
- Ensure no extra spaces or quotes in the key
- Verify key is from build.nvidia.com, not another NVIDIA service
Model Not Available
Problem: Model ID not found or not accessible
Solution:
- Check model ID is in the allow-list (see table above)
- Verify model ID format:
provider/model-name - List available models:
GET /api/models/nvidia - Contact support to add new models to the allow-list
Rate Limiting
Problem: 429 Too Many Requests
Solution:
- Implement exponential backoff retry logic
- Reduce request frequency
- Check your NVIDIA account tier limits
- Consider upgrading to a paid NVIDIA tier for higher limits
Slow Responses
Problem: Higher than expected latency
Solution:
- Use smaller models (1B-3B instead of 7B+)
- Reduce
max_tokensparameter - Enable streaming for incremental responses
- Check network connectivity to NVIDIA's API
- Consider caching frequent responses