NVIDIA NIM (Inference Microservices)

NVIDIA NIM provides optimized inference microservices for AI models, delivering high-performance, cost-effective API access to popular open-source models with NVIDIA's hardware acceleration.

Key Features

✓ GPU-Optimized: TensorRT and CUDA-optimized inference for maximum performance
✓ Cost-Effective: Pay-per-use pricing with no infrastructure management
✓ Fast Response Times: Sub-second latency for most models
✓ Popular Models: Access to Llama, Mistral, Gemma, and more
✓ Easy Integration: OpenAI-compatible API format

Available Models

Staque IO integrates with a curated selection of NVIDIA NIM models optimized for performance:

Meta Llama Models

llama-3.2-11b-vision-instruct: Multimodal model with vision capabilities
llama-3.2-3b-instruct: Efficient 3B parameter model
llama-3.2-1b-instruct: Ultra-fast 1B parameter model
llama3-8b-instruct: Balanced 8B model for general use

Mistral AI Models

mistral-7b-instruct-v0.3: Latest Mistral 7B with improved capabilities
mistral-7b-instruct-v0.2: Stable Mistral 7B version

Google Gemma Models

gemma-3-12b-it: Google's 12B instruction-tuned model
gemma-3-27b-it: Larger Gemma model for complex tasks
gemma-2-9b-it: Efficient 9B Gemma model

Microsoft Phi Models

phi-4-mini-instruct: Latest compact high-performance model
phi-3.5-mini-instruct: Small but powerful instruction model

IBM Granite Models

granite-3.3-8b-instruct: Latest enterprise-grade model
granite-3.0-8b-instruct: Business-focused 8B model
granite-34b-code-instruct: Code-specialized large model
granite-3.0-3b-a800m-instruct: Compact efficient model

Other Models

nemotron-mini-4b-instruct: NVIDIA's efficient instruction model
jamba-1.5-mini-instruct: AI21 Labs' hybrid SSM-Transformer
breeze-7b-instruct: MediaTek's optimized 7B model
solar-10.7b-instruct: Upstage's high-performance model

How It Works in Staque IO

1. Model Selection from curated list
   ↓
2. API Connectivity Verification (instant)
   ↓
3. Configuration Storage (no infrastructure provisioning)
   ↓
4. Immediate Availability
   ↓
5. Pay only for requests made

"Deployment" Process

NVIDIA NIM models don't require traditional deployment. When you "deploy" a NIM model in Staque IO:

Platform verifies API connectivity to NVIDIA's hosted service
Performs a ping test to ensure the model is reachable
Creates a configuration entry in the database
Model is immediately ready for chat interactions

OpenAI-Compatible API

NVIDIA NIM uses an OpenAI-compatible API format, making it easy to integrate with existing tools and libraries designed for OpenAI's API.

Pricing

Pay-Per-Use Model

NVIDIA NIM uses token-based pricing similar to other hosted inference services:

No Idle Costs: Pay only for actual inference requests
Token-Based: Charged per input and output token
Cost-Effective: Generally lower cost than managed infrastructure
Predictable: Transparent per-token pricing

💰 Cost Advantage: NVIDIA NIM is often 2-3x more cost-effective than running equivalent models on dedicated SageMaker instances, especially for variable or low-to-medium volume workloads.

Cost Comparison

Scenario: 1M tokens/day (input + output)

NVIDIA NIM (token-based):
- Daily cost: ~$2-5 (varies by model)
- Monthly cost: ~$60-150

SageMaker ml.g4dn.xlarge (dedicated):
- Daily cost: $20.40 (24/7 operation)
- Monthly cost: $612.00

Savings: 75-90% for variable workloads

Configuration

Required Environment Variables

# NVIDIA API Key (Required)
NVIDIA_API_KEY=nvapi-xxxxxxxxxxxxxxxxxxxxx

# Optional: Custom NIM Base URL (default: https://integrate.api.nvidia.com)
NIM_BASE_URL=https://integrate.api.nvidia.com

# Optional: JWT Secret
JWT_SECRET=your-secret-key-here

Obtaining an API Key

Visit build.nvidia.com
Sign in with your NVIDIA account (or create one)
Navigate to API Keys section
Generate a new API key
Add to your environment variables as NVIDIA_API_KEY

System Prompts

Like Bedrock models, you can customize the system prompt for each NVIDIA NIM model to tailor behavior to your specific use case.

// Update system prompt
POST /api/bedrock/system-prompt
{
  "modelId": "mistralai/mistral-7b-instruct-v0.3",
  "systemPrompt": "You are a helpful AI assistant specialized in..."
}

// Retrieve current prompt
GET /api/bedrock/system-prompt?modelId=mistralai/mistral-7b-instruct-v0.3

Usage Examples

Deploying a NIM Model

POST /api/deploy/nims
{
  "modelId": "mistralai/mistral-7b-instruct-v0.3",
  "dryRun": false
}

// Response (instant)
{
  "success": true,
  "provider": "nvidia-nim",
  "modelId": "mistralai/mistral-7b-instruct-v0.3",
  "endpoint": "https://integrate.api.nvidia.com/v1/chat/completions",
  "message": "NIM Hosted API reachable"
}

Sending a Chat Message

POST /api/chat/thread
{
  "message": "Explain quantum computing in simple terms",
  "conversationId": "conversation-uuid",
  "resourceId": "resource-uuid",
  "threadId": "thread-uuid"  // Optional
}

// Response
{
  "success": true,
  "threadId": "thread-uuid",
  "messages": [
    {
      "role": "user",
      "content": "Explain quantum computing...",
      "timestamp": "2024-01-10T12:00:00Z"
    },
    {
      "role": "assistant",
      "content": "Quantum computing uses quantum mechanical...",
      "timestamp": "2024-01-10T12:00:01Z",
      "tokens_in": 12,
      "tokens_out": 245,
      "tokens_total": 257,
      "latency_ms": 456
    }
  ]
}

Direct API Usage

You can also call the NVIDIA NIM API directly (outside of Staque IO):

curl -X POST 'https://integrate.api.nvidia.com/v1/chat/completions' \
  -H 'Authorization: Bearer $NVIDIA_API_KEY' \
  -H 'Content-Type: application/json' \
  -d '{
    "model": "mistralai/mistral-7b-instruct-v0.3",
    "messages": [
      {
        "role": "system",
        "content": "You are a helpful AI assistant."
      },
      {
        "role": "user",
        "content": "Hello!"
      }
    ],
    "temperature": 0.7,
    "max_tokens": 1000
  }'

Best Practices

Model Selection

Prototyping: Start with smaller models (1B-3B) for fast iteration
Production: Use 7B-8B models for balanced performance
Complex Tasks: Use 12B-27B models for advanced reasoning
Code Tasks: Use Granite Code or specialized models
Vision Tasks: Use Llama 3.2 Vision models

Performance Optimization

Use smaller models when possible to reduce latency and cost
Implement client-side caching for common responses
Batch similar requests together when feasible
Use streaming responses for better user experience
Set appropriate max_tokens limits

Cost Optimization

Choose the smallest model that meets quality requirements
Implement prompt engineering to reduce token usage
Use context length efficiently
Monitor usage through Staque IO's tracking
Set up alerts for unexpected usage spikes

Reliability

Implement retry logic with exponential backoff
Handle rate limits gracefully
Use timeout settings appropriate for your use case
Have fallback models configured
Monitor API status and response times

📚 Learn More

Model Availability

Staque IO uses a curated allow-list of NVIDIA NIM models to ensure quality and compatibility. Currently supported models include:

Model ID	Size	Best For
`meta/llama-3.2-11b-vision-instruct`	11B	Multimodal tasks
`mistralai/mistral-7b-instruct-v0.3`	7B	General purpose
`google/gemma-3-12b-it`	12B	Complex reasoning
`ibm/granite-34b-code-instruct`	34B	Code generation
`microsoft/phi-4-mini-instruct`	~4B	Fast responses

See the full list of supported models via the API: GET /api/models/nvidia

Troubleshooting

API Key Invalid

Problem: 401 Unauthorized or invalid API key error

Solution:

Verify NVIDIA_API_KEY is correctly set in environment
Check API key hasn't expired (regenerate if needed)
Ensure no extra spaces or quotes in the key
Verify key is from build.nvidia.com, not another NVIDIA service

Model Not Available

Problem: Model ID not found or not accessible

Solution:

Check model ID is in the allow-list (see table above)
Verify model ID format: provider/model-name
List available models: GET /api/models/nvidia
Contact support to add new models to the allow-list

Rate Limiting

Problem: 429 Too Many Requests

Solution:

Implement exponential backoff retry logic
Reduce request frequency
Check your NVIDIA account tier limits
Consider upgrading to a paid NVIDIA tier for higher limits

Slow Responses

Problem: Higher than expected latency

Solution:

Use smaller models (1B-3B instead of 7B+)
Reduce max_tokens parameter
Enable streaming for incremental responses
Check network connectivity to NVIDIA's API
Consider caching frequent responses