AWS SageMaker
Amazon SageMaker is a fully managed machine learning service that enables you to deploy custom models on dedicated infrastructure with full control over the deployment environment.
Key Features
- ✓ Custom Model Support: Deploy any model from HuggingFace, local files, or S3
- ✓ Dedicated Infrastructure: Predictable performance with dedicated GPU/CPU instances
- ✓ Full Control: Complete control over instance type, scaling, and configuration
- ✓ JumpStart Catalog: Access to pre-trained models and solutions
- ✓ Auto-Scaling: Automatic scaling based on traffic patterns
Available Models
SageMaker JumpStart
Access hundreds of pre-trained models from the JumpStart catalog:
- Llama Models: Meta's Llama 2 and Llama 3 in various sizes (7B, 13B, 70B)
- Falcon Models: TII's Falcon models optimized for inference
- Mistral Models: Mistral AI's open-source models
- BLOOM: BigScience's multilingual model
- Stable Diffusion: Image generation models
- Domain-Specific: Financial, healthcare, and legal models
Custom Models
Deploy your own models by providing:
- Model artifacts from S3
- Inference container image (PyTorch, TensorFlow, custom)
- Inference script for request handling
How It Works in Staque IO
1. Model Selection (JumpStart or custom) ↓ 2. Infrastructure Configuration (instance type, VPC, IAM) ↓ 3. Deployment (5-10 minutes provisioning time) ↓ 4. Endpoint Creation (Model → Config → Endpoint) ↓ 5. Ready for Inference ↓ 6. Ongoing: Hourly charges apply while endpoint is active
Deployment Process
When you deploy a SageMaker model through Staque IO, the platform:
- Creates a Model: Registers the model with SageMaker
- Creates Endpoint Configuration: Defines instance type and scaling
- Creates Endpoint: Provisions the infrastructure
- Waits for InService: Typically takes 5-10 minutes
- Enables Inference: Endpoint is ready to receive requests
Automatic Resource Management
Staque IO automatically handles:
- Name sanitization (AWS resource naming requirements)
- Region-specific inference image selection
- VPC configuration for secure deployments
- IAM role assignment for execution
- Status monitoring and health checks
Instance Types
SageMaker offers various instance types optimized for different workloads:
GPU Instances (Recommended for LLMs)
| Instance Type | GPU | Memory | Best For | Est. Cost/hr |
|---|---|---|---|---|
ml.g4dn.xlarge | 1x NVIDIA T4 | 16 GB | Small models (7B) | $0.85 |
ml.g5.xlarge | 1x NVIDIA A10G | 24 GB | Medium models (13B) | $1.41 |
ml.g5.12xlarge | 4x NVIDIA A10G | 96 GB | Large models (70B) | $7.09 |
ml.p4d.24xlarge | 8x NVIDIA A100 | 320 GB | Very large models | $32.77 |
CPU Instances
For smaller models or non-LLM workloads:
ml.m5.xlarge- General purpose, $0.23/hrml.c5.2xlarge- Compute optimized, $0.41/hrml.r5.xlarge- Memory optimized, $0.30/hr
Pricing
Hourly Instance Costs
SageMaker charges are based on:
- Instance Hours: Charged per second, minimum 60 seconds
- Always Running: Costs accrue whether endpoint is used or not
- Data Transfer: Minimal charges for requests/responses
- Storage: S3 storage for model artifacts
⚠️ Important: Unlike Bedrock, SageMaker incurs costs even when idle. Delete endpoints when not in use to avoid unnecessary charges.
Cost Example
ml.g4dn.xlarge @ $0.85/hour: - Daily cost: $20.40 (24 hours) - Monthly cost: $612.00 (30 days) - With 50% uptime: $306.00/month ml.g5.xlarge @ $1.41/hour: - Daily cost: $33.84 - Monthly cost: $1,015.20 - With 50% uptime: $507.60/month
Configuration
Required Environment Variables
# AWS Credentials STAQUE_AWS_REGION=eu-north-1 STAQUE_AWS_ACCESS_KEY_ID=AKIAIOSFODNN7EXAMPLE STAQUE_AWS_SECRET_ACCESS_KEY=wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY # SageMaker Configuration SAGEMAKER_EXECUTION_ROLE_ARN=arn:aws:iam::123456789012:role/SageMakerRole SAGEMAKER_SUBNET_IDS=subnet-12345678,subnet-87654321 SAGEMAKER_SECURITY_GROUP_IDS=sg-12345678 # Optional: JWT Secret JWT_SECRET=your-secret-key-here
IAM Role Requirements
The SageMaker execution role needs these permissions:
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Action": [
"sagemaker:CreateModel",
"sagemaker:CreateEndpointConfig",
"sagemaker:CreateEndpoint",
"sagemaker:DeleteEndpoint",
"sagemaker:DescribeEndpoint",
"sagemaker:InvokeEndpoint",
"s3:GetObject",
"s3:ListBucket",
"ecr:GetDownloadUrlForLayer",
"ecr:BatchGetImage",
"logs:CreateLogGroup",
"logs:CreateLogStream",
"logs:PutLogEvents"
],
"Resource": "*"
}
]
}VPC Configuration
For security, deploy SageMaker endpoints in a VPC:
- Subnets: At least 2 subnets in different availability zones
- Security Groups: Configure ingress/egress rules for model access
- VPC Endpoints: S3 and ECR endpoints for private connectivity
Usage Examples
Deploying a JumpStart Model
POST /api/deploy/sagemaker
{
"endpointName": "my-llama-endpoint",
"instanceType": "ml.g4dn.xlarge",
"modelPackageArn": "arn:aws:sagemaker:eu-north-1:...:model-package/llama-2-7b",
"dryRun": false
}
// Response (deployment takes 5-10 minutes)
{
"success": true,
"message": "Endpoint creation started",
"endpointName": "my-llama-endpoint",
"endpoint": "https://runtime.sagemaker.eu-north-1.amazonaws.com/..."
}Deploying a Custom Model
POST /api/deploy/sagemaker
{
"endpointName": "my-custom-model",
"instanceType": "ml.g5.xlarge",
"inferenceImage": "763104351884.dkr.ecr.eu-north-1.amazonaws.com/pytorch-inference:2.1.0-gpu-py310",
"modelDataUrl": "s3://my-bucket/model.tar.gz",
"executionRoleArn": "arn:aws:iam::123456789012:role/SageMakerRole",
"dryRun": false
}Checking Endpoint Status
GET /api/resources/resource-uuid/status
// Response
{
"success": true,
"resource": {
"type": "sagemaker",
"status": "InService", // Creating, Updating, InService, etc.
"health": "healthy",
"instance_type": "ml.g4dn.xlarge",
"instance_count": 1
},
"metrics": {
"response_time_ms": 342,
"throughput_per_minute": 87,
"cpu_utilization": 35,
"memory_utilization": 48
},
"costs": {
"hourly_cost": 0.85,
"daily_cost": 20.40,
"monthly_estimate": 612.00
}
}Deleting an Endpoint
POST /api/resources/resource-uuid/control
{
"action": "delete",
"confirm": true // Required to prevent accidental deletion
}
// Response
{
"success": true,
"message": "Endpoint deletion initiated",
"action": "delete",
"status": "deleting"
}Best Practices
Instance Selection
- 7B Models: ml.g4dn.xlarge or ml.g5.xlarge
- 13B Models: ml.g5.2xlarge or ml.g5.4xlarge
- 70B Models: ml.g5.12xlarge or ml.p4d.24xlarge
- Test First: Always start with smaller instances and scale up as needed
Cost Optimization
- Delete When Idle: Delete endpoints that aren't actively used
- Use Auto-Scaling: Scale instance count based on traffic
- Spot Instances: Consider using managed spot training/inference
- Right-Size: Monitor utilization and downgrade if under-utilized
- Schedule Deletion: Use automation to delete non-production endpoints after hours
Performance Optimization
- Model Optimization: Use quantization (INT8, FP16) to reduce memory
- Batch Requests: Process multiple requests together when possible
- Enable Multi-Model Endpoints: Host multiple models on one endpoint
- Use TensorRT: For NVIDIA GPU acceleration
- Monitor Metrics: Track latency, throughput, and resource utilization
Security Best Practices
- Always deploy in a VPC with private subnets
- Use VPC endpoints for S3 and ECR access
- Implement least-privilege IAM policies
- Enable encryption at rest and in transit
- Use security groups to restrict network access
📚 Learn More
Troubleshooting
Endpoint Stuck in "Creating" Status
Problem: Endpoint takes longer than 15 minutes to deploy
Solution:
- Check VPC configuration (subnets, security groups, route tables)
- Verify ECR and S3 VPC endpoints exist
- Check IAM role has required permissions
- Review CloudWatch logs for specific errors
Endpoint Failed to Deploy
Problem: Status changes to "Failed"
Solution:
- Check model artifacts are accessible in S3
- Verify inference image exists and is accessible
- Ensure instance type is available in your region
- Review CloudWatch logs for error details
High Latency or Timeout
Problem: Slow response times or request timeouts
Solution:
- Increase instance size or use GPU instances
- Optimize model (quantization, pruning)
- Enable auto-scaling to handle traffic spikes
- Check network configuration (VPC, security groups)
- Monitor endpoint metrics in CloudWatch
Unexpected Costs
Problem: Higher than expected charges
Solution:
- Delete unused endpoints immediately
- Use Staque IO's cost monitoring dashboard
- Set up AWS Budgets and alerts
- Review instance utilization and downsize if possible
- Consider switching to Bedrock for variable workloads