Chat System

Staque IO features a sophisticated conversation management system with thread-based chat, usage tracking, and cost monitoring.

Architecture

Conversation (Top-level container)
  └─ Conversation Threads (Individual chat threads)
      └─ Messages (User & assistant messages with metadata)
          ├─ Message content
          ├─ Timestamp
          ├─ Token usage (input, output, total)
          ├─ Cost data
          └─ Model response time

Data Models

Conversation

A conversation is the top-level container that groups related chat threads together.

{
  id: "uuid",
  user_id: "uuid",
  title: "Customer Support Bot",
  use_case: "customer-support",
  status: "deployed",
  created_at: "2024-01-01T00:00:00Z",
  updated_at: "2024-01-10T12:00:00Z"
}

Conversation Thread

A thread contains the actual conversation history and usage statistics.

{
  id: "uuid",
  conversation_id: "uuid",
  user_id: "uuid",
  model_id: "amazon.nova-pro-v1:0",
  thread_data: {
    messages: [
      {
        role: "user",
        content: "Hello!",
        timestamp: "2024-01-10T12:00:00Z",
        metadata: { username: "john" }
      },
      {
        role: "assistant",
        content: "Hello! How can I help you?",
        timestamp: "2024-01-10T12:00:01Z",
        metadata: {
          model_id: "amazon.nova-pro-v1:0",
          model_response_time: 0.823,
          tokens_used: 42
        },
        tokens_in: 15,
        tokens_out: 27,
        tokens_total: 42,
        latency_ms: 823,
        model_id: "amazon.nova-pro-v1:0"
      }
    ],
    resource_id: "uuid",
    model_id: "amazon.nova-pro-v1:0"
  },
  tokens_input_total: 150,
  tokens_output_total: 380,
  tokens_total: 530,
  requests_count: 10,
  cost_total_usd: 0.0042,
  last_usage_at: "2024-01-10T12:00:00Z",
  created_at: "2024-01-10T10:00:00Z",
  updated_at: "2024-01-10T12:00:00Z"
}

Chat Flow

Creating a New Thread

POST /api/chat/thread

{
  "message": "Hello! How can you help me?",
  "conversationId": "conversation-uuid",
  "resourceId": "resource-uuid"
  // threadId omitted = new thread
}

Response:
{
  "success": true,
  "threadId": "thread-uuid",
  "messages": [
    { role: "user", content: "Hello! How can you help me?", ... },
    { role: "assistant", content: "Hello! I can assist with...", ... }
  ]
}

Continuing a Thread

POST /api/chat/thread

{
  "message": "Tell me more about pricing",
  "conversationId": "conversation-uuid",
  "resourceId": "resource-uuid",
  "threadId": "thread-uuid"  // Include existing threadId
}

Response:
{
  "success": true,
  "threadId": "thread-uuid",
  "messages": [
    // ... previous messages ...
    { role: "user", content: "Tell me more about pricing", ... },
    { role: "assistant", content: "Our pricing is...", ... }
  ]
}

Retrieving Thread History

GET /api/chat/thread?threadId=thread-uuid&conversationId=conversation-uuid

Response:
{
  "success": true,
  "threadId": "thread-uuid",
  "messages": [
    { role: "user", content: "...", timestamp: "..." },
    { role: "assistant", content: "...", timestamp: "..." }
  ]
}

Model Integration

AWS Bedrock Models

Bedrock models are invoked with conversation context and system prompts:

// For Amazon Nova models
{
  messages: [{
    role: "user",
    content: [{
      text: "System: You are a helpful assistant...\n\nConversation history:\nuser: Hello\nassistant: Hi there!\nuser: How are you?"
    }]
  }],
  inferenceConfig: {
    max_new_tokens: 1000,
    temperature: 0.1
  }
}

// For Anthropic Claude models
{
  messages: [{
    role: "user",
    content: "System: You are a helpful assistant...\n\nConversation history:\nuser: Hello\nassistant: Hi there!\nuser: How are you?"
  }],
  max_tokens: 4000,
  temperature: 0.7,
  anthropic_version: "bedrock-2023-05-31"
}

NVIDIA NIM Models

NIM models use the OpenAI-compatible chat completions format:

{
  model: "mistralai/mistral-7b-instruct-v0.3",
  messages: [
    {
      role: "system",
      content: "You are a helpful assistant..."
    },
    {
      role: "user",
      content: "Hello"
    },
    {
      role: "assistant",
      content: "Hi there!"
    },
    {
      role: "user",
      content: "How are you?"
    }
  ],
  temperature: 0.1,
  max_tokens: 1000
}

System Prompts

System prompts guide the model's behavior and can be customized per model:

Setting a System Prompt

POST /api/bedrock/system-prompt

{
  "modelId": "amazon.nova-pro-v1:0",
  "systemPrompt": "You are a customer support specialist. Be helpful, concise, and professional. Always provide actionable solutions."
}

Response:
{
  "success": true,
  "message": "System prompt updated successfully"
}

Retrieving a System Prompt

GET /api/bedrock/system-prompt?modelId=amazon.nova-pro-v1:0

Response:
{
  "success": true,
  "systemPrompt": "You are a customer support specialist..."
}

Usage Tracking

The system automatically tracks token usage and costs for each interaction:

Per-Message Tracking

  • tokens_in: Input tokens consumed
  • tokens_out: Output tokens generated
  • tokens_total: Total tokens (in + out)
  • latency_ms: Model response time in milliseconds
  • cost: Calculated cost based on token pricing

Per-Thread Aggregates

  • tokens_input_total: Sum of all input tokens
  • tokens_output_total: Sum of all output tokens
  • tokens_total: Sum of all tokens
  • requests_count: Number of requests made
  • cost_total_usd: Total cost in USD

Querying Usage

GET /api/usage/current?conversationId=conv-uuid

Response:
{
  "success": true,
  "mtd": {
    "cost_usd": 12.45,
    "tokens_in": 125000,
    "tokens_out": 187500,
    "tokens_total": 312500,
    "requests": 850
  },
  "last24h": {
    "cost_usd": 2.34,
    "tokens_in": 15000,
    "tokens_out": 22500,
    "tokens_total": 37500,
    "requests": 102
  }
}

Cost Calculation

Costs are calculated based on model-specific token pricing:

// Example calculation
const inputTokens = 150
const outputTokens = 280
const inputPricing = 0.0008  // per 1K tokens
const outputPricing = 0.0032  // per 1K tokens

const cost = (
  (inputTokens / 1000) * inputPricing +
  (outputTokens / 1000) * outputPricing
)
// cost = 0.00012 + 0.000896 = 0.001016 USD

Message Metadata

Each message can include additional metadata:

{
  role: "assistant",
  content: "Here's your answer...",
  timestamp: "2024-01-10T12:00:00Z",
  metadata: {
    model_id: "amazon.nova-pro-v1:0",
    model_response_time: 0.823,
    tokens_used: 42,
    username: "john"
  },
  tokens_in: 15,
  tokens_out: 27,
  tokens_total: 42,
  latency_ms: 823,
  model_id: "amazon.nova-pro-v1:0"
}

Legacy Chat System

The platform previously used a chat_messages table for storing messages. This has been replaced by the thread-based system stored in conversation_threads. A legacy endpoint is still available for backwards compatibility:

GET /api/chat/legacy?conversationId=conv-uuid

// Returns messages from old chat_messages table

Best Practices

  • Thread Management: Create new threads for distinct conversations
  • Context Length: Monitor conversation length to avoid token limits
  • System Prompts: Use specific, clear system prompts for better results
  • Cost Monitoring: Regularly check usage statistics to manage costs
  • Error Handling: Implement retry logic for transient failures
  • Rate Limiting: Implement client-side throttling if needed

⚠️ Token Limits

Different models have different context window limits:

  • Amazon Nova Pro: ~300K tokens
  • Claude 3 Sonnet: ~200K tokens
  • Most NVIDIA models: ~32K tokens

Implement conversation summarization or truncation for long conversations.