canvas-website/docs/ENHANCED_TRANSCRIPTION.md

8.2 KiB

Enhanced Audio Transcription with Speaker Identification

This document describes the enhanced audio transcription system that identifies different speakers and ensures complete transcript preservation in real-time.

🎯 Key Features

1. Speaker Identification

  • Voice Fingerprinting: Uses audio analysis to create unique voice profiles for each speaker
  • Real-time Detection: Automatically identifies when speakers change during conversation
  • Visual Indicators: Each speaker gets a unique color and label for easy identification
  • Speaker Statistics: Tracks speaking time and segment count for each participant

2. Enhanced Transcript Structure

  • Structured Segments: Each transcript segment includes speaker ID, timestamps, and confidence scores
  • Complete Preservation: No words are lost during real-time updates
  • Backward Compatibility: Maintains legacy transcript format for existing integrations
  • Multiple Export Formats: Support for text, JSON, and SRT subtitle formats

3. Real-time Updates

  • Live Speaker Detection: Continuously monitors voice activity and speaker changes
  • Interim Text Display: Shows partial results as they're being spoken
  • Smooth Transitions: Seamless updates between interim and final transcript segments
  • Auto-scroll: Automatically scrolls to show the latest content

🔧 Technical Implementation

Audio Analysis System

The system uses advanced audio analysis to identify speakers:

interface VoiceCharacteristics {
  pitch: number              // Fundamental frequency
  volume: number             // Audio amplitude
  spectralCentroid: number   // Frequency distribution center
  mfcc: number[]            // Mel-frequency cepstral coefficients
  zeroCrossingRate: number   // Voice activity indicator
  energy: number            // Overall audio energy
}

Speaker Identification Algorithm

  1. Voice Activity Detection: Monitors audio levels to detect when someone is speaking
  2. Feature Extraction: Analyzes voice characteristics in real-time
  3. Similarity Matching: Compares current voice with known speaker profiles
  4. Profile Creation: Creates new speaker profiles for unrecognized voices
  5. Confidence Scoring: Assigns confidence levels to speaker identifications

Transcript Management

The enhanced transcript system provides:

interface TranscriptSegment {
  id: string              // Unique segment identifier
  speakerId: string       // Associated speaker ID
  speakerName: string     // Display name for speaker
  text: string           // Transcribed text
  startTime: number      // Segment start time (ms)
  endTime: number        // Segment end time (ms)
  confidence: number     // Recognition confidence (0-1)
  isFinal: boolean       // Whether segment is finalized
}

🎨 User Interface Enhancements

Speaker Display

  • Color-coded Labels: Each speaker gets a unique color for easy identification
  • Speaker List: Shows all identified speakers with speaking time statistics
  • Current Speaker Highlighting: Highlights the currently speaking participant
  • Speaker Management: Ability to rename speakers and manage their profiles

Transcript Controls

  • Show/Hide Speaker Labels: Toggle speaker name display
  • Show/Hide Timestamps: Toggle timestamp display for each segment
  • Auto-scroll Toggle: Control automatic scrolling behavior
  • Export Options: Download transcripts in multiple formats

Visual Indicators

  • Border Colors: Each transcript segment has a colored border matching the speaker
  • Speaking Status: Visual indicators show who is currently speaking
  • Interim Text: Italicized, gray text shows partial results
  • Final Text: Regular text shows confirmed transcript segments

📊 Data Export and Analysis

Export Formats

  1. Text Format:

    [00:01:23] Speaker 1: Hello, how are you today?
    [00:01:28] Speaker 2: I'm doing well, thank you for asking.
    
  2. JSON Format:

    {
      "segments": [...],
      "speakers": [...],
      "sessionStartTime": 1234567890,
      "totalDuration": 300000
    }
    
  3. SRT Subtitle Format:

    1
    00:00:01,230 --> 00:00:05,180
    Speaker 1: Hello, how are you today?
    

Statistics and Analytics

The system tracks comprehensive statistics:

  • Total speaking time per speaker
  • Number of segments per speaker
  • Average segment length
  • Session duration and timeline
  • Recognition confidence scores

🔄 Real-time Processing Flow

  1. Audio Capture: Microphone stream is captured and analyzed
  2. Voice Activity Detection: System detects when someone starts/stops speaking
  3. Speaker Identification: Voice characteristics are analyzed and matched to known speakers
  4. Speech Recognition: Web Speech API processes audio into text
  5. Transcript Update: New segments are added with speaker information
  6. UI Update: Interface updates to show new content with speaker labels

🛠️ Configuration Options

Audio Analysis Settings

  • Voice Activity Threshold: Sensitivity for detecting speech
  • Silence Timeout: Time before considering a speaker change
  • Similarity Threshold: Minimum similarity for speaker matching
  • Feature Update Rate: How often voice profiles are updated

Display Options

  • Speaker Colors: Customizable color palette for speakers
  • Timestamp Format: Choose between different time display formats
  • Auto-scroll Behavior: Control when and how auto-scrolling occurs
  • Segment Styling: Customize visual appearance of transcript segments

🔍 Troubleshooting

Common Issues

  1. Speaker Not Identified:

    • Ensure good microphone quality
    • Check for background noise
    • Verify speaker is speaking clearly
    • Allow time for voice profile creation
  2. Incorrect Speaker Assignment:

    • Check microphone positioning
    • Verify audio quality
    • Consider adjusting similarity threshold
    • Manually rename speakers if needed
  3. Missing Transcript Segments:

    • Check internet connection stability
    • Verify browser compatibility
    • Ensure microphone permissions are granted
    • Check for audio processing errors

Performance Optimization

  1. Audio Quality: Use high-quality microphones for better speaker identification
  2. Environment: Minimize background noise for clearer voice analysis
  3. Browser: Use Chrome or Chromium-based browsers for best performance
  4. Network: Ensure stable internet connection for speech recognition

🚀 Future Enhancements

Planned Features

  • Machine Learning Integration: Improved speaker identification using ML models
  • Voice Cloning Detection: Identify when speakers are using voice modification
  • Emotion Recognition: Detect emotional tone in speech
  • Language Detection: Automatic language identification and switching
  • Cloud Processing: Offload heavy processing to cloud services

Integration Possibilities

  • Video Analysis: Combine with video feeds for enhanced speaker detection
  • Meeting Platforms: Integration with Zoom, Teams, and other platforms
  • AI Summarization: Automatic meeting summaries with speaker attribution
  • Search and Indexing: Full-text search across all transcript segments

📝 Usage Examples

Basic Usage

  1. Start a video chat session
  2. Click the transcription button
  3. Allow microphone access
  4. Begin speaking - speakers will be automatically identified
  5. View real-time transcript with speaker labels

Advanced Features

  1. Customize Display: Toggle speaker labels and timestamps
  2. Export Transcripts: Download in your preferred format
  3. Manage Speakers: Rename speakers for better organization
  4. Analyze Statistics: View speaking time and participation metrics

Integration with Other Tools

  • Meeting Notes: Combine with note-taking tools
  • Action Items: Extract action items with speaker attribution
  • Follow-up: Use transcripts for meeting follow-up and documentation
  • Compliance: Maintain records for regulatory requirements

The enhanced transcription system provides a comprehensive solution for real-time speaker identification and transcript management, ensuring no spoken words are lost while providing rich metadata about conversation participants.