canvas-website/docs/ENHANCED_TRANSCRIPTION.md

215 lines
8.2 KiB
Markdown

# Enhanced Audio Transcription with Speaker Identification
This document describes the enhanced audio transcription system that identifies different speakers and ensures complete transcript preservation in real-time.
## 🎯 Key Features
### 1. **Speaker Identification**
- **Voice Fingerprinting**: Uses audio analysis to create unique voice profiles for each speaker
- **Real-time Detection**: Automatically identifies when speakers change during conversation
- **Visual Indicators**: Each speaker gets a unique color and label for easy identification
- **Speaker Statistics**: Tracks speaking time and segment count for each participant
### 2. **Enhanced Transcript Structure**
- **Structured Segments**: Each transcript segment includes speaker ID, timestamps, and confidence scores
- **Complete Preservation**: No words are lost during real-time updates
- **Backward Compatibility**: Maintains legacy transcript format for existing integrations
- **Multiple Export Formats**: Support for text, JSON, and SRT subtitle formats
### 3. **Real-time Updates**
- **Live Speaker Detection**: Continuously monitors voice activity and speaker changes
- **Interim Text Display**: Shows partial results as they're being spoken
- **Smooth Transitions**: Seamless updates between interim and final transcript segments
- **Auto-scroll**: Automatically scrolls to show the latest content
## 🔧 Technical Implementation
### Audio Analysis System
The system uses advanced audio analysis to identify speakers:
```typescript
interface VoiceCharacteristics {
pitch: number // Fundamental frequency
volume: number // Audio amplitude
spectralCentroid: number // Frequency distribution center
mfcc: number[] // Mel-frequency cepstral coefficients
zeroCrossingRate: number // Voice activity indicator
energy: number // Overall audio energy
}
```
### Speaker Identification Algorithm
1. **Voice Activity Detection**: Monitors audio levels to detect when someone is speaking
2. **Feature Extraction**: Analyzes voice characteristics in real-time
3. **Similarity Matching**: Compares current voice with known speaker profiles
4. **Profile Creation**: Creates new speaker profiles for unrecognized voices
5. **Confidence Scoring**: Assigns confidence levels to speaker identifications
### Transcript Management
The enhanced transcript system provides:
```typescript
interface TranscriptSegment {
id: string // Unique segment identifier
speakerId: string // Associated speaker ID
speakerName: string // Display name for speaker
text: string // Transcribed text
startTime: number // Segment start time (ms)
endTime: number // Segment end time (ms)
confidence: number // Recognition confidence (0-1)
isFinal: boolean // Whether segment is finalized
}
```
## 🎨 User Interface Enhancements
### Speaker Display
- **Color-coded Labels**: Each speaker gets a unique color for easy identification
- **Speaker List**: Shows all identified speakers with speaking time statistics
- **Current Speaker Highlighting**: Highlights the currently speaking participant
- **Speaker Management**: Ability to rename speakers and manage their profiles
### Transcript Controls
- **Show/Hide Speaker Labels**: Toggle speaker name display
- **Show/Hide Timestamps**: Toggle timestamp display for each segment
- **Auto-scroll Toggle**: Control automatic scrolling behavior
- **Export Options**: Download transcripts in multiple formats
### Visual Indicators
- **Border Colors**: Each transcript segment has a colored border matching the speaker
- **Speaking Status**: Visual indicators show who is currently speaking
- **Interim Text**: Italicized, gray text shows partial results
- **Final Text**: Regular text shows confirmed transcript segments
## 📊 Data Export and Analysis
### Export Formats
1. **Text Format**:
```
[00:01:23] Speaker 1: Hello, how are you today?
[00:01:28] Speaker 2: I'm doing well, thank you for asking.
```
2. **JSON Format**:
```json
{
"segments": [...],
"speakers": [...],
"sessionStartTime": 1234567890,
"totalDuration": 300000
}
```
3. **SRT Subtitle Format**:
```
1
00:00:01,230 --> 00:00:05,180
Speaker 1: Hello, how are you today?
```
### Statistics and Analytics
The system tracks comprehensive statistics:
- Total speaking time per speaker
- Number of segments per speaker
- Average segment length
- Session duration and timeline
- Recognition confidence scores
## 🔄 Real-time Processing Flow
1. **Audio Capture**: Microphone stream is captured and analyzed
2. **Voice Activity Detection**: System detects when someone starts/stops speaking
3. **Speaker Identification**: Voice characteristics are analyzed and matched to known speakers
4. **Speech Recognition**: Web Speech API processes audio into text
5. **Transcript Update**: New segments are added with speaker information
6. **UI Update**: Interface updates to show new content with speaker labels
## 🛠️ Configuration Options
### Audio Analysis Settings
- **Voice Activity Threshold**: Sensitivity for detecting speech
- **Silence Timeout**: Time before considering a speaker change
- **Similarity Threshold**: Minimum similarity for speaker matching
- **Feature Update Rate**: How often voice profiles are updated
### Display Options
- **Speaker Colors**: Customizable color palette for speakers
- **Timestamp Format**: Choose between different time display formats
- **Auto-scroll Behavior**: Control when and how auto-scrolling occurs
- **Segment Styling**: Customize visual appearance of transcript segments
## 🔍 Troubleshooting
### Common Issues
1. **Speaker Not Identified**:
- Ensure good microphone quality
- Check for background noise
- Verify speaker is speaking clearly
- Allow time for voice profile creation
2. **Incorrect Speaker Assignment**:
- Check microphone positioning
- Verify audio quality
- Consider adjusting similarity threshold
- Manually rename speakers if needed
3. **Missing Transcript Segments**:
- Check internet connection stability
- Verify browser compatibility
- Ensure microphone permissions are granted
- Check for audio processing errors
### Performance Optimization
1. **Audio Quality**: Use high-quality microphones for better speaker identification
2. **Environment**: Minimize background noise for clearer voice analysis
3. **Browser**: Use Chrome or Chromium-based browsers for best performance
4. **Network**: Ensure stable internet connection for speech recognition
## 🚀 Future Enhancements
### Planned Features
- **Machine Learning Integration**: Improved speaker identification using ML models
- **Voice Cloning Detection**: Identify when speakers are using voice modification
- **Emotion Recognition**: Detect emotional tone in speech
- **Language Detection**: Automatic language identification and switching
- **Cloud Processing**: Offload heavy processing to cloud services
### Integration Possibilities
- **Video Analysis**: Combine with video feeds for enhanced speaker detection
- **Meeting Platforms**: Integration with Zoom, Teams, and other platforms
- **AI Summarization**: Automatic meeting summaries with speaker attribution
- **Search and Indexing**: Full-text search across all transcript segments
## 📝 Usage Examples
### Basic Usage
1. Start a video chat session
2. Click the transcription button
3. Allow microphone access
4. Begin speaking - speakers will be automatically identified
5. View real-time transcript with speaker labels
### Advanced Features
1. **Customize Display**: Toggle speaker labels and timestamps
2. **Export Transcripts**: Download in your preferred format
3. **Manage Speakers**: Rename speakers for better organization
4. **Analyze Statistics**: View speaking time and participation metrics
### Integration with Other Tools
- **Meeting Notes**: Combine with note-taking tools
- **Action Items**: Extract action items with speaker attribution
- **Follow-up**: Use transcripts for meeting follow-up and documentation
- **Compliance**: Maintain records for regulatory requirements
---
*The enhanced transcription system provides a comprehensive solution for real-time speaker identification and transcript management, ensuring no spoken words are lost while providing rich metadata about conversation participants.*