Expand description
§ProcessContent
§File: Indexing/Process/ProcessContent.rs
§Role in Air Architecture
Provides content processing functionality for the File Indexer service, handling encoding detection, MIME type detection, and content tokenization.
§Primary Responsibility
Process file content for indexing by detecting encoding, mime types, and tokenizing text for search operations.
§Secondary Responsibilities
- File encoding detection (UTF-8, UTF-16, ASCII)
- MIME type detection from extensions and content
- Content tokenization for search indexing
- Language detection for code analysis
§Dependencies
External Crates:
- None (uses std library)
Internal Modules:
crate::Result- Error handling type
§Dependents
Indexing::Scan::ScanFile- Content processing during file scanIndexing::Store::StoreEntry- Index storage operations
§VSCode Pattern Reference
Inspired by VSCode’s content processing in
src/vs/base/node/encoding/
§Security Considerations
- Safe BOM marker detection
- Null byte filtering
- Length limits on processed content
§Performance Considerations
- Efficient tokenization with minimal allocations
- Early termination for binary files
- Lazy content evaluation
§Error Handling Strategy
Content processing functions return Option or safe defaults when detection fails, rather than errors, to allow indexing to continue.
§Thread Safety
Content processing functions are pure and safe to call from parallel indexing tasks.
Functions§
- Content
ToString - Convert content to UTF-8 string with error handling
- Detect
Encoding - Detect file encoding (simplified detection)
- Detect
Language - Detect programming language from file extension and shebang
- Detect
Mime Type - Detect MIME type with comprehensive file type detection
- GetChar
Count - Get char count from content
- GetLine
Count - Get line count from content
- IsBinary
Content - Check if content is likely binary (contains null bytes or high ratio of non-text)
- Sanitize
Content - Remove null bytes and control characters from content
- Tokenize
Content - Tokenize content for indexing with improved word boundary handling
- Truncate
Content - Truncate content to specified maximum size in characters