audio2_siaprueb

Written on April 14, 2026 in with a size of 366.38 KB

Speech to Text (STT):

Def: A Speech-To-Text (STT) or Automatic Speech Recognizer (ASR) system transforms a speech recording input into an output text that contains, word by word, what is said in the recording. EJEMPLO: raw audio -> STT -> (we are learning about speech to text)

We can describe the problem as Sequence labelling: This means the computer receives a continuous sequence of data (the audio waves shown in the image) and has to assign discrete "labels" (words or letters) to parts of that sound.

Sequence-to-Sequence Mapping: This is a more general term. It simply means mapping one type of sequence (audio) to another type of sequence (text).

STT EVALUATION:

STT Systems Make Mistakes so before evaluating, we must accept that these systems are not perfect.

Error Rates: Depending on the difficulty of the task, error rates typically range between 3% and 30%.

Why do errors happen? The accuracy depends on many factors, such as background noise (acoustic environment), the speaker's accent, emotions, or how fast they are speaking.

What You Need for Evaluation

Test Set: You need a set of audio recordings. It must be large enough (at least 30 minutes or 2000 words) and varied (different speakers/conditions) to be statistically significant.

Reference (REF): You need humans to manually transcribe these recordings perfectly. This is expensive and slow (it takes about 10 minutes of work to transcribe 1 minute of audio).

Word Error Rate (WER) -> Standard metric for ASR accuracy

How to calculate it: First, you align the Reference (REF) (what was actually said) with the Recognized (REC) text (what the computer wrote). You then look for three types of errors:

1.S (Substitutions): The computer replaced a word with the wrong one (e.G., "understands" became "under").

2.D (Deletions): The computer missed a word entirely.

3.I (Insertions): The computer added a word that wasn't there. ----- BONUS: Hits (H) – Words correctly recognized -----

Once you count these errors, you use this formula: WER = (S + D + I / N) x 100%, where N is the total number of words in REF (the truth)

Important Notes on WER:

It can be > 100%: Because of "Insertions," the computer can write more errors than there were original words.

Trade-off: You can tune the system to be "shy" (more deletions) or "talkative" (more insertions) using an insertion-penalty parameter.

Alternative Metrics -> Sometimes, counting words isn't the best method, so other metrics exist:

PER (Phone Error Rate): Counts errors by sound (phonemes) instead of words.

SER (Sentence Error Rate): The percentage of sentences that have at least one mistake. This is usually very high because getting a long sentence 100% perfect is difficult.

CER (Character Error Rate): Counts errors by individual letters (characters).

70s–80s: Template-Based Approaches for Speech recognition (Template Matching):

Feature extraction was done by chopping the audio into tiny slices called "windows". And then for each slice it calculates a list of numbers that describe the sound frecuencies "Feature Vectors". The specific type was "Mel-Frequency Cepstral Coefficients (MFCC)".

Entire workflow: 1.Off-line (Training): Training audios (people speaking) -> Feature extraction -> Save these extractions in a database as "Templates"

2.On-line (Recognition): Word to recognize (a user say a new word) -> Feature extraction -> Data base (pequeño) features -> Comparison (This new recording is compared againts every template in the database to see which one is the closest match) -> Recognized word

In the comparison is used DTW (an algorithm that "stretches" or "compresses" the time axis to make the two recordings fit together best.) beacuse the training and testing recordings can be at different speeds.

HMM-GMM Architecture (Acoustic Model)

The HMM-GMM is a statistical framework for speech recognition. It combines Hidden Markov Models (HMM) to model the temporal sequence of speech and Gaussian Mixture Models (GMM) to model the acoustic properties of sounds.

Step 1: Parameterization (Input): The system transforms the raw audio into a sequence of Feature Vectors (typically MFCCs). This process extracts the relevant acoustic information frame by frame.

Step 2: Modeling (HMM & GMM): HMMs (a probabilistic generative finite-state models) (Left-to-Right topology) represent the sequence of speech sounds (phonemes) using states and Transition Probabilities (aij) to determine the likelihood of moving between states.

GMMs (a parametric model of a probability density function) are associated with each HMM state to calculate the Emission Probability, modeling how well a feature vector fits a specific sound. Since acoustic data is complex, each GMM is typically composed of a weighted sum of multiple Gaussian distributions (mixtures). (Each HMM has one GMM)

Step 3: Recognition (Decoding): Based on the statistical models, the recognizer searches for the most likely sequence of states that matches the input feature vectors. It combines the probabilities from the GMMs (acoustic match) and the HMMs (sequence logic) to output the recognized text.

Combining Acoustic Model and LM : Given a sequence of observations Y=y_1, ... , y_T, the recognizer identifies the sequence of words W=w_1, ..., w_k from the lexicon that most likely generated Y. This is achieved by solving the equation W = argmax_W [P(Y|W)*P(W)]. In this formula, P(Y|W) represents the acoustic probability (how well the audio features match the ideal statistical models), while P(W) represents the language probability (the likelihood of that word sequence occurring based on linguistic context).

EJEMPLO: Speech -> Feature Extraction -> (feature vectors) Y -> Decoder (Acustic Models, Pronunciation Dictionary, Language Model -> Output el texto reconocido (Words W)

Pronunciation Dictionary or The Lexicon is a file containing the list of all allowed words mapped to their corresponding phonetic transcriptions (sequences of phonemes). This connects the words (W) to the sounds (Y).

Classic Method N-grams: It is a statistic model that assumes the probability of a word depending only N-1 before words. It usually uses N = 3. And it is trained by reading millions of texts and see "el perro ladra vs el perro vuela how many times each of them appears more: $P(\text{palabra} | \text{previas}) \approx \frac{\text{Conteo de la frase entera}}{\text{Conteo de la frase previa}}$ The slide mentions "data scarcity". If a specific sequence of words was never seen in the training text, its probability becomes zero. Solution? Smoothnes techniques.

Language Modelling - RNNs (The Modern Approach): Artificial Neural Networks, specifically RNNs (Recurrent Neural Networks) like LSTMs (Long Short-Term Memory) and GRUs, are very effective for Language Modelling because they can remember longer contexts than N-grams. But the field has moved towards the Transformer architecture, leading to the era of Large Language Models (LLMs) (the technology behind systems like ChatGPT). NLP and ASR are very related.

Real Time Decoding (How to make it fast): As we saw in the architecture equation, the system combines HMM states, the Dictionary, and the Language Model into one massive "recognition network." The number of possible paths is too huge to calculate perfectly in real-time (Direct Viterbi implementation is impossible).

Solution? Token-passing algorithm instead and Beam-Search to optimize decoding Which requires a specific way to organize the network

Training: HMM training is a complex, iterative process that starts with very simple models and iteratively refines them

1 – Data preparation, 2 – Feature extraction
3 – Training Monophone modelsn◼ One model for each phone, e.G. /a/
4 – Train Initial Triphone models ◼ One for each phone in context, e.G. /p/-/a/+/t/
5 – Train Advanced Triphone models
6 – Final Training Steps (e.G. Discriminative training)

Toolkits: To implement this effectively, engineers rely on specialized software toolkits such as HTK (classic, well-documented for beginners) or KALDI (modern, fast, and the standard for experts).

Hybrid HMM-DNN Systems:

Hybrid HMM-DNN Training: 1.Train an old HMM-GMM system first: Since GMMs can learn without perfect labels, you train one of those first. 2.Force Alignment: Use that old GMM system to label the audio. It goes through the recording and marks: "Time 0.1s is State 1", "Time 0.2s is State 2". 3.Train the DNN: Now that you have labels (generated by the GMM), you can train the Neural Network to predict those states.

DNN Structure: The input is the audio features, and the output is a probability distribution over the HMM states (using a softmax layer).

Hybrid HMM-DNN Decoding (after training): Process: 1.Extract features: Get the vectors from the audio. 2.Feed the DNN: Put those vectors into the Neural Network. 3.Forward Pass: The Neural Network calculates the probabilities for every possible state. 4.Beam Search: These probabilities are sent to the standard HMM decoder (the same one we used before).

HMM-DNN Toolkits: Difficulty: Training these hybrid systems is even harder than the old ones because you need to manage both HMMs and Neural Networks.

The Tool: KALDI is highlighted again as the best tool for this. It is the industry standard for these hybrid systems because it handles the complex connection between the HMM logic and the DNN math very efficiently.

NUEVO TEMA: People originally wanted to use RNNs for sequence labelling, but it wasn't possible directly, so they had to rely on complex hybrid systems like HMM-DNNs.

Sequence labelling (the computer receives a continuous sequence of data (audio frames) and has to assign discrete "labels" (words or letters) to parts of that sound) has a problem and is that the audio is huge and continuous, but the text is short. It's tricky because the computer doesn't know exactly when each letter is spoken in the audio.

So secuence labelling is hard for RNN because it requires frame-by-frame alignment during training. And this creates two issues:

Pre-segmentation: To train the network, we would need a human to manually label every single millisecond of the audio (e.G., 'The letter A starts at 0.1s and ends at 0.2s'). This data is expensive and difficult to obtain.

Post-processing: Without a mechanism to handle time, the network outputs repetitive, messy predictions (like H-H-H-E-E-L-L-L-O) that require complex external algorithms to clean up."

Solution? Using CTC (Connectionist Temporal Classification). Connectionist Temporal Classification (CTC) solves this by enabling end-to-end training without explicit alignment. CTC introduces a 'blank' token (epsilon or _) to model silence and separate repeated characters, and it trains by mathematically summing over all possible alignments instead of forcing a single specific timing.

DIBUJAR el grafico diapo 20.

OJO! EL PROBLEMA DE LAS CTC es que "each RNN output at a time 𝒕 is assumed to be conditionally independent of the others" Es decir, cada letra es independiente no mira hacia atras. Solo depende del sonido que escucha ahora.

WORKFLOW:

After RNN and a softmax function... Probabilities at each time t is: y= Pr(k, t|x). Esto nos da la probabilidad en cada tiempo de que sea la letra y usamos eso para lo siguiente. Ejemplo: t=1 N 0.8, O 0.1, _ 0.1; t=2 ...

The probability of a complete path of 𝑇 CTC symbols is: p(pi|x) = producto{y_pi}. Ejemplo: (N,N,O) = 0.8*0.4*0.9 = 0.288

Probability of a particular phone string: p(l|x) = sum{p(pi|x)} -> The final word (l) is calculated as the SUM of the probabilities of all possible paths (pi) that, when cleaned, form that word. Ejemplo: P("NO"|Audio) = (N, N,O) 0.288 + (N, _ , O) 0.288 ...

To do this sum of probabilities efficiently (beacuse there are lots of paths), we use the CTC Forward-Backward algorithm to compute the probability of the specific phone string.

CTC LOSS: CTC trains the network by minimizing the CTC loss function (where 𝑦^∗ is the ground truth phone string of training utterance 𝑥) -> CTC(x) = - logPr(y^*|x). Asi si la probabilidad es alta (Low punishment el log es close to 0) y al reves.

CTC Training: The CTC cost function is differentiable so this cost can be propagated in backpropagation. (Aqui tenemos el audio y nuestra palabra "NO", "HOLA"...)

CTC Decoding: Decoding is simple (in theory) it consists in selecting the labelling (𝑙) that maximizes Pr(l|x) ->h(x) = arg max p(l|x). In practice, there is no tractable algorithm to obtain ℎ(x), only aproximations. (Best path decoding (higer score), prefix decoding(smater method)) (Aqui ya has entranado la red y no tienes la palabra objetivo)

Conclusion: CTC beats or ties with old systems, but without the headache of having to manually align the data.

Improvement: Finally, to further improve accuracy and correct spelling errors, WFST-based decoding is often integrated. This allows the use of Language Models and dictionaries through efficient Beam Search, achieving accuracy comparable to complex hybrid systems but with significantly faster decoding speeds.

To solve CTC conditional indepencende problem, we use RNN-T. Structure of a RNN Transducer:

2 RNNs for decoding:

1.CTC Acoustic Model with BLSTMs ◼ Computes Pr 𝑘 𝑡 , where 𝑘 is a CTC symbol and 𝑡 is the input timestamp (el audio) ◼ Introduces Deep BLSTMs for the first time in speech recognition. (Escucha el audio)

2.Auto-regressive RNN that works as a Linguistic Model ◼ Computes Pr 𝑘 𝑢 , where 𝑘 is a CTC symbol and 𝑢 is the output timestamp (Despues de Ho, que probabiliad hay que vaya l?) ◼ Dependent on ALL previous decoded symbols. (Mira el historial de texto)

Both RNNs combined by a separate output feed-forward network to compute Pr 𝑘 𝑡, 𝑢 ❑ Inputs are the hidden activations of the two RNN networks.

Training and Decoding: The whole RNN-Transducer can be trained from scratch ◼ But better results are obtained if ❑ CTC acoustic model is pretrained using CTC loss ❑ Autorregresive RNN LM is pretrained ◼ Later the whole RNN-Transducer is further retrained ◼ Decoding is performed using Beam Search ❑ Faster and more effective than prefix search

The RNN Transducer is superior to CTC, and pre-training the acoustic and linguistic components separately before joint training yields the lowest error rate.

ATTENTION:

Why using attention for sequences? ◼ Initial Seq2Seq model structure: ❑ Encoder: input sequence → intermediate vector ❑ Decoder: intermediate vector → output sequence
HACER UNA IMAGEN!!!

Limitations: ◼ Intermediate vector representation is a bottleneck ◼ The decoder needs different input information at different time steps (not just always the same input vector) ❑ E.G. : “I like cats more than dogs” vs. “I like dogs more than cats”

How Attention Works? La diapositiva 40 no me queda muy claro

FRAMEwork: Tenemos un audio de input, ese input lo procesamos con un encoder (Bidirectional LSTM) para crear encoded features ℎ = (ℎ1, … , ℎ𝐿) y luego en la decodificacion empezamos con un estado inicial s0 y en cada paso (i) usa su memoria actual (s_i-1) para buscar en todo el Mapa codificado (h) y calcular los pesos de atencion (alfa) para ver donde mirar para saber que fonema es. Y conseguimos gi que son el context vector

alfa_i = Attend (s_i-1, alfa_i-1, h), gi= sum{alfa,h), yi = Generate(si-1,gi) y esto todo termina by computing a new generator state si= Recurrency(s_i-1, gi, yi) IMAGEN.

Types of Attention ◼ Content-based: 𝛼𝑖 = 𝐴𝑡𝑡𝑒𝑛𝑑(𝑠𝑖−1, ℎ) ❑ Limitation: similar speech fragments at different postions get the same attention (and they should not!) ◼ Location-based: 𝛼𝑖 = 𝐴𝑡𝑡𝑒𝑛𝑑 𝑠𝑖−1, 𝛼𝑖−1 ❑ Limitation: the attention mechanism is forced to predict the duration of the next phone based only on the previous one (which is very hard!) ◼ Hybrid: 𝛼𝑖 = 𝐴𝑡𝑡𝑒𝑛𝑑(𝑠𝑖−1, 𝛼𝑖−1, ℎ) ❑ Proposed approach

Attention (even content-based) is able to find the correct alignments

Joint CTC/Attention – Main Idea: Main weakness of attention mechanisms ❑ Lack of left-to-right constraints as in CTC, HMM-DNN

Idea: Improve attention-based encoder-decoders by training them in a Multi-Task Learning (MTL) framework: ❑ Combining Attention decoder and CTC Loss: L_MTL = landa*L_CTC + (1-landa)*L_Attention, where landa entre 0 y 1. Only training is modified, decoding is performed with attention. Joint CTC/Attention – Results ◼ Faster convergence y mejores resultados.

Joint CTC/Attention – Improvements: Joint CTC-attention decoding ❑ Not only training as in previous works ◼ Deep CNN (VGG) in the enconder network ◼ RNN-LM in parallel with CTC and attention ❑ Can be trained jointly or separately ❑ Can be used to rescore output or integrated in the beam search (one-pass decoding)

Transformers: Seq2seq models with attention has limitations ❑ Encoder and Decoder are usually RNNs (LSTMs) ◼ Slow training and decoding due to recurrency ◼ Not parallelizable ❑ Long-range dependencies are hard to model ◼ Even with LSTMs

Instead of using RNNs going left-to-right and right-to-left ❑ Use attention mechanisms (TRANSFORMERS) to allow access to all input (output) without recurrences.

Self-Attention: This is the foundational mechanism. With Self-Attention, the output of each layer at any time depends on all the outputs of the previous layer (or inputs to the transformer). The Transformer learns to focus its attention on the important parts of the sequence.
- Attention Definition: An attention mechanism is a function that maps an Input Query (Q) vector and a set of Key (K)-Value (V) pairs into a weighted sum of the values.
Scaled Dot Product Attention: This is the specific compatibility function used. It involves taking the Dot Product between keys and queries, scaling the result by $1/erro{d_k}$, optionally masking future values, and then applying a softmax function to normalize the weights.
Multi-Head Attention (MHA): This enhances the focus by performing multiple attention calculations in parallel. Instead of one single attention, MHA projects Queries, Keys, and Values into h different, lower-dimensionality spaces, performs h scaled dot product attentions, and then concatenates the outputs. This allows the model to focus on different positions of the input simultaneously.

Architecture and Final Components

Encoder: The Encoder contains Self-Attention layers where the Queries, Keys, and Values are all the same (the output of the previous layer at all timesteps).
Decoder: The Decoder contains two types of attention layers: Self-Attention layers (masked to hide future outputs) and Encoder-Decoder Attention layers (where the Query comes from the previous decoder layer, and Keys/Values come from the output of the Encoder).
Feed-Forward Networks: Position-wise Feed Forward Networks (composed of two linear layers with a ReLU activation in between) are applied after each attention module.
Residual Connections and Normalization: The model uses Residual connections and Layer Normalization throughout its architecture.
Positional Encoding: Since the model has no recurrence, Positional Encoding (a set of sinusoidals with different frequencies) is added to the input embeddings to inject information about the order of the tokens.
Input/Output: The model uses Input/Output embeddings (mappings from words into vectors). The output probabilities are calculated using a linear layer with softmax activations to estimate next word probabilities.
Fixed Dimensionality: A key architectural constraint is that the dimensionality of inputs and outputs of all layers is fixed at a constant value ($d_{model}$).

D6Jab1fGFtLnAAAAAElFTkSuQmCC B59krshuMpnBAAAAAElFTkSuQmCC

Speech Transformer Specific Adaptations

The final architecture requires specific adjustments for ASR tasks. To handle the high frame rate of audio input, a crucial pre-processing block consisting of two Convolutional layers plus one linear layer is introduced at the input. This convolutional block significantly reduces the time and frequency dimensions by a factor of 4, thereby making the frame rate more similar to the character rate. This downsampling step simplifies and accelerates training. While developers may utilize several optional additional models, these are typically not necessary when using a large base Transformer model.

Wav2Vec2.0:
Main Ideas: ◼ Learn speech representations from audio itself ❑ With no labels! → Huge amounts of audio can be applied ◼ Later fine-tune the model with a small amount of transcribed speech using CTC ❑ Just adding an additional final linear layer and a softmax ◼ The key is how to learn from speech alone: ❑ Mask parts of the input (features) and force the model to learn to reconstruct the masked parts ❑ Instead of reconstructing features, force the model to reconstruct quantized representations of the audio

Architecture:

Input: The model receives the Raw Audio Waveform (X).

Encoding: This raw wave is passed through a Convolutional Neural Network (CNN), which acts as a feature encoder. Its job is to transform the complex waveform into Latent Speech Representations (Z). These vectors contain the acoustic essence of the sound.

Masking: Parts of the latent speech representations (Z) are randomly masked (hidden) before being fed into the Transformer.

Context (C): The Transformer processes the sequence with the gaps to generate Context Representations (C). The Transformer must use the surrounding sounds to predict what was hidden.

The Target (Q): The quantization module simultaneously converts the latent representations (Z) into Quantized Representations (Q). These are automatically-learned, discrete speech units (like internal phonetic codes) that act as the ground truth target. Quantization is not differentiable ❑ A Gumbel softmax is used. P_g,v = exp(...) / sum{exp(...)}

HdN1VMRnj9EAAAAASUVORK5CYII=

Contrastive Loss: The model is trained using a contrastive loss function (L_m) to force the predicted context (C) to be more similar to the true quantized code (Q) that was hidden than to any of the other incorrect codes (distractors).

But, aditionally a diversity loss function is included. The goal is to force the vector quantization to use equally all the possible values in the codebook : L_d -> L = L_m + alfa*L_d

Training: Pre-training (Training feature encoders, transformer and quantizers ) and fine-tunning (Add a linear layer on top and a softmax and First train only that linear layer, then retrain transformer also).

Decoding: Decoding is performed ❑ With standard Beam Search ❑ Combining the acoustic model with language models

Conformer networks – Main idea ◼ Transformers are good at capturing contentbased global interactions ◼ CNN exploit local features effectively ◼ Combining both we get the best from both worlds ◼ Conformer: convolution-augmented transformer

Related entries:

Tags: