Speech Recognition
TODO
| Metric | Description |
|---|---|
| WER | Word Error Rate |
| Utterance False Positive Rate (uFPR) | Ratio of cases where non speech utterances were transcribed. |
| Utterance False Negative Rate (uFNR) | Ratio of cases where utterances where transcribed as silence. |
| SER | Sentence Error Rate |
| Min 3 WER | The minimum Word Error Rate when considering the first three alternatives only |
| Min WER | The minimum Word Error Rate out of all the alternatives |
| Short Utterance WER | WER of utterance with ground truth length of 1 or 2 words |
| Long Utterance WER | WER of utterances with at least 3 words in ground truth |
Data schema
We expect,
- tagged.transcriptions.csv to have columns called
idandtranscription, wheretranscriptioncan have only one string as value for each row, if not present leave it empty as it is, it’ll get parsed asNaN. - predicted.transcriptions.csv to have columns called
idandutterances, where each value in theutterancescolumn looks like this:
'[[
{"confidence": 0.94847125, "transcript": "iya iya iya iya iya"},
{"confidence": 0.9672866, "transcript": "iya iya iya iya"},
{"confidence": 0.8149829, "transcript": "iya iya iya iya iya iya"}
]]'
as you might have noticed it is expected to be in JSON format. each transcript represents each alternative from the ASR, and confidence represents ASR’s confidence for that particular alternative. If no such utterances present for that particular id, leave it as '[]' (json.dumps of empty list [])
Note: Please remove transcription column from predicted.transcriptions.csv (if it exists) before using eevee.
Usage
Command Line
Use the sub-command asr like shown below:
eevee asr ./data/tagged.transcriptions.csv ./data/predicted.transcriptions.csv
Value Support
Metric
WER 0.571429 6
Utterance FPR 0.500000 2
Utterance FNR 0.250000 4
SER 0.666667 6
Min 3 WER 0.571429 6
Min WER 0.571429 6
Short Utterance WER 0.000000 1
Long Utterance WER 0.809524 3
For users who want utterance level metrics or edit operations, add the “–dump” flag like:
eevee asr ./data/tagged.transcriptions.csv ./data/predicted.transcriptions.csv --dump
This will add two csv files
- predicted.transcriptions-dump.csv : File containing utterance level metrics
- predicted.transcriptions-ops.csv : File containing dataset level edit operations.
The filename is based on the prediction filename given by the user
For users who want ASR metrics reported separately on noisy and non-noisy subsets of audios, use the “–noisy” flag like:
eevee asr ./data/tagged.transcriptions.csv ./data/predicted.transcriptions.csv --noisy
Results are in two DataFrames - for each of noisy and non-noisy subsets, in order. An important note here, is that the transcriptions in tagged.transcriptions.csv are expected to contain info tags, like - <audio_silent>, <inaudible>, etc - which aren’t expected when not using the “–noisy” flag.
Python module
>>> import pandas as pd
>>> from eevee.metrics.asr import asr_report
>>>
>>> true_df = pd.read_csv("data/tagged.transcriptions.csv", usecols=["id", "transcription"])
>>> pred_df = pd.read_csv("data/predicted.transcriptions.csv", usecols=["id", "utterances"])
>>>
>>> asr_report(true_df, pred_df)
Value Support
Metric
WER 0.571429 6
Utterance FPR 0.500000 2
Utterance FNR 0.250000 4
SER 0.666667 6
Min 3 WER 0.571429 6
Min WER 0.571429 6
Short Utterance WER 0.000000 1
Long Utterance WER 0.809524 3
For ASR metrics, segregated by “noisy”:
>>> import pandas as pd
>>> from eevee.metrics.asr import asr_report, process_noise_info
>>>
>>> true_df = pd.read_csv("data/tagged.transcriptions.csv", usecols=["id", "transcription"])
>>> pred_df = pd.read_csv("data/predicted.transcriptions.csv", usecols=["id", "utterances"])
>>>
>>> noisy_dict, not_noisy_dict = process_noise_info(true_labels, pred_labels)
>>>
>>> asr_report(noisy_dict["true"], noisy_dict["pred"])
Value Support
Metric
WER 0.571429 6
Utterance FPR 0.500000 2
Utterance FNR 0.250000 4
SER 0.666667 6
Min 3 WER 0.571429 6
Min WER 0.571429 6
Short Utterance WER 0.000000 1
Long Utterance WER 0.809524 3
>>> asr_report(not_noisy_dict["true"], not_noisy_dict["pred"])
Value Support
Metric
WER 0.571429 6
Utterance FPR 0.500000 2
Utterance FNR 0.250000 4
SER 0.666667 6
Min 3 WER 0.571429 6
Min WER 0.571429 6
Short Utterance WER 0.000000 1
Long Utterance WER 0.809524 3