ASR accuracy for native vs. non-native Japanese speakers

Summary

This study seeks to assess the performance of commercially available ASR tools for correctly recognising non-native Japanese language speakers.

The ability for off-the-shelf ASR tools to successfully recognise non-native spoken language to an acceptable level is important for the software involved in my thesis, and therefore this study is testing the efficacy of this design choice.

Equally, there has been little public research into the word-recognition efficacy of these commercially available ASR tools, and less still into how these systems perform for non-native speakers 1)2). The native/non-native research that has been done is concerned with English rather than other languages (with one study noticing a 30% drop in non-native recognition in English 3)). This study should add useful data and analysis to this research canon.

Finally, this study will also test how effective these tools are at recognising non-native language when their boundaries for understanding are contextually constrained. That is, when the systems have been trained to expect certain responses.

Key Question

How effective are commercially available ASRs at comprehending non-native spoken Japanese?

Hypotheses

  • All ASRs will [recognise native speaker statements at a higher rate than non-native statements]
  • All ASRs will perform worse than native speakers at [recognising non-native speaker statements]
  • All ASRs will [recognise 24 week Japanese learners at a rate closer to native speakers than 0 week learners]
  • All ASRs will demonstrate an improved recognition rate when their recognition is contextually constrained, with a more notable increase for non-native speakers
  • Each ASR will perform differently
  • Each ASR will be able to be categorised as “forgiving” or “not forgiving”, and tend to mark user inputs (whether native Japanese or non-native) more favourably or less favourably depending on ASR system

Variables

  • ASR
    • Google
    • IBM
    • Facebook
    • Amazon
    • Microsoft
  • Context
    • No context
    • Contextually constrained
  • Speaker
    • Native speaker
    • Non-native speaker, 24 weeks tuition
    • Non-native speaker, 0 tuition
  • Statements
    • Related to coffee ordering
    • Wider corpus examples

Recognition Measurements

  • Word recognition rate (and confidence)
  • Intent recognition rate (and confidence)
  • Entity recognition rate (and confidence)
  • Levenshtein distance between spoken and heard examples

We must outline a procedure for balancing correct recognition and false-positives (particularly inside trained systems. Methodology from 4)).

  • ASR recognition efficacy comparison [NATIVE vs NON-NATIVE24 vs NON-NATIVE0]: one-way ANOVA
  • ASR recognition efficacy comparison [ASR vs ASR vs ASR vs ASR vs ASR]: one-way ANOVA calculation
  • ASR vs native for non-native recognition efficacy: Independent Samples T-Test
  • Context vs Non-context for ASR efficacy: Independent Samples T-Test

Comparisons

  • ASR: Native speaker vs Non-native speaker recognition
  • ASR: Non-native (24 weeks tuition) speaker vs Non-native (no tuition) speaker recognition
  • ASR vs ASR vs ASR vs ASR
  • ASR vs native: recognition rate of non-native (24 weeks tuition) speaker phrases
  • ASR vs native: recognition rate of non-native (no tuition) speaker phrases
  • ASR: Contextualised vs non-contextualised recognition rates

Conducting the experiment

  1. Record non-native speakers reading the corpora
  2. Examine if 50% of native speakers can understand non-native (untrained) reading
  3. Examine if 50% of native speakers can understand non-native (24 weeks) reading
  4. Record native speakers reading the corpora
  5. Process non-native/non-native/native through default ASRs
  6. Process non-native/non-native/native through trained ASRs

Analyse results

There are multiple methods for examining

  • All ASRs will [recognise native speaker statements at a higher rate than non-native statements]
    • Compare average ASR Recognition Measurements between [native] and [non-native] speakers
  • All ASRs will perform worse than native speakers at [recognising non-native speaker statements]
    • Compare word recognition between [ASR] and [native] speakers
  • All ASRs will [recognise 24 week Japanese learners at a rate closer to native speakers than 0 week learners]
    • Compare average ASR Recognition Measurements between [24 week non-native] and [0 week non-native] speakers
  • All ASRs will demonstrate an improved recognition rate when their recognition is contextually constrained, with a more notable increase for non-native speakers
    • Compare individual + average ASR Recognition Measurements between [trained] and [non-trained] contexts
  • Each ASR will perform differently
    • Compare ASR Recognition Measurements between ASRs
1)
Investigating differences between native english and non-native english speakers in interacting with a voice user interface: a case of google home
4)
A Comparison and Critique of Natural Language Understanding Tools