Speech-to-Text API ๋น„๊ต: OpenAI Whisper, Google Speech-to-Text, Amazon Transcribe

image

์Œ์„ฑ ์ธ์‹ API๋Š” ๋‹ค์–‘ํ•œ ์‚ฐ์—…์—์„œ ํ•„์ˆ˜์ ์ธ ๊ธฐ์ˆ ๋กœ ์ž๋ฆฌ ์žก๊ณ  ์žˆ๋‹ค. ์ด ๊ธ€์—์„œ๋Š” OpenAI Whisper, Google Speech-to-Text, Amazon Transcribe์˜ ์„ฑ๋Šฅ, ๊ธฐ๋Šฅ, ์–ธ์–ด ์ง€์›, ๊ฐ€๊ฒฉ, ํ†ตํ•ฉ, ๋ณด์•ˆ ๋“ฑ ์ฃผ์š” ์ธก๋ฉด์„ ๋น„๊ต ๋ถ„์„ํ•˜๊ณ ์ž ํ•œ๋‹ค.

1. Accuracy and Speed

image

Word Error Rate(WER)

  • OpenAI Whisper-v2:
SizeParametersEnglish-only modelMultilingual model
tiny39 Mโœ“โœ“
base74 Mโœ“โœ“
small244 Mโœ“โœ“
medium769 Mโœ“โœ“
turbo798 Mโœ“
large1550 Mโœ“
large-v21550 Mโœ“
large-v31550 Mโœ“

WER: 8.06%, ์ฒ˜๋ฆฌ ์†๋„: 10-30๋ถ„/1์‹œ๊ฐ„ ์˜ค๋””์˜ค

์žฅ์ : ๋ชจ๋ธ ํฌ๊ธฐ ์„ ํƒ ๊ฐ€๋Šฅ(39M~1.55B ํŒŒ๋ผ๋ฏธํ„ฐ)์œผ๋กœ ์ •ํ™•๋„์™€ ์†๋„ ๊ท ํ˜• ์กฐ์ • ๊ฐ€๋Šฅ.

  • Google Speech-to-Text: WER: 16.51%-20.63%, ์ฒ˜๋ฆฌ ์†๋„: 20-30๋ถ„/1์‹œ๊ฐ„ ์˜ค๋””์˜ค
  • Amazon Transcribe: WER: 18.42%-22%, ์ฒ˜๋ฆฌ ์†๋„: Google๊ณผ ์œ ์‚ฌ.

Sources:ย Hugging Face,ย Clari,ย Statista, Gladia

English test

๊ฒฐ๋ก : Whisper๋Š” ๋†’์€ ์ •ํ™•๋„์™€ ๋น ๋ฅธ ์ฒ˜๋ฆฌ ์†๋„๋กœ ์šฐ์œ„. ๊ทธ๋Ÿฌ๋‚˜ hallucination(ํ™˜๊ฐ) ๋ฌธ์ œ๊ฐ€ ๋ฐœ์ƒํ•  ์ˆ˜ ์žˆ์Œ.

2. Features

๋น„๊ตํ‘œ

OpenAIGolgleAmazon
์‹ค์‹œ๊ฐ„ ์ „์‚ฌ(Live)โœ“โœ“โœ“
์–ธ์–ด ์ž๋™ ๊ฐ์ง€โœ“โœ“โœ“
๋‹จ์–ด๋ณ„ ํƒ€์ž„์Šคํƒฌํ”„โœ“โœ“โœ“
ํ™”์ž ๋ถ„๋ฆฌโœ“โœ“
์š•์„ค ํ•„ํ„ฐโœ“โœ“
PII ์ œ๊ฑฐ(๋ฏผ๊ฐ ์ •๋ณด ์‚ญ์ œ)โœ“
๊ฐ์ • ๋ถ„์„
์ปค์Šคํ…€ ๋‹จ์–ด ์ถ”๊ฐ€โœ“
์Œ์„ฑ ์ ์‘โœ“โœ“โœ“
๋‹ค์ค‘ ์ฑ„๋„ ์ธ์‹โœ“โœ“
์žก์Œ ๊ฐ•์ธ์„ฑโœ“โœ“โœ“
๋„๋ฉ”์ธ ํŠนํ™” ๋ชจ๋ธโœ“โœ“
์ž๋™ ๊ตฌ๋‘์  ์‚ฝ์ž…โœ“โœ“โœ“
์œ ํ•ด ์˜ค๋””์˜ค ์ฝ˜ํ…์ธ  ๊ฐ์ง€โœ“
๋‹จ์–ด ์ˆ˜์ค€ ์‹ ๋ขฐ๋„โœ“

๊ฒฐ๋ก : Amazon์ด PII ์ œ๊ฑฐ ๋ฐ ์˜๋ฃŒ/์ฝœ์„ผํ„ฐ ๋ถ„์„ ๋“ฑ์˜ ์ „๋ฌธํ™”๋œ ๊ธฐ๋Šฅ์œผ๋กœ ๊ฐ€์žฅ ์™„๋ฒฝํ•œ ๊ธฐ๋Šฅ ์„ธํŠธ๋ฅผ ์ œ๊ณต.

3. Language Support

  • OpenAI Whisper: 98๊ฐœ ์–ธ์–ด ์ง€์›, ๋†’์€ ์ •ํ™•๋„. ๋‹จ, ์˜์–ด ๋ฐ์ดํ„ฐ ๋น„์ค‘์ด ๋†’์•„ ํƒ€ ์–ธ์–ด์—์„œ๋Š” ์„ฑ๋Šฅ ์ €ํ•˜ ๊ฐ€๋Šฅ. โ†’ ์ปค์Šคํ…€ ๋ชจ๋ธ๋กœ ํŠน์ • ์–ธ์–ด์™€ ๋ฐฉ์–ธ ์ตœ์ ํ™” ๊ฐ€๋Šฅ. image
  • Google Speech-to-Text: 125๊ฐœ ์ด์ƒ์˜ ์–ธ์–ด ๋ฐ ๋ฐฉ์–ธ ์ง€์›. ๋‹ค์–‘ํ•œ ์–ต์–‘๊ณผ ์†Œ์Œ์„ ์ฒ˜๋ฆฌํ•˜๋„๋ก ์„ค๊ณ„. โ†’ ๋ชจ๋ธ ์กฐ์ •(Adaptation) ๊ธฐ๋Šฅ์œผ๋กœ ํŠน์ • ๋‹จ์–ด๋‚˜ ๋ฌธ๊ตฌ ์ธ์‹ ๊ฐ€๋Šฅ.
  • Amazon Transcribe: 100๊ฐœ ์ด์ƒ์˜ ์–ธ์–ด ์ง€์›, ์ž๋™ ์–ธ์–ด ๊ฐ์ง€, ์ปค์Šคํ…€ ๋‹จ์–ด ์ถ”๊ฐ€ ๊ฐ€๋Šฅ.

๊ฒฐ๋ก : ์–ธ์–ด ์ˆ˜ ๊ธฐ์ค€์œผ๋กœ Google์ด ์•ž์„œ์ง€๋งŒ, Whisper๋Š” ์ •ํ™•๋„์™€ ์‹ค์ œ ์‚ฌ์šฉ ์‚ฌ๋ก€์—์„œ ๊ฐ•๋ ฅ.

4. Cost

์„œ๋น„์Šค ๊ฐ€๊ฒฉ

  • OpenAI Whisper $0.006/๋ถ„
  • Google Speech-to-Text $0.016/๋ถ„
  • Amazon Transcribe $0.0102-$0.024/๋ถ„
  • RTZR ์ฐธ๊ณ 

๊ฒฐ๋ก : Whisper๊ฐ€ ๊ฐ€์žฅ ์ €๋ ดํ•˜๋ฉฐ, ๋น„์šฉ ๋Œ€๋น„ ํ’ˆ์งˆ์ด ์šฐ์ˆ˜.

5. Integration

  • OpenAI Whisper API: Python, JavaScript ๋“ฑ ๋‹ค์–‘ํ•œ ์–ธ์–ด ์ง€์›, ๊ฐ„๋‹จํ•œ ์ฝ”๋“œ ๊ตฌ์กฐ(6์ค„ ์ดํ•˜)๋กœ ๋น ๋ฅธ ํ†ตํ•ฉ ๊ฐ€๋Šฅ. โ†’ ์ง๊ด€์ ์ธ ๋ฌธ์„œ ์ œ๊ณต.
  • Google Speech-to-Text: Google Cloud ์„œ๋น„์Šค์™€ ํ†ตํ•ฉ์— ๊ฐ•์ . ๋‹ค๋งŒ, ์ดˆ๊ธฐ ์„ค์ • ๋ณต์žก.์ฐธ๊ณ 
  • Amazon Transcribe: SDK ์ง€์›(๋‹ค์–‘ํ•œ ์–ธ์–ด), ๋ฌธ์„œ ์ฒด๊ณ„์ ์ด๋‚˜ ์ดˆ๊ธฐ ๋“ฑ๋ก ๊ณผ์ • ๋ณต์žก.

๊ฒฐ๋ก : Whisper๋Š” ๊ฐ€์žฅ ์‰ฌ์šด ์ดˆ๊ธฐ ์„ค์ •๊ณผ ๊ฐ„๋‹จํ•œ ์˜จ๋ณด๋”ฉ1 ๊ฒฝํ—˜ ์ œ๊ณต.

6. Privacy and Security

  • Amazon Transcribe: ๋ฐ์ดํ„ฐ ์ „์†ก ์‹œ TLS ์‚ฌ์šฉ, KMS ํ‚ค๋กœ ์ถ”๊ฐ€ ์•”ํ˜ธํ™”.
  • Google Speech-to-Text: ๋ฐ์ดํ„ฐ๋ฅผ ์ €์žฅํ•˜์ง€ ์•Š๊ณ  ๋ฉ”๋ชจ๋ฆฌ์—์„œ ์ฒ˜๋ฆฌ. GDPR, HIPAA ๋“ฑ ๊ทœ์ • ์ค€์ˆ˜.
  • OpenAI Whisper: OSS ๋ชจ๋ธ ์‚ฌ์šฉ ์‹œ ๋กœ์ปฌ์—์„œ ๋ชจ๋“  ๋ฐ์ดํ„ฐ ์ฒ˜๋ฆฌ. API ์‚ฌ์šฉ ์‹œ ๋ฐ์ดํ„ฐ๋Š” 30์ผ๊ฐ„ ์ €์žฅ, Zero Data Retention ์˜ต์…˜ ์ œ๊ณต.

๊ฒฐ๋ก : ๋ณด์•ˆ ์ธก๋ฉด์—์„œ Google๊ณผ Amazon์ด ๊ฐ€์žฅ ์šฐ์ˆ˜.

์ตœ์ข… ๊ฒฐ๋ก : ์–ด๋–ค API๊ฐ€ ์ ํ•ฉํ•œ๊ฐ€?

๊ธฐ์ค€OpenAI WhisperGoogle Speech-to-TextAmazon Transcribe
์ •ํ™•๋„/์†๋„์ตœ๊ณ ํ‰๊ท ํ‰๊ท 
๊ธฐ๋Šฅ์ œํ•œ์ ๊ด‘๋ฒ”์œ„๊ฐ€์žฅ ๊ด‘๋ฒ”์œ„
์–ธ์–ด ์ง€์›๋†’์€ ์ •ํ™•๋„์ตœ๋‹ค ์–ธ์–ด ์ง€์›์ค‘๊ฐ„
๋น„์šฉ๊ฐ€์žฅ ์ €๋ ด์ค‘๊ฐ„๋น„์‹ธ์ง€๋งŒ ๋‹ค์–‘ํ•œ ์˜ต์…˜
ํ†ตํ•ฉ/์‚ฌ์šฉ ํŽธ์˜์‰ฌ์›€์ค‘๊ฐ„์ค‘๊ฐ„
๋ณด์•ˆ๋กœ์ปฌ ์ฒ˜๋ฆฌ ์ง€์›๊ฐ•๋ ฅํ•œ ๋ณด์•ˆ๊ฐ•๋ ฅํ•œ ๋ณด์•ˆ

Whisper๋Š” ์†๋„, ์ •ํ™•๋„, ๋น„์šฉ์—์„œ ๊ฐ•์ ์„ ๋ณด์ด๋ฉฐ, ๊ฐœ๋ฐœ์ž ์นœํ™”์ ์ธ ์†”๋ฃจ์…˜.

Google๊ณผ Amazon์€ ์ถ”๊ฐ€ ๊ธฐ๋Šฅ๊ณผ ๋ณด์•ˆ์ด ํ•„์š”ํ•œ ์—”ํ„ฐํ”„๋ผ์ด์ฆˆ ํ™˜๊ฒฝ์— ์ ํ•ฉ.

๊ทธ์™ธ

ํ˜„ ํ”„๋กœ์ ํŠธ์— ์–ด์šธ๋ฆฌ๋Š” STT ๋ชจ๋ธ๋กœ๋Š” Whisper๊ฐ€ ๊ดœ์ฐฎ๋‹ค๊ณ  ์ƒ๊ฐํ–ˆ๋Š”๋ฐ, โ€œ๋ฆฌํ„ด์ œ๋กœโ€๊ฐ€ ํ•œ๊ตญ์–ด ์ดํ•ด๋ ฅ ์„ธ๊ณ„ ์ตœ๊ณ  ์ˆ˜์ค€ ๋‹ฌ์„ฑ.

https://blog.rtzr.ai/korean-speechai-benchmark/

API \ ๋ฐ์ดํ„ฐ์…‹Avg. CER(%)์ฃผ์š” ์˜์—ญ๋ณ„ ํšŒ์˜ํšŒ์˜์ƒ๋‹ด์ €์Œ์งˆ
์ „ํ™”๋ง
ํ•œ๊ตญ์–ด
๊ฐ•์˜
KsponSpeech eval cleanKsponSpeech eval other
OpenAI Whisper11.3910.4910.167.5117.2710.8912.0611.34
Google
api v2
11.50N/A211.628.3714.1111.4811.8211.59
ETRI10.199.9510.568.3615.469.899.997.15
Naver ClovaSpeech9.527.888.535.899.0913.7110.6610.86
๋ฆฌํ„ด์ œ๋กœ6.186.787.273.564.667.766.616.64
๋ฆฌํ„ด์ œ๋กœ Whisper36.596.848.334.14.267.117.787.73

์—ฌ๊ธฐ์„œ CER(Character Error Rate)๋Š” ์•ž์„œ ์„ค๋ช…ํ•œ WER์™€ ๊ณ„์‚ฐ ๋ฐฉ๋ฒ•์€ ๊ฑฐ์˜ ๋™์ผํ•˜๋‹ค. WER์€ ๋‹จ์–ด๊ฐ€ ํ† ํฐ์ด ๋˜๋ฉฐ, CER์€ ๋ฌธ์ž๊ฐ€ ํ† ํฐ์ด๋ผ๋Š” ์ฐจ์ด์ ์ด ์žˆ๋‹ค. ํ•œ๊ตญ์–ด ์Œ์„ฑ์ธ์‹์€ WER์ด ์•„๋‹Œ CER๋กœ ํ‰๊ฐ€๋˜์–ด์•ผ ์ ์ ˆํ•˜๋‹ค๊ณ  ํ•œ๋‹ค.

์™œ CER๋กœ ๊ณ„์‚ฐ?

ํ•œ๊ตญ์–ด๋Š” ๊ต์ฐฉ์–ด(์ฒจ๊ฐ€์–ด)๋กœ ์กฐ์‚ฌ๋ฅผ ์‚ฌ์šฉํ•˜๊ณ  ๋‹ค๋ฅธ ์–ธ์–ด์™€ ๋น„๊ตํ–ˆ์„ ๋•Œ ํ˜•ํƒœ์†Œ์˜ ๊ตฌ์กฐ๊ฐ€ ๋ณต์žกํ•˜๋ฉฐ, ๋‹จ์–ด์™€ ๋‹จ์–ด ์‚ฌ์ด์˜ ๊ฒฝ๊ณ„๊ฐ€ ๋ชจํ˜ธํ•˜๋‹ค. ์ด๋Ÿฌํ•œ ์–ธ์–ด ๊ตฌ์กฐ์˜ ํŠน์„ฑ์œผ๋กœ ์ธํ•ด ๋‹จ์–ด ์ˆ˜์ค€์—์„œ์˜ ํ‰๊ฐ€๊ฐ€ ์–ด๋ ต๋‹ค. ๋”ฐ๋ผ์„œ, ๋ฌธ์ž ๋‹จ์œ„์˜ ์˜ค๋ฅ˜๋ฅผ ์ธก์ •ํ•˜๋Š” CER์ด ํ•œ๊ตญ์–ด ์Œ์„ฑ์ธ์‹์—์„œ ๋” ์ •ํ™•ํ•œ ํ‰๊ฐ€ ๋ฐฉ๋ฒ•์œผ๋กœ ๊ฐ„์ฃผ๋œ๋‹ค.

์ฐธ๊ณ 

Footnotes

  1. ์ƒˆ๋กœ์šด ์ง์›์ด ์กฐ์ง์— ์ž˜ ์ ์‘ํ•  ์ˆ˜ ์žˆ๋„๋ก ์ง€์‹์ด๋‚˜ ๊ธฐ์ˆ ์„ ๊ต์œกํ•˜๋Š” ๊ณผ์ • โ†ฉ

  2. Google์˜ ์Œ์„ฑ์ธ์‹ ํŒŒ์ผ ํฌ๊ธฐ์˜ ์ œํ•œ์œผ๋กœ ์ƒ๋žต โ†ฉ

  3. OpenAI์—์„œ ๊ณต๊ฐœํ•œ Whisper ์˜คํ”ˆ์†Œ์Šค ๋ชจ๋ธ์— ๋ฆฌํ„ด์ œ๋กœ์˜ ๋ฐ์ดํ„ฐ๋ฅผ ํŒŒ์ธํŠœ๋‹(fine-tuning)ํ•œ ๋ชจ๋ธ โ†ฉ