๋ฆฌํ„ด์ œ๋กœ(ReturnZero) ๊ฐœ์š”

image

๋ฆฌํ„ด์ œ๋กœ๋Š” ํ•œ๊ตญ์–ด ์Œ์„ฑ์ธ์‹(STT) ๋ถ„์•ผ์—์„œ ์šฐ์ˆ˜ํ•œ ์„ฑ๋Šฅ์„ ๋ณด์ด๋Š” ์„œ๋น„์Šค๋กœ, ๋‹ค์–‘ํ•œ ๋ฒค์น˜๋งˆํฌ์—์„œ ๊ทธ ์„ฑ๋Šฅ์„ ์ž…์ฆํ•˜๊ณ  ์žˆ๋‹ค. ํŠนํžˆ ํ•œ๊ตญ์–ด ์–ธ์–ด๋ชจ๋ธ์˜ ๋‹ค๋ถ„์•ผ ์‚ฌ๊ณ ๋ ฅ์„ ์ธก์ •ํ•˜๋Š” โ€˜๋กœ์งKor(LogicKor)โ€™ ๋ฆฌ๋”๋ณด๋“œ์—์„œ ๊ฒฝ๋Ÿ‰ํ™”๋œ ๊ฑฐ๋Œ€์–ธ์–ด๋ชจ๋ธ(sLLM) ์ค‘ 1์œ„๋ฅผ ๋‹ฌ์„ฑํ–ˆ๋‹ค.

ํ•œ๊ตญ์–ด STT ์„ฑ๋Šฅ ๋น„๊ต

๋‹ค์Œ์€ ํ•œ๊ตญ์–ด STT ์„ฑ๋Šฅ์„ ๋น„๊ตํ•œ ๋ฒค์น˜๋งˆํฌ๋กœ, ๋ฆฌํ„ด์ œ๋กœ์˜ ์šฐ์ˆ˜ํ•œ ์„ฑ๋Šฅ์„ ํ™•์ธํ•  ์ˆ˜ ์žˆ๋‹ค.

API \ ๋ฐ์ดํ„ฐ์…‹Avg. CER(%)์ฃผ์š” ์˜์—ญ๋ณ„ ํšŒ์˜ํšŒ์˜์ƒ๋‹ด์ €์Œ์งˆ
์ „ํ™”๋ง
ํ•œ๊ตญ์–ด
๊ฐ•์˜
KsponSpeech eval cleanKsponSpeech eval other
OpenAI Whisper11.3910.4910.167.5117.2710.8912.0611.34
Google
api v2
11.50N/A111.628.3714.1111.4811.8211.59
ETRI10.199.9510.568.3615.469.899.997.15
Naver ClovaSpeech9.527.888.535.899.0913.7110.6610.86
๋ฆฌํ„ด์ œ๋กœ6.186.787.273.564.667.766.616.64
๋ฆฌํ„ด์ œ๋กœ Whisper26.596.848.334.14.267.117.787.73
STT OpenAPI๋Š” ๋‹ค์Œ์˜ ๋‘ ๊ฐ€์ง€ ํ˜•ํƒœ๋ฅผ ์ œ๊ณตํ•œ๋‹ค:
  • ์ผ๋ฐ˜ STT: ์Œ์„ฑ ํŒŒ์ผ์„ ํ…์ŠคํŠธ๋กœ ๋ณ€ํ™˜
  • ์ŠคํŠธ๋ฆฌ๋ฐ STT: ์‹ค์‹œ๊ฐ„์œผ๋กœ ์Œ์„ฑ์ธ์‹ ์ฒ˜๋ฆฌ

์ฐธ๊ณ ์‚ฌํ•ญ

  • ๋ฌด๋ฃŒ ์‚ฌ์šฉ๋Ÿ‰: ๊ธฐ๋ณธ 10์‹œ๊ฐ„์„ ๋ฌด๋ฃŒ ์ œ๊ณต
  • ๋ณด๊ด€ ์ฃผ๊ธฐ: ๊ฐœ์ธ์ •๋ณด๋ฅผ ์œ„ํ•ด ๋ณ€ํ™˜๋œ ๋ฐ์ดํ„ฐ๋Š” 3์ผ ๋™์•ˆ๋งŒ ๊ธฐ๋กํ•˜๋ฉฐ, 3์ผ์ด ์ง€๋‚˜๋ฉด ์‚ญ์ œํ•จ
  • Web, App ํ™˜๊ฒฝ์—์„œ ์‚ฌ์šฉํ•˜๋Š” ๊ฒƒ๋ณด๋‹ค, ์„œ๋ฒ„๋ฅผ ํ†ตํ•ด์„œ ์—ฐ๋™ํ•˜๋Š” ๋ฐฉ์‹์„ ๊ถŒ์žฅ

์ธ์ฆ ๊ฐ€์ด๋“œ

API๋ฅผ ์‚ฌ์šฉํ•˜๊ธฐ ์œ„ํ•ด์„œ๋Š” ์• ํ”Œ๋ฆฌ์ผ€์ด์…˜ ๋“ฑ๋ก์ด ํ•„์š”ํ•˜๋‹ค. ์•„๋ž˜ ์ ˆ์ฐจ๋ฅผ ๋”ฐ๋ฅด๋ฉด ๋œ๋‹ค.

  1. RTZR ๋””๋ฒจ๋กœํผ์Šค ์‚ฌ์ดํŠธ ํšŒ์›๊ฐ€์ž…
  2. ์ฝ˜์†”์— ์ž…์žฅ
  3. API ์—ฐ๋™์— ํ•„์š”ํ•œ SECRET(client_id, client_secret) ์ •๋ณด ๋ฐœ๊ธ‰

์• ํ”Œ๋ฆฌ์ผ€์ด์…˜ ๋“ฑ๋ก์ด ์™„๋ฃŒ๋˜๋ฉด, ๋ฐœ๊ธ‰๋ฐ›์€ Secret ์ •๋ณด์™€ ์ธ์ฆ ๊ณผ์ •์„ ํ†ตํ•ด Token์„ ๋ฐœ๊ธ‰๋ฐ›์„ ์ˆ˜ ์žˆ๋‹ค.

API ๋ชฉ๋ก

MethodURLDescription
POST/v1/authenticate์ธ์ฆ ํ† ํฐ ์š”์ฒญ

์ธ์ฆ ํ† ํฐ ์š”์ฒญ ์ƒ˜ํ”Œ ์ฝ”๋“œ

import requests
 
resp = requests.post(
    'https://openapi.vito.ai/v1/authenticate',
    data={'client_id': '{YOUR_CLIENT_ID}',
          'client_secret': '{YOUR_CLIENT_SECRET}'}
)
resp.raise_for_status()
print(resp.json())

์‘๋‹ต ๋ฐ”๋”” (Response Body)

์„ฑ๊ณต ์‹œ, HTTP Status 200๊ณผ ํ•จ๊ป˜ ์•„๋ž˜์™€ ๊ฐ™์€ ์‘๋‹ต์ด ๋ฐ˜ํ™˜๋œ๋‹ค.

{
  "access_token": "{YOUR_JWT_TOKEN}",
  "expire_at": 1690377931
}

CAUTION

token์˜ ๋งŒ๋ฃŒ ๊ธฐ๊ฐ„์€ 6์‹œ๊ฐ„์ด๋ฏ€๋กœ /v1/authenticate๋ฅผ ํ†ตํ•ด ์ฃผ๊ธฐ์ ์œผ๋กœ ํ† ํฐ์„ ๊ฐฑ์‹ ํ•ด์•ผ ํ•œ๋‹ค.

์˜ค๋ฅ˜ ์ฝ”๋“œ์— ๋Œ€ํ•œ ์ž์„ธํ•œ ๋‚ด์šฉ์€ RTZR STT ๋ฌธ์„œ๋ฅผ ์ฐธ๊ณ ํ•œ๋‹ค.

์ผ๋ฐ˜ STT

์ผ๋ฐ˜ STT API๋Š” ์Œ์„ฑ ํŒŒ์ผ์„ ํ…์ŠคํŠธ๋กœ ๋ณ€ํ™˜ํ•  ์ˆ˜ ์žˆ๋Š” ๊ธฐ๋Šฅ์„ ์ œ๊ณตํ•œ๋‹ค. HTTP ๊ธฐ๋ฐ˜์˜ REST API3๋กœ ๊ตฌํ˜„๋˜์–ด ์žˆ์œผ๋ฉฐ, ๋‹ค์–‘ํ•œ ์Œ์„ฑ ํŒŒ์ผ ํฌ๋งท์„ ์ง€์›ํ•œ๋‹ค.

์ง€์› ํฌ๋งท

  • mp4, m4a, mp3, amr, flac, wav

API ๋ชฉ๋ก

MethodURLDescription
POST/v1/transcribeํŒŒ์ผ ์ „์‚ฌ ์š”์ฒญ
GET/v1/transcribe/{TRANSCRIBE_ID}ํŒŒ์ผ ์ „์‚ฌ ๊ฒฐ๊ณผ ์กฐํšŒ

1) [POST]/v1/transcribe

์ €์žฅ๋œ ์Œ์„ฑ ํŒŒ์ผ์— ๋Œ€ํ•ด ์ „์‚ฌ๋ฅผ ์š”์ฒญํ•œ๋‹ค.

HTTP ์š”์ฒญ

POST https://openapi.vito.ai/v1/transcribe

์š”์ฒญ ํ—ค๋”

Authorization: bearer {YOUR_JWT_TOKEN}
  • scheme: bearer
  • bearerFormat: JWT

์š”์ฒญ ๋ฐ”๋”” (Request body)

content-type: multipart/form-data

FieldTypeRequired
configRequestConfigrequired
fileBinaryrequired
RequestConfig
NameDescTypeRequiredValueDefault
model_name์Œ์„ฑ์ธ์‹ ๋ชจ๋ธstringoptionalsommers, whispersommers
language์Œ์„ฑ์ธ์‹ ์–ธ์–ด,ย whisperย ์‚ฌ์šฉํ•  ๋•Œ๋งŒ ์ ์šฉstringoptionalko
use_diarizationํ™”์ž ๋ถ„๋ฆฌย ์‚ฌ์šฉ ์—ฌ๋ถ€booleanoptionalfalse
diarization.spk_countํ™”์ž์ˆ˜,ย use_diarization์ด true์ผ ๋•Œ๋งŒ ์ ์šฉintegeroptional0 ์ด์ƒ์˜ ์ •์ˆ˜0 (ํ™”์ž์ˆ˜ ์˜ˆ์ธก)
use_itn์˜์–ด/์ˆซ์ž/๋‹จ์œ„ ๋ณ€ํ™˜ย ์—ฌ๋ถ€booleanoptionaltrue
use_disfluency_filter๊ฐ„ํˆฌ์–ด ํ•„ํ„ฐย ์‚ฌ์šฉ ์—ฌ๋ถ€booleanoptionaltrue
use_profanity_filter๋น„์†์–ด ํ•„ํ„ฐย ์‚ฌ์šฉ ์—ฌ๋ถ€booleanoptionalfalse
use_paragraph_splitter๋ฌธ๋‹จ ๋‚˜๋ˆ„๊ธฐย ์‚ฌ์šฉ ์—ฌ๋ถ€booleanoptionaltrue
paragraph_splitter.max๋ฌธ๋‹จ์˜ ์ตœ๋Œ€ ๋ฌธ์ž ๊ธธ์ด,ย use_paragraph_splitter์ด true์ผ ๋•Œ๋งŒ ์ ์šฉintegeroptional1 ์ด์ƒ์˜ ์ •์ˆ˜50
domain์Œ์„ฑํŒŒ์ผ์˜ ์ข…๋ฅ˜ (๋„๋ฉ”์ธ)stringoptionalGENERAL, CALLGENERAL
use_word_timestamp๋‹จ์–ด๋ณ„ Timestampย ์‚ฌ์šฉ ์—ฌ๋ถ€booleanoptionalfalse
keywordsํ‚ค์›Œ๋“œ ๋ถ€์ŠคํŒ…์šฉ ๋‹จ์–ด ๋ฆฌ์ŠคํŠธarrayoptional

CAUTION

์ผ๋ฐ˜ STT API์˜ ๊ฒฝ์šฐ ์•„๋ž˜์™€ ๊ฐ™์€ ์ œ์•ฝ ์‚ฌํ•ญ์ด ์žˆ๋‹ค.

  1. POST API์˜ ๋™์‹œ์ฒ˜๋ฆฌ ์ œํ•œ: 10๊ฐœ, POST API๋กœ ์š”์ฒญํ•œ ๋’ค ์ฒ˜๋ฆฌ๊ฐ€ ์™„๋ฃŒ๋˜๊ธฐ ์ „๊นŒ์ง€์˜ ์š”์ฒญ ๊ฐœ์ˆ˜๋ฅผ ์˜๋ฏธํ•˜๋ฉฐ, ์™„๋ฃŒ ์—ฌ๋ถ€๋Š” GET API๋ฅผ ํ†ตํ•ด ํ™•์ธ ๊ฐ€๋Šฅํ•˜๋‹ค.
  2. ์ตœ๋Œ€ ์ธ์‹ํŒŒ์ผ ํฌ๊ธฐ: 2GB, ์ตœ๋Œ€ ์ธ์‹๊ฐ€๋Šฅ ์‹œ๊ฐ„: 4์‹œ๊ฐ„.

์ƒ˜ํ”Œ ์ฝ”๋“œ

import json
import requests
 
config = {}
resp = requests.post(
    'https://openapi.vito.ai/v1/transcribe',
    headers={'Authorization': 'bearer '+'{YOUR_JWT_TOKEN}'},
    data={'config': json.dumps(config)},
    files={'file': open('sample.wav', 'rb')}
)
resp.raise_for_status()
print(resp.json())

๋‹ค์–‘ํ•œ ์˜ต์…˜์„ ์„ค์ •ํ•œ ์˜ˆ์‹œ:

import json
import requests
 
config = {
  "use_diarization": True,
  "diarization": {
    "spk_count": 2
  },
  "use_itn": False,
  "use_disfluency_filter": False,
  "use_profanity_filter": False,
  "use_paragraph_splitter": True,
  "paragraph_splitter": {
    "max": 50
  }
}
resp = requests.post(
    'https://openapi.vito.ai/v1/transcribe',
    headers={'Authorization': 'bearer '+'{YOUR_JWT_TOKEN}'},
    data={'config': json.dumps(config)},
    files={'file': open('sample.wav', 'rb')}
)
resp.raise_for_status()
print(resp.json())

์‘๋‹ต ๋ฐ”๋”” (Response Body)

์„ฑ๊ณต ์‹œ, HTTP Status 200๊ณผ ํ•จ๊ป˜ ์•„๋ž˜์™€ ๊ฐ™์€ ์‘๋‹ต์ด ๋ฐ˜ํ™˜๋œ๋‹ค.

{
  "id": "{TRANSCRIBE_ID}"
}

์‘๋‹ต์ด ์‹คํŒจํ•œ ๊ฒฝ์šฐ ์˜ค๋ฅ˜ ์ฝ”๋“œ๋Š” RTZR STT ๋ฌธ์„œ๋ฅผ ์ฐธ๊ณ ํ•œ๋‹ค.

2) [GET]/v1/transcribe/{TRANSCRIBE_ID}

์ „์‚ฌ ์š”์ฒญ ์‹œ ๋ฐœ๊ธ‰๋ฐ›์€ TRANSCRIBE_ID๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ์ „์‚ฌ ๊ฒฐ๊ณผ๋ฅผ ์กฐํšŒํ•œ๋‹ค.

HTTP ์š”์ฒญ

GET https://openapi.vito.ai/v1/transcribe/{TRANSCRIBE_ID}

์š”์ฒญ ํ—ค๋”

Authorization: bearer {YOUR_JWT_TOKEN}
  • scheme: bearer
  • bearerFormat: JWT

์ƒ˜ํ”Œ ์ฝ”๋“œ

import requests
 
resp = requests.get(
    'https://openapi.vito.ai/v1/transcribe/'+'{TRANSCRIBE_ID}',
    headers={'Authorization': 'bearer '+'{YOUR_JWT_TOKEN}'},
)
resp.raise_for_status()
print(resp.json())

์‘๋‹ต ๋ฐ”๋”” (Response Body)

์„ฑ๊ณต ์‹œ, HTTP Status 200๊ณผ ํ•จ๊ป˜ ์•„๋ž˜์™€ ๊ฐ™์€ ์‘๋‹ต์ด ๋ฐ˜ํ™˜๋œ๋‹ค.

NameDescTypeValue
idtranscribe idstring
status์ „์‚ฌ ๊ฒฐ๊ณผ ์ƒํƒœstringtranscribing,ย completed,ย failed
results.utterances๋ฐœํ™” ์ •๋ณดarray
results.utterances.start_at๋ฐœํ™” ์‹œ์ž‘ ์‹œ๊ฐ (ms)integer
results.utterances.duration๋ฐœํ™” duration (ms)integer
results.utterances.msg๋ฐœํ™” ํ…์ŠคํŠธstring
results.utterances.spkํ™”์ž/์ฑ„๋„ IDinteger

TIP

์ผ๋ฐ˜ STT API๋Š” ๊ธด ์Œ์„ฑ ํŒŒ์ผ๋„ ์ง€์›ํ•˜๊ธฐ ์œ„ํ•ด Polling ๋ฐฉ์‹์„ ์‚ฌ์šฉํ•œ๋‹ค. status ๊ฐ’์ด transcribing์ธ ๊ฒฝ์šฐ, ์ตœ์ข… ์ƒํƒœ(completed ๋˜๋Š” failed)๊ฐ€ ๋  ๋•Œ๊นŒ์ง€ ์ฃผ๊ธฐ์ ์œผ๋กœ ์กฐํšŒํ•˜์—ฌ ๊ฒฐ๊ณผ๋ฅผ ํ™•์ธํ•ด์•ผ ํ•œ๋‹ค. ๊ถŒ์žฅ Pooling ์ฃผ๊ธฐ๋Š” 5์ดˆ๋‹ค.

status: transcribing

{
  "id": "{TRANSCRIBE_ID}",
  "status": "transcribing"
}

status: completed

{
  "id": "{TRANSCRIBE_ID}",
  "status": "completed",
  "results": {
    "utterances": [
      {
        "start_at": 4737,
        "duration": 2360,
        "msg": "์•ˆ๋…•ํ•˜์„ธ์š”.",
        "spk": 0
      },
      {
        "start_at": 8197,
        "duration": 3280,
        "msg": "๋„ค, ์•ˆ๋…•ํ•˜์„ธ์š”? ๋ฐ˜๊ฐ‘์Šต๋‹ˆ๋‹ค.",
        "spk": 1
      }
    ]
  }
}

์‘๋‹ต์ด ์‹คํŒจํ•œ ๊ฒฝ์šฐ ์˜ค๋ฅ˜ ์ฝ”๋“œ๋Š” RTZR STT ๋ฌธ์„œ๋ฅผ ์ฐธ๊ณ ํ•˜์ž.

์‚ฌ์šฉ์š”๊ธˆ

์š”๊ธˆ

  • ์ƒํ’ˆ๋ณ„ ์š”๊ธˆ์€ ๋‹ค์Œ๊ณผ ๊ฐ™๋‹ค:
    • ์ผ๋ฐ˜ STT: 1์‹œ๊ฐ„๋‹น 1,000์›
    • ์ŠคํŠธ๋ฆฌ๋ฐ STT: 1์‹œ๊ฐ„ 1,000์›
  • ์š”๊ธˆ์€ ์‚ฌ์šฉ๋Ÿ‰์— ๋”ฐ๋ผ ๋งค์›” ์ฒญ๊ตฌ๋˜๋ฉฐ, ๋ถ€๊ฐ€์„ธ ๋ณ„๋„
  • ์‚ฌ์šฉ๋Ÿ‰์€ 1์ดˆ ๋‹จ์œ„๋กœ ์ง‘๊ณ„ํ•œ ๋’ค ์‹œ๊ฐ„ ๋‹จ์œ„๋กœ ํ™˜์‚ฐํ•œ๋‹ค(ex. 900์ดˆ = 15๋ถ„ = 0.25์‹œ๊ฐ„์œผ๋กœ ์ง‘๊ณ„)
    • ์ผ๋ฐ˜ STT์˜ ๊ฒฝ์šฐ ์Œ์„ฑ ํŒŒ์ผ ๊ธธ์ด๋ฅผ ๊ธฐ์ค€์œผ๋กœ ์‚ฌ์šฉ๋Ÿ‰์ด ์ง‘๊ณ„๋จ.
      • ๋‹ค์ค‘ ์ฑ„๋„ ์˜ต์…˜์„ ์‚ฌ์šฉํ•  ๊ฒฝ์šฐ ๊ฐ ์ฑ„๋„๋ณ„ ์Œ์„ฑ ํŒŒ์ผ ๊ธธ์ด์˜ ํ•ฉ์ด ์ง‘๊ณ„๋จ
    • ์ŠคํŠธ๋ฆฌ๋ฐ STT์˜ ๊ฒฝ์šฐ, ์‹ค์‹œ๊ฐ„ ์ŠคํŠธ๋ฆผ์˜ ๊ธธ์ด๋ฅผ ๊ธฐ์ค€์œผ๋กœ ์‚ฌ์šฉ๋Ÿ‰์ด ์ง‘๊ณ„๋จ

๋‹ค๋ฅธ STT ์„œ๋น„์Šค์˜ ์š”๊ธˆ์€ ๋‹ค์Œ์„ ์ฐธ๊ณ ํ•˜์ž.

ํ”Œ๋žœ

  • ๋ชจ๋“  ์œ ์ €๋Š” ๊ฐ€์ž… ์ฆ‰์‹œ basic ํ”Œ๋žœ์œผ๋กœ ์‹œ์ž‘ํ•จ

๋ฌด๋ฃŒ ์‚ฌ์šฉ๋Ÿ‰

  • ๋ฌด๋ฃŒ ์‚ฌ์šฉ๋Ÿ‰: ๋ชจ๋“  ์œ ์ €์—๊ฒŒ ๊ฐ€์ž… ์ฆ‰์‹œ ๊ธฐ๋ณธ 10์‹œ๊ฐ„์„ ๋ฌด๋ฃŒ ์ œ๊ณตํ•จ
    • ์ผ๋ฐ˜ STT์™€ ์ŠคํŠธ๋ฆฌ๋ฐ STT ๊ตฌ๋ถ„ ์—†์ด API ์‚ฌ์šฉ ์‹œ๊ฐ„์˜ ํ•ฉ์‚ฐ์œผ๋กœ 10์‹œ๊ฐ„์„ ์ œ๊ณต
  • ๋ฌด๋ฃŒ ์‚ฌ์šฉ๋Ÿ‰์— ๋Œ€ํ•ด์„œ๋Š” ๋ณ„๋„์˜ ์š”๊ธˆ์ด ์ฒญ๊ตฌ๋˜์ง€ ์•Š์Œ

Footnotes

  1. Google์˜ ์Œ์„ฑ์ธ์‹ ํŒŒ์ผ ํฌ๊ธฐ์˜ ์ œํ•œ์œผ๋กœ ์ƒ๋žต โ†ฉ

  2. OpenAI์—์„œ ๊ณต๊ฐœํ•œ Whisper ์˜คํ”ˆ์†Œ์Šค ๋ชจ๋ธ์— ๋ฆฌํ„ด์ œ๋กœ์˜ ๋ฐ์ดํ„ฐ๋ฅผ ํŒŒ์ธํŠœ๋‹(fine-tuning)ํ•œ ๋ชจ๋ธ โ†ฉ

  3. Representational State Transfer API, ์ž์›์„ ์ด๋ฆ„์œผ๋กœ ๊ตฌ๋ถ„ํ•˜์—ฌ ํ•ด๋‹น ์ž์›์˜ ์ƒํƒœ๋ฅผ ์ฃผ๊ณ  ๋ฐ›๋Š” ๋ชจ๋“  ๊ฒƒ์„ ์˜๋ฏธ โ†ฉ