주간 다이어리 - 4주차

10 분 소요

주간 다이어리 4주차 (3월 25일 ~ 3월 31일)

활동 기록

팀 활동

3월 25일(월) 21:00 ~ 22:15 (정기 미팅)

→ 총 1시간 15분 진행

개별 활동

유재휘
- 2024.03.27 : ICA 샘플 코드 테스트 (잘 안됨…다시하기) - 약 2시간
- 2024.03.28 : VAD 샘플 코드 테스트 - 약 2시간 30분
- 2024.03.29 : 실시간(real-time) VAD 알고리즘 테스트 (RabbitMQ 사용 가능 여부 테스트), 녹음 & VAD실행 알고리즘 테스트 - 약 3시간
- 2024.03.30 : VAD 알고리즘 코드 & 결과 정리 - 약 3시간
→ 약 10시간 30분 진행
전준표
- 3월 27일(수) 16:00 ~ 18:00 (개발 환경 세팅 및 라이브러리 설치)
- 3월 29일(금) 21:00 ~ 25:30 (audio to mel-spectrogram 알고리즘 구현)
→ 6시간 30분 진행
이민석
- 03-27(수) 22:00 ~ 23:30 (라즈베리파이 OS 설치 및 초기 세팅)
- 03-28(목) 21:00 ~ 22:00 (라즈베리파이 VScode 및 파이썬 설치 및 세팅)
- 03-30(토) 14:00 ~ 17:00 (Blender 학습 및 간단한 프레임 작업)
→ 5시간 30분 진행
조민수
- 3월 26일 화요일 10:00 ~ 11:00 (UI 수정)
- 3월 29일 금요일 13:00 ~ 14:30 (UI 제작 환경 세팅 및 라이브러리 설치)
- 3월 30일 토요일 7:00~11:30 ( 미리 제작한 UI코딩)
→ 7시간 진행

진행 상황

1) 데이터셋 전처리 알고리즘 구현

샴 네트워크 모델 학습을 위한 데이터셋 제작을 위하여, 모델을 구현하기에 앞서 음성 데이터를 mel-spectrogram으로 변환 시켜주는 전처리 알고리즘을 구현하였다.

개발언어 : python

필요한 라이브러리 설치 및 버전 확인

pip install librosa
pip install numpy
pip install matplotlib

import sys
import numpy as np
import matplotlib as plt
import librosa

print("python : ", sys.version)
print("librosa : ",librosa.__version__ )
print("numpy : ",np.__version__)
print("matplotlib : ",plt.__version__ )

python : 3.10.12 (main, Nov 20 2023, 15:14:05) [GCC 11.4.0] librosa : 0.10.1 numpy : 1.25.2 matplotlib : 3.7.1

음성 데이터 load

import librosa
import librosa.display
import matplotlib.pyplot  as plt

# 오디오 파일 경로
audio_file = "/content/KsponSpeech_000001.wav"

# 오디오 데이터 읽어오기
y, sr = librosa.load(audio_file, sr=None, mono=True)

plt.figure(figsize=(12,6))
librosa.display.waveshow(y, sr=sr)

plt.xlabel("Time (s)")
plt.ylabel("Amplitude")
plt.show()

2-1. pcm 음성 파일을 wav 파일로 변환

import wave
import struct

def pcm_to_wav(pcm_file, wav_file, channels, sample_width, frame_rate):
    # PCM 파일 열기
    with open(pcm_file, 'rb') as pcm:
        # WAV 파일 열기
        with wave.open(wav_file, 'wb') as wav:
            # WAV 파일 헤더 설정
            wav.setnchannels(channels)
            wav.setsampwidth(sample_width)
            wav.setframerate(frame_rate)
            
            # 오디오 데이터 쓰기
            data = pcm.read()
            wav.writeframes(data)

# PCM 파일 정보 (16kHz/16bits,(little endian) linear PCM)
pcm_file = "KsponSpeech_000001"
wav_file = pcm_file+".wav"
channels = 1  # 채널 수
sample_width = 2  # 샘플 폭 (바이트 단위)
frame_rate = 16000  # 샘플 속도

# PCM 파일을 WAV 파일로 변환
pcm_to_wav(pcm_file+".pcm", wav_file, channels, sample_width, frame_rate)

mel-spectrogram으로 변환

# Mel-spectrogram 계산
mel_spectrogram = librosa.feature.melspectrogram(y=y, sr=sr)

# Mel-spectrogram을 데시벨로 변환
mel_spectrogram_db = librosa.power_to_db(mel_spectrogram, ref=np.max)

plt.figure(figsize=(10, 4))
librosa.display.specshow(mel_spectrogram_db, sr=sr, x_axis='time', y_axis='mel')
plt.colorbar(format='%+2.0f dB')
plt.title('Mel-spectrogram')
plt.show()

전체 코드

import librosa
import matplotlib.pyplot as plt
import wave
import struct

def pcm_to_wav(pcm_file, wav_file, channels, sample_width, frame_rate):
    # PCM 파일 열기
    with open(pcm_file, 'rb') as pcm:
        # WAV 파일 열기
        with wave.open(wav_file, 'wb') as wav:
            # WAV 파일 헤더 설정
            wav.setnchannels(channels)
            wav.setsampwidth(sample_width)
            wav.setframerate(frame_rate)
            
            # 오디오 데이터 쓰기
            data = pcm.read()
            wav.writeframes(data)

# 오디오 파일 경로
audio_file = "/content/KsponSpeech_000001.wav"

# 오디오 데이터 읽어오기
y, sr = librosa.load(audio_file)

# Mel-spectrogram 계산
mel_spectrogram = librosa.feature.melspectrogram(y=y, sr=sr)

# Mel-spectrogram을 데시벨로 변환
mel_spectrogram_db = librosa.power_to_db(mel_spectrogram, ref=np.max)

# Mel-spectrogram 플로팅
plt.figure(figsize=(10, 4))
librosa.display.specshow(mel_spectrogram_db, sr=sr, x_axis='time', y_axis='mel')
plt.colorbar(format='%+2.0f dB')
plt.title('Mel-spectrogram')
plt.show()

에러/오류 처리

2번 과정의 음성 데이터를 읽어오는 과정에서 다음과 같은 오류 발생

LibsndfileError                           Traceback (most recent call last)
/usr/local/lib/python3.10/dist-packages/librosa/core/audio.py in load(path, sr, mono, offset, duration, dtype, res_type)
    174         try:
--> 175             y, sr_native = __soundfile_load(path, offset, duration, dtype)
    176 

7 frames
LibsndfileError: Error opening '/content/KsponSpeech_000001.pcm': Format not recognised.

During handling of the above exception, another exception occurred:

NoBackendError                            Traceback (most recent call last)
<decorator-gen-119> in __audioread_load(path, offset, duration, dtype)

/usr/local/lib/python3.10/dist-packages/audioread/__init__.py in audio_open(path, backends)
    130 
    131     # All backends failed!
--> 132     raise NoBackendError()

NoBackendError: 

동일 코드에 대해 wav파일로 실행했을 경우 정상적으로 실행됨을 확인.

→ ai-hub 의 pcm 음성 데이터의 경우 hearder가 없는 raw data이기 때문에 pcm파일 load중 에러 발생

→ pcm raw data에 헤더를 추가하는 함수를 제작하여 pcm파일을 wav형식의 파일로 변환하는 함수 작성(2-1)

→ 변환된 wav파일로 load하여 문제 해결

2) UI제작 수정 사항

<?xml version="1.0" encoding="UTF-8"?>
<ui version="4.0">
 <class>MainWindow</class>
 <widget class="QMainWindow" name="MainWindow">
  <property name="geometry">
   <rect>
    <x>0</x>
    <y>0</y>
    <width>800</width>
    <height>600</height>
   </rect>
  </property>
  <property name="windowTitle">
   <string>MainWindow</string>
  </property>
  <widget class="QWidget" name="centralwidget">
   <widget class="QPushButton" name="pushButton">
    <property name="geometry">
     <rect>
      <x>330</x>
      <y>380</y>
      <width>75</width>
      <height>23</height>
     </rect>
    </property>
    <property name="autoFillBackground">
     <bool>true</bool>
    </property>
    <property name="text">
     <string>학습시작</string>
    </property>
   </widget>
   <widget class="QTextBrowser" name="textBrowser">
    <property name="geometry">
     <rect>
      <x>240</x>
      <y>90</y>
      <width>256</width>
      <height>31</height>
     </rect>
    </property>
    <property name="mouseTracking">
     <bool>true</bool>
    </property>
    <property name="html">
     <string>&lt;!DOCTYPE HTML PUBLIC &quot;-//W3C//DTD HTML 4.0//EN&quot; &quot;http://www.w3.org/TR/REC-html40/strict.dtd&quot;&gt;
&lt;html&gt;&lt;head&gt;&lt;meta name=&quot;qrichtext&quot; content=&quot;1&quot; /&gt;&lt;style type=&quot;text/css&quot;&gt;
p, li { white-space: pre-wrap; }
&lt;/style&gt;&lt;/head&gt;&lt;body style=&quot; font-family:'Gulim'; font-size:9pt; font-weight:400; font-style:normal;&quot;&gt;
&lt;p align=&quot;center&quot; style=&quot; margin-top:0px; margin-bottom:0px; margin-left:0px; margin-right:0px; -qt-block-indent:0; text-indent:0px;&quot;&gt;&lt;span style=&quot; font-size:12pt; font-weight:600;&quot;&gt;원하는 학습을 선택하세요&lt;/span&gt;&lt;/p&gt;&lt;/body&gt;&lt;/html&gt;</string>
    </property>
   </widget>
   <widget class="QListView" name="listView">
    <property name="geometry">
     <rect>
      <x>240</x>
      <y>150</y>
      <width>256</width>
      <height>192</height>
     </rect>
    </property>
    <property name="autoFillBackground">
     <bool>true</bool>
    </property>
   </widget>
   <widget class="QTextBrowser" name="textBrowser_2">
    <property name="geometry">
     <rect>
      <x>270</x>
      <y>180</y>
      <width>201</width>
      <height>31</height>
     </rect>
    </property>
    <property name="html">
     <string>&lt;!DOCTYPE HTML PUBLIC &quot;-//W3C//DTD HTML 4.0//EN&quot; &quot;http://www.w3.org/TR/REC-html40/strict.dtd&quot;&gt;
&lt;html&gt;&lt;head&gt;&lt;meta name=&quot;qrichtext&quot; content=&quot;1&quot; /&gt;&lt;style type=&quot;text/css&quot;&gt;
p, li { white-space: pre-wrap; }
&lt;/style&gt;&lt;/head&gt;&lt;body style=&quot; font-family:'Gulim'; font-size:9pt; font-weight:400; font-style:normal;&quot;&gt;
&lt;p align=&quot;center&quot; style=&quot; margin-top:0px; margin-bottom:0px; margin-left:0px; margin-right:0px; -qt-block-indent:0; text-indent:0px;&quot;&gt;문장&lt;/p&gt;&lt;/body&gt;&lt;/html&gt;</string>
    </property>
   </widget>
   <widget class="QTextBrowser" name="textBrowser_3">
    <property name="geometry">
     <rect>
      <x>270</x>
      <y>240</y>
      <width>201</width>
      <height>31</height>
     </rect>
    </property>
    <property name="html">
     <string>&lt;!DOCTYPE HTML PUBLIC &quot;-//W3C//DTD HTML 4.0//EN&quot; &quot;http://www.w3.org/TR/REC-html40/strict.dtd&quot;&gt;
&lt;html&gt;&lt;head&gt;&lt;meta name=&quot;qrichtext&quot; content=&quot;1&quot; /&gt;&lt;style type=&quot;text/css&quot;&gt;
p, li { white-space: pre-wrap; }
&lt;/style&gt;&lt;/head&gt;&lt;body style=&quot; font-family:'Gulim'; font-size:9pt; font-weight:400; font-style:normal;&quot;&gt;
&lt;p align=&quot;center&quot; style=&quot; margin-top:0px; margin-bottom:0px; margin-left:0px; margin-right:0px; -qt-block-indent:0; text-indent:0px;&quot;&gt;단어&lt;/p&gt;&lt;/body&gt;&lt;/html&gt;</string>
    </property>
   </widget>
  </widget>
  <widget class="QMenuBar" name="menubar">
   <property name="geometry">
    <rect>
     <x>0</x>
     <y>0</y>
     <width>800</width>
     <height>21</height>
    </rect>
   </property>
   <widget class="QMenu" name="menu">
    <property name="title">
     <string>한국어 발음 평가 프로그램</string>
    </property>
   </widget>
   <addaction name="menu"/>
  </widget>
  <widget class="QStatusBar" name="statusbar"/>
  <widget class="QToolBar" name="toolBar">
   <property name="windowTitle">
    <string>toolBar</string>
   </property>
   <attribute name="toolBarArea">
    <enum>TopToolBarArea</enum>
   </attribute>
   <attribute name="toolBarBreak">
    <bool>false</bool>
   </attribute>
  </widget>
 </widget>
 <resources/>
 <connections/>
</ui>

다음 코드를 통해 UI 이미지 시작 화면구현 - 사진 예제

3) VAD 제거 알고리즘 구현

record.py
- VAD에 적용하기 전 목소리 녹음하는 코드
- 실행 후 엔터 누르고 녹음, 소리감지X 상태가 3초 이상이면 자동으로 녹음 종료, 터미널 표시가 멈추면 종료된 것이고 Ctrl+C로 코드실행 종료하기
  
  → 엔터 또 눌러서 종료하는 걸로 하려했는데 잘 안됨…
- 녹음이 끝나면 자동으로 .wav파일로 같은 디렉토리 안에 저장해줌
- 녹음할 때, 같은 디렉토리 안에 다른 .wav파일이 있으면 전부 삭제해줌

# *record.py*
# ----------------------------------------------------------------
#1
# 필요 모듈 import하기
import os
import sys
import wave  # .wav 파일 관련
import webrtcvad  # 녹음
import pyaudio  # 오디오 입출력 관련
import time
import datetime  # .wav 파일 날짜로 저장할때 필요
# ----------------------------------------------------------------

# ----------------------------------------------------------------
#2
# 터미널에 뜨는 메시지...input뜨고 엔터누르면 녹음 실행
input("Enter 눌러서 녹음 시작 (종료는 Ctrl + c)")
# ----------------------------------------------------------------

# ----------------------------------------------------------------
#3
# 오디오 파라미터(Parameter) 설정
FORMAT = pyaudio.paInt16  # 16비트
CHANNELS = 1  # 오디오 채널 수 (단일 채널이니 1로 하자)
RATE = 16000  # 초 당 샘플링
FRAMES_PER_BUFFER = 320  # 버퍼 당 프레임 수
# ----------------------------------------------------------------

# ----------------------------------------------------------------
#4
# VAD 실행 파트
# 음성 감지 감도 조정 -> '적극적'감도인 3을 입력함... 점잖은 = 1, 중간 = 2
# '점잖은' 감도는 민감도 제일 낮음, '적극적' 감도는 민감도 제일 높음
vad = webrtcvad.Vad(3)

# PyAudio 객체를 생성하여 마이크에서 음성 데이터를 가져옴
# 위에서 설정한 오디오 파라미터들을 대입하여 음성을 입력시킴
pa = pyaudio.PyAudio()
stream = pa.open(format=FORMAT, channels=CHANNELS, rate=RATE, input=True, frames_per_buffer=FRAMES_PER_BUFFER)

# 현재 파이썬 코드 파일이 위치한 디렉토리 경로 -> .wav 저장할 때 사용
current_directory = os.path.dirname(os.path.abspath(__file__))
# ----------------------------------------------------------------

# ----------------------------------------------------------------
#5
# 이미 존재하는 .wav 삭제하는 파트
# 현재 디렉토리에 .wav 파일이 있다면, 저장 전에 해당 파일을 전부 먼저 삭제해줌
existing_wav_files = [f for f in os.listdir(current_directory) if f.endswith('.wav')]
if existing_wav_files:
    for existing_wav_file in existing_wav_files:
        os.remove(os.path.join(current_directory, existing_wav_file))
        print(f"Deleted existing WAV file: {existing_wav_file}")
# ----------------------------------------------------------------

# ----------------------------------------------------------------
#6
inactive_session = False
inactive_since = time.time()
frames = []

while True:
    # 마이크를 통해 음성 데이터 읽기
    data = stream.read(FRAMES_PER_BUFFER)

    # 오디오 상태가 활성화인지 체크
    is_active = vad.is_speech(data, sample_rate=RATE)
    
    # 음성이 들리지 않는 최대 몇 초 후 중지할 것인지? 조절 가능
    idle_time = 3
    # 음성이 감지된 경우
    if is_active:
        inactive_session = False
    # 현재 프레임에서 음성이 감지X인 경우
    # 감지되지 않은 시간을 추적 (최대 idle_time초)
    else:
        if inactive_session == False:
            inactive_session = True
            inactive_since = time.time()
        # 이전 프레임에서도 감지X인 경우
        else:
            inactive_session = True

    #  idle_time초 동안 감지가 되지 않으면 'X'출력 후 녹음 중지
    if (inactive_session == True) and (time.time() - inactive_since) > idle_time:
        sys.stdout.write('X')
        
        # 음성 데이터를 프레임에 추가 -> 나중에 전부 모아서 .wav 파일로 저장
        frames.append(data)

        # 현재 파이썬 코드파일이 있는 디렉토리에 .wav 파일로 저장 (이름 = 현재시간)
        current_time = datetime.datetime.now().strftime('%Y%m%d_%H-%M-%S')
        audio_recorded_filename = os.path.join(current_directory, f'RECORDED-{current_time}.wav')
        wf = wave.open(audio_recorded_filename, 'wb')
        wf.setnchannels(CHANNELS)
        wf.setsampwidth(pa.get_sample_size(FORMAT))
        wf.setframerate(RATE)
        wf.writeframes(b''.join(frames))
        wf.close()

        # X후 5초 대기한 다음 다시 녹음할 준비
        time.sleep(5)
        inactive_session = False
  
    # 녹음 중 음성이 계속 감지되는 경우
    # 소리 on이면 '1', 소리 off이면 '_'을 터미널에 출력
    else:
        sys.stdout.write('1' if is_active else '_')
    
    # 음성 데이터를 프레임에 추가 -> 나중에 전부 모아서 .wav 파일로 저장
    frames.append(data)
    sys.stdout.flush()
# ----------------------------------------------------------------

# ----------------------------------------------------------------
#7
# 녹음 종료 (PyAudio 모듈 종료)
stream.stop_stream()
# ----------------------------------------------------------------

vad.py
- record.py로 녹음한 .wav파일에 VAD 적용
- VAD 적용 후, 적용 전 .wav파일을 자동으로 삭제해줌
- (librosa.effects.trim은 음성의 시작과 끝만 구분…음성 중간 중간의 공백은 삭제X)

# *vad.py*
# ----------------------------------------------------------------
#1
# 필요 모듈 import하기
import os
import sys
import numpy as np
import scipy
import scipy.signal
import scipy.io.wavfile  # .wav 파일 읽고쓰기
# ----------------------------------------------------------------

# ----------------------------------------------------------------
#2
# 신호를 블록화하는 함수
# a = 신호 배열, stride_length = 각 블록의 길이, stride_step = 블록 간 거리
# return값은 블록화된 프레임 배열
def stride_trick(a, stride_length, stride_step):
    nrows = ((a.size - stride_length) // stride_step) + 1
    n = a.strides[0]
    return np.lib.stride_tricks.as_strided(a,
                                           shape=(nrows, stride_length),
                                           strides=(stride_step*n, n))
# ----------------------------------------------------------------

# ----------------------------------------------------------------
#3
# 신호를 프레임으로 변환하는 함수 (= 단위별로 나누기)
# sig = 프레임으로 변환할 신호 numpy 배열
# fs = 샘플링 주파수 -> 숫자 커지면 더 많은 프레임... 해상도 커짐?
# win_len와 win_hop은 둘 다 기본값으로 함 (각각 길이 & 이동 간격)
def framing(sig, fs=16000, win_len=0.025, win_hop=0.01):
    
    # 프레임화 결과가 유효한지 확인 (프레임이 겹쳐야 분석 가능)
    if win_len < win_hop: print("Parameter Error: win_len < win_hop...")

    # 프레임 길이 & 간격 계산
    frame_length = win_len * fs
    frame_step = win_hop * fs
    
    # 신호에 패딩 추가 관련
    signal_length = len(sig)  # 원본 신호의 길이
    frames_overlap = frame_length - frame_step  # 프레임 간 겹침 계산

    # 패팅 추가하기
    rest_samples = np.abs(signal_length - frames_overlap) % np.abs(frame_length - frames_overlap)
    pad_signal = np.append(sig, np.array([0] * int(frame_step - rest_samples) * int(rest_samples != 0.)))

    # stride_trick을 이용하여 프레임화 완료
    # return값은 프레임화된 신호와 해당 길이
    frames = stride_trick(pad_signal, int(frame_length), int(frame_step))
    return frames, frame_length
# ----------------------------------------------------------------

# ----------------------------------------------------------------
#4
# 각 프레임의 정규화된 단기 에너지를 계산하는 함수 (= 음성 에너지 크기 정리한 것)
def _calculate_normalized_short_time_energy(frames):
    return np.sum(np.abs(np.fft.rfft(a=frames, n=len(frames)))**2, axis=-1) / len(frames)**2
# ----------------------------------------------------------------

# ----------------------------------------------------------------
#5
# 음성 활동 감지(VAD) 알고리즘을 구현하는 함수
# threshold = 음성 활동 감지의 임계값 (음성인지 아닌지 구분하는 최소 구분값)
def naive_frame_energy_vad(sig, fs, threshold=-20, win_len=0.25, win_hop=0.25, E0=1e7):
    # 녹음된 음성 신호 프레임화
    frames, frames_len = framing(sig=sig, fs=fs, win_len=win_len, win_hop=win_hop)

    # 각 프레임별 에너지 계산(음성 크기 정도...) & 로그로 변환
    energy = _calculate_normalized_short_time_energy(frames)
    log_energy = 10 * np.log10(energy / E0)

    # 에너지 정규화 & 빈 부분(무음) 필터링
    energy = scipy.signal.medfilt(log_energy, 5)
    energy = np.repeat(energy, frames_len)

    # VAD 실행하여 음성과 비음성 구분 & 이를 프레임으로 추출
    vad = np.array(energy > threshold, dtype=sig.dtype)
    vframes = np.array(frames.flatten()[np.where(vad==1)], dtype=sig.dtype)
    
    return energy, vad, np.array(vframes, dtype=np.float64)
# ----------------------------------------------------------------

# ----------------------------------------------------------------
#6
# 메인 파트
if __name__ == "__main__":
    # 현재 디렉토리에서 .wav 확장자를 가진 파일을 읽어옴
    wav_files = [f for f in os.listdir() if f.endswith('.wav')]
    if len(wav_files) < 1:
        print("Error : 현재 디렉토리에 .wav 파일 존재 X")
    else:
        input_filename = wav_files[0]
        output_filename = "after_VAD_record.wav"

        # 파일 불러오기
        fs, sig = scipy.io.wavfile.read(input_filename)

        # 음성 부분만 자르기
        _, _, voiced = naive_frame_energy_vad(sig, fs, threshold=-35,
                                              win_len=0.025, win_hop=0.025)

        # 자른 음성 부분을 새로운 WAV 파일로 저장
        scipy.io.wavfile.write(output_filename, fs, np.array(voiced, dtype=sig.dtype))

        # 추출 전 원본 음원 파일 삭제
        os.remove(input_filename)
        print("VAD 실행 후 추출 완료...저장된 이름은 after_VAD_record.wav")
# ----------------------------------------------------------------


<문제점>

.wav파일 문제 : 계속 쌓이면 용량은 어떻게 해야할까? .wav파일이 여러 개 있는 경우, vad.py 코드도 어떤 것을 사용해야 하는지 구분을 못하는 문제…

<해결법>

할때마다 삭제하기 → 삭제하는 것이 없는 경우, record.py를 실행할 때마다 음원파일이 계속 쌓이고 vad.py는 어떤 것을 해야 할지 구분X라서 처음에 녹음된 파일만 계속 실행...

+ 할 때마다 기존에 남은 것들을 지워야 함, 안지우면 쌓임 (용량 문제)

테스트 결과 (녹음된 음원의 시작, 중간, 끝 사이에 음성 공백을 지워줌)

*vad.py* 적용 전

*vad.py* 적용 후