Open JTalk で音声合成して日本語テキスト読み上げ

事前に用意するパッケージ

Open JTalk をビルドするには build-essentialパッケージをあらかじめインストールしておきましょう。

$ sudo apt-get install build-essential

ソースコード類を入手する

hts_engine API から、

hts_engine_API-1.04.tar.gz（HTSランタイムエンジン）

Open JTalk から、

open_jtalk-1.03.tar.gz（本体）
open_jtalk_dic_utf_8-1.03.tar.gz（Open JTalk用辞書）
hts_voice_nitech_jp_atr503_m001-1.02.tar.gz（Open JTalk用HTSボイス）

ついでに MMDAgent - Toolkit for Building Voice Interaction Systems から、

MMDAgent_Example-1.0.zip（MMDAgent のサンプルスクリプト）

それぞれの最新版（2011年 5月 3日時点）を入手します。

hts_engine API のビルドとインストール

hts_engine_API-1.04.tar.gz を作業用ディレクトリに置いて、

$ tar zxvf hts_engine_API-1.04.tar.gz
$ cd ./hts_engine_API-1.04
$ ./configure
$ make
$ sudo make install

でインストール出来ます。

Open JTalk のビルドとインストール

open_jtalk-1.03.tar.gz を作業用ディレクトリに置いて展開し、./open_jtalk-1.03/jpcommon へ移動します。

$ tar zxvf open_jtalk-1.03.tar.gz
$ cd ./open_jtalk-1.03/jpcommon

そのディレクトリにある jpcommon_label.c を以下のように修正します。修正の差分は音素継続長の処理 - Open JTalk [ja.nishimotz.com] からいただきました。そちらにあるような理由での修正なので、hts_voice_nitech_jp_atr503_m001 のバージョンが上がれば修正の必要が無くなるかもしれません。

--- open_jtalk-1.03/jpcommon/jpcommon_label.c	2011-04-29 14:08:32.000000000 +0900
+++ jpcommon_label.c	2011-05-01 19:55:40.000000000 +0900
@@ -270,6 +270,7 @@
       if (index == a)
          break;
    }
+   if (i > 3) i = 3;
    return i;
 }
 
@@ -369,6 +370,7 @@
 
    for (i = 0, index = m->next; index != NULL; index = index->next)
       i++;
+   if (i > 10) i = 10;
    return index_mora_in_utterance(m) + i;
 }

簡単に済ませる為に、上記の差分をそのままコピペして jpcommon_label.diff という名前でパッチファイルを作り、パッチを当ててあげましょう。

$ patch < jpcommon_label.diff

パッチを当てたら一つ上のディレクトリへ戻り、

$ cd ../
$ ./configure --with-charset=UTF-8
$ make
$ sudo make install

でインストール完了です。

辞書と HTSボイスの配置

open_jtalk_dic_utf_8-1.03.tar.gz を作業用ディレクトリに置いて展開し、出来たディレクトリを /usr/local/share/open_jtalk へ配置します。

$ tar zxvf open_jtalk_dic_utf_8-1.03.tar.gz
$ sudo mkdir /usr/local/share/open_jtalk
$ sudo mv ./open_jtalk_dic_utf_8-1.03 /usr/local/share/open_jtalk/

hts_voice_nitech_jp_atr503_m001-1.02.tar.gz を作業用ディレクトリに置いて展開し、出来たディレクトリを /usr/local/share/hts_voice へ配置します。

$ tar zxvf hts_voice_nitech_jp_atr503_m001-1.02.tar.gz
$ sudo mkdir /usr/local/share/hts_voice
$ sudo mv ./nitech_jp_atr503_m001-1.02 /usr/local/share/hts_voice/

MMDAgent_Example-1.0.zip を作業用ディレクトリに置いて展開し、MMDAgent_Example-1.0/Voice/mei_normal を /usr/local/share/hts_voice へ配置します。

$ unzip MMDAgent_Example-1.0.zip
$ sudo mv ./MMDAgent_Example-1.0/Voice/mei_normal /usr/local/share/hts_voice/

Open JTalk用スクリプト

引き数を何も与えずに open_jtalk を実行すると簡単な使い方が表示されます。

$ open_jtalk

open_jtalk - The Japanese TTS system "Open JTalk"

  usage:
       open_jtalk [ options ] [ infile ] 
  options:                                                                   [  def][ min--max]
    -x dir         : dictionary directory                                    [  N/A]
    -td tree       : decision trees file for state duration                  [  N/A]
    -tm tree       : decision trees file for spectrum                        [  N/A]
    -tf tree       : decision trees file for Log F0                          [  N/A]
    -tl tree       : decision trees file for low-pass filter                 [  N/A]
    -md pdf        : model file for state duration                           [  N/A]
    -mm pdf        : model file for spectrum                                 [  N/A]
    -mf pdf        : model file for Log F0                                   [  N/A]
    -ml pdf        : model file for low-pass filter                          [  N/A]
    -dm win        : window files for calculation delta of spectrum          [  N/A]
    -df win        : window files for calculation delta of Log F0            [  N/A]
    -dl win        : window files for calculation delta of low-pass filter   [  N/A]
    -ow s          : filename of output wav audio (generated speech)         [  N/A]
    -ot s          : filename of output trace information                    [  N/A]
    -s  i          : sampling frequency                                      [16000][   1--48000]
    -p  i          : frame period (point)                                    [   80][   1--]
    -a  f          : all-pass constant                                       [ 0.42][ 0.0--1.0]
    -g  i          : gamma = -1 / i (if i=0 then gamma=0)                    [    0][   0-- ]
    -b  f          : postfiltering coefficient                               [  0.0][-0.8--0.8]
    -l             : regard input as log gain and output linear one (LSP)    [  N/A]
    -u  f          : voiced/unvoiced threshold                               [  0.5][ 0.0--1.0]
    -em tree       : decision tree file for GV of spectrum                   [  N/A]
    -ef tree       : decision tree file for GV of Log F0                     [  N/A]
    -el tree       : decision tree file for GV of low-pass filter            [  N/A]
    -cm pdf        : filename of GV for spectrum                             [  N/A]
    -cf pdf        : filename of GV for Log F0                               [  N/A]
    -cl pdf        : filename of GV for low-pass filter                      [  N/A]
    -jm f          : weight of GV for spectrum                               [  1.0][ 0.0--2.0]
    -jf f          : weight of GV for Log F0                                 [  1.0][ 0.0--2.0]
    -jl f          : weight of GV for low-pass filter                        [  1.0][ 0.0--2.0]
    -k  tree       : use GV switch                                           [  N/A]
    -z  i          : audio buffer size                                       [ 1600][   0--48000]
  infile:
    text file                                                                [stdin]
  note:
    option '-d' may be repeated to use multiple delta parameters.
    generated spectrum, log F0, and low-pass filter coefficient
    sequences are saved in natural endian, binary (float) format.

ご覧のように open_jtalk をそのまま使うのはコマンドオプションが複雑過ぎて面倒なので、動かしてみる - Open JTalk [ja.nishimotz.com] を参考に Open JTalk用Perlスクリプトを書いてみました。

#!/usr/bin/perl
# HTSボイスに nitech_jp_atr503_m001-1.02 を使う場合

use strict;
use warnings;

my $voice = '/usr/local/share/hts_voice/nitech_jp_atr503_m001-1.02';
my $dic   = '/usr/local/share/open_jtalk/open_jtalk_dic_utf_8-1.03';

my @opts = (
            -x  => "$dic",
            -td => "$voice/tree-dur.inf",
            -tm => "$voice/tree-mgc.inf",
            -tf => "$voice/tree-lf0.inf",
            -md => "$voice/dur.pdf",
            -mm => "$voice/mgc.pdf",
            -mf => "$voice/lf0.pdf",
            -dm => "$voice/mgc.win1",
            -dm => "$voice/mgc.win2",
            -dm => "$voice/mgc.win3",
            -df => "$voice/lf0.win1",
            -df => "$voice/lf0.win2",
            -df => "$voice/lf0.win3",
            -ow => "out.wav",
            -em => "$voice/tree-gv-mgc.inf",
            -ef => "$voice/tree-gv-lf0.inf",
            -cm => "$voice/gv-mgc.pdf",
            -cf => "$voice/gv-lf0.pdf",
            -k  => "$voice/gv-switch.inf",
           );

exec( 'open_jtalk', @opts, 'in.txt' ) or die "couldn't exec open_jtalk: $!\n";

exit;

#!/usr/bin/perl
# HTSボイスに mei_normal を使う場合

use strict;
use warnings;

my $voice = '/usr/local/share/hts_voice/mei_normal';
my $dic   = '/usr/local/share/open_jtalk/open_jtalk_dic_utf_8-1.03';

my @opts = (
            -x  => "$dic",
            -td => "$voice/tree-dur.inf",
            -tm => "$voice/tree-mgc.inf",
            -tf => "$voice/tree-lf0.inf",
            -tl => "$voice/tree-lpf.inf",
            -md => "$voice/dur.pdf",
            -mm => "$voice/mgc.pdf",
            -mf => "$voice/lf0.pdf",
            -ml => "$voice/lpf.pdf",
            -dm => "$voice/mgc.win1",
            -dm => "$voice/mgc.win2",
            -dm => "$voice/mgc.win3",
            -df => "$voice/lf0.win1",
            -df => "$voice/lf0.win2",
            -df => "$voice/lf0.win3",
            -dl => "$voice/lpf.win1",
            -ow => 'out.wav',
            -s  => 48000,
            -p  => 240,  # 話速？
            -a  => 0.58, # 声質？
            -u  => 0.0,  # 有声化・無声化？
            -em => "$voice/tree-gv-mgc.inf",
            -ef => "$voice/tree-gv-lf0.inf",
            -cm => "$voice/gv-mgc.pdf",
            -cf => "$voice/gv-lf0.pdf",
            -jm => 0.7,  # 音量？
            -jf => 0.5,  # 抑揚？
            -k  => "$voice/gv-switch.inf",
            -z  => 6000,
           );

exec( 'open_jtalk', @opts, 'in.txt' ) or die "couldn't exec open_jtalk: $!\n";

exit;

in.txt という読み上げ用テキストファイルを用意して、上のスクリプトを実行すると、out.wav に読み上げ音声が wav形式で出力されます。

HTSボイスの調整は Open JTalk のデモサイトに及びませんけども、とりあえずローカル環境でそれっぽい日本語音声を合成する事は出来ました。

動作確認環境 : Debian GNU/Linux 6.0 squeeze