{"id":1719,"date":"2024-02-15T13:42:00","date_gmt":"2024-02-15T04:42:00","guid":{"rendered":"https:\/\/www.comm.tcu.ac.jp\/masuda-lab\/?p=1719"},"modified":"2025-11-11T12:29:14","modified_gmt":"2025-11-11T03:29:14","slug":"an-investigation-efficient-methods-for-voice-cloning%ef%bc%88-undergraduate-student-research-work-for-the-2023-academic-year-%ef%bc%89","status":"publish","type":"post","link":"https:\/\/www.comm.tcu.ac.jp\/masuda-lab\/en\/archives\/1719","title":{"rendered":"&#8220;An investigation efficient methods for voice cloning&#8221;\uff08 Undergraduate student research work for the 2023 academic year \uff09\u00a0"},"content":{"rendered":"\n<p class=\"wp-block-paragraph\">This study investigated methods for efficiently generating voice clones from a small amount of speech data. <\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Voice cloning is a technology that records a person\u2019s voice and reproduces it through machine learning, and it has been widely applied in systems such as Vocaloid and recent AI-based voice generation&nbsp;tools(Figure1).<\/p>\n\n\n<div class=\"wp-block-image\">\n<figure class=\"aligncenter size-full is-resized\"><img loading=\"lazy\" decoding=\"async\" width=\"468\" height=\"244\" src=\"https:\/\/www.comm.tcu.ac.jp\/masuda-lab\/wordpress\/wp-content\/uploads\/2024\/02\/image-1.png\" alt=\"\" class=\"wp-image-1822\" style=\"width:468px;height:auto\" srcset=\"https:\/\/www.comm.tcu.ac.jp\/masuda-lab\/wordpress\/wp-content\/uploads\/2024\/02\/image-1.png 468w, https:\/\/www.comm.tcu.ac.jp\/masuda-lab\/wordpress\/wp-content\/uploads\/2024\/02\/image-1-300x156.png 300w\" sizes=\"auto, (max-width: 468px) 100vw, 468px\" \/><figcaption class=\"wp-element-caption\">Figure1.&nbsp;An&nbsp;overview of&nbsp;voice&nbsp;cloning&nbsp;<\/figcaption><\/figure>\n<\/div>\n\n\n<p class=\"wp-block-paragraph\">However, conventional approaches often require\u00a0large amounts\u00a0of recordings and complex processes, making them difficult for non-specialists to handle. To address this issue, this study employed <a href=\"https:\/\/github.com\/NVIDIA\/tacotron2\">Tacotron2<\/a>, a speech synthesis model, to examine the quality of voice clones produced with minimal voice data. <\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Specifically, 20 sentences were recorded by four participants and used for training. The generated speech was extracted at different training stages and evaluated on a five-point scale using original criteria. The results showed that after 5,000 training iterations, the average score reached 3.36, representing the highest quality, and that the quality of the generated voice improved as the number of training iterations increased(Table1).<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Future work will focus on refining the selection of recording sentences and developing a fully automated system to achieve faster and higher-quality voice cloning.&nbsp;<\/p>\n\n\n\n<p class=\"wp-block-paragraph\"><\/p>\n\n\n\n<div class=\"wp-block-group alignwide is-vertical is-content-justification-center is-layout-flex wp-container-core-group-is-layout-524f8de7 wp-block-group-is-layout-flex\">\n<div class=\"wp-block-group\"><div class=\"wp-block-group__inner-container is-layout-constrained wp-block-group-is-layout-constrained\">\n<figure class=\"wp-block-image size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"462\" height=\"234\" src=\"https:\/\/www.comm.tcu.ac.jp\/masuda-lab\/wordpress\/wp-content\/uploads\/2025\/10\/image.png\" alt=\"\" class=\"wp-image-1717\" srcset=\"https:\/\/www.comm.tcu.ac.jp\/masuda-lab\/wordpress\/wp-content\/uploads\/2025\/10\/image.png 462w, https:\/\/www.comm.tcu.ac.jp\/masuda-lab\/wordpress\/wp-content\/uploads\/2025\/10\/image-300x152.png 300w\" sizes=\"auto, (max-width: 462px) 100vw, 462px\" \/><\/figure>\n<\/div><\/div>\n\n\n\n<p class=\"wp-block-paragraph\">Table1. The average score of the voice clone for each training epoch&nbsp;&nbsp;<\/p>\n\n\n\n<p class=\"wp-block-paragraph\"><\/p>\n<\/div>\n","protected":false},"excerpt":{"rendered":"<p>This study investigated methods for efficiently generating voice clones from a small amount of speech data. <\/p>\n","protected":false},"author":14,"featured_media":1717,"comment_status":"closed","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"_lmt_disableupdate":"no","_lmt_disable":"","_locale":"en_US","_original_post":"https:\/\/www.comm.tcu.ac.jp\/masuda-lab\/?p=1716","footnotes":""},"categories":[24],"tags":[],"class_list":["post-1719","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-24","en-US"],"modified_by":"\u89d2\u7530\u3000\u7a1a\u5b99","_links":{"self":[{"href":"https:\/\/www.comm.tcu.ac.jp\/masuda-lab\/wp-json\/wp\/v2\/posts\/1719","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.comm.tcu.ac.jp\/masuda-lab\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.comm.tcu.ac.jp\/masuda-lab\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.comm.tcu.ac.jp\/masuda-lab\/wp-json\/wp\/v2\/users\/14"}],"replies":[{"embeddable":true,"href":"https:\/\/www.comm.tcu.ac.jp\/masuda-lab\/wp-json\/wp\/v2\/comments?post=1719"}],"version-history":[{"count":10,"href":"https:\/\/www.comm.tcu.ac.jp\/masuda-lab\/wp-json\/wp\/v2\/posts\/1719\/revisions"}],"predecessor-version":[{"id":1834,"href":"https:\/\/www.comm.tcu.ac.jp\/masuda-lab\/wp-json\/wp\/v2\/posts\/1719\/revisions\/1834"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/www.comm.tcu.ac.jp\/masuda-lab\/wp-json\/wp\/v2\/media\/1717"}],"wp:attachment":[{"href":"https:\/\/www.comm.tcu.ac.jp\/masuda-lab\/wp-json\/wp\/v2\/media?parent=1719"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.comm.tcu.ac.jp\/masuda-lab\/wp-json\/wp\/v2\/categories?post=1719"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.comm.tcu.ac.jp\/masuda-lab\/wp-json\/wp\/v2\/tags?post=1719"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}