這是用戶在 2024-12-2 14:04 為 https://elevenlabs.io/docs/product/voices/voice-lab/professional-voice-cloning 保存的雙語快照頁面,由 沉浸式翻譯 提供雙語支持。了解如何保存?

The video is currently slightly outdated as we’ve released new features since it was made, and the training time is significantly quicker. However, a lot of the information in it is still relevant.
這段視頻目前稍顯過時,因為自製作以來我們已經推出了新功能,訓練時間也大幅縮短。不過,裡面的許多資訊仍然是相關的。

Professional Voice Cloning (PVC), unlike Instant Voice Cloning (IVC) which lets you clone voices with very short samples nearly instantaneously, allows you to train a hyper-realistic model of a voice. This is achieved by training a dedicated model on a large set of voice data to produce a model that’s indistinguishable from the original voice.
專業語音克隆(PVC)與即時語音克隆(IVC)不同,後者能夠幾乎瞬間用非常短的樣本來克隆聲音,而 PVC 則允許您訓練出一個超現實的聲音模型。這是通過在大量語音數據上訓練專門的模型來實現的,最終生成的模型與原始聲音幾乎無法區分。

Since the custom models require fine-tuning and training, it will take a bit longer to train these Professional Voice Clones compared to the Instant Voice Clones. Giving an estimate is challenging as it depends on the number of people in the queue before you and a few other factors.
由於自訂模型需要進行微調和訓練,因此訓練這些專業語音克隆所需的時間會比即時語音克隆更長。給出準確的估算是困難的,因為這取決於您前面排隊的人數以及其他一些因素。

Here are the current estimates for Professional Voice Cloning:
這裡是目前對專業語音克隆的估算:

  • English: ~3 hours 大約三個小時
  • Multilingual: ~6 hours 多語言:約 6 小時

Voice Creation 語音創作

There are a few things to be mindful of before you start uploading your samples, and some steps that you need to take to ensure the best possible results.
在您開始上傳樣本之前,有幾件事情需要特別留意,並且需要採取一些步驟以確保能獲得最佳的結果。

Firstly, Professional Voice Cloning is highly accurate in cloning the samples used for its training. It will create a near-perfect clone of what it hears, including all the intricacies and characteristics of that voice, but also including any artifacts and unwanted audio present in the samples. This means that if you upload low-quality samples with background noise, room reverb/echo, or any other type of unwanted sounds like music on multiple people speaking, the AI will try to replicate all of these elements in the clone as well.
首先,專業的聲音克隆技術在克隆訓練樣本方面非常精確。它能夠創造出幾乎完美的聲音克隆,捕捉到所有的細節和特徵,同時也會包括樣本中存在的任何瑕疵和不必要的音頻。因此,如果您上傳低品質的樣本,包含背景噪音、房間的混響或回聲,或其他不必要的聲音,例如多個人同時講話的音樂,人工智慧也會試圖在克隆中重現這些元素。

Secondly, make sure there’s only a single speaking voice throughout the audio, as more than one speaker or excessive noise or anything of the above can confuse the AI. This confusion can result in the AI being unable to discern which voice to clone or misinterpreting what the voice actually sounds like because it is being masked by other sounds, leading to a less-than-optimal clone.
其次,請確保整個音頻中只有一個說話者,因為多個說話者或過多的噪音都可能使 AI 感到困惑。這種困惑可能導致 AI 無法正確辨識要克隆的聲音,或因為被其他聲音掩蓋而誤解聲音的實際特徵,最終導致克隆效果不理想。

Thirdly, make sure you have enough material to clone the voice properly. The bare minimum we recommend is 30 minutes of audio, but for the optimal result and the most accurate clone, we recommend closer to 3 hours of audio. You might be able to get away with less, but at that point, we can’t vouch for the quality of the resulting clone.
第三,請確保你擁有足夠的材料來正確克隆聲音。我們建議的最低要求是 30 分鐘的音頻,但為了獲得最佳效果和最準確的克隆,建議接近 3 小時的音頻。雖然你可能能夠使用更少的音頻,但在那種情況下,我們無法保證最終克隆的質量。

Fourthly, the speaking style in the samples you provide will be replicated in the output, so depending on what delivery you are looking for, the training data should correspond to that style (e.g. if you are looking to voice an audiobook with a clone of your voice, the audio you submit for training should be a recording of you reading a book in the tone of voice you want to use). It is better to just include one style in the uploaded samples for consistencies sake.
第四,您提供的樣本中的說話風格將在輸出中被複製,因此根據您所期望的表達方式,訓練數據應該與該風格相符(例如,如果您希望用自己的聲音來配音有聲書,您提交的訓練音頻應該是您以想要的語調朗讀一本書的錄音)。為了保持一致性,最好在上傳的樣本中只包含一種風格。

Lastly, it’s best to use samples speaking where you are speaking the language that the PVC will mainly be used for. Of course, the AI can speak any language that we currently support. However, it is worth noting that if the voice itself is not native to the language you want the AI to speak - meaning you cloned a voice speaking a different language - it might have an accent from the original language and might mispronounce words and inflections. For instance, if you clone a voice speaking English and then want it to speak Spanish, it will very likely have an English accent when speaking Spanish. We only support cloning samples recorded in one of our supported languages, and the application will reject your sample if it is recorded in an unsupported language.
最後,最好使用樣本來說明您所使用的語言,這是 PVC 主要使用的語言。當然,AI 可以說任何我們目前支持的語言。然而,值得注意的是,如果語音本身不是您希望 AI 使用的語言的母語——也就是說,您克隆了一個說不同語言的聲音——那麼它可能會帶有原始語言的口音,並且可能會在發音和重音上出現不準確的情況。例如,如果您克隆了一個說英語的聲音,然後希望它說西班牙語,那麼它在說西班牙語時很可能會帶有英語口音。我們僅支持克隆錄製於我們支持的語言之一的樣本,如果樣本錄製於不支持的語言,應用程序將會拒絕您的樣本。

For now, we only allow you to clone your own voice. You will be asked to go through a verification process before submitting your fine-tuning request.
目前,我們僅允許您克隆自己的聲音。在提交微調請求之前,您需要完成一個驗證過程。

  • Professional Recording Equipment: Use high-quality recording equipment for optimal results as the AI will clone everything about the audio. High-quality input = high-quality output. Any microphone will work, but an XLR mic going into a dedicated audio interface would be our recommendation. A few general recommendations on low-end would be something like an Audio Technica AT2020 or a Rode NT1 going into a Focusrite interface or similar.
    專業錄音設備:為了獲得最佳效果,請使用高品質的錄音設備,因為 AI 會複製音頻的所有細節。高品質的輸入等於高品質的輸出。雖然任何麥克風都可以使用,但我們建議使用 XLR 麥克風並連接到專用的音頻介面。對於入門級設備,我們推薦像 Audio Technica AT2020 或 Rode NT1 這樣的麥克風,搭配 Focusrite 音頻介面或類似產品。
  • Use a Pop-Filter: Use a Pop-Filter when recording. This will minimize plosives when recording.
    使用防爆濾網:在錄音時使用防爆濾網,這樣可以減少錄音過程中的爆破音。
  • Microphone Distance: Position yourself at the right distance from the microphone - approximately two fists away from the mic is recommended, but it also depends on what type of recording you want.
    麥克風距離:建議您與麥克風保持約兩個拳頭的距離,但具體距離也取決於您想要錄製的內容。
  • Noise-Free Recording: Ensure that the audio input doesn’t have any interference, like background music or noise. The AI cloning works best with clean, uncluttered audio.
    噪音清晰的錄音:確保音頻輸入沒有任何干擾,例如背景音樂或噪音。AI 克隆在乾淨、清晰的音頻中效果最佳。
  • Room Acoustics: Preferably, record in an acoustically-treated room. This reduces unwanted echoes and background noises, leading to clearer audio input for the AI. You can make something temporary using a thick duvet or quilt to dampen the recording space.
    房間聲學:最好在經過聲學處理的房間內進行錄音。這樣可以減少不必要的回聲和背景噪音,從而為 AI 提供更清晰的音頻輸入。你可以用厚重的被子或毛毯臨時製作一個來降低錄音空間的回音。
  • Audio Pre-processing: Consider editing your audio beforehand if you’re aiming for a specific sound you want the AI to output. For instance, if you want a polished podcast-like output, pre-process your audio to match that quality, or if you have long pauses or many “uhm”s and “ahm”s between words as the AI will mimic those as well.
    音頻預處理:如果您希望 AI 輸出特定的聲音,建議您提前編輯音頻。例如,若您想要一個精緻的播客效果,請將音頻預處理到相應的質量;如果您的音頻中有長時間的停頓或頻繁的“嗯”和“啊”,AI 也會模仿這些聲音。
  • Volume Control: Maintain a consistent volume that’s loud enough to be clear but not so loud that it causes distortion. The goal is to achieve a balanced and steady audio level. The ideal would be between -23dB and -18dB RMS with a true peak of -3dB.
    音量控制:保持穩定的音量,應該足夠清晰,但又不至於過大而造成失真。目標是達到平衡且穩定的音頻水平。理想的範圍是 -23dB 到 -18dB RMS,真實峰值應為 -3dB。
  • Sufficient Audio Length: Provide at least 30 minutes of high-quality audio that follows the above guidelines for best results - preferably closer to 3 hours of audio. The more quality data you can feed into the AI, the better the voice clone will be. The number of samples is irrelevant; the total runtime is what matters. However, if you plan to upload multiple hours of audio, it is better to split it into multiple ~30-minute samples. This makes it easier to upload.
    足夠的音頻長度:請提供至少 30 分鐘的高品質音頻,遵循上述指導方針以獲得最佳效果,最好是接近 3 小時的音頻。您提供給 AI 的數據質量越高,聲音克隆的效果就越好。樣本的數量並不重要,關鍵在於總播放時間。然而,如果您計劃上傳多小時的音頻,建議將其分成多個約 30 分鐘的樣本,這樣上傳會更方便。
  • Uploading: After pressing upload, you will not be able to make any changes to the clone and it will be locked in. Ensure that you have uploaded the correct samples that you want to you.
    上傳:按下上傳後,您將無法對克隆進行任何更改,並且它將被鎖定。請確保您已上傳正確的樣本。
  • Verify Your Voice: Once everything is recorded and uploaded, you will be asked to verify your voice. To ensure a smooth experience, please try to verify your voice using the same or similar equipment used to record the samples and in a tone and delivery that is similar to what was present in the samples. If you do not have access to the same equipment, try verifying the best you can. If it fails, you will have to reach out to support.
    驗證您的聲音:當所有內容錄製並上傳後,您將被要求驗證您的聲音。為了確保體驗順利,請盡量使用與錄製樣本時相同或相似的設備來進行驗證,並以與樣本中相似的語調和表達方式進行。如果您無法使用相同的設備,請盡量進行驗證。如果驗證失敗,您需要聯繫客服支援。

Keep in mind that all of this depends on the output you want. The AI will try to clone everything in the audio, but for the AI to work optimally and predictably, we suggest following the guidelines mentioned above.
請記住,這一切都取決於您想要的輸出。人工智慧會嘗試複製音頻中的所有內容,但為了讓人工智慧能夠最佳且可預測地運作,我們建議您遵循上述的指導方針。

Once you’ve uploaded your samples, there are four stages of the cloning process that you might see on your voice card.
一旦您上傳了樣本,您可能會在聲音卡上看到克隆過程的四個階段。

  • Verify: This means that they have uploaded the voice samples, but you have not yet finished the verification step. You will need to finish this step before it can start training.
    驗證:這表示他們已經上傳了語音樣本,但您尚未完成驗證步驟。在開始訓練之前,您需要先完成這個步驟。
  • Processing: This means that the voice has been verified and is preprocessing, ready to be trained. When you’ve reached this step, the rest is automatic, and you will not need to do anything.
    處理中:這表示聲音已經過驗證,並正在進行預處理,準備進行訓練。當你達到這一步時,接下來的過程將自動進行,你無需再做任何事情。
  • Fine-tuning: This is when the voice is actually training. Along with this label, you will also see a loading bar to show you the progress.
    微調:這是聲音實際進行訓練的過程。與此標籤一起,您還會看到一個進度條來顯示加載進度。
  • Fine-tuned: This means the voice has finished training and is ready to be used!
    微調:這表示聲音已經完成訓練,現在可以使用了!

Scripts 劇本

What you read is not very important; how you read it is very important, however. The AI will try to mimic everything it hears in a voice: the tonal quality, the accent, the inflection, and many other intricate details. It will replicate how you pronounce certain words, vowels, and consonants, but not the actual words themselves. So, it is better to choose a text or script that conveys the emotion you want to capture, and read in a tone of voice you want to use.
你所閱讀的內容並不是特別重要;然而,你的閱讀方式卻非常關鍵。人工智慧會試圖模仿它聽到的一切聲音,包括音調的質感、口音、語調以及許多其他細微的細節。它會複製你發音某些單詞、元音和輔音的方式,但不會複製實際的單詞。因此,選擇一段能夠傳達你想要表達的情感的文本或劇本,並以你希望使用的語調來朗讀,會更為妥當。