这是用户在 2024-7-30 15:37 为 https://devblogs.microsoft.com/oldnewthing/20240726-00/?p=110048 保存的双语快照页面,由 沉浸式翻译 提供双语支持。了解如何保存?

What can I do if IMlangConvertCharset is unable to convert from code page 28591 directly to UTF-8?
如果 IMlangConvertCharset 无法直接从代码页 28591 转换为 UTF-8,我该怎么办?

Raymond Chen 雷蒙德·陈

A customer wanted to do a character set conversion from code page 28591 directly to UTF-8. They found that when they ask IMulti­Language::Create­Convert­Charset to create such a converter, it returns S_FALSE, meaning that no such conversion is available.
一位客户想要将字符集从代码页 28591 直接转换为 UTF-8。他们发现,当他们请求IMulti­Language::Create­Convert­Charset创建这样的转换器时,它返回S_FALSE,这意味着没有这样的转换可用。

auto mlang = wil::CoCreateInstance<IMultiLanguage>(
        CLSID_CMultiLanguage);

// This next call returns S_FALSE, indicating no conversion.
wil::com_ptr<IMLangConvertCharset> convert;
mlang->CreateConvertCharset(28591, CP_UTF8, 0, &convert);

Oh no, what shall we ever do?
哦不,我们该怎么办?

Okay, so CMultiLanguage can’t convert from 28591 directly to UTF-8. But you can just convert through UTF-16.
好的,所以CMultiLanguage不能直接从 28591 转换为 UTF-8。但你可以通过 UTF-16 进行转换。

HRESULT ConvertStringFrom28591ToUtf8(
    char const* input, 
    int inputLength,
    char * output,
    int outputCapacity,
    int* actualOutput)
{
    *actualOutput = 0;

    // Ensure we are not working with negative numbers.
    RETURN_HR_IF(E_INVALIDARG, inputLength < 0 ||
                               outputCapacity < 0);

    // Empty string converts to empty string.
    if (inputLength == 0)
    {
        return S_OK;
    }

    // Avoid edge cases if outputCapacity = 0.
    // This also short-circuits cases where we know that the
    // output buffer isn't big enough to hold the converted input.
    RETURN_HR_IF(HRESULT_FROM_WIN32(ERROR_INSUFFICIENT_BUFFER),
                inputLength > outputCapacity);

    // Code page 28591 resides completely in the BMP.
    auto bufferCapacity = std::min(inputLength, outputLength);
    auto buffer = wil::make_unique_hlocal_nothrow<wchar_t[]>(
        bufferCapacity);
    RETURN_IF_NULL_ALLOC(buffer);

    // Convert from 28591 to UTF-16LE.
    auto result = MultibyteToWideChar(28591, MB_ERR_INVALID_CHARS,
        input, inputLength, buffer.get(), maximumOutput);
    RETURN_IF_WIN32_BOOL_FALSE(result != 0);

    // Convert from UTF-16LE to UTF-8.
    *actualOutput = WideCharToMultiByte(CP_UTF8, 0,
        buffer.get(), bufferCapacity,
        output, outputCapacity, nullptr, nullptr);
    RETURN_IF_WIN32_BOOL_FALSE(*actualOutput != 0);

    return S_OK;
},

After dealing with some edge cases, we allocate a temporary UTF-16LE buffer. That buffer needs to be big enough to hold the converted input, but doesn’t need to be so big that the caller-provided output couldn’t possibly hold the result.
处理一些边缘情况后,我们分配了一个临时的 UTF-16LE 缓冲区。该缓冲区需要足够大以容纳转换后的输入,但不需要大到调用者提供的输出不可能容纳结果。

Since all the characters of code page 28591 have code points less than U+10000, they will convert to a single UTF-16LE code unit. Therefore, we will need at most inputLength UTF-16LE code units to hold the intermediate UTF-16LE output.
由于代码页 28591 的所有字符的代码点都小于 U+10000,它们将转换为单个 UTF-16LE 代码单元。因此,我们最多需要inputLength个 UTF-16LE 代码单元来保存中间的 UTF-16LE 输出。

And since all the code points of the intermediate buffer will be less than U+10000, there will never be a need for more UTF-16LE code units than corresponding UTF-8 code units. Therefore, any intermediate buffer bigger than outputCapacity wouldn’t fit in the caller-provided buffer anyway, so we can just return the “insufficient buffer” error right away without having to do any work.
由于中间缓冲区的所有代码点都小于 U+10000,因此所需的 UTF-16LE 代码单元永远不会多于相应的 UTF-8 代码单元。因此,任何大于outputCapacity的中间缓冲区都无法放入调用者提供的缓冲区中,所以我们可以直接返回“缓冲区不足”错误,而无需做任何工作。

The rest is anticlimactic: We convert the input buffer to our temporary buffer, and then we convert the temporary buffer to the output buffer.
其余的部分没有高潮:我们将输入缓冲区转换为临时缓冲区,然后将临时缓冲区转换为输出缓冲区。

In the general case, a single input byte could result in two UTF-16LE code units, if it represents a character outside the BMP. (We assume that no code page has a single input byte that converts to multiple Unicode characters.) And the worst-case expansion from UTF-8 bytes to UTF-16LE code units is just 1:1. So in the general case, the required temporary buffer capacity is std::min(2 * inputLength, outputCapacity).
在一般情况下,如果单个输入字节表示 BMP 之外的字符,则可能会产生两个 UTF-16LE 代码单元。(我们假设没有代码页将单个输入字节转换为多个 Unicode 字符。)而从 UTF-8 字节到 UTF-16LE 代码单元的最坏情况扩展只是 1:1。因此,在一般情况下,所需的临时缓冲区容量是std::min(2 * inputLength, outputCapacity)

The whole IMultiLanguage interface was a red herring. You never needed it. The conversion was in front of you the whole time.
整个IMultiLanguage接口都是一个障眼法。你从来不需要它。转换一直就在你面前。

Bonus chatter: The entire MultiLanguage API family has been deprecated since at least 2008, possibly longer, so it’s a good thing we migrated away from it.
额外闲聊:整个多语言 API 系列自 至少 2008 年起,可能更早 就已被弃用,所以我们迁移离开它是件好事。

Bonus bonus chatter: The International Components for Unicode (ICU) have been included with Windows since Windows 10 version 1703, so if you don’t need to support anything older than that, you can just use the copy of ICU built into Windows. The ucnv_convertEx function lets you convert from one encoding to another. Mind you, it “pivots” through UTF-16, so it’s internally doing the same thing we are, but at least it done for you. You can consult the ICU documentation for more information about converters.
奖金奖金闲聊:Unicode 国际组件(ICU)自 Windows 10 版本 1703 起已包含在 Windows 中,所以如果你不需要支持更早的版本,你可以直接使用 Windows 内置的 ICU 副本。ucnv_convertEx函数允许你从一种编码转换到另一种编码。请注意,它通过 UTF-16“枢纽”,所以内部做的事情和我们一样,但至少它为你完成了。你可以查阅 ICU 文档以获取更多关于转换器的信息

2 comments

Leave a comment


Newest
  • Kevin Norris 21 hours ago 0

    > (We assume that no code page has a single input byte that converts to multiple Unicode characters.)

    I’m not entirely convinced that this is actually true....

    Read more

  • Neil Rashbrook 0

    Although I agree that your output buffer needs to be at least as long as your input buffer, I found your logic as to why this is the case to be confusing. In particul...

    Read more

Feedback usabilla icon 反馈 usabilla icon