合并嵌套的、重叠的<strong>并且<em>标签

Consolidate nested, overlapping <strong> and <em> tags

本文关键字：gt lt 并且 em 标签嵌套 strong 合并重叠更新时间：2023-09-26

我有一个文本字符串，我将为其单独存储标记。例如：

var content = {
    text: "a little white rabbit hops",
    style: [
        {
            type: "strong",
            start: 0,
            length: 8
         },
         {
            type: "em",
            start: 2,
            length: 14
         }
    ]
}

然后我将其解析为html，但em标签必须打开和关闭两次才能正确格式化：

<strong>a <em>little</em></strong><em> white</em> rabbit hops

我的问题是：解析从DOM检索到的html以合并分离的em标记（或者可以想象的是strong标记：在我的场景中，两者都可以嵌套）的最佳方法是什么。

如果我迭代一个子代的NodeList（p.getElementsByTagName('em')），我将不得不执行多个for循环并检查所有嵌套标签的开始/长度。必须有一种更简单的方法，但我还没有想到——有没有一个库可以处理这种格式（或者直接通过DOM实现这一点的方法）？

我没有使用jQuery，也不想仅仅为了这个而将它添加到我的项目中。非常感谢您的帮助！

---编辑---

为了澄清这个问题：这本质上是关于将格式化转换为HTML或从HTML中转换出来，这个问题是处理标记嵌套的最佳方式：即即使有两个em子标记，实际上也只有一个em格式化的块（em子标记1和2的结束/开始是连续的）

这里有两个函数用于任意方向的转换。

首先是将HTML字符串转换为您描述的内容结构的方法：

function htmlToContent(html) {
    // The object to fill and return:
    var content = {
        text: '',
        style: []
    };
    // Keep track of recently closed tags (i.e. without text following them as of yet)
    var closedStyles = [];
    // Recursive function
    function parseNode(elem) {
        var style;
        if (elem.nodeType === 3) {
            // This is a text node (no children)
            content.text += elem.nodeValue;
            // Any styles that were closed should be added to the content 
            // style array, as they cannot be "extended" any more
            [].push.apply(content.style, closedStyles);
            closedStyles = [];
        } else {
            // See if we can extend a style that was closed
            if (!closedStyles.some(function (closedStyle, idx) {
                if (closedStyle.type === elem.nodeName) {
                    style = closedStyle;
                    // Style will be extended, so it's no longer closed
                    closedStyles.splice(idx, 1);
                    return true; // exit ".some"
                }
            })) {
                // No style could be extended, so we create a new one
                style = {
                    type: elem.nodeName,
                    start: content.text.length,
                    length: 0
                };
            }
            // Recurse into the child nodes:
            [].forEach.call(elem.childNodes, function(child) {
                parseNode(child);
            });
            // set style length and store it as a closed one
            style.length = content.text.length - style.start;
            closedStyles.push(style);
        }
    }
    // Create a node with this html
    wrapper = document.createElement('p');
    wrapper.innerHTML = html;
    parseNode(wrapper);
    // Flush remaining styles to the result
    closedStyles.pop(); // Discard wrapper
    [].push.apply(content.style, closedStyles);
    return content;
}

此函数首先将HTML字符串注入DOM包装器元素，然后递归到节点层次结构中以构建内容结构。该代码的主要思想是，它首先在临时closedStyles数组中收集封闭节点。只有当确定这些节点不能再用于与即将到来的节点的合并时，才会将它们添加到内容结构中。当一个文本节点被嵌入时，就会发生这种情况。但是，如果标签在没有中间文本的情况下关闭并再次打开，则会定位并从此closedStyles数组中提取匹配的样式，并重新用于扩展。

起相反作用的函数可以定义如下：

function contentToHtml(content) {
    var tags = [];
    // Build list of opening and closing tags with the offset of injection
    content.style.forEach(function (tag) {
        tags.push({
            html: '<' + tag.type + '>',
            offset: tag.start
        }, {
            html: '</' + tag.type + '>',
            offset: tag.start + tag.length
        });
    });
    // Sort this list by decreasing offset:
    tags.sort(function(a, b) {
        return b.offset - a.offset;
    });
    var html = '';
    var text = content.text;
    // Insert opening and closing tags from end to start in text
    tags.forEach(function (tag) {
        // Prefix the html with the open/close tag and the escaped text that follows it
        html = tag.html + textToHtml(text.substr(tag.offset)) + html;
        // Reduce the text to the part that still needs to be processed
        text = text.substr(0, tag.offset);
    });
    // Remaining text:
    html = textToHtml(text) + html;
    // Create a node with this html, in order to get valid html tag sequences
    p = document.createElement('p');
    p.innerHTML = html;
    // p.innerHTML will change here if html was not valid.
    return p.innerHTML;
}

此函数首先将每个样式转换为两个对象，一个表示开始标记，另一个表示结束标记。然后，这些标签被插入到文本的正确位置（从文本的开始到开始）。最后，应用了您自己描述的技巧：将生成的html放入dom对象中，然后再次从中取出。这样，任何无效的HTML标记序列都会被修复。

该函数使用textToHtml实用函数，其定义如下：

function textToHtml(text) {
    // See http://www.w3.org/International/questions/qa-escapes#use
    return text.replace('&', '&amp;').replace('<', '&lt;').replace('>', '&gt;');
}

您可以在这个fiddle中看到它的工作，其中使用了一个示例HTML字符串，该字符串还包括相同类型的嵌套标记。这些都得到了维护。