检查字符串数组元素是否为URL的子字符串

check if a string array element is a sub string of a URL

本文关键字：字符串 URL 是否数组元素检查更新时间：2023-09-26

我在浏览器扩展URL白名单工作。我目前的工作，但我需要检查列表在两个地方，我想尽量使其更有效，以减少增加页面加载时间的机会。

我必须在两个地方检查列表。第一个检查是在一个页面mod与附加的内容脚本，这是适用于所有网站，内容脚本被改变，如果url是在白名单。第二次检查是在请求观察者中发送不同的头，如果url被白名单。

我试图只检查它一次，并将结果从页面mod传递到请求观察者或从请求观察者传递到页面mod，这会导致报头不正确或对内容脚本的修改不应用的时间问题，当它们应该是。

是否有一种方法可以改进下面的子字符串检查代码，使其更快

我有一个用户输入的网站列表，在保存之前按字母顺序排序。

现在列表的格式很简单。

example1.com
b.example2.com/some/content.html
c.exampleN.com

url可以是

http://example1.com/some/site/content.html

我正在检查url是否包含每个数组元素的值的子字符串

//check if  a url is in the list
function listCheck(list,url){ 
    for (var i=0; i<list.length; i++){
        if (url.indexOf(list[i]) > -1)
            return true;
    }
   return false;
};

您可以使用URL的第一个字母的二进制搜索。这很方便，因为白名单可以快速增长。然而，你不能用模式来做到这一点。(如。: * .somedomain.com)
考虑使用散列表来存储白名单。你可以通过编写自己的哈希函数使其高效和专门化。
Regex将使更容易，但有时也会使变慢。如果您使用正则表达式，请确保您知道自己在做什么。您可以通过上述方法之一首先缩小比较列表。

编辑 :这就是我所说的二进制搜索。这只适用于不使用通配符。

function binarySearch(needle, haystack, startIndex, endIndex) {
    //console.log("'ttrying to find " + needle + " between " +
    //    haystack[startIndex] + "(" + startIndex + ") and " + 
    //    haystack[endIndex] + "(" + endIndex + ")");
    // the basic case, where the list is narrowed down to 1 or 2 items
    if (startIndex == endIndex || endIndex - startIndex == 1) {
        if (haystack[startIndex] == needle)
            return startIndex;
        if (haystack[endIndex] == needle)
            return endIndex;
        return -1;
    }
    var midIndex = Math.ceil((startIndex + endIndex) / 2);
    //console.log("'t'tgot " + haystack[midIndex] + "(" + midIndex +
    //    ") for middle of the list.");
    var comparison = haystack[midIndex].localeCompare(needle);
    //console.log("'t'tcomparison: " + comparison);
    if (comparison > 0)
        return binarySearch(needle, haystack, startIndex, midIndex);
    if (comparison < 0)
        return binarySearch(needle, haystack, midIndex, endIndex);
    return midIndex; // (comparison == 0)
}
var sitelist = [ // the whitelist (the haystack).
        "alpha.com",
        "bravo.com",
        "charlie.com",
        "delta.com",
        "echo.com",
        "foxtrot.com",
        "golf.com",
        "hotel.com",
        "india.com",
        "juliet.com",
        "kilo.com",
        "lima.com",
        "mike.com",
        "november.com",
        "oscar.com",
        "papa.com",
        "quebec.com",
        "romeo.com",
        "sierra.com",
        "tango.com",
        "uniform.com",
        "victor.com",
        "whiskey.com",
        "xray.com",
        "yankee.com",
        "zulu.com"
    ];
function testBinarySearch(needle) {
    console.log("trying to find " + needle);
    var foundIndex = binarySearch(needle, sitelist, 0, sitelist.length - 1);
    if (foundIndex < 0)
        console.log(needle + " not found");
    else
        console.log(needle + " found at: " + foundIndex);
}
// note that the list is already sorted. if the list is not sorted,
// haystack.sort();
// we can find "uniform.com" using 5 comparisons, instead of 20
testBinarySearch("uniform.com");
// we can confirm the non-existance of "google.com" in 4 comparisons, not 26
testBinarySearch("google.com");
// this is an interesting (worst) case, it takes 5 comparisons, instead of 1
testBinarySearch("alpha.com");
// "zulu.com" takes 4 comparisons instead of 26
testBinarySearch("zulu.com");

当你的列表增长时，二分搜索可以很好地扩展。我不会去讨论二分搜索的其他优缺点，因为它们在很多地方都有很好的记录。

关于JavaScript二进制搜索的更多问题:

二进制查找码
JSON对象的二进制查找

使用regexp将使事情变得更容易。在这段代码中，你只需要做一个比较。

function listCheck(list, url) {
    var exp = new RegExp('(' + list.join('|') + ')');
    if (exp.test(url)) return true;
    else return false;
}

EDIT: 你可以在url中得到符号.或/或-的错误，所以这段代码工作得更好:

function listCheck(list, url) {
    var exp = new RegExp('(' + list.join('|').replace(/('/|'.|'-)/g, '''$1') + ')');
    if (exp.test(url)) return true;
    else return false;
}