从字符串数组创建唯一组合数组

Create array of unique combinations from array of strings

本文关键字：数组组合唯一创建字符串更新时间：2023-09-26

我正在写一些东西，它获取一个文本块，并将其分解为可能的数据库查询，这些查询可以用来查找类似的文本块。（类似于我键入时生成的"类似问题"列表）基本过程：

从文本中删除停止字
删除特殊字符
从剩余的文本中创建一个独特的"词干"数组
创建一个茎数组的可能组合数组（我被卡住的地方…有点）

到目前为止，我拥有的是：

    //baseList starts with an empty array
    //candList starts with the array of unique stems
    //target is where the arrays of unique combinations are stored
    function createUniqueCombos(baseList,candList,target){
    for(var i=0;i<candList.length;i++){         
        //copy the base List
        var newList = baseList.slice(0);
        //add the candidate list item to the base list copy
        newList.push(candList[i]);
        //add the new array to the target array
        target.push(newList);   
        //re-call function using new array as baseList
        //and remaining candidates as candList
        var nextCandList = candList.slice(i + 1);       
        createUniqueCombos(newList,nextCandList,target);
    }
}

这是有效的，但对于超过25个单词左右的文本块，它会使我的浏览器崩溃。我意识到，从数学上讲，可能存在大量可能的组合。我想知道的是：

有没有更有效的方法可以做到这一点
如何定义最小/最大组合数组长度

我认为您的逻辑存在根本缺陷，因为您正在创建许多组合。

我会采取的一种方法是；

将文本拆分为单独的单词（我们将此变量称为split_words）
删除特殊字符
删除简短/常用词（和，或，I，a）；要么按长度来做，要么更明智地按单词黑名单来做
有一个包含列block_id和word的表（例如blocks）

具有SQL查询，如

SELECT block_id FROM blocks 
WHERE word IN (split_words) GROUP BY block_id 
ORDER BY COUNT(*) DESC

然后你会有一个CCD_ 5的列表，这些列表是根据块的共有单词数量排序的。

发现了之前的问题：查找具有相似文本的文章的算法

其中一个答案提供了一篇文章的链接，该文章建议找出两个字符串中包含多少相邻字符对。[http://www.catalysoft.com/articles/StrikeAMatch.html]

这个例子是Java的，但我相信可以很容易地移植到JS:

/** @return an array of adjacent letter pairs contained in the input string */
private static String[] letterPairs(String str) {
   int numPairs = str.length()-1;
   String[] pairs = new String[numPairs];
   for (int i=0; i<numPairs; i++) {
       pairs[i] = str.substring(i,i+2);
   }
   return pairs;
}
/** @return an ArrayList of 2-character Strings. */
private static ArrayList wordLetterPairs(String str) {
   ArrayList allPairs = new ArrayList();
   // Tokenize the string and put the tokens/words into an array
   String[] words = str.split("''s");
   // For each word
   for (int w=0; w < words.length; w++) {
       // Find the pairs of characters
       String[] pairsInWord = letterPairs(words[w]);
       for (int p=0; p < pairsInWord.length; p++) {
           allPairs.add(pairsInWord[p]);
       }
   }
   return allPairs;
}
/** @return lexical similarity value in the range [0,1] */
public static double compareStrings(String str1, String str2) {
   ArrayList pairs1 = wordLetterPairs(str1.toUpperCase());
   ArrayList pairs2 = wordLetterPairs(str2.toUpperCase());
   int intersection = 0;
   int union = pairs1.size() + pairs2.size();
   for (int i=0; i<pairs1.size(); i++) {
       Object pair1=pairs1.get(i);
       for(int j=0; j<pairs2.size(); j++) {
           Object pair2=pairs2.get(j);
           if (pair1.equals(pair2)) {
               intersection++;
               pairs2.remove(j);
               break;
           }
       }
   }
   return (2.0*intersection)/union;
}

使用我的二项式系数类可以很容易地解决您的问题。看看我对一个相关问题的回答中的代码。我不知道将C#代码移植到SQL存储过程是否是个好主意。将其移植到java或js并从该代码中调用存储的proc可能会更容易。