从文本中提取关键短语(1-4个单词)
Extract keyphrases from text (1-4 word ngrams)
从文本块中提取关键短语的最佳方法是什么?我正在写一个工具来做关键字提取:类似这样的东西。我已经为Python和Perl找到了一些库来提取n-gram,但我是在Node中编写的,所以我需要一个JavaScript解决方案。如果没有任何现有的JavaScript库,有人可以解释如何做到这一点,这样我就可以自己写吗?
我喜欢这个想法,所以我实现了它:见下文(包括描述性注释)。
预览:https://jsfiddle.net/WsKMx
/*@author Rob W, created on 16-17 September 2011, on request for Stackoverflow (http://stackoverflow.com/q/7085454/938089)
* Modified on 17 juli 2012, fixed IE bug by replacing [,] with [null]
* This script will calculate words. For the simplicity and efficiency,
* there's only one loop through a block of text.
* A 100% accuracy requires much more computing power, which is usually unnecessary
**/
var text = "A quick brown fox jumps over the lazy old bartender who said 'Hi!' as a response to the visitor who presumably assaulted the maid's brother, because he didn't pay his debts in time. In time in time does really mean in time. Too late is too early? Nonsense! 'Too late is too early' does not make any sense.";
var atLeast = 2; // Show results with at least .. occurrences
var numWords = 5; // Show statistics for one to .. words
var ignoreCase = true; // Case-sensitivity
var REallowedChars = /[^a-zA-Z''-]+/g;
// RE pattern to select valid characters. Invalid characters are replaced with a whitespace
var i, j, k, textlen, len, s;
// Prepare key hash
var keys = [null]; //"keys[0] = null", a word boundary with length zero is empty
var results = [];
numWords++; //for human logic, we start counting at 1 instead of 0
for (i=1; i<=numWords; i++) {
keys.push({});
}
// Remove all irrelevant characters
text = text.replace(REallowedChars, " ").replace(/^'s+/,"").replace(/'s+$/,"");
// Create a hash
if (ignoreCase) text = text.toLowerCase();
text = text.split(/'s+/);
for (i=0, textlen=text.length; i<textlen; i++) {
s = text[i];
keys[1][s] = (keys[1][s] || 0) + 1;
for (j=2; j<=numWords; j++) {
if(i+j <= textlen) {
s += " " + text[i+j-1];
keys[j][s] = (keys[j][s] || 0) + 1;
} else break;
}
}
// Prepares results for advanced analysis
for (var k=1; k<=numWords; k++) {
results[k] = [];
var key = keys[k];
for (var i in key) {
if(key[i] >= atLeast) results[k].push({"word":i, "count":key[i]});
}
}
// Result parsing
var outputHTML = []; // Buffer data. This data is used to create a table using `.innerHTML`
var f_sortAscending = function(x,y) {return y.count - x.count;};
for (k=1; k<numWords; k++) {
results[k].sort(f_sortAscending);//sorts results
// Customize your output. For example:
var words = results[k];
if (words.length) outputHTML.push('<td colSpan="3" class="num-words-header">'+k+' word'+(k==1?"":"s")+'</td>');
for (i=0,len=words.length; i<len; i++) {
//Characters have been validated. No fear for XSS
outputHTML.push("<td>" + words[i].word + "</td><td>" +
words[i].count + "</td><td>" +
Math.round(words[i].count/textlen*10000)/100 + "%</td>");
// textlen defined at the top
// The relative occurence has a precision of 2 digits.
}
}
outputHTML = '<table id="wordAnalysis"><thead><tr>' +
'<td>Phrase</td><td>Count</td><td>Relativity</td></tr>' +
'</thead><tbody><tr>' +outputHTML.join("</tr><tr>")+
"</tr></tbody></table>";
document.getElementById("RobW-sample").innerHTML = outputHTML;
/*
CSS:
#wordAnalysis td{padding:1px 3px 1px 5px}
.num-words-header{font-weight:bold;border-top:1px solid #000}
HTML:
<div id="#RobW-sample"></div>
*/
我不知道JavaScript中有这样的库,但逻辑是
- 将文本拆分为数组
- 然后排序和计数
或者
- 拆分为数组
- 创建从阵列
- 遍历第一个数组的每一项
- 检查从阵列中当前项是否存在
- 如果不存在push它作为一个项目的键
- 其他增加具有键值=的值。HTH
Ivo Stoykov
function ngrams(seq, n) {
to_return = []
for (let i=0; i<seq.length-(n-1); i++) {
let cur = []
for (let j=i; j<seq.length && j<=i+(n-1); j++) {
cur.push(seq[j])
}
to_return.push(cur.join(''))
}
return to_return
}
> ngrams(['a', 'b', 'c'], 2)
['ab', 'bc']
相关文章:
- 字母计数:返回重复字母数最多的第一个单词
- 我如何从字符串中选出第一个单词的第一个字母,然后再选出第二个单词
- 测试字符串中的多个单词
- AJAX数据包含一个或多个单词.
- 输出包含3个单词的字符串中的前2个单词
- 两个单词之间的Javascript差异
- PHP jQuery使用多个单词传递值
- 用于匹配带空格和不带空格的多个单词的正则表达式
- 查找并替换多个单词 JavaScript
- 查找表单输入的第一个单词
- 使用jQuery在p-tag中查找给定字符串的前10个单词和下10个单词
- 使jQuery自动完成功能适用于多个单词(“跳过”一个单词)
- 使用正则表达式获取多个单词,单词之间留有空格
- javascript 不将超过 1 个单词的字符串传递给 HTML
- 如何删除两个单词之间的字符串
- 文本框中需要两个单词,使用AngularJS
- 如何获得前两个单词
- 使用js-Regex的字符串的第N个单词
- 谷歌建议搜索多个单词
- meSpeak.js-第二个单词在使用回调方法时不会播放