Javascript:使用dictionary从字符串中过滤单词

Javascript: Using dictionary to filter out words from a string?

本文关键字:过滤 单词 字符串 使用 dictionary Javascript      更新时间:2023-09-26

我需要从字符串中筛选出几百个"停止"字。由于有很多"停止"的词,我认为这样做不是一个好主意:

sentence.replace(/'b(?:the|it is|we all|an?|by|to|you|[mh]e|she|they|we...)'b/ig, '');

如何创建类似哈希图的东西来存储停止词?在这个映射中,键本身就是一个停止词,值并不重要。然后过滤将导致检查单词是否不存在于停止单词映射中。使用什么数据结构来构建这样的地图?

对于这类工作,没有什么比正则表达式更好的了。然而,它们有两个问题——难以维护(你在帖子中指出的)和非常大的性能问题。我不知道一个regexp可以处理多少个替代方案,但我想在任何情况下最多20-30个都可以。

因此,您需要一些代码来从某些数据结构动态构建正则表达式,这些数据结构可以是数组,也可以只是字符串。我个人更喜欢刺,因为它最容易维护。

// taken from http://www.ranks.nl/resources/stopwords.html
stops = ""
+"a about above after again against all am an and any are aren't as  "
+"at be because been before being below between both but by can't    "
+"cannot could couldn't did didn't do does doesn't doing don't down  "
+"during each few for from further had hadn't has hasn't have        "
+"haven't having he he'd he'll he's her here here's hers herself     "
+"him himself his how how's i i'd i'll i'm i've if in into is isn't  "
+"it it's its itself let's me more most mustn't my myself no nor     "
+"not of off on once only or other ought our ours ourselves out      "
+"over own same shan't she she'd she'll she's should shouldn't so    "
+"some such than that that's the their theirs them themselves then   "
+"there there's these they they'd they'll they're they've this       "
+"those through to too under until up very was wasn't we we'd we'll  "
+"we're we've were weren't what what's when when's where where's     "
+"which while who who's whom why why's with won't would wouldn't     "
+"you you'd you'll you're you've your yours yourself yourselves      "
// how many to replace at a time
reSize = 20 
// build regexps
regexes = []
stops = stops.match(/'S+/g).sort(function(a, b) { return b.length - a.length })
for (var n = 0; n < stops.length; n += reSize)
    regexes.push(new RegExp("''b(" + stops.slice(n, n + reSize).join("|") + ")''b", "gi"));

一旦你有了这个,剩下的就是显而易见的:

regexes.forEach(function(r) {
    text = text.replace(r, '')
})

您需要对reSize值进行实验,以找出正则表达式长度和正则表达式总数之间的最佳平衡。如果性能很关键,您也可以运行一次生成部分,然后将结果(即生成的regexp)缓存在某个地方。