用unicode字符提取字符串中的单词

extract words in string with unicode characters

本文关键字：单词字符串提取 unicode 字符更新时间：2024-04-13

在javascript（nodejs）中，我需要用unicode字符索引文本字符串，即给定一个字符串，如：

"Bonjour à tous le monde, 
je voulais être le premier à vous dire:
  -'comment ça va'
  -<est-ce qu'il fait beau?>"

我想得到以下单词数组：

 ["Bonjour", "à", "tous", "le", "monde", "je", "voulais", "être", ... "beau"]

如何使用regex或任何其他方式实现这一点？

ps：我安装并尝试了xregexp模块，该模块为javascript提供unicode支持，但由于对正则表达式毫无用处，我无法走得很远。。。

您可以使用与插件捆绑的XRegExp版本，该插件（以及其他插件）添加了对regex unicode类别的支持。我们对类别not an unicode letter感兴趣，即'P{L}。然后可以使用正则表达式XRegExp("''P{L}+")分割字符串。

var s="Bonjour à tous le monde,'nje voulais être le premier à vous dire:'n  -'comment ça va''n  -<est-ce qu'il fait beau?>";
var notALetter = XRegExp("''P{L}+");
var words = XRegExp.split(s, notALetter);

看看这把小提琴。

您可能可以使用库"uwords"-https://github.com/AlexAtNet/uwords.它通过将L*Unicode组中的字符分组在一起，从文本中提取单词。

它的工作原理与XRegExp("''p{L}+")类似，但速度极快。

示例：

var uwords = require('uwords');
var words = uwords('Bonjour à tous le monde,'n' +
    'je voulais être le premier à vous dire:'n' +
    '-''comment ça va'''n' +
    '-<est-ce qu''il fait beau?>');
console.log(words);
[ 'Bonjour',
  'à',
  'tous',
  'le',
  'monde',
  'je',
  'voulais',
  'être',
  'le',
  'premier',
  'à',
  'vous',
  'dire',
  'comment',
  'ça',
  'va',
  'est',
  'ce',
  'qu',
  'il',
  'fait',
  'beau' ]

附言：很抱歉迟到了，我希望它仍然有用。

一个想法是通过不属于单词的各种字符来分割字符串，然后过滤出空字符串：

var str = "Bonjour à tous le monde, je voulais être le premier à vous dire:  -'comment ça va'  -<est-ce qu'il fait beau?>";
var result = str.split(/[-:'"?'s><]+/).filter(function(item) { return item !== '' });
/*
["Bonjour", "à", "tous", "le", "monde,", "je", "voulais", "être", "le", "premier", "à", "vous", "ire", "comment", "ça", "va", "est", "ce", "qu", "il", "fait", "beau"]
*/

类似地，您可以通过上面的否定字符类进行匹配，并且不必过滤空字符串：

var result = str.match(/[^-:'"?'s><]+/g);