可以在两个特定的正则表达式匹配之间获取文本

Possible to get text between two specific Regular Expression Matches?

本文关键字:正则表达式 之间 取文本 获取 两个      更新时间:2023-09-26

我必须解析的文本是这样的:

var textToParse = "INTRO
1.  MORE INTRO
2.  THINGS
3.  CONTENTS.   The 200 teststs.  The 300 test. 
4.  REF.  jytjndga.
5.  COLORING BOOK.  The 400 teststs.  The 500 test. 
WETRJEWO /EWRGGWE RE
100.
FUN STUFF
101.
RTHRT QWERATGER
A.  WSHNJDBRTH ARGSERTHERTHB
B. aqhretgwaefawef
C. trtrrttrtrtr
101.1
loads
   .2
thinking of loading
   .3
Loading 
   .4
unloading
   .5
reloading
   .6 
deloading
   .7
reREloading
   .8
done loading
   .9
not loading
   .10
fish
200.
PROCEDURES
201.
PROCEDURES 1
202.
PROCEDURES 2
A.  hear about procedure 203.
B.  think about procedure 203.
C.  eat cookie
D.  procrastinate
E.  sleep.
203.
PROCEDURES 3
203.1
A.  Trim Lawn
203.
PROCEDURES 3 (CONT’D)
203.1
B.  Clean stuff
C.  Finsih cleaning
204.
PROCEDURES 4
204.1
A.  wax on.
B.  Wax off
C.  crane kick
D.  Don't sweep the leg
E.  Sweep leg anyway
204.
PROCEDURES 4 (CONT’D)
204.1
F.  Finish procedure
205.
LAUNDRY DAY";

我对所有主要部分标题(以及一些不是)运行此正则表达式:

var sectionHeadersRegex = /^'s*'d{3}'.?('s|$)/;

所以我的问题是:我想获取两场比赛之间的所有文本。

例如,我想获取匹配[1]('101.')和匹配[5]('203.')之间的所有文本。

因此,文本将是:

var desireText = "RTHRT QWERATGER
A.  WSHNJDBRTH ARGSERTHERTHB
B. aqhretgwaefawef
C. trtrrttrtrtr
101.1
loads
   .2
thinking of loading
   .3
Loading 
   .4
unloading
   .5
reloading
   .6 
deloading
   .7
reREloading
   .8
done loading
   .9
not loading
   .10
fish
200.
PROCEDURES
201.
PROCEDURES 1
202.
PROCEDURES 2
A.  hear about procedure 203.
B.  think about procedure
C.  eat cookie
D.  procrastinate
E.  sleep.
";
我知道

比赛开始时包含额外的空格,我知道一个答案是我可以使用额外的空格来制作它,这样如果我做这样的正则表达式:

var newRegexToGetTextBetweenMatchesOneandFive = new RegExp(' + match[1] + '([^~]+?)' + match[5] + '');

但我不能依靠标题编号前缀的空格来防止错误匹配。

即使我可以,目标基本上也是能够说"获取第二个匹配项和第六个匹配项之间的所有文本",而不是"获取'101.'和"203."之间的所有文本"。

感谢您的帮助,如果我能澄清任何事情,请告诉我。

编辑:

@Dawg很抱歉感到困惑。我认为这个例子会澄清事情。@Wiktor你的答案似乎是以同样的方式获取文本。

我修改了需要稍微解析的文本,以便我可以以您完成的方式显示问题。

var str = 'var textToParse = "INTRO'n'n1.  MORE INTRO'n'n2.  THINGS'n'n3.  CONTENTS.   The 200 teststs.  The 300 test. 'n'n4.  REF.  jytjndga.'n'n5.  COLORING BOOK.  The 400 teststs.  The 500 test. 'n'nWETRJEWO /EWRGGWE RE'n'n100.'nFUN STUFF'n'n101.'nRTHRT QWERATGER'n'nA.  WSHNJDBRTH ARGSERTHERTHB'n'nB. aqhretgwaefawef'n'nC. trtrrttrtrtr'n'n101.1'nloads'n   .2'nthinking of loading'n   .3'nLoading 'n   .4'nunloading'n   .5'nreloading'n   .6 'ndeloading'n   .7'nreREloading'n   .8'ndone loading'n   .9'nnot loading'n   .10'nfish'n'n200.'nPROCEDURES'n'n201.'nPROCEDURES 1'n'n202.'nPROCEDURES 2'n'nA.  hear about procedure 203.'n'nB.  think about procedure 203.'n'nC.  eat cookie'n'nD.  procrastinate'n'nE.  sleep.'n'n203.'n THIS SHOULD BE CAPTURED'n'n203.'nPROCEDURES 3'n'n203.1'nA.  Trim Lawn'n'n203.'nPROCEDURES 3 (CONT’D)'n'n203.1'nB.  Clean stuff'n'nC.  Finsih cleaning'n'n204.'nPROCEDURES 4'n'n204.1'nA.  wax on.'n'nB.  Wax off'n'nC.  crane kick'n'nD.  Don''t sweep the leg'n'nE.  Sweep leg anyway'n'n204.'nPROCEDURES 4 (CONT’D)'n'n204.1'nF.  Finish procedure'n'n205.'nLAUNDRY DAY";';

我修改了其中的一部分:

'sleep.'n'n203.'nPROCEDURES'

自:

'sleep.'n'n203.'n THIS SHOULD BE CAPTURED'n'n203.'nPROCEDURES'

所以现在的收官战是匹配[6]而不是匹配[5]。

因此,它不能只是一个正则表达式,其中包含两个匹配项的文本作为所需文本的开头和结尾。

它必须是匹配位置 [1] 和匹配位置 [6] 之间的所有文本。

我希望我从一开始就想到这样解释它。我认为这使它更加清楚。

自从问题被编辑以来,对我以前的 anwer 进行了全面改造。

您需要在以下正则表达式的某些匹配项之间获取一个子字符串:

var re = /^'s*'b'd{3}'.?(?:'s|$)/gm;

然后,当您准备好str时,您可以为匹配项的索引定义一个数组:

var indices = [];

然后,使用 RegExp.exec() 遍历所有匹配项:

while((m = re.exec(str)) !== null) {
   indices.push({ start: m.index, end: m.index+m[0].length});
}

注意开始和结束位置是如何获得的:可以从MatchObject.index属性获取起始位置,结束索引是索引和匹配值(m[0])长度的总和。

接下来,您应该使用带有string.substring方法的索引获取所需的文本(请参阅传递的16索引):

va0r newRegexToGetTextBetweenMatchesOneandFive = str.substring(indices[1].end, indices[6].start);

使用的第一个indices[1]属性是end(因为我们需要从第一个匹配项的末尾获取文本,第二个indices[6]属性是start,因为我们需要子字符串直到第 6 个匹配项。

整个演示如下。

var re = /^'s*'b'd{3}'.?(?:'s|$)/gm; 
var str = 'var textToParse = "INTRO'n'n1.  MORE INTRO'n'n2.  THINGS'n'n3.  CONTENTS.   The 200 teststs.  The 300 test. 'n'n4.  REF.  jytjndga.'n'n5.  COLORING BOOK.  The 400 teststs.  The 500 test. 'n'nWETRJEWO /EWRGGWE RE'n'n100.'nFUN STUFF'n'n101.'nRTHRT QWERATGER'n'nA.  WSHNJDBRTH ARGSERTHERTHB'n'nB. aqhretgwaefawef'n'nC. trtrrttrtrtr'n'n101.1'nloads'n   .2'nthinking of loading'n   .3'nLoading 'n   .4'nunloading'n   .5'nreloading'n   .6 'ndeloading'n   .7'nreREloading'n   .8'ndone loading'n   .9'nnot loading'n   .10'nfish'n'n200.'nPROCEDURES'n'n201.'nPROCEDURES 1'n'n202.'nPROCEDURES 2'n'nA.  hear about procedure 203.'n'nB.  think about procedure 203.'n'nC.  eat cookie'n'nD.  procrastinate'n'nE.  sleep.'n'n203.'n THIS SHOULD BE CAPTURED'n'n203.'nPROCEDURES 3'n'n203.1'nA.  Trim Lawn'n'n203.'nPROCEDURES 3 (CONT’D)'n'n203.1'nB.  Clean stuff'n'nC.  Finsih cleaning'n'n204.'nPROCEDURES 4'n'n204.1'nA.  wax on.'n'nB.  Wax off'n'nC.  crane kick'n'nD.  Don''t sweep the leg'n'nE.  Sweep leg anyway'n'n204.'nPROCEDURES 4 (CONT’D)'n'n204.1'nF.  Finish procedure'n'n205.'nLAUNDRY DAY";';
var indices = [];
while((m = re.exec(str)) !== null) {
   indices.push({ start: m.index, end: m.index+m[0].length});
}
var newRegexToGetTextBetweenMatchesOneandFive = str.substring(indices[1].end, indices[6].start); 
document.body.innerHTML = "<pre>" + newRegexToGetTextBetweenMatchesOneandFive + "</pre>";