JavaScript Regex从文本文件中获取标题和字幕

JavaScript Regex to get title and subtitle from text file

本文关键字：获取标题字幕文件 Regex 文本 JavaScript 更新时间：2023-09-26

我有下面的文本，它来自*.text文件：

1）TEXTDATA.TXT

A 57-year-old female presents to the office with fatigue, jaundice and dyspnea. On physical exam you note her face is pale. Laboratory testing shows slightly elevated MCV, increased LDH, indirect bilirubin, and reticulocytes. Positive Direct Coombs test shows antibodies on RBCs and peripheral smear shows spherocytes. What is the most likely diagnosis?
A. Glucose-6-phospate dehydrogenase (G6PD) deficiency
B. Vitamin B12 deficiency
C. Paroxysmal nocturnal hemoglobinuria (PNH)
D. Iron deficiency anemia
E. Autoimmune hemolytic anemia
The correct answer is (E) Autoimmune hemolytic anemia This patient most likely has warm autoimmune hemolytic anemia as evidenced by her positive Direct Coombs test, elevated MCV, increased LDH, indirect bilirubin, and reticulocytes. Warm autoimmune hemolytic anemias are idiopathic or associated with autoimmune processes (SLE), drugs, lymphoproliferative disorders (CLL) and typically present with severe anemia (pallor, jaundice, fatigue, dyspnea). Peripheral smear can show spherocytes.
Choice A (Glucose-6-phospate dehydrogenase (G6PD) deficiency) is incorrect. G6PD is a X-linked recessive disease, which is seen more commonly in males.
Choice B (Vitamin B12 deficiency) is incorrect. Pernicious anemia typically presents with peripheral neuropathy, fatigue, leg stiffness, ataxia, memory impairment, and depression.
Choice C (Paroxysmal nocturnal hemoglobinuria (PNH)) is incorrect. Paroxysmal nocturnal hemoglobinuria presents with intermittent dark colored urine in the morning.
Choice D (Iron deficiency anemia) is incorrect. Iron deficiency anemia is associated with decreased Hgb, hematocrit, serum Fe, ferritin, transferrin saturation, and MCV, increased TIBC and RDW.

AUTOIMMUNE HEMOLYTIC ANEMIA
Hemolytic anemia
Ax: 
Warm autoimmune hemolytic anemias are idiopathic or associated with autoimmune processes (SLE), drugs, lymphoproliferative disorders (CLL).

1）我已经更新了TEXTDATA.TXT，我正在尝试查找最后一个"选项X"到"Ax"之间的文本：有什么简单的技巧吗。我的代码看起来像

var string = string.toString().substring(fileContent.indexOf("Choice E") + 8, string.indexOf("Cx:") - 3);

它有点不适用于最后一个选择，因为选择是D"选择D"。

2）我只需要TEXTDATA.TXT文件中的Title="自体免疫性溶血性贫血"和Subtitle="溶血性贫血症"。如果我在最后一个"选项X"到"Ax:"之间获得内容，那就完美了。

代码：

var ifdtdata = string.toString().substring(string.indexOf("Choice E") + 8, string.indexOf("Cx:") - 3);
titleifdt = /(?:'r?'n){2}([A-Z].*)/.exec(ifdtdata);
subifdt = /(?:'r?'n){2}([A-Z].*)'r?'n(.*)/.exec(ifdtdata);
ifdtdata = ifdtdata.replace(/[^a-z0-9 ,.?!]/ig, '');
if(valUndefinedNull(subifdt) == false){
       subifdt = /([A-Z0-9 ]*[A-Z]{2,}?)([A-Z][a-z]+[^.]*)/.exec(ifdtdata);
}
if(valUndefinedNull(titleifdt) == false){
       titleifdt = /([A-Z0-9 ]*[A-Z]{2,}?)([A-Z][a-z]+[^.]*)/.exec(ifdtdata);
}

我认为您需要第二行"有意义"的内容。您可以使用正则表达式来拆分内容，该正则表达式将匹配任何类型的换行符，并且只获取第二个元素。由于换行符中可能有'r符号，我建议使用以下示例代码：

var s = "TITLE X (CD55 and CD59 markers) are positive in paroxysmal nocturnal hemoglobinuria (PNH).'n'nAUTOIMMUNE HEMOLYTIC ANEMIA'nHemolytic anemia'n'nTITLE Z: Warm autoimmune hemolytic anemias are idiopathic or associated with autoimmune processes (SLE)";
var arr = s.replace(/^'s*|'s*$/g, '').split(/['r'n]+/);
document.write(arr[1]);

使用.replace(/^'s*|'s*$/g, '')可以修剪输入，使用.split(/['r'n]+/);可以将内容拆分为单独的行，无论是Windows/Linux/MacOS文本文件。

如果您需要在第一个换行符后以大写字母开头的第一行，请使用

var s = "TITLE X (CD55 and CD59 markers) are positive in paroxysmal nocturnal hemoglobinuria (PNH).'n'nAUTOIMMUNE HEMOLYTIC ANEMIA'nHemolytic anemia'n'nTITLE Z: Warm autoimmune hemolytic anemias are idiopathic or associated with autoimmune processes (SLE)";
var m = /(?:'r?'n){2}([A-Z].*)/.exec(s);
if (m !== null)
  document.write(m[1]);

这里，正则表达式匹配：

(?:'r?'n){2}-两次断线
([A-Z].*)-以大写字母[A-Z]开头的一行，然后尽可能多地匹配换行符以外的所有符号（贪婪）。该值将在m[1]中

更新

要查找字幕，请使用

var s = "TITLE X (CD55 and CD59 markers) are positive in paroxysmal nocturnal hemoglobinuria (PNH).'n'nAUTOIMMUNE HEMOLYTIC ANEMIA'nHemolytic anemia'n'nTITLE Z: Warm autoimmune hemolytic anemias are idiopathic or associated with autoimmune processes (SLE)";
var m = /(?:'r?'n){2}([A-Z].*)'r?'n(.*)/.exec(s);
if (m !== null){
  document.write("Title: " + m[1] + "<br/>Subtitle: " + m[2]);
}

这里我只插入新行并取第三行（它以0开头，所以第三行是2）

var title = fileContent.split("'n")[2]
console.log(title);

我只匹配所有大写的行，只进行第一个匹配：/^'[A-Z'W 0-9']{3,}$/m