Python在spark中绘制地图

Python mapraduce in spark

本文关键字：绘制地图 spark Python 更新时间：2023-09-26

我有一些文本，我必须用hadoop计算一些单词的计数（例如John和marry）。

在java脚本中，我可以这样写：

require('timothy').map(function(line){
        emit("count", 1);
        if(new RegExp("john", "i").test(line))     emit("John", 1);
        if(new RegExp("marry", "i").test(line))    emit("Marry", 1);
    }).reduce(function(key, values){
        var result = 0;
        values.forEach(function(value){
            result += +value;
        });
        emit(key, result);
}).run();

我对所有行使用map函数，并为每个匹配写入数据。现在我想用Spark来做这件事，但我必须用python来写。我有一些代码：

import sys
import re
from operator import add
from pyspark import SparkContext
if __name__ == "__main__":
    if len(sys.argv) != 2:
        print >> sys.stderr, "Usage: wordcount <file>"
        exit(-1)
    sc = SparkContext(appName="PythonWordCount")
    lines = sc.textFile(sys.argv[1], 1)
    def map(line):
        #here must contains map function;

    counts = lines.map(map).reduceByKey(add)
    output = counts.collect()
    for (word, count) in output:
        print "%s: %i" % (word, count)
    sc.stop()

我的问题是，我只能记录一个返回的匹配（key，val），如何与第一个例子类似。感谢U。

如果您的问题是如何在映射阶段发出多个值。答案是将flatMap运算符与返回值序列而不是单个值的函数一起使用。序列将通过flatMap转换进行拆分。例如：

file = spark.textFile("file://...")
counts = file.flatMap(lambda line: line.split(" ")) '
         .map(lambda word: (word, 1)) '
         .reduceByKey(lambda a, b: a + b)

line.split(" ")返回一系列字符串。