有没有一种方法可以在Google BigQuery中测量字符串的相似性
Is there a way to measure string similarity in Google BigQuery
我想知道是否有人知道在BigQuery中测量字符串相似性的方法。
这似乎是一个巧妙的功能。
我的情况是,我需要比较两个url的相似性,以确保它们引用的是同一篇文章。
我可以找到使用javascript的例子,所以也许UDF是可行的,但我根本没有使用UDF(或者javascript:)
只是想知道是否有一种方法可以使用现有的regex函数,或者是否有人可以让我开始将javascript示例移植到UDF中。
非常感谢任何帮助,谢谢
编辑:添加一些示例代码
因此,如果我有一个UDF定义为:
// distance function
function levenshteinDistance (row, emit) {
//if (row.inputA.length <= 0 ) {var myresult = row.inputB.length};
if (typeof row.inputA === 'undefined') {var myresult = 1};
if (typeof row.inputB === 'undefined') {var myresult = 1};
//if (row.inputB.length <= 0 ) {var myresult = row.inputA.length};
var myresult = Math.min(
levenshteinDistance(row.inputA.substr(1), row.inputB) + 1,
levenshteinDistance(row.inputB.substr(1), row.inputA) + 1,
levenshteinDistance(row.inputA.substr(1), row.inputB.substr(1)) + (row.inputA[0] !== row.inputB[0] ? 1 : 0)
) + 1;
emit({outputA: myresult})
}
bigquery.defineFunction(
'levenshteinDistance', // Name of the function exported to SQL
['inputA', 'inputB'], // Names of input columns
[{'name': 'outputA', 'type': 'integer'}], // Output schema
levenshteinDistance // Reference to JavaScript UDF
);
// make a test function to test individual parts
function test(row, emit) {
if (row.inputA.length <= 0) { var x = row.inputB.length} else { var x = row.inputA.length};
emit({outputA: x});
}
bigquery.defineFunction(
'test', // Name of the function exported to SQL
['inputA', 'inputB'], // Names of input columns
[{'name': 'outputA', 'type': 'integer'}], // Output schema
test // Reference to JavaScript UDF
);
任何我尝试的查询测试,例如:
SELECT outputA FROM (levenshteinDistance(SELECT "abc" AS inputA, "abd" AS inputB))
我得到错误:
错误:TypeError:无法读取第11行第38-39列未定义的属性"substr"错误位置:用户定义的功能
row.inputA似乎不是字符串,或者由于某种原因,字符串函数无法处理它。不确定这是类型问题还是UDF默认使用的实用程序的有趣之处。
再次感谢您的帮助,谢谢。
准备使用共享UDF-Levenstein距离:
SELECT fhoffa.x.levenshtein('felipe', 'hoffa')
, fhoffa.x.levenshtein('googgle', 'goggles')
, fhoffa.x.levenshtein('is this the', 'Is This The')
6 2 0
Soundex:
SELECT fhoffa.x.soundex('felipe')
, fhoffa.x.soundex('googgle')
, fhoffa.x.soundex('guugle')
F410 G240 G240
模糊选择一:
SELECT fhoffa.x.fuzzy_extract_one('jony'
, (SELECT ARRAY_AGG(name)
FROM `fh-bigquery.popular_names.gender_probabilities`)
#, ['john', 'johnny', 'jonathan', 'jonas']
)
johnny
如何:
- https://medium.com/@hoffa/new-in-bigquery-persistent-udfs-c9ea4100fd83
如果您熟悉Python,您可以使用从GCS加载的外部库在BigQuery中使用fuzzywuzzy定义的函数。
步骤:
- 下载fuzzywuzzy(fuzzball)的javascript版本
- 取库的编译文件:dist/fuzzball.umd.min.js,并将其重命名为更清晰的名称(如
fuzzball
) - 上传到谷歌云存储桶
- 创建一个临时函数以在查询中使用lib(将OPTIONS中的路径设置为相关路径)
CREATE TEMP FUNCTION token_set_ratio(a STRING, b STRING)
RETURNS FLOAT64
LANGUAGE js AS """
return fuzzball.token_set_ratio(a, b);
"""
OPTIONS (
library="gs://my-bucket/fuzzball.js");
with data as (select "my_test_string" as a, "my_other_string" as b)
SELECT a, b, token_set_ratio(a, b) from data
通过JS实现Levenstein将是一种选择。您可以使用该算法来获得绝对字符串距离,或者通过简单地计算abs(strlen - distance / strlen).
将其转换为百分比相似度
实现这一点的最简单方法是定义一个Levenstein UDF,它接受两个输入a和b,并计算它们之间的距离。函数可以返回a、b和距离。
要调用它,您需要将这两个URL作为别名为"a"answers"b"的列传入:
SELECT a, b, distance
FROM
Levenshtein(
SELECT
some_url AS a, other_url AS b
FROM
your_table
)
下面是使用WITH OFFSET
而不是ROW_NUMBER() OVER()
的汉明距离的简单版本
#standardSQL
WITH Input AS (
SELECT 'abcdef' AS strings UNION ALL
SELECT 'defdef' UNION ALL
SELECT '1bcdef' UNION ALL
SELECT '1bcde4' UNION ALL
SELECT '123de4' UNION ALL
SELECT 'abc123'
)
SELECT 'abcdef' AS target, strings,
(SELECT COUNT(1)
FROM UNNEST(SPLIT('abcdef', '')) a WITH OFFSET x
JOIN UNNEST(SPLIT(strings, '')) b WITH OFFSET y
ON x = y AND a != b) hamming_distance
FROM Input
我找不到直接的答案,所以我在标准SQL 中提出了这个解决方案
#standardSQL
CREATE TEMP FUNCTION HammingDistance(a STRING, b STRING) AS (
(
SELECT
SUM(counter) AS diff
FROM (
SELECT
CASE
WHEN X.value != Y.value THEN 1
ELSE 0
END AS counter
FROM (
SELECT
value,
ROW_NUMBER() OVER() AS row
FROM
UNNEST(SPLIT(a, "")) AS value ) X
JOIN (
SELECT
value,
ROW_NUMBER() OVER() AS row
FROM
UNNEST(SPLIT(b, "")) AS value ) Y
ON
X.row = Y.row )
)
);
WITH Input AS (
SELECT 'abcdef' AS strings UNION ALL
SELECT 'defdef' UNION ALL
SELECT '1bcdef' UNION ALL
SELECT '1bcde4' UNION ALL
SELECT '123de4' UNION ALL
SELECT 'abc123'
)
SELECT strings, 'abcdef' as target, HammingDistance('abcdef', strings) as hamming_distance
FROM Input;
与其他解决方案(如此解决方案)相比,它使用两个字符串(长度相同,遵循hamming距离的定义)并输出预期距离。
bigquery相似性标准sql hammingdistance
我是这样做的:
CREATE TEMP FUNCTION trigram_similarity(a STRING, b STRING) AS (
(
WITH a_trigrams AS (
SELECT
DISTINCT tri_a
FROM
unnest(ML.NGRAMS(SPLIT(LOWER(a), ''), [3,3])) AS tri_a
),
b_trigrams AS (
SELECT
DISTINCT tri_b
FROM
unnest(ML.NGRAMS(SPLIT(LOWER(b), ''), [3,3])) AS tri_b
)
SELECT
COUNTIF(tri_b IS NOT NULL) / COUNT(*)
FROM
a_trigrams
LEFT JOIN b_trigrams ON tri_a = tri_b
)
);
以下是与Postgres的pg_trgm:的比较
select trigram_similarity('saemus', 'seamus');
-- 0.25 vs. pg_trgm 0.272727
select trigram_similarity('shamus', 'seamus');
-- 0.5 vs. pg_trgm 0.4
关于如何在Google BigQuery中执行三角图操作,我给出了相同的答案?
当我在寻找上面Felipe的答案时,我进行了自己的查询,最终得到了两个版本,一个版本我称之为string近似,另一个版本称之为字符串相似。
第一个是查看源字符串和测试字符串的字母之间的最短距离,并返回0到1之间的分数,其中1是完全匹配。它将始终根据两个字符串中最长的字符串进行评分。事实证明,它返回了与Levenshein距离类似的结果。
#standardSql
CREATE OR REPLACE FUNCTION `myproject.func.stringApproximation`(sourceString STRING, testString STRING) AS (
(select avg(best_result) from (
select if(length(testString)<length(sourceString), sourceoffset, testoffset) as ref,
case
when min(result) is null then 0
else 1 / (min(result) + 1)
end as best_result,
from (
select *,
if(source = test, abs(sourceoffset - (testoffset)),
greatest(length(testString),length(sourceString))) as result
from unnest(split(lower(sourceString),'')) as source with offset sourceoffset
cross join
(select *
from unnest(split(lower(testString),'')) as test with offset as testoffset)
) as results
group by ref
)
)
);
第二种是第一种的变体,它将查看匹配距离的序列,这样,与前面或后面的字符等距离匹配的字符将算作一个点。这工作得很好,比字符串近似更好,但没有我想要的那么好(见下面的示例输出)。
#standarSql
CREATE OR REPLACE FUNCTION `myproject.func.stringResemblance`(sourceString STRING, testString STRING) AS (
(
select avg(sequence)
from (
select ref,
if(array_length(array(select * from comparison.collection intersect distinct
(select * from comparison.before))) > 0
or array_length(array(select * from comparison.collection intersect distinct
(select * from comparison.after))) > 0
, 1, 0) as sequence
from (
select ref,
collection,
lag(collection) over (order by ref) as before,
lead(collection) over (order by ref) as after
from (
select if(length(testString) < length(sourceString), sourceoffset, testoffset) as ref,
array_agg(result ignore nulls) as collection
from (
select *,
if(source = test, abs(sourceoffset - (testoffset)), null) as result
from unnest(split(lower(sourceString),'')) as source with offset sourceoffset
cross join
(select *
from unnest(split(lower(testString),'')) as test with offset as testoffset)
) as results
group by ref
)
) as comparison
)
)
);
下面是一个结果示例:
#standardSQL
with test_subjects as (
select 'benji' as name union all
select 'benjamin' union all
select 'benjamin alan artis' union all
select 'ben artis' union all
select 'artis benjamin'
)
select name, quick.stringApproximation('benjamin artis', name) as approxiamtion, quick.stringResemblance('benjamin artis', name) as resemblance
from test_subjects
order by resemblance desc
这将返回
+---------------------+--------------------+--------------------+
| name | approximation | resemblance |
+---------------------+--------------------+--------------------+
| artis benjamin | 0.2653061224489796 | 0.8947368421052629 |
+---------------------+--------------------+--------------------+
| benjamin alan artis | 0.6078947368421053 | 0.8947368421052629 |
+---------------------+--------------------+--------------------+
| ben artis | 0.4142857142857142 | 0.7142857142857143 |
+---------------------+--------------------+--------------------+
| benjamin | 0.6125850340136053 | 0.5714285714285714 |
+---------------------+--------------------+--------------------+
| benji | 0.36269841269841263| 0.28571428571428575|
+----------------------------------------------------------------
编辑:更新相似性算法以改进结果。
尝试Google Sheets的Flokup。。。它肯定比Levenstein距离快,而且它可以开箱即用地计算相似性百分比。你可能会发现一个有用的Flokup函数是:
FUZZYMATCH (string1, string2)
参数详细信息
- 字符串1:与字符串2进行比较
- 字符串2:与字符串1进行比较
然后基于这些比较来计算相似度百分比。两个参数都可以是范围。
我目前正在尝试为大型数据集优化它,所以非常欢迎您的反馈。
编辑:我是Flokup的创建者。
- 如何更改bigquery API中的计费层选项
- 如何在Google柱状图中动态添加行/列
- Google/html5语音识别JavaScript SDK Chrome网络工具包SpeechRecognition
- 使用Google Visualization动态调用构造函数
- Firebase2(Firebase.google.com)推送通知-从外部管理
- 是否可以控制获取哪些Google地图脚本(JavaScript API)
- 通过命令行/批处理文件打开页面时,将javascript代码注入Google Chrome
- Google Adsense多次加载脚本
- 单击超链接时,如何使用Google Maps API v3缩放地图
- Google电子表格getValue([cell containing ])不返回制表符
- 实现一个建立在google.com之上的自定义搜索引擎
- 使用Google Maps API向标记添加多个字符
- Google 脚本:用于创建日历活动的脚本运行时不会出错,但不会执行任何操作
- 回调函数中传递参数的困难(Google Map API Markers)
- 在Chrome扩展内部输出Google API调用
- 如何使用Google Sheets API+Javascript阅读电子表格
- 应用程序脚本到Google Bigquery未知错误
- 有没有一种方法可以在Google BigQuery中测量字符串的相似性
- Google BigQuery通过API访问公共数据集
- 使用JS的服务帐户调用Google bigquery