双字母组

双字母组或称二元语法（英语：bigrams，或称digrams），作为统计分析文本使用非常广泛；它是由两个字母，或者两个音节，或者两个词构成的双字母组。

简介

在给定一个前导词情况下，双字母组可帮助计算出现某个词的概率，这是条件概率应用场景：

$P(W_{n}|W_{n-1})={P(W_{n-1},W_{n}) \over P(W_{n-1})}$

即，在给定前面一个词 $W_{n-1}$ 的前提下，出现某个词 $W_{n}$ 的概率 $P(W_{n})$ 与他们构成的双字母组的概率一致，换言之，两个词同时出现的概率 $P(W_{n-1},W_{n})$ 被出现前一个词 $W_{n-1}$ 的概率除。

Gappy bigrams或称skipping bigrams是允许有跳空的词对组（也许想避免把词连接起来，或者想允许某种模拟的依赖，如依赖语法）。

Head word bigrams是具有明确依赖关系的gappy bigrams。

应用

这种组被用在最成功的一种语音识别^[1]的语言模型中。它们是N字母组的一种特例。

本术语也被用在密码学里，在此领域，试图破解密码电文有时二元语法频率攻击会被用到。参考频率分析。

英语里双字母组的出现频率

据小英语语料库的统计结果，最常见的字母双字母的频率是：^[2]

th 1.52%       en 0.55%       ng 0.18%
he 1.28%       ed 0.53%       of 0.16%
in 0.94%       to 0.52%       al 0.09%
er 0.94%       it 0.50%       de 0.09%
an 0.82%       ou 0.50%       se 0.08%
re 0.68%       ea 0.47%       le 0.08%
nd 0.63%       hi 0.46%       sa 0.06%
at 0.59%       is 0.46%       si 0.05%
on 0.57%       or 0.43%       ar 0.04%
nt 0.56%       ti 0.34%       ve 0.04%
ha 0.56%       as 0.33%       ra 0.04%
es 0.56%       te 0.27%       ld 0.02%
st 0.55%       et 0.19%       ur 0.02%

可以获得从更大语料库中提取的完整双字母频率。^[3]

参考文献

^ Collins, Michael John. A new statistical parser based on bigram lexical dependencies. Association for Computational Linguistics: 184–191. 1996-06-24 [2018-10-09]. doi:10.3115/981863.981888. （原始内容存档于2018-10-08）.
^ Cornell Math Explorer's Project – Substitution Ciphers. [2011-03-22]. （原始内容存档于2011-06-05）.
^ Jones, Michael N; D J K Mewhort. Case-sensitive letter and bigram frequency counts from large-scale English corpora. Behavior Research Methods, Instruments, and Computers. August 2004, 36 (3): 388–396. ISSN 0743-3808. PMID 15641428.

参见

[1] Collins, Michael John. A new statistical parser based on bigram lexical dependencies. Association for Computational Linguistics: 184–191. 1996-06-24 [2018-10-09]. doi:10.3115/981863.981888. （原始内容存档于2018-10-08）.

[2] Cornell Math Explorer's Project – Substitution Ciphers. [2011-03-22]. （原始内容存档于2011-06-05）.

[3] Jones, Michael N; D J K Mewhort. Case-sensitive letter and bigram frequency counts from large-scale English corpora. Behavior Research Methods, Instruments, and Computers. August 2004, 36 (3): 388–396. ISSN 0743-3808. PMID 15641428.

[1]

[2]

[3]