实验要求:
任选一篇1000个单词左右的英文:
1.使用列表解析中所学的方法,将上述英文按单词存储形成一个列表。
2.按照首字母对上述列表进行分类,字母“a”到“z”存储为26个列表。
3.最终要求输出:
- 字母“a”到“z”,每个分类中单词的个数(重复的单词不计算入内)。
- 在本文中出现的次数最多的10个单词及出现的频次。
设计思路及实现:
1.文本准备及读取:
选取《我有一个梦想》英文原稿以纯文本形式保存,引入re模块,读取文本并用正则表达式来匹配单词,并存入列表中(正则匹配亦可自己写一个模块实现,但因不是本实验重点,并避免项目复杂,故直接使用re模块):
import re
with open('Data_List_resolution_I Have a Dream.txt','r')as file:
file_contents=file.read()
2.列表解析
对存储文本的列表进行列表解析,将文本按照单词存储为一个列表,这也是本次实验的重点:
words=[word for word in re.findall(r'\w+',file_contents)]
3.字母分类
创建字典来存储字母分类的26个列表,用for循环对26个列表的key赋值,并用chr()转换为对应字母,便于后续匹配。从存储单词的列表中取出单词,转化为小写(大小写不敏感),与字典中的key值进行匹配,并存入对应value值的列表:
words_list={chr(i):[] for i in range(97,123)}
for word in words:
first_letter=word[0].lower()
if first_letter in words_list:
words_list[first_letter].append(word)
4.输出分类单词个数
由于存储单词的列表中含有重复单词,联想到课堂上所讲set()的性质,使用set()对字典中的列表进行处理并重新放回字典。再从字典中依次取出key值对应value值的列表(这正是课上所说的对字典进行切片的操作),利用len()计算单词数量并输出:
for letter in words_list:
words_list[letter]=list(set(words_list[letter]))
for letter in words_list:
count=len(words_list[letter])
print(f'The number of words starting with {letter} is :{count}')
5.输出要求的十个单词及频次
创建集合记录每个单词对应的频次,利用sorted()函数处理字典,将其变为键值对(本质应是一个元组),并按照key-value中的value进行降序排列,取出前十位的key-value放入列表中。再从列表中依次取出对应的key和value值,即输出要求的十个单词及频次,并按照次序进行输出:
word_freq={}
for word in words:
word = word.lower()
if word in word_freq:
word_freq[word]+=1
else:
word_freq[word]=1
get_ten_words=sorted(word_freq.items(),key=lambda x:x[1],reverse=True)[:10]
i=1
for word,freq in get_ten_words:
print(f'The ranking of the frequency of occurrence of \"{word}\" is {i},its frequency is {freq}')
i+=1
程序测试:
运行上述完整程序,读取扫描对应文本,进行相应的处理并输出,经检验完全正确:
The number of words starting with a is :28
The number of words starting with b is :28
The number of words starting with c is :42
The number of words starting with d is :31
The number of words starting with e is :13
The number of words starting with f is :26
The number of words starting with g is :20
The number of words starting with h is :37
The number of words starting with i is :18
The number of words starting with j is :7
The number of words starting with k is :1
The number of words starting with l is :31
The number of words starting with m is :30
The number of words starting with n is :21
The number of words starting with o is :20
The number of words starting with p is :28
The number of words starting with q is :2
The number of words starting with r is :29
The number of words starting with s is :66
The number of words starting with t is :37
The number of words starting with u is :10
The number of words starting with v is :10
The number of words starting with w is :29
The number of words starting with x is :0
The number of words starting with y is :6
The number of words starting with z is :0
The ranking of the frequency of occurrence of "the" is 1,its frequency is 103
The ranking of the frequency of occurrence of "of" is 2,its frequency is 99
The ranking of the frequency of occurrence of "to" is 3,its frequency is 59
The ranking of the frequency of occurrence of "and" is 4,its frequency is 54
The ranking of the frequency of occurrence of "a" is 5,its frequency is 38
The ranking of the frequency of occurrence of "we" is 6,its frequency is 33
The ranking of the frequency of occurrence of "be" is 7,its frequency is 33
The ranking of the frequency of occurrence of "will" is 8,its frequency is 27
The ranking of the frequency of occurrence of "that" is 9,its frequency is 24
The ranking of the frequency of occurrence of "is" is 10,its frequency is 23
总结分析:
本次实验对所学知识进行了实际的操作和运用,如set()函数、字典切片、列表解析等,同时也感受到了python处理文本、进行排序等方面的便捷,各种集成的函数和第三方库确实能提升编程的效率。但仍应注重其底层逻辑,如sort()对单词按照频次进行排序的时候,遇到频次一样的键值对是怎么进行处理的?本实验中"we"和"be"的频次一样,为何"we"排在"be"的前面,应进一步了解sort()的底层逻辑才能解答此问题。python是一个很好的工具,了解工具才能掌握工具,并在面对不同情境时结合实际进行运用。
评论(0)