AI人工智能 解决问题
在本节中,我们将解决一些相关问题。
类别预测
在一组文档中,不仅单词很重要,单词的类别也很重要;一个特定的单词属于哪个文本类别。例如,我们想预测一个给定的句子是否属于电子邮件、新闻、体育、计算机等类别。在下面的示例中,我们将使用 tf-idf 来制定特征向量,以找到文档的类别。我们将使用 sklearn 的 20 个新闻组数据集的数据。
我们需要导入必要的包:
from sklearn.datasets import fetch_20newsgroups
from sklearn.naive_bayes import MultinomialNB
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.feature_extraction.text import CountVectorizer
定义类别映射。我们使用五个不同的类别,分别是宗教、汽车、体育、电子和太空。
category_map = {'talk.religion.misc':'Religion','rec.autos':'Autos',
'rec.sport.hockey':'Hockey','sci.electronics':'Electronics', 'sci.space': 'Space'}
创建训练集:
training_data = fetch_20newsgroups(subset = 'train',
categories = category_map.keys(), shuffle = True, random_state = 5)
构建计数向量化器并提取词频:
vectorizer_count = CountVectorizer()
train_tc = vectorizer_count.fit_transform(training_data.data)
print("\nDimensions of training data:", train_tc.shape)
创建 tf-idf 转换器:
tfidf = TfidfTransformer()
train_tfidf = tfidf.fit_transform(train_tc)
现在,定义测试数据:
input_data = [
'Discovery was a space shuttle',
'Hindu, Christian, Sikh all are religions',
'We must have to drive safely',
'Puck is a disk made of rubber',
'Television, Microwave, Refrigrated all uses electricity'
]
上面的数据将帮助我们训练一个多项式朴素贝叶斯分类器:
classifier = MultinomialNB().fit(train_tfidf, training_data.target)
使用计数向量化器转换输入数据:
input_tc = vectorizer_count.transform(input_data)
现在,我们将使用 tf-idf 转换器向量化后的数据:
input_tfidf = tfidf.transform(input_tc)
我们将预测输出类别:
predictions = classifier.predict(input_tfidf)
输出生成如下:
for sent, category in zip(input_data, predictions):
print('\nInput Data:', sent, '\n Category:', \
category_map[training_data.target_names[category]])
类别预测器生成以下输出:
Dimensions of training data: (2755, 39297)
Input Data: Discovery was a space shuttle
Category: Space
Input Data: Hindu, Christian, Sikh all are religions
Category: Religion
Input Data: We must have to drive safely
Category: Autos
Input Data: Puck is a disk made of rubber
Category: Hockey
Input Data: Television, Microwave, Refrigrated all uses electricity
Category: Electronics
性别识别器
在这个问题中,我们将训练一个分类器,通过提供名字来识别性别(男性或女性)。我们需要使用启发式方法来构建特征向量并训练分类器。我们将使用 scikit-learn 包中的标记数据。以下是构建性别识别器的 Python 代码:
让我们导入必要的包:
import random
from nltk import NaiveBayesClassifier
from nltk.classify import accuracy as nltk_accuracy
from nltk.corpus import names
现在我们需要从输入单词中提取最后 N 个字母。这些字母将作为特征:
def extract_features(word, N = 2):
last_n_letters = word[-N:]
return {'feature': last_n_letters.lower()}
if __name__=='__main__':
使用 NLTK 中可用的标记名字(男性和女性)创建训练数据:
male_list = [(name, 'male') for name in names.words('male.txt')]
female_list = [(name, 'female') for name in names.words('female.txt')]
data = (male_list + female_list)
random.seed(5)
random.shuffle(data)
现在,创建测试数据:
namesInput = ['Rajesh', 'Gaurav', 'Swati', 'Shubha']
使用以下代码定义用于训练和测试的样本数量:
train_sample = int(0.8 * len(data))
现在,我们需要遍历不同的长度,以便比较准确性:
for i in range(1, 6):
print('\nNumber of end letters:', i)
features = [(extract_features(n, i), gender) for (n, gender) in data]
train_data, test_data = features[:train_sample], features[train_sample:]
classifier = NaiveBayesClassifier.train(train_data)
可以计算分类器的准确性:
accuracy_classifier = round(100 * nltk_accuracy(classifier, test_data), 2)
print('Accuracy = ' + str(accuracy_classifier) + '%')
现在,我们可以预测输出:
for name in namesInput:
print(name, '==>', classifier.classify(extract_features(name, i)))
上面的程序将生成以下输出:
Number of end letters: 1
Accuracy = 74.7%
Rajesh -> female
Gaurav -> male
Swati -> female
Shubha -> female
Number of end letters: 2
Accuracy = 78.79%
Rajesh -> male
Gaurav -> male
Swati -> female
Shubha -> female
Number of end letters: 3
Accuracy = 77.22%
Rajesh -> male
Gaurav -> female
Swati -> female
Shubha -> female
Number of end letters: 4
Accuracy = 69.98%
Rajesh -> female
Gaurav -> female
Swati -> female
Shubha -> female
Number of end letters: 5
Accuracy = 64.63%
Rajesh -> female
Gaurav -> female
Swati -> female
Shubha -> female
在上面的输出中,我们可以看到当使用最后两个字母时准确性最高,并且随着使用的结尾字母数量增加,准确性会下降。