codecamp

AI人工智能 解决问题

在本节中,我们将解决一些相关问题。

类别预测

在一组文档中,不仅单词很重要,单词的类别也很重要;一个特定的单词属于哪个文本类别。例如,我们想预测一个给定的句子是否属于电子邮件、新闻、体育、计算机等类别。在下面的示例中,我们将使用 tf-idf 来制定特征向量,以找到文档的类别。我们将使用 sklearn 的 20 个新闻组数据集的数据。

我们需要导入必要的包:

from sklearn.datasets import fetch_20newsgroups
from sklearn.naive_bayes import MultinomialNB
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.feature_extraction.text import CountVectorizer

定义类别映射。我们使用五个不同的类别,分别是宗教、汽车、体育、电子和太空。

category_map = {'talk.religion.misc':'Religion','rec.autos':'Autos',
'rec.sport.hockey':'Hockey','sci.electronics':'Electronics', 'sci.space': 'Space'}

创建训练集:

training_data = fetch_20newsgroups(subset = 'train',
categories = category_map.keys(), shuffle = True, random_state = 5)

构建计数向量化器并提取词频:

vectorizer_count = CountVectorizer()
train_tc = vectorizer_count.fit_transform(training_data.data)
print("\nDimensions of training data:", train_tc.shape)

创建 tf-idf 转换器:

tfidf = TfidfTransformer()
train_tfidf = tfidf.fit_transform(train_tc)

现在,定义测试数据:

input_data = [
'Discovery was a space shuttle',
'Hindu, Christian, Sikh all are religions',
'We must have to drive safely',
'Puck is a disk made of rubber',
'Television, Microwave, Refrigrated all uses electricity'
]

上面的数据将帮助我们训练一个多项式朴素贝叶斯分类器:

classifier = MultinomialNB().fit(train_tfidf, training_data.target)

使用计数向量化器转换输入数据:

input_tc = vectorizer_count.transform(input_data)

现在,我们将使用 tf-idf 转换器向量化后的数据:

input_tfidf = tfidf.transform(input_tc)

我们将预测输出类别:

predictions = classifier.predict(input_tfidf)

输出生成如下:

for sent, category in zip(input_data, predictions):
print('\nInput Data:', sent, '\n Category:', \
category_map[training_data.target_names[category]])

类别预测器生成以下输出:

Dimensions of training data: (2755, 39297)
Input Data: Discovery was a space shuttle
Category: Space
Input Data: Hindu, Christian, Sikh all are religions
Category: Religion
Input Data: We must have to drive safely
Category: Autos
Input Data: Puck is a disk made of rubber
Category: Hockey
Input Data: Television, Microwave, Refrigrated all uses electricity
Category: Electronics

性别识别器

在这个问题中,我们将训练一个分类器,通过提供名字来识别性别(男性或女性)。我们需要使用启发式方法来构建特征向量并训练分类器。我们将使用 scikit-learn 包中的标记数据。以下是构建性别识别器的 Python 代码:

让我们导入必要的包:

import random
from nltk import NaiveBayesClassifier
from nltk.classify import accuracy as nltk_accuracy
from nltk.corpus import names

现在我们需要从输入单词中提取最后 N 个字母。这些字母将作为特征:

def extract_features(word, N = 2):
last_n_letters = word[-N:]
return {'feature': last_n_letters.lower()}


if __name__=='__main__':

使用 NLTK 中可用的标记名字(男性和女性)创建训练数据:

male_list = [(name, 'male') for name in names.words('male.txt')]
female_list = [(name, 'female') for name in names.words('female.txt')]
data = (male_list + female_list)
random.seed(5)
random.shuffle(data)

现在,创建测试数据:

namesInput = ['Rajesh', 'Gaurav', 'Swati', 'Shubha']

使用以下代码定义用于训练和测试的样本数量:

train_sample = int(0.8 * len(data))

现在,我们需要遍历不同的长度,以便比较准确性:

for i in range(1, 6):
print('\nNumber of end letters:', i)
features = [(extract_features(n, i), gender) for (n, gender) in data]
train_data, test_data = features[:train_sample], features[train_sample:]
classifier = NaiveBayesClassifier.train(train_data)

可以计算分类器的准确性:

accuracy_classifier = round(100 * nltk_accuracy(classifier, test_data), 2)
print('Accuracy = ' + str(accuracy_classifier) + '%')

现在,我们可以预测输出:

for name in namesInput:
print(name, '==>', classifier.classify(extract_features(name, i)))

上面的程序将生成以下输出:

Number of end letters: 1
Accuracy = 74.7%
Rajesh -> female
Gaurav -> male
Swati -> female
Shubha -> female
Number of end letters: 2
Accuracy = 78.79%
Rajesh -> male
Gaurav -> male
Swati -> female
Shubha -> female
Number of end letters: 3
Accuracy = 77.22%
Rajesh -> male
Gaurav -> female
Swati -> female
Shubha -> female
Number of end letters: 4
Accuracy = 69.98%
Rajesh -> female
Gaurav -> female
Swati -> female
Shubha -> female
Number of end letters: 5
Accuracy = 64.63%
Rajesh -> female
Gaurav -> female
Swati -> female
Shubha -> female

在上面的输出中,我们可以看到当使用最后两个字母时准确性最高,并且随着使用的结尾字母数量增加,准确性会下降。

AI人工智能 词袋(Bag of Word, BoW)模型
AI人工智能 主题建模:识别文本数据中的模式
温馨提示
下载编程狮App,免费阅读超1000+编程语言教程
取消
确定
目录

AI人工智能监督学习(回归)

关闭

MIP.setData({ 'pageTheme' : getCookie('pageTheme') || {'day':true, 'night':false}, 'pageFontSize' : getCookie('pageFontSize') || 20 }); MIP.watch('pageTheme', function(newValue){ setCookie('pageTheme', JSON.stringify(newValue)) }); MIP.watch('pageFontSize', function(newValue){ setCookie('pageFontSize', newValue) }); function setCookie(name, value){ var days = 1; var exp = new Date(); exp.setTime(exp.getTime() + days*24*60*60*1000); document.cookie = name + '=' + value + ';expires=' + exp.toUTCString(); } function getCookie(name){ var reg = new RegExp('(^| )' + name + '=([^;]*)(;|$)'); return document.cookie.match(reg) ? JSON.parse(document.cookie.match(reg)[2]) : null; }