AI人工智能解决问题

在本节中，我们将解决一些相关问题。

类别预测

在一组文档中，不仅单词很重要，单词的类别也很重要；一个特定的单词属于哪个文本类别。例如，我们想预测一个给定的句子是否属于电子邮件、新闻、体育、计算机等类别。在下面的示例中，我们将使用 tf-idf 来制定特征向量，以找到文档的类别。我们将使用 sklearn 的 20 个新闻组数据集的数据。

我们需要导入必要的包：

from sklearn.datasets import fetch_20newsgroups
from sklearn.naive_bayes import MultinomialNB
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.feature_extraction.text import CountVectorizer

定义类别映射。我们使用五个不同的类别，分别是宗教、汽车、体育、电子和太空。

category_map = {'talk.religion.misc':'Religion','rec.autos':'Autos',
'rec.sport.hockey':'Hockey','sci.electronics':'Electronics', 'sci.space': 'Space'}

创建训练集：

training_data = fetch_20newsgroups(subset = 'train',
categories = category_map.keys(), shuffle = True, random_state = 5)

构建计数向量化器并提取词频：

vectorizer_count = CountVectorizer()
train_tc = vectorizer_count.fit_transform(training_data.data)
print("\nDimensions of training data:", train_tc.shape)

创建 tf-idf 转换器：

tfidf = TfidfTransformer()
train_tfidf = tfidf.fit_transform(train_tc)

现在，定义测试数据：

input_data = [
'Discovery was a space shuttle',
'Hindu, Christian, Sikh all are religions',
'We must have to drive safely',
'Puck is a disk made of rubber',
'Television, Microwave, Refrigrated all uses electricity'
]

上面的数据将帮助我们训练一个多项式朴素贝叶斯分类器：

classifier = MultinomialNB().fit(train_tfidf, training_data.target)

使用计数向量化器转换输入数据：

input_tc = vectorizer_count.transform(input_data)

现在，我们将使用 tf-idf 转换器向量化后的数据：

input_tfidf = tfidf.transform(input_tc)

我们将预测输出类别：

predictions = classifier.predict(input_tfidf)

输出生成如下：

for sent, category in zip(input_data, predictions):
print('\nInput Data:', sent, '\n Category:', \
category_map[training_data.target_names[category]])

类别预测器生成以下输出：

Dimensions of training data: (2755, 39297)
Input Data: Discovery was a space shuttle
Category: Space
Input Data: Hindu, Christian, Sikh all are religions
Category: Religion
Input Data: We must have to drive safely
Category: Autos
Input Data: Puck is a disk made of rubber
Category: Hockey
Input Data: Television, Microwave, Refrigrated all uses electricity
Category: Electronics

性别识别器

在这个问题中，我们将训练一个分类器，通过提供名字来识别性别（男性或女性）。我们需要使用启发式方法来构建特征向量并训练分类器。我们将使用 scikit-learn 包中的标记数据。以下是构建性别识别器的 Python 代码：

让我们导入必要的包：

import random
from nltk import NaiveBayesClassifier
from nltk.classify import accuracy as nltk_accuracy
from nltk.corpus import names

现在我们需要从输入单词中提取最后 N 个字母。这些字母将作为特征：

def extract_features(word, N = 2):
last_n_letters = word[-N:]
return {'feature': last_n_letters.lower()}


if __name__=='__main__':

使用 NLTK 中可用的标记名字（男性和女性）创建训练数据：

male_list = [(name, 'male') for name in names.words('male.txt')]
female_list = [(name, 'female') for name in names.words('female.txt')]
data = (male_list + female_list)
random.seed(5)
random.shuffle(data)

现在，创建测试数据：

namesInput = ['Rajesh', 'Gaurav', 'Swati', 'Shubha']

使用以下代码定义用于训练和测试的样本数量：

train_sample = int(0.8 * len(data))

现在，我们需要遍历不同的长度，以便比较准确性：

for i in range(1, 6):
print('\nNumber of end letters:', i)
features = [(extract_features(n, i), gender) for (n, gender) in data]
train_data, test_data = features[:train_sample], features[train_sample:]
classifier = NaiveBayesClassifier.train(train_data)

可以计算分类器的准确性：

accuracy_classifier = round(100 * nltk_accuracy(classifier, test_data), 2)
print('Accuracy = ' + str(accuracy_classifier) + '%')

现在，我们可以预测输出：

for name in namesInput:
print(name, '==>', classifier.classify(extract_features(name, i)))

上面的程序将生成以下输出：

Number of end letters: 1
Accuracy = 74.7%
Rajesh -> female
Gaurav -> male
Swati -> female
Shubha -> female
Number of end letters: 2
Accuracy = 78.79%
Rajesh -> male
Gaurav -> male
Swati -> female
Shubha -> female
Number of end letters: 3
Accuracy = 77.22%
Rajesh -> male
Gaurav -> female
Swati -> female
Shubha -> female
Number of end letters: 4
Accuracy = 69.98%
Rajesh -> female
Gaurav -> female
Swati -> female
Shubha -> female
Number of end letters: 5
Accuracy = 64.63%
Rajesh -> female
Gaurav -> female
Swati -> female
Shubha -> female

在上面的输出中，我们可以看到当使用最后两个字母时准确性最高，并且随着使用的结尾字母数量增加，准确性会下降。

w3cschool 编程狮，随时随地学编程

AI人工智能解决问题

类别预测

性别识别器

AI人工智能入门概念

AI人工智能开发环境配置

AI人工智能机器学习

AI人工智能数据准备

AI人工智能监督学习(分类)

AI人工智能监督学习（回归）

AI人工智能逻辑编程

AI人工智能无监督学习：聚类

AI人工智能 NLTK包

AI人工智能时间序列数据介绍

AI人工智能语音识别

AI人工智能启发式搜索

AI人工智能游戏开发教程

AI人工智能神经网络教程

AI人工智能强化学习教程

AI人工智能遗传算法教程

AI人工智能计算机视觉教程

AI人工智能深度学习教程

w3cschool 编程狮，随时随地学编程

AI人工智能 解决问题

类别预测

性别识别器

AI人工智能 入门概念

AI人工智能开发环境配置

AI人工智能机器学习

AI人工智能 数据准备

AI人工智能 监督学习(分类)

AI人工智能监督学习（回归）

AI人工智能逻辑编程

AI人工智能无监督学习：聚类

AI人工智能 NLTK包

AI人工智能 时间序列数据介绍

AI人工智能 语音识别

AI人工智能 启发式搜索

AI人工智能 游戏开发教程

AI人工智能 神经网络教程

AI人工智能 强化学习教程

AI人工智能 遗传算法教程

AI人工智能 计算机视觉教程

AI人工智能 深度学习教程

AI人工智能解决问题

AI人工智能入门概念

AI人工智能数据准备

AI人工智能监督学习(分类)

AI人工智能时间序列数据介绍

AI人工智能语音识别

AI人工智能启发式搜索

AI人工智能游戏开发教程

AI人工智能神经网络教程

AI人工智能强化学习教程

AI人工智能遗传算法教程

AI人工智能计算机视觉教程

AI人工智能深度学习教程