首页 > 分享 > 【数据分析实战】—预测宠物收养状况数据分析

【数据分析实战】—预测宠物收养状况数据分析

萌宠菠菠乐园
2024-10-28 17:30

文章目录数据集数据集描述特征用途注意宠物收养预测环境准备探索数据帧数据预处理机器学习数据预处理：模型培训和评估：合奏学习：添加底部名片获取数据集吧！

在这里插入图片描述

数据集

数据集描述

宠物收养数据集提供了对各种因素的全面调查，这些因素可能会影响宠物从收容所被收养的可能性。该数据集包括可供收养的宠物的详细信息，涵盖了各种特征和属性。

特征

PetID：每个宠物的唯一标识符。PetType：宠物的类型（例如，狗、猫、鸟、兔子）。Breed：宠物的特定品种。AgeMonths：宠物的年龄（以月为单位）。Color：宠物的颜色。Size：宠物的尺寸类别（小、中、大）。WeightKg：宠物的重量，单位为公斤。Vaccinated：宠物的疫苗接种状态（0-未接种，1-已接种）。HealthCondition：宠物的健康状况（0-健康，1-医疗状况）。TimeInShelterDays：宠物在庇护所的持续时间（天）。AdoptionFee：宠物的收养费（美元）。PreviousOwner:宠物是否有以前的主人（0-否，1-是）。AdoptionLikelihood：宠物被收养的可能性（0-不太可能，1-可能）。

用途

该数据集非常适合有兴趣了解和预测宠物收养趋势的数据科学家和分析师。它可以用于：

预测建模，以确定收养宠物的可能性。分析各种因素对采用率的影响。制定提高收容所收养率的战略。

注意

该数据集旨在支持专注于提高宠物收养率和确保更多宠物找到他们永远的家的研究和举措。

宠物收养预测

环境准备

本 Python3 环境安装了许多有用的分析库，它是由kaggle/python Docker镜像定义的：https://github.com/kaggle/docker-python。例如，以下是要加载的几个有用的包

import numpy as np # linear algebra import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv) 12

输入数据文件在只读“…/Input/”目录中可用，例如，运行此操作（通过单击run或按Shift+Enter）将列出输入目录下的所有文件

import os for dirname, _, filenames in os.walk('/kaggle/input'): for filename in filenames: print(os.path.join(dirname, filename)) 1234

导入数据集 df = pd.read_csv('/kaggle/input/predict-pet-adoption-status-dataset/pet_adoption_data.csv')

探索数据帧

输入df.head()，输出

在这里插入图片描述
输入

def get_df_info(df): print("n33[1mShape of DataFrame:33[0m ", df.shape) print("n33[1mColumns in DataFrame:33[0m ", df.columns.to_list()) print("n33[1mData types of columns:33[0mn", df.dtypes) print("n33[1mInformation about DataFrame:33[0m") df.info() print("n33[1mNumber of unique values in each column:33[0m") for col in df.columns: print(f"33[1m{col}33[0m: {df[col].nunique()}") print("n33[1mNumber of null values in each column:33[0mn", df.isnull().sum()) print("n33[1mNumber of duplicate rows:33[0m ", df.duplicated().sum()) print("n33[1mDescriptive statistics of DataFrame:33[0mn", df.describe().transpose()) # Call the function get_df_info(df)

1234567891011121314151617181920

输出如图所示

在这里插入图片描述

数据预处理

1、删除‘PetID’列

df = df.drop('PetID', axis = 1) 1

2、将数据帧划分为特征（X）和目标（y）

X = df.drop('AdoptionLikelihood', axis=1) y = df['AdoptionLikelihood'] 12

3、处理X中的范畴变量

X = pd.get_dummies(X) 1

机器学习

输入

from sklearn.model_selection import train_test_split from sklearn.metrics import accuracy_score, confusion_matrix, classification_report from sklearn.preprocessing import StandardScaler, LabelEncoder from sklearn.linear_model import LogisticRegression from sklearn.tree import DecisionTreeClassifier from sklearn.ensemble import RandomForestClassifier, ExtraTreesClassifier, AdaBoostClassifier, GradientBoostingClassifier from xgboost import XGBClassifier from sklearn.svm import SVC from lightgbm import LGBMClassifier from catboost import CatBoostClassifier from sklearn.multiclass import OneVsRestClassifier from sklearn.metrics import f1_score from imblearn.over_sampling import SMOTE from sklearn.ensemble import VotingClassifier, StackingClassifier 1234567891011121314

函数apply_models以特征（X）和目标标签（y）为输入，并执行以下任务：

数据预处理：将数据拆分为训练集和测试集。检查类不平衡，并在需要时应用SMOTE（过采样）。使用StandardScaler缩放要素。模型培训和评估：定义一组机器学习分类模型。根据训练数据训练每个模型。使用准确性和F1分数对测试数据上的每个模型进行评估。打印每个模型的详细报告（准确性、混淆矩阵、分类报告）。合奏学习：根据F1成绩确定表现最好的三款车型。使用前3个模型创建两个集成模型（投票分类器和堆叠分类器）。使用准确性、混淆矩阵和分类报告对测试数据上的集成模型进行评估。总之，该功能旨在探索各种分类模型，确定性能最好的分类模型，并通过集成学习技术潜在地提高性能。

输入

def apply_models(X, y): # Split the data into training and test sets X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42) # Check for class imbalance class_counts = np.bincount(y_train) if len(class_counts) > 2 or np.min(class_counts) / np.max(class_counts) < 0.1: print("Class imbalance detected. Applying SMOTE...") # Apply SMOTE (class imbalance) smote = SMOTE(random_state=42) X_train, y_train = smote.fit_resample(X_train, y_train) # Initialize the StandardScaler scaler = StandardScaler() # Fit the scaler on the training data and transform both training and test data X_train = scaler.fit_transform(X_train) X_test = scaler.transform(X_test) # Define the models models = { 'LogisticRegression': LogisticRegression(), 'SVC': SVC(), 'DecisionTree': DecisionTreeClassifier(), 'RandomForest': RandomForestClassifier(), 'ExtraTrees': ExtraTreesClassifier(), 'AdaBoost': AdaBoostClassifier(), 'GradientBoost': GradientBoostingClassifier(), 'XGBoost': XGBClassifier(use_label_encoder=False, eval_metric='logloss'), 'LightGBM': LGBMClassifier(), 'CatBoost': CatBoostClassifier(verbose=0) } # Initialize a dictionary to hold the performance of each model model_performance = {} # Apply each model for model_name, model in models.items(): print(f"n33[1mClassification with {model_name}:33[0mn{'-' * 30}") # Fit the model to the training data model.fit(X_train, y_train) # Make predictions on the test data y_pred = model.predict(X_test) # Calculate the accuracy and f1 score accuracy = accuracy_score(y_test, y_pred) f1 = f1_score(y_test, y_pred, average='weighted') # Store the performance in the dictionary model_performance[model_name] = (accuracy, f1) # Print the accuracy score print("33[1m**Accuracy**:33[0mn", accuracy) # Print the confusion matrix print("n33[1m**Confusion Matrix**:33[0mn", confusion_matrix(y_test, y_pred)) # Print the classification report print("n33[1m**Classification Report**:33[0mn", classification_report(y_test, y_pred)) # Sort the models based on f1 score and pick the top 3 top_3_models = sorted(model_performance.items(), key=lambda x: x[1][1], reverse=True)[:3] print("n33[1mTop 3 Models based on F1 Score:33[0mn", top_3_models) # Extract the model names and classifiers for the top 3 models top_3_model_names = [model[0] for model in top_3_models] top_3_classifiers = [models[model_name] for model_name in top_3_model_names] # Create a Voting Classifier with the top 3 models print("n33[1mInitializing Voting Classifier with top 3 models...33[0mn") voting_clf = VotingClassifier(estimators=list(zip(top_3_model_names, top_3_classifiers)), voting='hard') voting_clf.fit(X_train, y_train) y_pred = voting_clf.predict(X_test) print("n33[1m**Voting Classifier Evaluation**:33[0mn") print("33[1m**Accuracy**:33[0mn", accuracy_score(y_test, y_pred)) print("n33[1m**Confusion Matrix**:33[0mn", confusion_matrix(y_test, y_pred)) print("n33[1m**Classification Report**:33[0mn", classification_report(y_test, y_pred)) # Create a Stacking Classifier with the top 3 models print("n33[1mInitializing Stacking Classifier with top 3 models...33[0mn") stacking_clf = StackingClassifier(estimators=list(zip(top_3_model_names, top_3_classifiers))) stacking_clf.fit(X_train, y_train) y_pred = stacking_clf.predict(X_test) print("n33[1m**Stacking Classifier Evaluation**:33[0mn") print("33[1m**Accuracy**:33[0mn", accuracy_score(y_test, y_pred)) print("n33[1m**Confusion Matrix**:33[0mn", confusion_matrix(y_test, y_pred)) print("n33[1m**Classification Report**:33[0mn", classification_report(y_test, y_pred))

123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990

在X和y上应用该函数apply_models(X, y)
在这里插入图片描述