宠物收养数据集提供了对各种因素的全面调查,这些因素可能会影响宠物从收容所被收养的可能性。该数据集包括可供收养的宠物的详细信息,涵盖了各种特征和属性。
该数据集非常适合有兴趣了解和预测宠物收养趋势的数据科学家和分析师。它可以用于:
预测建模,以确定收养宠物的可能性。分析各种因素对采用率的影响。制定提高收容所收养率的战略。该数据集旨在支持专注于提高宠物收养率和确保更多宠物找到他们永远的家的研究和举措。
本 Python3 环境安装了许多有用的分析库,它是由kaggle/python Docker镜像定义的:https://github.com/kaggle/docker-python。例如,以下是要加载的几个有用的包
import numpy as np # linear algebra import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv) 12
输入数据文件在只读“…/Input/”目录中可用,例如,运行此操作(通过单击run或按Shift+Enter)将列出输入目录下的所有文件
import os for dirname, _, filenames in os.walk('/kaggle/input'): for filename in filenames: print(os.path.join(dirname, filename)) 1234
导入数据集 df = pd.read_csv('/kaggle/input/predict-pet-adoption-status-dataset/pet_adoption_data.csv')
输入df.head(),输出
输入
def get_df_info(df): print("n 33[1mShape of DataFrame: 33[0m ", df.shape) print("n 33[1mColumns in DataFrame: 33[0m ", df.columns.to_list()) print("n 33[1mData types of columns: 33[0mn", df.dtypes) print("n 33[1mInformation about DataFrame: 33[0m") df.info() print("n 33[1mNumber of unique values in each column: 33[0m") for col in df.columns: print(f" 33[1m{col} 33[0m: {df[col].nunique()}") print("n 33[1mNumber of null values in each column: 33[0mn", df.isnull().sum()) print("n 33[1mNumber of duplicate rows: 33[0m ", df.duplicated().sum()) print("n 33[1mDescriptive statistics of DataFrame: 33[0mn", df.describe().transpose()) # Call the function get_df_info(df)
1234567891011121314151617181920输出如图所示
1、 删除‘PetID’列
df = df.drop('PetID', axis = 1) 1
2、将数据帧划分为特征(X)和目标(y)
X = df.drop('AdoptionLikelihood', axis=1) y = df['AdoptionLikelihood'] 12
3、处理X中的范畴变量
X = pd.get_dummies(X) 1
输入
from sklearn.model_selection import train_test_split from sklearn.metrics import accuracy_score, confusion_matrix, classification_report from sklearn.preprocessing import StandardScaler, LabelEncoder from sklearn.linear_model import LogisticRegression from sklearn.tree import DecisionTreeClassifier from sklearn.ensemble import RandomForestClassifier, ExtraTreesClassifier, AdaBoostClassifier, GradientBoostingClassifier from xgboost import XGBClassifier from sklearn.svm import SVC from lightgbm import LGBMClassifier from catboost import CatBoostClassifier from sklearn.multiclass import OneVsRestClassifier from sklearn.metrics import f1_score from imblearn.over_sampling import SMOTE from sklearn.ensemble import VotingClassifier, StackingClassifier 1234567891011121314
函数apply_models以特征(X)和目标标签(y)为输入,并执行以下任务:
数据预处理: 将数据拆分为训练集和测试集。检查类不平衡,并在需要时应用SMOTE(过采样)。使用StandardScaler缩放要素。 模型培训和评估: 定义一组机器学习分类模型。根据训练数据训练每个模型。使用准确性和F1分数对测试数据上的每个模型进行评估。打印每个模型的详细报告(准确性、混淆矩阵、分类报告)。 合奏学习: 根据F1成绩确定表现最好的三款车型。使用前3个模型创建两个集成模型(投票分类器和堆叠分类器)。使用准确性、混淆矩阵和分类报告对测试数据上的集成模型进行评估。总之,该功能旨在探索各种分类模型,确定性能最好的分类模型,并通过集成学习技术潜在地提高性能。输入
def apply_models(X, y): # Split the data into training and test sets X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42) # Check for class imbalance class_counts = np.bincount(y_train) if len(class_counts) > 2 or np.min(class_counts) / np.max(class_counts) < 0.1: print("Class imbalance detected. Applying SMOTE...") # Apply SMOTE (class imbalance) smote = SMOTE(random_state=42) X_train, y_train = smote.fit_resample(X_train, y_train) # Initialize the StandardScaler scaler = StandardScaler() # Fit the scaler on the training data and transform both training and test data X_train = scaler.fit_transform(X_train) X_test = scaler.transform(X_test) # Define the models models = { 'LogisticRegression': LogisticRegression(), 'SVC': SVC(), 'DecisionTree': DecisionTreeClassifier(), 'RandomForest': RandomForestClassifier(), 'ExtraTrees': ExtraTreesClassifier(), 'AdaBoost': AdaBoostClassifier(), 'GradientBoost': GradientBoostingClassifier(), 'XGBoost': XGBClassifier(use_label_encoder=False, eval_metric='logloss'), 'LightGBM': LGBMClassifier(), 'CatBoost': CatBoostClassifier(verbose=0) } # Initialize a dictionary to hold the performance of each model model_performance = {} # Apply each model for model_name, model in models.items(): print(f"n 33[1mClassification with {model_name}: 33[0mn{'-' * 30}") # Fit the model to the training data model.fit(X_train, y_train) # Make predictions on the test data y_pred = model.predict(X_test) # Calculate the accuracy and f1 score accuracy = accuracy_score(y_test, y_pred) f1 = f1_score(y_test, y_pred, average='weighted') # Store the performance in the dictionary model_performance[model_name] = (accuracy, f1) # Print the accuracy score print(" 33[1m**Accuracy**: 33[0mn", accuracy) # Print the confusion matrix print("n 33[1m**Confusion Matrix**: 33[0mn", confusion_matrix(y_test, y_pred)) # Print the classification report print("n 33[1m**Classification Report**: 33[0mn", classification_report(y_test, y_pred)) # Sort the models based on f1 score and pick the top 3 top_3_models = sorted(model_performance.items(), key=lambda x: x[1][1], reverse=True)[:3] print("n 33[1mTop 3 Models based on F1 Score: 33[0mn", top_3_models) # Extract the model names and classifiers for the top 3 models top_3_model_names = [model[0] for model in top_3_models] top_3_classifiers = [models[model_name] for model_name in top_3_model_names] # Create a Voting Classifier with the top 3 models print("n 33[1mInitializing Voting Classifier with top 3 models... 33[0mn") voting_clf = VotingClassifier(estimators=list(zip(top_3_model_names, top_3_classifiers)), voting='hard') voting_clf.fit(X_train, y_train) y_pred = voting_clf.predict(X_test) print("n 33[1m**Voting Classifier Evaluation**: 33[0mn") print(" 33[1m**Accuracy**: 33[0mn", accuracy_score(y_test, y_pred)) print("n 33[1m**Confusion Matrix**: 33[0mn", confusion_matrix(y_test, y_pred)) print("n 33[1m**Classification Report**: 33[0mn", classification_report(y_test, y_pred)) # Create a Stacking Classifier with the top 3 models print("n 33[1mInitializing Stacking Classifier with top 3 models... 33[0mn") stacking_clf = StackingClassifier(estimators=list(zip(top_3_model_names, top_3_classifiers))) stacking_clf.fit(X_train, y_train) y_pred = stacking_clf.predict(X_test) print("n 33[1m**Stacking Classifier Evaluation**: 33[0mn") print(" 33[1m**Accuracy**: 33[0mn", accuracy_score(y_test, y_pred)) print("n 33[1m**Confusion Matrix**: 33[0mn", confusion_matrix(y_test, y_pred)) print("n 33[1m**Classification Report**: 33[0mn", classification_report(y_test, y_pred))
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990在X和y上应用该函数apply_models(X, y)
相关知识
【数据分析实战】—预测宠物收养状况数据分析
宠物寄托收养服务行业大数据及人工智能应用分析.pptx
宠物领养数据分析:预测模型与影响因素研究
宝可梦数据集分析及预测
宠物饲养数据分析怎么写
基于大数据的宠物消费行为分析与预测
基于大数据的宠物消费行为分析与预测.docx
从0到1数据分析实战学习笔记(二)数据清洗
宠物健康管理与数据分析
数据分析在宠物护理和兽医服务中的应用 – PingCode
网址: 【数据分析实战】—预测宠物收养状况数据分析 https://m.mcbbbk.com/newsview490223.html
上一篇: 基于Python的宠物市场数据分 |
下一篇: C/C++ 学习手札(四) |