300字范文 > 预约挂号后患者实际就诊情况的数据分析

预约挂号后患者实际就诊情况的数据分析

时间：2024-05-27 17:46:09

相关推荐

预约挂号后患者实际就诊情况的数据分析

项目：预约挂号后患者实际就诊情况的数据分析

简介

预约挂号是近年来开展的一项便民就医服务，是为了缩短看病流程，节约患者时间而产生的挂号方式。预约挂号也是一种方便患者提前安排就医计划，减小候诊时间，便于医院提升管理水平，提高工作效率和医疗质量，降低医疗安全风险的门诊挂号方式。但是预约患者不可能做到100%应诊，目前国内报道的爽约率10%以上，甚至达50%，国外报道的爽约率为3&~34%。患者会因为各种主客观原因而爽约，造成医疗资源的极大浪费。因此研究这些浪费医疗资源的人特征，对医疗资源的合理分配有重要的意义。

本次采用的数据集是kaggle上的Medical Appointment No Shows数据集，包含10万条巴西预约挂号的求诊信息，其中每行数据录入了有关患者特点的多个数值。数据中的缺失值被标记为NaN。数据列名的含义如下：

PatientId：病人IdAppointmentID: 预约IdGender: 性别ScheduledDay: 预约日期指患者具体预约就诊的日期AppointmentDay: 就诊日期指患者就诊的日期Age: 年龄Neighbourhood: 街区指医院所在位置Scholarship: 福利保障说明病人是否是巴西福利项目 Bolsa Família 的保障人群Hipertension: 高血压Diabetes: 糖尿病Alcoholism: 酗酒Handcap: 残障人士（名称错误）SMS_received: 接收到短信提醒No-show: 表示病人是否已如约就诊 “No”表示病人已如约就诊，“Yes”说明病人未前往就诊。

本次分析主要研究以下问题：
就诊时间与预定时间间隔越短，如约就诊的可能性就越高么？享受福利待遇的患者如约就诊率较高吗？不同性别的患者如约就诊的情况如何？不同年龄段的患者如约就诊的情况如何？接受到短信提醒的患者是否如约就诊？不同类型病的患者如约就诊的情况如何？

# 导入包import pandas as pdimport numpy as npimport matplotlib.pyplot as pltimport seaborn as sns%matplotlib inline

数据整理

评估数据

# 加载数据并打印几行。df = pd.read_csv('noshowappointments-kagglev2-may-.csv')df.head()

Out[2]:

#检查数据形状df.shape#检查数据类型，以及是否有缺失数据或错误数据的情况。df.info()#重复行数sum(df.duplicated())#缺失值行数df.isnull().any(axis = 1).sum()

#age的唯一值df.Age.unique()

Out[7]:

array([ 62, 56, 8, 76, 23, 39, 21, 19, 30, 29, 22, 28, 54,15, 50, 40, 46, 4, 13, 65, 45, 51, 32, 12, 61, 38,79, 18, 63, 64, 85, 59, 55, 71, 49, 78, 31, 58, 27,6, 2, 11, 7, 0, 3, 1, 69, 68, 60, 67, 36, 10,35, 20, 26, 34, 33, 16, 42, 5, 47, 17, 41, 44, 37,24, 66, 77, 81, 70, 53, 75, 73, 52, 74, 43, 89, 57,14, 9, 48, 83, 72, 25, 80, 87, 88, 84, 82, 90, 94,86, 91, 98, 92, 96, 93, 95, 97, 102, 115, 100, 99, -1],dtype=int64)

数据集没有缺失值，这对后面的分析来说是个大大的好消息。

数据清理

在评估数据时，发现数据集的"Handcap"列标签书写错误，为了保持列名的一致，将"No-show"列中的分隔符改成下划线。同时"ScheduledDay"列和"AppointmentDay"列都是时间列，应该将其转化成datetime数据类型。观察到"Age"列的唯一值中有-1，应该将其清除掉，同时为了便于分析，对"Age"列进行拆分，分成不同年龄段。
重命名列
为了与'PatientId'列名保持一致将'AppointmentID'改成'AppointmentId'，数据集中的"Handcap"列标签书写错误，将其改成"Handicap"，重命名"No-show"列标签将"-"替换为"_"。
时间化数据
使用 pandas 的 to_datetime() 方法来将数据集中的"ScheduledDay"列和"AppointmentDay"列转换为datetime 数据类型，同时计算时间间隔。
拆分年龄
Pandas 的 cut 函数可以让你将数据"切分"为组。利用pd.cut()将年纪拆分成不同的年龄段。其中0~6岁为儿童（childhood），7~17岁为少年（juvenile），18~40岁为青年（youth），41岁~65岁为中老年（middle_aged），66岁以上为老年人（elderly）

#重命名列名df.rename(columns = {'AppointmentID':'AppointmentId','Handcap':'Handicap','No-show':'No_show'},inplace = True)

#时间化数据df['ScheduledDay'] = pd.to_datetime(df['ScheduledDay'])df['AppointmentDay'] = pd.to_datetime(df['AppointmentDay'])#计算时间间隔df['WaitingTime'] = df['AppointmentDay'] - df['ScheduledDay']#WaitingTime列取天数,当天为-1，0df['WaitingTime']= df['WaitingTime'].apply(lambda x: x.days)#过滤数据df = df[df['WaitingTime']>=-1]df.WaitingTime.describe()

Out[10]:

count 110522.000000mean9.184253std15.255115min-1.00000025%-1.00000050% 3.00000075%14.000000max 178.000000Name: WaitingTime, dtype: float64

#时间标签time_names = ['Sameday','OneDay','OneWeek','TwoWeek','ThreeWeek','OneMonth']#把数据“分割”成组的边缘列表time_edges = [-2,0,1,7,13,18,179]# 创建 WaitingDays 列df['WaitingDays'] = pd.cut(df['WaitingTime'], time_edges, labels=time_names)# 显示数据框的前几行，确认更改df.head()

Out[11]:

# 确定数据类型df.info()

#清除年龄异常值df = df[df['Age']>=0]#将'Age'列拆分成不同年龄段#时间标签age_names = ['childhood','juvenile','youth','middle_aged','elderly']#把数据“分割”成组的边缘列表age_edges = [0,6,17,40,65,179]# 创建 Generation 列df['Generation'] = pd.cut(df['Age'], age_edges, labels=age_names)# 显示数据框的前几行，确认更改df.head()

Out[13]:

探索性数据分析

1、就诊时间与预定时间间隔越短，如约就诊的可能性就越高么？

#绘图函数def CP_plot(df,index,df_P,key=''):fig, ax1 = plt.subplots(figsize=(8,5))df.plot(kind = 'bar',rot=0,ax=ax1)# 标题和标签plt.ylabel('Counts',fontsize = 14)plt.xlabel('{:s}'.format(key),fontsize = 14)plt.title('Probability of the {:s}'.format(key),fontsize = 14)ax2 = ax1.twinx() #plt second data set using second(right) axisax2.set_ylim([0, 0.5])#plt.plot(['0','1'],[No_SMS_received_P,SMS_received_P],'r>-')sns.pointplot(x=index, y=df_P, color='r', ax=ax2)plt.ylabel('Probability',fontsize = 14)return None

#计算不同的时间间隔，如约就诊的可能性WaitingDays_Count = df.query('No_show == "Yes"').WaitingDays.value_counts(sort=False)No_WaitingDays_Count = df.query('No_show == "No"').WaitingDays.value_counts(sort=False)df_1 = pd.DataFrame({'Yes':WaitingDays_Count,'No':No_WaitingDays_Count})WaitingDays_P = df.query('No_show == "Yes"').WaitingDays.value_counts(sort=False)/df['WaitingDays'].value_counts(sort=False)#绘图CP_plot(df_1,WaitingDays_P.index, WaitingDays_P,key='WaitingDays')

df.groupby('WaitingDays')['No_show'].value_counts()

WaitingDays No_showSamedayNo 40870Yes 2905OneDay No5123Yes 1602OneWeekNo 16852Yes 5727TwoWeekNo6699Yes 2994ThreeWeek No3948Yes 1878OneMonthNo 14715Yes 7208Name: No_show, dtype: int64

df.groupby('WaitingDays')['No_show'].value_counts().unstack()

#WaitingDays_P = df.query('No_show == "Yes"').WaitingDays.value_counts(sort=False)/df['WaitingDays'].value_counts(sort=False)fig,ax1 = plt.subplots(figsize = (8,5))df.groupby('WaitingDays')['No_show'].value_counts().unstack().plot(kind='bar',rot=0,ax=ax1)plt.ylabel('Count',fontsize = 14)ax2 = ax1.twinx() #plt second data set using second(right) axisax2.set_ylim([0, 0.5]) sns.pointplot(x=WaitingDays_P.index, y=WaitingDays_P, color='r', ax=ax2)plt.ylabel('Probability',fontsize = 14)plt.title('Probability of the WaitingDays',fontsize = 14)

当天预约，当天就诊的人数最多；预约时间与就诊时间间隔越短，如约就诊的可能性越高，而且随着间隔时间越长，爽约率越高。

2、享受福利待遇的患者如约就诊率较高吗？

#计算福利待遇患者如约就诊率Scholarship_Count = df.query('No_show == "Yes"').Scholarship.value_counts()No_Scholarship_Count = df.query('No_show == "No"').Scholarship.value_counts()df_2 = pd.DataFrame({'Yes':Scholarship_Count,'No':No_Scholarship_Count})Scholarship_P = (df.query('Scholarship == 1')['No_show'] == "Yes").mean()No_Scholarship_P = (df.query('Scholarship == 0')['No_show'] == "Yes").mean()#绘图CP_plot(df_2,['0','1'],[No_Scholarship_P,Scholarship_P],key='Scholarship')

可以看出享受福利待遇的患者，如约就诊的可能性低于没有享受福利待遇的患者，说明福利待遇没有提高按时就诊率。

3、不同性别、年龄段的患者如约就诊的情况如何？

A.不同性别的患者如约就诊率如何？

#计算不同性别的患者如约就诊率Gender_Count = df.query('No_show == "Yes"').Gender.value_counts()No_Gender_Count = df.query('No_show == "No"').Gender.value_counts()df_3 = pd.DataFrame({'Yes':Gender_Count,'No':No_Gender_Count})Female_Yes = (df.query('Gender == "F"')['No_show'] == "Yes").mean()Male_Yes = (df.query('Gender == "M"')['No_show'] == "Yes").mean()#绘图CP_plot(df_3,['Female','Male'],[Female_Yes,Male_Yes],key='Gender')

男性患者如约就诊的可能性比女性患者略低一点。

B.不同年龄段的患者如约就诊率如何？

#计算不同年龄段的患者如约就诊率Generation_Count = df.query('No_show == "Yes"').Generation.value_counts(sort=False)No_Generation_Count = df.query('No_show == "No"').Generation.value_counts(sort=False)df_4 = pd.DataFrame({'Yes':Generation_Count,'No':No_Generation_Count})Generation_P = df.query('No_show == "Yes"').Generation.value_counts(sort=False)/df['Generation'].value_counts(sort=False)#No_WaitingDays_P = df.query('No_show == "No"').WaitingDays.value_counts(sort=False)/df['WaitingDays'].value_counts(sort=False)#绘图CP_plot(df_4,Generation_P.index, Generation_P,key='Generation')

中老年患者如约就诊的人数最多；65岁以后的老年患者如约就诊的可能性最高，青少年患者如约就诊的可能性较低。

C.不同性别、年龄段的患者如约就诊的情况如何？

#计算不同性别，每个年龄段如约就诊的数量female_counts = df.query('Gender == "F"').groupby(['No_show','Generation'])['Age'].count()male_counts = df.query('Gender == "M"').groupby(['No_show','Generation'])['Age'].count()#计算男性患者的每个年龄段的如约就诊率和爽约率Female_Yes = female_counts['Yes']/df.query('Gender == "F"').groupby(['Generation'])['Age'].count()Male_Yes = male_counts['Yes']/df.query('Gender == "M"').groupby(['Generation'])['Age'].count()df_111 = pd.DataFrame({'F':Female_Yes,'M':Male_Yes})#绘图df_111.plot(figsize=(8,5),style = 'o-')# 标题和标签plt.ylabel('Probability',fontsize = 14)plt.xlabel('Generation',fontsize = 14)labels = ['childhood', 'juvenile', 'youth', 'middle_aged', 'elderly'] # x 坐标刻度标签plt.xticks(np.arange(len(Female_Yes)),labels)plt.title('Probability of the Generation and Gender',fontsize = 14);

男性患者和女性患者，少年爽约率均最高，老年爽约率均最低。

#获取女性每个年龄段等级数量female_counts = df.query('Gender == "F"').groupby(['No_show','Generation'])['Age'].count()Female_No = female_counts['No']Female_Yes = female_counts['Yes']#获取男性每个年龄段等级数量male_counts = df.query('Gender == "M"').groupby(['No_show','Generation'])['Age'].count()Male_No = male_counts['No']Male_Yes = male_counts['Yes']

#绘制不同性别每个年龄段的如约就诊人数图fig, ax = plt.subplots(figsize = (15,5))ind = np.arange(len(Female_No)) # 组的 x 坐标位置width = 0.35 plt.subplot(1, 2, 1)# 绘制条柱No_bars = plt.bar(ind, Female_No, width, color='r', alpha=.7, label='No')Yes_bars = plt.bar(ind + width, Female_Yes, width, color='b', alpha=.7, label='Yes')# 标题和标签plt.ylabel('Counts')plt.xlabel('Generation')plt.title('Female')locations = ind + width / 2 # x 坐标刻度位置labels = ['childhood', 'juvenile', 'youth', 'middle_aged', 'elderly'] # x 坐标刻度标签plt.xticks(locations, labels)#plt.axhline(y=Male_No,color = 'g',linewidth = 0.5);# 图例plt.legend();plt.subplot(1, 2, 2)# 绘制条柱No_bars = plt.bar(ind, Male_No, width, color='r', alpha=.7, label='No')Yes_bars = plt.bar(ind + width, Male_Yes, width, color='b', alpha=.7, label='Yes')# 标题和标签plt.ylabel('Counts')plt.xlabel('Generation')plt.title('Male')locations = ind + width / 2 # x 坐标刻度位置labels = ['childhood', 'juvenile', 'youth', 'middle_aged', 'elderly'] # x 坐标刻度标签plt.xticks(locations, labels)#plt.axhline(y=Male_No,color = 'g',linewidth = 0.5);# 图例plt.legend();

女性患者在40~65岁的如约就诊人数最多，在0~6岁的如约就诊人数和爽约人数均最少。男性患者在40~65岁的如约就诊人数最多，在65岁以上的如约就诊人数和爽约人数均最少；男性患者和女性患者的如约就诊率最高的年龄段都是65岁以上。

5、接收到短信提醒的患者如约就诊率如何？

#计算接收到短信提醒的患者如约就诊率SMS_received_Count = df.query('No_show == "Yes"').SMS_received.value_counts(sort=False)No_SMS_received_Count = df.query('No_show == "No"').SMS_received.value_counts(sort=False)df_s = pd.DataFrame({'Yes':SMS_received_Count,'No':No_SMS_received_Count})SMS_received_P = (df.query('SMS_received == 1')['No_show'] == "Yes").mean()No_SMS_received_P = (df.query('SMS_received == 0')['No_show'] == "Yes").mean()#绘图CP_plot(df_s,['0','1'],[No_SMS_received_P,SMS_received_P],key='SMS_received')

可以看出接收到短信提醒的患者如约就诊率低于没有收到短信提醒的患者，即短信提醒功能没有提高如约就诊率。

6、不同种类病患者如约就诊率如何？

A.不同疾病的年龄分布状况

#设置背景及标题sns.set_style('dark')f, (ax1, ax2, ax3) = plt.subplots(1, 3, figsize=(15, 4), sharex=True)f.suptitle('Age distribution of different types of Diseases')#作图sns.distplot(df.query('Hipertension == 1')['Age'],ax=ax1)sns.distplot(df.query('Diabetes == 1')['Age'],color='g',ax=ax2)sns.distplot(df.query('Alcoholism == 1')['Age'],color='r',ax=ax3)ax1.legend('Hipertension')ax2.legend('Diabetes')ax3.legend('Alcoholism');

#不同疾病，不同年龄段的患者数量m = df.groupby('Generation').sum()#设置背景及标题sns.set_style('dark')f, ax = plt.subplots(1, 3, figsize=(15, 4), sharex=True)f.suptitle('Generation distribution of different types of Diseases')#作图plt.subplot(131)m.Hipertension.plot(kind = 'pie',autopct='%.0f%%')plt.subplot(132)m.Diabetes.plot(kind = 'pie',autopct='%.0f%%')plt.subplot(133)m.Alcoholism.plot(kind = 'pie',autopct='%.0f%%')

可以看出，高血压、糖尿病患者人群集中在50~70岁之间，而醺酒患者人群集中在50岁左右。按年龄段看，三种疾病中儿童和少年阶段的患者太少，中老年患病人数是最多的。

B.不同疾病的如约就诊率

#计算不同年龄段如约就诊的概率m1 = df.query('No_show == "Yes"').groupby('Generation').sum()/df.groupby('Generation').sum()#绘图#设置背景及标题sns.set_style('dark')f, (ax1, ax2, ax3) = plt.subplots(3, 1, figsize=(6, 6), sharex=True)f.suptitle('Age distribution of different types of Diseases')#作图sns.pointplot(m1.index,m1.Hipertension,ax=ax1)sns.pointplot(m1.index,m1.Diabetes,color='g',ax=ax2)sns.pointplot(m1.index,m1.Alcoholism,color='r',ax=ax3);

可以看出高血压和糖尿病这两种病，儿童患者的爽约率较高，醺酒患者爽约率最高的是儿童和少年，但对于这三种病来说，儿童和少年患者的样本量太小，这个结果没有实际意义。但从其他三个年龄段看的话，我们可以得知，各种病患者年纪越大，如约就诊的可能性越大。

结论

通过上面的分析，我们可以得出的结论如下：

就诊时间与预定时间间隔越短，如约就诊的可能性就越高,当天预约如约的可能性越高。享受福利待遇和短信提醒服务对提高如约就诊率没有帮助。单就性别来看，患者如约就诊率差别不大。中老年患者如约就诊的人数最多,65岁以后的老年患者如约就诊的可能性最高，青少年患者如约就诊的可能性较低。。女性患者在40~65岁的如约就诊人数最多，在0~6岁的如约就诊人数和爽约人数均最少。男性患者在40~65岁的如约就诊人数最多，在65岁以上的如约就诊人数和爽约人数均最少。65岁以上的男性患者和女性患者的如约就诊率均是最高的。对于高血压和糖尿病这两种病，患者人群集中在50~70岁之间，醺酒患者人群集中在50岁左右，这三种病儿童和少年的患者人数微乎其微，各种病患者年龄越大，如约就诊的可能性也就越大。