http://liao.cpython.org/pandas20/
http://liao.cpython.org/pandas21/
val = np.arange(10,38).reshape(7,4) col = 'a b c d'.split() idx = 'this is just a fake practise today'.split() df = pd.DataFrame(val,index = idx, columns = col) df['e'] = np.nan df.at['is','a'] = 100 df.at['a','c']= 300 df.loc['d'] = np.nan df['f'] = np.nan df Out[74]: a b c d e f this 10.0 11.0 12.0 13.0 NaN NaN is 100.0 15.0 16.0 17.0 NaN NaN just 18.0 19.0 20.0 21.0 NaN NaN a 22.0 23.0 300.0 25.0 NaN NaN fake 26.0 27.0 28.0 29.0 NaN NaN practise 30.0 31.0 32.0 33.0 NaN NaN today 34.0 35.0 36.0 37.0 NaN NaN d NaN NaN NaN NaN NaN NaN
123456789101112131415161718192021df.isnull() Out[76]: a b c d e f this False False False False True True is False False False False True True just False False False False True True a False False False False True True fake False False False False True True practise False False False False True True today False False False False True True d True True True True True True 1234567891011
df.isnull().sum() Out[77]: a 1 b 1 c 1 d 1 e 8 f 8 dtype: int6412345678
df.isnull().sum().sum() Out[78]: 201
每列非空的数据个数统计:
df.count() Out[79]: a 7 b 7 c 7 d 7 e 0 f 0 dtype: int6412345678
和isnull()的反义词:
df.notnull() Out[80]: a b c d e f this True True True True False False is True True True True False False just True True True True False False a True True True True False False fake True True True True False False practise True True True True False False today True True True True False False d False False False False False False12345678910
构造一个df
df.at['is','e'] = 100 df.at['d','e'] = 520 a b c d e f this 10.0 11.0 12.0 13.0 NaN NaN is 100.0 15.0 16.0 17.0 100.0 NaN just 18.0 19.0 20.0 21.0 NaN NaN a 22.0 23.0 300.0 25.0 NaN NaN fake 26.0 27.0 28.0 29.0 NaN NaN practise 30.0 31.0 32.0 33.0 NaN NaN today 34.0 35.0 36.0 37.0 NaN NaN d NaN NaN NaN NaN 520.0 NaN 1234567891011
取得布尔值:
df.e.notnull() Out[88]: this False is True just False a False fake False practise False today False d True Name: e, dtype: bool df.e[df.e.notnull()] Out[89]: is 100.0 d 520.0 Name: e, dtype: float64
12345678910111213141516dropna函数删除DataFrame的某Series列里的数据,但不会影响DataFrame本身:
df.e.dropna() Out[92]: is 100.0 d 520.0 Name: e, dtype: float64 df Out[93]: a b c d e f this 10.0 11.0 12.0 13.0 NaN NaN is 100.0 15.0 16.0 17.0 100.0 NaN just 18.0 19.0 20.0 21.0 NaN NaN a 22.0 23.0 300.0 25.0 NaN NaN fake 26.0 27.0 28.0 29.0 NaN NaN practise 30.0 31.0 32.0 33.0 NaN NaN today 34.0 35.0 36.0 37.0 NaN NaN d NaN NaN NaN NaN 520.0 NaN
12345678910111213141516df Out[97]: a b c d e f this 10.0 11.0 12.0 13.0 NaN NaN is 100.0 15.0 16.0 17.0 100.0 999.0 just 18.0 19.0 20.0 21.0 NaN NaN a 22.0 23.0 300.0 25.0 NaN NaN fake 26.0 27.0 28.0 29.0 NaN NaN practise 30.0 31.0 32.0 33.0 NaN NaN today 34.0 35.0 36.0 37.0 NaN NaN d NaN NaN NaN NaN 520.0 NaN df.dropna() Out[96]: a b c d e f is 100.0 15.0 16.0 17.0 100.0 999.0
123456789101112131415df Out[99]: a b c d e f this 10.0 11.0 12.0 13.0 NaN NaN is 100.0 15.0 16.0 17.0 100.0 999.0 just 18.0 19.0 20.0 21.0 NaN NaN a 22.0 23.0 300.0 25.0 NaN NaN fake 26.0 27.0 28.0 29.0 NaN NaN practise 30.0 31.0 32.0 33.0 NaN NaN today 34.0 35.0 36.0 37.0 NaN NaN d NaN NaN NaN NaN NaN NaN df.dropna(how = 'all') Out[100]: a b c d e f this 10.0 11.0 12.0 13.0 NaN NaN is 100.0 15.0 16.0 17.0 100.0 999.0 just 18.0 19.0 20.0 21.0 NaN NaN a 22.0 23.0 300.0 25.0 NaN NaN fake 26.0 27.0 28.0 29.0 NaN NaN practise 30.0 31.0 32.0 33.0 NaN NaN today 34.0 35.0 36.0 37.0 NaN NaN
123456789101112131415161718192021df Out[102]: a b c d e f this 10.0 11.0 12.0 13.0 NaN NaN is 100.0 15.0 16.0 17.0 100.0 999.0 just 18.0 19.0 20.0 21.0 NaN NaN a 22.0 23.0 300.0 25.0 NaN NaN fake 26.0 27.0 28.0 29.0 NaN NaN practise 30.0 31.0 32.0 33.0 NaN NaN today 34.0 35.0 36.0 37.0 NaN NaN d NaN NaN NaN NaN NaN NaN '列上至少有7个非NaN的列留下:' df.dropna(axis = 1,thresh = 7) Out[103]: a b c d this 10.0 11.0 12.0 13.0 is 100.0 15.0 16.0 17.0 just 18.0 19.0 20.0 21.0 a 22.0 23.0 300.0 25.0 fake 26.0 27.0 28.0 29.0 practise 30.0 31.0 32.0 33.0 today 34.0 35.0 36.0 37.0 d NaN NaN NaN NaN
123456789101112131415161718192021222324行上至少有4列非空值的留下:
df.dropna(axis = 0, thresh = 4) Out[108]: a b c d e f this 10.0 11.0 12.0 13.0 NaN NaN is 100.0 15.0 16.0 17.0 100.0 999.0 just 18.0 19.0 20.0 21.0 NaN NaN a 22.0 23.0 300.0 25.0 NaN NaN fake 26.0 27.0 28.0 29.0 NaN NaN practise 30.0 31.0 32.0 33.0 NaN NaN today 34.0 35.0 36.0 37.0 NaN NaN123456789
df.fillna(0) Out[109]: a b c d e f this 10.0 11.0 12.0 13.0 0.0 0.0 is 100.0 15.0 16.0 17.0 100.0 999.0 just 18.0 19.0 20.0 21.0 0.0 0.0 a 22.0 23.0 300.0 25.0 0.0 0.0 fake 26.0 27.0 28.0 29.0 0.0 0.0 practise 30.0 31.0 32.0 33.0 0.0 0.0 today 34.0 35.0 36.0 37.0 0.0 0.0 d 0.0 0.0 0.0 0.0 0.0 0.012345678910
method = ‘ffill’ : 是用每一列/行前面的值填充后面的空白
method = ‘bfill’: 是用每一列/行后面的值填充前面的空白
df Out[124]: a b c d e f this 10.0 11.0 12.0 13.0 NaN NaN is 100.0 15.0 16.0 17.0 100.0 999.0 just 18.0 19.0 20.0 21.0 NaN NaN a 22.0 23.0 300.0 25.0 NaN NaN fake 26.0 27.0 28.0 29.0 NaN NaN practise 30.0 31.0 32.0 33.0 NaN NaN today 34.0 35.0 36.0 37.0 88.0 NaN d NaN NaN NaN NaN NaN NaN df.f.fillna(method = 'bfill') Out[125]: this 999.0 is 999.0 just NaN a NaN fake NaN practise NaN today NaN d NaN Name: f, dtype: float64 df.f.fillna(method = 'ffill') Out[126]: this NaN is 999.0 just 999.0 a 999.0 fake 999.0 practise 999.0 today 999.0 d 999.0 Name: f, dtype: float64
12345678910111213141516171819202122232425262728293031323334列和行的不同填充规则:默认是在列的方向上填充列的数据:
df Out[131]: a b c d e f this 10.0 11.0 12.0 13.0 NaN NaN is 100.0 15.0 16.0 17.0 100.0 999.0 just 18.0 19.0 20.0 21.0 NaN NaN a 22.0 23.0 300.0 25.0 NaN NaN fake 26.0 27.0 28.0 29.0 NaN NaN practise 30.0 31.0 32.0 33.0 NaN NaN today 34.0 35.0 36.0 37.0 88.0 NaN d NaN NaN 666.0 666.0 666.0 666.0 df.fillna(method = 'ffill') Out[132]: a b c d e f this 10.0 11.0 12.0 13.0 NaN NaN is 100.0 15.0 16.0 17.0 100.0 999.0 just 18.0 19.0 20.0 21.0 100.0 999.0 a 22.0 23.0 300.0 25.0 100.0 999.0 fake 26.0 27.0 28.0 29.0 100.0 999.0 practise 30.0 31.0 32.0 33.0 100.0 999.0 today 34.0 35.0 36.0 37.0 88.0 999.0 d 34.0 35.0 666.0 666.0 666.0 666.0 df.fillna(method = 'ffill',axis = 0) Out[134]: a b c d e f this 10.0 11.0 12.0 13.0 NaN NaN is 100.0 15.0 16.0 17.0 100.0 999.0 just 18.0 19.0 20.0 21.0 100.0 999.0 a 22.0 23.0 300.0 25.0 100.0 999.0 fake 26.0 27.0 28.0 29.0 100.0 999.0 practise 30.0 31.0 32.0 33.0 100.0 999.0 today 34.0 35.0 36.0 37.0 88.0 999.0 d 34.0 35.0 666.0 666.0 666.0 666.0 df.fillna(method = 'ffill',axis = 1) Out[136]: a b c d e f this 10.0 11.0 12.0 13.0 13.0 13.0 is 100.0 15.0 16.0 17.0 100.0 999.0 just 18.0 19.0 20.0 21.0 21.0 21.0 a 22.0 23.0 300.0 25.0 25.0 25.0 fake 26.0 27.0 28.0 29.0 29.0 29.0 practise 30.0 31.0 32.0 33.0 33.0 33.0 today 34.0 35.0 36.0 37.0 88.0 88.0 d NaN NaN 666.0 666.0 666.0 666.0
12345678910111213141516171819202122232425262728293031323334353637383940414243444546df Out[137]: a b c d e f this 10.0 11.0 12.0 13.0 NaN NaN is 100.0 15.0 16.0 17.0 100.0 999.0 just 18.0 19.0 20.0 21.0 NaN NaN a 22.0 23.0 300.0 25.0 NaN NaN fake 26.0 27.0 28.0 29.0 NaN NaN practise 30.0 31.0 32.0 33.0 NaN NaN today 34.0 35.0 36.0 37.0 88.0 NaN d NaN NaN 666.0 666.0 666.0 666.0 fill = pd.Series([5,6,7],index = ['just','a','fake']) df['e'].fillna(fill,inplace = True) df Out[140]: a b c d e f this 10.0 11.0 12.0 13.0 NaN NaN is 100.0 15.0 16.0 17.0 100.0 999.0 just 18.0 19.0 20.0 21.0 5.0 NaN a 22.0 23.0 300.0 25.0 6.0 NaN fake 26.0 27.0 28.0 29.0 7.0 NaN practise 30.0 31.0 32.0 33.0 NaN NaN today 34.0 35.0 36.0 37.0 88.0 NaN d NaN NaN 666.0 666.0 666.0 666.0
12345678910111213141516171819202122232425相关知识
Python pandas 数据清洗(二)
pandas读写csv文件,及注意事项
宝可梦数据集分析及预测
探索性数据分析—赛事数据集(Ⅰ)
机器学习之数据预处理(Python 实现)
Python 数据清洗
社交媒体数据分析的可视化展示:让数据讲述故事
【创新课题】猫狗养殖户养殖决策系统:基于python爬虫猫狗电商销售数据可视化分析
通过数据分析如何优化运动训练计划?
第三届泰迪杯数据挖掘技能赛一等奖总结&经验分享
网址: Pandas的数据清洗 https://m.mcbbbk.com/newsview475149.html
上一篇: 合成气的主要成分是一氧化碳和氢气 |
下一篇: 荞麦皮清洗妙招 |