×

数据分析工具pandas快速入门教程5-处理缺失数据

第5章 缺失数据

介绍

很少没有任何缺失值的数据集。 有许多缺失数据的表示。 在数据库中是NULL值,一些编程语言使用NA。缺失值可以是空字符串:”或者甚至是数值88或99等。Pandas显示缺失值为NaN。

本章将涵盖:

  • 什么是缺失值
  • 如何创建缺失值
  • 如何重新编码并使用缺失值进行计算

什么是缺失值

可以从numpy中获得NaN值,在Python中看到缺失值使用几种方式显示:NaN,NAN或nan,他们都是相等的。

NaN不等于0或空字符串”。

In [1]: from numpy import NaN, NAN, nanIn [2]: print(NaN == True, NaN == False, NaN == 0, NaN == '', sep='|')False|False|False|FalseIn [3]: print(NaN == NaN, NaN == nan, NaN == NAN, nan == NAN, sep='|')False|False|False|FalseIn [4]: import pandas as pdIn [5]: print(pd.isnull(NaN), pd.isnull(nan), pd.isnull(NAN), sep='|')True|True|TrueIn [6]: print(pd.notnull(NaN), pd.notnull(99), pd.notnull("https://china-testing.github.io"), sep='|')False|True|True

缺失值的来源

来自加载数据或数据处理

  • 加载数据

当我们加载数据时,pandas会自动找到该缺少数据的单元格,并填充NaN值。在read_csv函数中,参数na_values, keep_default_na, na_filter用于处理缺失值。比如:na_values=[99]。na_filter设置为False,在读大文件时会提升性能。

5-1.py

import pandas as pdvisited_file = 'data/survey_visited.csv'print(pd.read_csv(visited_file))print(pd.read_csv(visited_file, keep_default_na=False))print(pd.read_csv(visited_file, na_values=[''], keep_default_na=False))

执行结果

$ python3 5-1.py    ident   site       dated0    619   DR-1  1927-02-081    622   DR-1  1927-02-102    734   DR-3  1939-01-073    735   DR-3  1930-01-124    751   DR-3  1930-02-265    752   DR-3         NaN6    837  MSK-4  1932-01-147    844   DR-1  1932-03-22   ident   site       dated0    619   DR-1  1927-02-081    622   DR-1  1927-02-102    734   DR-3  1939-01-073    735   DR-3  1930-01-124    751   DR-3  1930-02-265    752   DR-3            6    837  MSK-4  1932-01-147    844   DR-1  1932-03-22   ident   site       dated0    619   DR-1  1927-02-081    622   DR-1  1927-02-102    734   DR-3  1939-01-073    735   DR-3  1930-01-124    751   DR-3  1930-02-265    752   DR-3         NaN6    837  MSK-4  1932-01-147    844   DR-1  1932-03-22
  • 合并数据
import pandas as pdvisited = pd.read_csv('data/survey_visited.csv')survey = pd.read_csv('data/survey_survey.csv')print(visited)print(survey)vs = visited.merge(survey, left_on='ident', right_on='taken')print(vs)

执行结果

$ python3 5-2.py    ident   site       dated0    619   DR-1  1927-02-081    622   DR-1  1927-02-102    734   DR-3  1939-01-073    735   DR-3  1930-01-124    751   DR-3  1930-02-265    752   DR-3         NaN6    837  MSK-4  1932-01-147    844   DR-1  1932-03-22    taken person quant  reading0     619   dyer   rad     9.821     619   dyer   sal     0.132     622   dyer   rad     7.803     622   dyer   sal     0.094     734     pb   rad     8.415     734   lake   sal     0.056     734     pb  temp   -21.507     735     pb   rad     7.228     735    NaN   sal     0.069     735    NaN  temp   -26.0010    751     pb   rad     4.3511    751     pb  temp   -18.5012    751   lake   sal     0.1013    752   lake   rad     2.1914    752   lake   sal     0.0915    752   lake  temp   -16.0016    752    roe   sal    41.6017    837   lake   rad     1.4618    837   lake   sal     0.2119    837    roe   sal    22.5020    844    roe   rad    11.25    ident   site       dated  taken person quant  reading0     619   DR-1  1927-02-08    619   dyer   rad     9.821     619   DR-1  1927-02-08    619   dyer   sal     0.132     622   DR-1  1927-02-10    622   dyer   rad     7.803     622   DR-1  1927-02-10    622   dyer   sal     0.094     734   DR-3  1939-01-07    734     pb   rad     8.415     734   DR-3  1939-01-07    734   lake   sal     0.056     734   DR-3  1939-01-07    734     pb  temp   -21.507     735   DR-3  1930-01-12    735     pb   rad     7.228     735   DR-3  1930-01-12    735    NaN   sal     0.069     735   DR-3  1930-01-12    735    NaN  temp   -26.0010    751   DR-3  1930-02-26    751     pb   rad     4.3511    751   DR-3  1930-02-26    751     pb  temp   -18.5012    751   DR-3  1930-02-26    751   lake   sal     0.1013    752   DR-3         NaN    752   lake   rad     2.1914    752   DR-3         NaN    752   lake   sal     0.0915    752   DR-3         NaN    752   lake  temp   -16.0016    752   DR-3         NaN    752    roe   sal    41.6017    837  MSK-4  1932-01-14    837   lake   rad     1.4618    837  MSK-4  1932-01-14    837   lake   sal     0.2119    837  MSK-4  1932-01-14    837    roe   sal    22.5020    844   DR-1  1932-03-22    844    roe   rad    11.25
  • 用户输入
import pandas as pdfrom numpy import NaN, NAN, nannum_legs = pd.Series({'goat': 4, 'amoeba': nan})print(num_legs)scientists = pd.DataFrame({'Name': ['Rosaline Franklin', 'William Gosset'],                           'Occupation': ['Chemist', 'Statistician'],                           'Born': ['1920-07-25', '1876-06-13'],                           'Died': ['1958-04-16', '1937-10-16'],                           'missing': [NaN, nan]})print(scientists)scientists['missing'] = nanprint(scientists)

执行结果

$ python3 5-3.py amoeba    NaNgoat      4.0dtype: float64         Born        Died               Name    Occupation  missing0  1920-07-25  1958-04-16  Rosaline Franklin       Chemist      NaN1  1876-06-13  1937-10-16     William Gosset  Statistician      NaN         Born        Died               Name    Occupation  missing0  1920-07-25  1958-04-16  Rosaline Franklin       Chemist      NaN1  1876-06-13  1937-10-16     William Gosset  Statistician      NaN
  • 重新索引

5-4.py

import pandas as pdfrom numpy import NaN, NAN, nangapminder = pd.read_csv('data/gapminder.tsv', sep='\t')life_exp = gapminder.groupby(['year'])['lifeExp'].mean()print(life_exp)print(life_exp.reindex(range(2000, 2010)))

执行结果

year1952    49.0576201957    51.5074011962    53.6092491967    55.6782901972    57.6473861977    59.5701571982    61.5331971987    63.2126131992    64.1603381997    65.0146762002    65.6949232007    67.007423Name: lifeExp, dtype: float64year2000          NaN2001          NaN2002    65.6949232003          NaN2004          NaN2005          NaN2006          NaN2007    67.0074232008          NaN2009          NaNName: lifeExp, dtype: float64

处理缺失数据

  • 统计缺失数据

5-5.py

import pandas as pdfrom numpy import NaN, NAN, nanimport numpy as npebola = pd.read_csv('data/country_timeseries.csv')print(ebola.head())print(ebola.count())num_rows = ebola.shape[0]print("num_rows")print(num_rows)num_missing = num_rows - ebola.count()print("num_missing:")print(num_missing)print(np.count_nonzero(ebola.isnull()))print(np.count_nonzero(ebola['Cases_Guinea'].isnull()))print(ebola.Cases_Guinea.value_counts(dropna=False).head())

执行结果

 Date  Day  Cases_Guinea  Cases_Liberia  Cases_SierraLeone  \0    1/5/2015  289        2776.0            NaN            10030.0   1    1/4/2015  288        2775.0            NaN             9780.0   2    1/3/2015  287        2769.0         8166.0             9722.0   3    1/2/2015  286           NaN         8157.0                NaN   4  12/31/2014  284        2730.0         8115.0             9633.0      Cases_Nigeria  Cases_Senegal  Cases_UnitedStates  Cases_Spain  Cases_Mali  \0            NaN            NaN                 NaN          NaN         NaN   1            NaN            NaN                 NaN          NaN         NaN   2            NaN            NaN                 NaN          NaN         NaN   3            NaN            NaN                 NaN          NaN         NaN   4            NaN            NaN                 NaN          NaN         NaN      Deaths_Guinea  Deaths_Liberia  Deaths_SierraLeone  Deaths_Nigeria  \0         1786.0             NaN              2977.0             NaN   1         1781.0             NaN              2943.0             NaN   2         1767.0          3496.0              2915.0             NaN   3            NaN          3496.0                 NaN             NaN   4         1739.0          3471.0              2827.0             NaN      Deaths_Senegal  Deaths_UnitedStates  Deaths_Spain  Deaths_Mali  0             NaN                  NaN           NaN          NaN  1             NaN                  NaN           NaN          NaN  2             NaN                  NaN           NaN          NaN  3             NaN                  NaN           NaN          NaN  4             NaN                  NaN           NaN          NaN  Date                   122Day                    122Cases_Guinea            93Cases_Liberia           83Cases_SierraLeone       87Cases_Nigeria           38Cases_Senegal           25Cases_UnitedStates      18Cases_Spain             16Cases_Mali              12Deaths_Guinea           92Deaths_Liberia          81Deaths_SierraLeone      87Deaths_Nigeria          38Deaths_Senegal          22Deaths_UnitedStates     18Deaths_Spain            16Deaths_Mali             12dtype: int64num_rows122num_missing:Date                     0Day                      0Cases_Guinea            29Cases_Liberia           39Cases_SierraLeone       35Cases_Nigeria           84Cases_Senegal           97Cases_UnitedStates     104Cases_Spain            106Cases_Mali             110Deaths_Guinea           30Deaths_Liberia          41Deaths_SierraLeone      35Deaths_Nigeria          84Deaths_Senegal         100Deaths_UnitedStates    104Deaths_Spain           106Deaths_Mali            110dtype: int64121429NaN       29 86.0      3 495.0     2 112.0     2 390.0     2Name: Cases_Guinea, dtype: int64
  • 处理缺失数据

5-6.py

import pandas as pdfrom numpy import NaN, NAN, nanimport numpy as npebola = pd.read_csv('data/country_timeseries.csv')print(ebola.iloc[0:10, 0:5])print(ebola.fillna(0).iloc[0:10, 0:5])# 前向填充print(ebola.fillna(method='ffill').iloc[0:10, 0:5])# 后向填充print(ebola.fillna(method='bfill').iloc[0:10, 0:5])print(ebola.interpolate().iloc[0:10, 0:5])print(ebola.shape)ebola_dropna = ebola.dropna()print(ebola_dropna.shape)print(ebola_dropna)ebola['Cases_multiple'] = ebola['Cases_Guinea'] + ebola['Cases_Liberia'] + \ebola['Cases_SierraLeone']ebola_subset = ebola.loc[:, ['Cases_Guinea', 'Cases_Liberia',                             'Cases_SierraLeone', 'Cases_multiple']]print(ebola_subset.head(n=10))print(ebola.Cases_Guinea.sum(skipna = True))print(ebola.Cases_Guinea.sum(skipna = False))

执行结果

         Date  Day  Cases_Guinea  Cases_Liberia  Cases_SierraLeone0    1/5/2015  289        2776.0            NaN            10030.01    1/4/2015  288        2775.0            NaN             9780.02    1/3/2015  287        2769.0         8166.0             9722.03    1/2/2015  286           NaN         8157.0                NaN4  12/31/2014  284        2730.0         8115.0             9633.05  12/28/2014  281        2706.0         8018.0             9446.06  12/27/2014  280        2695.0            NaN             9409.07  12/24/2014  277        2630.0         7977.0             9203.08  12/21/2014  273        2597.0            NaN             9004.09  12/20/2014  272        2571.0         7862.0             8939.0         Date  Day  Cases_Guinea  Cases_Liberia  Cases_SierraLeone0    1/5/2015  289        2776.0            0.0            10030.01    1/4/2015  288        2775.0            0.0             9780.02    1/3/2015  287        2769.0         8166.0             9722.03    1/2/2015  286           0.0         8157.0                0.04  12/31/2014  284        2730.0         8115.0             9633.05  12/28/2014  281        2706.0         8018.0             9446.06  12/27/2014  280        2695.0            0.0             9409.07  12/24/2014  277        2630.0         7977.0             9203.08  12/21/2014  273        2597.0            0.0             9004.09  12/20/2014  272        2571.0         7862.0             8939.0         Date  Day  Cases_Guinea  Cases_Liberia  Cases_SierraLeone0    1/5/2015  289        2776.0            NaN            10030.01    1/4/2015  288        2775.0            NaN             9780.02    1/3/2015  287        2769.0         8166.0             9722.03    1/2/2015  286        2769.0         8157.0             9722.04  12/31/2014  284        2730.0         8115.0             9633.05  12/28/2014  281        2706.0         8018.0             9446.06  12/27/2014  280        2695.0         8018.0             9409.07  12/24/2014  277        2630.0         7977.0             9203.08  12/21/2014  273        2597.0         7977.0             9004.09  12/20/2014  272        2571.0         7862.0             8939.0         Date  Day  Cases_Guinea  Cases_Liberia  Cases_SierraLeone0    1/5/2015  289        2776.0         8166.0            10030.01    1/4/2015  288        2775.0         8166.0             9780.02    1/3/2015  287        2769.0         8166.0             9722.03    1/2/2015  286        2730.0         8157.0             9633.04  12/31/2014  284        2730.0         8115.0             9633.05  12/28/2014  281        2706.0         8018.0             9446.06  12/27/2014  280        2695.0         7977.0             9409.07  12/24/2014  277        2630.0         7977.0             9203.08  12/21/2014  273        2597.0         7862.0             9004.09  12/20/2014  272        2571.0         7862.0             8939.0         Date  Day  Cases_Guinea  Cases_Liberia  Cases_SierraLeone0    1/5/2015  289        2776.0            NaN            10030.01    1/4/2015  288        2775.0            NaN             9780.02    1/3/2015  287        2769.0         8166.0             9722.03    1/2/2015  286        2749.5         8157.0             9677.54  12/31/2014  284        2730.0         8115.0             9633.05  12/28/2014  281        2706.0         8018.0             9446.06  12/27/2014  280        2695.0         7997.5             9409.07  12/24/2014  277        2630.0         7977.0             9203.08  12/21/2014  273        2597.0         7919.5             9004.09  12/20/2014  272        2571.0         7862.0             8939.0(122, 18)(1, 18)          Date  Day  Cases_Guinea  Cases_Liberia  Cases_SierraLeone  \19  11/18/2014  241        2047.0         7082.0             6190.0       Cases_Nigeria  Cases_Senegal  Cases_UnitedStates  Cases_Spain  Cases_Mali  \19           20.0            1.0                 4.0          1.0         6.0       Deaths_Guinea  Deaths_Liberia  Deaths_SierraLeone  Deaths_Nigeria  \19         1214.0          2963.0              1267.0             8.0       Deaths_Senegal  Deaths_UnitedStates  Deaths_Spain  Deaths_Mali  19             0.0                  1.0           0.0          6.0     Cases_Guinea  Cases_Liberia  Cases_SierraLeone  Cases_multiple0        2776.0            NaN            10030.0             NaN1        2775.0            NaN             9780.0             NaN2        2769.0         8166.0             9722.0         20657.03           NaN         8157.0                NaN             NaN4        2730.0         8115.0             9633.0         20478.05        2706.0         8018.0             9446.0         20170.06        2695.0            NaN             9409.0             NaN7        2630.0         7977.0             9203.0         19810.08        2597.0            NaN             9004.0             NaN9        2571.0         7862.0             8939.0         19372.084729.0nan

参考资料

人已赞赏
0 条回复 A文章作者 M管理员
    暂无讨论,说说你的看法吧
个人中心
今日签到
有新消息 消息中心
有新私信 私信列表
搜索