使用pandas数据分析工具处理excel

pandas有强大的excel数据处理和导入处理功能,本文简单介绍pandas在csv和excel等格式方面处理的应用及绘制图表等功能。

pandas处理excel依赖xlutils, OpenPyXL, XlsxWriter等库。

  • 讨论qq群144081101 591302926 567351477 钉钉免费群21745728

pandas数据读取概述

读写文本

FunctionDescription
read_csvLoad delimited data from a file, URL, or file-like object; use comma as default delimiter
read_tableLoad delimited data from a file, URL, or file-like object; use tab (‘\t’) as default delimiter
read_fwfRead data in fixed-width column format (i.e., no delimiters)
read_clipboard Version ofRead_table thatReads data from the clipboard; useful for converting tables from web pages
read_excelRead tabular data from an Excel XLS or XLSX file
read_hdfRead HDF5 files written by pandas
read_htmlRead all tables found in the given HTML document
read_jsonRead data from a JSON (JavaScript Object Notation) string representation
read_msgpackRead pandas data encoded using the MessagePack binary format
read_pickleRead an arbitrary object stored in Python pickle format
read_sasRead a SAS dataset stored in one of the SAS system’s custom storage formats
read_sqlRead the results of a SQL query (using SQLAlchemy) as a pandas DataFrame
read_stataRead a dataset from Stata file format
read_featherRead the Feather binary file format

参数主要涉及索引、类型推理和数据转换、日期时间处理、迭代、脏数据。

ex1.csv的内容如下:

a,b,c,d,message1,2,3,4,hello5,6,7,8,world9,10,11,12,foo

读取:

In [9]: df = pd.read_csv('examples/ex1.csv')In [10]: dfOut[10]:a b c d message0 1 2 3 4 hello1 5 6 7 8 world2 9 10 11 12 foo

还可以改用read_table读取

In [11]: pd.read_table('examples/ex1.csv', sep=',')Out[11]:a b c d message0 1 2 3 4 hello1 5 6 7 8 world2 9 10 11 12 foo

ex2.csv的内容如下:

1,2,3,4,hello5,6,7,8,world9,10,11,12,foo

可以使用header=None表示没有列名,也可以用names自行指定列名,还可以使用index_col将列作为索引。

In [13]: pd.read_csv('examples/ex2.csv', header=None)Out[13]:0 1 2 3 40 1 2 3 4 hello1 5 6 7 8 world2 9 10 11 12 fooIn [14]: pd.read_csv('examples/ex2.csv', names=['a', 'b', 'c', 'd', 'message'])Out[14]:a b c d message0 1 2 3 4 hello1 5 6 7 8 world2 9 10 11 12 foo6.1In [15]: names = ['a', 'b', 'c', 'd', 'message']In [16]: pd.read_csv('examples/ex2.csv', names=names, index_col='message')Out[16]:a b c dmessagehello 1 2 3 4world 5 6 7 8foo 9 10 11 12

csv_mindex.csv的内容:

key1,key2,value1,value2one,a,1,2one,b,3,4one,c,5,6one,d,7,8two,a,9,10two,b,11,12two,c,13,14two,d,15,16

建立层级索引:

In [18]: parsed = pd.read_csv('examples/csv_mindex.csv',....: index_col=['key1', 'key2'])In [19]: parsedOut[19]:value1 value2key1 key2one a 1 2b 3 4c 5 6d 7 8two a 9 10b 11 12c 13 14d 15 16

用正则表达式处理混合的分隔符:

In [20]: list(open('examples/ex3.txt'))Out[20]:[' A B C\n','aaa -0.264438 -1.026059 -0.619500\n','bbb 0.927272 0.302904 -0.032399\n','ccc -0.264273 -0.386314 -0.217601\n','ddd -0.871858 -0.348382 1.100491\n']In [21]: result = pd.read_table('examples/ex3.txt', sep='\s+')In [22]: resultOut[22]:A B Caaa -0.264438 -1.026059 -0.619500bbb 0.927272 0.302904 -0.032399ccc -0.264273 -0.386314 -0.217601ddd -0.871858 -0.348382 1.100491

ex4.csv的内容:

# hey!a,b,c,d,message# just wanted to make things more difficult for you# who reads CSV files with computers, anyway?1,2,3,4,hello5,6,7,8,world9,10,11,12,foo

skiprows可以忽略行

In [24]: pd.read_csv('examples/ex4.csv', skiprows=[0, 2, 3])Out[24]:a b c d message0 1 2 3 4 hello1 5 6 7 8 world2 9 10 11 12 foo

ex5.csv的内容:

something,a,b,c,d,messageone,1,2,3,4,NAtwo,5,6,,8,worldthree,9,10,11,12,foo

可以指定哪些值为缺失值,甚至可以针对行指定缺失值。

In [26]: result = pd.read_csv('examples/ex5.csv')In [27]: resultOut[27]:something a b c d message0 one 1 2 3.0 4 NaN1 two 5 6 NaN 8 world2 three 9 10 11.0 12 fooIn [28]: pd.isnull(result)Out[28]:something a b c d message0 False False False False False True1 False False False True False False2 False False False False False FalseIn [29]: result = pd.read_csv('examples/ex5.csv', na_values=['NULL'])In [30]: resultOut[30]:something a b c d message0 one 1 2 3.0 4 NaN1 two 5 6 NaN 8 world2 three 9 10 11.0 12 fooIn [31]: sentinels = {'message': ['foo', 'NA'], 'something': ['two']}In [32]: pd.read_csv('examples/ex5.csv', na_values=sentinels)Out[32]:something a b c d message0 one 1 2 3.0 4 NaN1 NaN 5 6 NaN 8 world2 three 9 10 11.0 12 NaN

pandas.read_csv和pandas.read_table的常用参数如下:

ArgumentDescription
pathString indicating filesystem location, URL, or file-like object
sep or delimiterCharacter sequence or regular expression to use to split fields in each row
headerRow number to use as column names; defaults to 0 (first row), but should be None if there is no header row。
index_colColumn numbers or names to use as the row index in the result; can be a single name/number or alist of them for a hierarchical index
namesList of column names for result, combine with header=None
skiprowsNumber of rows at beginning of file to ignore or list of row numbers (starting from 0) to skip.
na_valuesSequence of values to replace with NA.
commentCharacter(s) to split comments off the end of lines.
parse_datesAttempt to parse data to datetime; False by default. If True, will attempt to parse all columns.Otherwise can specify a list of column numbers or name to parse. If element of list is tuple or list, willcombine multiple columns together and parse to date (e.g., if date/time split across two columns).
keep_date_colIf joining columns to parse date, keep the joined columns; False by default.
convertersDict containing column number of name mapping to functions (e.g., {‘foo’: f} would apply the function f to all values in the ‘foo’ column).
dayfirstWhen parsing potentially ambiguous dates, treat as international format (e.g., 7/6/2012 -> June 7,2012); False by default.
date_parserFunction to use to parse dates.
nrowsNumber of rows to read from beginning of file.
iteratorReturn a TextParser object for reading file piecemeal.
chunksizeFor iteration, size of file chunks.
skip_footerNumber of lines to ignore at end of file.
verbosePrint various parser output information, like the number of missing values placed in non-numericcolumns.
encodingText encoding for Unicode (e.g., ‘utf-8’ for UTF-8 encoded text).
squeezeIf the parsed data only contains one column, return a Series.
thousandsSeparator for thousands (e.g., ‘,’ or ‘.’).

参考资料:

更多参考:https://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_csv.html

CSV

使用pandas读写csv

pandas_parsing_and_write.py

import pandas as pdinput_file = r"supplier_data.csv"output_file = r"output_files\1output.csv"data_frame = pd.read_csv(input_file)print(data_frame)data_frame.to_csv(output_file, index=False)

当然也可以用python实现:

1csv_simple_parsing_and_write.py

input_file = r"supplier_data.csv"output_file = r"output_files\1output.csv"with open(input_file, newline='') as filereader:    with open(output_file, 'w', newline='') as filewriter:        for row in filereader:            filewriter.write(row)

2csv_reader_parsing_and_write.py

import csvinput_file = r"supplier_data.csv"output_file = r"output_files\2output.csv"with open(input_file, 'r', newline='') as csv_in_file:    with open(output_file, 'w', newline='') as csv_out_file:        filereader = csv.reader(csv_in_file, delimiter=',')        filewriter = csv.writer(csv_out_file, delimiter=',')        for row_list in filereader:            filewriter.writerow(row_list)

过滤特定行

  • 选择供应商名字包含Z或者Cost大于600的行

pandas_value_meets_condition.py

import pandas as pdinput_file = r"supplier_data.csv"output_file = r"output_files\3output.csv"data_frame = pd.read_csv(input_file)data_frame['Cost'] = data_frame['Cost'].str.strip('$').astype(float)data_frame_value_meets_condition = data_frame.loc[(data_frame['Supplier Name']\.str.contains('Z')) | (data_frame['Cost'] > 600.0), :]data_frame_value_meets_condition.to_csv(output_file, index=False)

注意pandas的strip连里面的内容都可以清除, 有点类似replace的功能。

  • 选择符合一个集合的数据:

选择日期为’1/20/14′, ‘1/30/14’的行

import pandas as pdinput_file = r"supplier_data.csv"output_file = r"output_files\4output.csv"data_frame = pd.read_csv(input_file)important_dates = ['1/20/14', '1/30/14']data_frame_value_in_set = data_frame.loc[data_frame['Purchase Date']\.isin(important_dates), :]data_frame_value_in_set.to_csv(output_file, index=False)
  • 用正则表达式选择数据

pandas_value_matches_pattern.py

import pandas as pdinput_file = r"supplier_data.csv"output_file = r"output_files\4output.csv"data_frame = pd.read_csv(input_file)data_frame_value_matches_pattern = data_frame.ix[data_frame['Invoice Number']\.str.startswith("001-"), :]data_frame_value_matches_pattern.to_csv(output_file, index=False)

过滤特定列

  • 选择0,3列

pandas_column_by_index.py

import pandas as pdimport sysinput_file = r"supplier_data.csv"output_file = r"output_files\6output.csv"data_frame = pd.read_csv(input_file)data_frame_column_by_index = data_frame.iloc[:, [0, 3]]data_frame_column_by_index.to_csv(output_file, index=False)

pandas_column_by_index.py

import pandas as pdinput_file = r"supplier_data.csv"output_file = r"output_files\7output.csv"data_frame = pd.read_csv(input_file)data_frame_column_by_name = data_frame.loc[    :, ['Invoice Number', 'Purchase Date']]data_frame_column_by_name.to_csv(output_file, index=False)

pandas_select_contiguous_rows.py

import pandas as pdinput_file = r"supplier_data_unnecessary_header_footer.csv"output_file = r"output_files\11output.csv"data_frame = pd.read_csv(input_file, header=None)data_frame = data_frame.drop([0,1,2,16,17,18])data_frame.columns = data_frame.iloc[0]data_frame = data_frame.reindex(data_frame.index.drop(3))data_frame.to_csv(output_file, index=False)

添加行头

pandas_add_header_row.py

import pandas as pdinput_file = r"supplier_data_no_header_row.csv"output_file = r"output_files\11output.csv"header_list = ['Supplier Name', 'Invoice Number', \'Part Number', 'Cost', 'Purchase Date']data_frame = pd.read_csv(input_file, header=None, names=header_list)data_frame.to_csv(output_file, index=False)

合并多个文件

pandas_concat_rows_from_multiple_files.py

import pandas as pdimport globimport osinput_path = r"D:\code\foundations-for-analytics-with-python\csv"output_file = r"output_files\12output.csv"all_files = glob.glob(os.path.join(input_path,'sales_*'))all_data_frames = []for file in all_files:    data_frame = pd.read_csv(file, index_col=None)    all_data_frames.append(data_frame)data_frame_concat = pd.concat(all_data_frames, axis=0, ignore_index=True)data_frame_concat.to_csv(output_file, index = False)

求和和求平均值

pandas_sum_average_from_multiple_files.py

import pandas as pdimport globimport osinput_path = r"D:\code\foundations-for-analytics-with-python\csv"output_file = r"output_files\12output.csv"all_files = glob.glob(os.path.join(input_path,'sales_*'))all_data_frames = []for input_file in all_files:    print(input_file)    data_frame = pd.read_csv(input_file, index_col=None)        print(data_frame)        sales = pd.DataFrame([float(str(value).strip('$').replace(',',''))       for value in data_frame.loc[:, 'Sale Amount']])        total_cost = sales.sum()    average_cost = sales.mean()    data = {'file_name': os.path.basename(input_file),            'total_sales': total_cost,            'average_sales': average_cost}    all_data_frames.append(pd.DataFrame(        data, columns=['file_name', 'total_sales', 'average_sales']))data_frames_concat = pd.concat(all_data_frames, axis=0, ignore_index=True)data_frames_concat.to_csv(output_file, index = False)

XLS

使用pandas读写xls

pandas_parsing_and_write_keep_dates.py

import pandas as pdinput_file = "sales_2013.xlsx"output_file = "pandas_output.xls"data_frame = pd.read_excel(input_file, sheetname='january_2013')writer = pd.ExcelWriter(output_file)data_frame.to_excel(writer, sheet_name='jan_13_output', index=False)writer.save()

过滤特定行

  • 销售额大于1400的记录

pandas_value_meets_condition.py

import pandas as pdinput_file = "sales_2013.xlsx"output_file = "pandas_output.xls"data_frame = pd.read_excel(input_file, 'january_2013', index_col=None)data_frame_value_meets_condition = \    data_frame[data_frame['Sale Amount'].astype(float) > 1400.0]writer = pd.ExcelWriter(output_file)data_frame_value_meets_condition.to_excel(    writer, sheet_name='jan_13_output', index=False)writer.save()
  • 指定日期的

pandas_value_in_set.py

import stringinput_file = "sales_2013.xlsx"output_file = "pandas_output.xls"data_frame = pd.read_excel(input_file, 'january_2013', index_col=None)important_dates = ['01/24/2013','01/31/2013']data_frame_value_in_set = data_frame[data_frame['Purchase Date'].isin(important_dates)]writer = pd.ExcelWriter(output_file)data_frame_value_in_set.to_excel(writer, sheet_name='jan_13_output', index=False)writer.save()
  • 其他条件

startswith , endswith , match和search等。

pandas_value_matches_pattern.py

import pandas as pdinput_file = "sales_2013.xlsx"output_file = "pandas_output.xls"data_frame = pd.read_excel(input_file, 'january_2013', index_col=None)data_frame_value_matches_pattern = data_frame[    data_frame['Customer Name'].str.startswith("J")]writer = pd.ExcelWriter(output_file)data_frame_value_matches_pattern.to_excel(    writer, sheet_name='jan_13_output', index=False)writer.save()

选取特定列

  • iloc基于index选取第2和第5列
import pandas as pdinput_file = "sales_2013.xlsx"output_file = "pandas_output.xls"data_frame = pd.read_excel(input_file, 'january_2013', index_col=None)data_frame_column_by_index = data_frame.iloc[:, [1, 4]]writer = pd.ExcelWriter(output_file)data_frame_column_by_index.to_excel(    writer, sheet_name='jan_13_output', index=False)writer.save()
  • loc基于列名选取第2和第5列

pandas_column_by_name.py

import pandas as pdinput_file = "sales_2013.xlsx"output_file = "pandas_output.xls"data_frame = pd.read_excel(input_file, 'january_2013', index_col=None)data_frame_column_by_name = data_frame.loc[:, ['Customer ID', 'Purchase Date']]writer = pd.ExcelWriter(output_file)data_frame_column_by_name.to_excel(    writer, sheet_name='jan_13_output', index=False)writer.save()

操作所有sheet

  • 选取销售额大于2000的行

pandas_value_meets_condition_all_worksheets.py

import pandas as pdinput_file = "sales_2013.xlsx"output_file = "pandas_output.xls"data_frame = pd.read_excel(input_file, sheetname=None, index_col=None)row_output = []for worksheet_name, data in data_frame.items():    row_output.append(data[data['Sale Amount'].replace('$', '').                           replace(',', '').astype(float) > 2000.0])filtered_rows = pd.concat(row_output, axis=0, ignore_index=True)writer = pd.ExcelWriter(output_file)filtered_rows.to_excel(writer, sheet_name='sale_amount_gt2000', index=False)writer.save()
  • loc基于列名选取所有sheet的第2和第5列

pandas_value_meets_condition_all_worksheets.py

import pandas as pdinput_file = "sales_2013.xlsx"output_file = "pandas_output.xls"data_frame = pd.read_excel(input_file, sheet_name=None, index_col=None)column_output = []for worksheet_name, data in data_frame.items():    column_output.append(data.loc[:, ['Customer Name', 'Sale Amount']])selected_columns = pd.concat(column_output, axis=0, ignore_index=True)writer = pd.ExcelWriter(output_file)selected_columns.to_excel(        writer, sheet_name='selected_columns_all_worksheets', index=False)writer.save()

操作部分sheet

  • 选取销售额大于2000的行

pandas_value_meets_condition_set_of_worksheets.py

import pandas as pdinput_file = "sales_2013.xlsx"output_file = "pandas_output.xls"my_sheets = [0,1]threshold = 1900.0data_frame = pd.read_excel(input_file, sheetname=my_sheets, index_col=None)row_list = []for worksheet_name, data in data_frame.items():    row_list.append(data[data['Sale Amount'].replace('$', '').                         replace(',', '').astype(float) > threshold])filtered_rows = pd.concat(row_list, axis=0, ignore_index=True)writer = pd.ExcelWriter(output_file)filtered_rows.to_excel(writer, sheet_name='set_of_worksheets', index=False)writer.save()

处理多个excel

  • 连接concat

pandas_concat_data_from_multiple_workbooks.py

import pandas as pdimport globimport osinput_path = "/media/andrew/6446FA2346F9F5A0/code/foundations-for-analytics-\with-python/excel"output_file = "pandas_output.xls"all_workbooks = glob.glob(os.path.join(input_path,'*.xls*'))data_frames = []for workbook in all_workbooks:    all_worksheets = pd.read_excel(            workbook, sheet_name=None, index_col=None)    for worksheet_name, data in all_worksheets.items():        data_frames.append(data)all_data_concatenated = pd.concat(data_frames, axis=0, ignore_index=True)writer = pd.ExcelWriter(output_file)all_data_concatenated.to_excel(        writer, sheet_name='all_data_all_workbooks', index=False)writer.save()
  • 求和

pandas_sum_average_multiple_workbooks.py

import pandas as pdimport globimport osinput_path = "/media/andrew/6446FA2346F9F5A0/code/foundations-for-analytics-\with-python/excel"output_file = "pandas_output.xls"all_workbooks = glob.glob(os.path.join(input_path,'*.xls*'))data_frames = []for workbook in all_workbooks:    all_worksheets = pd.read_excel(workbook, sheetname=None, index_col=None)    workbook_total_sales = []    workbook_number_of_sales = []    worksheet_data_frames = []    worksheets_data_frame = None    workbook_data_frame = None    for worksheet_name, data in all_worksheets.items():        total_sales = pd.DataFrame(            [float(str(value).strip('$').replace(',','')) for value in              data.ix[:, 'Sale Amount']]).sum()        number_of_sales = len(data.loc[:, 'Sale Amount'])        average_sales = pd.DataFrame(total_sales / number_of_sales)        workbook_total_sales.append(total_sales)        workbook_number_of_sales.append(number_of_sales)        data = {'workbook': os.path.basename(workbook),                'worksheet': worksheet_name,                'worksheet_total': total_sales,                'worksheet_average': average_sales}        worksheet_data_frames.append(            pd.DataFrame(data,                          columns=['workbook', 'worksheet', 'worksheet_total',                                   'worksheet_average']))    worksheets_data_frame = pd.concat(        worksheet_data_frames, axis=0, ignore_index=True)    workbook_total = pd.DataFrame(workbook_total_sales).sum()    workbook_total_number_of_sales = pd.DataFrame(        workbook_number_of_sales).sum()    workbook_average = pd.DataFrame(        workbook_total / workbook_total_number_of_sales)    workbook_stats = {'workbook': os.path.basename(workbook),                      'workbook_total': workbook_total,                      'workbook_average': workbook_average}    workbook_stats = pd.DataFrame(workbook_stats,                                   columns=['workbook', 'workbook_total',                                           'workbook_average'])    workbook_data_frame = pd.merge(        worksheets_data_frame, workbook_stats, on='workbook', how='left')    data_frames.append(workbook_data_frame)all_data_concatenated = pd.concat(data_frames, axis=0, ignore_index=True)writer = pd.ExcelWriter(output_file)all_data_concatenated.to_excel(    writer, sheet_name='sums_and_averages', index=False)writer.save()

使用excel绘制图表

import pandas as pdimport random# Some sample data to plot.cat_1 = ['y1', 'y2', 'y3', 'y4']index_1 = range(0, 21, 1)multi_iter1 = {'index': index_1}for cat in cat_1:    multi_iter1[cat] = [random.randint(10, 100) for x in index_1]# Create a Pandas dataframe from the data.index_2 = multi_iter1.pop('index')df = pd.DataFrame(multi_iter1, index=index_2)df = df.reindex(columns=sorted(df.columns))# Create a Pandas Excel writer using XlsxWriter as the engine.excel_file = 'legend.xlsx'sheet_name = 'Sheet1'writer = pd.ExcelWriter(excel_file, engine='xlsxwriter')df.to_excel(writer, sheet_name=sheet_name)# Access the XlsxWriter workbook and worksheet objects from the dataframe.workbook = writer.bookworksheet = writer.sheets[sheet_name]# Create a chart object.chart = workbook.add_chart({'type': 'line'})# Configure the series of the chart from the dataframe data.for i in range(len(cat_1)):    col = i + 1    chart.add_series({        'name':       ['Sheet1', 0, col],        'categories': ['Sheet1', 1, 0, 21, 0],        'values':     ['Sheet1', 1, col, 21, col],    })# Configure the chart axes.chart.set_x_axis({'name': 'Index'})chart.set_y_axis({'name': 'Value', 'major_gridlines': {'visible': False}})# Insert the chart into the worksheet.worksheet.insert_chart('G2', chart)# Close the Pandas Excel writer and output the Excel file.writer.save()
使用pandas数据分析工具处理excel
test.PNG

参考资料:http://pandas-xlsxwriter-charts.readthedocs.io/

文章链接:https://www.sbkko.com/ganhuo-316.html
文章标题:使用pandas数据分析工具处理excel
文章版权:SBKKO 所发布的内容,部分为原创文章,转载请注明来源,网络转载文章如有侵权请联系我们!

给TA打赏
共{{data.count}}人
人已打赏
干货分享

║考公║路上陪我的那些笔

2018-8-17 21:45:00

干货分享

新媒体写手常用工具推荐

2018-8-18 11:18:00

0 条回复 A文章作者 M管理员
    暂无讨论,说说你的看法吧
个人中心
购物车
优惠劵
今日签到
有新私信 私信列表
搜索