Pandas读取Excel数据，简单清洗后存入数据库

2019-04-18

场景描述

客户提供的Excel文件，需要进行简单清洗之后导入数据库。

代码实现

import pandas as pd
import numpy as np
import os
from sqlalchemy import create_engine
from sqlalchemy import text
from sqlalchemy.types import Integer
from sqlalchemy.types import NVARCHAR

# Excel文件所在路径
workpath = 'D:/Users/Temp/working'
os.chdir(workpath)

# 数据库 odbc 连接
con_string = "mssql+pyodbc://username:pwd@127.0.01/MyMSSQL?driver=SQL+Server"
engine = create_engine(con_string)

# 设置数据库字段类型
def mapping_df_types(df):
    dtype_dict = {}
    for col, dtype in zip(df.columns, df.dtypes):
        if str(dtype) in ['object','float64']:
            dtype_dict.update({col: NVARCHAR(length=2000)})
        if str(dtype) in ['int64','int']:
            dtype_dict.update({col: Integer()})            
    return dtype_dict

# Trim所有字符串列
def col_strip(df):
    for col in list(df): # 遍历列名
        if df[col].dtype.kind == 'O': # 若为字符串，strip一下
            df[col] = df[col].astype(str).str.strip()

# 某些需要特别指定数据类型的字段
dtype = {'EmpNo':str,'OldEmpNo':str,'xCode':str,'Mobile':str,'TEL':str}

# 读取excel文件
df_Emp = pd.read_excel('导入数据.xlsx', sheet_name=0,header=0,dtype=dtype)

# 将 code-title 形式的 xcode 拆为仅 code 
df_Emp['xCode'] = df_Emp.xCode.str.split('-',expand = True)[0]

# 将所有读取出来的 nan 替换为空串
df_Emp = df_Emp.replace('nan', '')

# trip所有字符串
col_strip(df_Emp)

# 设置数据库字段类型
dtype_dict = mapping_df_types(df_Emp)

# 数据插入数据库
df_Emp.to_sql('DataImport',con=engine,if_exists='replace',dtype=dtype_dict,index = False)

说明

通过Pandas读取Excel，可以自动识别字段类型，然后再部分指定字段类型，最后直接通过to_sql函数，一条语句即实现建表和插入数据的功能；
设置字段类型那段非常关键，因为是SQL SERVE数据库，而且数据中含中文，若不设置则会均按默认varchar走，最终存到数据库中的很多都会是乱码。