最近一段时间,做一些分析运营相关的工作,和hive打的交道比较多。每次需要看数据的时候都需要写一个SQL去验证一下,有的时候甚至不知道表结构,但是hive并不像MySQL那样使用方便。目前我司使用hive查询只有两个入口:

  • 通过公司的HUE管理页面,这个是推荐的方法,但是有一个问题,每一张表都需要申请。很麻烦,而且我们部门审批hive表的那个XXX太XXX了,申请比较困难。
  • 使用JDBC进行连接。这个是可以,但是Java使用起来还是不方便。

所以我就想写一个Python脚本来执行SQL。本来觉得很容易,但是发现这里面的坑太多了。主要问题是Python的库都太弱并且好多库都没人维护了。折腾了一天终于找到了一个比较好的库,下面就是具体的过程:

安装Python依赖库

1
2
pip install impyla

实现代码

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
__author__ = 'guochenglai'
from impala.dbapi import connect
hive_config = {}
hive_config['online'] = {
'host': 'l-xxx.com',
'db': 'xxx',
'port': 12001
}
def get_hive_cursor(conf):
conn = connect(host=conf['host'], port=conf['port'], database=conf['db'], auth_mechanism='PLAIN')
return conn.cursor()
def execute_sql(cursor,sql):
cursor.execute(sql)
return cursor.fetchall()
def assemble_sql():
sql='''SHOW Tables'''
return sql
if __name__ == '__main__':
hive = hive_config['online']
cursor = get_hive_cursor(hive)
sql = assemble_sql()
print execute_sql(cursor,sql)

如果是学习的小伙伴上面两部链接hive应该的已经成功了。下面是我尝试其他方法遇到的问题。与教程无关,如果是小白用户就不用看了。

其他链接方法中遇到的坑

在前期尝试了很多其他的方法遇到了很多坑,至今没有解决。如果有人遇到以下的问题,并且已经解决。烦请多交流。
遇到的第一个问题程序报错如下:

1
2
3
4
5
6
7
8
9
10
File "build/bdist.macosx-10.11-intel/egg/pyhs2/__init__.py", line 7, in connect
# Copyright (c) 1999 by Secret Labs AB.
File "build/bdist.macosx-10.11-intel/egg/pyhs2/connections.py", line 47, in __init__
File "build/bdist.macosx-10.11-intel/egg/pyhs2/TCLIService/TCLIService.py", line 154, in OpenSession
File "build/bdist.macosx-10.11-intel/egg/pyhs2/TCLIService/TCLIService.py", line 165, in recv_OpenSession
File "build/bdist.macosx-10.11-intel/egg/thrift/protocol/TBinaryProtocol.py", line 140, in readMessageBegin
File "build/bdist.macosx-10.11-intel/egg/thrift/transport/TTransport.py", line 58, in readAll
File "build/bdist.macosx-10.11-intel/egg/thrift/transport/TTransport.py", line 159, in read
File "build/bdist.macosx-10.11-intel/egg/thrift/transport/TSocket.py", line 120, in read
thrift.transport.TTransport.TTransportException: TSocket read 0 bytes

遇到问题二程序报错如下:

1
2
3
4
5
6
7
8
/usr/local/lib/python2.7/site-packages/setuptools-12.3-py2.7.egg/pkg_resources/init.py:1224: UserWarning: /home/tap/.python-eggs is writable by group/others and vulnerable to attack when used with get_resource_filename. Consider a more secure location (set with .set_extraction_path or the PYTHON_EGG_CACHE environment variable).
Traceback (most recent call last):
File "testPktFlow2.py", line 7, in
database='default') as conn:
File "build/bdist.linux-x86_64/egg/pyhs2/init.py", line 7, in connect
File "build/bdist.linux-x86_64/egg/pyhs2/connections.py", line 45, in init
File "build/bdist.linux-x86_64/egg/pyhs2/cloudera/thrift_sasl.py", line 66, in open
thrift.transport.TTransport.TTransportException: Could not start SASL: Error in sasl_client_start (-4) SASL(-4): no mechanism available: No worthy mechs found