【Spark】使用外部数据源连接HBase数据库进行读写


本博客文章如无特别说明,均为原创!转载请注明出处:Big data enthusiast(http://www.lubinsu.com/)

本文链接地址:【Spark】使用外部数据源连接HBase数据库进行读写(http://www.lubinsu.com/index.php/archives/389)

本文主要介绍,Spark如何通过外部数据源连接HBase数据库进行读写
我们先贴上项目源码:https://github.com/hortonworks-spark/shc
可以直接通过源码编译出jar包使用,也可以使用项目组预先编译好的jar:http://repo.hortonworks.com/content/groups/public/
好了,具体如何使用,且看代码:

from pyspark import Row
from pyspark.sql import SparkSession

if __name__ == '__main__':
    spark = SparkSession.builder.appName('hbase test').getOrCreate()
    catalog = """{
            "table":{"namespace":"default", "name":"test"},
            "rowkey":"key",
            "columns":{
                "key":{"cf":"rowkey", "col":"key", "type":"string"},
                "column1":{"cf":"f1", "col":"column1", "type":"string"},
                "column2":{"cf":"f1", "col":"column2", "type":"string"}
            }
            }"""

    Record = Row("key", "column1", "column2")
    df = spark.createDataFrame([Record(str(i), "column1_" + str(i), "column2_" + str(i)) for i in range(1, 101)])
    df.write.options(catalog=catalog).format('org.apache.spark.sql.execution.datasources.hbase').save()

    spark.stop()

执行方式如下:
PYSPARK_PYTHON=python3 /home/hadoop/spark-2.1.1-bin-hadoop2.6/bin/spark-submit --master local[5] --jars lib/htrace-core-3.1.0-incubating.jar,lib/hbase-protocol-1.2.0-cdh5.7.0.jar,lib/hbase-server-1.2.0-cdh5.7.0.jar,lib/hbase-common-1.2.0-cdh5.7.0.jar,lib/hbase-client-1.2.0-cdh5.7.0.jar,lib/shc-1.0.0-2.0-s_2.11.jar,lib/spark-streaming-kafka-0-8_2.11-2.1.1.jar,lib/kafka_2.11-0.8.2.1.jar,lib/zkclient-0.3.jar,lib/zookeeper-3.4.6.jar,lib/metrics-core-2.2.0.jar,lib/kafka-clients-0.8.2.1.jar,/home/hadoop/spark-2.1.1-bin-hadoop2.6/examples/jars/spark-examples_2.11-2.1.1.jar /home/hadoop/workspace/lubinsu/job/spark2-pro/hbase_test.py
 

发表评论

电子邮件地址不会被公开。 必填项已用*标注