大数据开发环境搭建

Posted by wsxq2 on 2018-05-28
TAGS:  大数据人工智能

常识及约定:(本文默认读者知道如下内容)

  1. 本文安装Ubuntu时使用的用户名是wsxq2,读到相应内容时请替换之。另外,Linux的最高权限用户为root,隶属于root用户组,以该用户登录时可以对系统做任何操作(这很危险),故通常使用隶属于root用户组的自定义用户登陆,如我用的wsxq2
  2. 如果发现某个命令执行时提示权限不够(甚至其它奇怪的提示),请在命令前加sudo尝试。加了sudo后相当于使用root用户的权限执行命令。
  3. 如果更改了~/.bashrc,若想让更改在当前终端立即生效(默认重启终端生效),请使用. ~/.bashrc(或source ~/.bashrc)命令。
  4. ~等价于/home/<your username>(如我的是/home/wsxq2)
  5. 本文中,终端Shell不做区分
  6. $<word>的形式为Shell(bash是常见的Shell,和Windows命令提示符很像)变量,如后文中多次用到的$XXX_HOME,且和${<word>}等价,如后文中多次用到的${BIG_DATA}等价于$BIG_DATA
  7. Linux中大多文件以#作为注释符,类似C语言中的//
  8. 本文大多数软件均使用西电开源镜像站:linux.xidian.edu.cn下载,因此本文更适合和我同校的人食用
  9. 默认Ubuntu每次开机都会清空/tmp,而hdfs使用了该目录,可以通过修改/etc/default/rcS文件来更改Ubuntu的这个机制(修改TMPTIME=-1或者是无限大)

本文目录:

1 在VirtualBox中安装Ubuntu

Ubuntu版本:16.04 LTS
Virtual Box版本: 5.2.12-122591
平台:Windows 10
下载方式:校内开源镜像站linux.xidian.edu.cn

西电开源镜像站:linux.xidian.edu.cn,校内使用速度超快(易迅(在不被路由器管理员限速的情况下)和校园网(不计入流量)均为5Mb/s),该站可以下载绝大多数开源软件,如Ubuntu, Virtual Box, apache hadoop, hive, spark, storm, kafka

1.1 安装VirtualBox

  1. 下载Virtual Box: VirtualBox-5.2.12-122591-Win.exe
  2. 安装Virtual Box: 基于VirtualBox虚拟机安装Ubuntu图文教程

1.2 安装Ubuntu

  1. 下载Ubuntu镜像:ubuntu-16.04.4-desktop-amd64.iso
  2. 安装Ubuntu: 基于VirtualBox虚拟机安装Ubuntu图文教程

1.3 安装增强功能

安装增强功能用于方便主机和虚拟机间的操作,如共享剪贴板、支持拖放、主机共享某一文件夹。下面将先安装增强功能。

  1. 下载VirtualBox Extension Pack: Oracle_VM_VirtualBox_Extension_Pack-5.2.12-122591.vbox-extpack
  2. 打开VirtualBox: 管理->全局设定->扩展->添加新包图标->选择下载的Oracle_VM_VirtualBox_Extension_Pack->打开->…
  3. 打开Ubuntu虚拟机:设备->安装增强功能。虚拟机出现提示,按提示操作后重启(重启命令:reboot)。

此时增强功能已安装完成,下面将为虚拟机开启剪贴板共享、拖放支持、共享主机文件夹d:\

  1. 开启剪贴板共享、拖放支持:控制->设置->常规->高级->共享粘贴板:双向->拖放:双向->确定
  2. 共享主机文件夹d:\:

    1. 控制->设置->共享文件夹->添加共享文件夹图标->共享文件夹路径(如d:\)->共享文件夹名称(如d)->勾选固定分配->确定
    2. 在虚拟机中打开终端(Ctrl+Alt+T),输入如下命令(#后面的内容为注释,`用于界定,不用输入):
    sudo gedit /etc/fstab # 此处会要求输入密码(安装时设置的密码)(因为使用sudo意味着将执行危险操作,故要求验证身份),
                          # 然后会打开一个文件编辑器(gedit)编辑/etc/fstab文件,
                          # 在该文件末尾添加如下内容:
                          # `d  /mnt/d  vboxsf  defaults  0  0`
                          # 然后Ctrl+S保存,Ctrl+Q退出,继续输入如下命令:
    sudo mkdir /mnt/<any name(如d)>
    sudo mount -a
    

1.4 配置Ubuntu

  1. 修改镜像源为校内镜像源(默认使用官方镜像源,使用校内镜像源更快):sudo gedit /etc/apt/sources.list(或sudo apt edit-source),注释掉未注释的部分(在前面添加#即可),然后在末尾添加如下内容:

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    
    deb http://linux.xidian.edu.cn/mirrors/ubuntu/ xenial main restricted universe multiverse
    #deb-src http://linux.xidian.edu.cn/mirrors/ubuntu/ xenial main restricted universe multiverse
       
    deb http://linux.xidian.edu.cn/mirrors/ubuntu/ xenial-security main restricted universe multiverse
    #deb-src http://linux.xidian.edu.cn/mirrors/ubuntu/ xenial-security main restricted universe multiverse
       
    deb http://linux.xidian.edu.cn/mirrors/ubuntu/ xenial-updates main restricted universe multiverse
    #deb-src http://linux.xidian.edu.cn/mirrors/ubuntu/ xenial-updates main restricted universe multiverse
       
    #deb http://linux.xidian.edu.cn/mirrors/ubuntu/ xenial-backports main restricted universe multiverse
    #deb-src http://linux.xidian.edu.cn/mirrors/ubuntu/ xenial-backports main restricted universe multiverse
       
    #deb http://linux.xidian.edu.cn/mirrors/ubuntu/ xenial-proposed main restricted universe multiverse
    #deb-src http://linux.xidian.edu.cn/mirrors/ubuntu/ xenial-proposed main restricted universe multiverse
    

    详情可以参考Ubuntu 镜像使用帮助

    修改完成后,执行如下命令更新软件包列表:sudo apt update,此后便可使用sudo apt upgrade升级系统的软件包、使用sudo apt install <软件包名>安装常用软件包了。

1.5 备份

因为后续操作会对虚拟机进行大量的配置(安装大数据开发需要的各种软件),故我们使用系统快照先将其做个备份,保存当前虚拟机状态,以便恢复。具体操作如下: 控制->生成备份->填写备份名称和备份描述->确定

2 Jupyter notebook和TensorFlow

大数据与人工智能关系密切,故我们需要知名的google开发的人工智能学习框架TensorFlow。而鉴于该框架对Python的友好性及Python的易用性,我们使用Python做为TensorFlow框架的开发语言。而用Python语言在TensorFlow框架上进行开发时,Jupyter notebook可以更好地帮助我们学习。所以我们需要安装TensorFlowJupyter notebook

2.1 Jupyter notebook

Jupyter官方网站:http://Jupyter.org
Jupyter官方文档:Installing Jupyter

2.1.1 Jupyter notebook简介

The Jupyter Notebook is an open-source web application that allows you to create and share documents that contain live code, equations, visualizations and narrative text. Uses include: data cleaning and transformation, numerical simulation, statistical modeling, data visualization, machine learning, and much more.

2.1.2 Installing Jupyter using Anaconda

We strongly recommend installing Python and Jupyter using the Anaconda Distribution, which includes Python, the Jupyter Notebook, and other commonly used packages for scientific computing and data science.

  1. download Anaconda. We recommend downloading Anaconda’s latest Python 3 version.
  2. install the version of Anaconda which you downloaded, following the instructions on the download page.
  3. Congratulations, you have installed Jupyter Notebook! To run the notebook, run the following command at the Terminal (Mac/Linux) or Command Prompt (Windows):

    1
    
    jupyter notebook
    

    See Running the Notebook for more details.

2.2 TensorFlow

官方网站:https://www.TensorFlow.org/?hl=zh-cn
官方文档:在 Ubuntu 上安装 TensorFlow

2.2.1 TensorFlow 简介

TensorFlow™ 是一个使用数据流图进行数值计算的开放源代码软件库。图中的节点代表数学运算,而图中的边则代表在这些节点之间传递的多维数组(张量)。借助这种灵活的架构,您可以通过一个 API 将计算工作部署到桌面设备、服务器或移动设备中的一个或多个 CPU 或 GPU。TensorFlow 最初是由 Google Brain 团队(隶属于 Google 机器智能研究部门)中的研究人员和工程师开发的,旨在用于进行机器学习和深度神经网络研究。但该系统具有很好的通用性,还可以应用于众多其他领域。

2.2.2 安装TensorFlow

配置PyPi镜像源(为了更快):

1
2
3
4
mkdir -p $HOME/.config/pip/
gedit $HOME/.config/pip/pip.conf # 将该文件中的所有内容删除掉或注释掉,然后添加以下内容:
                                 # [global]
                                 # index-url = https://linux.xidian.edu.cn/mirrors/pypi/web/simple/

经验证,使用原生pip进行安装简单可行,故我们采用该方法安装Python 3.n; only CPU support版本的TensorFlow。具体操作如下:

1
2
sudo apt install python3-pip python3-dev
pip3.6 install tensorflow

2.2.3 验证你的安装

  1. 基本验证。在 Python 交互式 shell 中输入以下几行简短的程序代码:
    1
    2
    3
    4
    5
    
    # Python
    import tensorflow as tf
    hello = tf.constant('Hello, TensorFlow!')
    sess = tf.Session()
    print(sess.run(hello))
    

    如果系统输出以下内容,就说明您可以开始编写 TensorFlow 程序了:

    1
    
    Hello, TensorFlow!
    
  2. Jupyter notebook中验证。在Jupyter notebook中:new->Python3->输入import tensorflow as tf然后Ctrl+Enter,如果没有提示import error即说明成功了。

3 Hadoop

官方网站:https://hadoop.apache.org
官方文档:Hadoop: Setting up a Single Node Cluster.

3.1 Hadoop简介

3.1.1 What Is Apache Hadoop?

The Apache™ Hadoop® project develops open-source software for reliable, scalable, distributed computing.

The Apache Hadoop software library is a framework that allows for the distributed processing of large data sets across clusters of computers using simple programming models. It is designed to scale up from single servers to thousands of machines, each offering local computation and storage. Rather than rely on hardware to deliver high-availability, the library itself is designed to detect and handle failures at the application layer, so delivering a highly-available service on top of a cluster of computers, each of which may be prone to failures.

3.1.2 Included modules

  • Hadoop Common: The common utilities that support the other Hadoop modules.
  • Hadoop Distributed File System (HDFS™): A distributed file system that provides high-throughput access to application data.
  • Hadoop YARN: A framework for job scheduling and cluster resource management.
  • Hadoop MapReduce: A YARN-based system for parallel processing of large data sets.
  • Ambari™: A web-based tool for provisioning, managing, and monitoring Apache Hadoop clusters which includes support for Hadoop HDFS, Hadoop MapReduce, Hive, HCatalog, HBase, ZooKeeper, Oozie, Pig and Sqoop. Ambari also provides a dashboard for viewing cluster health such as heatmaps and ability to view MapReduce, Pig and Hive applications visually along with features to diagnose their performance characteristics in a user-friendly manner.
  • Avro™: A data serialization system.
  • Cassandra™: A scalable multi-master database with no single points of failure.
  • Chukwa™: A data collection system for managing large distributed systems.
  • HBase™: A scalable, distributed database that supports structured data storage for large tables.
  • Hive™: A data warehouse infrastructure that provides data summarization and ad hoc querying.
  • Mahout™: A Scalable machine learning and data mining library.
  • Pig™: A high-level data-flow language and execution framework for parallel computation.
  • Spark™: A fast and general compute engine for Hadoop data. Spark provides a simple and expressive programming model that supports a wide range of applications, including ETL, machine learning, stream processing, and graph computation.
  • Tez™: A generalized data-flow programming framework, built on Hadoop YARN, which provides a powerful and flexible engine to execute an arbitrary DAG of tasks to process data for both batch and interactive use-cases. Tez is being adopted by Hive™, Pig™ and other frameworks in the Hadoop ecosystem, and also by other commercial software (e.g. ETL tools), to replace Hadoop™ MapReduce as the underlying execution engine.
  • ZooKeeper™: A high-performance coordination service for distributed applications.

3.2 安装Hadoop及配置伪集群

3.2.1 准备工作

  1. 安装依赖软件包。包括openssh-serverrsync(Ubuntu 16.04 LTS自带)、openjdk-8-jdk:
    1
    
    sudo apt-get install openssh-server openjdk-8-jdk
    
  2. ssh配置。
    1. 设置sshd开机自启。使用systemctl status sshd命令查看sshd是否启动,若未启动则执行如下命令启动sshd(ssh daemon)服务(安装后好像默认启动了的):
      1
      2
      
      sudo systemctl start sshd # 启动`sshd`
      sudo systemctl enable sshd # 设置开机自启
      
    2. 配置无需输入密码进行ssh localhost连接。
      1
      2
      3
      
      ssh-keygen -t rsa -P '' -f ~/.ssh/id_rsa
      cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys
      chmod 0600 ~/.ssh/authorized_keys
      
  3. jdk配置。主要是需要设置JAVA_HOME变量:
    1
    2
    3
    4
    
    dpkg -L openjdk-8-jdk # 查看openjdk-8-jdk主安装目录
    # 默认主安装目录为`/usr/lib/jvm/java-8-openjdk-amd64`,将其设置为JAVA_HOME变量的值
    echo "export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64" >> ~/.bashrc # 添加引号中内容到`~/.bashrc`末尾
    source ~/.bashrc # 使`~/.bashrc`立即生效(默认重启终端后才会生效)
    

3.2.2 安装Hadoop

  1. 下载到默认目录(~/Downloads/,中文系统是~/下载/): Hadoop-2.8.4.tar.gz
  2. 新建目录~/bigdata用于安装大数据开发需要的各种软件,并将下载的Hadoop复制到该目录,解压:
    1
    2
    3
    4
    
    sudo mkdir ~/bigdata # 新建目录
    sudo cp ~/Downloads/hadoop-2.8.4.tar.gz ~/bigdata # 如果是中文系统,请将`Downloads`改为`下载`
    cd ~/bigdata # 进入bigdata安装目录
    sudo tar xf hadoop-2.8.4.tar.gz # 解压
    

至此便算安装完成。

3.2.3 配置Hadoop

  1. 进入~/bigdatahadoop-2.8.4/目录:
    1
    
    cd ~/bigdatahadoop-2.8.4/
    
  2. Prepare to Start the Hadoop Cluster配置,编辑文件etc/hadoop/hadoop-env.sh

    1
    2
    
    # set to the root of your Java installation
    export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64
    

    尝试如下命令: bin/hadoop,如果成功输出Hadoop命令的帮助信息,则可断定初步安装成功

  3. Pseudo-Distributed Operation配置:
    1. 编辑文件etc/hadoop/core-site.xml:
      1
      2
      3
      4
      5
      6
      
      <configuration>
          <property>
              <name>fs.defaultFS</name>
              <value>hdfs://localhost:9000</value>
          </property>
      </configuration>
      
    2. 编辑文件etc/hadoop/hdfs-site.xml:
      1
      2
      3
      4
      5
      6
      
      <configuration>
          <property>
              <name>dfs.replication</name>
              <value>1</value>
          </property>
      </configuration
      
  4. ~/.bashrc后添加如下内容:

    1
    2
    3
    4
    5
    6
    7
    
    # bigdata
    export BIG_DATA="/home/wsxq2/bigdata"
    
    # hadoop
    export HADOOP_HOME="${BIG_DATA}/hadoop-2.8.4"
    export HADOOP_CONF_DIR=${HADOOP_HOME}/etc/hadoop
    PATH="$HADOOP_HOME/sbin:$HADOOP_HOME/bin:$PATH"
    

3.2.4 验证Hadoop

1
2
3
4
5
6
7
hdfs dfs -ls /
hdfs dfs -mkdir /input
hdfs dfs -put etc/hadoop/ /input
hdfs dfs -ls /input
hdfs dfs -ls /input/hadoop
hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-examples-2.8.4.jar grep /input/hadoop/ /output 'dfs[a-z.]+'
hdfs dfs -cat /output/*

具体步骤如下: The following instructions are to run a MapReduce job locally. If you want to execute a job on YARN, see YARN on Single Node.

  1. Format the filesystem:
    1
    
    $ bin/hdfs namenode -format
    
  2. Start NameNode daemon and DataNode daemon:

    1
    
    $ sbin/start-dfs.sh
    

    The hadoop daemon log output is written to the $HADOOP_LOG_DIR directory (defaults to $HADOOP_HOME/logs).

  3. Browse the web interface for the NameNode; by default it is available at:
    • NameNode - http://localhost:50070/
  4. Make the HDFS directories required to execute MapReduce jobs:

    1
    2
    
    $ bin/hdfs dfs -mkdir /user
    $ bin/hdfs dfs -mkdir /user/<username>
    
  5. Copy the input files into the distributed filesystem:

    1
    
    $ bin/hdfs dfs -put etc/hadoop input
    
  6. Run some of the examples provided:

    1
    
    $ bin/hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-examples-2.8.4.jar grep input output 'dfs[a-z.]+'
    
  7. Examine the output files: Copy the output files from the distributed filesystem to the local filesystem and examine them:

    1
    2
    
    $ bin/hdfs dfs -get output output
    $ cat output/*
    

    or View the output files on the distributed filesystem:

    1
    
    $ bin/hdfs dfs -cat output/*
    
  8. When you’re done, stop the daemons with:

    1
    
    $ sbin/stop-dfs.sh
    

3.2.5 YARN on a Single Node

You can run a MapReduce job on YARN in a pseudo-distributed mode by setting a few parameters and running ResourceManager daemon and NodeManager daemon in addition.

  1. Configure parameters as follows:
    1. etc/hadoop/mapred-site.xml:

      1
      2
      3
      4
      5
      6
      
      <configuration>
          <property>
              <name>mapreduce.framework.name</name>
              <value>yarn</value>
          </property>
      </configuration>
      
    2. etc/hadoop/yarn-site.xml:

      1
      2
      3
      4
      5
      6
      
      <configuration>
          <property>
              <name>yarn.nodemanager.aux-services</name>
              <value>mapreduce_shuffle</value>
          </property>
      </configuration>
      
  2. Start ResourceManager daemon and NodeManager daemon:

    1
    
      $ sbin/start-yarn.sh
    
  3. Browse the web interface for the ResourceManager; by default it is available at:
    • ResourceManager - http://localhost:8088/
  4. Run a MapReduce job.
  5. When you’re done, stop the daemons with:

    1
    
    sbin/stop-yarn.sh
    

4 Hive

官方网站:https://hive.apache.org/
官方文档:AdminManual Installation

4.1 Hive简介

APACHE HIVE TM The Apache Hive ™ data warehouse software facilitates reading, writing, and managing large datasets residing in distributed storage using SQL. Structure can be projected onto data already in storage. A command line tool and JDBC driver are provided to connect users to Hive.

4.2 安装Hive

  1. 下载:apache-hive-2.3.3-bin.tar.gz
  2. 移动到~/bigdata/: mv ~/Downloads/apache-hive-2.3.3-bin.tar.gz ~/bigdata/
  3. 解压:tar xf apache-hive-2.3.3-bin.tar.gz

4.3 配置Hive

编辑~/.bashrc,在后面添加如下内容:

1
2
3
# Hive
export HIVE_HOME=${BIG_DATA}/apache-hive-2.3.3-bin
export PATH=$HIVE_HOME/bin:$PATH

4.4 测试

  1. 先开启hdfs: start-dfs.sh
  2. 测试。
    1. Hive CLI. To use the Hive command line interface (CLI) go to the Hive home directory and execute the following command:
      1
      
      bin/hive
      
    2. Beeline CLI. HiveServer2 (introduced in Hive 0.11) has a new CLI called Beeline (see Beeline – New Command Line Shell). To use Beeline, execute the following command in the Hive home directory:

      1
      
      bin/beeline
      

    如果报错,请重新格式化(hdfs namenode -format)后再开启hdfs再进行测试。

5 Spark

官方网站:https://spark.apache.org/

官方文档:

  1. Using Spark’s “Hadoop Free” Build(spark-without-hadoop)
  2. Spark Overview

5.1 Spark简介

Apache Spark is a fast and general-purpose cluster computing system. It provides high-level APIs in Java, Scala, Python and R, and an optimized engine that supports general execution graphs. It also supports a rich set of higher-level tools including Spark SQL for SQL and structured data processing, MLlib for machine learning, GraphX for graph processing, and Spark Streaming.

5.2 安装Spark

  1. 下载:spark-2.3.0-bin-without-hadoop.tgz
  2. 移动到~/bigdata/: mv ~/Downloads/spark-2.3.0-bin-without-hadoop.tgz ~/bigdata/
  3. 解压:tar xf spark-2.3.0-bin-without-hadoop.tgz

5.3 配置Spark

编辑文件~/.bashrc,在末尾添加如下内容:

1
2
3
4
# spark
export LD_LIBRARY_PATH=$HADOOP_HOME/lib/native/:$LD_LIBRARY_PATH
export SPARK_DIST_CLASSPATH=$(hadoop classpath)
PATH="$BIG_DATA/spark-2.3.0-bin-without-hadoop/bin:$PATH"

5.4 测试

Spark comes with several sample programs. Scala, Java, Python and R examples are in the examples/src/main directory. To run one of the Java or Scala sample programs, use bin/run-example <class> [params] in the top-level Spark directory. (Behind the scenes, this invokes the more general spark-submit script for launching applications). For example,

1
./bin/run-example SparkPi 10

You can also run Spark interactively through a modified version of the Scala shell. This is a great way to learn the framework.

1
./bin/spark-shell --master local[2]

The –master option specifies the master URL for a distributed cluster, or local to run locally with one thread, or local[N] to run locally with N threads. You should start by using local for testing. For a full list of options, run Spark shell with the –help option.

Spark also provides a Python API. To run Spark interactively in a Python interpreter, use bin/pyspark:

1
./bin/pyspark --master local[2]

Example applications are also provided in Python. For example,

1
./bin/spark-submit examples/src/main/python/pi.py 10

6 ZooKeeper

官方网站:https://zookeeper.apache.org
官方文档:ZooKeeper Getting Started Guide

6.1 ZooKeeper简介

ZooKeeper: A Distributed Coordination Service for Distributed Applications. ZooKeeper is a distributed, open-source coordination service for distributed applications. It exposes a simple set of primitives that distributed applications can build upon to implement higher level services for synchronization, configuration maintenance, and groups and naming. It is designed to be easy to program to, and uses a data model styled after the familiar directory tree structure of file systems. It runs in Java and has bindings for both Java and C.

Coordination services are notoriously hard to get right. They are especially prone to errors such as race conditions and deadlock. The motivation behind ZooKeeper is to relieve distributed applications the responsibility of implementing coordination services from scratch.

6.2 安装ZooKeeper

  1. 下载: zookeeper-3.4.12.tar.gz
  2. 移动到~/bigdata/: mv ~/Downloads/zookeeper-3.4.12.tar.gz ~/bigdata/
  3. 解压:tar xf zookeeper-3.4.12.tar.gz

6.3 配置ZooKeeper

To start ZooKeeper you need a configuration file. Here is a sample, create it in conf/zoo.cfg:

1
2
3
tickTime=2000
dataDir=/tmp/zookeeper # 默认为`/var/lib/zookeeper`目录,但是非root用户不具备该目录的写入权限,为避免频繁使用sudo,我们将其更改为`/tmp/zookeeper`
clientPort=2181

添加如下内容到~/.bashrc:

1
2
# zookeeper
PATH="$BIG_DATA/zookeeper-3.4.12/bin:$PATH"

6.4 测试

Now that you created the configuration file, you can start ZooKeeper:

1
bin/zkServer.sh start

Then, You can Connect to ZooKeeper:

1
$ bin/zkCli.sh -server 127.0.0.1:2181

7 Kafka

官方网站:https://kafka.apache.org
官方文档:

  1. Introduction
  2. Quickstart

7.1 Kafka简介

Apache Kafka® is a distributed streaming platform. What exactly does that mean? A streaming platform has three key capabilities:

  • Publish and subscribe to streams of records, similar to a message queue or enterprise messaging system.
  • Store streams of records in a fault-tolerant durable way.
  • Process streams of records as they occur.

Kafka is generally used for two broad classes of applications:

  • Building real-time streaming data pipelines that reliably get data between systems or applications
  • Building real-time streaming applications that transform or react to the streams of data

7.2 安装Kafka

  1. 下载: kafka_2.11-1.1.0.tgz
  2. 移动到~/bigdata/: mv ~/Downloads/kafka_2.11-1.1.0.tgz ~/bigdata/
  3. 解压:tar xf kafka_2.11-1.1.0.tgz

7.3 配置Kafka

添加如下内容到~/.bashrc:

1
2
# kafka
PATH="$BIG_DATA/kafka_2.11-1.1.0/bin/:$PATH"

7.4 测试

  1. 启动服务:kafka-server-start.sh config/server.properties
  2. 创建Topic: kafka-topics.sh --create --zookeeper localhost:2181 --replication-factor 1 --partitions 1 --topic test
  3. 列出topickafka-topics.sh --list --zookeeper localhost:2181
  4. 启动一个producer, 发送消息:
    1
    2
    3
    
    kafka-console-producer.sh --broker-list localhost:9092 --topic test
    This is a message
    This is another message
    
  5. 启动一个consumer, 接收消息:
    1
    
    kafka-console-consumer.sh --bootstrap-server localhost:9092 --topic test --from-beginning
    

    输出如下:

    1
    2
    
    This is a message
    This is another message
    

8 Storm

9 参考链接(所有链接汇总)

  1. 校内开源镜像站(https://linux.xidian.edu.cn):
    1. kafka_2.11-1.1.0.tgz
    2. zookeeper-3.4.12.tar.gz
    3. spark-2.3.0-bin-without-hadoop.tgz
    4. apache-hive-2.3.3-bin.tar.gz
    5. Hadoop-2.8.4.tar.gz
    6. VirtualBox-5.2.12-122591-Win.exe
    7. ubuntu-16.04.4-desktop-amd64.iso
    8. Oracle_VM_VirtualBox_Extension_Pack-5.2.12-122591.vbox-extpack
    9. Ubuntu 镜像使用帮助
  2. 官方文档:
    1. TensorFlow:
      1. 在 Ubuntu 上安装 TensorFlow
    2. Jupyter && Anaconda:
      1. Installing Jupyter
      2. Anaconda Distribution
      3. download Anaconda
      4. Running the Notebook
    3. Hadoop:
      1. Hadoop: Setting up a Single Node Cluster.
    4. Hive:
      1. AdminManual Installation
    5. Spark:
      1. Using Spark’s “Hadoop Free” Build(spark-without-hadoop)
      2. Spark Overview
    6. ZooKeeper:
      1. ZooKeeper Getting Started Guide
    7. Kafka:
      1. Introduction
      2. Quickstart
  3. 其它文档:
    1. 基于VirtualBox虚拟机安装Ubuntu图文教程