Using PyHive in EMR

Using PyHive on AWS (Amazon Web Services) has been a real challenge, so I'm posting all the pieces I used to get it working.

Components Installed

PyHive

python3 -m pip install --user pyhive

SASL

sudo yum install cyrus-sasl-devel - Courtesy of Stack Overflow
python3 -m pip install --user sasl

Thrift

python3 -m pip install --user thrift
python3 -m pip install --user thrift_sasl

I had to install all of the modules listed above as although PyHive depended upon SASL and Thrift, they aren't installed as part of the dependencies. This was using Python 3.4 which is the default as of October 31st 2018 on AWS EMR (Elastic Map Reduce).

Error

The challenge in getting this set up was that PyHive is not installed by default, and the dependencies that need to be installed include the module SASL (Simple Authentication and Security Layer) which is a python module. Attempting to install it normally resulted in an error, which I've included below:

In file included from sasl/saslwrapper.cpp:254:0:
sasl/saslwrapper.h:22:23: fatal error: sasl/sasl.h: No such file or directory
#include <sasl/sasl.h>
^
compilation terminated.
error: command 'gcc' failed with exit status 1

After installing everything from above, I was able to get the example from the PyHive homepage working, including the use of localhost.

from pyhive import hive
cursor = hive.connect('localhost').cursor()
cursor.execute('SELECT * FROM database.table LIMIT 10')
print(cursor.fetchall())

Code Review

Expanding upon the code above, I wanted to break down each piece and gain a better understanding of what they are doing. Each segment discussed will be highlighted in a code block.

Python

  • python3
  • python3 -m
    • The -m command allows a module to be run as a script. For example, if you tried to run python3 pip install --user pyhive without the -m then it would fail as we're trying to execute pip as a script, rather than calling the pip module. Once this command with the -m is invoked, it runs the __main__.py of the module and any arguments passed to it. I found this Stack Overflow useful for a greater understanding of what's going on.
  • python3 -m pip
    • PIP (Pip Installs Packages) is a package manager used to handle the retrieval and installation of Python modules. Since Python 3.4, it is included by default with Python.
  • python3 -m pip install
    • Install is one of the available commands to call as part of running pip.
  • python3 -m pip install --user
    • This specifies to install the package in a location that is specific to the user rather than in the system collection of packages. Pip documentation includes more details.
    • Python also has the concept of a Virtual Environment which would allow isolation of a particular package from the system or user installed packages.
  • python3 -m pip install --user pyhive
    • The final argument in this example is the module name to be installed. There are other options available which may come after the module name, in this case PyHive, such as --no-deps indicating this should be installed without installing any dependent packages.

Amazon Linux

  • sudo
  • sudo yum
    • YUM (Yellowdog Updater Modified) is a package manager similar to PIP for Python to make installing software easier. It seems to be associated with Linux distro (distribution) originating from RedHat or CentOS. \
    • APT (Advanced Packaging Tool) is another common package manager, associated with the Debian and Ubuntu distros of Linux.
    • This post from 2011 on YUM vs APT-GET Differences mentions a few differences I've noticed in my own recent Linux experiences to understand practical differences between the two.
  • sudo yum install
    • Install is a subcommand to YUM to retrieve a software package from a repository and install it. Redhat provides a cheat sheet of YUM commands.
  • sudo yum install cyrus-sasl-devel
    • Using elevated privileges, run YUM using the Install subcommand to retrieve the Cyrus SASL software package.