Using PyHive in EMR
Using PyHive on AWS (Amazon Web Services) has been a real challenge, so I'm posting all the pieces I used to get it working.
Components Installed
PyHive
python3 -m pip install --user pyhive
SASL
sudo yum install cyrus-sasl-devel - Courtesy of Stack Overflow
python3 -m pip install --user sasl
Thrift
python3 -m pip install --user thrift
python3 -m pip install --user thrift_sasl
I had to install all of the modules listed above as although PyHive depended upon SASL and Thrift, they aren't installed as part of the dependencies. This was using Python 3.4 which is the default as of October 31st 2018 on AWS EMR (Elastic Map Reduce).
Error
The challenge in getting this set up was that PyHive is not installed by default, and the dependencies that need to be installed include the module SASL (Simple Authentication and Security Layer) which is a python module. Attempting to install it normally resulted in an error, which I've included below:
In file included from sasl/saslwrapper.cpp:254:0:
sasl/saslwrapper.h:22:23: fatal error: sasl/sasl.h: No such file or directory
#include <sasl/sasl.h>
^
compilation terminated.
error: command 'gcc' failed with exit status 1
After installing everything from above, I was able to get the example from the PyHive homepage working, including the use of localhost.
from pyhive import hive
cursor = hive.connect('localhost').cursor()
cursor.execute('SELECT * FROM database.table LIMIT 10')
print(cursor.fetchall())
Code Review
Expanding upon the code above, I wanted to break down each piece and gain a better understanding of what they are doing. Each segment discussed will be highlighted in a code block.
Python
- python3- This is a call to run the python executable for version 3 of python, as opposed to version 2 which is deprecated and will not be supported beyond 2020.
 
- python3 -m- The -m command allows a module to be run as a script. For example, if you tried to run python3 pip install --user pyhivewithout the-mthen it would fail as we're trying to execute pip as a script, rather than calling the pip module. Once this command with the -m is invoked, it runs the__main__.pyof the module and any arguments passed to it. I found this Stack Overflow useful for a greater understanding of what's going on.
 
- The -m command allows a module to be run as a script. For example, if you tried to run 
- python3 -m pip- PIP (Pip Installs Packages) is a package manager used to handle the retrieval and installation of Python modules. Since Python 3.4, it is included by default with Python.
 
- python3 -m pip install- Install is one of the available commands to call as part of running pip.
 
- python3 -m pip install --user- This specifies to install the package in a location that is specific to the user rather than in the system collection of packages. Pip documentation includes more details.
- Python also has the concept of a Virtual Environment which would allow isolation of a particular package from the system or user installed packages.
 
- python3 -m pip install --user pyhive- The final argument in this example is the module name to be installed. There are other options available which may come after the module name, in this case PyHive, such as --no-depsindicating this should be installed without installing any dependent packages.
 
- The final argument in this example is the module name to be installed. There are other options available which may come after the module name, in this case PyHive, such as 
Amazon Linux
- sudo- SUDO (SuperUser DO) allows the running of commands with elevated privileges as if the user was root, even when not. This Stack Overflow post has a few more details that are relevant, and Linux Academy has a helpful beginner guide.
 
- sudo yum- YUM (Yellowdog Updater Modified) is a package manager similar to PIP for Python to make installing software easier. It seems to be associated with Linux distro (distribution) originating from RedHat or CentOS. \
- APT (Advanced Packaging Tool) is another common package manager, associated with the Debian and Ubuntu distros of Linux.
- This post from 2011 on YUM vs APT-GET Differences mentions a few differences I've noticed in my own recent Linux experiences to understand practical differences between the two.
 
- sudo yum install- Install is a subcommand to YUM to retrieve a software package from a repository and install it. Redhat provides a cheat sheet of YUM commands.
 
- sudo yum install cyrus-sasl-devel- Using elevated privileges, run YUM using the Install subcommand to retrieve the Cyrus SASL software package.