Technical Resources
Educational Resources
Connect with Us
Get interactive SQL access to months of Papertrail log archives (using Hadoop and Hive), in 5-10 minutes, without any new hardware or software.
This quick start assumes basic familiarity with AWS. For step-by-step instructions or to customize, see Intro to Hadoop and Hive.
Use the AWS command-line emr command to start a 1-node cluster and run a Papertrail Hive script to load archives.
IMPORTANT: This will start billing for an m1.medium
instance immediately and continue charging until you manually terminate the job. As of this writing, the fee is $0.022 (2.2 cents) per hour in most US and global regions. See Terminate.
First, install the AWS command-line tools. Then, similar to Creating a Job Flow Using Hive, run:
$ export AWS_ACCESS_KEY_ID="abc123"
$ export AWS_SECRET_ACCESS_KEY="def456ghi789"
$ aws emr create-default-roles
$ aws emr create-cluster --name "Papertrail log archives" --ami-version 3.3 --applications Name=Hue Name=Hive Name=Pig \
--use-default-roles --ec2-attributes KeyName=myKey \
--instance-type m1.medium --instance-count 1 \
--steps 'Type=Hive,Name="Hive Program",ActionOnFailure=CONTINUE,Args=[-f,s3://hive.papertrailapp.com/EMR/create_events-0-13-1.q,-d,INPUT=s3://YOUR-ARCHIVE-BUCKET/papertrail/logs/xyz]'
Replace YOUR-ARCHIVE-BUCKET/papertrail/logs/xyz
with the bucket and path on
Archives.
You’ll receive a job ID like j-1234567890123
. In about 5 minutes, check its state with:
$ aws emr list-clusters
When the state is WAITING
, SSH to the hostname shown. Use the username hadoop
with your AWS key
per SSH into the Master Node:
$ ssh -i ~/my-ec2-key.pem hadoop@ec2-1-2-3-4.compute-1.amazonaws.com
After logging in via SSH, run the hive
query tool to perform queries:
hadoop@domU-12-34-56-78-90:~$ hive
hive> DESCRIBE events
hive> SELECT message FROM events LIMIT 5
To access the Hadoop Web interface, start an SSH session which tunnels your port 9100
to the EMR node, then visit http://localhost:9100/
in a browser:
$ ssh -L 9100:localhost:9100 -i ~/my-ec2-key.pem hadoop@ec2-1-2-3-4.compute-1.amazonaws.com
When finished, terminate the job via the AWS console or CLI:
$ aws emr list-clusters
$ aws emr terminate-clusters --cluster-ids j-1234567890123
The script above loads the archives in a table called events
, which has one column per archive field. This is best for most logs and works well for full-text ad-hoc search, such as:
SELECT * FROM events WHERE message LIKE '%some string%';
If some or all of your logs are in a key=value
or JSON format, here’s a few ways to denormalize columns.
For log messages in key=value key2=value2
format, like from scrolls, individual keys can be referenced as columns. To do this, run the elastic-mapreduce
command in Setup, but change the --hive-script
URL to s3://hive.papertrailapp.com/EMR/create_events_kv-0-13-1.q
.
Papertrail will create an events_kv
table where the message
is a hash instead of a string. The log message path=/abc/def status=200 appcode=abc size=5310
could be queried with:
select message['status'], message['appcode'] from events_kv;
The create_events_kv-0-13-1.q
script assumes that =
delimits keys from values and ` ` (space) delimits key=value pairs from one another. These can be changed.
Hive has two popular ways of working with JSON:
LATERAL VIEW
in the schema. This is easier
to setup but slightly less powerful.Here’s an example of using this serializer/deserializer (“SerDe”) to make an array, and a hash of arrays, queryable as native Hive columns:
CREATE TABLE json_nested_events (
country string,
languages array<string>,
religions map<string,array<int>>
);
...
SELECT religions['catholic'][0] from json_nested_events;
For more, see README on GitHub, including Nested structures.
Hive supports a LATERAL VIEW
which can act on a string column (like events.message
from Setup). See Hive and JSON made simple
for user-defined functions (UDFs) and examples.
Combine with json_tuple() to analyze arbitrary JSON attributes. For example, given the JSON log message:
{ "status":200, "appcode":"abc" }
Run:
SELECT received_at, message_view.* FROM events
LATERAL VIEW json_tuple(events.message, 'status', 'appcode') message_view AS status, appcode;
This uses an alias (“view”) message_view
, which is the parsed JSON in events.message
. The query returns the event.received_at
alongside two JSON hash keys: status
, as result set column f1
, and appcode
, as f2
. Messages which are not valid JSON or are valid JSON but do not contain the requested keys will return NULL
. They can be filtered in a WHERE
clause.
Nested JSON elements can be accessed the same way. For example, consider changeset
(a hash) and ids
(an array):
{ "status":200, "appcode":"abc", "changeset":{ "name":"Sally", "address":"123 Home Row" }, "ids":[3,7,10] }
Turn values from the nested changeset
hash into columns:
SELECT received_at, message_view.status, changeset_view.* FROM events
LATERAL VIEW json_tuple(events.message, 'status', 'changeset') message_view AS status, changeset
LATERAL VIEW json_tuple(message_view.changeset, 'name', 'address') changeset_view AS name, address;
Also explode the ids
values into a column:
SELECT received_at, message_view.status, message_view.ids, changeset_view.*, ids_view.exploded_id FROM events
LATERAL VIEW json_tuple(events.message, 'status', 'changeset', 'ids') message_view AS status, changeset, ids
LATERAL VIEW json_tuple(message_view.changeset, 'name', 'address') changeset_view AS name, address
LATERAL VIEW explode(split(regexp_replace(message_view.ids, '^\\[|\\]$', ''), ',')) ids_view AS exploded_id;
Although the Hive-JSON-Serde might be a better fit for JSON this complex, it’s still possible using LATERAL VIEW
. For additional examples, see Hive plays well with JSON.
Hive supports outputting results from one query into a second table, where it can be persisted for additional reporting. Use SELECT to generate useful results, then CREATE TABLE my_second_table ..
followed by INSERT OVERWRITE TABLE my_second_table SELECT ..
to persist them.
This can also be used to extract values from within strings. See Hive’s functions, such as regexp_replace()
, sum()
, and parse_url_tuple
.
EMR supports multiple Hive versions. The scripts above, create_events-0-13-1.q
and create_events_kv-0-13-1.q
, use the MSCK REPAIR TABLE
command to make partitions accessible. This command was introduced in Hive 0.13.1.
For Hive versions prior to 0.13.1, use s3://hive.papertrailapp.com/EMR/create_events.q
or s3://hive.papertrailapp.com/EMR/create_events_kv.q
. They use the ALTER TABLE .. ADD PARTITION
command instead of MSCK REPAIR
.
Something you’re trying to accomplish? Contact us.
The scripts are not supported under any SolarWinds support program or service. The scripts are provided AS IS without warranty of any kind. SolarWinds further disclaims all warranties including, without limitation, any implied warranties of merchantability or of fitness for a particular purpose. The risk arising out of the use or performance of the scripts and documentation stays with you. In no event shall SolarWinds or anyone else involved in the creation, production, or delivery of the scripts be liable for any damages whatsoever (including, without limitation, damages for loss of business profits, business interruption, loss of business information, or other pecuniary loss) arising out of the use of or inability to use the scripts or documentation.