Select Page

# Parsing

One of the most important aspects of log management is to parse unstructured free text log lines into structured documents so that it becomes easily searchable and analyzable. ZettaLogs offers an advanced and easy to use custom log parsing method. Logs are sent to ZettaLogs according to RFC5424 which adds a header to the log message that has the following structure:

<%pri%>%protocol-version% %timestamp% %hostname% %app-name% %procid% %msgid% [PROJECT-TOKEN@46644] %msg%

Syslog line is parsed and fields are extracted automatically. The following fields are available for user:

FieldNote
priThe priority value generated by rsyslog according to RFC5424.

The priority value is calculated from facility and severity values using the formula

8xfacility+severity

The facility takes integer values from 0 to 23 whereas severity takes integer values from 0 to 7. Therefore, the value of pri may be between 0 to 191.

timestampThe timestamp generated by rsyslog.

Rsyslog generates and appends the time when this log line is read from the file or received from an application.

hostnameName of the host where the log file was generated
app_nameName of the application that generated the log.

If this log is read from a file by rsyslog then this field gets the name of the file. If an application has written the log to rsyslog over a socket then this field gets the name of the application.

msgThe log line itself.
The syslog fields are readily available in each document. An aggregation analysis may be run over app_name field to observe all the different applications generating logs and their respective quantities.

Field analysis results may also be visualized using a pie chart.

Syslog header is well defined and its fields are automatically inserted into each log document by ZettaLogs. The format of the log message itself is application dependent and thus do not have a constant form. Each application generates many different types of log lines. ZettaLogs automatically parses most of the common log types such as default apache and nginx access logs.

Parsed fields of an apache log is shown in tree format. Each feature in the log line is mapped into a field in the structured document like clientip, request etc.

Different client IPs connecting the apache server can be visualized as a pie chart.

## Custom Formats

Although many formats for parsing common log types are declared in ZettaLogs by default, many user and application specific formats go unparsed. That’s why ZettaLogs gives users the means to declare their custom log formats easily.

Custom log formats are declared using regular expressions. However, since normal regular expressions may become too complicated ZettaLogs uses a hierarchical regular expression format. This is the same format used by grok filter from Logstash.

Even the most complicated log formats may be concisely defined using hierarchical patterns. There are hundreds of common patterns are defined by default. The user can define a custom pattern also. These patterns are used in the log format definition as building blocks. The syntax for a format is

%{SYNTAX:SEMANTIC}

SYNTAX is the name of the pattern that will match the text. For example, 2016-03-25T10:39:01.614319+02:00 will be matched by the TIMESTAMP_ISO8601 pattern, 123456 will be matched by the NUMBER pattern and 192.168.1.1 will be matched by the IP pattern. These patterns are system defined by default and can be found in patterns table. A pattern may be composed of other patterns in a hierarchical manner. In the end of the hierarchy there is always a regular expression. For example, TIMESTAMP_ISO8601 pattern is defined as

%{YEAR}-%{MONTHNUM}-%{MONTHDAY}[T ]%{HOUR}:?%{MINUTE}(?::?%{SECOND})?%{ISO8601_TIMEZONE}?

It is composed of YEAR, MONTHNUM, MONTHDAY, HOUR, MINUTE, SECOND and ISO8601_TIMEZONE patterns. ISO8601_TIMEZONE pattern itself is composed of other patterns as follows.

(?:Z|[+-]%{HOUR}(?::?%{MINUTE}))

However, MONTHNUM pattern is at the end of the hierarchy and it is a plain regex defined as (?:0[1-9]|1[0-2]).

The SEMANTIC is the identifier assigned to the piece of text being matched. For example, 192.168.1.1 could be the destination IP of a connection, so it is called dstip. Similarly, the string 2016-03-25T10:39:01.614319+02:00 might be the timestamp of the event. These identifiers will be used as the field names for the extracted information from the log event. They will be indexed, made searchable and analyzable.

Optionally, data type conversion may be added to a pattern. By default, all identifiers are stored as strings. This behavior may be overridden by specifying a data type after the identifier’s name seperated with a semi-colon. For example %{NUMBER:num;long} converts the num identifier to long. The supported conversions are long, double, boolean, date and ipv4. If data type conversion is not successful for some reason then the field is retained as string.

date data type optionally takes an additional argument. By default date conversion is done with the assumption that the matched string is in ISO8601 format. However, the user may specify any format in Java date format. Assume that we have MYPATTERN1 pattern defined as follows.

%{MONTH} %{MONTHDAY} %{YEAR} %{TIME}%{ISO8601_TIMEZONE}

The strings matched with this pattern will be dates in the form of Mar 12 2015 18:00:08+0800. If the extracted field using this pattern needs to be converted to date type then the format definition would be like as follows.

%{MYPATTERN1:timestamp;date;MMM d yyyy HH:mm:ssZ}

In addition to Java date formats we support two special date format identifiers:

• UNIX : Number of seconds since Unix epoch.
• UNIX_MS : Number of milliseconds since Unix epoch.

For example, assuming that we have a pattern named MYPATTERN2 that extracts a field data 1487142973, we may use the date conversion format:

%{MYPATTERN2:timestamp;date;UNIX}

The epoch time will be converted to date. Similarly for epoch time expressed in milliseconds you may use UNIX_MS date format identifier.

The extracted fields from parsed log events will be displayed on the user interface and will be ready to be used in analysis.