Splunk4 + Instant Messaging = SplunkAIM

February 7th, 2010 by David Carasso

This small, unofficial project integrates an open-source AIM (AOL Instant Messaging) Chatbot with Splunk 4, allowing ad hoc searching, running of prepared searches, and real-time search alerting via instant messaging.

What’s real-time searching? It’s new in Splunk 4.1, out shortly, and will allow users to search for “real-time” events, within seconds of them reaching Splunk. Most usefully, you can set up real-time searches and be IM’d with the matching events the second they show up. You could ask to be IM’d, for example, whenever someone logs into your system, whenever there’s an error, whenever someone logs in as root, etc.


Above is a screen capture of real-time alerts printing out for each time someone downloads Splunk!

Note: You can use this project with Splunk 4.0, and everything other than real-time searches will work. That means you can do ad hoc searches and run saved searches over historical data.

Download Project

Example Searches

    ? prints out a help message explaining commands.
    rtsearch login root set up a real-time alert to IM you whenever a user logs in as root.
    rtlist get a list of all your real-time alert jobs.
    rtstop * cancel all your real-time alerts.
    search login | top 5 username run an historical search reporting to top 5 users who logged in the most.
    admin error IM’s not starting with known commands will search existing saved searches (here we search for saved searches about admin errors).

Is Your BlackBerry App Spying on You?

February 7th, 2010 by Chris Eng

Tyler Shields gave a presentation earlier today at ShmooCon 2010 on the threats of mobile spyware, particularly as it relates to data privacy. Smart phones and mobile applications have grown tremendously popular over the past couple of years, and it seemed like an appropriate time to raise awareness of what these applications are capable of.

Our goal was to demonstrate how BlackBerry applications can access and leak sensitive information, using only RIM-provided APIs and no trickery or exploits of any sort. We make no assumptions about how the malicious application will be installed on the phone, and we haven’t attempted to sneak a malicious application into BlackBerry App World. BlackBerry apps can be installed from any location, plus, there are so many examples of malware slipping through the screening processes of the various app stores (Apple, Symbian, Android, etc.) that we didn’t find it necessary to prove the point again. To some degree, official app stores give users a false sense of security because people will assume that everything in the store must be trustworthy.

Here’s a video that demonstrates the features of Tyler’s proof-of-concept spyware. We show how it can be used to dump contacts and messages, intercept text messages, eavesdrop on the room, report on phone usage, and monitor GPS data. To view this in HD resolution, click through to Vimeo and use full screen mode for best results.

 

We’re also releasing source code. As far as we know, this is the first public release of source code that demonstrates such a broad range of malicious functionality on a BlackBerry device. Code reviewers and security practitioners can use it as an educational resource to help them recognize malicious behavior and understand the specific risks introduced. This is an important educational asset for those of us working to create more secure software. As for the bad guys, it would be naive to think that they don’t already know how to do this stuff. The code doesn’t go out of its way to be stealthy; in fact, it’s quite the opposite (by design).

Here are the goods:

Slides: Blackberry Mobile Spyware — The Monkey Steals the Berries
Source: txsBBSpy.java

So how can users protect themselves? There are a few places to defend against malware of this nature.

  1. Users can configure their default application permissions to be more restrictive. This way, if an application tries to use an API that accesses the user’s email or contact list, the OS will ask for permission. Avoid granting applications “trusted application” status, which grants untrusted applications additional privileges. Tyler’s slide deck shows the default and trusted permission sets in more detail.
  2. Corporations using a BlackBerry Enterprise Server can configure their IT policies to restrict their users from installing third-party applications, or whitelist certain approved applications (but brace yourself for the backlash)
  3. BlackBerry App World could introduce a rigorous security screening process that submitted applications must pass in order to be listed in the store.

If app stores don’t provide any security testing, the risk reduction responsibility falls to the enterprise. We recommend creating an approved list of applications that have undergone security testing.

Finally, it should be noted that while we chose BlackBerry for our proof-of-concept, this is not just a BlackBerry problem. All mobile platforms provide similar mechanisms for writing applications that have access to the user’s personal, potentially sensitive information. As consumers become increasingly dependent on their mobile devices, we are certain to see an uptick in the volume and sophistication of mobile malware.

Apoena

February 5th, 2010 by Felipe Afonso

The goal is to analyze Snort logs in order to get a general view of the network events. At your left you have the atacker view, where is ploted a sector graph with quantity(radius) and priorities of atack (red, yellow, green) at your right you have the victims view with same information. there is the abilty to filter by protocol ( TCP, UDP, ICMP ) and priorities, this graph have interaction, and you can get the original log with a mouse right click .
This is the abstract of the paper, the original was written in portuguese.
ABSTRACT
The compromising of computer systems generate evidences on various devices such as routers, operating systems and applications. Monitoring and analyzing this large amount of data is a challenge for network administrators. One way for analyzing large amounts of data like the generated in these cases, is to use information visualization to provide one or more graphics capable to summarize data and translating them into information. This work presents a study on the use of visualization techniques applied to information security and monitoring of computer networks, with emphasis on visual analysis of logs generated by the intrusion detection system Snort. It also reports the development of a software called Apoena, which aims to analyze the alerts generated by Snort, using graphs and pie charts for displaying of the network events.

Unless you have had your head in the sand, SQL Injections have made a fierce comeback to the top of the threat vector charts this year. According to the WHID (Web Hacking Incidents Database), SQL injection is still king of the attack vectors, accounting for 19 percent of attacks, followed by authentication abuse (11 percent), content spoofing (10 percent), DDoS/brute force (10 percent), configuration/admin error (8 percent), cross-site scripting (8 percent), cross-site request forgery (5 percent), DNS highjacking (5 percent), and worms (3 percent).

Reflect on the recent increase in compliance legislation requiring businesses to provide dynamic data access to customers for banking, healthcare, or the influx simple purchases on the web, and the concern may be scarier for all of us. Recently, Dark Reading reported on the number of companies who have been compromised through SQL Injection attacks.

What is SQL Injection, and How Does it Work?

If you don’t know what this is, and just learned what SQL is, I recommend going to OWASP.org and reading up a little. It is a great resource, and the mass amount of security professionals dedicated to the Open Web Application Security Project deserve a big shout out.

Lets start with how SQL Injection actually works. SQL Injection occurs when an attacker is able to insert a series of SQL statements into a ‘query’ by manipulating a data input, usually a form for users to update their account information. Some common relational database management systems that use SQL are: Oracle, MSSQL Server, DB2, Sybase, Informix, MS Access, Ingres, and so on, with the most popular being MSSQL of those.

Whether you are a potential attacker, auditor, researcher or an application developer, you may go through the same steps to exploit or find exploitable code:

  1. Input Validation
  2. Information Gathering
  3. 1=1 attacks
  4. Data extraction
  5. OS interaction
  6. OS Cmd Prompt
  7. Expand influence

More information available at OWASP (Victor Chapela, OWASP, “Advanced Topics on SQL Injection Protection”)

Splunk and SQL Injections

Splunk approaches this attack a little differently because of our ability to make all IT data security-relevant. Within the Splunk index, organizations will collect logs, custom application logs, traps, configurations, stack traces, scripted outputs, auth data and metrics for analysis. Splunk can help in applying security/audit logic in various detective controls to aggregate IT data to one place, make simple sense of the data, apply relationship logic into what might appear to be a standard operational issue. This logic can give you a “report”, alert, dashboard, e-mail, run a script to gather more information, or simply create a news feed of the ongoing event to send to a ticketing system, incident software or other system. Not just designed for tier 1 troubleshooting, Splunk can help incident handlers and analysts backtrack events by digging into logs across geographies, datacenters, applications and technologies. Incident handlers, auditors, security professionals can then persist the same logic in a “search” to identify the next occurrence, proactively defending sensitive assets.

How Does Splunk Do This?

Well, for this case, lets use the Splunk Security application, Enterprise Security Suite (ESS). ESS enables simple searching to illuminate information in the muddy, challenging environment of security and operational data accessed by more people than just security androids and SysAdmins. I use it to help organize security data into categorical security areas: Access Protection, Endpoint Protection, Network Protection, Incident Response, and Governance.

You Already Have the Answers: In Your IT Data

The same Splunk rules still apply, you have to put your data in, to get good information out, so we need a few key pieces of data to find a SQL Injection – some more damning than others – to identify what starts out as an operational issue, but turns into a security investigation.

IDS Logs and Events

Though there are many methods of subterfuge to avoid IDS/IPS detection of the 1=1 statement, getting a look at the application data in a purported attack via a Snort/Cisco/Juniper alert, is very helpful as part of a correlated event. SQL injection may include of logic, depending on the input validation, a ‘;’ may help; seeing JOIN or UNION statements may also be indicators of misuse.

Packet Capture

Always good, certainly looking at application data in the packet with SQL statements is going to be helpful. Thing is, often times, database replication, linked databases, etc. are all capable of using HTTP as the transport protocol, so be advised- this could be a lot of data, and it may be legitimate. Alerting on these events in Splunk would let you execute a script to trigger TCPdump or something based on an event, if the Splunk instance is enabled with tcpdump.

Vulnerability Assessment Tools

Nessus events, or other audit tools, can help qualify the actual threat of the injection language based on the type of systems you are protecting. MSSQL statements are more forgiving than say, Informix for example, and if you are a UNIX shop, MSSQL attacks do not pose a risk, though this may mean some interrogative work.

Anti-virus

This is handy should malicious code be dropped, downloaded and/or propagated via SQL language over HTTP, FTP, SSH or other file transfer protocols. When an event turns off AV, or a failure occurs after a noticed injection, there should be concern as to the sanctity of the system it failed on.

Host Data

Perhaps the application server and database servers have file integrity monitoring, maybe a scripted output of binaries such as top, psstat, or in windows, netstat and ipconfig? Looking at a new listening service you didn’t install, may be after the fact, but at least identifiable with Splunk. If you happen to have something like OSSEC installed, or another kernel monitoring software, perfect. An example SQL Injection provided by OSSEC, looks like this:

200.96.104.241 - - [12/Sep/2009:09:44:28 -0300] "GET /modules.php?name=Downloads&d_op=modifydownloadrequest&%20lid=-1%20UNION%20SELECT%200,username,user_id,user_password,name,%20user_email,user_level,0,0%20FROM%20nuke_users HTTP/1.1" 200 9918 "-" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; .NET CLR 1.1.4322)"

note the UNION statement followed by %20SELECT%200. This is the beginning of pulling all the successful users from PHPnuke site, and getting table data on all of their usernames, IDs, passwords, and e-mails. If you aren’t careful about your web application accounts all being the same, you could be the subject of more than a twitter DDoS.

Application Logs

Web server logs, depending on verbosity, this may be a good starting point to determine the source/destination centered around IP traffic, as well as success/failure and some explanation of why. Nothing special about these logs, we use them for operations and security. Several types of HTTP status errors can be helpful in determining what is going on – for instance, a 403 or a 401 error on a customer facing application.

User Audit Log

Finding failed and successful authentication attempts on a local system is extremely helpful- especially around the time something changed in our environment. Make sure to note, actual injection code may be executed on the application tier or webhost, rather than the database server itself. Detective controls should apply to the entire transaction architecture.

MSSQL Error Logs

Error logs provide some good information for Splunking account access, and database errors especially after isolating the application error occurring in the transaction model. To add the error log on a forwarder as an input:

[source::...MSSQL\\LOG\\ERRORLOG]
CHARSET = UTF16-LE
NO_BINARY_CHECK = true

(Make sure to do this on the indexer as well)

The most valuable source of information is actual data across the wire. You are using tcpdump, or wireshark, monitoring the SQL server (if you don’t, you should at least have the capability should there be a threat). If extended procedures like “xp_cmdshell” are being executed they may actually be logged when invoked. Xp_cmdshell enables a virtual cmd shell within an SQL statement. Maybe you have Windows event logging, and registry settings baselined, and a profile of your SQL server’s persisted connections base lined so you can apply a diff to them frequently, uncovering new network connections, and services alike. Similarly, user logon/logoff events to the operating system and application, both successful and denied, are handy and can indicate if a system account seems to persist failure a-la brute force attack. We may also see a large amount of auth attempts to SQL server, or the database through the windows Application log as well.

Determining Contextual Relevance

The same information has different context depending on who is looking at it, what it is used for, and how it relates to other data, what we call contextual relevance. In Security, almost every technology is interrelated in either debunking a threat, or validating it. We use Splunk to help solve Operations availability problems, Information Security problems, Compliance requirements, both on the fly and proactively, bringing these problems to light with alerting and notification.

Any of these events may show up, though logging may be turned off/system clock reset given a system may have been compromised with a Splunk forwarder. The stopping of event data in logs is also telltale, so be sure to no alerts=immediate concern in Boolean search, and persist a search that alerts when no events appear.

Input validation is really the first way to stop the problem in development, but stored procedures can be modified to add a regular expression for a credit card number, an SSN, or other fields. Preventing the additive SQL command in the input space of a form, stops the problem ahead of time, so does fuzzing before code is live, but secure coding is another chapter.

Some of the information available in Splunk will also allow the operations personnel or administrators to go back and audit configuration files, to see if DBerrors are being thrown to the user’s screen enabling enumeration/info gathering. Even if the code cannot be pulled from production, enumerating poor SQL input validation can be more difficult if error reporting is turned off.

For instance, is your config file configured to throw exceptions on that server? Have a look in Splunk. Take HTTP status errors for example a 403 Forbidden error status code, this may be something that gives valuable information to a potential attacker. 403 errors may be something you don’t want to show folks, unless in a development environment, so this may suggest turning off error responses if you see them in Splunk. At the same time, this generates an event beneficial to the application owner, but maybe doesn’t need to be broadcast. Users should see 404 errors when error notification is disabled, rather than a 403 error. When in doubt, look in Splunk! Maybe a quick search for all 403 errors may let you know there is an example of a potential SQL Injection occurring.

Once you find a 403 occurrence, or a bunch of failed account logins to a database, or an application server unavailable (503), use Splunk to span a 15-minute window by dragging the search range to the 15 minutes around the error. Then, look for all events during that time, on the host that may have hosted the error event. Perhaps a system has been compromised, and credentialed access may allow downloading/uploading?

Look for failed and successful authentications to the same system. Look for escalated account privileges. Splunk allows you to run a “System Profiler” as well to look at things like the change in listening services and ports on the Network, and things like infections or the state of anti-virus from the Endpoint Protection dashboard. When you find the behavior in all of the layers of events that correlate the event, create a search to populate a summary index as a “notable event” appearing in the Security Posture dashboard, and create an alert for it. Next time, Splunk users and Incident Handlers alike will know without re-creating the on the fly search we just did.

Splunk Helps Make the Seemingly Unrelated – Related

What does all this mean? Simply this- simple users can apply complex concepts from experts in the field, to search, detect, alert and report on security threats like SQL Injections, using Splunk. ESS allows a wide variety of users to look at the same information and derive different conclusions based on contextual relevance. Most real security risks, don’t flash “SQL INJECTION ATTACK” in your dashboard, you need to understand your environment and what pieces work together. Given the increased frequency of Web application threats, specifically SQL Injection, identifying the threat itself depends on multiple layers of security as well as a way to simply search through all of those layers simultaneously. Found an IP of a suspected system? Splunk it. Found an error for authentication to a back-end system? Splunk it. Can’t find the relationship between events in a given period of time? Use Splunk to make the seemingly unrelated, correlated.

For more information on how to monitor your MSSQL-driven application using Splunk, drop us a line: info@splunk.com

Microsoft has provided their own suggestions as well. http://msdn.microsoft.com/en-us/library/ms998271.aspx

Be successful with Splunk in about an hour…

February 4th, 2010 by David Carasso

Here’s a document that can get you analyzing real data and making real charts, in about an hour or two…

Dive into Splunk

Feedback really, really appreciated.

Splunk Reports

Reports you could be making in about an hour!

Splunk memory use patterns

February 3rd, 2010 by Joshua Rodman

From an operating-system perspective, splunk is a system of programs that work together to provide the utility that users experience. Each of these programs have their own memory use patterns, and having some idea of them is good for investigating memory exhaustion/performance problems, as well as resource planning.

The involved parties in the splunk memory picture are:

  • the operating system
  • splunkweb
  • splunkd

Programs launched by splunkd:

  • splunk-search
  • python search processors
  • splunk-optimize
  • scripted inputs such as wmi, imap, regmon, admon, vmware, imap, or your own customized/created agents
  • scripted alerts
  • scripted index management scripts (warmtocold, coldtofrozen)
  • scripted auth

Many of these (especially the scripts) are largely external to splunk, in that splunkd runs them as requested, but their resource consumption is up to third party authors, external designs, or external factors. The size of these tools will not be covered in great detail.

Operating system

The operating system is expected to provide an efficient data cache for splunk data files, including:

  • splunk binaries
  • web assets
  • config files
  • indexed data files
  • input log files
  • etc

Since memory access is several orders of magnitude faster than disk access, a healthy splunk system should have a significant of memory un-allocated by any process at most times. A good ballpark ratio is half or more of the ram free for cacheing purposes. A corrolary is your operating system should be making use of all your memory.

General memory info

When measuring the memory use of actual programs, always remember to review the real memory usage, not the virtual. Real memory usage is sometimes called “RSS”, “RSIZE”, “in core”. On windows the closest approximation is “Private working set”. This can be a bit misleading, as a system under hevy memory pressure will page out more of the memory allocated to programs. Therefore it’s best to first get a sense of overall system memory pressure before reviewing process sizes.

(There are other misleading factors — it’s generally a bad idea to measure dissimilar programs simply by RSIZE to guage their ‘bloat factor’. If you care about this sort of thing you might be interested in smem : http://www.selenic.com/smem/ on Linux )

Splunk Web, or the python process does need to buffer the data being fed immediately to the the browser. For the most part, the ram requirements are modest (tens to perhaps 100 MB) , but there are patterns that can push it up.

If you are displaying 50 items a page, splunkweb will have to acquire 50 items in an xml document from splunkd and then render a an html fragment with these 50 items. Normally this isn’t very large, and the default document trims them to a fixed number of lines (to avoid breaking the browser). However for odd cases (events containing lines that are tens or hundreds of kilobytes long) this could become significant per client.

Another example would be a case where you request display of the top 10,000 hostnames based on event quantity. splunkd will need to generate an xml document with 10k stanzas, which python will have to load and parse, and then generate an html entry with same.

Thus large display cases, times user concurrency, will cause splunkweb to expectedly grow. For so-called ‘pathological’ situations I’ve seen splunkweb grow by 200-300MB for one user.

splunkd has a few tasks in parallel:

  • reading in data from various inputs
  • processing data prior to indexing
  • building indexed datastructures
  • launching search requests and providing results, both interactively and scheduled.
  • authenticating users
  • possibly sending data outbound to other systems.

While all these tasks use memory, there are a few that dominate.

program baseline

splunkd is a big program. The program text itself will use some 30MB or more.

pipeline data

All the data flows of pre-indexed data to the index on disk or to network outputs live in memory. Typically for both forwarders and indexers, this data is some tens of megabytes. On an indexer, the data size is proportional to event size. Thus if you have a majority of very large events (java exception backtraces, web page documents) then this data will grow proportionally.

Pipeline data can grow sharply when the system is not able to keep up with the dataflow for some reason. An extremely underutilitized system will have 1-2 events in each FIFO queue, while a system that is behind will fill up to 1000 events in each FIFO queue. Thus you can grow from ~1MB of pipeline data to more like 20-40MB of pipeline data quickly in situations like disk bandwidth exhaustion, or a blocked downstream splunk instance.

index structures

As part of making the data searchable, an index is built for it. This is built in memory and then flushed out when the memory buffer is full. Each index has an independent buffer.

In Splunk 3, the default per-index buffer was 10MB, while the default index buffer was 100MB. Typically adding more indexes with significant volume would have similarly large buffers, so a high volume server with two user-data indexes might have around 200+MB for indexing buffers.

In Splunk 4.0, the default per-index buffer is 5MB, while the default for the main user-data index is 20MB. A similar example on Splunk 4 would be more like 30-40MB for indexing buffers.

In both 3 and 4, if the number of indexing threads goes up, additional buffers are allocated for these additional threads. We strongly do not recommend adjusting the number of threads.

ldap authentication data

In splunk 3.x and 4.0.x, the responses to the defined LDAP searches that gather user information and group information is buffered in ram. In some cases, this can be quite large. Ideally these searches should be tuned to narrow the data down to the necesssary data. Splunk 4.1 will not buffer significant LDAP data.

searches

In splunk 3, searches live in splunkd ram. Approximately 100k events will result in memory allocations on the order of 1GB.

In splunk 4, the only significant memory use for search will be generating xml descriptions of events. For splunkweb and well-behaved REST clients, this will be very small. It’s possible for a poorly behaved REST client to request extremely large documents which will kick this up.

splunk-search

Splunk-search (4.x+) runs all the operations requested by the search expression, including pulling data off disk, adding fields, sorting, timecharts, and so on. Some operations, like deduping can use significant memory for large numbers of events, while simple search does not. Thus, searches will vary from some tens of megabytes to multiple gigabytes.

If you have memory concerns about your expensive searches it is best to try them and measure using top, ps, etc.

Obviously, you have to consider the quota of searches configured, and the likely overlap of expensive searches by user patterns.

Search processors

In addition to search processors that run natively inside the splunk-search executable some search processors are written in python, and will be spawned as externel processes. Typically these are quite small, but if you have added processors of your own design they may be significant. Ideally these do not buffer any significant amount of data, but just read and write records as they go.

splunk-optimize

From a memory perspective, splunk-optimize is usually a red herring. It looks big but its real footprint is far below that.

Splunk-optimize has the task of combining small .tsidx files (bucket components) into large ones. Depending upon the files combined, the resources can very from extremely little to significant.

splunk-optimize maps the index files into memory, so the virtual size of this program will appear to be quite large. It then walks the source files in essentially linear order, faulting all of the files into the process space. However, since the memory access patterns are so linear, there will be little effective memory pressure produced by splunk-optimize, so the footprint should decrease dramatically when memory is tighter.

The rest of the tasks, including the various scripts, data gathering programs, alerting programs, archiving scripts are genearlly not significant. There are some notable exceptions:

  • The 3.x vmware app. Written in Java, it’s a bit large, over 1 GB of ram typically.
  • flatfileexport.sh – this coldtofrozen archive script invokes ‘exporttool’ which can be fairly memory hungry for 64bit buckets. It may take as much as 2.5GB of ram.
  • splunk-wmi – largely as a result of the Windows WMI subsystem that this program uses, the memory use of this tool grows with the number of categories it is pulling and with the number of hosts. Thus this growth can be a problem if you gather data from a very large number of hosts, or if you have, for example, a large number of custom eventlog categories, or both.

Mobile App Security

February 3rd, 2010 by Chris Wysopal

Neil MacDonald at Gartner asks the question, “Why Don’t Mobile Application Stores Require Security Testing?”

I couldn’t agree more that we may be missing an opportunity to bring whitelisting to these new important mobile platforms. We need to leave the “detect and revoke” mentality of the PC world behind as we move to new platforms. Attackers are able to game the PC antivirus model by continuously flooding the software ecosystem with new unknown malware. The attackers will win in the mobile world too if we don’t change it. The mobile app store is a form of whitelisting that can assure the security of an entire platform if the whitelisting means something. That is if the apps are tested for security before being published.

Veracode is being asked by large financial organizations to build security testing into internal mobile app stores. There is obviously a desire for security screened applications in the corporate and government world. Why not just scan once at the platform provider’s app store and give the benefits to all?

Veracode researcher Tyler Shields is presenting 2/7/2010 at Shmoocon on Blackberry malicious mobile code. The presentation and sample code will be available here.

Parsing the Splunk Timezone Format

February 2nd, 2010 by Joshua Rodman

Every once in a while, rarely, you may get a splunkd.log error that looks something like this:

12-07-2009 14:32:06.894 ERROR bucket - Failed to resurrect timezone ('
' delimited): '### SERIALIZED TIMEZONE FORMAT 1.0
C0
Y0 NW 47 4D 54
$'

This is splunk saying it can’t parse the timezone description it just got. This can be a problem when you’re in a distributed environment, and you’re asking for data to be bucketed (collected) into time-specific chunks. A typical example is when using timecharts.

The fix for this particular issue is called Splunk 4.0.7, but if you’re curious to know what timzeone it actually is, the digits of hex are the name, represented as ascii values.

A quick trip to python shows us a more familiar name:

jrodman@joshbook:~> python
Python 2.6.1 (r261:67515, Jul 7 2009, 23:51:51)
[GCC 4.2.1 (Apple Inc. build 5646)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
0x47, 0x4D, 0x54
(71, 77, 84)
chr(71)
'G'
map(chr, (0x47, 0x4D, 0x54))
['G', 'M', 'T']

The second presentation at the Boston Splunklive event on January 28th was an in-depth profile of a large-scale deployment in a financial services firm, anonymously described as “one of the world’s largest providers of financial services.” Paddy Griffin, Director of Technical Architecture, used his extensive history in the software industry to provide context to his firm’s plans with Splunk. Unlike other major IT projects at his firm, this Splunk-based initiative is being rolled out in record time, using an iterative approach, to show they can provide a continually enhanced log aggregation and search service as part of their “nimble infrastructure.”

Paddy started his presentation by unveiling the name of the overall initiative: LASSIE (yes, like the famous collie from TV). The acronym stands for Log Aggregation Service with Splunk Indexing and Exploration. A somewhat fitting name when you see the last slide (below) in his presentation.

Think of LASSIE as a service: a log aggregation and search service planned, deployed and managed by a central group; providing value to users around the company. Below you can see some of the various data sources going into LASSIE (Splunk). Paddy said “The ability to index any data without having to maintain and support a data schema is huge.”

Phase 1 of LASSIE focused on providing capabilities for indexing, searching, monitoring and reporting based on log files and changes. Phase 1 also implemented the core foundation for the service including the definition of roles and role-based access controls, and service policies.

As part of the role definitions and role-based access controls, Paddy integrated Splunk with Active directory. These roles are being used both to control information access and privileges on LASSIE (Splunk), and also to provide the views needed by the diverse users in various parts of their business. They are likely to take advantage of the Single Sign-on (SSO) support in the upcoming release of Splunk 4.1. His team also defined a role called “Curators”-people who are aligned with the various business groups (such as bond trading) and have primary responsibility for a business app or service. Curators define the data sources sent into Splunk and who within their business unit can access the data.

Over time LASSIE will need to scale. The approach they are taking is to scale “horizontally”-setting up separate Splunk indexers for each set of users/business groups. Splunk will also enable them to scale linearly, by using multiple Splunk indexers on commodity servers, and let users within a business group search across the indexers. Future plans call for them to enable distributed search, enabling authorized users to get a combined view from searching across the separate Splunk indexes set up across the business groups.

The attendees got useful insights in how to plan a major Splunk deployment in a very large enterprise. And one of the benefits for Paddy from the Splunklive Boston is that he was actually able to meet for the first time other people from his firm who are already using Splunk as well. “Splunk has gone viral in my company!”

120 users and prospects came together Thursday morning, January 28th, to attend the first Splunklive of 2010. Set at the Cambridge Marriott in Kendall Square, a major university and a major financial services firm presented on how they are using Splunk to better manage their IT infrastructures. Attendees came from the greater Boston area, Maine, Connecticut, and elsewhere in Massachusetts on a day when it was cold enough to walk across the Charles River.

The event was kicked off with a short overview of Splunk–a presentation followed by a product demo.

The first customer presentation was given by Jim Donn, Network Management Systems Engineer, and Tim Hartmann, Unix Systems Administrators. They requested that their university remain unnamed, so I’ll refer to them as “Major U” (consistently ranked among the best colleges and universities in the country and world). Both the networking and systems management groups were looking for solutions that would provide centralized logging for troubleshooting, alerting, reporting and trending analysis. Tim and Jim had started their research independently but soon converged on a single answer: Splunk. Their Splunk deployment environment consists of: 400 Unix, Windows, and other servers; 3000+ Cisco devices; TACACS+ authentication logs; and VPN access logs; 47 staffers with Splunk logins, 25 regular Splunk users.

Tim and Jim reported on the quick success they achieved with Splunk. “Everyone in our org, as soon as they start using Splunk they won’t stop.” A major focus for them was on trending analysis. Before Splunk, they would trend a single server or component. Now, with Splunk they are able to do trending for an entire service. “We didn’t have that top-down view before.” The value of trending with Splunk came up in customer presentations and in the impromptu conversations with users during breaks and lunch. They highly recommended to the audience the Splunk for *nix and Splunk for Windows free apps that are included in Splunk.

One of the unexpected paybacks from implementing Splunk is that they were able to decommission two sizeable Oracle RDBMS servers and repurpose the hardware. They had been using two sizeable HP boxes with Oracle licenses to store event data from their SMARTS devices. The repurposed hardware and the cost avoidance of having to buy Oracle reporting software was in the ballpark of their entire Splunk license. And the database guys no longer had to support and maintain the Oracle databases and the users had far better access to the event data for analysis, trending and troubleshooting.

Their migration to Splunk 4 went smoothly, and provided them with major performance improvements (as promised by Splunk’s marketing claims!). They’ve encouraged different users and groups at Major U to send them all their logs. “Users didn’t think we could handle it, but we’ve proven we can handle everything they send us.” That’s true not just for data volumes but data types as well. “Anything that spits out text we can get into Splunk.” Major U has plans for expanding Splunk in their organization—more uses, more data, audit and elsewhere. “We keep finding more and more use cases for Splunk”.

After the presentations, an open Q&A was held. Attendees were encouraged to ask any questions of a panel of speakers, other customers, and Splunk attendees. Godfrey Sullivan, Splunk’s president & CEO, attended the Splunklive event and answered a variety of questions about our business, customer use cases and very large customer deployments.