Wednesday, September 4, 2013

No data in TNPM reports? Some things to investigate...

Hi,

This last week I saw something in TNPM that reminded me about one of the most common issues when processing and reporting performance data...no data in the reports. In this case it is quite obvious that something is wrong...but where is the problem?

I would like to share some steps I always take in order to investigate the problem in TNPM

Was data ever collected before?

This is the first thing to check...navigate to the past in the report (preferably a report with a chart in order to see raw data over time). Is there any datapoint available? If yes, when was the last one?
The most part of the time you will notice that the issue is not in TNPM itself, but in the data source. It can be a not announced patch applied to the device by the engineer dept (and they always forget to inform it :). It can be some network change caused by a new product release. It can be a change in the sftp credentials used to collect data...etc. Looking backward on time and identifying exactly the last collection point should be your first step.

Are the datachannel components running correctly?

Go to the Datachannel server and do a dccmd status all. Check if all components are on time or if one or many of them are delayed by some hours. Look for a column called "ES DURATION". The numbers indicate how long the component is in a "fixed state". Ideally you should see small numbers (let's say smaller than 100), indicating that something is being processed, except for the DISC component, that will usually have a big number.
Please be attentive here, specially if you are using CSE/CME formulas that depends on input from different subchannels. If one subchannel is delayed in time it will delay all dependent subchannels as well.

Are there ".bof" files being generated?

Go to the datachannel server and check the done directory for the subchannel with problems. You should see files with the extension ".bof" in it (except for the BCOL where you will find ".pvline" files). Please remember that the data flows in the following order:

BCOL
                => FTE   =>  CME   =>  LDR  => DLDR
SNMP

So, if you can find ".bof" files in the FTE/done directory and in the CME/done directory but not in the LDR/done directory for a specific subchannel, this means that the problem is in the CME and not in the LDR

Can you find some metric for your subelement in the ".bof" files?

Select one subelement that should have data displayed in the report, and get its dbIndex (the easiest way is to export the RST table to csv and look into the resource column).

Go to the data directory, find a recent ".bof" file and execute 

bofDump -r <dbIndex> <filename>.bof 

If nothing comes out, them the problem may be in the collection tree or collection requests (continue reading...)

Is the subelement in the correct folder in the collection tree?

Open the "pvm", go to the "resource editor" and check if the subelement is in the correct collection folder. If not, please check your grouping rules using the "rule editor".

Is a collection formula deployed and active for the collection folder?

Open the "request editor" and check if at least one collection formula is deployed and active for the collection folder.

NOTE: One small trick here. If the subelement exists in the collection folder, the formula is deployed and active and you can see data for other subelements in the same folder but not for this specific one, do a grep in the tnpm.log file using the dbIndex of this subelement and you will probably find an error saying the subelement was dropped because no request exists for it.

If this is the case, there is a problem with the CME local metadata image. Try the following:

1) Bounce the CME in question (dccmd bounce CME.X.Y)
2) If that doesn't solve, go the the "request editor", select the metrics deployed in the collection folder, disable them, save, enable them and save again. This was the only solution for me in some cases. I know it smells like a bug, but I don't have time right now to open a PMR (fell free to do it if you face this same problem :) )

Tuesday, January 29, 2013

Bulk PVLINE example

Hi,

This has been a busy week. As promised I will write something about Bulk collection and the pvline format. Please remember that you should always go back to the official documentation.

In TNPM, data can be collected using 2 modes: SNMP based or Bulk based. SNMP collection uses predefined formulas in TNPM (known as discovery / collection formulas) to query SNMP enabled devices and collect the necessary performance data. Bulk collection reads data from flat files.

Independent of the collection type used, the following steps will be necessary:
  1. Discover the elements/subelements that you will collect data from
  2. Collect the data for those elements/subelements using a predefined polling period
  3. Present the data using reports and charts (you can also export the data)
If using SNMP, you will use SNMP discovery formulas to create the elements and subelements with its properties. If using BULK, all discovery information will be available in the flat file (that we are going to call "pvline" from now on).

Internally, TNPM will create for each BULK and each SNMP collector a kind of "processing line" called subchannel. For SNMP the first component in this processing line would be called something similar to SNMP.1.2 (channel 1, subchannel 2), and for BULK something like BCOL.2.3 (channel 2, subchannel 3).

The file is written in a format called PVLINE (full definition here). You can see an example below:

Type Both 
OPTION:Type=Line 
OPTION:PVMVersion=3.0 
OPTION:Element=NETD1
# Inventory Section
G1998/08/12 23:30:00 | Family | alias | NETD1_CPU1 |  | inventory | NETD_CPU
G1998/08/12 23:30:00 | Label | alias | NETD1_CPU1 |  | inventory | CPU C1
G1998/08/12 23:30:00 | Invariant | alias |  | NETD1_CPU1 | inventory | invcpu1
G1998/08/12 23:30:00 | Instance | alias | NETD1_CPU1 |  | inventory | cpu_<1>
G1998/08/12 23:30:00 | Slot | alias | NETD1_CPU1 |  | property | S1
G1998/08/12 23:30:00 | Frequency | alias | NETD1_CPU1 |  | property | 1GHz
(...)
OPTION:Element=NETD2
G1998/08/12 23:30:00 | Family | alias | NETD2_CPU1 |  | inventory | NETD_CPU
G1998/08/12 23:30:00 | Label | alias | NETD2_CPU1 |  | inventory | CPU C1
G1998/08/12 23:30:00 | Invariant | alias |  | NETD2_CPU1 | inventory | invcpu1
G1998/08/12 23:30:00 | Instance | alias | NETD2_CPU1 |  | inventory | cpu_<1>
G1998/08/12 23:30:00 | Slot | alias | NETD2_CPU1 |  | property | S1
G1998/08/12 23:30:00 | Frequency | alias | NETD2_CPU1 |  | property | 1GHz
(...)
# Data Section
G1998/08/12 23:30:00 | AP~Specific~Bulk~NETD_CPU~CPU_idle_pct | alias | NETD1_CPU1 |  | float | 25.00
G1998/08/12 23:30:00 | AP~Specific~Bulk~NETD_CPU~CPU_user_pct | alias | NETD1_CPU1 |  | float | 35.00
G1998/08/12 23:30:00 | AP~Specific~Bulk~NETD_CPU~CPU_system_pct | alias | NETD1_CPU1 |  | float | 40.00
(...)
G1998/08/12 23:30:00 | AP~Specific~Bulk~NETD_CPU~CPU_idle_pct | alias | NETD2_CPU1 |  | float | 35.00
G1998/08/12 23:30:00 | AP~Specific~Bulk~NETD_CPU~CPU_user_pct | alias | NETD2_CPU1 |  | float | 40.00
G1998/08/12 23:30:00 | AP~Specific~Bulk~NETD_CPU~CPU_system_pct | alias | NETD2_CPU1 |  | float | 25.00
(...)
G1998/08/12 23:45:00 | AP~Specific~Bulk~NETD_CPU~CPU_idle_pct | alias | NETD1_CPU1 |  | float | 10.00
G1998/08/12 23:45:00 | AP~Specific~Bulk~NETD_CPU~CPU_user_pct | alias | NETD1_CPU1 |  | float | 60.00
G1998/08/12 23:45:00 | AP~Specific~Bulk~NETD_CPU~CPU_system_pct | alias | NETD1_CPU1 |  | float | 30.00
(...)
G1998/08/12 23:45:00 | AP~Specific~Bulk~NETD_CPU~CPU_idle_pct | alias | NETD2_CPU1 |  | float | 20.00
G1998/08/12 23:45:00 | AP~Specific~Bulk~NETD_CPU~CPU_user_pct | alias | NETD2_CPU1 |  | float | 70.00
G1998/08/12 23:45:00 | AP~Specific~Bulk~NETD_CPU~CPU_system_pct | alias | NETD2_CPU1 |  | float | 10.00

When creating the pvline it is mandatory to add the ".pvline" extension in the filename (bfile.pvline for example).

Some important details about the pvline format (and common mistakes):
  1. Please be careful with typos. Using "TYPE Both" can be different than "Type Both"
  2. The file content must be in time sequence. The collector will ignore any line older than the last line read.
  3. You can split the file into two main sections: The inventory section (lines with "inventory" or "property" in the example above) and the data section (lines with "float"). In the official documentation, you will see that the inventory and data sections are mixed. This is only useful if your inventory data for the same subelement will change inside the same file for different timestamps. If that is not the case, just write all inventory lines in the beginning of the file using the oldest timestamp ( "G1998/08/12 23:30:00" in our example) and then put all data after it.
  4. The "OPTION:Element=" is only necessary in the inventory section. You don't have to put it in the data section
  5. Please notice that the formula path does not contain "~" in the beginning. This is a very common error.

That is it. Once you have the BCOL running and you move the pvline file in its input directory, the file will be processed.

Talking about the file processing, some important points are listed below:
  1. When using BULK collection, three facts must be true before any data can be stored in the database:
    1. The subelement(s) must exist and be active
    2. The collection formulas must exist and the formula requests must be configured and active in the RequestEditor
    3. The oldest timestamp present in the file must be equal or newer than the last timestamp read by the BCOL. 
  2. If configured to do the discovery, the BCOL will generate a new inventory file after every discovery time window (usually 60 min) of data processed. Please notice that the time window is related to the processed data, meaning that if your pvline file contains data from 09:00 until 12:30, it will generate 3 discovery files, one when processing the data for 10:00, another at 11:00 and another at 12:00.
  3. The first processed hour for a new subelement is always discarded, once the subelement won't exist until the discovery window is reached and the inventory file created and processed.

Thursday, January 10, 2013

Introduction

Hi all,

This last week I've passed through some of my notes in order to decide from where to start. As I mentioned in my last post, the idea is not to repeat what is already written in the official documentation, so, if you want to know how to install the DataChannel or the minimum requirements, please check the official manuals (I know they are big...but you have to pass through it at least once :).

Before I continue, lets talk a little bit about history...

If you work with performance management for some time already, you probably heard about a product called Proviso...this is the former name of TNPM (or TNPM is its new name if you wish). In fact, after IBM acquired Quallaby, they decided to pick some other products from its portfolio and combine them under the same solution. They got Proviso, TIP, Cognos and Tivoli Netcool Performance Manager for Wireless, mixed them on the same cooking pot and produced two main dishes: TNPM wireline and TNPM wireless. In this blog we will discuss more about the wireline one.

Some other changes were done as well. IBM created two components for making the installation easier: one called TopologyEditor and another called Deployer. The idea was to centralize the configuration of all components in one place (the topology part) and allow local or remote installation from a central point (the deployer). Before, the configuration of each component was spread among many servers using config files. It was a good initiative from IBM, but I have to say that it brought with it more complexity to the process (and a lot of bugs as well).

Well, lets stop the history here...

As I mentioned earlier, I was checking my personal notes in order to find some interesting topics to write about and I found many of them:

  • Why does TNPM installs as root and is there a way to work around it?
  • How to manage users in TNPM?
  • Why SFTP doesn't work by default for bulk collections and how to force it?
  • Composite Subelements (CSE) introduction
  • Are you installing a Remote CME through a firewall? If you follow only the manual it won't work...
  • CME troubleshooting
  • BCOL troubleshooting
  • Creating Bulk adaptors, pvline format and other details
  • ...

I would like to try something here...instead of just choosing myself the topics sequence, I would like to ask you from where to start...The idea is to try to write about a different topic every week...I know this is a very new blog and probably few people know about it (you can change it sharing the link with your coworkers), but I want it to be as interactive as possible...so, please give me your opinion about which one you would like to see first or any suggestions you may have for new topics.

Au revoir!

Wednesday, January 2, 2013

And the story begins...

Hi all!!

Welcome to my new blog!

First of all, I would like to explain why I decided to create it. I've been working with IBM Tivoli Netcool Performance Manager (TNPM) for more than 3 years in different projects, installing, upgrading, fixing, and customizing it. I have to admit that it is a great product for performance management of network devices and highly scalable, but with a very complex installation and configuration process.

During those projects, I realized that despite the huge 100+ page manuals, the installation and configuration of each project required some extra information (the tips and tricks...)

So, that is the idea...share the knowledge I got with you. Of course I do not intend to replace the official manuals...they should be your first and main reference...but maybe you will find something here that will make your day (and mine as well).

That's it...let's start...