Wednesday, September 4, 2013

No data in TNPM reports? Some things to investigate...

Hi,

This last week I saw something in TNPM that reminded me about one of the most common issues when processing and reporting performance data...no data in the reports. In this case it is quite obvious that something is wrong...but where is the problem?

I would like to share some steps I always take in order to investigate the problem in TNPM

Was data ever collected before?

This is the first thing to check...navigate to the past in the report (preferably a report with a chart in order to see raw data over time). Is there any datapoint available? If yes, when was the last one?
The most part of the time you will notice that the issue is not in TNPM itself, but in the data source. It can be a not announced patch applied to the device by the engineer dept (and they always forget to inform it :). It can be some network change caused by a new product release. It can be a change in the sftp credentials used to collect data...etc. Looking backward on time and identifying exactly the last collection point should be your first step.

Are the datachannel components running correctly?

Go to the Datachannel server and do a dccmd status all. Check if all components are on time or if one or many of them are delayed by some hours. Look for a column called "ES DURATION". The numbers indicate how long the component is in a "fixed state". Ideally you should see small numbers (let's say smaller than 100), indicating that something is being processed, except for the DISC component, that will usually have a big number.
Please be attentive here, specially if you are using CSE/CME formulas that depends on input from different subchannels. If one subchannel is delayed in time it will delay all dependent subchannels as well.

Are there ".bof" files being generated?

Go to the datachannel server and check the done directory for the subchannel with problems. You should see files with the extension ".bof" in it (except for the BCOL where you will find ".pvline" files). Please remember that the data flows in the following order:

BCOL
                => FTE   =>  CME   =>  LDR  => DLDR
SNMP

So, if you can find ".bof" files in the FTE/done directory and in the CME/done directory but not in the LDR/done directory for a specific subchannel, this means that the problem is in the CME and not in the LDR

Can you find some metric for your subelement in the ".bof" files?

Select one subelement that should have data displayed in the report, and get its dbIndex (the easiest way is to export the RST table to csv and look into the resource column).

Go to the data directory, find a recent ".bof" file and execute 

bofDump -r <dbIndex> <filename>.bof 

If nothing comes out, them the problem may be in the collection tree or collection requests (continue reading...)

Is the subelement in the correct folder in the collection tree?

Open the "pvm", go to the "resource editor" and check if the subelement is in the correct collection folder. If not, please check your grouping rules using the "rule editor".

Is a collection formula deployed and active for the collection folder?

Open the "request editor" and check if at least one collection formula is deployed and active for the collection folder.

NOTE: One small trick here. If the subelement exists in the collection folder, the formula is deployed and active and you can see data for other subelements in the same folder but not for this specific one, do a grep in the tnpm.log file using the dbIndex of this subelement and you will probably find an error saying the subelement was dropped because no request exists for it.

If this is the case, there is a problem with the CME local metadata image. Try the following:

1) Bounce the CME in question (dccmd bounce CME.X.Y)
2) If that doesn't solve, go the the "request editor", select the metrics deployed in the collection folder, disable them, save, enable them and save again. This was the only solution for me in some cases. I know it smells like a bug, but I don't have time right now to open a PMR (fell free to do it if you face this same problem :) )

18 comments:

  1. This comment has been removed by the author.

    ReplyDelete
  2. Hi Danilo,

    Thankyou for the procedure. Currently I am facing a simialr issue in TNPM. The data for one metric is not available in many reports. As mentioned by you, I have checked the BOF files in FTE and CME and found NO data for this metric. All the grouping rules are correctly assigned. I have extracted the formula path and checked in Request Editor. To my surprise, I found no collection request is assigned for this metric ( I found it surprising because data is available for this metric in some other reports when no collection request is assigned). Also, I have done a grep for this metric in proviso.log and found "BAD_RECORDS" ( GYMDC10402W) . I have also extracted the debugDump from the corresponding data loader and found NO collection task created for this metric. Can you please guide me how to proceed with further investigation.
    Thanks in Advance!!

    ReplyDelete
    Replies
    1. Hi, the error you mentioned is indeed related to the missing request in the request editor.
      However, considering that you can see data for this metric on other reports even if no request exists indicates this is a generic formula, meaning you should have other collection formula(s) (with a valid request created) that is saving its result in the dbIndex of the generic formula. This is a common practice when you want to report the same metric type (interface usage for instance) in one report for different vendors, where the specific collection formula is different for each vendor. To check if this is a generic formula, open it in the formula editor and see if it is empty.
      If it is a generic formula, to solve the problem you have to find the specific collection formula that should be running for the subelement and see if it is saving its value correctly (you should have a SaveAlias directive pointing to the dbIndex of the generic formula).
      Once you find this specific formula, get its dbIndex and repeat the troubleshooting using this value.

      Delete
    2. Hi Danilo,
      Thankyou very much for your answer. I have tried to find the specific collection formula for every vendor in the reports from formula editor. I have noticed that no such formula is present for those vendors where the data is missing in reports. I have found the specific collection formula for those vendors where data is available. So can you tell me if missing specific collection formula can be the issue or I am checking at the wrong place. Thanks!!

      Delete
    3. Yes, you are going in the correct direction. You have to create the requests for the specific collection formulas in the appropriate collection folders and it should start collecting data and saving it in the generic formula.

      Delete
    4. Also a small correction...when using the SaveAlias directive, the .bof files will contain the dbIndex of the generic formula and not the dbIndex of the specific one, once the dbIndex translation happens at the dataload level. So for troubleshooting, you should look for the dbIndex of the generic formula in the .bof files.

      Delete
    5. Hi Danilo,
      I am working on the solution provided by you. Thankyou very much for your help!!

      Delete
  3. Hi Danilo, Can you tell me what can be done if data is available in reports but not visible in real time?
    Thanks!!

    ReplyDelete
    Replies
    1. Hi, when you say "real time" do you mean "near real time" like threshold data or really real time using the java applet?

      For near real time (NRT), I already faced problems related to performance issues. You have to keep in mind that the NRT data involves not only the database, but also the CME where the most recent data resides. If the CME is too busy or if the report includes too many subelements, you may have a timeout problem for the NRT data. I believe this would report an error on the tnpm log (NRT something...)

      For the java applet, it can be many things, but the most common problems are related to firewall ports. Please double check if the ports defined in CNS.CORBA_PORT and CMGR.CORBA_PORT are statically defined and open from the dataview to the datachannel.

      I hope this can help you

      Delete
    2. Hi Danilo,Thankyou for your help. However in my investigation i found that the CMGR process was getting restarted after every 5 mins and a duplicate cmgr process was getting created. After solving that problem, the realtime report issue also got solved.

      Delete
    3. Hi Danilo,

      I didnt find a valid post for my issue. Hence I am replying here.
      Currently I am finding some duplication collection requests in my database i.e Collection requests for same metric and same resource under different collection groups. There are many such unnecesssary requests in my database which is impacting collector performance. Can you tell me how can i find out which of them are unnecessary. Is there any way???
      In my investigation, i found that some requests have origin as "requestEdit - 1.3.2.0-12 (Nora.12)" and some requests have origin as "resmgr". Also the date of creation of these requests is different. I feel that one out of these two origins is an error. Can you help here??
      Thanks in advance

      Delete
    4. Hi Vijay,

      I believe the issue is not only at the request level, but also with the inventory grouping. As you know, in TNPM you have the collection tree and the reporting tree. Ideally, in the collection tree each subelement would end up in only one leaf folder, where all collection formulas related to that subelement are deployed. The decision on where each subelement ends is done via grouping rules. In your case, looks like your grouping rules are linking the same subelement with two or more leaf folders, where the same formulas are also deployed. This is a mistake, and should be corrected.
      I believe the easiest way would be to use the resmgr to export all subelements with their grouping and requests and check for duplicates. The command would be something like:

      $ resmgr -export segp -colNames "npath se.name" -file /tmp/segp_se.dat
      $ resmgr -export segpreq -colNames "segp.npath fgp.nName period state
      status" -file /tmp/segpreq.dat

      With those 2 files you can search for subelements that are deployed on more than one grouping folder AND have the same formula request.

      Kind Regards

      Danilo

      Delete
    5. Hi Danilo,

      Thanks for your suggestion. It worked.

      Kind Regards

      Vijay

      Delete
    6. Hi Danilo,

      Can you tell me if there is any other source for creating duplicate collection requests other than sub-element ending in multiple collection groups where same formulas are deployed?
      If there is no other source I think the easiest way is to check the configuration of all inventory profiles and find out which profiles have multiple collection groups added to it. Please correct me if I am wrong.

      Delete
    7. Hi,

      You cannot use this approach to identify duplicated collections. It is normal to have more than one profile using the same collection folder. What is not normal is to have the same subelement ending on two different leaf folders where the same collection formula is deployed on both folders. This is the only source (that I know) of duplicated collection requests.

      Delete
  4. Thanks.It solved my prob

    ReplyDelete
  5. Hi Danilo,

    I'm getting error in FTE as it keeps on finding files that doen't exist in BCOL/output. I already tried to restart it but the issue still persist. Is there a way/command to reset the FTE so that it will process existing data in BCOL?

    ReplyDelete
  6. Hi Danilo,

    Im having stuck files in BCOL.2.64/do and done folders. I dont see any errors in the logs and the status is doing good. Another thing is, even though there are stuck files in the said folders, reports in silverstream are being displayed. Please give me some advice. Thanks!

    ReplyDelete