Comprehensive Troubleshooting Guide for SCOM Linux/UNIX Agents

Introduction
Common Troubleshooting Issues
- Certificate Errors
- Agent Discovery Failures
- Agent Communication Problems
- Heartbeat Failure Troubleshooting
- Validating Agent & Management Server Communication
Advanced Troubleshooting Techniques
- Verbose Logging
- Clearing SSSD Cache
- Verifying OpenSSL Configuration
- Checking OMI Issues
- Ensuring Correct Agent Kit Installation
- Troubleshooting Missing Performance Counters
- WinRM Verification of SCX Agents
- Enumerate OMI Information
- Purge SCX Agent
- Check Linux Cipher Suites
- Firewall Configuration
- Tailing Logs
Best Practices
References

Introduction

This document integrates and builds upon the excellent work by Kevin Holman and specifically draws heavily from Blake Drumm’s guide on troubleshooting SCOM Linux agents. Kevin’s guide is one of the most detailed and comprehensive resources available on this topic, and all credit for the depth and clarity of information should go to him.

In Part 1 of our troubleshooting series, we discussed handling issues with the Windows SCOM agent. In this installment, we will focus on troubleshooting common problems associated with the Linux and UNIX SCOM agents. Just like Windows agents, Linux/UNIX agents can occasionally exhibit behaviors requiring investigation and resolution.

Deploying and managing SCOM agents on Linux/UNIX can present unique challenges. This guide aims to equip administrators with the knowledge to troubleshoot common issues effectively.

Common troubleshooting problems

Certificate Errors

Symptom:

Errors related to SSL certificates occur during agent discovery or regular communication between the Linux/UNIX agent and the SCOM management server. Typical symptoms include messages indicating a mismatched common name (CN), expired certificates, or certificates issued by an untrusted certificate authority.

Resolution:

Step 1: Verify Certificate Details

Begin by inspecting the existing SSL certificate used by the Linux/UNIX agent. Execute the following command on the agent system:

openssl x509 -noout -in /etc/opt/microsoft/scx/ssl/scx.pem -subject -issuer -dates

Subject: Confirms the identity of the certificate (specifically the CN).
Issuer: Indicates which Certificate Authority (CA) issued the certificate.
Dates: Provides the validity period, indicating whether the certificate is currently valid or has expired.

Step 2: Regenerate Certificates if Necessary

If you identify any discrepancies—such as a mismatch in the common name (CN) or expired validity period—you need to regenerate the SSL certificate:

sudo /opt/microsoft/scx/bin/tools/scxsslconfig -f -h <hostname> -d <domain.name>

Replace:

<hostname> with the actual hostname of your Linux/UNIX system.
<domain.name> with your actual domain name (e.g., example.com).

This command forcefully regenerates the certificate, correctly aligning it with the system’s current hostname and domain configuration.

Step 3: Restart the Agent to Apply New Certificate

After regenerating the SSL certificate, restart the SCOM agent to apply the changes:

sudo /opt/microsoft/scx/bin/tools/scxadmin -restart

Step 4: Ensure Management Server Trusts the Certificate

The management server must trust the newly generated certificate. If the agent still exhibits communication issues after restarting, explicitly import the updated SSL certificate into the management server’s trusted store:

Export the agent’s certificate from
```
/etc/opt/microsoft/scx/ssl/scx.pem.
```
Import the certificate into the management server’s trusted certificate store.

This ensures smooth SSL communication between the agent and the management server and resolves persistent certificate-related issues.

Agent Discovery Failures

Symptom:

During agent deployment or discovery, you encounter errors such as:

"No installable Linux/UNIX agent available."

This typically indicates issues related to compatibility, permissions, or network communication between the management server and the Linux/UNIX target system.

Resolution:

Step 1: Verify Agent Compatibility

Ensure that the correct and compatible Linux/UNIX agent package is available and ready for installation on your SCOM management server:

Navigate to the following directory on your management server:

<SCOM_Installation_Path>/AgentManagement/UnixAgents/DownloadedKits/

Confirm the presence of the appropriate agent package matching your Linux or UNIX distribution (e.g., RedHat, CentOS, Ubuntu). The naming typically indicates the specific operating system version and architecture.
If the required package is missing or outdated, download the correct version from Microsoft’s official source and place it into this directory.

Step 2: Check SUDO Permissions

Incorrect or insufficient sudo permissions can lead to discovery or installation failures. Verify that your discovery account is correctly configured in the sudoers file:

Run:

sudo visudo

Ensure the discovery account has sufficient privileges, typically configured as follows:

scomuser ALL=(ALL) NOPASSWD: ALL

Replace scomuser with your actual discovery account name. This ensures the user can execute commands without being prompted for passwords during the discovery process, allowing seamless agent installation.

Step 3: Verify Network and Name Resolution

Network or DNS-related issues often prevent successful agent discovery:

Verify the management server can resolve the hostname of the target Linux/UNIX server:

nslookup <agent-hostname>

If resolution fails, update your DNS records or the management server’s local hosts file.
Check that required ports (default is TCP port 1270) are reachable. Use tools like nc (netcat) to verify connectivity:

nc -zv <agent-ip> 1270

A successful connection will yield:

Connection to <agent-ip> 1270 port [tcp/*] succeeded!

If connectivity issues persist, examine firewall rules and network configurations between the management server and the agent system, adjusting firewall settings if necessary.

Following these detailed steps ensures the SCOM management server can successfully discover and deploy the Linux/UNIX agent without encountering the stated error.

Agent Communication Problems

Symptom:

The Linux/UNIX agent appears unresponsive or displays as „greyed out“ in the SCOM console, indicating a loss of communication between the management server and the agent.

Resolution:

Step 1: Restart the Agent

Restarting the agent can often restore normal communication. Use the following command:

sudo /opt/microsoft/scx/bin/tools/scxadmin -restart

This command stops and restarts the agent service, potentially clearing temporary issues that may have caused the communication disruption.

Step 2: Check Agent Status:

After restarting, verify the agent is operational:

Sudo /opt/microsoft/scx/bin/tools/scxadmin -status

Review the status output to confirm the agent is actively running and responsive. Ensure there are no error messages indicating ongoing issues, such as failed processes or halted services.

Step 3: Inspect Logs

If communication issues persist, examine the agent’s log files for detailed error messages and troubleshooting hints:

tail -f /var/opt/microsoft/scx/log/scx.log

Carefully review recent entries for any critical errors, certificate issues, authentication problems, or network connectivity errors. These log entries will provide precise information on the root cause of the communication breakdown, guiding the next troubleshooting steps.

Heartbeat Failure Troubleshooting

Heartbeat failures indicate communication interruptions between the SCOM agent and the management server. Follow these steps to identify and resolve the issue:

Step 1: Check current agent status and logs

Verify the operational status of the agent using the following command:

sudo /opt/microsoft/scx/bin/tools/scxadmin -status

This command provides output indicating whether the agent service is running, stopped, or encountering issues. Look specifically for any errors or warnings in the output.

Step 2: Inspect agent logs

To investigate deeper into the issue, examine real-time log entries using:

tail -f /var/opt/microsoft/scx/log/scx.log

Pay attention to any error messages or warnings related to communication problems, timeouts, authentication failures, or certificate errors. These messages can provide insight into the underlying cause of the heartbeat failure.

Step 3: Restart the agent

If issues persist or you find evidence of errors, restart the agent to attempt recovery and re-establish communication:

sudo /opt/microsoft/scx/bin/tools/scxadmin -restart

After restarting, wait a few minutes and recheck the status and logs (repeat Steps 1 and 2) to confirm whether the heartbeat issue is resolved.

Validating Agent and Management Server Communication

Validating communication between the SCOM agent and the management server is essential to ensure data is correctly transmitted and received. Follow these steps to verify connectivity and agent compatibility:

Step 1: Check connectivity using network utilities (telnet or nc)

Confirm that the network connection between the management server and the SCOM agent is functional by using network diagnostic tools like telnet or nc (netcat):

nc -zv <agent-ip> 1270

Replace <agent-ip> with the IP address or hostname of the Linux/UNIX agent.

A successful connection will return output similar to:

Connection to <agent-ip> 1270 port [tcp/*] succeeded!

A failure will indicate issues such as firewall restrictions, port blocking, or network routing problems, requiring further network-level troubleshooting.

Step 2: Verify the Installed Agent Version

Ensure the correct and compatible version of the SCOM agent is installed by executing:

/opt/microsoft/scx/bin/tools/scxadmin -version

Compare the version reported by this command with the version expected by your SCOM management server or as indicated in the official Microsoft compatibility matrix. Mismatches in agent versions can cause compatibility issues or unexpected agent behaviors, so always ensure alignment between agent and management server versions.

Advanced Troubleshooting Techniques

Verbose Logging

Verbose logging is a crucial troubleshooting step that enables detailed and comprehensive logging of SCOM agent activities. It provides deeper visibility into agent operations, communication events, and specific errors, assisting significantly in identifying complex or unclear issues.

Step 1: Enable verbose logging

To activate verbose logging, execute the following command on your Linux/UNIX agent:

sudo /opt/microsoft/scx/bin/tools/scxadmin -log-set all verbose

This command instructs the agent to log all activities in detail.

Verbose logs will include granular events such as data transactions, certificate validation, discovery operations, and heartbeat exchanges.

Step 2: Review Verbose Logs

Once enabled, review detailed logs continuously or after specific events using:

tail -f /var/opt/microsoft/scx/log/scx.log

Pay special attention to error-level entries, warnings, and messages indicating certificate, communication, or permission issues.

Step 3: Revert Logging to Default After Troubleshooting

After resolving the issue, it is important to revert logging to the default state to avoid excessive log-file growth, which can consume considerable disk space over time:

sudo /opt/microsoft/scx/bin/tools/scxadmin -log-reset

This command restores logging levels to their defaults, ensuring optimal system performance and manageable log sizes.

By temporarily enabling verbose logging, administrators can gather necessary diagnostic details, promptly identify underlying issues, and restore logging settings afterward to maintain system efficiency.

Clearing the SSSD Cache

The System Security Services Daemon (SSSD) caches authentication credentials and identity information from remote identity providers (such as LDAP or Active Directory). While caching significantly enhances performance by reducing frequent lookups, it can sometimes retain outdated or stale entries, causing authentication failures or inconsistencies, especially after updates or configuration changes.

To effectively clear the SSSD cache and resolve potential authentication issues, follow these detailed steps:

Step 1: Stop the SSSD service

Before clearing the cache, gracefully stop the SSSD service to ensure that no active operations are disrupted and that the cache can be safely cleared. Execute:

sudo service sssd stop

Ensure the service has stopped successfully without errors. You can verify this by running:

sudo service sssd status

Confirm that the service status indicates it is stopped.

Step 2: Remove cache files

Clear the SSSD cache by deleting the cached credential database. This will force SSSD to retrieve fresh information from the configured identity providers upon restart. Execute:

sudo rm -rf /var/lib/sss/db/*

The command above completely removes all cached entries.

Ensure this step is executed carefully, verifying the path to avoid accidental removal of critical data elsewhere.

Step 3: Restart the SSSD service

Restart the SSSD service to reinitialize the cache and resume authentication operations using fresh data. Execute:

service sssd start

After restarting, SSSD will recreate the necessary database files and re-populate the cache from the identity sources. Check the SSSD service status again to confirm successful restart:

sudo service sssd status

Ensure the service status shows active/running without errors.

By following these detailed steps, you clear potentially stale cached credentials, resolve authentication inconsistencies, and allow the SSSD to function optimally.

Verifying OpenSSL Configuration

Proper SSL communication between the Linux/UNIX SCOM agent and the management server is crucial for secure and reliable monitoring. Problems related to SSL misconfiguration often manifest as failed agent connections or intermittent communication issues.

Follow these steps to verify and troubleshoot SSL communication issues clearly and effectively:

Step 1: Test SSL Connectivity with OpenSSL

Use OpenSSL’s client mode to test SSL connectivity and validate that secure communication can be established successfully between the SCOM agent and the management server.

Execute the following command from the Linux/UNIX agent system:

openssl s_client -connect <SCOM_Server>:1270

Replace <SCOM_Server> with the actual hostname or IP address of your SCOM management server.

Analyzing the Output:

Successful SSL connection:

You will see a detailed SSL handshake sequence and information about the server’s certificate chain, such as:

CONNECTED(00000003)

depth=0 CN = <SCOM_Server>

Verify return code: 0 (ok)

This indicates proper SSL configuration and successful connectivity.

Failed SSL connection or handshake issues:

Typical error messages include:

connect: Connection refused

SSL handshake failed

These messages suggest either a network issue, SSL certificate mismatch, expired certificate, or incorrect cipher suite configuration.

Next Steps if Issues Found:

Confirm that firewall settings allow TCP port 1270.
Recheck certificate validity and trust (using steps detailed previously in the certificate troubleshooting section).
Verify and update cipher configurations if required:

openssl ciphers -V

Performing these detailed verifications ensures that the OpenSSL configuration is correct, the SSL certificates are properly trusted, and secure communication between the agent and the management server is functioning optimally.

Checking OMI (Open Management Infrastructure) Issues

The Open Management Infrastructure (OMI) service is essential for the proper functioning of the Linux/UNIX SCOM agent. Problems with OMI often lead to communication failures, monitoring disruptions, or overall agent instability. To diagnose and resolve common OMI-related issues, proceed as follows:

Step 1: Restart the OMI service

A restart of the OMI service often resolves temporary issues related to service responsiveness or stability. Execute the following command:

sudo systemctl restart omid

This command cleanly stops and restarts the OMI service, potentially resolving transient operational or communication problems.

Step 2: Verify that the OMI service is running

After restarting, check that the OMI service is active and healthy by running:

systemctl status omid

Analyzing the output:

Look for a status indicating:

Active: active (running)

If the service shows any state other than active or reports errors, investigate further by reviewing additional log details provided in the output.

Additional diagnostics (if necessary):

In case of persistent OMI-related issues, deeper investigation into logs may be required. You can inspect the OMI server log file directly by executing:

tail -f /var/opt/omi/log/omiserver.log

Review these logs for error messages indicating service crashes, authentication errors, or other operational issues requiring further action.

By following these detailed steps, administrators can effectively troubleshoot, verify, and restore proper functionality of the OMI service to ensure smooth operation of the Linux/UNIX SCOM agent.

Ensuring Correct Agent Kit Installation

Verifying the correct installation and version alignment of the Linux/UNIX SCOM agent is critical for seamless communication and monitoring. Incompatibilities or incorrect agent kits can lead to errors, missing monitoring data, or agent instability. Follow these steps to ensure correct installation and resolve version discrepancies:

Step 1: Verify the Installed Agent Version

Confirm the currently installed agent version by executing the following command on your Linux/UNIX system:

sudo /opt/microsoft/scx/bin/tools/scxadmin -version

This command displays the exact agent version, including build numbers.

Step 2: Cross-reference with Compatibility Matrix

Check the output against the official Microsoft compatibility documentation to confirm compatibility with your operating system and your current SCOM management server version. Agent compatibility matrices are available from official Microsoft resources:

SCOM Linux/UNIX Agent Compatibility Matrix

If the agent version displayed does not match the recommended version for your distribution or the SCOM management server, proceed with the upgrade process.

Step 3: Upgrade Agent Version if Necessary

To upgrade or reinstall the correct version, follow these steps:

Download the appropriate Linux/UNIX agent kit from Microsoft’s official SCOM download repository or through your SCOM console under the „Linux/UNIX Agent Management“ section.
Transfer the agent kit to your Linux/UNIX system.
Perform the upgrade installation using the provided script. Typically, this involves executing commands similar to:

sudo sh ./scx-<version>.sh --upgrade

(Replace the placeholder with your specific downloaded agent installer file name.)

After the upgrade completes, verify again with:

sudo /opt/microsoft/scx/bin/tools/scxadmin -version

Ensure the new installed version matches the expected and supported version for your OS and SCOM management environment.

These careful verification and upgrade steps guarantee the agent operates correctly, avoiding compatibility-related issues or unexpected behaviors.

Troubleshooting Performance Counters Missing Data

If your Linux/UNIX SCOM agent is missing performance data (such as CPU, memory, or disk usage metrics), it typically indicates issues with the CIM (Common Information Model) server—often related to the underlying OMIServer service. Follow these steps to diagnose and resolve performance counter data issues effectively:

Step 1: Check Available CIM Namespaces

Verify that the required CIM namespaces are present and accessible on the Linux/UNIX agent. Execute the following command to list all available namespaces:

sudo /opt/microsoft/scx/bin/tools/scxcimcli ns

Review the output carefully:

Confirm that expected namespaces (such as root/scx) appear in the list.
Missing or inaccessible namespaces indicate potential corruption, configuration issues, or a failure in the CIM server.

Step 2: Restart the CIMOM Service (OMIServer)

If namespaces appear incomplete or performance counters are still missing, restarting the CIMOM service (OMIServer) can often resolve the underlying issue. Execute:

sudo systemctl restart omiserver

Restarting the CIMOM service refreshes the service state and reinitializes namespace data collection processes.

Step 3: Verify Service Status After Restart

After restarting, ensure the service restarted successfully and is operational:

systemctl status omiserver

Confirm the service shows as:

Active: active (running)

Additional Troubleshooting

If performance counters remain missing after the restart, review CIMOM logs for deeper diagnostics:

tail -f /var/opt/omi/log/omiserver.log

Analyze the logs for error messages indicating issues such as namespace corruption, provider errors, or service crashes.

By performing these detailed verification and corrective steps, administrators can effectively restore missing performance data collection, ensuring accurate monitoring through SCOM

WinRM Enumerate SCX Agent Verification from Management Servers in Linux/UNIX Resource Pool

To verify communication and enumerate details from the Linux/UNIX agent via WinRM (Windows Remote Management) from SCOM Management Servers within the Linux/UNIX Resource Pool, you can utilize either Basic Authentication or Kerberos Authentication. This enumeration helps confirm that the SCX agent is responsive, properly configured, and able to communicate securely.

Step 1: Verification using Basic Authentication

If your environment uses basic authentication, perform the enumeration by executing:

winrm enumerate http://schemas.microsoft.com/wbem/wscim/1/cim-schema/2/SCX_Agent -auth:basic -username:<username> -password:<password> -remote:<agent-hostname-or-IP>

Replace the placeholders appropriately:

<username>: Authorized user account for the Linux/UNIX agent.
<password>: Password corresponding to the provided user account.
<agent-hostname-or-IP>: The target Linux/UNIX system’s hostname or IP address.

Step 2: Verification using Kerberos Authentication (recommended)

For secure environments using Kerberos authentication, perform the enumeration with:

winrm enumerate http://schemas.microsoft.com/wbem/wscim/1/cim-schema/2/SCX_Agent -auth:Kerberos -remote:<agent-hostname-or-IP>

Kerberos authentication does not require explicitly passing credentials, as it leverages integrated Active Directory security contexts.
Ensure that Kerberos tickets are available and valid on your Windows management server by using:

klist

Reviewing Command Output:

Successful enumeration will return structured information about the SCX agent, confirming proper configuration and responsiveness.
Failure or error messages (such as access denied or timeout errors) indicate authentication issues, network connectivity problems, or misconfigured Kerberos tickets.

By performing these steps, administrators can reliably verify that the Linux/UNIX SCX agents are correctly responding to enumeration requests and properly configured to communicate with SCOM Management Servers in the Linux/UNIX Resource Pool.

Enumerate OMI Information on Linux/UNIX Machine

Using the omicli tool provided by the Open Management Infrastructure (OMI), you can query detailed information from Linux/UNIX systems directly. This helps confirm that the OMI infrastructure is correctly collecting and exposing vital system metrics to the SCOM management server.

Step 1: Enumerate Operating System Information

To retrieve detailed information about the operating system on your Linux/UNIX system, execute the following command:

sudo /opt/omi/bin/omicli ei root/scx SCX_OperatingSystem

This command returns structured data including:
- OS name and version.
- Kernel version.
- System uptime.
- Hostname and domain details.

Carefully inspect the output to ensure the information is accurate, up-to-date, and properly formatted, indicating that the OMI provider is functioning correctly.

Step 2: Enumerate Processor Statistical Information

For detailed processor statistics, such as CPU usage, idle times, and performance metrics, execute:

sudo /opt/omi/bin/omicli ei root/scx SCX_ProcessorStatisticalInformation

Output typically includes processor usage percentages, processor idle time, interrupt rates, and other critical performance data.
Validate that the data returned matches expected system performance levels, ensuring the OMI provider captures and reports accurate system performance metrics.

Analyzing Enumeration Results:

Proper, detailed responses confirm correct functionality of OMI providers, ensuring the SCOM management server receives accurate and timely system data.
Errors or missing data suggest issues with the OMI providers or configuration, prompting additional troubleshooting steps (such as reviewing OMI service logs or restarting the OMI service).

Performing these steps validates OMI’s health and ensures reliable communication and data exchange between your Linux/UNIX systems and the SCOM management server.

Purge SCX Agent Installation

Completely removing and purging the SCX agent from a Linux/UNIX system is sometimes necessary when the agent installation is corrupted, outdated, or requires a clean reinstall. Follow these steps for a complete removal:

Step 1: Purge the SCX Agent

To perform a complete removal (purge) of the SCX agent, execute the installation script with specific flags:

sudo sh ./scx-<version>.sh --purge --force

Replace <version> with the specific version number included in the agent installation filename.

The –purge option ensures complete removal of all related files, configurations, logs, and certificates.
The –force option ensures the command executes without interactive prompts or confirmation messages, which is useful for scripting or automation purposes.

Step 2: Verify Successful Removal

Confirm the agent’s removal by checking if the installation directories no longer exist:

ls /opt/microsoft/scx

This command should return no such file or directory, indicating a successful purge.

Check Linux Ciphers

Ensuring the appropriate cipher suites are configured on your Linux/UNIX systems is crucial for maintaining secure and compatible SSL communication between the agent and SCOM management servers.

Step 1: Verify Supported Cipher Suites

Execute the following OpenSSL command to list all available cipher suites currently supported on your Linux/UNIX system:

openssl ciphers -V

Analyzing Cipher Output:

The output provides a detailed listing of supported ciphers, including strength, encryption methods, protocols, and cipher suite names.
Verify that the listed ciphers align with security policies and compatibility requirements for communication with SCOM management servers.
Identify and address potential weak or insecure ciphers as per organizational security guidelines.

Step 2: Update Ciphers if Needed

If the cipher configuration requires adjustments, update the OpenSSL configuration file (openssl.cnf) or system security policies to enable or disable specific cipher suites.

After making configuration changes, validate the adjustments by re-running the cipher verification command provided above.

Performing these detailed steps ensures a secure, compliant, and robust SSL communication setup between your Linux/UNIX agents and SCOM management servers.

Firewall Configuration

Proper firewall configuration is critical for successful communication between Linux/UNIX SCOM agents and management servers. By default, the SCOM agent communicates via TCP port 1270. Follow these detailed steps to configure and verify firewall rules effectively:

Step 1: Open Port 1270 for All IPs

To enable inbound communication from the SCOM management server to your Linux/UNIX agent via TCP port 1270, execute:

sudo firewall-cmd --zone=public --add-port=1270/tcp --permanent

The option –permanent ensures that this rule persists across firewall service restarts or system reboots.

Step 2: Reload Firewall Configuration

To immediately apply your updated firewall rules, reload the firewall service:

sudo firewall-cmd --reload

This command applies all permanent firewall rule changes without needing to restart the system.

Step 3: Verify Open Ports

Confirm that the firewall rules have been successfully applied and port 1270 is now open:

sudo firewall-cmd --zone=public --list-ports

Verify the command’s output lists 1270/tcp, confirming successful rule application.

Step 4 (Optional, Recommended): Restrict Access to Specific IP Address

To enhance security, you may restrict access to port 1270 only from the specific IP address of your SCOM management server. Execute the following command, replacing <management-server-ip> with the actual IP address of your management server:

sudo firewall-cmd --zone=public --add-rich-rule='rule family="ipv4" source address="<management-server-ip>" port port=1270 protocol=tcp accept' --permanent

This creates a rule allowing only your management server to communicate through port 1270.

Step 5: Reload and Verify the Restricted Rule

Again, reload your firewall rules:

sudo firewall-cmd --reload

Confirm the rich-rule is active by listing all rules:

sudo firewall-cmd --zone=public --list-rich-rules

Ensure your specified IP address and port 1270 are correctly listed, verifying your firewall configuration has been secured appropriately.

These comprehensive steps help ensure secure, reliable, and properly configured firewall access between Linux/UNIX SCOM agents and their management servers.

Tailing Logs

Monitoring log files in real-time (tailing) is an essential troubleshooting step to quickly identify issues, observe system behavior, and understand errors or unexpected behaviors. Follow these detailed steps to tail relevant log files for authentication, system, and agent-specific troubleshooting:

Step 1: Authentication and System Message Logs

To monitor authentication attempts, security events, and general system messages, execute the following commands in separate terminal windows or sessions:

Authentication logs:

sudo tail -f /var/log/secure

This log file captures critical security-related events, such as login attempts, SSH session events, sudo usage, and other authentication-related issues.

General system logs:

sudo tail -f /var/log/messages

Provides a broad view of system events, service status changes, kernel messages, hardware detections, and general troubleshooting information.

Step 2: OMI and SCX-Specific Logs

To troubleshoot the Open Management Infrastructure (OMI) service and SCOM’s SCX agent more specifically, tail their respective log files:

OMI service logs:

sudo tail -f /var/opt/omi/log/omiserver.log

This log tracks OMI-specific issues, including CIM provider errors, connectivity problems, namespace issues, or service disruptions.

SCX agent logs:

sudo tail -f /var/opt/microsoft/scx/log/scx.log

Contains detailed logging related to the SCOM agent’s operations, including data collection events, communication with the management server, SSL handshake activities, heartbeat issues, and discovery errors.

Best Practice for Log Monitoring:

Run each command in separate terminal sessions to simultaneously track multiple logs.
Observe logs closely during troubleshooting scenarios for specific timestamps, patterns, or error codes.
Take note of frequent or recurring issues and escalate as needed based on insights gained from these logs.

Using these detailed logging techniques ensures timely diagnosis and resolution of potential problems in your Linux/UNIX SCOM agent and operating environment.

Best Practices

Regular Log Maintenance

Regularly maintain and manage log files in

/var/opt/microsoft/scx/log/

Logs should be periodically reviewed, archived, or deleted to ensure disk space isn’t exhausted, preventing potential system issues and facilitating quicker troubleshooting.

Ensuring Compatibility

Regularly ensure the SCOM agent and management packs are updated and fully compatible with your Linux/UNIX versions. Check compatibility matrices provided by Microsoft frequently to avoid deployment issues[1][2].

Use tools such as IISCrypto to ensure optimal cipher configurations.

Conclusion

Following these comprehensive troubleshooting steps ensures that Linux and UNIX SCOM Agent issues are accurately identified, quickly resolved, and prevented from recurring, thereby maintaining optimal performance and stability in your System Center Operations Manager environment. Proactively verifying agent compatibility, SSL certificate validity, secure network configurations, and proper firewall settings significantly reduces the likelihood of disruptions in monitoring operations. Furthermore, regularly reviewing logs, validating OMI and CIMOM health, and performing periodic agent version checks will help administrators detect and address potential issues early. By adhering to these best practices and leveraging the detailed troubleshooting techniques provided, administrators can guarantee efficient, reliable, and secure monitoring of UNIX/Linux systems within their IT infrastructure.

References

UNIX/Linux System Center Operations Manager Agents Troubleshooting Tips

Troubleshoot monitoring of UNIX and Linux computers

Monitoring UNIX/Linux with SCOM 2022

OpsMgr (SCOM) – Unix/Linux Agents Requisites and Troubleshooting

Reddit (multiple threads)

Überblick

Identitiy & Access

Automation

Überblick

Tikit

au2mator

Überblick

Vermittlung & Partnerschaft

Über Pohn

Microsoft MVP

Karriere

Offene Stellen

Comprehensive Troubleshooting Guide for SCOM Linux/UNIX Agents

Table of Contents

Introduction

Common troubleshooting problems

Certificate Errors

Symptom:

Resolution:

Step 1: Verify Certificate Details

Step 2: Regenerate Certificates if Necessary

Step 3: Restart the Agent to Apply New Certificate

Step 4: Ensure Management Server Trusts the Certificate

Agent Discovery Failures

Symptom:

Resolution:

Step 1: Verify Agent Compatibility

Step 2: Check SUDO Permissions

Step 3: Verify Network and Name Resolution

Agent Communication Problems

Symptom:

Resolution:

Step 1: Restart the Agent

Step 2: Check Agent Status:

Step 3: Inspect Logs

Heartbeat Failure Troubleshooting

Step 1: Check current agent status and logs

Step 2: Inspect agent logs

Step 3: Restart the agent

Validating Agent and Management Server Communication

Step 1: Check connectivity using network utilities (telnet or nc)

Step 2: Verify the Installed Agent Version

Advanced Troubleshooting Techniques

Verbose Logging

Step 1: Enable verbose logging

Step 2: Review Verbose Logs

Step 3: Revert Logging to Default After Troubleshooting

Clearing the SSSD Cache

Step 1: Stop the SSSD service

Step 2: Remove cache files

Step 3: Restart the SSSD service

Verifying OpenSSL Configuration

Step 1: Test SSL Connectivity with OpenSSL

Checking OMI (Open Management Infrastructure) Issues

Step 1: Restart the OMI service

Step 2: Verify that the OMI service is running

Ensuring Correct Agent Kit Installation

Step 1: Verify the Installed Agent Version

Step 2: Cross-reference with Compatibility Matrix

Step 3: Upgrade Agent Version if Necessary

Troubleshooting Performance Counters Missing Data

Step 1: Check Available CIM Namespaces

Step 2: Restart the CIMOM Service (OMIServer)

Step 3: Verify Service Status After Restart

WinRM Enumerate SCX Agent Verification from Management Servers in Linux/UNIX Resource Pool

Step 1: Verification using Basic Authentication

Step 2: Verification using Kerberos Authentication (recommended)

Enumerate OMI Information on Linux/UNIX Machine

Step 1: Enumerate Operating System Information

Step 2: Enumerate Processor Statistical Information

Purge SCX Agent Installation

Step 1: Purge the SCX Agent

Step 2: Verify Successful Removal

Check Linux Ciphers

Step 1: Verify Supported Cipher Suites

Step 2: Update Ciphers if Needed

Firewall Configuration

Step 1: Open Port 1270 for All IPs

Step 2: Reload Firewall Configuration

Step 3: Verify Open Ports