Insight by PagerDuty

Try a more organized approach to keeping systems and missions running

More than simply monitoring enterprise applications, agencies need to have an organized and automated way to alert the people who can fix it the fastest.

Detailed as efforts agencies deploy to prevent them might be, enterprise application crashes occasionally occur. Two trends have made them more difficult to deal with: digital transformation, which sometimes bring separate applications together, and the use of hybrid cloud environments to host applications.

“What agencies need to do is figure out how to give minutes back to the mission,” said Eric Forseter, vice president for the public sector at PagerDuty. The main question: “How do I ensure the resiliency of the approach that I’m taking, and how do I transform my environment so that I am more modern, more agile, and improving the digital customer experience?”

The challenge, in other words, become innovating and transforming while minimizing risk, Forseter said. He pointed to once study showing developers spend more than half their time fixing applications, “as opposed to building and creating that transformation.”

Observability of the IT environment can help focus efforts and mitigate risk, Forseter said, but only if you’ve got a system to sort out important alerts from the “noise.”

“We need to automate,” he said. “We need to applications to help us tune down the noise.” More than that, he said, IT organizations should create an environment that will anticipate adverse events. Perhaps with a touch of artificial intelligence, such applications “will tell you when an alert happens and, in effect, ask whether you want a server rebooted while it generates a report on what happened.

In the same vein, a management overlay could monitor developers, whether government employees or contractors, to better understand the source of a problem and reduce what Forseter called the “mean time to resolution” and more knowledge to prevent a repeat.

Forseter said several agencies are using PagerDuty to help them understand not only when peak website visits occur but also the characteristics of the system, together with possible courses of action.

“When an issue arises, the product is alerting different teams. We’re letting management know, but also letting the operational folks know, and giving them decision points.” It may recommend actions, or be configured to orchestrate what it recommends.

“Maybe it’s spin up more servers for that application itself or do something cloud,” Forsetger said. “Or maybe it’s, ‘Hey, we need to have a team come in right away to fix it.’”

Hands on needed

But which team should receive the alert? For example, a customer service rep or all center person might become aware of a technology-induced issue. The rep, Forseter said, may hit the PagerDuty chat button to alert others, typically the development, operations and security teams, as well as others on the customer-facing group that may be seeing the same problem.

The internal groups would have the knowledge to send a reliably fix-by time back to the customer services team, Forseter said.

A practical issue affects organizations, especially large ones. Given that members of operations, security and development teams work different days and hours, who precisely does the PagerDuty system alert at a given time?

Few employees outside of medicine use pagers anymore. Forseter said the company, former well past the page age, was founded by a software developer who had worked while on pager duty, hence the name. But a pager only gives limited information, say, a number to call back.

“He got this idea. ‘You know what, I don’t want to just get a call,’” Forseter said. “’I want someone to actually let me know and to alert me when things are happening.’ And that’s where we’re built out of.”

Users configure the product with information on who does what in a company or agency, what their shift schedules look like, and how to contact them, usually on their smartphones. It also escalates if the designated contact fails to respond in a certain amount of time, by moving on to others on the team.

Of one federal customer, Forseter said, “The complexity of the mission that they work on is beyond comprehension.” He added, “They have documents upon documents saying this guy has this piece, but this lady has this piece.” The result is that when something breaks, employees “end up scrambling, looking through literally pieces of paper to say who is the right person that we need to call.”

He added, “If you can make that less manual, and then understand also who is that person’s boss, or who is other team members on that person’s team, that’s a huge benefit right there.”

At another federal agency, knowing the roles and responsibilities helped solve recurring scrambles.

“We had an agency this summer where literally something went down, and there was finger pointing,” Forseter said. “No, it’s your responsibility. No, it’s your responsibility. And they didn’t even know where to start. It took them four or five days to even get close to resolution, not because they didn’t know how to fix it, but because they didn’t know who the right people were.”

Because it understands the nature and likelihood of an adverse event, and folds in the organizational information to deal with it, Forseter said, “now, all of a sudden, I’ve given time back to the mission. I can increase the pace of innovation. I can actually do that digital transformation.”

Copyright © 2025 Federal News Network. All rights reserved. This website is not intended for users located within the European Economic Area.

Related Stories

    Trump

    Agency officials make their ‘efficiency’ pitch. Trump’s transition team takes note

    Read more