Useful MOPs

Recently, I was involved in discussion about good MOPs – Method of Procedure documents often generated to plan work on networks, power systems, etc.

I said generated to plan, but in reality, this is not what happens in most orgs. MOPs become a box-ticking chore that actually subtracts value from the company by creating work with no gain. This is bad on two levels; it wastes your time, and it misses all the benefits of what should be a helpful planning tool.

The best way to take this back is to start with a review of what information is typically on your MOPs. Broadly, I like to see as much of the below as practical:

Summary
Boilerplate
- communication channel
- who is involved; escalation resources
- equipment being configured
- materials & tools
- duration, scheduling, and exclusivity
- dependencies, blockers, or relationship to larger-scale project
- forecast affect on customers
Actions
- pre-execution tests
- steps
- revert procedures, if appropriate
- post-execution tests
- follow-up tasks

One thing I don’t like is a requirement that all commands or config to be run be written out in the MOP. That is wishful thinking that really just forces obedient folks to abort jobs over easily-fixed typos, and creates jeopardy for people who take initiative to work around planning mistakes.

What you should always have is an escalation path. If the planned activity doesn’t go smoothly, the folks executing it should know who to contact for assistance or authority to deviate from the plan.

What you may have, depending on your needs and tooling, are prepared commands or references to them, like, update route-policy FOO to the latest production version.

You might also have plenty of guidance, like run show route 192.0.2.0/24 all and make sure route isn’t leaking into other VRFs unintentionally; be observant of route table names when checking the output. This kind of information can help a less-experienced teammate avoid an escalation and increase their confidence. It can help you get more nights of uninterrupted sleep!

Recipe for a Useful MOP

Summary

If someone on your team only reads this, they should have a good idea of what’s happening. Upgrade software on 10 selected A-plane switches in IAD1 POP is great.

Boilerplate

A lot of teams get tired of the boilerplate. That’s because they stopped using it as a tool. Anyone on a network engineering team should be able to draft a MOP without knowing all the details, circulate it to colleagues (preferably through a system where everyone can make edits, not just “comment”), and have some prep work completed even if the activity date or personnel involved are not known.

Why? Because that boilerplate helps you choose the schedule and responsible team or person. Does the activity need to be during an overnight window? Do you need on-site hands?

Communication Channel

A defined communication channel ensures everybody knows what phone bridge, IRC channel, or whatever, is being used by the people working on an activity. It might say network maint bridge and everybody knows what that means. On the other hand, if you have BigColo remote hands assisting you and you wrote a MOP for them, you’ll want it to include a phone number they can call into in case it’s needed.

Who is Involved

The “who” might be people, teams, or outside resources like co-lo facility remote hands or electrical contractors.

It should also include escalation contact(s) in case the people performing the task need help. If that’s always the same, you might leave it out; but consider what you do if you write a MOP for a co-lo remote hands person!

Equipment Being Configured

It helps to list what equipment is being configured so your coworkers can more easily understand the scope of the job.

Materials & Tools

If a site tech will benefit from having a light meter available, that’s good information to tell them clearly before they get to your rack. Same thing with patch cables, labeling machines, and so on.

This will save you time and help you get to bed earlier.

Duration, Scheduling, and Exclusivity

An activity’s duration is distinct from its scheduling.

If a thing will take ten minutes, express that clearly. If it’s more like 30 - 90 minutes depending on the moon phase, that’s fine, too.

Scheduling and Exclusivity can be simple or not. If you have a small team you might be fine with saying a task will be done anytime within a weekly 2-hour window (for example.)

In a larger org, you might need to arrange for exclusivity in a POP or metro area, or otherwise ensure that, when you’re upgrading the OS on routers in the A-plane, somebody else isn’t swapping cards in the B-plane.

Dependencies, Blockers, or Relationship to Larger-Scale Project

It’s important to know if some other jobs needed to be complete before executing an activity. You also need to know if your project is a blocker for others. You may be fine keeping track of this manually even while there are dozens of things in-flight, but once it starts to get too complex to manage by hand, don’t stop managing it. That’s when you really get into trouble. Instead, use a ticketing system that understands dependencies / blockers.

If you can’t get a ticketing system or other project management tool, just use a diagram. Start up Visio and draw a dependency graph, and update it periodically (perhaps once or twice weekly) so you don’t miss the big picture.

Forecast Affect on Customers

This is what folks outside your group probably care about most. Make it easy for them to find. Make it easy to understand so somebody writing a customer announcement email won’t come ask you to clarify.

Actions

Pre-Execution Tests

Hopefully, you have a status board or a script that can verify the part of your network that is in-scope is already in good condition at the beginning of your activity. If it isn’t, you’ll know that when you re-test later, so you can know if the activity broke something, or if something just happened to be broken and it’s unrelated.

If you don’t already have this you should work on it. It’s very important. Don’t write out a bunch of steps like ping customers and check for failed BGP sessions because that is a bad way to spend everyones’ time. Get a tool that checks for these things automatically.

This is when you test your serial console if there isn’t already an automatic tool or trustworthy regular process to make sure your out-of-band stuff works.

Steps

Here’s where you write, configure all the things or deploy latest production version of core router policy templates from git repo.

Use an appropriate level of detail so your coworkers could perform the activity instead of you if you decide to request a day off. That’s really the goal of teamwork tools: more team responsibilities, fewer tasks tied to individuals, and more freedom to take the day off.

Again, don’t feel the need to write every command that will be executed. Guidance is helpful, but you wouldn’t let someone from another department login to your routers and cut/paste commands into them, so don’t try to script your activity to such excruciating detail that they could unless it actually helps your team to do this. Sometimes that is true for unusual jobs!

Revert Procedures, If Appropriate

It’s okay to leave this out or provide only broad guidance. If you need to include any cautionary notes about back-out processes, though, put them here. A good example might be, if upgrade fails, leave router running on backup RE and reinstall the malfunctioning RE OS back to 10.4R4.5 using USB or TFTP.

You may also simply write if you need to revert, escalate to neteng for assistance so somebody doesn’t have to make a hard decision about whether to wake a coworker up at 2am – give them guidance in advance.

Post-Execution Tests

Again, we hope you have a nice status board and/or a script to quickly evaluate the health of the affected part of the network. If not, honestly, it’s more important to work on that than to have MOPs at all.

Follow-Up Tasks

Do you need to do anything before moving your ticket to review/closed status?

Is there a customer-facing status page that would need to be updated (sometimes true for major activities!)

Think about all the tasks that need to happen here. If they are common, write them up as a standard procedure you can re-use. If they are so common that you always need to do them, automate those things.

Don’t forget about the little things, like inventory. If your site tech used up some patch cables from your supply, who is responsible for reducing the inventory of cables? This can seem like more box-ticking, but it’s how your org knows when to order more supplies.

Tools & Format

Are there tools for this? Yes. 100% there are workflow tools. You might want to use one, or might not.

What I will say is, do not use Microsoft Word or Excel documents for this unless you already have a good shared editing and revision tracking system in place for those types of docs among your team. There is basically zero value to using MS Word instead of a text (or markdown!) file for MOPs. If you have to email word docs as part of your MOP drafting or approval procedure, stop it. That kind of bad process will ensure that good people leave your team.

What I think you should do is familiarize your team with the basic operation of Git and use markdown files, like below. That will allow everyone on your team to make edits, not just suggest and comment on others’ drafts. It also gives folks plenty of resources when they want to see how a similar activity was planned in the past.

network-mops/
    complete/
    pending/
        1234-add-patch-cables-to-rt1-lhr1/
            index.md
            picture-of-linecard.jpg
            picture-of-patchfield.jpg
    templates/

Don’t only use this, but keep your MOP documents in text in a git repo; and let your tickets refer to them. You could literally plan your activities using GitHub Issues if you didn’t have access to any better tools.

Note about large files Plan on using git lfs eventually if you store a lot of pictures in your repo. Same for visio diagrams. Don’t worry about this until your repository is already starting to get sluggish upon commits or branch changes, though. Just use it the easy way long enough to decide if this is what you and your team want to do; then solve for scale later.

jeff.wheeler.blog

Jeff Wheeler's ramblings