Measuring teaching practice: COPUS observations

by Stephanie Chasteen on April 14, 2016

I’ve said this before, but I *am* going to start posting in this blog again!  I miss the chance to share ideas and reflect on what I’m learning.

So today I’m going to talk about something I’ve been involved with lately, which is the problem of how to measure teaching practice.  There are many of us who work in the area of teaching improvements, and a big problem that plagues us is — how do we measure what teachers actually DO in their classrooms?

It’s a sticky problem.We know that what teachers SAY and what they DO doesn’t always match.  There’s a famous study on this (which I’m forgetting the name of), but there are lots of instances.  Part of the problem is that there isn’t a clear vocabulary for talking about teaching (e.g., if I ask a teacher if they use Peer Instruction, they may say yes, but not use many of the practices that experts would say comprise Peer Instruction), and it’s hard to reflect accurately on your own behavior.  Just recently I used a survey question that asked teachers to estimate the percent of time that they use active learning in the class:  The same question, two weeks later, yielded different results.

There are several different approaches, including validated surveys (which I may write about later), but one method is by using observational protocols.  There are a few different observational protocols, including the Reformed Teaching Observation Protocol (RTOP), the Teaching Dimensions Observation Protocol (TDOP), and the Classroom Protocol for Undergraduate STEM (COPUS).  They each have their different strengths and purposes (and I’d love to see a comparison of the different approaches).  There is also one for lab courses.  Right now I’m using the COPUS, and finding out a lot during the process.

What is the COPUS?

It’s a set of codes to help characterize classroom instruction — in particular, it looks at what the students are doing, and what the instructor is doing, in two-minute increments.  Is the instructor lecturing?  Did she pose a question?  Is she using a clicker question? Are the students listening, asking a question, discussing with each other?  Are a high fraction of students engaged, or not? I log all of these types of things that are happening in each two minute period, and then the timer restarts and I do it again.

How do you collect the data?

There is a wonderful wonderful tool, developed by a team at UC Davis, called GORP — the Generalized Observation and Reflection Platform.  You can find it at   This is part of the Tools for Evidence Based Action project (TEA;, which has other technological tools for helping to support educational reform.  What the GORP does is provide an interface for logging observations from multiple protocols. Below is a screenshot from COPUS:








Each little box represents one of the codes, and I select the relevant code.  All boxes reset every two minutes.

Alternatively, there is an Excel version, where you can just put a mark in the box for the code that applies in each two minute increment.  The GORP tool is nice since it works with touch screens, and can automatically give you the data output.

What does COPUS output look like?

Visualization of the data in a way that’s most useful for teachers is an open problem, but here is an example of traditional vs. active learning from the University of Arizona (original page here):

Physics comparison



There are some other visualizations that give you a timeline of the course, so you can see not just the percent of time spent on different activities, but where they occur within the course.

How do you learn to use the COPUS?

The COPUS page has a training protocol, but until getting trained myself, I didn’t appreciate how many nuances there are in the protocol.  I’m part of a project across 7 institutions, so it’s important we’re using the protocol appropriately and reliably, so our data is comparable.  Also, I don’t have a group of observers with whom to compare data, so I needed a way to make sure I was reliable. Here’s the process we used:

  1. Code a 45 minute lecture video
  2. Compare my codes to the codes achieved by a group of observers on the same video.  (We ran reliability measures, using a script we wrote in Excel, and looked for disagreement).
  3. Code another video, incorporating that feedback.
  4. Compare codes on video 2.
  5. Code a third video, and compare codes.

We got pretty good reliability this way (kappa of 0.80-0.85), and figured out several specifics that don’t seem to be clearly defined in the protocol itself.   However, it’s tough to see what students are doing within the video, and there is a need for more videos for establishing this kind of reliability.

Here are some of the nuances I learned, just to give you a sense of the kinds of questions that arise when doing this kind of work:

  1. If an instructor is doing “follow-up” on a clicker question, do not simultaneously code “lecture” — it’s only considered follow-up.
  2. If the instructor is talking about the content of the course, don’t code that as “administration”
  3. Even if students are first showing low engagement, and then high engagement, don’t code both — choose what seems most appropriate over the 2 minute interval.
  4. After an instructor asks a clicker question, if they do not circulate, it’s hard to figure out whether to code that as “Waiting” or not.
  5. When students are first thinking about a clicker question on their own, it’s hard to determine whether to code that as “thinking individually” or just use “discussing clicker in groups”, since they do then discuss in groups.

Also, the “student engagement” codes are pretty impossible to code reliably.  It asks you to determine if small, substantial, or large fractions are obviously engaged — but from any vantage point you can’t tell whether students are on task or not.  You can’t see where they’re all looking, or what’s on their computer screen. From the back I always see a lot of students on their phones, and can hear off-topic conversations during clicker discussions.  But it’s hard to tell how widespread it is.  I know that UBC recently developed the Behavioral Engagement Related to Instruction (BERI) protocol, perhaps to address this issue.


Overall, it’s been very interesting to learn to use this protocol, and to use it.  It makes me much more aware of instructional moves, and how they can be used throughout the lecture.  I’ve felt privileged to sit in on several courses to use the protocol, but it also gives me a greater appreciation of the student experience.  It’s tough to sit and listen for so long!

I’d love to hear others’ experiences with this, and other protocols.

{ 1 comment }

Alex Small April 15, 2016 at 3:20 pm

Regarding COPUS, it’s not so different from TDOP. In fact, there’s an article on TDOP that mentions this:

The article has some pointed swipes at COPUS, e.g. “This effort led to a protocol known as the COPUS, which is a minor adaptation of the TDOP that involved removing certain categories (e.g., student cognitive engagement) and re-naming existing codes.”

Comments on this entry are closed.

Previous post:

Next post: