by Justin Reich, Richard L. Menschel HarvardX Research Fellow
Republished from Education Week, June 30, 2013
In a meeting recently, Andrew Ho (the chair of the HarvardX Research Committee) started using the phrase "person-click dataset," and it struck me that this is a very useful term that will probably one day enter wide circulation among educational researchers using longitudinal methods. So I wanted to record here, that unless someone else has published it somewhere else, Andrew deserves credit for coming up with it.
The ur-text of longitudinal data analysis in education (and some other fields) is Judy Singer and John Willett's Applied Longitudinal Data Analysis. Take this with a grain of salt as John was on my committee and I work with Judy now, but it is a marvelous book. While highly technical material, it is written incredibly clearly and succinctly, with concepts illustrated by a wide variety of real data and analyses.
Singer and Willett propose two data structures with broad applications in longitudinal analyses: the person-level dataset and the person-period dataset.
In the person-level dataset, each row of the dataset corresponds to one person. So let's say you have 500 students and you test them three times. You might come up with a data set where the first three rows look like this:
|Student||Gender||Test 1||Test 2||Test 3|
Pretty simple stuff. The problem with the structure is that it requires a new column for every new observation. If you have 500 observations, you need 500 columns, even if you only observe a subset of people at each moment of observation.
Thus, a more flexible structure is a person-period dataset, where each row in the dataset corresponds to one person at one moment of observation
These data structures were imagined for a world where in most educational research studies, there were only a few waves of observation. You tested kids once a year for 5 years. You swabbed a kids cheek for saliva once a day for a few weeks.
We now have the opportunity to log everything that students do in online spaces: to record their contributions, their pathways, their timing, and so forth. Essentially, we are sampling each student's behavior at each instant, or at least at each instant that a student logs an action with the server (and to be sure, many of the things we care most about happen between clicks rather than during them).
Thus, we need a specialized form of the person-period dataset: the person-click dataset, where each row in the dataset records a student's action in each given instant, probably tracked to the second or tenth of a second. (I had started referring to this as the person-period(instantaneous) dataset, but person-click is much better). Despite the volume of data, the fundamental structure is very simple.
|Andrew||2014-11-05T13:15:30Z||Play Video||Unit 3.Video 1|
|Andrew||1994-11-05T13:17:30Z||Submit Answer||Unit 3.Video 1.Question 1|
Then, depending upon the size and length of the course, you will need a few hundred million rows to capture all of the activity, but only those same four columns. This kind of dataset will become central to longitudinal research in online learning.
What the "person-period" dataset will become is just a roll-up of person-click data. For many research questions, you don't need to know what everyone did every second, you just need to know what they do every hour, day or week. So many person-period datasets will just be "roll-ups" of person-click datasets, where you run through big person-click datasets and sum up how many videos a person watched, pages viewed, posts added, questions answered, etc. Each row will represent a defined time period, like a day. The larger your "period," the smaller your dataset.
All of these datasets use the "person" as the unit of analysis. One can also create datasets where learning objects are the unit of analysis, as I have done with wikis and Mako HIll and Andres Monroy-Hernandes have done with Scratch projects. These can be referred to as project-level and project-period datasets, or object-level and object-period datasets.
As geeky aside to a geeky post, as I understand it most of the more moden latent variable approaches to longitudinal modeling, such as those that Katherine Masyn and others have developed, require a person-level dataset for analysis. Person-level datasets with the granularity of the person-click datasets will be impossible to generate and use (except perhaps for people with access to incredible computational power). For a MOOC, you'd need to have a column for every second that a student took an action, and then a row for every user. That's a big table. You could not open that table in Excel. So it's possible that most of the recent innovations in latent variable approaches to longitudinal work won't be of any use with online data for a while, at least until the stats nerds can figure out how to do the same analyses in person-click datasets. (There's a thesis for a stats nerd somewhere.) That means, the classical training, even going back to Cox Proportional Hazard modeling, is going to be pretty useful for a while longer.
So, all of this is just for the record: I first heard Andrew Ho use the phrase "person-click" dataset in a meeting on Friday June 28 in Gutman 403 (coincidentally, the same room where John Willett mentored over 300 doctoral students over 25 years) . For those of you wondering what people are doing with data from online learning environments, hopefully this gives a little glimpse into the underlying data structures.