{"id":1285,"date":"2019-02-15T10:32:03","date_gmt":"2019-02-15T10:32:03","guid":{"rendered":"http:\/\/kusuaks7\/?p=890"},"modified":"2023-07-31T09:03:35","modified_gmt":"2023-07-31T09:03:35","slug":"data-is-a-stakeholder","status":"publish","type":"post","link":"https:\/\/www.experfy.com\/blog\/bigdata-cloud\/data-is-a-stakeholder\/","title":{"rendered":"Data is a stakeholder"},"content":{"rendered":"<p style=\"margin-left: -1.95pt;\"><strong><em>Ready to learn Data Science? <a href=\"https:\/\/www.experfy.com\/training\/courses\">Browse courses<\/a>\u00a0like\u00a0<a href=\"https:\/\/www.experfy.com\/training\/tracks\/data-science-training-certification\">Data Science Training and Certification<\/a> developed by industry thought leaders and Experfy in Harvard Innovation Lab.<\/em><\/strong><\/p>\n<h4 style=\"margin-left: -1.3pt;\">Design, ethnography, and future-proof data\u00a0science<\/h4>\n<p>I never intended to be a data scientist. I graduated from college with a degree in cultural anthropology and a naive confidence that the type of research I wanted to do didn\u2019t require math. Through sheer luck, I fell into one of the few anthropology graduate programs in the United States that took math seriously, which changed my perspective a little but not a lot. When, for my doctoral fieldwork, I traveled to Central Asia to study identity narratives among religious, political, and social activists, I used a handful of standard statistical tools like principal component analysis and regression models alongside my decidedly non-mathy interviews and participant observation.<\/p>\n<p>I eventually got my PhD and landed a job doing strategic intelligence and planning with the U.S. Department of the Army. They hired me because I was an anthropologist, but it didn\u2019t take me long to see that there was a flood of data coming out of Afghanistan that very few people were really leveraging. Over time, I expanded the quantitative part of my toolkit to tackle data at larger and larger scale. I hacked myself into a data scientist because being a data scientist was the best way to do my job. Since that time, I\u2019ve built and managed data science capabilities in the public, private, and non-profit sectors, and across the security, travel, asset management, education, and marketing technology industries.<\/p>\n<p>When people learn about my background, they often ask \u201cdo you miss being an anthropologist?\u201d My answer to that question is always the same: \u201cI\u2019m still an anthropologist.\u201d People almost never ask me what I mean by that, but the more I think about it, the more I believe that what I mean says something important about how data science can remain a cohesive profession in face of increasing automation and public scrutiny.<\/p>\n<h3 style=\"margin-left: -1.2pt;\"><strong>Good implementation requires good\u00a0design<\/strong><\/h3>\n<p>As a profession, I think we\u2019ve devoted too little attention to the issue of design. The relatively few discussions of the topic I\u2019ve seen strike me as either too general or too specific. On the one hand, we seem to talk about \u201cgood design\u201d as something sort of indefinable \u2014 we can only know it when we see it \u2014 while at the same time we seem to shift too quickly from the importance of good design to a list of specific methods, metrics, and other technical tools necessary to turn the design into an implementation.<\/p>\n<p>Once, when my daughter was very little, she wanted a toy kitchen. I got it into my head to get old cabinets and hardware from a second-hand store and make her that kitchen set. In the end, the thing was held together with wood glue and wishful thinking. My daughter loved it, but it was a real piece of junk. My implementation suffered from two weaknesses.<\/p>\n<p>On the one hand, I only used tools that I already had, which meant five screw drivers that were all too large or too small, slip-joint pliers a mechanic had once left under the hood of my car after an oil change, a hammer I took from my dad\u2019s garage when I went to college, and a small bag of tool-like objects collected from IKEA furniture purchases. In short, my technical toolset stunk and I didn\u2019t really know how to use the few tools I did have.<\/p>\n<p>On the other hand, I really had no experience or competency at envisioning, in detail, what I wanted to build. I didn\u2019t know how to plan the thing as a whole, or consider contingencies, or adapt after mistakes, or anticipate the way the various pieces could or should or would go together.<\/p>\n<p>If I had had the biggest, best set of hardware in the world, I would have finished a little sooner and ended up with a marginally better product. Not knowing how to implement was certainly a big problem, but not knowing how to design was what really held me back.<\/p>\n<p>I think design skills are particularly important to data science as a discipline at this particular moment in time, because how much of the technical side of data science will eventually be automated is still very much an open question. Training and validating models is an important part of the job. Many of the models I have trained and validated over the course of my career could have been automated today in whole or in part by services like\u00a0<a href=\"https:\/\/www.datarobot.com\/\" target=\"_blank\" rel=\"noopener noreferrer\">DataRobot<\/a>. Design skills, and how we learn those skills, deserve more attention than we, as a profession, have given them to date.<\/p>\n<p>The word \u201cdesign\u201d has the same root as \u201cdesignate\u201d: the word \u201csignum\u201d, which means \u201ca mark\u201d \u2014 something you write or place or cut that distinguishes one thing from another. In other words, \u201cdesign\u201d is the practice of taking a particular goal and distinguishing all of that goal\u2019s relevant tasks, materials, and possibilities one from another. Certain problems are challenging to conceptually organize in this way, regardless of how difficult the technical implementation may or may not be. I think those problems are a big part of data science\u2019s future.<\/p>\n<h3 style=\"margin-left: -1.2pt;\"><strong>Design as negotiated story-telling<\/strong><\/h3>\n<p>In my experience, design skills are harder for people to learn than other non-technical skills, like \u201cdomain expertise\u201d or \u201ccommunication skills\u201d. I think that\u2019s because of how the learning of each skill happens:<\/p>\n<ul>\n<li>When you need to learn a subject domain, you put yourself in contact with people who are already familiar with the domain. Each time someone uses a word you don\u2019t know, or cites an unfamiliar example, or tells a story from their experience to explain a decision you don\u2019t understand, you gain domain expertise.<\/li>\n<li>When you need to learn communication skills, you communicate and then get feedback on what you could have done better. Each time you learn of a new tactic, watch someone demonstrate best practices, and then try to mimic those in your next attempt to communicate, you gain communication skills.<\/li>\n<\/ul>\n<p>In both cases, the skills themselves can be difficult to learn, but it\u2019s not particularly difficult to understand how to learn them. With design, it\u2019s different. It\u2019s difficult to learn, but it\u2019s also difficult to learn how to learn. That\u2019s because what makes someone competent in design skills is different from what makes someone competent in other, non-technical skill sets:<\/p>\n<p style=\"text-align: center;\"><img decoding=\"async\" style=\"width: 600px; height: 429px;\" src=\"https:\/\/cdn-images-1.medium.com\/max\/880\/1*1QevsDb3VDxWUff4X7VskA.png\" alt=\"experfy-blog\" \/><\/p>\n<p>(I\u2019m not going to talk about the \u201cvision\u201d part of the table. That\u2019s a topic for another post).<\/p>\n<p>It boils down to the extent to which you, versus other people, get to decide that particular piece of knowledge or a particular action is reasonable, acceptable, \u201cright\u201d \u2014 in a word, whether it is valid. With domain expertise, no one has that authority: if a thing happens within a domain, it\u2019s a part of the domain. In communication, others have the entire authority to decide what is valid: it doesn\u2019t matter how good you thought your presentation was if everyone else thought it was terrible. You can\u2019t say \u201cI don\u2019t care what everyone else says, I know I\u2019m a great communicator.\u201d That betrays a fundamental misunderstanding about\u00a0<a href=\"https:\/\/xkcd.com\/1028\/\" target=\"_blank\" rel=\"noopener noreferrer\">how communication works<\/a>.<\/p>\n<p>In design, both you and others hold the authority to determine validity. I don\u2019t mean that validity can be determined by either you or others. I mean validity has to be negotiated.<\/p>\n<p>Anthropologists tend to use a method called \u201cethnography\u201d as a way of negotiating design validity. The method gets characterized in many different ways, but I find it useful to think about ethnography as requiring you to go into situations that are more-or-less foreign to you and to put your own intuition and assumptions to sleep. You get input from \u201cnatives\u201d about how they view those situations, and then you wake your own intuition back up in order to translate what you\u2019ve learned into something that looks like a coherent and reasonable story to you. You then take that story back and revise it until it seems like a coherent and reasonable story to them too. It\u2019s iterated, negotiated story-telling.<\/p>\n<p>As an anthropologist, when I have a research question, I find someone who will talk to me about it. I have them tell me how the world works according to them, take notes, and eventually I craft a story that explains the assumptions and context and circumstances that make their story reasonable and logical to me. Then I share that story with them. Then they tell me I\u2019m full of crap.<\/p>\n<p>I don\u2019t think I\u2019ve ever had anyone recognize their story the first time I had tried to retell it. Their intuition and my intuition were just too far apart. I get detailed criticism, then I recraft the story. A few iterations of this, and I get a response along the lines of: \u201cwell, that\u2019s not exactly how I\u2019d put it, but I think you got all the major points right.\u201d Then I find someone else to talk to. This time, I start by telling the story I\u2019ve crafted \u2014 tell them that was how the world works. Then they tell me I\u2019m wrong, and I ask them to put me right.<\/p>\n<p>Over time, as I work with more people and deliberately seek out those who had exposure or perspectives that past contributors didn\u2019t have, I craft a story that most contributors recognize as reasonable. Sometimes the story splits into more than one story, because different groups of people fundamentally disagree. More often than not, I end up with one main story that tells most of the picture, and then several satellite stories that tell important variations.<\/p>\n<p>I don\u2019t just compile information. I fill in the gaps between individual contributors\u2019 stories. Often contributors themselves disagree (sometimes vehemently) with each other\u2019s stories, but agree with my bridge. I defend my contributions to the story as much as they defend theirs. It\u2019s not enough for me to recognize a perspective as simply being there, and it\u2019s not enough for others to tell me what worked and what didn\u2019t. I needed to negotiate a story that blends how I view the world and how others view it, and the results need to be accepted as a whole even if it is hard to swallow in parts.<\/p>\n<p>Good design comes from negotiating stories. There are learnable skills, beyond communication skills, that make a person a good designer.<\/p>\n<h3 style=\"margin-left: -1.2pt;\"><strong>Give data a\u00a0voice<\/strong><\/h3>\n<p>Now, the idea of design as the outcome of negotiation among various stakeholders is certainly nothing new. In fact, that kind of iterative setting of priorities is at the core of agile approaches to software design. However, I think design in data science differs from design in most other engineering contexts, in that the data itself needs to have a voice. I mean that in largely the same way that I mean each stakeholder in a project needs a voice. In a healthy design environment, stakeholders don\u2019t just give input only at the times and in the ways that the designer stipulates they should. They are able to contradict. They\u2019re able to raise doubts. They\u2019re able to add unanticipated insights. I believe it needs to be a core part of a data scientist\u2019s job to create tools that allow individual data stories to be told, even when no one specifically asks for the story.<\/p>\n<p>These storytelling tools need to be responsive because design is negotiated, and therefore iterative. When a stakeholder proposes a course of action to address a particular business problem, the storytelling tool set should be able to react to that: to show, in an intuitive way, some ways that data will look different if that course of action is carried out, and flag any cases where the story ceases to make sense. And it needs to do that quickly: if data is a drag on the conversation, it\u2019s going to be let out.<\/p>\n<p>Improved design skills are the key to solving one of data science\u2019s biggest current challenges: algorithmic bias. It\u2019s not hard to find examples of efficient and scalable technical solutions that suffer from serious design flaws. For example:<\/p>\n<ul>\n<li>An\u00a0<a href=\"https:\/\/www.bloomberg.com\/graphics\/2016-amazon-same-day\/\" target=\"_blank\" rel=\"noopener noreferrer\">investigation by Bloomberg<\/a>\u00a0into Amazon\u2019s offering of same-day delivery service revealed that poorer, more minority-populated zip codes were frequently excluded. The algorithm recommended same day delivery in places where Amazon already had a lot of Prime subscribers, which subscribers tend to be fewer in number in poorer areas.<\/li>\n<li><a href=\"https:\/\/www.wired.com\/story\/excerpt-from-automating-inequality\/\" target=\"_blank\" rel=\"noopener noreferrer\">A story by Wired<\/a>\u00a0showed how child-abuse protection algorithms tended to target poor families. The indicators of child abuse that the algorithm tried to take into account (\u201cnot having enough food, having inadequate or unsafe housing, lacking medical care, or leaving a child alone while you work\u201d, etc.) blurred the lines between neglect and poverty.<\/li>\n<li><a href=\"https:\/\/www.theguardian.com\/technology\/2015\/jul\/08\/women-less-likely-ads-high-paid-jobs-google-study\" target=\"_blank\" rel=\"noopener noreferrer\">Researchers from Carnegie Mellon<\/a>\u00a0created fake users that behaved in the same way and whose profiles only differed in their gender. Google\u2019s ad display system tended to show ads for highly-paid jobs much less often to the users whose profiles indicated that they were women.<\/li>\n<\/ul>\n<p>In none of the above cases is there any indication that race, wealth, or gender were explicitly targeted. The tools were built to answer a simple business question: how do we differentiate people, places or situations that interest us? The technical solution in each case, as far as we can tell, seems entirely reasonable: choose a set of proxy measures and flag the people places and situations that most frequently exhibit those proxy measures. However, in building these system, it does not seem that any set of tools was also built to answer the question of whether the flagged people, places, or situations systematically differed in any attributes form those that were not flagged? That is not a technical failure. It\u2019s a design failure.<\/p>\n<p>I think this issue of tooling-for-design\u00a0<a href=\"https:\/\/www.oreilly.com\/ideas\/machine-learning-tools-for-fairness-at-scale\" target=\"_blank\" rel=\"noopener noreferrer\">is largely unexplored territory<\/a>\u00a0in data science, although it\u2019s\u00a0<a href=\"https:\/\/www.theatlantic.com\/technology\/archive\/2017\/09\/saving-the-world-from-code\/540393\/\" target=\"_blank\" rel=\"noopener noreferrer\">starting to receive a little more attention<\/a>\u00a0in software development. At any rate, it\u2019s an area that is much more robust against the threat of automation than are the technical skill sets that are often promoted as the core of what defines the profession.<\/p>\n<h3 style=\"margin-left: -1.2pt;\"><strong>Example 1: parcel geometry overlap resolution<\/strong><\/h3>\n<p><a href=\"https:\/\/www.valassisdigital.com\/us\/\" target=\"_blank\" rel=\"noopener noreferrer\">Valassis Digital<\/a>\u00a0collects location data from mobile devices and uses that data to help companies more precisely market to their target audiences. A lot of my work with that data touches on the issue of location semantics: we find that certain devices tend to cluster in a particular geographic space, but we want to understand\u00a0<em>why<\/em>\u00a0they cluster there. One way to try to understand location semantics is through parcel geometries. One way to understand why a device might appear in a particular place is to look at who owns that particular place. That kind of data is obtainable through assessor\u2019s offices across the United States and Canada. We get a geometry definition that defines the borders of the property, and by looking at the metadata associated with a geometry, we find way to attach meaning to geolocation data.<\/p>\n<p>Parcel geometry quality varies widely. One difficulty we\u2019ve encountered in using that data is that sometimes parcel geometries overlap. We don\u2019t have a great understanding of why the overlap happens: in some cases, it looks like a simple matter of sloppy record keeping; in other cases, two definitions of property (personal property vs city right-of-ways, for example) create situations where parcels might legitimately overlap. For most of our purposes, it\u2019s helpful for one piece of land to be associate with one and only one parcel. Therefore, I\u2019ve had to work to detect that overlap, which is easy, and automatically resolve it, which is harder. Consider the following case:<\/p>\n<p>This is fairly typical of the overlap we encounter in our parcel data. The blue polygon represents one parcel, and the yellow outline represents another. Seeing this example, several relatively simple technical solutions suggest themselves. We could assign the overlapping area to the larger of the two parcels, or to the smaller of the two, for example. The solution I finally arrived at was to difference the area of overlap from both parcels to create a simulation of what each parcel would look like if it lost its overlap. I then checked each simulation against size and shape thresholds. If a simulated parcel was smaller or more narrow than what we typically saw in useful residential and commercial parcels (the two kinds of land that we are most interested in), I simply removed the shape associated with the unsuccessful simulation. If neither shape resulted in an unsuccessful simulation, I assigned the overlap area to the larger shape. If both resulted in an unsuccessful simulation, I merged the two shapes together.<\/p>\n<p>Once I arrived at the above process for resolving parcel geometry overlap, I went through and did some simple descriptive analysis of the results, noting how many parcels were removed, edited, merged, etc. I also went through dozens of images like the one above to confirm that the results looked reasonable. In other words, I did some normal checking of results. However, I continued to overlay satellite images with polygons \u2014 a few thousand of them, and I built a tool show the results of my algorithm alongside the original breakdown. It was tedious. But we have millions of parcel geometries, and I\u2019m used to seeing new stories emerge unexpectedly, even after long periods of hearing the same story over and over again. The result:<\/p>\n<p>In the image above, an assessor took a multi story building and created a parcel geometry for each apartment on each floor. They\u2019re all stacked. My simple algorithm made a huge mess of this situation: it effectively ended up removing the entire building because every parcel ended up becoming too narrow or two small when compared to some other parcel. This was a special case of parcel overlap that had to be separately detected and dealt with before running the core algorithm. In this case, I looked for stacking (sum the areas of each parcel and divide by the total area that all the parcels cover together \u2014 anything over 1.0 indicates stacking), and in cases where I detected it, I merged all of the stacked parcels together into a single building. That should work fine, right?<\/p>\n<p>Look closely at the border lines in the above image. Nearly all of the lines are doubled. In fact, this entire area is represented by two sets of parcels, although the sets are not identical \u2014 one is offset a little from the other, but the shapes are also slightly different in many of the cases. My stacking detection resulted in this entire subdivision being merged into a single building \u2014 not an optimal result. In this cases, I needed to check for stacking as before, but then check for redundant layers. I created a network where each parcel was a node and each edge was a case of parcels touching \u2014 not overlapping. I then took the largest set of connected components from this network and used that as a base layer. It then went through every other set of connected components, deleting the new layer if it overlapped above a certain threshold, and merging it if it didn\u2019t.<\/p>\n<p>All of the above is a simple example, but it illustrates the benefits \u2014 and sometimes the necessity \u2014 of creating a way for a data set to react to an intended algorithm. It required me to create a new tool set that highlighted areas of overlap so I could more intuitively see problem areas, and it required me to keep \u201casking\u201d portions of the dataset whether my implementation made sense. Most of the images I looked at indicated no problem at all with my basic algorithm. It was only after many hours of searching that I started to discover the gaps.<\/p>\n<h3 style=\"margin-left: -1.2pt;\"><strong>Example 2: location precision calibration<\/strong><\/h3>\n<p>We get hundreds of millions of geo-located records a day. However, not all of that data can be taken at face value. Among other problems, specific ISPs, apps, devices, or particular device configurations can report data in the form of pinpoint coordinates when in fact the coordinates aren\u2019t a pinpoint. There are a number of reasons for this, only one of which is that sources sometimes rely on high-level records that encode an entire area as a single set of coordinates, as one unfortunate family at the geographic center of the contiguous United States\u00a0<a href=\"https:\/\/splinternews.com\/how-an-internet-mapping-glitch-turned-a-random-kansas-f-1793856052\" target=\"_blank\" rel=\"noopener noreferrer\">has felt for over ten years<\/a>.<\/p>\n<p>So, at the start, the problem seemed pretty straightforward: flag coordinates that are attached to an unusually high number of unique devices. For example, if most of the coordinate pairs in an area see between one and ten unique devices, but one coordinate pair has 1,527 unique devices, then that one coordinate pair is probably being used as a default location. This zoom-in of the top 20 coordinate pairs in one area, ordered by number of unique devices, shows what that looks like.<\/p>\n<p>Incidentally, I omitted all coordinates with fewer than three decimal places precision, and rounded all other coordinates to six decimal places. Six decimal places represents a space with diameter of about 4 inches, which is around the limits of what GPS can offer in terms of precision without some substantial triangulation.<\/p>\n<p>There\u2019s an interesting analytic and business question here. The red line in the graph above represents a example of how I could set a threshold for filtering out coordinates. You can clearly see the jump in number of devices, but you can see at least two similarly-sized jumps further down the line. What should be considered an extreme value rather than simply a very busy spot?<\/p>\n<p>There is no good technical answer to that question. There are a variety of methods for doing changepoint detection, but all of those methods naturally have parameters that need to be tuned. And, as far as I\u2019m aware, none of them tell you how many change points to pay attention to. What if you see at least three large jumps like we do in the graph above. Is the first one the cutoff for identifying the fake location data? The last one? Even in cases where there is only one changepoint, that only give us an unambiguous course of action in those situations (and there were many of them) where the changepoint separates the very largest datapoint from all of the rest. That matches our naive conception of how this problem of default locations works. But I found cases where the change point separates many, sometimes dozens of data points from the rest.<\/p>\n<p>There\u2019s no clear technical solution to the problem, and the subject domain doesn\u2019t give us clear guidance \u2014 because we don\u2019t know the system that is generating the hotspots, we don\u2019t know how many are plausible. This is a design challenge.<\/p>\n<p>This is where data needs a voice, and with geo-coordinate data it\u2019s not very hard to give data a voice. One person in a location during any given day is normal \u2014 that fits right in with our assumptions about people with mobile devices. And if we look at the location data we got from one day at Penn Station in New York City, it\u2019s about what we would expect:<\/p>\n<p>There\u2019s a high density of coordinates in the middle of Madison Square Garden, which is what we would expect since Penn Station\u2019s main Amtrak waiting area is right there just below ground level. If we filter this down to only those coordinates where many people connected at the same location within a single day, we get a different picture:<\/p>\n<p>We still see a number of hits in the Amtrak waiting area, many of them practically on top of one another. I\u2019m willing to bet most of those are spots right in front of a schedule board or right next to an electrical outlet. We also have several hits in between Madison Square Gardens and Two Penn Plaza (the rectangular building to the east), where, one level below the Amtrak level, is the Long Island Rail Road waiting area. We also have some on the other side of Two Penn Plaza, which is a major taxi pickup area. In other words, even though these coordinates were attached to more devices than we would normally expect (some of them had as many as 50 devices), the pattern is plausible.<\/p>\n<p>However, I was only able to recognize that plausibility by happening to know something about the area I was looking at. That\u2019s not scalable. So I took the basic map visualization functionality I\u2019d already put together, and incorporated it into a workflow that pulled shapefiles of US administrative divisions (states, counties, cities, and zip codes), matched up the device coordinates dataset with the shapes, and then used the frequency of devices-per-coordinate-pair to let the data say something about how normal a particular device count was.<\/p>\n<p>I divided the count of unique devices by the maximum count found within the boundaries of the administrative division I was looking at. In most cases, the grand majority of the hits were less than 1% of the maximum, so I took all of those that are 50% or more of the maximum. I then started paging through satellite images overlaid with the coordinates, looking for stories. Many looked like I expected them to \u2014 one or two points in the entire shape. Then I landed in a zip code on the outskirts of Fort Worth:<\/p>\n<p>It\u2019s a grid. There are a couple points that don\u2019t fit the grid, and there are naturally parts of the grid missing, but there\u2019s patterning there. Paging through more results, I found the same thing in zip codes in Boise, San Antonio, Colorado Springs. Here\u2019s one in Manhattan:<\/p>\n<p>Note that those circles I plotted on the image are much larger than the actual locations \u2014 I enlarged them for visibility. In reality, each circle is about four inches in diameter.<\/p>\n<p>This took me in a whole new direction. I worked up a simple algorithm to automatically identify the grids \u2014 I took all coordinates that had a lot of unique devices attached to them, calculated the angle of rotation from the x-axis for each coordinate pair, kept the top four angles from that process, and then filtered out all coordinates that weren\u2019t attached to another coordinate by at least one of those angles. Each circle on the following map represents a grid I discovered through that process:<\/p>\n<p>Here\u2019s a zoom in on one cluster of grids:<\/p>\n<p>Each node in the grid was a four-inch patch, separated from other nodes by at least one kilometer. The distances and angles differed for different places \u2014 it wasn\u2019t a single grid over the whole world. It captured a large number of the coordinates that had comparatively extreme counts of devices. And I never would have found it if I hadn\u2019t built a tool to allow the data to create stories on demand about what was normal and what wasn\u2019t.<\/p>\n<p>One other feature of these grids: nearly all of them involved weather apps. However those apps created far more coordinates that didn\u2019t fall within a grid than coordinates that did, and many of the apps involved in the coordinates weren\u2019t obviously weather-related. But it gave me the idea that this could maybe be a weather grid, and so I spent the better part of week researching and downloading weather grid projections from the U.S. National Weather Service. I finally found one that had roughly the same granularity as what I was seeing in my data (the device coordinates are in yellow, the weather grid in white):<\/p>\n<p>But the grid wouldn\u2019t match up. After staring at pictures like the one above for a very long time, I suddenly realized something obvious:<\/p>\n<p>The weather grid (white) is a proper grid \u2014 the nodes form right angles. The same is not true of the data grid (yellow). Because the National Weather Service only uses proper grids, by definition the pattern I was seeing in the data couldn\u2019t have come from the National Weather Service. I was stumped until, as often happens in ethnographic fieldwork, I had a random conversation on a seemingly unrelated topic that gave me some sudden insight. In this case, the conversation was with the front seat of my car:<\/p>\n<p>Notice the pattern in the fabric. Imagine centroids in the middle of each of those hexagons. I hadn\u2019t been able to find an appropriate weather grid because I had been searching for \u201cweather grid\u201d. I need to be searching for \u201cweather mesh\u201d. As soon as I did:<\/p>\n<p>Apparently the Weather Company (owned by IBM and owner of both the Weather Channel and Weather Underground)\u00a0collaborated with the climate group at Los Alamos\u00a0to develop a mesh that could be stretched to have higher resolution at some places in the world and shrunk to have lower resolution at others. Their implementation of that mesh is proprietary, and is possibly subject to change over time as their forecasting needs change. And the weather channel hosts an API that allows others to serve weather, which explains why so many other apps end up falling into the hexagon centroids as well. However, weather apps have tons of hits outside of these centroids, which suggests they only provide centroid data sometimes, perhaps only when location services are disabled on a device. It makes a whole lot of sense for the Weather Company to use some sort of a grid or mesh to decide what forecasts to serve users, and it makes a whole lot of sense for apps to simply report the centroids of those hexagons when reporting user location \u2014 a simple coordinate pair is all most of these reporting systems ask for. Here, understanding the design of the data-generation process enables the most accurate use of the data<\/p>\n<p>This series of discoveries changed the entire design of how I addressed this issue of location precision calibration:<\/p>\n<ol>\n<li>Look only at coordinates from specific weather apps<\/li>\n<li>Pick coordinates that have multiple devices attached to them.<\/li>\n<li>Partition the data into arbitrary groups of nearby coordinates (administrative shapes no longer necessary because now I know what I\u2019m looking for).<\/li>\n<li>Pick the top n-most common angles of rotation between coordinate pairs within a group, and pick the most frequent distance between coordinate pairs that exhibit each angle of rotation.<\/li>\n<li>Go through all coordinates (not just those with multiple devices), keeping only those that are connected by one of the selected angles of rotation, and are separated by roughly the typical distance for that angle of rotation.<\/li>\n<li>Build in an update process since the grid nodes can change over time.<\/li>\n<li>Look for outliers in administrative divisions using a standard changepoint detection approach, but only after the grid coordinates have been removed.<\/li>\n<\/ol>\n<p>By building tools to give data a voice \u2014 to show me in rough but intuitive terms what the coordinates looked like under different assumptions of how many devices was considered \u201cnormal\u201d \u2014 the problem changed from a purely technical challenge of parameterizing a changepoint analysis, to a design problem of how to accommodate a widespread but systematically imprecise location reporting system. And by removing the grid coordinates ahead of time, it made my changepoint detection much more straightforward \u2014 it substantially increased the states, counties, cities, and zip codes in which there was only one set of coordinates with an extreme device count. And, as a bonus, it flagged coordinates that couldn\u2019t be trusted (at least at the precision those coordinates gave), even though the device count at those coordinates wasn\u2019t extreme at all.<\/p>\n<h3 style=\"margin-left: -1.2pt;\"><strong>We need a good question-generating toolset<\/strong><\/h3>\n<p>Data science is currently very good at coming up with answers. It\u2019s not very good at coming up with questions. I believe that requires data scientists to pay more attention to building non-technical skills, but I think it also requires us to build more tools that facilitate that part of the process. In fact, building the tools will contribute, in large measure, to building the non-technical skills.<\/p>\n<p>The examples I gave used in this post was of a relatively easy use case, as geospatial data lends itself to straightforward, intuitive communication. Other kinds of data and other kinds of business problems will present more of a challenge. It\u2019s a challenge worth undertaking. A focus on design \u2014 not just being aware of it but actively building our design intuition, competency, and toolset \u2014 is the best chance data science has to survive the increasing automation of core technical functions, and the heightened scrutiny of a public increasingly skeptical of the idea that algorithms can ethical as well as scalable.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>Data science is currently very good at coming up with answers. It&rsquo;s not very good at coming up with questions. I believe that requires data scientists to pay more attention to building non-technical skills, but I think it also requires us to build more tools that facilitate that part of the process. In fact, building the tools will contribute, in large measure, to building the non-technical skills.<\/p><\/p>\n","protected":false},"author":238,"featured_media":2825,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[187],"tags":[94],"ppma_author":[1869],"class_list":["post-1285","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-bigdata-cloud","tag-data-science"],"authors":[{"term_id":1869,"user_id":238,"is_guest":0,"slug":"schaun-wheeler","display_name":"Schaun Wheeler","avatar_url":"https:\/\/secure.gravatar.com\/avatar\/?s=96&d=mm&r=g","author_category":"","user_url":"","last_name":"Wheeler","first_name":"Schaun","job_title":"","description":"Schaun Wheeler is a senior data scientist at Valassis Digital, an advertising and marketing intelligence company. After building data science capabilities for the U.S Department of the Army and working as a data scientist in an educational travel firm and then an asset management startup, he founded and directed the data science team for Success Academy Charter schools in New York City. His current work focuses on location semantics and his broader interests include data science ethics, analytic design, and the creation of humane HR practices in the tech industry. He is a formally trained anthropologist."}],"_links":{"self":[{"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/posts\/1285","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/users\/238"}],"replies":[{"embeddable":true,"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/comments?post=1285"}],"version-history":[{"count":0,"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/posts\/1285\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/media\/2825"}],"wp:attachment":[{"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/media?parent=1285"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/categories?post=1285"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/tags?post=1285"},{"taxonomy":"author","embeddable":true,"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/ppma_author?post=1285"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}