Unlocking the Value of Open Data: A Case Study Using the New York State Healthcare Open Data

An estimation of the total amount of data existing in the digital universe today is 7.9 Zettabytes [1].

1 Zettabyte is 1 trillion gigabytes which is beyond comprehension of us mortals. An estimation of the total number of stars in our galaxy, the Milky Way, is 300 billion. The total number of gigabytes today is more than 2000 times the number of total stars in our galaxy (to get a feeling of a gigabyte you could say that a good quality movie file could be close to 1 GB). The digital universe size is poised to grow 5 times to a total of 35 Zettabytes (10,000 times the number of stars in our galaxy) by 2020.

So when people talk about big data and explosion of data they really mean it. But it is one thing to have data and another thing to be able to extract value from data. Crude datasets are like crude oil. You need to process them to get value. Nevertheless, one could say that we are sitting on gold mines and we only need the miners to harvest the value. Let’s take, for example, open data. Does it have value or not?

Open data describe large datasets that entities (usually governments) release online and free of charge for anyone to analyze for any purpose. While there is a huge debate regarding the pros and cons of this approach, nobody can doubt the value of these data. A plethora of applications has emerged from those data along with the added benefit of transparency into the government itself.

To demonstrate the value one can mine from data, I analyzed the Hospital Inpatient Discharges (SPARCS De-Identified) of 2012 released from New York State as an open dataset [2].

This dataset contains data regarding hospital discharges (when a patient is released from the hospital) along with demographics, etiology of admission, procedures (if any), severity and cost data. The file is close to 1 gigabyte in size, which is relatively small in terms of the files I usually work with, but as we will see there is quite a bit of value to mine even in 1 gigabyte. One thing with data is that you spend a lot of time cleaning the files (there is actually a pareto law paradigm stating that 80% of your time as a data scientist is spent on cleaning and transforming your data while only 20% is spent on getting results and visualizing). So it might take a while for you to clean and munge your data, as we say, but it is like hunting for gold, where the value far outweighs the effort, however arduous.

First contact with data is always a visual exploration to check trends, outliers and get a general impression and feel for the dataset. Let’s take a look at the age of inpatients vs. the total number of admissions.

We can easily see that most of the hospitalizations are for newborns and 50+ year olds.

Let’s take a look into the male/female distribution. Interestingly enough, female discharges are higher.

For the sake of space, we will look into only three diseases, create some insights for demonstration purposes, and show a process that can easily be generalized for the rest of the diseases.

COPD

Chronic obstructive pulmonary disease (COPD) is a type of obstructive lung disease characterized by chronically poor airflow with main symptoms that include shortness of breath, cough, and sputum production.

Let’s take a look into the COPD age distribution.

What is the distribution of the length of stay? Most of the admissions are between 2-4 days.

What about the geolocation of COPD discharges based on the first three numbers of the zip code (rest of zip codes are truncated for anonymization)? Most of the hospitalizations are of course in New York City, but it is interesting to see a high incidence of COPD discharges outside of NYC. One can create interactive maps online relatively easily for the data stakeholders to be able to “geo – visualize” their data.

What about Chronic Heart Insufficiency, a condition of the heart where the heart muscle does not work efficiently in pumping blood? Can we compare the number of discharges of CHF with those of COPD? It is obvious that for patients older than 70 the incidence of CHF is almost double in comparison with COPD.

Length of stay is more or less the same as COPD.

Now, let’s add Diabetes into the analysis and continue on from there.

Cost Data for the Three Diseases

Clearly CHF is the costliest disease among the three, with average cost per admission nearly twice as high as COPD.

More specifically, we notice that CHF, apart from being very expensive in total, in average and in maximum cost for one hospitalization, also has the highest standard deviation among the three diseases.

Costs of Hospitalization

	COPD	CHF	DIABETES
Mean	$25,779	$42,549	$29,404
Total	$0.958 B	$2,425 B	$1,35 B
STD	$34,534	$84,772	$45,483
25%	$9,971	$12,389	$9,336
50%	$17,329	$22,877	$16,918
75%	$30,243	$44,266	$32,061
max	$1,665,430	$4,214,537	$2,141,412

Let’s take a look into the power that predictive and preventive medicine can have on cost reduction.

Let’s say that we define the vulnerable patients and we find a magic way to reduce emergency admissions by 10%. What would be the cost savings for New York State only?

$203M for CHF, $116M for Diabetes, and $83.9M for COPD.

Is there any correlation between average cost per admissions and day of the week for 2012?

Generally speaking, there shouldn’t be any over the course of the year. But there is no free lunch in data science, so let’s dig deeper into the data to uncover the real answer.

COPD: Average Cost per Admission by Day of Week

Admissions for emergencies surprisingly cost on average the same amount for each day of the week.

Elective admissions have an almost 50% higher average cost on Wednesday but very low costs on weekends, as compared with emergency admissions.

CHF: Average Cost per Admission by Day of Week

Elective admissions cost more on average (probably due to lots of tests). We also watch this peak on electives on Wednesday.

Diabetes: Average Cost per Admission by Day of Week

There is a peak of emergency admission costs on Sunday followed by low average cost on Monday and Friday elective admissions. One guess is that some emergency admissions from Sunday worsen and account for some of the elective admissions on Monday. Moreover, there is likely a reduction in elective admissions on Friday because people don’t want to spend the weekend in the hospital.

Last but not least, we can take a look at what the insurance companies paid on average per hospitalization by type of admission (elective/emergency) for each disease.

COPD: Average Payment per Type of Admission per Insurance Coverage

Source of Payment (rows) / Type of Admission (columns)	Elective	Emergency
Blue Cross	$32,290	$24,371
CHAMPUS	$9,912	$22,182
Insurance Company	$23,025	$25,607
Medicaid	$13,614	$21,689
Medicare	$25,057	$26,875
Other Federal Program	$12,329	$13,947
Other Non-Federal Program	$25,990	$25,915
Self-Pay	$22,594	$27,032
Unknown	$3,739	$17,353
Workers Compensation	$43,622	$25,436

CHF: Average Payment per Type of Admission per Insurance Coverage

Source of Payment (rows) / Type of Admission (columns)	Elective	Emergency
Blue Cross	$78,158	$43,927
CHAMPUS	$49,044	$37,291
Insurance Company	$93,328	$47,836
Medicaid	$75,460	$39,134
Medicare	$68,665	$37,911
Other Federal Program	$7,058	$23,907
Other Non-Federal Program	$36,672	$31,150
Self-Pay	$38,606	$34,050
Workers Compensation	$75,141	$42,851

Diabetes: Average Payment per Type of Admission per Insurance Coverage

Source of Payment (rows) / Type of Admission (columns)	Elective	Emergency
Blue Cross	$27,257	$21,206
Insurance Company	$32,544	$30,226
Medicaid	$23,405	$38,076
Medicare	$23,861	$30,567
Other Non-Federal Program	$56,586	$33,893

And what were the 20 top hospitalizations in terms of cost?

	Disease	Cost in $
1.	Eye Infection	7,066,436
2.	Surgcl/Med Care Compl	6,286,622
3.	Aspiration Pneumonitis	6,230,015
4.	Coag/Hemrrge Disorder	6,196,974
5.	Septicemia	5,166,411
6.	Liveborn	4,971,831
7.	Liveborn	4,953,934
8.	Burns	4,877,072
9.	HIV Infection	4,839,726
10.	Septicemia	4,511,673
11.	Encephalitis	4,362,477
12.	HIV Infection	4,324,821
13.	CHF	4,214,537
14.	Leukemias	4,196,908
15.	CHF	4,138,746
16.	HIV Infection	4,052,052
17.	Liveborn	3,975,798
18.	Liveborn	3,961,028
19.	Adult Respiratory Failure	3,947,940
20.	Anemia	3,891,765

We notice that 5 of them are related to births, 2 to CHF, and 3 to HIV infections.

It seems that with this dataset we might have enough data to make a prediction model for the probable cost of an admission based on gender, age, disease, area, length of stay and insurance type for the region of New York.

I’m guessing that there is a lot more value to be extracted from a dataset like this.

But that is a topic for another blog post….

[1] http://www.csc.com/insights/flxwd/78931-big_data_universe_beginning_to_explode

[2] (https://health.data.ny.gov/Health/Hospital-Inpatient-Discharges-SPARCS-De-Identified/u4ud-w55t)