{"id":1112,"date":"2019-02-15T10:31:58","date_gmt":"2019-02-15T10:31:58","guid":{"rendered":"http:\/\/kusuaks7\/?p=717"},"modified":"2023-08-03T10:16:14","modified_gmt":"2023-08-03T10:16:14","slug":"unlocking-the-value-of-open-data-a-case-study-using-the-ny-state-healthcare-open-data-be1dd8f6-7141-4986-aa9f-00e52f1700bc","status":"publish","type":"post","link":"https:\/\/www.experfy.com\/blog\/health-tech\/unlocking-the-value-of-open-data-a-case-study-using-the-ny-state-healthcare-open-data-be1dd8f6-7141-4986-aa9f-00e52f1700bc\/","title":{"rendered":"Unlocking the Value of Open Data: A Case Study Using the New York State Healthcare Open Data"},"content":{"rendered":"<p><span style=\"font-family: arial,helvetica,sans-serif;\">An estimation of the total amount of data existing in the digital universe today is 7.9 Zettabytes [1].<\/span><\/p>\n<p><span style=\"font-family: arial,helvetica,sans-serif;\">1 Zettabyte is 1 trillion gigabytes which is beyond comprehension of us mortals. An estimation of the total number of stars in our galaxy, the Milky Way, is 300 billion. The total number of gigabytes today is more than 2000 times the number of total stars in our galaxy (to get a feeling of a gigabyte you could say that a good quality movie file could be close to 1 GB). The digital universe size is poised to grow 5 times to a total of 35 Zettabytes (10,000 times the number of stars in our galaxy) by 2020.<\/span><\/p>\n<p><span style=\"font-family: arial,helvetica,sans-serif;\">So when people talk about big data and explosion of data they really mean it. But it is one thing to have data and another thing to be able to extract value from data. Crude datasets are like crude oil. You need to process them to get value. Nevertheless, one could say that we are sitting on gold mines and we only need the miners to harvest the value. Let&#8217;s take, for example, open data. Does it\u00a0have value or not?<\/span><\/p>\n<p><span style=\"font-family: arial,helvetica,sans-serif;\">Open data describe large datasets that entities (usually governments) release online and free of charge for anyone to analyze for any purpose. While there is a huge debate regarding the pros and cons of this approach, nobody can doubt the value of\u00a0these data. A plethora of applications has\u00a0emerged from those data along with the added benefit of transparency into the government itself.<\/span><\/p>\n<p><span style=\"font-family: arial,helvetica,sans-serif;\">To demonstrate the value one can mine from data, I analyzed the Hospital Inpatient Discharges (SPARCS De-Identified) of 2012 released from New York State as an open dataset [2].<\/span><\/p>\n<p><span style=\"font-family: arial,helvetica,sans-serif;\">This dataset contains data regarding hospital discharges (when a patient is released from the hospital) along with demographics, etiology of admission, procedures (if any), severity and cost data. The file is close to 1 gigabyte\u00a0in size, which is relatively small in terms of the files I usually work with, but as we will see there is quite a bit of value to mine even in 1 gigabyte. One thing with data is that you spend a lot of time cleaning the files (there is actually\u00a0a pareto law paradigm stating that\u00a080% of your time as a data scientist is spent on cleaning and transforming your data while only 20% is spent on getting results and visualizing). So it might take a while for you to clean and munge your data, as we say, but it is like hunting for gold,\u00a0where the value far outweighs the effort, however arduous.<\/span><\/p>\n<p><span style=\"font-family: arial,helvetica,sans-serif;\">First contact with data is always a visual exploration to check trends, outliers and get a general impression and feel for\u00a0the dataset. Let\u2019s take a look at the age of inpatients vs. the total number of admissions.\u00a0<\/span><\/p>\n<p>&nbsp;<\/p>\n<p><span style=\"font-family: arial,helvetica,sans-serif;\">We can easily see that most of the hospitalizations are\u00a0for newborns and 50+ year\u00a0olds.<\/span><\/p>\n<p><span style=\"font-family: arial,helvetica,sans-serif;\">Let\u2019s take a look into the male\/female distribution. Interestingly enough, female discharges are higher.<\/span><\/p>\n<p>&nbsp;<\/p>\n<p><span style=\"font-family: arial,helvetica,sans-serif;\">For the sake of space, we will look into <\/span><span style=\"font-family: arial, helvetica, sans-serif; line-height: 20.8px;\">only<\/span><span style=\"font-family: arial,helvetica,sans-serif;\">\u00a0three diseases, create some insights for demonstration purposes, and show a process that can easily be generalized for the rest of the diseases.<\/span><\/p>\n<h3><span style=\"font-family: arial,helvetica,sans-serif;\">COPD<\/span><\/h3>\n<p><span style=\"font-family: arial,helvetica,sans-serif;\">Chronic obstructive pulmonary disease (COPD) is a type of obstructive lung disease characterized by chronically poor airflow with main symptoms that include shortness of breath, cough, and sputum production.<\/span><\/p>\n<p><span style=\"font-family: arial,helvetica,sans-serif;\">Let\u2019s take a look into the COPD age distribution.\u00a0<\/span><\/p>\n<p><span style=\"font-family: arial,helvetica,sans-serif;\">What is the distribution of the length of stay? Most of the admissions are between 2-4 days.<\/span><\/p>\n<p><span style=\"font-family: arial,helvetica,sans-serif;\">What about the geolocation of COPD discharges based on the first three numbers of the zip code (rest of zip codes are truncated for anonymization)? Most of the hospitalizations are of\u00a0course\u00a0in New York City, but it is interesting to see a high incidence of COPD discharges outside of NYC. One can create interactive maps online relatively easily for the data stakeholders to be able to &#8220;geo &#8211; visualize&#8221; their data.<\/span><\/p>\n<p><span style=\"font-family: arial,helvetica,sans-serif;\">What about Chronic Heart Insufficiency, a condition of the heart where the heart muscle does not work efficiently in pumping blood? Can we compare the number of discharges of CHF with those of COPD? It is obvious that for patients older than 70 the incidence of CHF is almost double in comparison with COPD.<\/span><\/p>\n<p><span style=\"font-family: arial,helvetica,sans-serif;\">Length of stay is more or less the same as COPD.<\/span><\/p>\n<p><span style=\"font-family: arial,helvetica,sans-serif;\">Now, let\u2019s add Diabetes into the analysis and continue on from there.<\/span><\/p>\n<h3><span style=\"font-family: arial,helvetica,sans-serif;\"><strong>Cost Data for the Three Diseases<\/strong><\/span><\/h3>\n<p><span style=\"font-family: arial,helvetica,sans-serif;\">Clearly CHF is the costliest disease among the three, with average cost per admission nearly twice as high as COPD.<\/span><\/p>\n<p><span style=\"font-family: arial,helvetica,sans-serif;\">More specifically, we notice that CHF, apart from being very expensive in total, in average and in maximum cost for one hospitalization, also has the highest standard deviation among the three diseases.<\/span><\/p>\n<p><span style=\"font-family: arial,helvetica,sans-serif;\"><strong>Costs of Hospitalization<\/strong><\/span><\/p>\n<table style=\"width: 500px;\" border=\"1\" cellspacing=\"1\" cellpadding=\"1\">\n<tbody>\n<tr>\n<td><\/td>\n<td><strong>COPD<\/strong><\/td>\n<td><strong>CHF<\/strong><\/td>\n<td><strong>DIABETES<\/strong><\/td>\n<\/tr>\n<tr>\n<td>Mean<\/td>\n<td>$25,779<\/td>\n<td>$42,549<\/td>\n<td>$29,404<\/td>\n<\/tr>\n<tr>\n<td>Total<\/td>\n<td>$0.958 B<\/td>\n<td>$2,425 B<\/td>\n<td>$1,35 B<\/td>\n<\/tr>\n<tr>\n<td>STD<\/td>\n<td>$34,534<\/td>\n<td>$84,772<\/td>\n<td>$45,483<\/td>\n<\/tr>\n<tr>\n<td>25%<\/td>\n<td>$9,971<\/td>\n<td>$12,389<\/td>\n<td>$9,336<\/td>\n<\/tr>\n<tr>\n<td>50%<\/td>\n<td>$17,329<\/td>\n<td>$22,877<\/td>\n<td>$16,918<\/td>\n<\/tr>\n<tr>\n<td>75%<\/td>\n<td>$30,243<\/td>\n<td>$44,266<\/td>\n<td>$32,061<\/td>\n<\/tr>\n<tr>\n<td>max<\/td>\n<td>$1,665,430<\/td>\n<td>$4,214,537<\/td>\n<td>$2,141,412<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p><span style=\"font-family: arial,helvetica,sans-serif;\">Let\u2019s take a look into the power that predictive and preventive medicine can have on cost reduction.<\/span><\/p>\n<p><span style=\"font-family: arial,helvetica,sans-serif;\">Let\u2019s say that we define the vulnerable patients and we find a magic way to reduce emergency admissions by 10%. What would be the cost savings for New York State only?<\/span><\/p>\n<p>$203M for CHF, $116M for Diabetes, and $83.9M for COPD.<\/p>\n<p><span style=\"font-family: arial,helvetica,sans-serif;\">Is there any correlation between average cost per admissions and day of the week for 2012?<\/span><\/p>\n<p><span style=\"font-family: arial,helvetica,sans-serif;\">Generally speaking, there shouldn\u2019t be any over the course of the year. But there is no free lunch in data science, so let\u2019s dig deeper into the data to uncover\u00a0the real answer.<\/span><\/p>\n<h3><span style=\"font-family: arial,helvetica,sans-serif;\"><strong>COPD: Average\u00a0Cost per\u00a0Admission by\u00a0Day\u00a0of\u00a0Week<\/strong><\/span><\/h3>\n<p><span style=\"font-family: arial,helvetica,sans-serif;\">Admissions for emergencies surprisingly cost on average the same amount for each day of the week.<\/span><\/p>\n<p><span style=\"font-family: arial,helvetica,sans-serif;\">Elective admissions have an almost 50% higher average cost on Wednesday but very low costs on weekends, as compared with emergency admissions.<\/span><\/p>\n<h3><span style=\"font-family: arial,helvetica,sans-serif;\"><strong>CHF: Average\u00a0Cost per\u00a0Admission by\u00a0Day\u00a0of\u00a0Week<\/strong><\/span><\/h3>\n<p><span style=\"font-family: arial,helvetica,sans-serif;\">Elective admissions cost more on average (probably due to lots of tests). We also watch this peak on electives on Wednesday.<\/span><\/p>\n<h3><span style=\"font-family: arial,helvetica,sans-serif;\"><strong>Diabetes: Average\u00a0Cost per\u00a0Admission by\u00a0Day\u00a0of\u00a0Week<\/strong><\/span><\/h3>\n<p><span style=\"font-family: arial,helvetica,sans-serif;\">There is a peak of emergency admission costs on Sunday followed by low average cost on Monday and Friday elective admissions. One guess is that some emergency admissions from Sunday worsen and account for some of the elective admissions on Monday. Moreover, there is likely a reduction in\u00a0elective admissions on Friday because people don\u2019t want to spend the weekend in the hospital.<\/span><\/p>\n<p><span style=\"font-family: arial,helvetica,sans-serif;\">Last but not least, we can take a look at\u00a0what the insurance companies paid on average per hospitalization by type of admission (elective\/emergency) for each disease.<\/span><\/p>\n<p>&nbsp;<\/p>\n<h3><span style=\"font-family: arial,helvetica,sans-serif;\"><strong>COPD: Average Payment per Type of Admission per Insurance Coverage<\/strong><\/span><\/h3>\n<table style=\"width: 500px;\" border=\"1\" cellspacing=\"1\" cellpadding=\"1\">\n<tbody>\n<tr>\n<td>Source of Payment (rows) \/ Type of Admission (columns)<\/td>\n<td>Elective<\/td>\n<td>Emergency<\/td>\n<\/tr>\n<tr>\n<td>Blue Cross<\/td>\n<td>$32,290<\/td>\n<td>$24,371<\/td>\n<\/tr>\n<tr>\n<td>CHAMPUS<\/td>\n<td>$9,912<\/td>\n<td>$22,182<\/td>\n<\/tr>\n<tr>\n<td>Insurance Company<\/td>\n<td>$23,025<\/td>\n<td>$25,607<\/td>\n<\/tr>\n<tr>\n<td>Medicaid<\/td>\n<td>$13,614<\/td>\n<td>$21,689<\/td>\n<\/tr>\n<tr>\n<td>Medicare<\/td>\n<td>$25,057<\/td>\n<td>$26,875<\/td>\n<\/tr>\n<tr>\n<td>Other Federal Program<\/td>\n<td>$12,329<\/td>\n<td>$13,947<\/td>\n<\/tr>\n<tr>\n<td>Other Non-Federal Program<\/td>\n<td>$25,990<\/td>\n<td>$25,915<\/td>\n<\/tr>\n<tr>\n<td>Self-Pay<\/td>\n<td>$22,594<\/td>\n<td>$27,032<\/td>\n<\/tr>\n<tr>\n<td>Unknown<\/td>\n<td>$3,739<\/td>\n<td>$17,353<\/td>\n<\/tr>\n<tr>\n<td>Workers Compensation<\/td>\n<td>$43,622<\/td>\n<td>$25,436<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<h3><\/h3>\n<h3><span style=\"font-family: arial,helvetica,sans-serif;\"><strong>CHF:\u00a0Average Payment per Type of Admission per Insurance Coverage<\/strong><\/span><\/h3>\n<table style=\"width: 500px;\" border=\"1\" cellspacing=\"1\" cellpadding=\"1\">\n<tbody>\n<tr>\n<td><span style=\"line-height: 20.7999992370605px;\">Source of Payment (rows) \/ Type of Admission (columns)<\/span><\/td>\n<td>Elective<\/td>\n<td>Emergency<\/td>\n<\/tr>\n<tr>\n<td>Blue Cross<\/td>\n<td>$78,158<\/td>\n<td>$43,927<\/td>\n<\/tr>\n<tr>\n<td>CHAMPUS<\/td>\n<td>$49,044<\/td>\n<td>$37,291<\/td>\n<\/tr>\n<tr>\n<td>Insurance Company<\/td>\n<td>$93,328<\/td>\n<td>$47,836<\/td>\n<\/tr>\n<tr>\n<td>Medicaid<\/td>\n<td>$75,460<\/td>\n<td>$39,134<\/td>\n<\/tr>\n<tr>\n<td>Medicare<\/td>\n<td>$68,665<\/td>\n<td>$37,911<\/td>\n<\/tr>\n<tr>\n<td>Other Federal Program<\/td>\n<td>$7,058<\/td>\n<td>$23,907<\/td>\n<\/tr>\n<tr>\n<td>Other Non-Federal Program<\/td>\n<td>$36,672<\/td>\n<td>$31,150<\/td>\n<\/tr>\n<tr>\n<td>Self-Pay<\/td>\n<td>$38,606<\/td>\n<td>$34,050<\/td>\n<\/tr>\n<tr>\n<td>Workers Compensation<\/td>\n<td>$75,141<\/td>\n<td>$42,851<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<h3><\/h3>\n<h3><strong>Diabetes:\u00a0Average Payment per Type of Admission per Insurance Coverage<\/strong><\/h3>\n<table style=\"width: 500px;\" border=\"1\" cellspacing=\"1\" cellpadding=\"1\">\n<tbody>\n<tr>\n<td><span style=\"line-height: 20.7999992370605px;\">Source of Payment (rows) \/ Type of Admission (columns)<\/span><\/td>\n<td>Elective<\/td>\n<td>Emergency<\/td>\n<\/tr>\n<tr>\n<td>Blue Cross<\/td>\n<td>$27,257<\/td>\n<td>$21,206<\/td>\n<\/tr>\n<tr>\n<td>Insurance Company<\/td>\n<td>$32,544<\/td>\n<td>$30,226<\/td>\n<\/tr>\n<tr>\n<td>Medicaid<\/td>\n<td>$23,405<\/td>\n<td>$38,076<\/td>\n<\/tr>\n<tr>\n<td>Medicare<\/td>\n<td>$23,861<\/td>\n<td>$30,567<\/td>\n<\/tr>\n<tr>\n<td>Other Non-Federal Program<\/td>\n<td>$56,586<\/td>\n<td>$33,893<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<h3><\/h3>\n<h3><strong>And what were the 20 top hospitalizations in terms of cost?<\/strong><\/h3>\n<table style=\"width: 500px;\" border=\"1\" cellspacing=\"1\" cellpadding=\"1\">\n<tbody>\n<tr>\n<td><\/td>\n<td>Disease<\/td>\n<td>Cost in $<\/td>\n<\/tr>\n<tr>\n<td>1.<\/td>\n<td>Eye Infection<\/td>\n<td>7,066,436<\/td>\n<\/tr>\n<tr>\n<td>2.<\/td>\n<td>Surgcl\/Med Care Compl<\/td>\n<td>6,286,622<\/td>\n<\/tr>\n<tr>\n<td>3.<\/td>\n<td>Aspiration Pneumonitis<\/td>\n<td>6,230,015<\/td>\n<\/tr>\n<tr>\n<td>4.<\/td>\n<td>Coag\/Hemrrge Disorder<\/td>\n<td>6,196,974<\/td>\n<\/tr>\n<tr>\n<td>5.<\/td>\n<td>Septicemia<\/td>\n<td>5,166,411<\/td>\n<\/tr>\n<tr>\n<td>6.<\/td>\n<td>Liveborn<\/td>\n<td>4,971,831<\/td>\n<\/tr>\n<tr>\n<td>7.<\/td>\n<td>Liveborn<\/td>\n<td>4,953,934<\/td>\n<\/tr>\n<tr>\n<td>8.<\/td>\n<td>Burns<\/td>\n<td>4,877,072<\/td>\n<\/tr>\n<tr>\n<td>9.<\/td>\n<td>HIV Infection<\/td>\n<td>4,839,726<\/td>\n<\/tr>\n<tr>\n<td>10.<\/td>\n<td>Septicemia<\/td>\n<td>4,511,673<\/td>\n<\/tr>\n<tr>\n<td>11.<\/td>\n<td>Encephalitis<\/td>\n<td>4,362,477<\/td>\n<\/tr>\n<tr>\n<td>12.<\/td>\n<td>HIV Infection<\/td>\n<td>4,324,821<\/td>\n<\/tr>\n<tr>\n<td>13.<\/td>\n<td>CHF<\/td>\n<td>4,214,537<\/td>\n<\/tr>\n<tr>\n<td>14.<\/td>\n<td>Leukemias<\/td>\n<td>4,196,908<\/td>\n<\/tr>\n<tr>\n<td>15.<\/td>\n<td>CHF<\/td>\n<td>4,138,746<\/td>\n<\/tr>\n<tr>\n<td>16.<\/td>\n<td>HIV Infection<\/td>\n<td>4,052,052<\/td>\n<\/tr>\n<tr>\n<td>17.<\/td>\n<td>Liveborn<\/td>\n<td>3,975,798<\/td>\n<\/tr>\n<tr>\n<td>18.<\/td>\n<td>Liveborn<\/td>\n<td>3,961,028<\/td>\n<\/tr>\n<tr>\n<td>19.<\/td>\n<td>Adult Respiratory Failure<\/td>\n<td>3,947,940<\/td>\n<\/tr>\n<tr>\n<td>20.<\/td>\n<td>Anemia<\/td>\n<td>3,891,765<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p>&nbsp;<\/p>\n<p><span style=\"font-family: arial,helvetica,sans-serif;\">We notice that 5 of them are related to births, 2 to CHF, and 3 to HIV infections.<\/span><\/p>\n<p><span style=\"font-family: arial,helvetica,sans-serif;\">It seems that with this dataset we might have enough data to make a prediction model for the probable cost of an admission based on gender, age, disease, area, length of stay and insurance type\u00a0for the region of New York.<\/span><\/p>\n<p><span style=\"font-family: arial, helvetica, sans-serif;\">I&#8217;m guessing that there is a lot more value to be extracted from a dataset like this.<\/span><\/p>\n<p><span style=\"font-family: arial,helvetica,sans-serif;\">But that is a topic for another blog post&#8230;.<\/span><\/p>\n<p><span style=\"font-family: arial,helvetica,sans-serif;\">[1] <a href=\"http:\/\/www.csc.com\/insights\/flxwd\/78931-big_data_universe_beginning_to_explode\" rel=\"noopener\">http:\/\/www.csc.com\/insights\/flxwd\/78931-big_data_universe_beginning_to_explode<\/a><\/span><\/p>\n<p><span style=\"font-family: arial,helvetica,sans-serif;\">[2] (<a href=\"https:\/\/health.data.ny.gov\/Health\/Hospital-Inpatient-Discharges-SPARCS-De-Identified\/u4ud-w55t\" rel=\"noopener\">https:\/\/health.data.ny.gov\/Health\/Hospital-Inpatient-Discharges-SPARCS-De-Identified\/u4ud-w55t<\/a>)\u00a0<\/span><\/p>\n","protected":false},"excerpt":{"rendered":"<p>An analysis on Hospital Inpatient Discharges data released by New York State in 2012. This post shows that open data can be as useful as proprietary data.&nbsp;<\/p>\n","protected":false},"author":514,"featured_media":22108,"comment_status":"open","ping_status":"open","sticky":false,"template":"single-post-2.php","format":"standard","meta":{"content-type":"","footnotes":""},"categories":[197],"tags":[148],"ppma_author":[2465],"class_list":["post-1112","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-health-tech","tag-healthcare-analytics"],"authors":[{"term_id":2465,"user_id":514,"is_guest":0,"slug":"louizos-alexandros-louizos","display_name":"Louizos Louizos","avatar_url":"https:\/\/secure.gravatar.com\/avatar\/?s=96&d=mm&r=g","user_url":"","last_name":"Louizos","first_name":"Louizos","job_title":"","description":"<span style=\"line-height: 20.8px\">Louizos has a background in Medicine, Nanotechnology and Computational Physics. He also holds a PhD in Computational Quantum Mechanics. Originally from Greece, he is based in Brooklyn, NY.<\/span>"}],"_links":{"self":[{"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/posts\/1112","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/users\/514"}],"replies":[{"embeddable":true,"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/comments?post=1112"}],"version-history":[{"count":4,"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/posts\/1112\/revisions"}],"predecessor-version":[{"id":29893,"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/posts\/1112\/revisions\/29893"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/media\/22108"}],"wp:attachment":[{"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/media?parent=1112"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/categories?post=1112"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/tags?post=1112"},{"taxonomy":"author","embeddable":true,"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/ppma_author?post=1112"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}