Our approach was to first collect all English-language Twitter posts mentioning medical products. We applied manual and semi-automated techniques to identify posts with resemblance to AEs. Colloquial language was then mapped to a standard regulatory dictionary. We then compared the aggregate frequency of identified product-event pairs with FAERS at the organ system level. A data collection schematic is shown in Fig. 1.
Twitter data were collected from 1 November 2012 through 31 May 2013, and consisted of public posts acquired through the general-use streaming application programming interface (API). We chose this data source because it contains a large volume of publicly available posts about medical products. Data were stored in databases using Amazon Web Services (AWS) cloud services. Since the vast majority of the over 400 million daily Twitter posts have no relevance to AE reporting, we created a list of medical product names and used them as search term inputs to the Twitter API. Though this approach may remove posts that contain misspellings, slang terms, and other oblique references, it allowed us to start from a manageable data corpus. In order to avoid confusion with regulatory definitions of an ‘adverse event’ report, the term ‘Proto-AE’ was coined to signify ‘posts with resemblance to AEs’, designating posts containing discussion of AEs identified in social media sources. The labeled subset was chosen through a combination of review of the data in sequence as collected from the API for convenient time periods, as well as by searching the unlabeled data for specific product names and symptom terms.
Public FAERS data were obtained from the FDA website in text format for the time period concurrent with the collection of Twitter data, the fourth quarter of 2012 and first quarter of 2013.
In conjunction with the FDA, a priori selected 23 prescription and over-the-counter drug products in diverse therapeutic areas were selected for quantitative analysis, representing new and old medicines, as well as widely used products and more specialized ones: acetaminophen, adalimumab, alprazolam, citalopram, duloxetine, gabapentin, ibuprofen, isotretinoin, lamotrigine, levonorgestrel, metformin, methotrexate, naproxen, oxycodone, paroxetine, prednisone, pregabalin, sertraline, tramadol, varenicline, venlafaxine, warfarin, and zolpidem. We also selected vaccines for influenza, human papillomavirus (HPV), hepatitis B, and the combined tetanus/diphtheria/pertussis (Tdap) vaccine. Whenever possible, we identified brand and generic names for each product using the DailyMed site from the National Library of Medicine and the FDA’s Orange Book.
Adverse Event (AE) Identification in Twitter
The next step was classification of the information, which includes filtering the corpus to remove items irrelevant to AEs. To determine whether or not a given post constitutes an AE report, we established guidelines for human annotators to consistently identify AE reports. We proceeded under the general guidance of the four statutorily required data elements for AE reporting in the USA: an identifiable medical product, an identifiable reporter, an identifiable individual, and mention of a negative outcome, though we did not automatically exclude posts if they failed to meet one of these criteria. We also considered Twitter accounts as sufficient to meet the requirement for an identifiable reporter, though this standard is not the current regulatory expectation for mandatory reporting.
We applied a tree-based dictionary-matching algorithm to identify both product and symptom mentions. It consists of three components. First, we loaded the dictionary from a multi-user editable spreadsheet form into the tree structure in memory. Two separate product and symptom dictionaries were superimposed into a single tree, allowing only a single pass over the input for both product and symptom matches. Second, for the extraction step, a tokenizer stripped punctuation and split the input into a series of tokens, typically corresponding to words. Finally, we processed the tokens one at a time, matching each against the tree and traversing the tree as matches occurred. If we reached a leaf in the tree, then a positive match was established and we returned the identifier for the appropriate concept (product or symptom). Because the concordance analysis is based on the output of the algorithm, its performance characteristics are important. We assessed the performance of the symptom classifier by manually examining a random sample of 10 % of the Proto-AEs and comparing the algorithmically identified symptoms with symptoms that a rater (CCF) determined to be attributed to a product.
We created a curation tool for reviewing and labeling posts. Two trained raters (CCF, CMM) classified a convenience sample of 61,402 posts. Discrepancies between raters were adjudicated by three of the authors (CCF, CMM, ND). Agreement on overlapped subsets increased from 97.9 to 98.4 % (Cohen’s kappa: 0.97) over successive rounds of iterative protocol development and classification. The convenience sample was selected as a training dataset for further development of an automated Bayesian classifier, but the classifier was not used in the analysis presented in this study. The sample was enriched to include posts that contained AEs based on preliminary data review.
Coding of AEs in Twitter
Further natural language processing was required to identify the event in each post. Starting with the subset of posts identified to contain AEs, we developed a dictionary to convert Internet vernacular to a standardized regulatory dictionary, namely Medical Dictionary for Regulatory Activities (MedDRA®) version 16 in English. MedDRA®, the Medical Dictionary for Regulatory Activities, terminology is the international medical terminology developed under the auspices of the International Conference on Harmonization of Technical Requirements for Registration of Pharmaceuticals for Human Use (ICH). MedDRA® trademark is owned by the International Federation of Pharmaceutical Manufacturers and Associations (IFPMA) on behalf of ICH. The ontology matches Internet vernacular to the closest relevant MedDRA preferred term, but allows for less specific higher-level terms to be used when not enough detail is available for matching to a preferred term. The empirically derived dictionary currently contains over 4,800 terms spread across 257 symptom categories. For example, the post “So much for me going to sleep at 12. I am wide awake thanks prednisone and albuterol” (30 Sep 2013), would be coded to the MedDRA preferred term ‘insomnia’, by identifying ‘wide awake’ as the outcome, and both prednisone and albuterol as the drugs involved. Multiple vernacular phrases could be mapped to the same MedDRA preferred term, such as ‘can’t sleep’ and ‘tossing and turning’ in the previous example. As noted above and detailed in Freifeld et al. , we used a tree-based text-matching algorithm to match the raw text from the posts to the vernacular dictionary. Preferred terms were aggregated up to the System Organ Class (SOC), the broadest hierarchical category in MedDRA.
AE Identification in FAERS
AEs were identified from public FAERS data for the products of interest using exact name matching for brand and generic names. Un-duplication was conducted using the FDA case identification number, date of event, country of occurrence, age, and gender. Reports submitted by consumers were identified using reporter field. All roles (primary suspect, secondary suspect, etc.) were considered; preliminary analysis suggested that limiting to primary suspect medicines did not alter results meaningfully (data not shown).
We analyzed vaccines and drugs separately, since vaccine AE data from the US Vaccine Adverse Event Reporting System (VAERS)  were not available for this analysis. We compared prescription and over-the-counter drug AEs identified in Twitter posts with corresponding FAERS data for those products, at the SOC level. This approach is intended to identify gross patterns, and not to assess the ability to detect rare but serious AEs.
We assessed the precision and recall of the symptom dictionary-matching algorithm by manual classification of a random sample of 437 Twitter posts (10 % of the full sample). A single post can contain multiple symptom mentions; a symptom match was considered a true positive only if the symptom was considered an AE of one of the mentioned products.
We did not seek to verify each individual report as truthful, but rather to identify overall associations between Twitter and official spontaneous report data as a preliminary proof of concept. We calculated correlation using the Spearman correlation rank statistic (rho) by the 22 SOC categories, with a statistical alpha of <0.05.