Python web scraping for Digital Health N 8
Python web scraping for Digital Health N 8 CIR Speaker: Zhongtian Sun, Tahir Aduragba, Jialin Yu Durham University www. n 8 cir. org. uk
Digital Healthcare: What and Why Use technology to help improve individuals’ health and wellness Increasing hospitals are using data analytics and artificial intelligence to identify potential patients at risk Consumers increasingly demand access to their healthcare information The growth of large amounts of real time heterogeneous medical data have become available in various healthcare organizations www. n 8 cir. org. uk
Outline Topic Modelling LDA Modelling Data Collection & Processing Python Introduction Description about LDA Modelling how to create necessary knowledge for machines to read and understand Tweets Explanation about how to use Twitter API, collect and process data Introduction to python including characteristics, how to use and data type, functions and modules Application Research Instantiation www. n 8 cir. org. uk
Python Introduction: Basic Knowledge www. n 8 cir. org. uk
Introduction to Python • What is Python? • What can Python do? • Why Python? www. n 8 cir. org. uk
Install python • https: //www. python. org/downloads/ • https: //www. jetbrains. com/pycharm/download/ www. n 8 cir. org. uk
How to use • Run in Integrated Development Environments www. n 8 cir. org. uk
How to use • Colab www. n 8 cir. org. uk
Variables • Containers for storing data values • Created when being assign a value • Type could be changed after set • Values could be assigned to multiple variables www. n 8 cir. org. uk
Data types • Different types of data www. n 8 cir. org. uk
List methods Method Description append() Add an element to the end of list index() Return the index of first element with the specified value insert() Add element at referred position pop() Remove element at the specified position sort() Sort the list count() Return the number of element with specified value clear() Clear all the elements from the list copy() Return a copy of the list remove() Removes the element with a specified value www. n 8 cir. org. uk
Functions • Create functions • Call functions www. n 8 cir. org. uk
Modules • Create Modules (file extension. py) • Use Modules www. n 8 cir. org. uk
Additional resources • Python documentation https: //docs. python. org/3/tutorial/ • W 3 Schools https: //www. w 3 schools. com/python/ • Real python https: //realpython. com/ www. n 8 cir. org. uk
Introduction to Twitter data Collection & Processing www. n 8 cir. org. uk
Accessing Twitter APIs • Application Program Interfaces (APIs) are sets of protocols that govern interactions between sites and users • Twitter Standard API – Filter real-time tweets • Filter by keywords (e. g. coronavirus) • Filter by location • Follow twitter account • 1% of all public tweets – Search tweets • Historic (up to 7 days) collection of tweets www. n 8 cir. org. uk
Access to Twitter API • Create a Twitter account https: //twitter. com/i/flow/signup • Apply for Twitter Developer Account via https: //developer. twitter. com/en/apply-for-access www. n 8 cir. org. uk
Create a Twitter Application • Login to https: //developer. twitter. com/ • Goto https: //developer. twitter. com/en/apps and select Create an app • Fill in the app creation page with: – unique name – a website name – application description – explanation why the app is required www. n 8 cir. org. uk
API Credentials www. n 8 cir. org. uk
Access Twitter API in Python • Tweepy – Most popular python Twitter package • pip install tweepy www. n 8 cir. org. uk
Collecting Twitter Data • API reference: https: //developer. twitter. com/en/docs/tweets/filter-realtime/api-reference/post-statuses-filter www. n 8 cir. org. uk
The JSON output • Tweet Data Dictionary: https: //developer. twitter. com/en/docs/tweets/data-dictionary/overview/tweet-object www. n 8 cir. org. uk
Preprocessing Data • Extract relevant fields from tweet object e. g. tweet id, user, text … • Remove user mentions e. g. @WHO • Remove html links • Remove special characters e. g www. n 8 cir. org. uk
Removing Stop words • Remove commonly used words e. g. the, this, an. . • Python nltk package – pip install nltk • Tokenization – Separate piece of text into smaller units (tokens) www. n 8 cir. org. uk
Stemming and Lemmatization • Reduce inflectional/related forms of a word to common base form • Python spacy package – pip install spacy – python -m spacy download en_core_web_sm www. n 8 cir. org. uk
Sample Notebook • Google Colaboratory link: https: //colab. research. google. com/drive/10 S 0 E_M 7 j. Grz. HNQP 2 -m. Vn. Mp. JHAOd. O 2 Rb. K? usp=sharing www. n 8 cir. org. uk
Introduction to LDA Topic Modelling for Digital Healthcare www. n 8 cir. org. uk
Motivation for using Topic Models in Tweeter Understand the content www. n 8 cir. org. uk
Difference in thinking pattern: HUMAN v. s. AI 0001110010000100001111000010001000 1110001111111000001 1100001111100101010 What is it about? What is it related to? What does it feel like? What does it mean? 101 ? ? ? 000 ? ? ? 111 ? ? ? 110 ? ? ? www. n 8 cir. org. uk
How to make AI understand the text in Twitter? right level of abstraction knowledge about the world www. n 8 cir. org. uk
How to create the necessary knowledge for AI to read and understand Tweets? Use Wikipedia data to learn the knowledge(topics) https: //radimrehurek. com/gensim/ www. n 8 cir. org. uk
How to create the necessary knowledge for AI to read and understand Tweets? www. n 8 cir. org. uk
How to create the necessary knowledge for AI to read and understand Tweets? www. n 8 cir. org. uk
How to have the right level of abstraction? different length of tweets www. n 8 cir. org. uk
How to have the right level of abstraction? same length of tweets www. n 8 cir. org. uk
How to have the right level of abstraction? Let the right words remains • Remove the words appeared too often or too less • More filters • Word length • Stopwords • Lemmatization www. n 8 cir. org. uk
How to understand Tweets based on the model? www. n 8 cir. org. uk
1 2 3 4 5 6 7 8 9 10 11 12 Number of topics defined www. n 8 cir. org. uk
1 2 3 4 5 6 7 8 9 10 11 12 Topics probability distribution www. n 8 cir. org. uk
1 2 3 4 5 6 7 8 9 10 11 12 Critical topics : IDENTITY www. n 8 cir. org. uk
How to find similarity of Tweets based on the model? Hellinger distance metric gives an output in the range [0, 1] for two probability distributions, with values closer to 0 meaning they are more similar. www. n 8 cir. org. uk
SUMMARY: Understanding Tweets with LDA • • • Monitoring Disease Outbreak Live Stream Medical Data Classification Topic Analysis Trend Prediction … Much more with Gensim www. n 8 cir. org. uk
Question & Answers www. n 8 cir. org. uk
www. n 8 cir. org. uk
- Slides: 44