Word and Graph Embeddings for Machine Learning
Abstract
Distributed word embeddings (e.g. word2vec) provide a powerful way to reduce large text corpora to concise features (vectors) readily applicable to a variety of problems in NLP and data science. I will introduce word embeddings, and apply them in variety of new and interesting directions, including:
(1) Multilingual NLP -- The Polyglot project (www.polyglot-NLP.com) employs deep learning and other techniques to build a basic NLP pipeline (including entity recognition, POS tagging, and sentiment analysis) for over 100 different languages. We train our systems over each language's Wikipedia edition, providing unified data resources in the absence of explicitly annotated data, but substantial challenges in interpretation and evaluation.
(2) Detecting Historical Shifts in Word Meaning -- Words like "gay" and "mouse" have substantially shifted their meanings over time in response to societal and technological changes. We use word embeddings trained over texts drawn from different time periods to detect changes in word meanings. This is part of our efforts in historical trends analysis.
(3) Feature Extraction from Graphs -- We present DeepWalk, our approach for learning latent representations of vertices in a network, which has become extremely popular. DeepWalk uses local information on truncated random walks to learn embeddings, by treating walks as the equivalent of sentences in a language. It is suitable for a broad class of applications such as network classification and anomaly detection. We also introduce new graph embedding techniques based on random projections, which produce DeepWalk-quality embeddings thousands of times faster than previous algorithms.
Bio:
Skiena received his B.S. in Computer Science from the University of Virginia and his Ph.D. in Computer Science from the University of Illinois in 1988. He is a Fellow of the American Association for the Advancement of Science (AAAS), a former Fulbright scholar, and recipient of the University of Virginia Engineering Distinguished Alumni Award (WahooWa!), the ONR Young Investigator Award and the IEEE Computer Science and Engineer Teaching Award. More info is available at http://www.cs.stonybrook.edu/~skiena/.
Join meeting via Zoom: https://tufts.zoom.us/s/98846620413
Password: 353214
Meeting ID:988 4662 0413
Dial by location: 1 646 558 8656 US (New York)
Meeting ID:988 4662 0413
Passcode: 353214
Disregard password and passcode at end of the email; they do not apply to this meeting.