Using Robust PCA to estimate regional characteristics of language use from geo-tagged Twitter messages

Daniel Kondor, I. Csabai, Laszlo Dobos, Janos Szule, Norbert Barankai, Tamas Hanyecz, Tamas Sebok, Zsofia Kallus, G. Vattay

Research output: Chapter in Book/Report/Conference proceedingConference contribution

6 Citations (Scopus)

Abstract

Principal component analysis (PCA) and related techniques have been successfully employed in natural language processing. Text mining applications in the age of the online social media (OSM) face new challenges due to properties specific to these use cases (e.g. spelling issues specific to texts posted by users, the presence of spammers and bots, service announcements, etc.). In this paper, we employ a Robust PCA technique to separate typical outliers and highly localized topics from the low-dimensional structure present in language use in online social networks. Our focus is on identifying geospatial features among the messages posted by the users of the Twitter microblogging service. Using a dataset which consists of over 200 million geolocated tweets collected over the course of a year, we investigate whether the information present in word usage frequencies can be used to identify regional features of language use and topics of interest. Using the PCA pursuit method, we are able to identify important low-dimensional features, which constitute smoothly varying functions of the geographic location.

Original languageEnglish
Title of host publication4th IEEE International Conference on Cognitive Infocommunications, CogInfoCom 2013 - Proceedings
PublisherIEEE Computer Society
Pages393-398
Number of pages6
ISBN (Print)9781479915439
DOIs
Publication statusPublished - 2013
Event4th IEEE International Conference on Cognitive Infocommunications, CogInfoCom 2013 - Budapest, Hungary
Duration: Dec 2 2013Dec 5 2013

Other

Other4th IEEE International Conference on Cognitive Infocommunications, CogInfoCom 2013
CountryHungary
CityBudapest
Period12/2/1312/5/13

Fingerprint

Principal component analysis
Processing

ASJC Scopus subject areas

  • Artificial Intelligence
  • Information Systems

Cite this

Kondor, D., Csabai, I., Dobos, L., Szule, J., Barankai, N., Hanyecz, T., ... Vattay, G. (2013). Using Robust PCA to estimate regional characteristics of language use from geo-tagged Twitter messages. In 4th IEEE International Conference on Cognitive Infocommunications, CogInfoCom 2013 - Proceedings (pp. 393-398). [6719277] IEEE Computer Society. https://doi.org/10.1109/CogInfoCom.2013.6719277

Using Robust PCA to estimate regional characteristics of language use from geo-tagged Twitter messages. / Kondor, Daniel; Csabai, I.; Dobos, Laszlo; Szule, Janos; Barankai, Norbert; Hanyecz, Tamas; Sebok, Tamas; Kallus, Zsofia; Vattay, G.

4th IEEE International Conference on Cognitive Infocommunications, CogInfoCom 2013 - Proceedings. IEEE Computer Society, 2013. p. 393-398 6719277.

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Kondor, D, Csabai, I, Dobos, L, Szule, J, Barankai, N, Hanyecz, T, Sebok, T, Kallus, Z & Vattay, G 2013, Using Robust PCA to estimate regional characteristics of language use from geo-tagged Twitter messages. in 4th IEEE International Conference on Cognitive Infocommunications, CogInfoCom 2013 - Proceedings., 6719277, IEEE Computer Society, pp. 393-398, 4th IEEE International Conference on Cognitive Infocommunications, CogInfoCom 2013, Budapest, Hungary, 12/2/13. https://doi.org/10.1109/CogInfoCom.2013.6719277
Kondor D, Csabai I, Dobos L, Szule J, Barankai N, Hanyecz T et al. Using Robust PCA to estimate regional characteristics of language use from geo-tagged Twitter messages. In 4th IEEE International Conference on Cognitive Infocommunications, CogInfoCom 2013 - Proceedings. IEEE Computer Society. 2013. p. 393-398. 6719277 https://doi.org/10.1109/CogInfoCom.2013.6719277
Kondor, Daniel ; Csabai, I. ; Dobos, Laszlo ; Szule, Janos ; Barankai, Norbert ; Hanyecz, Tamas ; Sebok, Tamas ; Kallus, Zsofia ; Vattay, G. / Using Robust PCA to estimate regional characteristics of language use from geo-tagged Twitter messages. 4th IEEE International Conference on Cognitive Infocommunications, CogInfoCom 2013 - Proceedings. IEEE Computer Society, 2013. pp. 393-398
@inproceedings{bff929b22e7a4923890aac2554c27100,
title = "Using Robust PCA to estimate regional characteristics of language use from geo-tagged Twitter messages",
abstract = "Principal component analysis (PCA) and related techniques have been successfully employed in natural language processing. Text mining applications in the age of the online social media (OSM) face new challenges due to properties specific to these use cases (e.g. spelling issues specific to texts posted by users, the presence of spammers and bots, service announcements, etc.). In this paper, we employ a Robust PCA technique to separate typical outliers and highly localized topics from the low-dimensional structure present in language use in online social networks. Our focus is on identifying geospatial features among the messages posted by the users of the Twitter microblogging service. Using a dataset which consists of over 200 million geolocated tweets collected over the course of a year, we investigate whether the information present in word usage frequencies can be used to identify regional features of language use and topics of interest. Using the PCA pursuit method, we are able to identify important low-dimensional features, which constitute smoothly varying functions of the geographic location.",
author = "Daniel Kondor and I. Csabai and Laszlo Dobos and Janos Szule and Norbert Barankai and Tamas Hanyecz and Tamas Sebok and Zsofia Kallus and G. Vattay",
year = "2013",
doi = "10.1109/CogInfoCom.2013.6719277",
language = "English",
isbn = "9781479915439",
pages = "393--398",
booktitle = "4th IEEE International Conference on Cognitive Infocommunications, CogInfoCom 2013 - Proceedings",
publisher = "IEEE Computer Society",

}

TY - GEN

T1 - Using Robust PCA to estimate regional characteristics of language use from geo-tagged Twitter messages

AU - Kondor, Daniel

AU - Csabai, I.

AU - Dobos, Laszlo

AU - Szule, Janos

AU - Barankai, Norbert

AU - Hanyecz, Tamas

AU - Sebok, Tamas

AU - Kallus, Zsofia

AU - Vattay, G.

PY - 2013

Y1 - 2013

N2 - Principal component analysis (PCA) and related techniques have been successfully employed in natural language processing. Text mining applications in the age of the online social media (OSM) face new challenges due to properties specific to these use cases (e.g. spelling issues specific to texts posted by users, the presence of spammers and bots, service announcements, etc.). In this paper, we employ a Robust PCA technique to separate typical outliers and highly localized topics from the low-dimensional structure present in language use in online social networks. Our focus is on identifying geospatial features among the messages posted by the users of the Twitter microblogging service. Using a dataset which consists of over 200 million geolocated tweets collected over the course of a year, we investigate whether the information present in word usage frequencies can be used to identify regional features of language use and topics of interest. Using the PCA pursuit method, we are able to identify important low-dimensional features, which constitute smoothly varying functions of the geographic location.

AB - Principal component analysis (PCA) and related techniques have been successfully employed in natural language processing. Text mining applications in the age of the online social media (OSM) face new challenges due to properties specific to these use cases (e.g. spelling issues specific to texts posted by users, the presence of spammers and bots, service announcements, etc.). In this paper, we employ a Robust PCA technique to separate typical outliers and highly localized topics from the low-dimensional structure present in language use in online social networks. Our focus is on identifying geospatial features among the messages posted by the users of the Twitter microblogging service. Using a dataset which consists of over 200 million geolocated tweets collected over the course of a year, we investigate whether the information present in word usage frequencies can be used to identify regional features of language use and topics of interest. Using the PCA pursuit method, we are able to identify important low-dimensional features, which constitute smoothly varying functions of the geographic location.

UR - http://www.scopus.com/inward/record.url?scp=84894167337&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=84894167337&partnerID=8YFLogxK

U2 - 10.1109/CogInfoCom.2013.6719277

DO - 10.1109/CogInfoCom.2013.6719277

M3 - Conference contribution

AN - SCOPUS:84894167337

SN - 9781479915439

SP - 393

EP - 398

BT - 4th IEEE International Conference on Cognitive Infocommunications, CogInfoCom 2013 - Proceedings

PB - IEEE Computer Society

ER -