Race, religion and the city

Twitter word frequency patterns reveal dominant demographic dimensions in the United States

Eszter Bokányi, Dániel Kondor, László Dobos, Tamás Sebők, József Stéger, I. Csabai, G. Vattay

Research output: Contribution to journalArticle

4 Citations (Scopus)

Abstract

Recently, numerous approaches have emerged in the social sciences to exploit the opportunities made possible by the vast amounts of data generated by online social networks (OSNs). Having access to information about users on such a scale opens up a range of possibilities—from predicting individuals’ demographics and health status to their beliefs and political opinions—all without the limitations associated with often slow and expensive paper-based polls. A question that remains to be satisfactorily addressed, however, is how demography is represented in OSN content—that is, what are the relevant aspects that constitute detectable large-scale patterns in language? Here, we study language use in the United States using a corpus of text compiled from over half a billion geotagged messages from the online microblogging platform Twitter. Our intention is to reveal the most important spatial patterns in language use in an unsupervised manner and relate them to demographics. Our approach is based on Latent Semantic Analysis augmented with the Robust Principal Component Analysis methodology, which permits identification of the data’s main sources of variation with an automatic filtering of noise and outliers without influencing results by a priori assumptions. We find spatially correlated patterns that can be interpreted based on the words associated with them. The main language features can be related to slang use, urbanization, travel, religion and ethnicity, the patterns of which are shown to correlate plausibly with traditional census data. Apart from the standard measure of linear correlation, some relations seem to be better explained by Boolean implications, suggesting a threshold-like behaviour where demographic variables influence the users’ word use. Our findings validate the concept of demography being represented in OSN language use and show that the traits observed are inherently present in the word frequencies without any previous assumptions about the dataset. They therefore could form the basis of further research focusing on the evaluation of demographic data estimation from other big data sources, or on the dynamical processes that result in the patterns identified here.

Original languageEnglish
Article number16010
JournalPalgrave Communications
Volume2
DOIs
Publication statusPublished - Jan 1 2016

Fingerprint

twitter
Religion
Demography
Language
language
social network
Social Support
demography
Access to Information
Urbanization
Social Sciences
Information Storage and Retrieval
health status
Censuses
urbanization
Principal Component Analysis
Semantics
census
ethnicity
social science

ASJC Scopus subject areas

  • Social Sciences(all)
  • Arts and Humanities(all)
  • Economics, Econometrics and Finance(all)
  • Psychology(all)

Cite this

Race, religion and the city : Twitter word frequency patterns reveal dominant demographic dimensions in the United States. / Bokányi, Eszter; Kondor, Dániel; Dobos, László; Sebők, Tamás; Stéger, József; Csabai, I.; Vattay, G.

In: Palgrave Communications, Vol. 2, 16010, 01.01.2016.

Research output: Contribution to journalArticle

@article{eeb6a9e4d33c4c6e92cf966af839e19d,
title = "Race, religion and the city: Twitter word frequency patterns reveal dominant demographic dimensions in the United States",
abstract = "Recently, numerous approaches have emerged in the social sciences to exploit the opportunities made possible by the vast amounts of data generated by online social networks (OSNs). Having access to information about users on such a scale opens up a range of possibilities—from predicting individuals’ demographics and health status to their beliefs and political opinions—all without the limitations associated with often slow and expensive paper-based polls. A question that remains to be satisfactorily addressed, however, is how demography is represented in OSN content—that is, what are the relevant aspects that constitute detectable large-scale patterns in language? Here, we study language use in the United States using a corpus of text compiled from over half a billion geotagged messages from the online microblogging platform Twitter. Our intention is to reveal the most important spatial patterns in language use in an unsupervised manner and relate them to demographics. Our approach is based on Latent Semantic Analysis augmented with the Robust Principal Component Analysis methodology, which permits identification of the data’s main sources of variation with an automatic filtering of noise and outliers without influencing results by a priori assumptions. We find spatially correlated patterns that can be interpreted based on the words associated with them. The main language features can be related to slang use, urbanization, travel, religion and ethnicity, the patterns of which are shown to correlate plausibly with traditional census data. Apart from the standard measure of linear correlation, some relations seem to be better explained by Boolean implications, suggesting a threshold-like behaviour where demographic variables influence the users’ word use. Our findings validate the concept of demography being represented in OSN language use and show that the traits observed are inherently present in the word frequencies without any previous assumptions about the dataset. They therefore could form the basis of further research focusing on the evaluation of demographic data estimation from other big data sources, or on the dynamical processes that result in the patterns identified here.",
author = "Eszter Bok{\'a}nyi and D{\'a}niel Kondor and L{\'a}szl{\'o} Dobos and Tam{\'a}s Sebők and J{\'o}zsef St{\'e}ger and I. Csabai and G. Vattay",
year = "2016",
month = "1",
day = "1",
doi = "10.1057/palcomms.2016.10",
language = "English",
volume = "2",
journal = "Palgrave Communications",
issn = "2055-1045",
publisher = "Palgrave Macmillan Ltd.",

}

TY - JOUR

T1 - Race, religion and the city

T2 - Twitter word frequency patterns reveal dominant demographic dimensions in the United States

AU - Bokányi, Eszter

AU - Kondor, Dániel

AU - Dobos, László

AU - Sebők, Tamás

AU - Stéger, József

AU - Csabai, I.

AU - Vattay, G.

PY - 2016/1/1

Y1 - 2016/1/1

N2 - Recently, numerous approaches have emerged in the social sciences to exploit the opportunities made possible by the vast amounts of data generated by online social networks (OSNs). Having access to information about users on such a scale opens up a range of possibilities—from predicting individuals’ demographics and health status to their beliefs and political opinions—all without the limitations associated with often slow and expensive paper-based polls. A question that remains to be satisfactorily addressed, however, is how demography is represented in OSN content—that is, what are the relevant aspects that constitute detectable large-scale patterns in language? Here, we study language use in the United States using a corpus of text compiled from over half a billion geotagged messages from the online microblogging platform Twitter. Our intention is to reveal the most important spatial patterns in language use in an unsupervised manner and relate them to demographics. Our approach is based on Latent Semantic Analysis augmented with the Robust Principal Component Analysis methodology, which permits identification of the data’s main sources of variation with an automatic filtering of noise and outliers without influencing results by a priori assumptions. We find spatially correlated patterns that can be interpreted based on the words associated with them. The main language features can be related to slang use, urbanization, travel, religion and ethnicity, the patterns of which are shown to correlate plausibly with traditional census data. Apart from the standard measure of linear correlation, some relations seem to be better explained by Boolean implications, suggesting a threshold-like behaviour where demographic variables influence the users’ word use. Our findings validate the concept of demography being represented in OSN language use and show that the traits observed are inherently present in the word frequencies without any previous assumptions about the dataset. They therefore could form the basis of further research focusing on the evaluation of demographic data estimation from other big data sources, or on the dynamical processes that result in the patterns identified here.

AB - Recently, numerous approaches have emerged in the social sciences to exploit the opportunities made possible by the vast amounts of data generated by online social networks (OSNs). Having access to information about users on such a scale opens up a range of possibilities—from predicting individuals’ demographics and health status to their beliefs and political opinions—all without the limitations associated with often slow and expensive paper-based polls. A question that remains to be satisfactorily addressed, however, is how demography is represented in OSN content—that is, what are the relevant aspects that constitute detectable large-scale patterns in language? Here, we study language use in the United States using a corpus of text compiled from over half a billion geotagged messages from the online microblogging platform Twitter. Our intention is to reveal the most important spatial patterns in language use in an unsupervised manner and relate them to demographics. Our approach is based on Latent Semantic Analysis augmented with the Robust Principal Component Analysis methodology, which permits identification of the data’s main sources of variation with an automatic filtering of noise and outliers without influencing results by a priori assumptions. We find spatially correlated patterns that can be interpreted based on the words associated with them. The main language features can be related to slang use, urbanization, travel, religion and ethnicity, the patterns of which are shown to correlate plausibly with traditional census data. Apart from the standard measure of linear correlation, some relations seem to be better explained by Boolean implications, suggesting a threshold-like behaviour where demographic variables influence the users’ word use. Our findings validate the concept of demography being represented in OSN language use and show that the traits observed are inherently present in the word frequencies without any previous assumptions about the dataset. They therefore could form the basis of further research focusing on the evaluation of demographic data estimation from other big data sources, or on the dynamical processes that result in the patterns identified here.

UR - http://www.scopus.com/inward/record.url?scp=85043794478&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=85043794478&partnerID=8YFLogxK

U2 - 10.1057/palcomms.2016.10

DO - 10.1057/palcomms.2016.10

M3 - Article

VL - 2

JO - Palgrave Communications

JF - Palgrave Communications

SN - 2055-1045

M1 - 16010

ER -