Finding Temporal Patterns in Twitter Posts: Exploratory Data Analysis with Python (Part 2) | by Dmitrii Eliuseev | Jun, 2023


Users behavior analysis with Python and Pandas

User timelines example, Image by author

In the first part of this article, I analyzed the timestamps of about 70,000 Twitter posts and got some interesting results; for example, it was possible to detect bots or users posting messages from clone accounts. But I was not able to get accurate message time; at least for a free account, the Twitter API response does not have a time zone, and all messages have UTC time. Millions of people are using social networks nowadays, and analysis of users’ behavior is not only interesting but may also be important for sociological or psychological studies. For example, it can be interesting to figure out if people are posting more messages in the evening, at night, or during the day, but without having a proper time, it is impossible to know. Finally, I was able to find a workaround that works well, even with the limitations of a free API.

In this article, I will show the full workflow, from collecting the data to the analysis using Python and Pandas.

Methodology

Our data processing flow will consist of several steps:

  • Collecting the data using the Tweepy library.
  • Loading the data and getting basic insights.
  • Data transformation. We will group data by user and find specific metrics useful for analysis.
  • Analyzing the results.

Let’s get started.

1. Collecting the data

As was written in the previous part, we cannot get a proper timezone for Twitter messages; all messages returned by the Twitter API have the UTC time. As a workaround, I decided to test three approaches:

  • I tried to get all messages using the “*” mask and analyze the “location” field of every message. Not every user specified the location on Twitter, but a pretty large number did. The idea was good, but practically, it did not work. Twitter is a large social network; it generates a huge amount of data, and collecting all tweets even for a week is unrealistic. The number of thousands of messages per second is not only too large for processing on an ordinary PC, but it will also be beyond the limitations of the free Twitter developer account.
  • I can use a city name as a request; for example, I can search all tweets with the hashtag “#Berlin”. Then it would be easy to filter users who have “Germany” as a location, and for Germany, we know the time zone. This idea works, but the problem is that the results may be biased. For example, messages with the hashtag “#Berlin” may be posted by people interested in politics or by sports fans. But in general, this approach is interesting; with different search queries, it may be possible to reach different types of audiences.
  • Finally, I’ve found a solution that works well for me. I decided to get all messages in a specific language by specifying the “*” mask and a language code. This obviously will not work for English, but there are many countries in the world that are geographically small enough to easily determine the time zone of their citizens. I selected the Dutch language because the number of Dutch speakers in the world is not so big; this language is mostly used in the Netherlands and in Belgium, and both countries have the same time zone. Some people may live abroad, and there are also native Dutch speakers in Suriname and Curaçao, but those numbers are not that big.

Collecting the data itself is straightforward. The code was already used in the first part; I only specified “*” as a query mask and “nl” as a language code. A free Twitter API has a 7-day limitation for getting historical data. But practically, it turned out that the pagination has a limit of about 100,000 messages. Is it a lot? Actually, not. Most people probably never realize how many messages are posted on social media. There are only about 25 million Dutch-speaking people in the world. And 100,000 is the number of messages that these people are posting on Twitter within only 3 hours! Practically, I needed to run the code at least every 2 hours to get all the tweets.

Collecting the data every two hours is not a problem; it can be easily done in the cloud, but as a free solution, I just took my Raspberry Pi:

Raspberry Pi 4, Image source https://en.wikipedia.org/wiki/Raspberry_Pi

The Raspberry Pi is a small credit-card-size Linux computer with 1–8 GB of RAM and a 1–2 GHz CPU. Those specs are absolutely enough for our task, and it is also nice that the Raspberry Pi has no coolers, produces no noise, and has only 2–5 W of power consumption. So, it’s a perfect choice to run a code for a week or two.

I slightly modified the Python script so it could make requests every 2 hours, and I also added a timestamp to the name of each CSV file. After doing the SSH login into the Raspberry Pi, I could run this script in the background by using the Linux “nohup” command:

nohup python3 twit_grabs.py >/dev/null 2>&1 &

By default, “nohup” saves the console output to the “nohup.out” file. This file can be large, so I use forwarding to “/dev/null” to prevent this. Another solution like Cron can also be used, but this simple command is enough for this task.

The process is running in the background, so we see nothing on the screen, but we can watch the log in realtime by using the “tail” command (here “20230601220000” is the name of the current file):

tail -f -n 50 tweets_20230601220000.csv

Getting tweets in the console looks like this:

Collecting Twitter messages, Image by author

When needed, we can copy new logs from the Raspberry Pi by using the “scp” command:

scp pi@raspberrypi:/home/pi/Documents/Twitter/tweets_20230601220000.csv .

Here, “/home/pi/Documents/…” is a remote path on the Raspberry Pi, and “.” is the current folder on a desktop PC, where CSV files should be copied.

In my case, I kept the Raspberry Pi running for about 10 days, which was enough to collect some data. But in general, the longer, the better. During the data preparation for the previous part of the article, I saw enough users who were making Twitter posts only once per week; obviously, longer intervals will be needed to see patterns in those users’ behavior.

2. Loading the data

The Python script was getting new Twitter messages every 2 hours, and as an output, a lot of CSV files were generated. We can load all files in Pandas and combine them into one dataset:

df_tweets = []
files = glob.glob("data/*.csv")
for file_name in files:
df_tweets.append(pd.read_csv(file_name, sep=';',
usecols=['id', 'created_at', 'user_name', 'user_location', 'full_text'],
parse_dates=["created_at"],
lineterminator='n', quoting=csv.QUOTE_NONE))
df = pd.concat(df_tweets).drop_duplicates('id').sort_values(by=['id'], ascending=True)

The code is straightforward. I load each file into the dataframe, then I combine all the data frames using pd.concat. The time intervals overlap each other; to avoid having duplicate records, I use the drop_duplicates method.

Let’s see what kind of data we have:

display(df)

The result looks like this:

Dataframe with all messages, Image by author

The text and message ids are actually not important; for the analysis, we will need only the “created_at” field. To make further processing easier, let’s extract the date, time, and hour of the day as separate columns. We can also add a timezone offset to all records:

tz_offset_hours = 2

def update_timezone(t_utc: np.datetime64):
""" Add timezone to the UTC time """
return (t_utc + np.timedelta64(tz_offset_hours, 'h')).tz_convert(None)

def get_time(dt: datetime.datetime):
""" Get time in HHMM format from the datetime """
return dt.time().replace(
second=0,
microsecond=0)

def get_date(dt: datetime.datetime):
""" Get date from the datetime """
return dt.date()

def get_datetime_hhmm(dt: datetime.datetime):
""" Get date and time in HHMM format """
return dt.to_pydatetime().replace(second=0, microsecond=0)

def get_hour(dt: datetime.datetime):
""" Get hour from the datetime """
return dt.hour

df["time_local"] = df['created_at'].map(update_timezone)
df["datetime_hhmm"] = df['time_local'].map(get_datetime_hhmm)
df["date"] = df['time_local'].map(get_date)
df["time"] = df['time_local'].map(get_time)
df["hour"] = df['time_local'].map(get_hour)
# Optionally, we can select only several days
df = df[(df['date'] >= datetime.date(2023, 5, 30)) & (df['date'] <= datetime.date(2023, 5, 31))].sort_values(by=['id'], ascending=True)
# Display
display(df)

The result looks like this:

Dataframe with added columns, Image by author

The data load is ready. Let’s see what the data looks like.

3. General Insights

This article aims to analyze the patterns in the “time” domain. As a warm-up, let’s see all messages on a single timeline. To draw all the graphs in the article, I will be using the Bokeh library:

from bokeh.io import show, output_notebook
from bokeh.plotting import figure
from bokeh.models import ColumnDataSource
from bokeh.models import SingleIntervalTicker, LinearAxis
from bokeh.transform import factor_cmap, factor_mark, linear_cmap
from bokeh.palettes import *
output_notebook()

def draw_summary_timeline(df_in: pd.DataFrame):
""" Group all messages by time and draw the timline """
print("All messages:", df_in.shape[0])
users_total = df_in['user_name'].unique().shape[0]
print("All users:", users_total)
days_total = df_in['date'].unique().shape[0]
print("Days total:", days_total)
print()

gr_messages = df_in.groupby(['datetime_hhmm'], as_index=False).size() # .sort_values(by=['size'], ascending=False)
gr_messages["msg_per_sec"] = gr_messages['size'].div(60)
datetime_hhmm = gr_messages['datetime_hhmm']
amount = gr_messages['msg_per_sec']

palette = RdYlBu11
p = figure(x_axis_type='datetime', width=2200, height=500,
title="Messages per second")
p.vbar(x=datetime_hhmm, top=amount, width=datetime.timedelta(seconds=50), line_color=palette[0])
p.xaxis[0].ticker.desired_num_ticks = 30
p.xgrid.grid_line_color = None
show(p)

draw_summary_timeline(df_)

In this method, I group all messages by date and time. The timestamps I created before have the “HH:MM” format. The number of messages per minute is not a convenient metric, so I divided all values by 60 to get the number of messages per second.

The result looks like this:

All Twitter messages, Image by author

The code had been running for about 10 days on the Raspberry Pi. As a result, 6,487,433 Twitter messages made by 1,515,139 unique users were collected. But in the image, we can see some problems. Some intervals are missing; probably there was no Internet access at this time. Another day is partially missing, and I don’t know what caused this issue; probably a free Twitter account has the lowest priority compared to all other requests. Anyway, we cannot complain about the free API, and my goal was to collect the data at least for a week, and I have enough information for that. I can just delete the corrupted intervals at the end:

df = df[(df['date'] >= datetime.date(2023, 5, 30)) & 
(df['date'] <= datetime.date(2023, 6, 5))]

By the way, another point on the timeline caught my attention; the peak happened on the 4th of June when the number of messages per second literally doubled. I became curious about what it was. We can easily filter the dataframe:

df_short = df[(df['datetime_hhmm'] >= datetime.datetime(2023, 6, 4, 23, 35, 0)) & 
(df['datetime_hhmm'] <= datetime.datetime(2023, 6, 4, 23, 55, 0))]
with pd.option_context('display.max_colwidth', 80):
display(df_short[["created_at", "full_text"]])

The result looks like this:

Tweets posted during the peak interval, Image by author

It turned out that popular football player Zlatan Ibrahimovic from AC Milan announced his retirement at the age of 41, and this message caused a lot of Twitter reposts:

Twitter messages timeline, Image by author

As we can see, the duration of the peak was about an hour; maybe it could be longer, but it was late; according to the timeline, the announcement was made at 23:35.

But let’s return to Pandas. For further time analysis, let’s create two helper methods to draw all messages, grouped by the time of the day:

from bokeh.io import show
from bokeh.plotting import figure, output_file
from bokeh.models import ColumnDataSource
from bokeh.transform import linear_cmap
from bokeh.palettes import *

def draw_dataframe(p: figure, df_in: pd.DataFrame, color: str, legend_label: str):
""" Draw all messages on the 00..24 timeline """
messages_per_day = df_in.groupby(['time'], as_index=False).size()
days_total = df["date"].unique().shape[0]

msg_time = messages_per_day['time']
# Data was summarized per minute, div by 60 to get seconds
msg_count = messages_per_day['size']/(days_total*60)

source = ColumnDataSource(data=dict(xs=msg_time, ys=msg_count))
p.vbar(x='xs', top='ys', width=datetime.timedelta(seconds=50),
color=color, legend_label=legend_label, source=source)

def draw_timeline(df_filtered: pd.DataFrame, df_full: pd.DataFrame):
""" Draw timeline as a bargraph """
p = figure(width=1600, height=400, title="Messages per second", x_axis_type="datetime", x_axis_label='Time')

palette = RdYlBu11
draw_dataframe(p, df_full, color=palette[0], legend_label="All values")
if df_filtered is not None:
draw_dataframe(p, df_filtered, color=palette[1], legend_label="Filtered values")

p.xgrid.grid_line_color = None
p.x_range.start = 0
p.x_range.end = datetime.time(23, 59, 59)
p.xaxis.ticker.desired_num_ticks = 24
p.toolbar_location = None
show(p)

This will allow us to see all messages on a single 24-hour timeline:

draw_timeline(df_filtered=None, df_full=df)

The optional “df_filtered” parameter will be used later. The result looks like this:

Messages per day, Image by author

We can clearly see a day/night difference, so my assumption that most of the messages in Dutch were made from the same time zone was correct.

We can also draw a timeline for a single user. I already used this method in the previous part. For the convenience of readers who may use this article as a tutorial, I’ll place the code here as well:

def draw_user_timeline(df_in: pd.DataFrame, user_name: str):
""" Draw cumulative messages time for specific user """
df_u = df_in[df_in["user_name"] == user_name]

# Group messages by time of the day
messages_per_day = df_u.groupby(['time'], as_index=False).size()
msg_time = messages_per_day['time']
msg_count = messages_per_day['size']

# Draw
p = figure(x_axis_type='datetime', width=1600, height=150,
title=f"Cumulative tweets timeline: {name} ({sum(msg_count)} messages)")
p.vbar(x=msg_time, top=msg_count, width=datetime.timedelta(seconds=30), line_color='black')
p.xaxis[0].ticker.desired_num_ticks = 30
p.xgrid.grid_line_color = None
p.toolbar_location = None
p.x_range.start = datetime.time(0,0,0)
p.x_range.end = datetime.time(23,59,0)
p.y_range.start = 0
p.y_range.end = 1
p.yaxis.major_tick_line_color = None
p.yaxis.minor_tick_line_color = None
p.yaxis.major_label_text_color = None
show(p)

draw_user_timeline(df, user_name="Ell_____")

The result looks like this:

Messages timeline for a single user, Image by author

4. Data transformation

In the previous step, we got a “raw” dataframe of messages made by all users. We are going to find daily patterns, so as input data, let’s get the number of messages grouped by an hour and calculated for each user:

gr_messages_per_user = df.groupby(['user_name', 'hour'], as_index=True).size()
display(gr_messages_per_user)

The result looks like this:

As a reminder, I was using 7-day data. In this example, we can see that during this interval, the user posted 4 messages at 7 a.m., 1 message at 8 a.m., 3 messages at 5 p.m., and so on.

For analysis, I decided to use three metrics:

  • The total number of “busy hours” per day when the user was making Twitter posts (in the last example, the number is 5).
  • The total number of messages per user (in the last example, the number is 20).
  • An array of 24 numbers, representing the number of messages grouped by hour. As an important step, I will also normalize the array sum to 100%.

The output will be a new dataframe, grouped by user names. This method does all the calculations:

def get_user_hours_dataframe(df_in: pd.DataFrame):   
""" Get new dataframe of users """
busy_hours = []
messages = []
hour_vectors = []
vectors_per_hour = [[] for _ in range(24)]
gr_messages_per_user = df_in.groupby(['user_name', 'hour'], as_index=True).size()
users = gr_messages_per_user.index.get_level_values('user_name').unique().values
for ind, user in enumerate(users):
if ind % 50000 == 0:
print(f"Processing {ind} of {users.shape[0]}")
hours_all = [0]*24
for hr, value in gr_messages_per_user[user].items():
hours_all[hr] = value

busy_hours.append(get_busy_hours(hours_all))
messages.append(sum(hours_all))
hour_vectors.append(np.array(hours_all))
hours_normalized = get_hours_normalized(hours_all)
for hr in range(24):
vectors_per_hour[hr].append(hours_normalized[hr])

print("Making the dataframe...")
cdf = pd.DataFrame({
"user_name": users,
"messages": messages,
"hours": hour_vectors,
"busy_hours": busy_hours
})
# Add hour columns to the dataframe
for hr in range(24):
cdf[str(hr)] = vectors_per_hour[hr]

return cdf.sort_values(by=['messages'], ascending=False)

def get_hours_normalized(hours_all: List) -> np.array:
""" Normalize all values in list to 100% total sum"""
a = np.array(hours_all)
return (100*a/linalg.norm(a, ord=1)).astype(int)

df_users = get_user_hours_dataframe(df)
with pd.option_context('display.max_colwidth', None):
display(df_users)

The result looks like this:

Data metrics, grouped by user, Image by author

Now we have a dataframe with all the metrics, and we’re ready to have some fun with this data.

5. Analysis

In the last step, we transformed a “raw” dataframe with all Twitter messages into the data, grouped by user. This dataframe is actually much more useful. As a warm-up, let’s start with something simple. Let’s get the number of messages per user. The dataframe is already sorted, and we can easily see the “Top 5” of users who posted the maximum number of messages:

display(df_users[:5])
Metrics dataframe, grouped by user, Image by author

Let’s also find percentiles:

> print(df_users["messages"].quantile([0.05, 0.1, 0.5, 0.9, 0.95]))

0.05 1.0
0.10 1.0
0.50 1.0
0.90 4.0
0.95 10.0

The result is interesting. This data was collected within 7 days. There are 1,198,067 unique users in the dataframe who posted at least one message during this period. And the 90th percentile is only 4, which means that 90% of all users posted only 4 messages during this week. A big difference, compared to the top users, who posted more than 5000 tweets! But as was discussed in the first part, some of the “top users” are probably bots. Well, we can easily verify this by using the number of messages per hour. I already have a number of messages, grouped by hour and normalized to 100%. Let’s find users who are posting messages continuously, without any delays. To do this, we only need to filter users who were posting 100/24 = 4% of their messages every hour:

    df_users_filtered = df_users.copy()
for p in range(24):
df_users_filtered = df_users_filtered[(df_users_filtered[str(p)] >= 2) &
(df_users_filtered[str(p)] <= 5)]

display(df_users_filtered)
for user_name in df_users_filtered["user_name"].values:
draw_user_timeline(df, user_name)

The number may not be exactly 4%, so I used 2..5% as a filter range. As a result, 28 “users” were found who were posting the same number of messages every hour:

“Users”, posting messages with equal intervals, Image by author

In the previous part, I had already detected some bots using the clustering algorithm. Here we can see that even with a much more simple approach, we can get similar results.

Let’s go to a more fun part and group users by their activity time. Because the total amount of messages per hour is normalized to 100%, it’s possible to make pretty complex requests. For example, let’s add new columns “morning”, “day”, “evening”, and “night”:

df_users["night"] = df_users["23"] + df_users["0"] + df_users["1"] + df_users["2"] + df_users["3"] + df_users["4"] + df_users["5"] + df_users["6"]
df_users["morning"] = df_users["7"] + df_users["8"] + df_users["9"] + df_users["10"]
df_users["day"] = df_users["11"] + df_users["12"] + df_users["13"] + df_users["14"] + df_users["15"] + df_users["16"] + df_users["17"] + df_users["18"]
df_users["evening"] = df_users["19"] + df_users["20"] + df_users["21"] + df_users["22"]

For the analysis, I will use only those users who posted more than 10 messages:

df_users_ = df_users[(df_users['messages'] > 10)]
df_ = df[df["user_name"].isin(df_users_["user_name"])]

Of course, 10 is not a statistically significant number. This is only a proof of concept, and for real research, collecting data at longer intervals is recommended.

Anyway, the results are interesting. For example, we can find users who posted most of their messages in the morning, using only one line of code. We can also get all these messages and draw them on the timeline:

df_users_filtered = df_users_[df_users_['morning'] >= 50]

print(f"Users: {100*df_users_filtered.shape[0]/df_users_.shape[0]}%")
df_filtered = df_[df_["user_name"].isin(df_users_filtered["user_name"])]
draw_timeline(df_filtered, df_)

The result looks like this:

Users posting tweets in the morning, Image by author

Interestingly, this number is only about 3%. As a comparison, 46% of active users send more than 50% of their messages during the day:

Users posting tweets during the day, Image by author

We can make other requests; for example, let’s find users who are making 80% of their messages in the evening:

df_users_filtered = df_users_[df_users_['evening'] >= 80]

The result looks like this:

Users posting tweets in the evening, Image by author

We can also display the timeline of some users to verify the results:

for user_name in df_users_filtered[:5]["user_name"].values:
draw_user_timeline(df_, user_name)

The output looks like this:

Timeline of selected users, Image by author

Results can be interesting; for example, the user “rhod***” was posting almost all the messages at the same time after 19.00.

Again, I must repeat that these results are not final. I analyzed only more or less active users who posted more than 10 tweets within a week. But a significant number of users posted fewer messages, and to gather more insights about them, data should be collected within several weeks or even months.

Conclusion

In this article, we were able to get all Twitter messages posted in a specific language — in our example, Dutch. This language is mostly used in the Netherlands and Belgium, which are located close to each other. This allows us to know the user’s timezone, which, alas, I was not able to obtain from the Twitter API, at least using the free account. By doing an analysis of message timestamps, we can get a lot of interesting information; for example, it is possible to find out if more users are active in the morning, during working hours, or in the evening. Finding temporal patterns in users’ behavior can be useful for psychology, cultural anthropology, or even medicine. Millions of people are using social networks, and it is interesting to know how it affects our lives, the rhythm of work, or sleep. And as was shown in the article, analyzing this behavior can be done using simple requests, which are actually not more difficult than school math.

It was also interesting to see how much data social networks can store. I suppose most people never think about how many messages are posted. Even for a relatively small Dutch-speaking community (about 25 million native speakers in the world), more than 10 tweets per second can be generated. For this article, 6,487,433 Twitter messages from 1,515,139 users were analyzed, and these were only messages posted within 10 days! For larger countries like Germany, getting all the messages will probably be beyond the limitations of a free Twitter development account. In that case, it may be possible to combine different request queries with filtering by user location.

Anyway, social networks are an interesting source of information about us, and I wish readers good luck with their own experiments. Those who are interested are also welcome to read the first part about using the K-Means algorithm for clustering Twitter users. And as another approach, the NLP analysis of Twitter posts was also explained.

If you enjoyed this story, feel free to subscribe to Medium, and you will get notifications when my new articles will be published, as well as full access to thousands of stories from other authors.

Thanks for reading.



Source link

Leave a Comment