Créez un réseau de phrases avec R

J'espère que cela a du sens. Je l'ai en quelque sorte jeté ensemble, mais il semble que c'est ce que vous voulez faire. J'ai attrapé un test du lien hypertexte ci-dessus. Il montrera les mots qui viennent après un certain mot ainsi que le rapport de fois où ces résultats se sont produits. Cela ne fera rien pour la visualisation, même si je suis sûr qu'il ne serait pas impossible de créer. Il devrait faire la plupart des calculs de base.

library(tau)

#this will load the string
x <- tokenize("Questions must be at least 2 days old to be eligible for a bounty. There can only be 1 active bounty per question at any given time. Users must have at least 75 reputation to offer a bounty, and may only have a maximum of 3 active bounties at any given time. The bounty period lasts 7 days. Bounties must have a minimum duration of at least 1 day. After the bounty ends, there is a grace period of 24 hours to manually award the bounty. If you do not award your bounty within 7 days (plus the grace period), the highest voted answer created after the bounty started with at least 2 upvotes will be awarded half the bounty amount. If there's no answer meeting that criteria, the bounty is not awarded to anyone. If the bounty was started by the question owner, and the question owner accepts an answer during the bounty period, and the bounty expires without an explicit award – we assume the bounty owner liked the answer they accepted and award it the full bounty amount at the time of bounty expiration. In any case, you will always give up the amount of reputation specified in the bounty, so if you start a bounty, be sure to follow up and award your bounty to the best answer! As an additional bonus, bounty awards are immune to the daily reputation cap and community wiki mode.")

#the number of tokens in the string
n <- length(x)

list <- NULL

count <- 1

#this will remove spaces, list is new string with no spaces
for (i in 1:n) {
  if (x[i] != " ") {
    list[count] <- x[i]
    count <- count + 1
  }
}

#the unique words in the string
y <- unique(list)

#number of tokens in the string
n <- length(list)
#number of distinct tokens
m <- length(y)


#assign tokens to values
ind <- NULL
val <- NULL
#make vector of numbers in place of tokens
for (i in 1:m) {
  ind[i] <- i
  for (j in 1:n) {
    if (y[i] == list[j]) {
      val[j] = i
    } 
  }
}


d <- array(0, c(m, m))

#this finds the number of count of the word after the current word
for (i in 1:(n-1)) {
   d[val[i], val[i+1]] <- d[val[i], val[i+1]] + 1
}

#pick a word
word <- 4

#show the word
y[word]
#[1] "at"

#the words that follow
y[which(d[word,] > 0)]
#[1] "least" "any"   "the" 

#the prob of words that follow
d[word,which(d[word,]>0)]/sum(d[word,])
#[1] 0.5714286 0.2857143 0.1428571

darrelkj
la source

Cela fait de grands progrès dans une intrigue qui ressemble plus à ce qui précède. C'est en fait le tracé / la visualisation de ce avec quoi je me bats. L'intrigue est presque comme un nuage de mots (taille = fréquence) et les flèches sont similaires à un sociogramme dans l'analyse de réseau, mais les flèches véhiculent un sens en ce qu'elles constituent un lien plus fort. Je pense que le travail que vous avez fait sera utile pour tracer les flèches. En fait, je ne suis pas trop familier avec l'analyse et la visualisation de réseau, j'ai donc besoin de beaucoup d'aide ici.

Tyler Rinker

Ajoutez ceci à la fin pour obtenir un graphique. Il sera clair cependant, vous voudrez probablement filtrer les mots de rang inférieur et utiliser uniquement ceux avec un support plus large. dd <- t (d) bibliothèque (diagramme) plotmat (dd [1:10, 1:10], box.size = 0,05, nom = y [1:10], lwd = 2 * dd [1:10,] )

darrelkj

@ darrelkj Cela semble être limité à 10 mots mais je pense qu'avec un peu de travail pour le connecter à des sociogrammes ou quelque chose comme ça, nous aurions une fonction assez raffinée. Je marque cette réponse comme correcte. darrelkj après tant de travail, vous devez mettre la touche finale à cela et le jeter dans un emballage. Si vous le faites, faites-le nous savoir. Merci de votre aide.

Tyler Rinker

Ce n'est pas limité à 10, je ne voulais tout simplement pas utiliser tout le tableau. Les dix utilisés ici sont également mal choisis.

darrelkj

Je me suis trompé. J'avais fait une erreur dans le code quand je l'ai essayé et j'ai donc obtenu une erreur hors limites. Vous avez tout à fait raison.

Tyler Rinker

Créez un réseau de phrases avec R

Réponses: